## Introduction

This notebook introduces core Python programming concepts and then shows how Python is used for data science through NumPy, Pandas, and scikit-learn. The goal is to build a gradual understanding from basic syntax to a simple machine learning model.

In [2]:
# Python Basics + NumPy + Pandas + scikit-learn (Intro Notebook)
# This cell imports key libraries and prints version numbers.

import sys
import numpy as np
import pandas as pd

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)


Python: 3.12.12
NumPy: 2.0.2
Pandas: 2.2.2


## Indentation and comments

### Indentation (tabs and spaces)
Python uses **indentation** (leading spaces at the start of a line) to define code blocks.  
A code block is a group of lines that belong together, for example inside `if`, `for`, `while`, or a function.

- Indentation is usually **4 spaces**.
- Tabs can work, but mixing tabs and spaces often causes errors or confusing behaviour.
- In most editors, pressing the Tab key inserts spaces (often 4), which is fine.

Example idea:

```text
if condition:
    this line is inside the if-block
this line is outside the if-block
```

### Comments (`#`)
A comment is text that Python ignores when running the code.

- Single-line comments start with `#`.
- Comments are used to explain what code is doing, or to temporarily disable a line.

In [3]:
# Variables store values that we can re-use later.
x = 10          # integer (whole number)
y = 3           # integer

# Basic arithmetic operations
print("x + y =", x + y)    # addition
print("x / y =", x / y)    # division (always returns a float)
print("x // y =", x // y)  # floor division (integer division)
print("x % y =", x % y)    # remainder (modulo)
print("x ** y =", x ** y)  # exponentiation (x to the power y)

# Other common data types
name = "Mehrdad"   # string (text)
is_active = True   # boolean (True/False)
pi = 3.14159       # float (decimal number)

# type() shows the data type of a value
print(type(x), type(name), type(is_active), type(pi))


x + y = 13
x / y = 3.3333333333333335
x // y = 3
x % y = 1
x ** y = 1000
<class 'int'> <class 'str'> <class 'bool'> <class 'float'>


## Python data structures

Python includes built-in data structures:

- **List**: ordered collection that can be changed (mutable).
- **Tuple**: ordered collection that is typically not changed (immutable).
- **Dictionary**: keyâ€“value pairs (like a lookup table).
- **Set**: collection of unique values (no duplicates).

In [4]:
# A list can contain duplicates and can be modified
numbers = [1, 2, 3, 4, 4]

# A tuple is an ordered "fixed" collection
coords = (10, 20)

# A dictionary stores key-value pairs
person = {"name": "Ada", "role": "scientist", "year": 1843}

# A set removes duplicates automatically
unique_numbers = set(numbers)

print("numbers:", numbers)
print("coords:", coords)
print("person name:", person["name"])     # access dictionary value by key
print("unique_numbers:", unique_numbers)

# Update data structures
numbers.append(5)              # add a value to the end of the list
person["country"] = "UK"       # add a new key-value pair to the dictionary

print("updated numbers:", numbers)
print("updated person:", person)


numbers: [1, 2, 3, 4, 4]
coords: (10, 20)
person name: Ada
unique_numbers: {1, 2, 3, 4}
updated numbers: [1, 2, 3, 4, 4, 5]
updated person: {'name': 'Ada', 'role': 'scientist', 'year': 1843, 'country': 'UK'}


## Control flow

Control flow determines how code executes:

- `if` statements make decisions.
- `for` loops repeat across sequences.
- `while` loops repeat while a condition is true.
- List comprehensions provide a compact way to create lists.

Indentation matters here: lines inside the block must be indented.

In [5]:
score = 78

# if / elif / else chooses one branch based on conditions
if score >= 85:
    grade = "High Distinction"
elif score >= 75:
    grade = "Distinction"
elif score >= 65:
    grade = "Credit"
elif score >= 50:
    grade = "Pass"
else:
    grade = "Fail"

print("Grade:", grade)

# List comprehension: compact way to build a list
squares = [n**2 for n in range(1, 6)]
print("Squares:", squares)

# A for-loop example: summing numbers 1..5
total = 0
for n in range(1, 6):
    total += n  # shorthand for total = total + n
print("Sum 1..5:", total)

# A while-loop example: counts down
k = 3
while k > 0:
    print("k =", k)
    k -= 1


Grade: Distinction
Squares: [1, 4, 9, 16, 25]
Sum 1..5: 15
k = 3
k = 2
k = 1


## Functions

Functions group reusable logic into a named block.  
They improve readability and reduce repetition.

Indentation matters: the function body is indented under `def`.

In [14]:
def mean(values):
    # Guard clause: prevent division by zero (empty list)
    if len(values) == 0:
        raise ValueError("values must not be empty")
    return sum(values) / len(values)

vals = [2, 4, 6, 8]
print("mean:", mean(vals))

def describe_number(n):
    # Return different text depending on the value of n
    if n > 0:
        return "positive"
    if n == 0:
        return "zero"
    return "positive"

for n in [5, 0, -1]:
    print(n, "is", describe_number(n))


mean: 5.0
5 is positive
0 is zero
-1 is positive


## NumPy basics

NumPy provides efficient numerical arrays.  
It supports element-wise operations, basic statistics, and matrix-style indexing.

In [7]:
# Create NumPy arrays from Python lists
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

# Element-wise operations (happens per position)
print("a:", a)
print("b:", b)
print("a + b:", a + b)
print("a * 2:", a * 2)

# Simple statistics
print("mean(a):", a.mean())
print("std(a):", a.std())

# A 2D array (matrix-like)
M = np.array([[1, 2, 3],
              [4, 5, 6]])

# Shape tells you rows and columns
print("M shape:", M.shape)

# Indexing examples
print("M first row:", M[0, :])      # first row, all columns
print("M second column:", M[:, 1])  # all rows, second column


a: [1 2 3 4 5]
b: [10 20 30 40 50]
a + b: [11 22 33 44 55]
a * 2: [ 2  4  6  8 10]
mean(a): 3.0
std(a): 1.4142135623730951
M shape: (2, 3)
M first row: [1 2 3]
M second column: [2 5]


## Random numbers and reproducibility

Random numbers are used in simulations and modelling.  
A fixed seed (here, `42`) makes the results reproducible, meaning you will get the same random values each time you run the cell.

In [8]:
# Create a random number generator with a fixed seed for reproducibility
rng = np.random.default_rng(42)

# Generate 10 values from a normal distribution (mean=0, std=1)
samples = rng.normal(loc=0.0, scale=1.0, size=10)

print("samples:", samples)
print("samples mean:", samples.mean())
print("samples min/max:", samples.min(), samples.max())


samples: [ 0.30471708 -1.03998411  0.7504512   0.94056472 -1.95103519 -1.30217951
  0.1278404  -0.31624259 -0.01680116 -0.85304393]
samples mean: -0.33557130840587135
samples min/max: -1.9510351886538364 0.9405647163912139


## Pandas DataFrame basics

Pandas is used for structured data analysis.  
A DataFrame is similar to a spreadsheet table: rows are records, columns are variables.  
You can summarise and filter data in a few lines of code.

In [9]:
# Create a DataFrame from a dictionary of columns
data = {
    "student": ["Ava", "Bilal", "Chen", "Dina", "Ehsan"],
    "hours_studied": [2.0, 4.5, 3.0, 5.0, 1.5],
    "quiz_score": [55, 78, 67, 88, 49]
}
df = pd.DataFrame(data)

print(df)

# Data types of each column
print("\nTypes:\n", df.dtypes)

# Summary statistics for numeric columns
print("\nSummary:\n", df.describe())

# Select a single column (Series)
print("\nSelect a column:\n", df["quiz_score"])

# Filter rows based on a condition
print("\nFilter rows (score >= 70):\n", df[df["quiz_score"] >= 70])


  student  hours_studied  quiz_score
0     Ava            2.0          55
1   Bilal            4.5          78
2    Chen            3.0          67
3    Dina            5.0          88
4   Ehsan            1.5          49

Types:
 student           object
hours_studied    float64
quiz_score         int64
dtype: object

Summary:
        hours_studied  quiz_score
count       5.000000    5.000000
mean        3.200000   67.400000
std         1.524795   16.040574
min         1.500000   49.000000
25%         2.000000   55.000000
50%         3.000000   67.000000
75%         4.500000   78.000000
max         5.000000   88.000000

Select a column:
 0    55
1    78
2    67
3    88
4    49
Name: quiz_score, dtype: int64

Filter rows (score >= 70):
   student  hours_studied  quiz_score
1   Bilal            4.5          78
3    Dina            5.0          88


## Transforming data in Pandas

You can:
- create new columns from existing columns,
- sort by a column,
- create categories (bands) from numeric ranges.

These steps are common in data cleaning and feature engineering.

In [10]:
# Create a new column based on existing columns
df["score_per_hour"] = df["quiz_score"] / df["hours_studied"]

# Sort rows by the new column
print(df.sort_values("score_per_hour", ascending=False))

# Create a categorical "band" column based on score ranges
df["band"] = pd.cut(
    df["quiz_score"],
    bins=[0, 50, 65, 75, 85, 100],
    labels=["Fail", "Pass", "Credit", "Distinction", "High Distinction"],
    include_lowest=True
)

# Count how many students are in each band
print("\nBand counts:\n", df["band"].value_counts())


  student  hours_studied  quiz_score  score_per_hour
4   Ehsan            1.5          49       32.666667
0     Ava            2.0          55       27.500000
2    Chen            3.0          67       22.333333
3    Dina            5.0          88       17.600000
1   Bilal            4.5          78       17.333333

Band counts:
 band
Fail                1
Pass                1
Credit              1
Distinction         1
High Distinction    1
Name: count, dtype: int64


## Handling missing data

Missing values often appear in real datasets.  
Pandas represents missing values as `NaN` (Not a Number).  
A common approach is to fill missing values using a simple statistic such as the median (for skewed data) or mean (for roughly symmetric data).

In [11]:
# Make a copy so we do not change the original df
df2 = df.copy()

# Insert missing values artificially (for demonstration)
df2.loc[2, "hours_studied"] = np.nan
df2.loc[4, "quiz_score"] = np.nan

print(df2)

# Count missing values per column
print("\nMissing counts:\n", df2.isna().sum())

# Fill missing values with simple summary statistics
df2_filled = df2.copy()
df2_filled["hours_studied"] = df2_filled["hours_studied"].fillna(df2_filled["hours_studied"].median())
df2_filled["quiz_score"] = df2_filled["quiz_score"].fillna(df2_filled["quiz_score"].mean())

print("\nFilled:\n", df2_filled)


  student  hours_studied  quiz_score  score_per_hour              band
0     Ava            2.0        55.0       27.500000              Pass
1   Bilal            4.5        78.0       17.333333       Distinction
2    Chen            NaN        67.0       22.333333            Credit
3    Dina            5.0        88.0       17.600000  High Distinction
4   Ehsan            1.5         NaN       32.666667              Fail

Missing counts:
 student           0
hours_studied     1
quiz_score        1
score_per_hour    0
band              0
dtype: int64

Filled:
   student  hours_studied  quiz_score  score_per_hour              band
0     Ava           2.00        55.0       27.500000              Pass
1   Bilal           4.50        78.0       17.333333       Distinction
2    Chen           3.25        67.0       22.333333            Credit
3    Dina           5.00        88.0       17.600000  High Distinction
4   Ehsan           1.50        72.0       32.666667              Fail


## First machine learning model with scikit-learn

scikit-learn provides tools for machine learning.  
This example trains a classifier on the Iris dataset.

Key ideas:
- Split data into training and test sets.
- Scale features (often helpful for many models).
- Train a model on training data.
- Evaluate performance on test data using accuracy and other metrics.

In [12]:
# Import scikit-learn tools for datasets, splitting, modelling, and evaluation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the classic Iris dataset (features X, labels y)
X, y = load_iris(return_X_y=True)

# Split into training and test sets
# stratify=y keeps class proportions similar in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# A Pipeline chains steps:
# 1) scale the features
# 2) train a classifier
model = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])

# Train (fit) the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Evaluate performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9333333333333333

Confusion matrix:
 [[10  0  0]
 [ 0  9  1]
 [ 0  1  9]]

Classification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.90      0.90      0.90        10
           2       0.90      0.90      0.90        10

    accuracy                           0.93        30
   macro avg       0.93      0.93      0.93        30
weighted avg       0.93      0.93      0.93        30



## Exercise: Customer Subscription Prediction from CSV

In this exercise you will practise a full applied machine learning workflow using data loaded from a CSV file.

### Task A: Load the dataset

Download the file `customer_subscription_data_100.csv` and upload it to your Colab session (or place it in the same directory if working locally).

Load the dataset into a Pandas DataFrame called `df_ex`.

Inspect:
- The first 5 rows
- Data types
- Basic summary statistics

### Task B: Train and evaluate a model

Use the following columns as features (`X`):
- `age`
- `income_k`
- `spent_last_month`

Use `subscribed` as the target (`y`).

Then:

1. Split the dataset into training and test sets.
2. Build a logistic regression model (with scaling).
3. Train the model.
4. Report:
   - Accuracy
   - Confusion matrix

### Task C: Try one improvement

Change one modelling choice and observe whether results change. For example:

- Change `test_size`
- Change `random_state`
- Remove the scaler
- Try a different classifier (e.g. `KNeighborsClassifier`)

Write 2 to 3 sentences explaining what you changed and what happened.

In [13]:
# Exercise: Customer subscription prediction from CSV

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# ---- Task A: Load dataset ----
# If using Colab, first upload the CSV file.
# If local, ensure the file is in the same directory.

df_ex = pd.read_csv("https://raw.githubusercontent.com/mehrdadgit/uow-MBAS922-applied-ml-ai-business/refs/heads/main/Datasets/customer_subscription_data_100.csv")

print("First 5 rows:\n", df_ex.head())
print("\nData types:\n", df_ex.dtypes)
print("\nSummary statistics:\n", df_ex.describe())

# ---- Task B: Prepare features and target ----
X = df_ex[["age", "income_k", "spent_last_month"]]
y = df_ex["subscribed"]

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Build pipeline (scaling + logistic regression)
model = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("\nAccuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred))

# ---- Task C: Modify one modelling choice and re-run ----

First 5 rows:
    age  income_k  spent_last_month  subscribed
0   24       129               251           1
1   54        53               549           1
2   49       126               322           1
3   39        30               434           1
4   39       125               301           1

Data types:
 age                 int64
income_k            int64
spent_last_month    int64
subscribed          int64
dtype: object

Summary statistics:
               age    income_k  spent_last_month  subscribed
count  100.000000  100.000000        100.000000  100.000000
mean    43.210000   88.430000        313.310000    0.750000
std     12.272409   33.259693        147.304103    0.435194
min     21.000000   30.000000         60.000000    0.000000
25%     34.000000   63.000000        196.000000    0.750000
50%     43.000000   89.000000        312.500000    1.000000
75%     54.000000  114.500000        434.750000    1.000000
max     63.000000  149.000000        594.000000    1.000000

Accuracy