# Modules 1 Main Notebook
**Seed = 1955** · Datasets: `data/housing_synth.csv`, `data/titanic_synth.csv`

Welcome to **Module 1 Advanced Tech - AI, ML, and Data Science** 

### How to Use This Notebook
Each section uses the pattern: **Introduction → Code Example → Step-by-Step Explanation → Output Interpretation → Summary**.
All randomness uses **seed=1955** for reproducibility across different machines.

## Module 1 — Foundations of AI, ML, and Data Science

### 1.1 Defining AI, ML, and DL — What’s the difference?
**Introduction:**
- **AI** is the umbrella term: systems that perform tasks requiring human-like intelligence.
- **ML** is data-driven learning from examples rather than rules.
- **DL** uses multi-layer neural networks to learn rich representations.

In [None]:
SEED = 1955
print('Seed set to', SEED)

Seed set to 1955


### 1.2 Categories of Machine Learning — Matching problems to paradigms
**Introduction:**
- **Supervised** (features→label): regression, classification.
- **Unsupervised** (no labels): clustering, dimensionality reduction, anomalies.
- **Reinforcement Learning** (rewards): policies learned by trial-and-error.

### 1.3 The ML Pipeline — End-to-end mindset
**Introduction:** problem → data → features → model → evaluation → deployment → monitoring.

In [None]:
# A minimal scaffold for an end-to-end pipeline on a tabular dataset (no plots)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

df = pd.read_csv('../data/titanic_synth.csv')
X = df.drop('Survived', axis=1)
y = df['Survived']
cat = X.select_dtypes(include=['object']).columns.tolist()

pre = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat)
], remainder='passthrough')

clf = Pipeline([
    ('pre', pre),
    ('model', DecisionTreeClassifier(max_depth=4, random_state=1955))
])

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1955)
clf.fit(Xtr, ytr)
pred = clf.predict(Xte)
acc = accuracy_score(yte, pred)
print('Accuracy:', acc)
print(classification_report(yte, pred))

: 

This short program builds a classification pipeline to predict whether a passenger survived (1) or not (0) using a synthetic Titanic dataset.
It demonstrates the entire supervised ML workflow: load data → preprocess → split → train → predict → evaluate.

**Code Explanation**

First, we need to import the necessary libraries. Here's a brief explanation of each library:

- `pandas`: Used for data manipulation and analysis.
- `scikit-learn`: Used for machine learning algorithms and tools.
    - `train_test_split`: divides the data into training and test sets.
    - `OneHotEncoder`: converts categorical features into numeric binary columns. Example: sex = {male, female} becomes sex_male, sex_female.
    - `ColumnTransformer`: applies different preprocessing to different column types. Here it ensures only categorical columns are encoded, while numeric ones pass through unchanged.
    - `Pipeline`: links preprocessing and modeling steps together. This guarantees the same transformations are applied during both training and prediction.
    - `DecisionTreeClassifier`: Imports the Decision Tree model — a rule-based algorithm that splits data by feature thresholds.
    - `accuracy_score`, `classification_report`: Imports tools to evaluate model performance. Accuracy: overall correctness. Classification report: includes precision, recall, F1-score, and support.

Next, we'll **load the dataset** from a CSV file into a pandas DataFrame.The line: 

`df = pd.read_csv('../data/titanic_synth.csv')`

Loads the dataset into a Pandas DataFrame. Each row represents a passenger; each column is a feature. 

The next step in the pipeline is to **define the problem**. Here we need to define our variables with:

`X = df.drop('Survived', axis=1)`

`y = df['Survived']`

That code defines the data into: X: features (inputs),  and y: target (the label we want to predict).

Next, we need to **preprocess** the data. 
First, we need to find which columns are categorical (string type) and stores them in a list for encoding. We accomplish that with 

`cat = X.select_dtypes(include=['object']).columns.tolist()`

Next, we define the preprocessing step:

`pre = ColumnTransformer([`
    `('cat', OneHotEncoder(handle_unknown='ignore'), cat)`
`], remainder='passthrough')`

- `'cat'`: label for this transformation.

- `OneHotEncoder(handle_unknown='ignore')`: converts categories to 0/1 columns, ignores unseen ones at inference.

- `cat`: list of categorical columns.

- `remainder='passthrough'`: keeps numeric features as they are.

Next, we need to combine the entire workflow into one Pipeline. Remeber the Pipeline will link the preprocessing to the modeling steps

`clf = Pipeline([`
   `('pre', pre),`
   `('model', DecisionTreeClassifier(max_depth=4, random_state=1955))`
`])`

The code above combines the entire workflow into one Pipeline:

`'pre'`: preprocessing step.

`'model'`: classifier (Decision Tree).

`max_depth=4` keeps the tree from growing too deep (avoids overfitting).

`random_state=1955` ensures reproducibility.


The next step is to **Split** the data for trainning and for testing.

`Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1955)`

The code Splits the data into 80% of the data for trainning, and 20% for testing.

`stratify=y` keeps class ratios consistent (same percentage of survivors).

`random_state=1955` ensures everyone in class gets the same split.


Once we have split the data, it's finally time to **Train** the model. We use:

`clf.fit(Xtr, ytr)`

The code automatically preprocesses the training data using OneHotEncoder and fits a decision tree model on the transformed data.

Next, we can start making **prediction** with our model. 
`pred = clf.predict(Xte)` Applies the same preprocessing to the test data and generates predictions.


Finally, we need to evaluate the model. The follwoing code will calculate and print the desire metrics:

`acc = accuracy_score(yte, pred)`
`print('Accuracy:', acc)`

The previous code calculates and prints accuracy, the proportion of correct predictions.

`print(classification_report(yte, pred))` 

This code prints precision, recall, F1-score, and support for each class. This gives a more complete picture than accuracy alone.


**Summary:** The real value comes from pipeline hygiene, not just the model choice.

#### Defining X and y for a simple dataset

Example: Defining `X` and `y` (Housing Dataset)

In supervised learning, we separate our dataset into:
- **`X` (features)** → input variables used for prediction  
- **`y` (target)** → the outcome we want the model to learn

Once these are defined, we split the data into **training** and **testing** sets to fairly evaluate performance.


##### Step 1: Import and Load Data

In [None]:
import pandas as pd

# Load a simple housing dataset
df = pd.read_csv('../data/housing_synth.csv')
df.head()

After running the code, you will see part of the dataset. The sample columns may appear as follows:

`['sqft', 'bedrooms', 'bathrooms', 'age_years', 'lot_size', 'dist_to_center_km', 'price']`

To define our target variable—the outcome we want the model to learn—we must identify which feature to focus on. In this case, our target variable is `price`. Therefore, we need to exclude `price` from the list of features we will use.

##### Step 2: Define Features (X) and Target (y)

In [None]:
# Separate inputs (X) and output (y)
X = df.drop('price', axis=1)   # all columns except 'price'
y = df['price']                # target column

# Check the dimensions
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

- X includes variables like square footage, number of bedrooms, and distance to city center.

- y contains the house price we want to predict.

Keeping them separate ensures the model only learns relationships from the predictors.

### 1.4 Data Preprocessing — Imputation & scaling
**Introduction:** Raw data is messy. Impute, encode, and scale before modeling.

#### Data Cleaning Demo — Titanic Example

In this example, we’ll clean a small Titanic-style dataset that has:
- Missing **Age** values  
- A categorical **Sex** column that needs encoding  
- Inconsistent **Fare** entries (negative / missing)  
- **Outlier Fares** (unrealistically high values)

---

##### Step 1: Import Libraries and Create the Dataset

The following code creates a very simple dataset with passanger from the Titanic. The data set includes `Name`, `Age`, `Sex` and `Fare` price.


In [None]:
import pandas as pd
import numpy as np

# Create a sample Titanic-like DataFrame
data = pd.DataFrame({
    'Name': ['John', 'Mary', 'Alex', 'Linda', 'James', 'Anna', 'Tom'],
    'Age':  [22, np.nan, 35, 58, np.nan, 29, 44],
    'Sex':  ['male', 'female', 'male', 'female', 'male', 'female', 'male'],
    # includes negatives, missing, and very large values to simulate real-world data struggles
    'Fare': [7.25, 71.83, -5.00, 512.33, 8.05, np.nan, 300.00]
})

print("Original Data:")
print(data)


Original Data:
    Name   Age     Sex    Fare
0   John  22.0    male    7.25
1   Mary   NaN  female   71.83
2   Alex  35.0    male   -5.00
3  Linda  58.0  female  512.33
4  James   NaN    male    8.05
5   Anna  29.0  female     NaN
6    Tom  44.0    male  300.00


##### Step 2: Identify Missing Values
This helps you see which columns need cleaning (here: Age, Fare).

In [None]:
print("\nMissing values per column:")
print(data.isna().sum())


Missing values per column:
Name    0
Age     2
Sex     0
Fare    1
dtype: int64


##### Step 3: Fix Missing Ages
This block handles **missing values** in the `Age` column using the `SimpleImputer` class from scikit-learn. **Note:** The following example uses the `mean` as the startegy. If you would like to see the other strategies change `mean` in the code below to one of the other strategies. Also, don't forget to re run te previous to cells to make sure the dataset resets to its original state.

In [None]:
from sklearn.impute import SimpleImputer

# Replace missing ages with the mean age
imputer = SimpleImputer(strategy='mean')
data['Age'] = imputer.fit_transform(data[['Age']])

age_mean = data['Age'].mean()
print(f"The mean age is: {age_mean}")
print("\nData after imputing missing ages:")
print(data)

The mean age is: 37.6

Data after imputing missing ages:
    Name   Age     Sex    Fare
0   John  22.0    male    7.25
1   Mary  37.6  female   71.83
2   Alex  35.0    male   -5.00
3  Linda  58.0  female  512.33
4  James  37.6    male    8.05
5   Anna  29.0  female     NaN
6    Tom  44.0    male  300.00


##### Step-by-Step Breakdown

**`from sklearn.impute import SimpleImputer`**  
Imports the `SimpleImputer` class, which is used to fill in missing values (NaN) in a dataset.

---

**`imputer = SimpleImputer(strategy='mean')`**  
Creates an instance of the imputer and sets the strategy to `'mean'`.  
- `'mean'`: replaces missing values with the **average** of the non-missing values in that column.  
- Other options include:
  - `'median'` → replaces with the median value  
  - `'most_frequent'` → replaces with the mode (most common value)  
  - `'constant'` → replaces with a fixed value you define using `fill_value`

---

**`data['Age'] = imputer.fit_transform(data[['Age']])`**  
- `fit_transform()` does two things:
  1. **`fit()`** → calculates the mean of the existing (non-NaN) `Age` values.  
  2. **`transform()`** → replaces all missing `Age` values with that mean.  
- The result is a **NumPy array**, so we assign it back to the DataFrame column `data['Age']`.

> Note: `[['Age']]` (double brackets) keeps the column as a 2D array, which `SimpleImputer` expects as input.

---

**`print("\nData after imputing missing ages:")`**  
**`print(data)`**  
Prints the updated DataFrame so you can confirm that all missing `Age` values have been replaced.

---

##### Summary
- Missing ages are now filled with the column’s mean value.  
- This avoids data loss while maintaining the central tendency of the data.  
- Imputation is an essential preprocessing step before model training, ensuring algorithms receive complete, numeric data.


##### Step 4: Fix Inconsistent Fares (negative / missing)
Next we need to clean inconsistent or missing fares.

In [None]:
# Replace negative or missing fares with median of valid positive fares
median_fare = data.loc[data['Fare'] > 0, 'Fare'].median()
data['Fare'] = data['Fare'].apply(lambda x: median_fare if (pd.isna(x) or x <= 0) else x)

print("\nData after fixing inconsistent fares (negatives/missing → median):")
print(data)



Data after fixing inconsistent fares (negatives/missing → median):
    Name   Age     Sex    Fare
0   John  22.0    male    7.25
1   Mary  37.6  female   71.83
2   Alex  35.0    male   71.83
3  Linda  58.0  female  512.33
4  James  37.6    male    8.05
5   Anna  29.0  female   71.83
6    Tom  44.0    male  300.00



```python
median_fare = data.loc[data['Fare'] > 0, 'Fare'].median()
```

This line finds the **median (middle)** fare from all valid, positive fare values in the dataset.

| Expression              | Meaning                                                                         |
| ----------------------- | ------------------------------------------------------------------------------- |
| `data['Fare'] > 0`      | Creates a Boolean mask to select only rows where the fare is greater than zero. |
| `data.loc[..., 'Fare']` | Uses `.loc[]` to access the `Fare` column only for those valid rows.            |
| `.median()`             | Calculates the median of those selected fares.                                  |

The result is stored in the variable **`median_fare`**, which serves as a **reference value** to replace missing or invalid fares later.

---

#### Why median instead of mean

The **median** is less sensitive to extreme outliers (e.g., very high first-class fares), making it a more **robust “typical” value** than the mean when dealing with skewed data.

---

```python
data['Fare'] = data['Fare'].apply(lambda x: median_fare if (pd.isna(x) or x <= 0) else x)
```

This line **cleans the `Fare` column** by replacing invalid or missing values with the median fare computed above.

---

* `pd.isna(x)` → checks if the fare value is missing (`NaN`).
* `x <= 0` → identifies fares that are negative or zero (invalid).
* `lambda x: ...` → defines a short, anonymous function that applies this rule to each value in the column.
* If either condition is true (missing or invalid), it replaces the fare with **`median_fare`**.
* Otherwise, it keeps the original value (`else x`).

---

**Result:**
All missing or invalid fare values are replaced with a reliable, central value — the median.
This ensures the dataset remains consistent, realistic, and ready for analysis.



##### Step 5: Handle Fare Outliers (IQR method — robust)
Next, we need to handle outlier in the data. We are using the IQR (interquartile range) method.

In [None]:
# IQR-based capping (winsorization): cap extreme high fares at the upper whisker
q1, q3 = data['Fare'].quantile([0.25, 0.75])
iqr = q3 - q1
upper_whisker = q3 + 1.5 * iqr

# Keep a copy to show before/after if desired
fare_before = data['Fare'].copy()
data['Fare'] = np.where(data['Fare'] > upper_whisker, upper_whisker, data['Fare'])

print("\nIQR capping applied:")
print(f"Q1={q1:.2f}, Q3={q3:.2f}, IQR={iqr:.2f}, Upper whisker={upper_whisker:.2f}")
print("\nFares (before → after) for affected rows:")
changed = pd.DataFrame({'before': fare_before, 'after': data['Fare']})
print(changed[changed['before'] != changed['after']])

##### Code explanation

**`q1, q3 = data['Fare'].quantile([0.25, 0.75])`**

* Calculates the **first quartile (Q1)** and **third quartile (Q3)** of the `Fare` column.
* Q1 = value at the 25th percentile
* Q3 = value at the 75th percentile

These values help define the **interquartile range (IQR)** — the middle 50% of the data.



**`iqr = q3 - q1`**

* Computes the **Interquartile Range (IQR)**.
* The IQR measures the spread of the middle 50% of the data, helping to identify unusually high or low values (outliers).



**`upper_whisker = q3 + 1.5 * iqr`**

* Calculates the **upper limit** for normal fare values, often called the *upper whisker* in boxplots.
* Any fare **greater than this value** is considered an **outlier**.
* The multiplier `1.5` is a standard rule-of-thumb for detecting moderate outliers.



**`fare_before = data['Fare'].copy()`**

* Creates a copy of the `Fare` column before modifications.
* This lets you compare original and capped values later — great for demonstrations and audits.



**`data['Fare'] = np.where(data['Fare'] > upper_whisker, upper_whisker, data['Fare'])`**

* Applies **IQR-based capping** (also known as *winsorization*).
* `np.where()` checks each value:

  * If `Fare` > `upper_whisker`, it replaces it with `upper_whisker`.
  * Otherwise, it leaves the value unchanged.
* This ensures that extreme outliers don’t skew your model or distort visualizations.



**`print("\nIQR capping applied:")`**
**`print(f"Q1={q1:.2f}, Q3={q3:.2f}, IQR={iqr:.2f}, Upper whisker={upper_whisker:.2f}")`**

* Displays the calculated thresholds for reference (rounded to two decimal places).



**`changed = pd.DataFrame({'before': fare_before, 'after': data['Fare']})`**
**`print(changed[changed['before'] != changed['after']])`**

* Builds a comparison table showing which fares were capped (changed).
* Filters only rows where the `before` and `after` values differ.

---

* **IQR capping (winsorization)** replaces extreme outliers with the highest “reasonable” value, preserving data shape while preventing outliers from dominating your model.
* This is safer than simply dropping outliers because it keeps all records but reduces distortion.
* After this step, the `Fare` column is clean, consistent, and ready for modeling or visualization.




##### Step 6: Encode Categorical Variable (Sex)
Encoding categorical data into a numerical format is essential for machine learning models.

In [None]:
# One-Hot Encode 'Sex' (drop_first=True to avoid redundancy)
data_encoded = pd.get_dummies(data, columns=['Sex'], drop_first=True)

print("\nData after encoding 'Sex':")
print(data_encoded)

##### Code Explanation

**`pd.get_dummies(data, columns=['Sex'], drop_first=True)`**

* This line uses **pandas’ `get_dummies()`** function to convert the categorical column **`Sex`** into **numeric (binary)** columns — a process called **One-Hot Encoding**.
* Machine learning models typically require numerical input, so categorical variables (like “male” or “female”) must be encoded as numbers.


| Parameter         | Meaning                                                                                                                                            |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `data`            | The original DataFrame.                                                                                                                            |
| `columns=['Sex']` | Specifies which column(s) to encode. In this case, we’re encoding only the `Sex` column.                                                           |
| `drop_first=True` | Drops the **first category** to avoid the “dummy variable trap.” This prevents multicollinearity (redundant information) when using linear models. |


What is happening here?
* The `Sex` column likely contains two unique values: **`male`** and **`female`**.
* One-hot encoding creates a new binary column for each category:

  * `Sex_male` = 1 if the passenger is male, 0 otherwise.
  * (The female category is **implied** when `Sex_male = 0`, because `drop_first=True` removes the redundant column.)

---

```python
print("\nData after encoding 'Sex':")
print(data_encoded)
```

* Prints the cleaned and encoded DataFrame to confirm that:

  * The original `Sex` column was replaced with a numeric column (`Sex_male`).
  * The dataset is now ready for modeling — since all features are numeric.


##### Step 7: Final Check

In [None]:
print("\nCleaned Data Overview:")
print(data_encoded.info())
print("\nCleaned Data Preview:")
print(data_encoded)

**Summary**

In this demo, we cleaned our data and prepared it for analysis. This is what we did:

* Missing **Age** → imputed (mean).
* Invalid **Fare** (negative/missing) → median.
* **Outliers** in **Fare** → capped (IQR or business rule).
* **Sex** encoded → numeric features for modeling.




### 1.5 Feature Engineering — Improve signal-to-noise
**Introduction:** Domain knowledge → better features → simpler models, better results.

In [None]:
import pandas as pd

# --- Step 1: Create an expanded Titanic-like dataset ---
titanic = pd.DataFrame({
    'Name': [
        'John', 'Mary', 'Sam', 'Lucy', 'Tom', 'Anna',
        'James', 'Ella', 'Mike', 'Sophie', 'George', 'Lily'
    ],
    'Age': [8, 17, 26, 38, 52, 72, 15, 28, 61, 45, 33, 5],
    'Fare': [7.25, 10.50, 35.00, 71.83, 8.05, 512.33, 9.25, 27.50, 82.10, 30.00, 45.80, 12.50]
})

print("Original Titanic Data:")
print(titanic)

# --- Step 2: Define bins and apply pd.cut() ---
bins = [0, 10, 25, 40, 60, 200]
labels = ['0-10', '11-25', '26-40', '41-60', '60+']

titanic['Age_Group'] = pd.cut(titanic['Age'], bins=bins, labels=labels, include_lowest=True)

print("\nAfter Binning Ages into Categories:")
print(titanic[['Name', 'Age', 'Age_Group']])

# --- Step 3 (optional): Count how many passengers per group ---
print("\nPassenger count by Age Group:")
print(titanic['Age_Group'].value_counts().sort_index())


Original Titanic Data:
      Name  Age    Fare
0     John    8    7.25
1     Mary   17   10.50
2      Sam   26   35.00
3     Lucy   38   71.83
4      Tom   52    8.05
5     Anna   72  512.33
6    James   15    9.25
7     Ella   28   27.50
8     Mike   61   82.10
9   Sophie   45   30.00
10  George   33   45.80
11    Lily    5   12.50

After Binning Ages into Categories:
      Name  Age Age_Group
0     John    8      0-10
1     Mary   17     11-25
2      Sam   26     26-40
3     Lucy   38     26-40
4      Tom   52     41-60
5     Anna   72       60+
6    James   15     11-25
7     Ella   28     26-40
8     Mike   61       60+
9   Sophie   45     41-60
10  George   33     26-40
11    Lily    5      0-10

Passenger count by Age Group:
Age_Group
0-10     2
11-25    2
26-40    4
41-60    2
60+      2
Name: count, dtype: int64


##### Binning (Grouping) Ages with `pd.cut()`

This example shows how to divide a continuous numeric column (`Age`) into **discrete ranges** (bins) using `pandas.cut()`.

---

##### Step 1: Create the dataset
We start with 12 Titanic passengers and their ages and fares.

| Name | Age | Fare |
|------|-----|------|
| John | 8 | 7.25 |
| Mary | 17 | 10.50 |
| Sam | 26 | 35.00 |
| Lucy | 38 | 71.83 |
| Tom | 52 | 8.05 |
| Anna | 72 | 512.33 |
| James | 15 | 9.25 |
| Ella | 28 | 27.50 |
| Mike | 61 | 82.10 |
| Sophie | 45 | 30.00 |
| George | 33 | 45.80 |
| Lily | 5 | 12.50 |

---

#### Step 2: Define bins and labels
```python
bins = [0, 10, 25, 40, 60, 200]
labels = ['0-10', '11-25', '26-40', '41-60', '60+']
```
* The **bins** define the numeric breakpoints for each age range.
* The **labels** are human-readable categories for each range.

---

#### Step 3: Apply `pd.cut()`

```python
titanic['Age_Group'] = pd.cut(
    titanic['Age'], 
    bins=bins, 
    labels=labels, 
    include_lowest=True
)
```

This function divides each `Age` into one of the predefined ranges.

---

#### Step 4: Count the passengers in each group

```python
titanic['Age_Group'].value_counts().sort_index()
```

This shows how many passengers fall into each age category.

| Age_Group | Count |
| --------- | ----- |
| 0-10      | 2     |
| 11-25     | 2     |
| 26-40     | 4     |
| 41-60     | 2     |
| 60+       | 3     |

---

##### Why Use Binning?

* Makes numeric data **more interpretable**.
* Helps visualize and analyze trends across age groups.
* Useful when building features that reflect **categories** (like “child,” “young adult,” “senior”).
* Simplifies model input when exact numeric precision isn’t critical.

---

**In summary:**
`pd.cut()` is an elegant way to group continuous values into labeled ranges — a foundational skill for **feature engineering** and **data preparation**.

### 1.6 Evaluation Metrics — Picking the right yardstick

- Classification → accuracy, precision, recall, F1. 

- Regression → RMSE, MAE, R².

In [None]:
# Step 1: Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Step 2: Create a small Titanic-like dataset
data = pd.DataFrame({
    'Sex_male': [1, 0, 1, 1, 0, 0, 1, 0, 1, 0],
    'Age': [22, 38, 26, 35, 28, 19, 42, 50, 30, 45],
    'Survived': [0, 1, 1, 0, 1, 1, 0, 0, 1, 0]
})

X = data[['Sex_male', 'Age']]
y = data['Survived']

# Step 3: Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1955, stratify=y)

# Step 4: Train a simple Decision Tree
model = DecisionTreeClassifier(max_depth=3, random_state=1955)
model.fit(X_train, y_train)

# Step 5: Make predictions
y_pred = model.predict(X_test)

# Step 6: Evaluate
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Step 7: Display Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Did Not Survive', 'Survived'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix — Titanic Mini Example')
plt.show()

print(f"Accuracy : {acc:.2f}")
print(f"Precision: {prec:.2f}")
print(f"Recall   : {rec:.2f}")
print(f"F1 Score : {f1:.2f}")


##### Understanding the Confusion Matrix: Accuracy vs Precision vs Recall

This example trains a **small Decision Tree** to predict passenger survival based on age and gender.  
The **confusion matrix** below summarizes how many predictions were correct vs. incorrect.

##### What is a Confusion Matrix?

| **Actual / Predicted** | **Predicted: No (0)** | **Predicted: Yes (1)** |
|-------------------------|-----------------------|------------------------|
| **Actual: No (0)**      | True Negatives (TN)   | False Positives (FP)   |
| **Actual: Yes (1)**     | False Negatives (FN)  | True Positives (TP)    |

It tells us *how the model’s predictions are distributed* — not just how often it’s right.


#### Metrics Derived from the Matrix

- **Accuracy** = (TP + TN) / (TP + TN + FP + FN)  
  - The fraction of total predictions that were correct.  
  - Example: 8 out of 10 passengers correctly classified → 0.80 (80%).

- **Precision** = TP / (TP + FP)  
  - Of all passengers predicted as “Survived,” how many actually survived?  
  - Measures *how reliable a positive prediction is.*

- **Recall** = TP / (TP + FN)  
  - Of all passengers who actually survived, how many did the model correctly identify?  
  - Measures *how complete the positive predictions are.*

- **F1 Score** = 2 × (Precision × Recall) / (Precision + Recall)  
  - A balanced measure when you care about both precision and recall.

---

##### Example Output (Typical)
- Accuracy : 0.80
- Precision: 0.75
- Recall : 0.60
- F1 Score : 0.67


##### Interpreting These Results
- **Accuracy (80%)** looks good, but the confusion matrix might show that the model **missed several survivors (false negatives)**.  
- **Precision (75%)** means that when the model predicts “Survived,” it’s right 75% of the time.  
- **Recall (60%)** shows it found only 60% of all actual survivors — it’s cautious but misses some positives.


##### Takeaway
- **Accuracy** can be misleading on **imbalanced datasets** (e.g., many more deaths than survivors).  
- **Precision** focuses on *quality* of positive predictions.  
- **Recall** focuses on *quantity* of true positives found.  
- The **confusion matrix** gives you the *complete picture* of your classifier’s performance.

In real projects, always look beyond accuracy — analyze the full confusion matrix to understand *how* your model is getting results.

### 1.7 Hands-On (Mini Project) 
See additional notebooks:
- M1_1.7A_Titanic_Classification.ipynb
- M1_1.7B_Housing_Regression.ipynb
- M1_1.7C_Go_Further_Lab.ipynb