# Module 1 — Hands-On 1.7 (A): Titanic Classification (Survival)
**Goal:** Apply the ML pipeline to a small dataset.

**Steps:** load → minimal EDA/cleaning → encode/split → simple model → metric (accuracy).

## Step 1: Load a Real-World Dataset

### When the real file isn’t available, synthetic Titanic-like data is created to complete the project

If the `data/titanic_synth.csv` file isn’t found, we create a small synthetic dataset so you can run the notebook. The code uses a **fixed random seed (`SEED=1955`)** so every time you run the example, it produces the same numbers (reproducible). Also, the code, builds **240 rows** (data points), each representing one passenger.

**Columns generated:**
- **`Age`** — integer from **1 to 79** (years).
- **`Sex`** — one of `{"male", "female"}`.
- **`Fare`** — floating-point ticket price between **\$5 and \$120** (rounded to 2 decimals).
- **`Embarked`** — port of embarkation:  
  - `S` = **Southampton** (United Kingdom)  
  - `C` = **Cherbourg** (France)  
  - `Q` = **Queenstown** (now Cobh, Ireland)
- **`Survived`** — outcome flag: `1` = survived, `0` = did not survive.

This synthetic dataset mimics the shape of the real Titanic data so that you can practice the **full ML pipeline** (load → clean → encode → model → evaluate) without relying on external downloads.


In [1]:
import pandas as pd
import numpy as np
SEED = 1955
try:
    df = pd.read_csv('../data/titanic_synth.csv')
except FileNotFoundError:
    rng = np.random.default_rng(SEED)
    df = pd.DataFrame({
        'Age': rng.integers(1, 80, 240),
        'Sex': rng.choice(['male','female'], 240),
        'Fare': rng.uniform(5, 120, 240).round(2),
        'Embarked': rng.choice(['S','C','Q'], 240),
        'Survived': rng.integers(0, 2, 240)
    })
df.head()

Unnamed: 0,Survived,age,fare,sibsp,parch,pclass,sex,embarked
0,1,34.6,7.22,0,2,2,female,C
1,0,35.4,5.84,1,2,2,female,S
2,0,36.2,6.73,1,1,1,female,S
3,1,35.8,6.64,1,1,3,female,C
4,1,20.1,7.63,0,2,1,female,S


### What `df.head()` is telling you, and what to check for

`df.head()` prints the **first few rows** (default 5). Use it to quickly check the data:

- **Column names & order** — Do you see the features you expect (`Age`, `Sex`, `Fare`, `Embarked`, `Survived`)? Any typos?
- **Value shapes** — Are `Age` and `Fare` numeric-looking? Are categorical fields like `Sex` and `Embarked` strings?
- **Obvious anomalies** — Negative fares? Ages of 0 or 999? Blank or `NaN` cells?
- **Consistent coding** — e.g., `male/female` vs `Male/Female` (case consistency matters before encoding).

> Tip: `head()` is your **first line of defense** against bad assumptions. It doesn’t replace full validation, but it helps you spot glaring issues early.


In [None]:
# --- Inject a bit of realistic "messiness" for hands-on cleaning ---
rng = np.random.default_rng(SEED)

# ~8% missing ages, ~5% missing fares, ~4% missing embarked
df.loc[rng.random(len(df)) < 0.08, 'Age'] = np.nan
df.loc[rng.random(len(df)) < 0.05, 'Fare'] = np.nan
df.loc[rng.random(len(df)) < 0.04, 'Embarked'] = None

# A few outlier fares (e.g., data entry issues)
out_idx = df.sample(3, random_state=SEED).index
df.loc[out_idx, 'Fare'] = df.loc[out_idx, 'Fare'].fillna(0) + 500  # make them very large


### Why we’re adding “messiness”

Real-world data is rarely perfect. We deliberately introduce:
- **Missing values** (`Age`, `Fare`, `Embarked`) to practice **imputation** and handling categorical NaNs.
- **Outliers** in `Fare` to practice **simple anomaly handling** (e.g., capping or inspecting unusual values).

> These small imperfections give us something meaningful to clean before modeling.

In [None]:
df.info() #shows dtypes/missingness

In [None]:
df.describe(include='all') #summarizes ranges/frequency

### Understanding `df.info()` and `df.describe(include='all')`

These two commands are part of the **exploratory data analysis (EDA)** step. They help you **understand the structure and quality** of your dataset before you start cleaning or modeling.

---

#### `df.info()`
- Displays a concise **summary of the DataFrame** — column names, data types, and counts of non-null values.
- Use it to quickly check:
  - Which columns are **numeric**, **categorical (object)**, or **boolean**.
  - Whether any columns have **missing values** (the “non-null count” will be lower than total rows).
  - The overall **memory usage** (useful for large datasets).

If you see the **output:** above you will be able to spot missing values — notice that `Age`, `Fare`, and `Embarked` have fewer than 240 non-null entries.

> **Why it matters:** `df.info()` helps you decide which columns might need imputation or type conversion before modeling.

---

#### `df.describe(include='all')`
- Gives **summary statistics** for each column.
- By default, `.describe()` only includes numeric columns.  
  Adding `include='all'` makes it include **both numeric and categorical** data.

**For numeric columns**, it shows:
| Statistic | Meaning |
|------------|----------|
| `count` | Number of non-missing values |
| `mean` | Average |
| `std` | Standard deviation (spread) |
| `min`, `25%`, `50%`, `75%`, `max` | Distribution spread (quartiles) |

**For categorical columns**, it shows:
| Statistic | Meaning |
|------------|----------|
| `count` | Number of non-missing values |
| `unique` | Number of distinct categories |
| `top` | Most frequent category |
| `freq` | How often that top category appears |

**Why it matters:** `df.describe(include='all')` gives a **first quantitative sense** of your data’s shape and scale — useful for spotting:
- Missing or extreme values (e.g., `min=0` where that makes no sense).
- Unbalanced categorical distributions (e.g., too few males/females).
- Skewed numeric variables (large `max` vs `mean`).

Together, these two commands form the **“X-ray”** of your dataset — they help you see the big picture before doing any transformations.


## Step 2: Minimal EDA & Cleaning

**Why this step matters:**  
Garbage in → garbage out. Even small cleaning choices can change your model’s conclusions.

**What to try:**  
- Identify columns with missing values using `df.isna().sum()`.  
- Impute **one** column (e.g., `Age` or `sqft`) to see impact.  
- Consider simple outlier handling (e.g., cap extreme `Fare`).

**Common pitfalls:**  
- Dropping too many rows with `dropna()` and losing valuable data.  
- Applying imputation differently to train vs test (always fit on train, apply to test via pipelines).


In [None]:
df.isna().sum() #It counts how many missing (NaN or None) values exist in each column of the DataFrame.

This result means that:
- 20 rows have a missing Age,
- 6 have missing Fare,
- 9 missing Embarked,
- none missing Sex or Survived.

### Handling missing or invalid Age values

In [None]:
from sklearn.impute import SimpleImputer
if df['Age'].isna().any():
    imputer = SimpleImputer(strategy='mean')
    df['Age'] = imputer.fit_transform(df[['Age']])

# Validation
print("Missing Age values after cleaning:", df['Age'].isna().sum())
print(df['Age'].describe())

#### What this imputation code does (and why)

Step-by-Step:

1. Check for missingness: df['Age'].isna().any() — only run imputation if needed.
2. Choose a strategy: strategy='mean' replaces missing values with the column’s average.
3. Fit + transform: fit_transform computes the mean from non-missing rows and fills NaNs. (We pass [['Age']] as a 2D array to match scikit-learn’s expected shape.)

#### When to use which strategy?

- Mean: fast, OK if distribution is fairly symmetric.
- Median: more robust to outliers (often better for skewed numeric data like prices).
- Most frequent: for categorical text columns (e.g., fill missing Embarked with the mode).
- Constant: when you need a placeholder value (e.g., 0 or "Unknown").

**Always document your choice and consider how it might bias the model. Imputation is a modeling decision.**

### Handling missing or invalid Fare values

In [None]:
# --- Handle missing or invalid Fare values ---

from sklearn.impute import SimpleImputer
import numpy as np

# Replace negative or zero fares with NaN (invalid values)
df.loc[df['Fare'] <= 0, 'Fare'] = np.nan

# Cap extreme outlier fares using IQR (winsorization)
q1, q3 = df['Fare'].quantile([0.25, 0.75])
iqr = q3 - q1
upper_whisker = q3 + 1.5 * iqr
df['Fare'] = np.where(df['Fare'] > upper_whisker, upper_whisker, df['Fare'])

# Impute remaining missing values with the median Fare
fare_imputer = SimpleImputer(strategy='median')
df['Fare'] = fare_imputer.fit_transform(df[['Fare']])

# Quick validation
print("Missing Fares after cleaning:", df['Fare'].isna().sum())
print(df['Fare'].describe())


### Handling missing or invalid Embarked values

In [None]:
# --- Handle missing values in the Embarked column ---

from sklearn.impute import SimpleImputer

# Check how many are missing
print("Missing Embarked values before cleaning:", df['Embarked'].isna().sum())
print(df['Embarked'].value_counts())

# Create an imputer that fills missing values with the most frequent value (mode)
embarked_imputer = SimpleImputer(strategy='most_frequent')

# Apply the imputer to the column
#df['Embarked'] = embarked_imputer.fit_transform(df[['Embarked']]). Note this line creates an error because SimpleImputer.fit_transform returns a 2-D array 
# of shape (n_rows, 1), but assigning to a single column like df['Embarked'] expects a 1-D Series. You just need to flatten the result.
mode_val = df['Embarked'].mode(dropna=True).iloc[0]
df['Embarked'] = df['Embarked'].fillna(mode_val)

# Confirm that all missing values have been filled
print("Missing Embarked values after cleaning:", df['Embarked'].isna().sum())
print(df['Embarked'].value_counts())


#### Fixing missing values in `Embarked`

Since `Embarked` is a **categorical feature**, we handle missing values by filling them with the **most frequent category** (the mode). In this version, we use pure **pandas**, which is simpler and avoids shape issues.

#### Step-by-step
1. **Find the mode** - `mode_val = df['Embarked'].mode(dropna=True).iloc[0]` identifies the most common value (e.g., 'S' for Southampton).


##### Step-by-step
1. **Count missing values** – see how many entries lack a port of embarkation.
2. **Choose a strategy** – for categorical features, `'most_frequent'` is common because:
   - It preserves existing category proportions.
   - It avoids introducing a fake “unknown” class unless that’s analytically useful.
3. **Find the mode** - `mode_val = df['Embarked'].mode(dropna=True).iloc[0]` identifies the most common value (e.g., 'S' for Southampton).
4. **Fill missing values** - `df['Embarked'] = df['Embarked'].fillna(mode_val)` every missing cell is replaced with that most frequent value.
5. **Validate** – recheck missing counts and category frequencies.


**Why this approach works well:**  
Filling with the most common port (e.g., `'S'` for Southampton) is a reasonable assumption unless you have reason to believe missing values cluster differently.  
For more advanced pipelines, you could later encode `'Missing'` as its own category to preserve that information.



## Step 3: Encode & Split Data

**Why this step matters:**  
- Models need numbers. Categorical features (e.g., `Sex`, `Embarked`) require encoding.  
- Train/test split ensures unbiased evaluation of generalization.

**Good practices:**  
- Use a **`ColumnTransformer` + `Pipeline`** so preprocessing is learned on training data and applied consistently to test/new data.  
- Set `random_state` for reproducibility; use `stratify=y` for classification to keep class balance.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X = df.drop('Survived', axis=1)
y = df['Survived']
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

pre = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
], remainder='passthrough')

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=SEED
)
X_train.shape, X_test.shape

**Explanation:**
- One-Hot encode categoricals.
- Stratified split preserves class balance.

## Step 4: Fit a Simple Model

**Why start simple?**  
- A **Decision Tree** (classification) or **Linear Regression** (regression) is interpretable and fast.  
- Helps build intuition before exploring more complex models (Random Forest, XGBoost, Neural Nets).

**Common pitfalls:**  
- Overfitting by using deep trees / over-parameterized models too early.  
- Confusing training performance with generalization (we care about test/validation metrics).

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = Pipeline([
    ('pre', pre),
    ('model', DecisionTreeClassifier(max_depth=4, random_state=SEED))
])
clf.fit(X_train, y_train);

**Explanation:** Shallow Decision Tree keeps things interpretable.

## Step 5: Brief Metric Calculation

**Pick metrics that match the goal:**
- **Classification**:  
  - **Accuracy** for balanced classes.  
  - **Precision/Recall/F1** when positives are rare or costs differ (e.g., missing a survivor vs false alarm).  
- **Regression**:  
  - **RMSE/MAE** for absolute error.  
  - **R²** to gauge explained variance.

In [None]:
from sklearn.metrics import accuracy_score
pred = clf.predict(X_test)
acc = accuracy_score(y_test, pred)
print(f'Accuracy: {acc:.3f}')

**Discussion:** When can accuracy be misleading (imbalance, unequal error costs)?