# 02 - Feature Engineering and Preprocessing

In this notebook, we will focus on **feature engineering** and **data preprocessing** for the Titanic dataset. Based on the insights from our Exploratory Data Analysis (EDA), we'll clean the data, handle missing values, create new features that might improve model performance, and transform categorical and numerical features into a format suitable for machine learning algorithms.

## 1. Load Data

First, let's load the `train.csv` dataset. We'll also make a copy to work with, preserving the original dataframe.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset
try:
    df = pd.read_csv('../data/train.csv')
    # Create a copy to work with, so the original DataFrame remains untouched
    data = df.copy()
    print("Dataset loaded successfully and a working copy created.")
except FileNotFoundError:
    print("Error: train.csv not found. Please make sure it's in the 'data/' directory.")
    data = pd.DataFrame() # Create an empty DataFrame to avoid errors later


## 2. Handle Missing Values

We identified `Age`, `Embarked`, and `Cabin` as columns with missing values in our EDA.

### 2.1. Impute `Age`

We'll impute missing `Age` values with the **median** age. The median is often preferred over the mean for skewed distributions or when outliers are present, as it's more robust.

In [None]:
if not data.empty:
    # Impute Age with median
    median_age = data['Age'].median()
    data['Age'].fillna(median_age, inplace=True)
    print(f"Missing 'Age' values imputed with median: {median_age:.2f}")


### 2.2. Impute `Embarked`

For `Embarked`, which has only a couple of missing values, we'll impute with the **mode** (most frequent value).

In [None]:
if not data.empty:
    # Impute Embarked with mode
    mode_embarked = data['Embarked'].mode()[0]
    data['Embarked'].fillna(mode_embarked, inplace=True)
    print(f"Missing 'Embarked' values imputed with mode: {mode_embarked}")


### 2.3. Handle `Cabin`

Given the high percentage of missing values in `Cabin`, we'll simplify it by creating a binary feature: `Has_Cabin`. This feature will indicate whether a passenger had cabin information recorded (1) or not (0).

In [None]:
if not data.empty:
    # Create a new feature 'Has_Cabin'
    data['Has_Cabin'] = data['Cabin'].notna().astype(int)
    # Drop the original 'Cabin' column
    data.drop('Cabin', axis=1, inplace=True)
    print("Created 'Has_Cabin' feature and dropped original 'Cabin' column.")


### 2.4. Verify Missing Values

In [None]:
if not data.empty:
    print("\nMissing values after imputation and handling:")
    print(data.isnull().sum()[data.isnull().sum() > 0])
    if data.isnull().sum().sum() == 0:
        print("All missing values have been handled.")


## 3. Feature Engineering

Let's create some new features that might provide more predictive power.

### 3.1. `FamilySize` and `IsAlone`

We'll combine `SibSp` (siblings/spouses aboard) and `Parch` (parents/children aboard) to create `FamilySize`. From `FamilySize`, we'll derive `IsAlone` to indicate if a passenger traveled alone.

In [None]:
if not data.empty:
    data['FamilySize'] = data['SibSp'] + data['Parch'] + 1 # +1 for the passenger themselves
    data['IsAlone'] = (data['FamilySize'] == 1).astype(int)
    print("Created 'FamilySize' and 'IsAlone' features.")
    display(data[['SibSp', 'Parch', 'FamilySize', 'IsAlone']].head())


### 3.2. `Title` Extraction

The `Name` column contains titles (e.g., Mr., Mrs., Miss, Master). These titles often reflect social status and could be indicative of survival. We'll extract them and then group less common titles into a 'Rare' category.

In [None]:
if not data.empty:
    data['Title'] = data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
    # Group rare titles
    data['Title'] = data['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona', 'Mlle', 'Ms', 'Mme'],
                                          ['Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Rare', 'Miss', 'Miss', 'Mrs'])
    print("Extracted and categorized 'Title' feature.")
    print("\nDistribution of Titles:")
    print(data['Title'].value_counts())
    data.drop('Name', axis=1, inplace=True) # Drop the original 'Name' column


### 3.3. `Fare_Bin` (Binning `Fare`)

We'll discretize `Fare` into 4 bins using `qcut` to ensure each bin has roughly the same number of observations. This can help handle its skewed distribution.

In [None]:
if not data.empty:
    data['Fare_Bin'] = pd.qcut(data['Fare'], q=4, labels=['Very_Low', 'Low', 'Medium', 'High'])
    print("Binned 'Fare' into 'Fare_Bin'.")
    print("\nDistribution of Fare_Bin:")
    print(data['Fare_Bin'].value_counts())


### 3.4. `Age_Bin` (Binning `Age`)

Similarly, we'll bin `Age` into meaningful categories like 'Child', 'YoungAdult', 'Adult', 'Senior'.

In [None]:
if not data.empty:
    # Define age bins and labels
    age_bins = [0, 12, 25, 60, 100]
    age_labels = ['Child', 'YoungAdult', 'Adult', 'Senior']
    data['Age_Bin'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels, right=False)
    print("Binned 'Age' into 'Age_Bin'.")
    print("\nDistribution of Age_Bin:")
    print(data['Age_Bin'].value_counts())


## 4. Encoding Categorical Features

Machine learning models typically require numerical input. We'll use **One-Hot Encoding** for our categorical features.

In [None]:
if not data.empty:
    # Identify categorical columns to be one-hot encoded
    # Pclass is numerical but treated as categorical due to its discrete nature and influence on survival
    categorical_cols = ['Sex', 'Embarked', 'Pclass', 'Title', 'Fare_Bin', 'Age_Bin']

    # Create a OneHotEncoder instance
    one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

    # Fit and transform the categorical columns
    encoded_features = one_hot_encoder.fit_transform(data[categorical_cols])

    # Create a DataFrame from the encoded features
    encoded_feature_names = one_hot_encoder.get_feature_names_out(categorical_cols)
    encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names, index=data.index)

    # Drop original categorical columns and concatenate with encoded features
    data = data.drop(columns=categorical_cols)
    data = pd.concat([data, encoded_df], axis=1)

    print("Categorical features successfully One-Hot Encoded.")
    print("\nDataFrame after One-Hot Encoding:")
    display(data.head())


## 5. Feature Scaling

For numerical features that were not binned (`Age`, `Fare`, `FamilySize`), we'll apply **Standard Scaling**. This transforms the data to have a mean of 0 and a standard deviation of 1, which is important for many machine learning algorithms (e.g., Logistic Regression, SVMs, neural networks) that are sensitive to the scale of features.

In [None]:
if not data.empty:
    # Identify numerical columns to be scaled. PassengerId and Survived are excluded.
    # Age and Fare are excluded as they have been binned. If not binned, they would be scaled.
    numerical_cols_for_scaling = ['Age', 'Fare', 'FamilySize']

    # Ensure these columns exist and are numeric before scaling
    numerical_cols_for_scaling = [col for col in numerical_cols_for_scaling if col in data.columns and pd.api.types.is_numeric_dtype(data[col])]

    if numerical_cols_for_scaling:
        # Create a StandardScaler instance
        scaler = StandardScaler()

        # Fit and transform the numerical columns
        data[numerical_cols_for_scaling] = scaler.fit_transform(data[numerical_cols_for_scaling])
        print("Numerical features successfully Standard Scaled.")
        print("\nDataFrame after Feature Scaling (scaled columns):")
        display(data[numerical_cols_for_scaling].head())
    else:
        print("No numerical columns selected for scaling or they do not exist.")


## 6. Final DataFrame Review

Let's check the final structure of our preprocessed DataFrame. We should see all missing values handled, new features created, and all relevant features encoded and scaled.

In [None]:
if not data.empty:
    print("\nFinal DataFrame Information:")
    data.info()

    print("\nFinal DataFrame Head:")
    display(data.head())


### Dropping unnecessary columns

We should drop `PassengerId` and `Ticket` as they are identifiers and typically do not contribute to predictive power. Also drop the original `Age` and `Fare` columns since we created binned versions.

In [None]:
if not data.empty:
    columns_to_drop = ['PassengerId', 'Ticket', 'Age', 'Fare']
    # Ensure columns exist before dropping
    existing_columns_to_drop = [col for col in columns_to_drop if col in data.columns]
    data.drop(columns=existing_columns_to_drop, inplace=True)
    print("Dropped unnecessary columns: ", existing_columns_to_drop)
    print("\nFinal DataFrame Head after dropping columns:")
    display(data.head())


## 7. Save the Preprocessed DataFrame

Finally, we'll save this preprocessed DataFrame to a new CSV file (`processed_data.csv`) in the `data/` directory. This will allow us to easily load the cleaned and engineered data in the next notebook for model training and evaluation, without having to rerun the preprocessing steps every time.

In [None]:
if not data.empty:
    output_path = '../data/processed_data.csv'
    data.to_csv(output_path, index=False)
    print(f"Preprocessed data saved to {output_path}")
else:
    print("DataFrame is empty, cannot save.")
