<a href="https://colab.research.google.com/github/vaibhavb/zero-to-one-datascience-to-ai/blob/main/003_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle and Data Setup

In [1]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

Kaggle credentials set.
Kaggle credentials successfully validated.


In [2]:
titanic_path = kagglehub.competition_download('titanic')

print('Data source import complete.')

Downloading from https://www.kaggle.com/api/v1/competitions/data/download-all/titanic...


100%|██████████| 34.1k/34.1k [00:00<00:00, 30.6MB/s]

Extracting files...
Data source import complete.





# `df_processed`
- Handle missing data in `Age`, `Cabin`, and `Embarked`.
- Created new features: `FamilySize` and `IsAlone`.

In [4]:
import pandas as pd
train_df = pd.read_csv(titanic_path + "/train.csv")
df_processed = train_df.copy()
df_processed.drop('Cabin', axis=1, inplace=True)
median_age = df_processed['Age'].median() #median
df_processed.fillna({'Age': median_age}, inplace=True)
mode_embarked = df_processed['Embarked'].mode()[0] #mode returns a series so we take first
df_processed.fillna({'Embarked': mode_embarked}, inplace=True)


# Type Conversions
Machine learning algorithms typically require numerical input. We need to convert categorical features (text-based) into numbers

In [5]:
print(df_processed.dtypes)

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Embarked        object
dtype: object


## Techniques for Conversions
* **Label Encoding**
  * Pros:* Simple.
  * *Cons:* Can inadvertently introduce an ordinal relationship where none exists (e.g., model might think category 2 is "greater" than category
      1). Usually not ideal for nominal categories (like `Sex` or `Embarked`) if using linear models or distance-based models.

* **One-Hot Encoding (Dummy Variables):** Creates new binary (0 or 1) columns for each category. For example, if `Embarked` has values S, C, Q, one-hot encoding creates three new columns: `Embarked_S`, `Embarked_C`, `Embarked_Q`. A passenger from port S would have 1 in `Embarked_S` and 0 in the others.
  * *Pros:* No ordinal relationship is implied. Generally better for nominal data with most algorithms.
  * *Cons:* Can increase the number of features (dimensionality), especially if a category has many unique values.

### One-Hot Encoding with Pandas

Pandas has a convenient function `pd.get_dummies()` for this.

In [7]:
# Let's work on a new copy for this stage, or continue with df_processed
df_model_ready = df_processed.copy()

# One-Hot Encode 'Sex'
# drop_first=True is often used to avoid multicollinearity (dummy variable trap).
# If you have k categories, drop_first=True creates k-1 dummy variables.
# The dropped category is implicitly represented when all other dummies are 0.
if 'Sex' in df_model_ready.columns:
        df_model_ready = pd.get_dummies(df_model_ready, columns=['Sex'], prefix='Sex', drop_first=True)
        print("\nDataFrame after One-Hot Encoding 'Sex'.")
else:
        print("\n'Sex' column not found for One-Hot Encoding.")


# One-Hot Encode 'Embarked'
if 'Embarked' in df_model_ready.columns:
        df_model_ready = pd.get_dummies(df_model_ready, columns=['Embarked'], prefix='Embarked', drop_first=True)
        print("DataFrame after One-Hot Encoding 'Embarked'.")
else:
        print("\n'Embarked' column not found for One-Hot Encoding.")

print("\nDataFrame after One-Hot Encoding (head):")
print(df_model_ready.head())
print("\nNew columns created (if any):")
print(df_model_ready.columns)


DataFrame after One-Hot Encoding 'Sex'.
DataFrame after One-Hot Encoding 'Embarked'.

DataFrame after One-Hot Encoding (head):
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0   
2                             Heikkinen, Miss. Laina  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   
4                           Allen, Mr. William Henry  35.0      0      0   

             Ticket     Fare  Sex_male  Embarked_Q  Embarked_S  
0         A/5 21171   7.2500      True       False        True  
1          PC 17599  71.2833     False       False       False  
2  

## Drop Unnecessary Columns
* `Name`: We haven't extracted titles yet, so the raw name is hard for most models to use.
* `Ticket`: Ticket numbers are often unique or require complex parsing.
* `PassengerId`: Just an identifier.
* `SibSp` and `Parch`: We created `FamilySize` and `IsAlone`, which might capture this information more effectively. Let's decide if we want to ke
ep the originals or just the engineered ones. For now, let's keep `FamilySize` and `IsAlone` and drop `SibSp` and `Parch` to avoid redundancy if `Fami
lySize` captures their essence.

In [8]:
columns_to_drop_for_model = ['Name', 'Ticket', 'PassengerId', 'SibSp', 'Parch']
# Ensure columns exist before attempting to drop, to avoid errors
columns_present_to_drop = [col for col in columns_to_drop_for_model if col in df_model_ready.columns]

if columns_present_to_drop:
         df_model_ready.drop(columns=columns_present_to_drop, axis=1, inplace=True)
         print(f"\nDropped columns for modeling: {columns_present_to_drop}")
else:
         print(f"\nSome columns intended for dropping were not found or already dropped: {columns_to_drop_for_model}")

print("\nDataFrame ready for scaling (head):")
print(df_model_ready.head())
print("\nFinal columns for the model (before scaling):")
print(df_model_ready.columns)


Dropped columns for modeling: ['Name', 'Ticket', 'PassengerId', 'SibSp', 'Parch']

DataFrame ready for scaling (head):
   Survived  Pclass   Age     Fare  Sex_male  Embarked_Q  Embarked_S
0         0       3  22.0   7.2500      True       False        True
1         1       1  38.0  71.2833     False       False       False
2         1       3  26.0   7.9250     False       False        True
3         1       1  35.0  53.1000     False       False        True
4         0       3  35.0   8.0500      True       False        True

Final columns for the model (before scaling):
Index(['Survived', 'Pclass', 'Age', 'Fare', 'Sex_male', 'Embarked_Q',
       'Embarked_S'],
      dtype='object')


# Feature Scaling
Many machine learning algorithms perform better when numerical input features are on a similar scale. This is especially true for algorithms that compute
distances (like k-NN) or use gradient descent (like Logistic Regression, SVMs, Neural Networks).

From `df_model_ready.columns`, our numerical features are likely `Age`, `Fare`, `Pclass` (treating as numerical), `FamilySize`. The one-hot encoded co
lumns are already 0/1. `IsAlone` is also 0/1. `Survived` is the target, not a feature to scale

In [10]:
import numpy as np
# numerical_cols_for_scaling = ['Age', 'Fare', 'Pclass', 'FamilySize'] # Example from before
# For robustness, dynamically identify numerical columns that are not binary (0/1) or the target

if 'df_model_ready' in locals() and not df_model_ready.empty:
    potential_numerical_cols = df_model_ready.select_dtypes(include=np.number).columns.tolist()
    # Exclude target and already binary/dummy columns
    cols_to_exclude_from_scaling = ['Survived'] # Always exclude target
    for col in df_model_ready.columns:
        if df_model_ready[col].nunique() <= 2: # If column is binary (like dummies, IsAlone)
            cols_to_exclude_from_scaling.append(col)

    numerical_cols_for_scaling = [col for col in potential_numerical_cols if col not in cols_to_exclude_from_scaling]
    numerical_cols_for_scaling = [col for col in numerical_cols_for_scaling if col in df_model_ready.columns] # Ensure they still exist
else: # Fallback if df_model_ready is not defined
    numerical_cols_for_scaling = ['Age', 'Fare', 'Pclass', 'FamilySize']
    # This fallback list requires manual adjustment if columns were named differently or dropped.
    print("--- Using a predefined list for numerical_cols_for_scaling as df_model_ready was not found. ---")


print(f"\nNumerical columns identified for scaling: {numerical_cols_for_scaling}")
if numerical_cols_for_scaling and 'df_model_ready' in locals() and not df_model_ready.empty:
      # Ensure all listed columns are actually in the DataFrame before calling describe
    valid_cols_for_describe = [col for col in numerical_cols_for_scaling if col in df_model_ready.columns]
    if valid_cols_for_describe:
        print(df_model_ready[valid_cols_for_describe].describe())
    else:
        print("None of the identified numerical columns for scaling are present in df_model_ready.")
elif not numerical_cols_for_scaling:
    print("No numerical columns were identified for scaling.")


Numerical columns identified for scaling: ['Pclass', 'Age', 'Fare']
           Pclass         Age        Fare
count  891.000000  891.000000  891.000000
mean     2.308642   29.361582   32.204208
std      0.836071   13.019697   49.693429
min      1.000000    0.420000    0.000000
25%      2.000000   22.000000    7.910400
50%      3.000000   28.000000   14.454200
75%      3.000000   35.000000   31.000000
max      3.000000   80.000000  512.329200


## Techniques for Scaling
* **Standardization (Z-score Normalization):** Transforms data to have a mean of 0 and a standard deviation of 1.
      Formula: `$X_{scaled} = (X - \mu) / \sigma$`
      Tool: `sklearn.preprocessing.StandardScaler`
* **Min-Max Scaling (Normalization):** Rescales data to a fixed range, usually 0 to 1.
      Formula: $X_{scaled} = (X - X_{min}) / (X_{max} - X_{min})$
      Tool: `sklearn.preprocessing.MinMaxScaler`

### Applying Standardization
**Important:** We should fit the scaler **only** on the training data and then use that *same* fitted scaler to transform both the training and the va
lidation/test data. This prevents "data leakage" from the validation/test set into the training process. Since we haven't split yet, we'll demonstrate sca
ling on `df_model_ready` for now, but keep this principle in mind. Ideally, split first, then scale. For simplicity in this lesson flow, we scale before s
plitting, assuming `df_model_ready` represents our *potential* full training dat

In [11]:
from sklearn.preprocessing import StandardScaler

if numerical_cols_for_scaling and 'df_model_ready' in locals() and not df_model_ready.empty: # Proceed only if there are columns to scale
    # Ensure all identified columns are actually present
    valid_cols_to_scale = [col for col in numerical_cols_for_scaling if col in df_model_ready.columns]
    if valid_cols_to_scale:
        scaler = StandardScaler()
        # Fit the scaler and transform the numerical columns
        df_model_ready[valid_cols_to_scale] = scaler.fit_transform(df_model_ready[valid_cols_to_scale])

        print("\nDataFrame after Scaling numerical features (head):")
        print(df_model_ready[valid_cols_to_scale].head())
        print("\nDescriptive stats of scaled features (should have mean ~0, std ~1):")
        print(df_model_ready[valid_cols_to_scale].describe())
    else:
        print("\nNone of the identified numerical columns for scaling were found in df_model_ready.")
else:
    print("\nNo numerical columns were identified or available for scaling, or df_model_ready is not defined.")


DataFrame after Scaling numerical features (head):
     Pclass       Age      Fare
0  0.827377 -0.565736 -0.502445
1 -1.566107  0.663861  0.786845
2  0.827377 -0.258337 -0.488854
3 -1.566107  0.433312  0.420730
4  0.827377  0.433312 -0.486337

Descriptive stats of scaled features (should have mean ~0, std ~1):
             Pclass           Age          Fare
count  8.910000e+02  8.910000e+02  8.910000e+02
mean  -8.772133e-17  2.272780e-16  3.987333e-18
std    1.000562e+00  1.000562e+00  1.000562e+00
min   -1.566107e+00 -2.224156e+00 -6.484217e-01
25%   -3.693648e-01 -5.657365e-01 -4.891482e-01
50%    8.273772e-01 -1.046374e-01 -3.573909e-01
75%    8.273772e-01  4.333115e-01 -2.424635e-02
max    8.273772e-01  3.891554e+00  9.667167e+00


# Splitting Data in to Training and Validation Sets
To evaluate our model's performance on unseen data, we need to split our dataset.

* **Why Split?**
    * **Training Set:** Used to train the machine learning model (i.e., learn the parameters).
    * **Validation Set (or Development Set):** Used to tune hyperparameters of the model and make decisions about the model (e.g., feature selection, model architecture). It provides an unbiased estimate of how the model performs on data it wasn't trained on.
    * **Test Set (from `test.csv`):** Used for a final, truly unbiased evaluation of the chosen model *after* all training and tuning are complete. We won't touch `test.csv` for labels until the very end.

* **Define Features (X) and Target (y):**
    * `X`: All columns in `df_model_ready` except `Survived`. These are our input features.
    * `y`: The `Survived` column. This is what we want to predict.

In [12]:
# (Assuming df_model_ready is available)
if 'df_model_ready' in locals() and not df_model_ready.empty and 'Survived' in df_model_ready.columns:
    X = df_model_ready.drop('Survived', axis=1)
    y = df_model_ready['Survived']

    print("\nFeatures (X) for the model (head):")
    print(X.head())
    print("\nTarget (y) for the model (head):")
    print(y.head())
    print(f"\nShape of X: {X.shape}, Shape of y: {y.shape}")
else:
    print("\n'Survived' column not found in df_model_ready or df_model_ready is empty. Cannot define X and y.")
    # Create dummy X and y if 'Survived' is missing for notebook continuity
    if 'df_model_ready' in locals() and not df_model_ready.empty:
        # Select all columns except a potential target if 'Survived' is missing
        potential_features = [col for col in df_model_ready.columns if col not in ['Survived']] # Example exclusion
        if potential_features:
            X = df_model_ready[potential_features]
            y = pd.Series(np.random.randint(0,2,size=len(df_model_ready)), name='Survived') # Dummy target
            print("--- Created dummy X and y as 'Survived' was not found for split. ---")
        else:
            X, y = pd.DataFrame(), pd.Series(name='Survived') # Empty placeholders
            print("--- df_model_ready is empty or has no suitable columns for X. ---")
    else: # If df_model_ready itself is not defined
        X, y = pd.DataFrame(), pd.Series(name='Survived') # Empty placeholders
        print("--- df_model_ready is not defined. Cannot define X and y. ---")


Features (X) for the model (head):
     Pclass       Age      Fare  Sex_male  Embarked_Q  Embarked_S
0  0.827377 -0.565736 -0.502445      True       False        True
1 -1.566107  0.663861  0.786845     False       False       False
2  0.827377 -0.258337 -0.488854     False       False        True
3 -1.566107  0.433312  0.420730     False       False        True
4  0.827377  0.433312 -0.486337      True       False        True

Target (y) for the model (head):
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Shape of X: (891, 6), Shape of y: (891,)


## Performing the Split
 **Performing the Split using `sklearn.model_selection.train_test_split`:**

In [13]:
from sklearn.model_selection import train_test_split

if 'X' in locals() and 'y' in locals() and not X.empty and not y.empty: # Proceed only if X and y are defined
    # test_size: proportion of the dataset to include in the validation split (e.g., 0.2 for 20%)
    # random_state: ensures the split is the same every time you run the code (for reproducibility)
    # stratify=y: recommended for classification tasks. It ensures that the proportion of the target
    #             variable's classes is approximately the same in both training and validation sets.
    X_train, X_val, y_train, y_val = train_test_split(
        X, y,
        test_size=0.2,    # 20% for validation, 80% for training
        random_state=42,  # The answer to life, the universe, and everything - for consistency
        stratify=y        # Important for classification if y has more than 1 class and is not empty
    )

    print("\nShapes of the resulting data splits:")
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")

    if not y.empty and y.nunique() > 1: # Stratification makes sense if there are classes to stratify
        print("\nProportion of target classes in original y:")
        print(y.value_counts(normalize=True))
        print("\nProportion of target classes in y_train:")
        print(y_train.value_counts(normalize=True))
        print("\nProportion of target classes in y_val (should be similar to y_train due to stratify):")
        print(y_val.value_counts(normalize=True))
    elif not y.empty:
          print("\nTarget variable 'y' has only one class or is empty, stratification may not be meaningful or applied.")
          print(y.value_counts(normalize=True))

else:
    print("\nCannot perform train-test split as X or y is empty/undefined.")
    # Define empty placeholders if split failed, for subsequent lessons to run
    X_train, X_val, y_train, y_val = pd.DataFrame(), pd.DataFrame(), pd.Series(dtype='int'), pd.Series(dtype='int')


Shapes of the resulting data splits:
X_train shape: (712, 6), y_train shape: (712,)
X_val shape: (179, 6), y_val shape: (179,)

Proportion of target classes in original y:
Survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64

Proportion of target classes in y_train:
Survived
0    0.616573
1    0.383427
Name: proportion, dtype: float64

Proportion of target classes in y_val (should be similar to y_train due to stratify):
Survived
0    0.614525
1    0.385475
Name: proportion, dtype: float64
