<a href="https://colab.research.google.com/github/williamfazle/Machine-Learning/blob/main/titanic_dataset_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Step 1: Loading and Initial Inspection***

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

# Load the dataset built into seaborn for easy access
# (In a real scenario, you would load your train.csv)
df = sns.load_dataset('titanic')

# ---- Initial Inspection ----

# 1. View the first few rows to understand what the data looks like
print("--- First 5 Rows ---")
print(df.head())

# 2. View data types and non-null counts
# This is crucial for identifying missing data and categorical vs numerical columns.
print("\n--- DataFrame Info ---")
print(df.info())

# 3. Check exact count of missing values per column
print("\n--- Missing Value Count ---")
print(df.isnull().sum())

--- First 5 Rows ---
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------

***Step 2: Handling Missing Values (Imputation)***

2a. The 'Cabin' Column (Too much missing data)

In [2]:
# 'cabin' has too many missing values. Let's drop it.
# Note: The seaborn version also has redundant columns like 'deck', drop them too if present.
cols_to_drop = ['cabin', 'deck']
# Filter to only drop columns present in the dataframe
cols_to_drop = [col for col in cols_to_drop if col in df.columns]
df = df.drop(columns=cols_to_drop, axis=1)

print("Cabin column dropped.")

Cabin column dropped.


***2b. The 'Embarked' Column (Very little missing data)***

In [3]:
# Find the most common embarkation point (mode)
most_common_embarked = df['embarked'].mode()[0]

print(f"Filling missing 'embarked' values with: {most_common_embarked}")

# Fill missing values
df['embarked'] = df['embarked'].fillna(most_common_embarked)
# If using the seaborn version, 'embark_town' is redundant, let's sync it or drop it later.
if 'embark_town' in df.columns:
     df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

Filling missing 'embarked' values with: S


***2c. The 'Age' Column (Smarter Imputation)***

In [4]:
# Group by Pclass and Sex, then transform 'age' by filling NaNs with the group's median
df['age'] = df.groupby(['pclass', 'sex'])['age'].transform(lambda x: x.fillna(x.median()))

# Verification check:
print("\nMissing values after imputation:")
print(df.isnull().sum())
# All counts should now be zero (except maybe 'who'/'adult_male' if using seaborn version, which we will drop later anyway).


Missing values after imputation:
survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


***Step 3: Feature Engineering (Creating New Features)***

In [5]:
# Create FamilySize: Siblings + Parents + The Passenger Themselves (+1)
df['FamilySize'] = df['sibsp'] + df['parch'] + 1

# Optional: Create a simpler feature "IsAlone" if FamilySize is 1
df['IsAlone'] = 0
df.loc[df['FamilySize'] == 1, 'IsAlone'] = 1

print("New features 'FamilySize' and 'IsAlone' created.")
print(df[['sibsp', 'parch', 'FamilySize', 'IsAlone']].head())

New features 'FamilySize' and 'IsAlone' created.
   sibsp  parch  FamilySize  IsAlone
0      1      0           2        0
1      1      0           2        0
2      0      0           1        1
3      1      0           2        0
4      0      0           1        1


***Step 4: Data Transformation and Encoding***

4a. Dropping Redundant/Useless Features

In [6]:
# Columns that don't add predictive value for survival in their current form
# Note: 'passengerid' is usually in kaggle data, not seaborn load. Added for completeness.
useless_cols = ['passengerid', 'name', 'ticket']
redundant_seaborn_cols = ['class', 'who', 'adult_male', 'embark_town', 'alive', 'alone']

# Combine and filter for columns actually present
cols_to_drop_final = [col for col in (useless_cols + redundant_seaborn_cols) if col in df.columns]

df = df.drop(cols_to_drop_final, axis=1)

print("Redundant columns dropped.")
print(df.columns.tolist())
# Remaining columns should be roughly: ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'FamilySize', 'IsAlone']

Redundant columns dropped.
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'FamilySize', 'IsAlone']


***4b. Encoding Categorical Variables (One-Hot Encoding)***

In [7]:
# Identify categorical columns for encoding
categorical_cols = ['sex', 'embarked']

# Apply One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print("\nData after Encoding (Final State):")
print(df_encoded.head())
print("\nFinal Data Types:")
print(df_encoded.info())


Data after Encoding (Final State):
   survived  pclass   age  sibsp  parch     fare  FamilySize  IsAlone  \
0         0       3  22.0      1      0   7.2500           2        0   
1         1       1  38.0      1      0  71.2833           2        0   
2         1       3  26.0      0      0   7.9250           1        1   
3         1       1  35.0      1      0  53.1000           2        0   
4         0       3  35.0      0      0   8.0500           1        1   

   sex_male  embarked_Q  embarked_S  
0      True       False        True  
1     False       False       False  
2     False       False        True  
3     False       False        True  
4      True       False        True  

Final Data Types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   survived    891 non-null    int64  
 1   pclass      891 non-null    int64  
 2   age 