# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [None]:
import pandas as pd

df = pd.read_csv('titanic.csv')
print(df.head())
print(df.info())
print(df.describe())

   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0    male  22.0                        1                        0   7.2500  
1  female  38.0                        1                        0  71.2833  
2  female  26.0                        0                        0   7.9250  
3  female  35.0                        1                        0  53.1000  
4    male  35.0                        0                        0   8.0500  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [2]:
import pandas as pd

# Assuming you've already loaded your dataframe as 'df':
df = pd.read_csv('/content/titanic.csv')

# Create a copy of the DataFrame
df_processed = df.copy()

# 1. Display the number of missing values per column:
print(df_processed.isnull().sum())

# 2. Fill missing 'Age' values with the median:
df_processed['Age'].fillna(df_processed['Age'].median(), inplace=True)

# 3. Drop the second row in the dataset:
df_processed.drop(df_processed.index[1], inplace=True)
print(df_processed.head())



Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64
   Survived  Pclass                                         Name     Sex  \
0         0       3                       Mr. Owen Harris Braund    male   
2         1       3                        Miss. Laina Heikkinen  female   
3         1       1  Mrs. Jacques Heath (Lily May Peel) Futrelle  female   
4         0       3                      Mr. William Henry Allen    male   
5         0       3                              Mr. James Moran    male   

    Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  
0  22.0                        1                        0   7.2500  
2  26.0                        0                        0   7.9250  
3  35.0                        1                        0  53.1000  
4  35.0               

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_processed['Age'].fillna(df_processed['Age'].median(), inplace=True)


## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Assuming df_processed is your DataFrame from previous steps

# 1. Label Encoding for 'Sex'
le = LabelEncoder()
df_processed['Sex_encoded'] = le.fit_transform(df_processed['Sex'])

# 2. One-Hot Encoding for 'Pclass'
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), ['Pclass'])], remainder='passthrough')
df_encoded = pd.DataFrame(ct.fit_transform(df_processed))

# Get feature names after one-hot encoding
# Assuming original features are ['Survived', 'Pclass', 'Sex', ..., 'Embarked']
feature_names = list(ct.named_transformers_['encoder'].get_feature_names_out(['Pclass'])) + list(df_processed.columns.drop(['Pclass']))
df_encoded.columns = feature_names

# Display the updated DataFrame
print(df_encoded.head())


  Pclass_1 Pclass_2 Pclass_3 Survived  \
0      0.0      0.0      1.0        0   
1      0.0      0.0      1.0        1   
2      1.0      0.0      0.0        1   
3      0.0      0.0      1.0        0   
4      0.0      0.0      1.0        0   

                                          Name     Sex   Age  \
0                       Mr. Owen Harris Braund    male  22.0   
1                        Miss. Laina Heikkinen  female  26.0   
2  Mrs. Jacques Heath (Lily May Peel) Futrelle  female  35.0   
3                      Mr. William Henry Allen    male  35.0   
4                              Mr. James Moran    male  27.0   

  Siblings/Spouses Aboard Parents/Children Aboard    Fare Sex_encoded  
0                       1                       0    7.25           1  
1                       0                       0   7.925           0  
2                       1                       0    53.1           0  
3                       0                       0    8.05           1  
4       

## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [4]:
from sklearn.preprocessing import StandardScaler

# Assuming 'df_encoded' is your DataFrame and you want to scale 'Age' and 'Fare'
scaler = StandardScaler()
df_encoded[['Age', 'Fare']] = scaler.fit_transform(df_encoded[['Age', 'Fare']])
print(df_encoded.head())


  Pclass_1 Pclass_2 Pclass_3 Survived  \
0      0.0      0.0      1.0        0   
1      0.0      0.0      1.0        1   
2      1.0      0.0      0.0        1   
3      0.0      0.0      1.0        0   
4      0.0      0.0      1.0        0   

                                          Name     Sex       Age  \
0                       Mr. Owen Harris Braund    male -0.528495   
1                        Miss. Laina Heikkinen  female -0.245189   
2  Mrs. Jacques Heath (Lily May Peel) Futrelle  female  0.392250   
3                      Mr. William Henry Allen    male  0.392250   
4                              Mr. James Moran    male -0.174362   

  Siblings/Spouses Aboard Parents/Children Aboard      Fare Sex_encoded  
0                       1                       0 -0.502593           1  
1                       0                       0 -0.489029           0  
2                       1                       0  0.418741           0  
3                       0                       

## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Define categorical and numerical features
categorical_features = ['Sex', 'Embarked']
numerical_features = ['Age', 'Fare']

# Create pipelines for numerical and categorical features
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore')),
])

# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
    ])

# Assuming 'X' is your DataFrame containing the features
# X_processed = preprocessor.fit_transform(X) # Use this line to apply the preprocessing
print(df_encoded.head())


  Pclass_1 Pclass_2 Pclass_3 Survived  \
0      0.0      0.0      1.0        0   
1      0.0      0.0      1.0        1   
2      1.0      0.0      0.0        1   
3      0.0      0.0      1.0        0   
4      0.0      0.0      1.0        0   

                                          Name     Sex       Age  \
0                       Mr. Owen Harris Braund    male -0.528495   
1                        Miss. Laina Heikkinen  female -0.245189   
2  Mrs. Jacques Heath (Lily May Peel) Futrelle  female  0.392250   
3                      Mr. William Henry Allen    male  0.392250   
4                              Mr. James Moran    male -0.174362   

  Siblings/Spouses Aboard Parents/Children Aboard      Fare Sex_encoded  
0                       1                       0 -0.502593           1  
1                       0                       0 -0.489029           0  
2                       1                       0  0.418741           0  
3                       0                       

## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [10]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data = pd.read_csv('/content/titanic.csv')
data['FamilySize'] = data['Siblings/Spouses Aboard'] + data['Parents/Children Aboard'] + 1
print(data.head())



   Survived  Pclass                                               Name  \
0         0       3                             Mr. Owen Harris Braund   
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2         1       3                              Miss. Laina Heikkinen   
3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4         0       3                            Mr. William Henry Allen   

      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  \
0    male  22.0                        1                        0   7.2500   
1  female  38.0                        1                        0  71.2833   
2  female  26.0                        0                        0   7.9250   
3  female  35.0                        1                        0  53.1000   
4    male  35.0                        0                        0   8.0500   

   FamilySize  
0           2  
1           2  
2           1  
3           2  
4     