# Task 2: Data Preprocessing for Machine Learning – AI Bootcamp

Download Titanic Dataset here: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

#### About this file

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Section 1: Data Loading & Exploration

### **Task 1**: Load and Inspect a Dataset

*Instruction*: Load the `titanic.csv` dataset and display the first 5 rows. Show basic info and describe statistics of the dataset.

In [None]:
import pandas as pd

df = pd.read_csv('/content/sample_data/titanic.csv')
print(df.head())
print(df.info())
print(df.describe())

## Section 2: Handling Missing Values

### **Task 2**: Identify and Handle Missing Data

*Instruction*:



*   Display the number of missing values per column.
*   Fill missing `Age` values with the median.
*   Drop the second row in the dataset.



In [None]:
data=df.isnull().sum()
print(data)
median_age=df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
df = df.drop(df.index[1])
print(df)



## Section 3: Encoding Categorical Features

### **Task 3**: Convert Categorical to Numeric

*Instruction*: Convert `Sex` and `Pclass` columns to numeric using:


*   Label Encoding for `Sex`
*   One-Hot Encoding for `Pclass`



In [None]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
df = pd.get_dummies(df, columns=['Pclass'])
print(df)


     Survived                                         Name  Sex   Age  \
0           0                       Mr. Owen Harris Braund    1  22.0   
3           1  Mrs. Jacques Heath (Lily May Peel) Futrelle    0  35.0   
4           0                      Mr. William Henry Allen    1  35.0   
5           0                              Mr. James Moran    1  27.0   
6           0                       Mr. Timothy J McCarthy    1  54.0   
..        ...                                          ...  ...   ...   
882         0                         Rev. Juozas Montvila    1  27.0   
883         1                  Miss. Margaret Edith Graham    0  19.0   
884         0               Miss. Catherine Helen Johnston    0   7.0   
885         1                         Mr. Karl Howell Behr    1  26.0   
886         0                           Mr. Patrick Dooley    1  32.0   

     Siblings/Spouses Aboard  Parents/Children Aboard     Fare  Pclass_1  \
0                          1                   

## Section 4: Feature Scaling

### **Task 4**: Scale Numerical Features

*Instruction*: Use StandardScaler to scale the Age and Fare columns.*italicized text*

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
print(df)

     Survived                                         Name  Sex       Age  \
0           0                       Mr. Owen Harris Braund    1 -0.528491   
3           1  Mrs. Jacques Heath (Lily May Peel) Futrelle    0  0.391765   
4           0                      Mr. William Henry Allen    1  0.391765   
5           0                              Mr. James Moran    1 -0.174546   
6           0                       Mr. Timothy J McCarthy    1  1.736756   
..        ...                                          ...  ...       ...   
882         0                         Rev. Juozas Montvila    1 -0.174546   
883         1                  Miss. Margaret Edith Graham    0 -0.740858   
884         0               Miss. Catherine Helen Johnston    0 -1.590326   
885         1                         Mr. Karl Howell Behr    1 -0.245335   
886         0                           Mr. Patrick Dooley    1  0.179398   

     Siblings/Spouses Aboard  Parents/Children Aboard      Fare  Pclass_1  

## Section 5: Feature Engineering

### **Task 5**: Build Preprocessing Pipeline

*Instruction*: Using `ColumnTransformer` and `Pipeline` from `sklearn`, build a pipeline that:



*   Imputes missing values
*   Scales numeric data
*   Encodes categorical data



In [None]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_features = ['Age', 'Fare']
categorical_features = ['Sex', 'Pclass']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])
df = pd.read_csv('/content/sample_data/titanic.csv')
df = preprocessor.fit_transform(df)
print(df)

[[-0.52936601 -0.50358635  0.         ...  0.          0.
   1.        ]
 [ 0.60426454  0.78341245  1.         ...  1.          0.
   0.        ]
 [-0.24595837 -0.49001959  1.         ...  0.          0.
   1.        ]
 ...
 [-1.59214465 -0.17798419  1.         ...  0.          0.
   1.        ]
 [-0.24595837 -0.04633641  0.         ...  1.          0.
   0.        ]
 [ 0.17915309 -0.4935369   0.         ...  0.          0.
   1.        ]]


## Section 6: Feature Engineering

### **Task 6**: Create a New Feature

*Instruction*: Create a new feature `FamilySize` = `Siblings/Spouses Aboard` + `Parents/Children Aboard` + 1.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

data = pd.read_csv('/content/sample_data/titanic.csv')
data['FamilySize'] = data['Siblings/Spouses Aboard'] + data['Parents/Children Aboard'] + 1
print(data)


     Survived  Pclass                                               Name  \
0           0       3                             Mr. Owen Harris Braund   
1           1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   
2           1       3                              Miss. Laina Heikkinen   
3           1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   
4           0       3                            Mr. William Henry Allen   
..        ...     ...                                                ...   
882         0       2                               Rev. Juozas Montvila   
883         1       1                        Miss. Margaret Edith Graham   
884         0       3                     Miss. Catherine Helen Johnston   
885         1       1                               Mr. Karl Howell Behr   
886         0       3                                 Mr. Patrick Dooley   

        Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  \
0      