# Titanic Dataset Preprocessing

## Introduction

This project focuses on the preprocessing of the Titanic dataset, a classic dataset used for binary classification tasks. The goal is to prepare the data for machine learning algorithms by performing essential data cleaning and transformation steps.


### 1. **Understand the Dataset:** Explore and familiarize ourselves with the Titanic dataset.

In [1]:
import pandas as pd # type: ignore
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder # type: ignore

#### Load the Data

In [2]:
df = pd.read_csv('Data/train.csv') # type: ignore

#### Explore the Data

In [3]:
df.head() #gives the first 5 rows of the data set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.tail(3) #gives the last 3 rows of the dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [5]:
df.info() #Check the data types and non-null counts.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
df.describe() #Get summary statistics of numerical columns.

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
df.sort_index(axis=0, ascending=True) #Sorts by an axis. 0 = row, 1 = coloumn

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
df.sort_values(by="PassengerId", ascending=True) #Sorts by values

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


### 2. **Handle Missing Values:** Identify and address missing values to ensure data completeness.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [10]:
df['Age'] = df['Age'].fillna(df['Age'].median()) #Fills all the missing values of age with the median of the Age coloumn

In [11]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0]) #Fills all the missing values of Embarked with the value that appears most often in Embarked coloumn

In [12]:
df.drop(columns=['Cabin'], inplace=True)

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


### 3. **Feature Scaling:** Normalize or standardize numerical features for consistent scaling.

In [14]:
df[['Age', 'Fare']]

Unnamed: 0,Age,Fare
0,22.0,7.2500
1,38.0,71.2833
2,26.0,7.9250
3,35.0,53.1000
4,35.0,8.0500
...,...,...
886,27.0,13.0000
887,19.0,30.0000
888,28.0,23.4500
889,26.0,30.0000


#### Standardization
Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data.

In [15]:
scaler = StandardScaler()
df[['Age']] = scaler.fit_transform(df[['Age']])
df[['Age', 'Fare']]

Unnamed: 0,Age,Fare
0,-0.565736,7.2500
1,0.663861,71.2833
2,-0.258337,7.9250
3,0.433312,53.1000
4,0.433312,8.0500
...,...,...
886,-0.181487,13.0000
887,-0.796286,30.0000
888,-0.104637,23.4500
889,-0.258337,30.0000


#### Normalization
Normalization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

In [16]:
min_max_scaler = MinMaxScaler()
df[['Fare']] = min_max_scaler.fit_transform(df[['Fare']])
df[['Age', 'Fare']]

Unnamed: 0,Age,Fare
0,-0.565736,0.014151
1,0.663861,0.139136
2,-0.258337,0.015469
3,0.433312,0.103644
4,0.433312,0.015713
...,...,...
886,-0.181487,0.025374
887,-0.796286,0.058556
888,-0.104637,0.045771
889,-0.258337,0.058556


### 4. **Encode Categorical Variables:** Convert categorical features into numerical format suitable for machine learning models.

#### One hot Encoding: Create binary columns for each category.

One−shot coding is a common approach for transforming categorical variables into numeric values. This is converting categorical data into binary data, where all category is stated by a binary value. A binary vector of length equivalent to the number of categories is generated for the entire categorical variable. For example, hypothesize your categorical variable has three categories. For 'Pclass_1', 'Pclass_2', and 'Pclass_3', the one−hot coding representation of that variable has three dimensions, each representing one of the categories.

In [17]:
df = pd.get_dummies(df, columns=['Sex', 'Pclass'])
df[['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3']]

Unnamed: 0,Sex_male,Sex_female,Pclass_1,Pclass_2,Pclass_3
0,True,False,False,False,True
1,False,True,True,False,False
2,False,True,False,False,True
3,False,True,True,False,False
4,True,False,False,False,True
...,...,...,...,...,...
886,True,False,False,True,False
887,False,True,True,False,False
888,False,True,False,False,True
889,True,False,True,False,False


#### Label Encoding: Convert each unique category value to a number.

Label Encoding operates by assigning a number value to each category to transform it to ordinal data. Each category is allocated a unique integer value using this technique. For example, if the categorical variable "Embarked" includes the categories "C," "QS," label encoding would assign values 0, 1, 2 to "C," "Q," and "S respectively."

In [18]:
le = LabelEncoder()

df['Embarked'] = le.fit_transform(df['Embarked']) #C = 0, Q = 1, S = 2
df['Embarked']

0      2
1      0
2      2
3      2
4      2
      ..
886    2
887    2
888    2
889    0
890    1
Name: Embarked, Length: 891, dtype: int32

### 5. **Feature Engineering:** Create new features to enhance the predictive power of the dataset.

In [19]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df

Unnamed: 0,PassengerId,Survived,Name,Age,SibSp,Parch,Ticket,Fare,Embarked,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,FamilySize
0,1,0,"Braund, Mr. Owen Harris",-0.565736,1,0,A/5 21171,0.014151,2,False,True,False,False,True,2
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0.663861,1,0,PC 17599,0.139136,0,True,False,True,False,False,2
2,3,1,"Heikkinen, Miss. Laina",-0.258337,0,0,STON/O2. 3101282,0.015469,2,True,False,False,False,True,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0.433312,1,0,113803,0.103644,2,True,False,True,False,False,2
4,5,0,"Allen, Mr. William Henry",0.433312,0,0,373450,0.015713,2,False,True,False,False,True,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,"Montvila, Rev. Juozas",-0.181487,0,0,211536,0.025374,2,False,True,False,True,False,1
887,888,1,"Graham, Miss. Margaret Edith",-0.796286,0,0,112053,0.058556,2,True,False,True,False,False,1
888,889,0,"Johnston, Miss. Catherine Helen ""Carrie""",-0.104637,1,2,W./C. 6607,0.045771,2,True,False,False,False,True,4
889,890,1,"Behr, Mr. Karl Howell",-0.258337,0,0,111369,0.058556,0,False,True,True,False,False,1


### 6. Save the Preprocessed Data

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Name         891 non-null    object 
 3   Age          891 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
 6   Ticket       891 non-null    object 
 7   Fare         891 non-null    float64
 8   Embarked     891 non-null    int32  
 9   Sex_female   891 non-null    bool   
 10  Sex_male     891 non-null    bool   
 11  Pclass_1     891 non-null    bool   
 12  Pclass_2     891 non-null    bool   
 13  Pclass_3     891 non-null    bool   
 14  FamilySize   891 non-null    int64  
dtypes: bool(5), float64(2), int32(1), int64(5), object(2)
memory usage: 70.6+ KB


In [21]:
df.to_csv('Data/train_preprocessed.csv')