# Titanic Survival Prediction: Data Preprocessing

In this notebook, we preprocess the Titanic dataset for building machine learning models. This involves handling missing values, encoding categorical variables, feature engineering, and normalizing numerical data.

---

In [1]:
import kagglehub
brendan45774_test_file_path = kagglehub.dataset_download('brendan45774/test-file')

print('Data source import complete.')

Data source import complete.


## Import Libraries

In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer

## Load Data

In [3]:
csv_file_path = os.path.join(brendan45774_test_file_path, 'tested.csv')

In [4]:
tested = pd.read_csv(csv_file_path)

## Handle Missing Values

Check for missing values

In [5]:
missing_values = tested.isnull().sum()

Display missing values and their percentage

In [6]:
missing_percentage = (missing_values / len(tested)) * 100
missing_data = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data.sort_values('Percentage', ascending=False)

Unnamed: 0,Missing Values,Percentage
Cabin,327,78.229665
Age,86,20.574163
Fare,1,0.239234
PassengerId,0,0.0
Name,0,0.0
Pclass,0,0.0
Survived,0,0.0
Sex,0,0.0
Parch,0,0.0
SibSp,0,0.0


Impute missing Age values with the median

In [7]:
age_imputer = SimpleImputer(strategy='median')
tested['Age'] = age_imputer.fit_transform(tested[['Age']])

Drop 'Cabin' column due to excessive missing values

In [8]:
tested.drop('Cabin', axis=1, inplace=True)

Impute missing Embarked values with the mode

In [16]:
embarked_imputer = SimpleImputer(strategy='most_frequent')
# Reshape to a 2D array before imputation
tested['Embarked'] = embarked_imputer.fit_transform(tested[['Embarked']]).ravel()

Impute missing Fare values with the median

In [17]:
fare_imputer = SimpleImputer(strategy='median')
# Reshape to a 2D array before imputation
tested['Fare'] = fare_imputer.fit_transform(tested[['Fare']]).ravel()

Check missing values after handling

In [18]:
tested.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


## Encode Categorical Variables

Now, we need to encode categorical variables such as Sex and Embarked so that machine learning models can process them.

In [19]:
# Encode 'Sex' using LabelEncoder
le = LabelEncoder()
tested['Sex'] = le.fit_transform(tested['Sex'])

# One-hot encode 'Embarked' using pd.get_dummies
train = pd.get_dummies(tested, columns=['Embarked'], drop_first=True)

# Display the first few rows after encoding
tested.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",1,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",1,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",1,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,1,1,3101298,12.2875,S


## Feature Engineering

We create a new feature FamilySize by combining the SibSp and Parch columns, which represent the number of siblings/spouses and parents/children aboard.

In [20]:
# Create 'FamilySize' feature
tested['FamilySize'] = tested['SibSp'] + tested['Parch']

# Drop 'SibSp' and 'Parch' columns as they are now redundant
tested.drop(['SibSp', 'Parch'], axis=1, inplace=True)

# Display the updated dataset
tested.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize
0,892,0,3,"Kelly, Mr. James",1,34.5,330911,7.8292,Q,0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,363272,7.0,S,1
2,894,0,2,"Myles, Mr. Thomas Francis",1,62.0,240276,9.6875,Q,0
3,895,0,3,"Wirz, Mr. Albert",1,27.0,315154,8.6625,S,0
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,3101298,12.2875,S,2


## Normalize Numerical Data

Next, we normalize the numerical features (Age, Fare) to ensure they are on the same scale, which is particularly important for models like Logistic Regression and SVM.

In [21]:
# Normalize the 'Age' and 'Fare' features
scaler = StandardScaler()
tested[['Age', 'Fare']] = scaler.fit_transform(tested[['Age', 'Fare']])

# Display the first few rows after normalization
tested.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Embarked,FamilySize
0,892,0,3,"Kelly, Mr. James",1,0.386231,330911,-0.497413,Q,0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,1.37137,363272,-0.512278,S,1
2,894,0,2,"Myles, Mr. Thomas Francis",1,2.553537,240276,-0.4641,Q,0
3,895,0,3,"Wirz, Mr. Albert",1,-0.204852,315154,-0.482475,S,0
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,-0.598908,3101298,-0.417492,S,2


## Drop Unnecessary Columns

We'll drop columns that are unlikely to contribute to the model's prediction, such as Name, Ticket, and PassengerId.

In [22]:
# Drop unnecessary columns
tested.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

# Display the final dataset after preprocessing
tested.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,FamilySize
0,0,3,1,0.386231,-0.497413,Q,0
1,1,3,0,1.37137,-0.512278,S,1
2,0,2,1,2.553537,-0.4641,Q,0
3,0,3,1,-0.204852,-0.482475,S,0
4,1,3,0,-0.598908,-0.417492,S,2


## Save Preprocessed Data

Finally, we save the preprocessed dataset as a new CSV file, which we can use for model training.

In [25]:
tested.to_csv('tested_preprocessed.csv', index=False)

# Display the shape of the preprocessed dataset
tested.shape

(418, 7)

## Conclusion
In this notebook, we performed the following preprocessing steps:

- Handled missing values in the dataset.

- Encoded categorical variables (Sex, Embarked).

- Created a new feature FamilySize.

- Normalized numerical features (Age, Fare).

- Dropped unnecessary columns (Name, Ticket, PassengerId).

The preprocessed dataset is now ready for building machine learning models.