# P3: Building a Classifier - Titanic Survival Prediction

**Author:** Saratchandra Golla  
**Date:** 11/08/2025    
**Dataset:** Titanic (from seaborn)     
**Introduction:**
This project aims to build and evaluate machine learning classification models to predict passenger survival on the Titanic, using the publicly available seaborn Titanic dataset. We will compare the performance of three different classifier types—Decision Tree (DT), Support Vector Machine (SVM), and a Neural Network (NN)—across three distinct feature sets to determine the most effective approach for this prediction task. The entire process will follow a structured methodology: data preparation, feature selection, model training, and performance evaluation using metrics like accuracy, precision, recall, and F1-score.

## Section 1: Import and Inspect the Data
We begin by importing all necessary libraries in one place, as per standard practice, and then loading the titanic dataset.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load Titanic dataset
titanic = sns.load_dataset('titanic')

# Display a few records to verify
print("First 5 rows of the Titanic dataset:")
print(titanic.head())
print("\nDataset Info:")
titanic.info()

First 5 rows of the Titanic dataset:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       -

## Section 2: Data Exploration and Preparation
### 2.1 Handle Missing Values and Clean Data
Missing values for continuous features like age will be imputed using the median, while missing values for categorical features like embark_town (or its coded counterpart embarked) will be filled using the mode.

In [2]:
# Impute missing values for 'age' using the median
median_age = titanic['age'].median()
titanic['age'] = titanic['age'].fillna(median_age)

# Fill in missing values for 'embark_town' (and thus 'embarked') using the mode
mode_embark = titanic['embark_town'].mode()[0]
titanic['embark_town'] = titanic['embark_town'].fillna(mode_embark)

# Check for remaining missing values in columns we plan to use
print("\nMissing values after imputation:")
print(titanic[['age', 'embark_town']].isnull().sum())


Missing values after imputation:
age            0
embark_town    0
dtype: int64


### 2.2 Feature Engineering
We create a new feature, family_size, and convert necessary categorical features (sex, embarked, alone) into numerical formats for model training.

In [3]:
# 1. Add family_size: sibsp (siblings/spouses) + parch (parents/children) + 1 (for the individual)
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1

# 2. Convert categorical 'sex' to numeric binary (male=0, female=1)
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})

# 3. Convert categorical 'embarked' to numeric (C=0, Q=1, S=2)
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# 4. Binary feature - convert 'alone' to numeric binary (already True/False, convert to int 1/0)
titanic['alone'] = titanic['alone'].astype(int)

print("\nData after feature engineering and mapping:")
print(titanic[['family_size', 'sex', 'embarked', 'alone']].head())


Data after feature engineering and mapping:
   family_size  sex  embarked  alone
0            2    0       2.0      0
1            2    1       0.0      0
2            1    1       2.0      1
3            2    1       2.0      0
4            1    0       2.0      1
