# Titanic Survival Prediction - Machine Learning Project

## Important: Running Order

**Please run the cells in order from top to bottom.** If you encounter errors, it's likely because cells were run out of order. In that case, restart the kernel and run all cells sequentially.

The notebook follows this workflow:

1. Load and combine data
2. Feature engineering and preprocessing
3. Model training and evaluation
4. Results visualization


In [26]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import classifiers from scikit-learn
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Set plot style
%matplotlib inline
sns.set_style('whitegrid')

## Step 1: Load the Data

First, we'll load the `train.csv` and `test.csv` files. For consistent preprocessing, we'll combine them into a single DataFrame.


In [27]:
# Load the training and testing data
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    print("Data loaded successfully!")
    print(f"Training data shape: {train_df.shape}")
    print(f"Testing data shape: {test_df.shape}")
except FileNotFoundError:
    print("Ensure 'train.csv' and 'test.csv' are in the same directory.")

# Combine datasets for easier preprocessing
all_data = pd.concat([train_df, test_df], sort=False).reset_index(drop=True)
print(f"Combined data shape: {all_data.shape}")

Data loaded successfully!
Training data shape: (891, 12)
Testing data shape: (418, 11)
Combined data shape: (1309, 12)


## Step 2: Improved Data Preprocessing & Feature Engineering

Here we apply our enhanced preprocessing strategy.


### a. Feature Engineering from 'Cabin'

We extract the first letter of the `Cabin` number to create a new `Deck` feature. Missing values are filled with 'U' for 'Unknown'.


In [28]:
all_data['Deck'] = all_data['Cabin'].apply(lambda x: str(x)[0] if pd.notnull(x) else 'U')
print("Created 'Deck' feature. Unique values:", all_data['Deck'].unique())

Created 'Deck' feature. Unique values: ['U' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']


### b. Create 'FamilySize' Feature

We combine `SibSp` and `Parch` to get the total family size and then group it into categorical bins.


In [29]:
all_data['FamilySize'] = all_data['SibSp'] + all_data['Parch'] + 1
all_data['FamilySize_cat'] = pd.cut(all_data['FamilySize'], bins=[0, 1, 4, 20], labels=['Alone', 'Small', 'Large'])
print("Created 'FamilySize_cat' feature.")
all_data[['FamilySize', 'FamilySize_cat']].head()

Created 'FamilySize_cat' feature.


Unnamed: 0,FamilySize,FamilySize_cat
0,2,Small
1,2,Small
2,1,Alone
3,2,Small
4,1,Alone


### c. Impute Missing 'Fare' and 'Embarked'

We fill the single missing `Fare` value with the median and the two missing `Embarked` values with the mode.


In [30]:
all_data['Fare'] = all_data['Fare'].fillna(all_data['Fare'].median())
all_data['Embarked'] = all_data['Embarked'].fillna(all_data['Embarked'].mode()[0])
print("Missing 'Fare' and 'Embarked' imputed.")

Missing 'Fare' and 'Embarked' imputed.


### d. Advanced Imputation for 'Age'

Instead of using a single overall median, we fill missing `Age` values with the median age specific to each passenger's `Pclass` and `Sex`. This is a more accurate estimation.


In [31]:
# Check what columns are available before proceeding
print("Available columns in all_data:")
print(all_data.columns.tolist())
print("\nChecking if required columns exist:")
print("'Age' in columns:", 'Age' in all_data.columns)
print("'Pclass' in columns:", 'Pclass' in all_data.columns)
print("'Sex' in columns:", 'Sex' in all_data.columns)

# Only proceed with age imputation if the required columns exist
if all(col in all_data.columns for col in ['Age', 'Pclass', 'Sex']):
    age_impute_table = all_data.pivot_table(values='Age', index=['Pclass', 'Sex'], aggfunc=np.median)
    print("\nMedian age by Pclass and Sex:")
    print(age_impute_table)

    # Fix the age imputation to ensure we get scalar values
    def impute_age(row):
        if pd.isnull(row['Age']):
            return age_impute_table.loc[(row['Pclass'], row['Sex']), 'Age']
        else:
            return row['Age']

    all_data['Age'] = all_data.apply(impute_age, axis=1)
    print("\nMissing 'Age' values imputed.")
else:
    print("\nRequired columns not found. Age imputation skipped.")
    print("This may be because the notebook is being run out of order.")

Available columns in all_data:
['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Deck', 'FamilySize', 'FamilySize_cat']

Checking if required columns exist:
'Age' in columns: True
'Pclass' in columns: True
'Sex' in columns: True

Median age by Pclass and Sex:
                Age
Pclass Sex         
1      female  36.0
       male    42.0
2      female  28.0
       male    29.5
3      female  22.0
       male    25.0

Missing 'Age' values imputed.


  age_impute_table = all_data.pivot_table(values='Age', index=['Pclass', 'Sex'], aggfunc=np.median)


In [32]:
# Ensure Age column is numeric and handle any remaining issues
if 'Age' in all_data.columns:
    all_data['Age'] = pd.to_numeric(all_data['Age'], errors='coerce')
    print("Age column converted to numeric.")
    print("Age column data type:", all_data['Age'].dtype)
    print("Age column info:")
    print(all_data['Age'].describe())
else:
    print("Age column not found. Skipping numeric conversion.")


Age column converted to numeric.
Age column data type: float64
Age column info:
count    1309.000000
mean       29.261398
std        13.218275
min         0.170000
25%        22.000000
50%        26.000000
75%        36.000000
max        80.000000
Name: Age, dtype: float64


### e. Bin Numerical Features ('Age' and 'Fare')

We convert `Age` and `Fare` from continuous numbers to categorical bins. This helps models capture non-linear relationships.


In [33]:
# Bin Numerical Features ('Age' and 'Fare')
# Use pd.cut instead of pd.qcut to avoid issues with duplicate values
if 'Age' in all_data.columns:
    all_data['Age_cat'] = pd.cut(all_data['Age'], bins=4, labels=['Child', 'YoungAdult', 'Adult', 'Senior'])
    print("Binned 'Age' into categories.")
    print("Age categories distribution:")
    print(all_data['Age_cat'].value_counts())
else:
    print("Age column not found. Skipping Age binning.")

if 'Fare' in all_data.columns:
    all_data['Fare_cat'] = pd.cut(all_data['Fare'], bins=4, labels=['Low', 'Medium', 'Mid-High', 'High'])
    print("\nBinned 'Fare' into categories.")
    print("Fare categories distribution:")
    print(all_data['Fare_cat'].value_counts())
else:
    print("Fare column not found. Skipping Fare binning.")

Binned 'Age' into categories.
Age categories distribution:
Age_cat
YoungAdult    806
Child         248
Adult         222
Senior         33
Name: count, dtype: int64

Binned 'Fare' into categories.
Fare categories distribution:
Fare_cat
Low         1242
Medium        50
Mid-High      13
High           4
Name: count, dtype: int64


### f. Drop Unnecessary Columns & Encode Categorical Features

Finally, we drop the original columns that are no longer needed and convert all remaining categorical features into numerical format using one-hot encoding.


In [34]:
# Drop original columns
all_data = all_data.drop(['Ticket', 'Cabin', 'Name', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize'], axis=1)

# One-hot encode categorical features
all_data = pd.get_dummies(all_data, columns=['Pclass', 'Sex', 'Embarked', 'Deck', 'FamilySize_cat', 'Age_cat', 'Fare_cat'], drop_first=True)

print("Final processed data columns:")
print(all_data.columns)
all_data.head()

Final processed data columns:
Index(['PassengerId', 'Survived', 'Pclass_2', 'Pclass_3', 'Sex_male',
       'Embarked_Q', 'Embarked_S', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E',
       'Deck_F', 'Deck_G', 'Deck_T', 'Deck_U', 'FamilySize_cat_Small',
       'FamilySize_cat_Large', 'Age_cat_YoungAdult', 'Age_cat_Adult',
       'Age_cat_Senior', 'Fare_cat_Medium', 'Fare_cat_Mid-High',
       'Fare_cat_High'],
      dtype='object')


Unnamed: 0,PassengerId,Survived,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S,Deck_B,Deck_C,Deck_D,...,Deck_T,Deck_U,FamilySize_cat_Small,FamilySize_cat_Large,Age_cat_YoungAdult,Age_cat_Adult,Age_cat_Senior,Fare_cat_Medium,Fare_cat_Mid-High,Fare_cat_High
0,1,0.0,False,True,True,False,True,False,False,False,...,False,True,True,False,True,False,False,False,False,False
1,2,1.0,False,False,False,False,False,False,True,False,...,False,False,True,False,True,False,False,False,False,False
2,3,1.0,False,True,False,False,True,False,False,False,...,False,True,False,False,True,False,False,False,False,False
3,4,1.0,False,False,False,False,True,False,True,False,...,False,False,True,False,True,False,False,False,False,False
4,5,0.0,False,True,True,False,True,False,False,False,...,False,True,False,False,True,False,False,False,False,False


## Step 3: Model Training and Evaluation

Now that our data is fully preprocessed, we can split it, train our models, and evaluate their performance.


In [35]:
# Split data back into train and test sets
train_processed = all_data[all_data['Survived'].notna()]
test_processed = all_data[all_data['Survived'].isna()].drop('Survived', axis=1)

# Define features (X) and target (y)
X = train_processed.drop("Survived", axis=1)
y = train_processed["Survived"].astype(int)

# Create a validation set from the training data to evaluate models
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

print("Data is split and scaled. Ready for training.")

Data is split and scaled. Ready for training.


In [36]:
# Define the models to be evaluated
models = {
    'Support Vector Machines': SVC(),
    'KNN': KNeighborsClassifier(),
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Naive Bayes': GaussianNB(),
    'Perceptron': Perceptron(),
    'Stochastic Gradient Decent': SGDClassifier(),
    'Linear SVC': LinearSVC(dual=False), # dual=False to avoid convergence warnings
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}

model_accuracies = {}

# Loop through each model, train it, and store its accuracy
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_val_scaled)
    accuracy = accuracy_score(y_val, y_pred)
    model_accuracies[name] = accuracy

## Step 4: Display and Visualize Results

Let's see which models performed the best with our improved features.


In [37]:
# Create a DataFrame to display accuracies
accuracy_df = pd.DataFrame(list(model_accuracies.items()), columns=['Model', 'Accuracy'])
accuracy_df = accuracy_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

print("--- Model Accuracy Ranking ---")
print(accuracy_df)

# # Plotting the results for better visualization
# plt.figure(figsize=(10, 7))
# sns.barplot(x='Accuracy', y='Model', data=accuracy_df, palette='viridis')
# plt.title('Improved Model Classification Accuracies', fontsize=16)
# plt.xlabel('Accuracy', fontsize=12)
# plt.ylabel('Model', fontsize=12)
# plt.xlim(0.7, 0.9) # Set x-axis limit for better readability
# plt.show()

--- Model Accuracy Ranking ---
                        Model  Accuracy
0     Support Vector Machines  0.798883
1         Logistic Regression  0.798883
2                  Linear SVC  0.798883
3                         KNN  0.787709
4                  Perceptron  0.776536
5  Stochastic Gradient Decent  0.776536
6               Random Forest  0.748603
7               Decision Tree  0.720670
8                 Naive Bayes  0.709497
