In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Titanic Survival Prediction

## Project Overview
This project predicts passenger survival on the Titanic using machine learning. Using the famous Kaggle Titanic dataset, we build a Random Forest classifier with custom preprocessing pipelines to achieve ~78% accuracy.

**Key Objectives:**
- Perform exploratory data analysis to understand survival patterns
- Engineer features and handle missing data
- Build and optimize a machine learning model
- Generate predictions for the test set

**Tools & Technologies:** Python, Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn

---

## 1. Import Libraries

In [None]:
# Load the training data
titanic_data = pd.read_csv('../data/train.csv')

## 2. Load and Explore Data

Loading the training dataset to understand the structure and identify data quality issues.

In [4]:
titanic_data

NameError: name 'titanic_data' is not defined

In [None]:
# Preview the first few rows to understand the data structure
titanic_data.head()

In [None]:
# Get statistical summary of numerical features
titanic_data.describe()

### Statistical Summary

Key observations:
- 891 passengers in training set
- 38.4% survival rate
- Age has 177 missing values (714/891 present)
- Large fare variance suggests different ticket classes

In [None]:
import seaborn as sns

# Create correlation heatmap to understand feature relationships
sns.heatmap(titanic_data.corr(numeric_only=True), cmap="YlGnBu", annot=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

## 3. Exploratory Data Analysis (EDA)

### Correlation Heatmap
Visualizing correlations between numerical features to identify relationships with survival.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Split data while preserving the distribution of Survived, Pclass, and Sex
# 80% training, 20% testing
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_indicies, test_indicies in split.split(titanic_data, titanic_data[["Survived", "Pclass", "Sex"]]):
    strat_train_set = titanic_data.loc[train_indicies]
    strat_test_set = titanic_data.loc[test_indicies]

## 4. Train-Test Split

Using **Stratified Shuffle Split** to maintain the distribution of key features (Survived, Pclass, Sex) in both training and test sets. This ensures our model evaluation is more reliable.

In [None]:
strat_test_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
374,375,0,3,"Palsson, Miss. Stina Viola",female,3.0,3,1,349909,21.0750,,S
519,520,0,3,"Pavlovic, Mr. Stefo",male,32.0,0,0,349242,7.8958,,S
557,558,0,1,"Robbins, Mr. Victor",male,,0,0,PC 17757,227.5250,,C
83,84,0,1,"Carrau, Mr. Francisco M",male,28.0,0,0,113059,47.1000,,S
173,174,0,3,"Sivola, Mr. Antti Wilhelm",male,21.0,0,0,STON/O 2. 3101280,7.9250,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
312,313,0,2,"Lahtinen, Mrs. William (Anna Sylfven)",female,26.0,1,1,250651,26.0000,,S
672,673,0,2,"Mitchell, Mr. Henry Michael",male,70.0,0,0,C.A. 24580,10.5000,,S
492,493,0,1,"Molson, Mr. Harry Markland",male,55.0,0,0,113787,30.5000,C30,S
240,241,0,3,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C


In [None]:
# Visualize distribution of Survived and Pclass in both sets
plt.subplot(1,2,1)
strat_train_set['Survived'].hist()
strat_train_set['Pclass'].hist()
plt.title('Training Set Distribution')

plt.subplot(1,2,2)
strat_test_set['Survived'].hist()
strat_test_set['Pclass'].hist()
plt.title('Test Set Distribution')

### Verify Stratification
Comparing the distributions in training and test sets to ensure proper stratification.

In [None]:
# Check for missing values and data types
# Age: 142 missing, Cabin: 549 missing, Embarked: 2 missing
strat_train_set.info()

## 5. Data Preprocessing Pipeline

### 5.1 Check Data Quality
Identifying missing values and data types in the training set.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer

class AgeImputer(BaseEstimator, TransformerMixin):
    """
    Custom transformer to impute missing Age values using mean strategy.
    
    This is necessary because ~20% of Age values are missing, and age is
    an important predictor of survival (children had priority in lifeboats).
    """
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        imputer = SimpleImputer(strategy="mean")
        X['Age'] = imputer.fit_transform(X[['Age']])
        return X

### 5.2 Custom Transformers

Building custom Scikit-learn transformers for preprocessing steps:
1. **AgeImputer**: Fills missing age values with the mean
2. **FeatureEncoder**: One-hot encodes categorical variables (Embarked, Sex)
3. **FeatureDropper**: Removes unnecessary columns

In [None]:
from sklearn.preprocessing import OneHotEncoder

class FeatureEncoder(BaseEstimator, TransformerMixin):
    """
    Custom transformer to one-hot encode categorical variables.
    
    Converts:
    - Embarked (C/S/Q/N) into 4 binary columns
    - Sex (Female/Male) into 2 binary columns
    
    One-hot encoding prevents the model from assuming ordinal relationships
    in categorical data.
    """
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        encoder = OneHotEncoder()
        
        # Encode Embarked port (C=Cherbourg, S=Southampton, Q=Queenstown)
        matrix = encoder.fit_transform(X[['Embarked']]).toarray()
        column_names = ["C", "S", "Q", "N"]
        for i in range(len(matrix.T)):
            X[column_names[i]] = matrix.T[i]

        # Encode Sex (Female/Male)
        matrix = encoder.fit_transform(X[['Sex']]).toarray()
        column_names = ["Female", "Male"]
        for i in range(len(matrix.T)):
            X[column_names[i]] = matrix.T[i]
            
        return X

In [None]:
class FeatureDropper(BaseEstimator, TransformerMixin):
    """
    Custom transformer to drop unnecessary features.
    
    Drops:
    - Embarked, Sex: Already encoded as binary columns
    - Name, Ticket, Cabin: High cardinality, not useful for this model
    - N: Placeholder column from encoding
    """
    
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.drop(["Embarked", "Name", "Ticket", "Cabin", "Sex", "N"], axis=1, errors="ignore")

In [None]:
from sklearn.pipeline import Pipeline

# Create a pipeline that sequentially applies all preprocessing steps
# This ensures consistency between training and test data
pipeline = Pipeline([("ageimputer", AgeImputer()),
                     ("featureencoder", FeatureEncoder()),
                     ("featuredropper", FeatureDropper())])

### 5.3 Build Preprocessing Pipeline

Combining all transformers into a single pipeline for reproducible preprocessing.

In [None]:
# Transform the training data through the pipeline
strat_train_set = pipeline.fit_transform(strat_train_set)

### 5.4 Apply Pipeline to Training Data

In [None]:
strat_train_set

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,C,S,Q,Female,Male
832,833,0,3,29.98407,0,0,7.2292,1.0,0.0,0.0,0.0,1.0
96,97,0,1,71.00000,0,0,34.6542,1.0,0.0,0.0,0.0,1.0
878,879,0,3,29.98407,0,0,7.8958,0.0,0.0,1.0,0.0,1.0
288,289,1,2,42.00000,0,0,13.0000,0.0,0.0,1.0,0.0,1.0
777,778,1,3,5.00000,0,0,12.4750,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
815,816,0,1,29.98407,0,0,0.0000,0.0,0.0,1.0,0.0,1.0
805,806,0,3,31.00000,0,0,7.7750,0.0,0.0,1.0,0.0,1.0
142,143,1,3,24.00000,1,0,15.8500,0.0,0.0,1.0,1.0,0.0
594,595,0,2,37.00000,1,0,26.0000,0.0,0.0,1.0,0.0,1.0


In [None]:
# Verify all values are present and numerical
strat_train_set.info()

### Verify Preprocessing Results

After preprocessing:
- All 12 features are numerical
- No missing values remain
- Features ready for scaling and model training

In [None]:
from sklearn.preprocessing import StandardScaler

# Separate features from target variable
X = strat_train_set.drop(['Survived'], axis=1)
y = strat_train_set['Survived']

# Standardize features (mean=0, std=1)
scaler = StandardScaler()
X_data = scaler.fit_transform(X)
y_data = y.to_numpy()

## 6. Feature Scaling & Train/Test Preparation

Separating features (X) from target (y) and applying standardization. StandardScaler ensures all features have mean=0 and std=1, which helps many ML algorithms converge faster.

In [None]:
X_data

array([[ 1.52424857e+00,  8.27893418e-01, -2.76416267e-16, ...,
        -1.60558072e+00, -7.35612358e-01,  7.35612358e-01],
       [-1.34128144e+00, -1.56828591e+00,  3.19121416e+00, ...,
        -1.60558072e+00, -7.35612358e-01,  7.35612358e-01],
       [ 1.70334420e+00,  8.27893418e-01, -2.76416267e-16, ...,
         6.22827609e-01, -7.35612358e-01,  7.35612358e-01],
       ...,
       [-1.16218581e+00,  8.27893418e-01, -4.65586165e-01, ...,
         6.22827609e-01,  1.35941164e+00, -1.35941164e+00],
       [ 5.97623379e-01, -3.70196244e-01,  5.45869244e-01, ...,
         6.22827609e-01, -7.35612358e-01,  7.35612358e-01],
       [-5.85965102e-01, -1.56828591e+00, -3.09977640e-01, ...,
         6.22827609e-01,  1.35941164e+00, -1.35941164e+00]])

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier()

# Define hyperparameter grid to search
param_grid = [
    {"n_estimators": [10, 100, 200, 500], "max_depth": [None, 5, 10], "min_samples_split": [2,3,4]}
]

# Perform grid search with 3-fold cross-validation
grid_search = GridSearchCV(clf, param_grid, cv=3, scoring="accuracy", return_train_score=True)
grid_search.fit(X_data, y_data)

## 7. Model Training & Hyperparameter Tuning

### Random Forest Classifier with GridSearchCV

Using **Random Forest** because:
- Handles non-linear relationships well
- Resistant to overfitting
- Provides feature importance
- No assumptions about data distribution

**GridSearchCV** exhaustively searches through specified hyperparameters:
- `n_estimators`: Number of trees in the forest
- `max_depth`: Maximum depth of each tree
- `min_samples_split`: Minimum samples required to split a node

Using 3-fold cross-validation to find the best combination.

In [None]:
# Get the best model from grid search
final_clf = grid_search.best_estimator_

### Extract Best Model

GridSearchCV automatically identifies the best performing hyperparameters based on cross-validation accuracy.

In [None]:
final_clf

0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,
,min_samples_split,3
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


**Best Hyperparameters Found:**
- n_estimators: 200
- min_samples_split: 3
- max_depth: None (unlimited)

In [None]:
# Apply the same preprocessing pipeline to test data
strat_test_set = pipeline.fit_transform(strat_test_set)

## 8. Model Evaluation

### 8.1 Prepare Test Set

Applying the same preprocessing pipeline to the test set.

In [None]:
strat_test_set

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,C,S,Q,Female,Male
374,375,0,3,3.000000,3,1,21.0750,0.0,0.0,1.0,1.0,0.0
519,520,0,3,32.000000,0,0,7.8958,0.0,0.0,1.0,0.0,1.0
557,558,0,1,28.571181,0,0,227.5250,1.0,0.0,0.0,0.0,1.0
83,84,0,1,28.000000,0,0,47.1000,0.0,0.0,1.0,0.0,1.0
173,174,0,3,21.000000,0,0,7.9250,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
312,313,0,2,26.000000,1,1,26.0000,0.0,0.0,1.0,1.0,0.0
672,673,0,2,70.000000,0,0,10.5000,0.0,0.0,1.0,0.0,1.0
492,493,0,1,55.000000,0,0,30.5000,0.0,0.0,1.0,0.0,1.0
240,241,0,3,28.571181,1,0,14.4542,1.0,0.0,0.0,1.0,0.0


In [None]:
# Prepare test features and labels, then standardize
X_test = strat_test_set.drop(['Survived'], axis=1)
y_test = strat_test_set['Survived']

scaler = StandardScaler()
X_data_test = scaler.fit_transform(X_test)
y_data_test = y_test.to_numpy()

In [None]:
# Evaluate model accuracy on test set
# Result: ~78.8% accuracy
final_clf.score(X_data_test, y_data_test)

### 8.2 Test Set Accuracy

Evaluating the model on unseen data to get an unbiased estimate of performance.

In [None]:
# Apply preprocessing to the entire training dataset
final_data = pipeline.fit_transform(titanic_data)

## 9. Production Model Training

### 9.1 Train on Full Dataset

Now that we've validated our approach, train the final model on ALL available training data (not just the 80% split) for maximum performance.

In [None]:
final_data

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,C,S,Q,Female,Male
0,1,0,3,22.000000,1,0,7.2500,0.0,0.0,1.0,0.0,1.0
1,2,1,1,38.000000,1,0,71.2833,1.0,0.0,0.0,1.0,0.0
2,3,1,3,26.000000,0,0,7.9250,0.0,0.0,1.0,1.0,0.0
3,4,1,1,35.000000,1,0,53.1000,0.0,0.0,1.0,1.0,0.0
4,5,0,3,35.000000,0,0,8.0500,0.0,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,27.000000,0,0,13.0000,0.0,0.0,1.0,0.0,1.0
887,888,1,1,19.000000,0,0,30.0000,0.0,0.0,1.0,1.0,0.0
888,889,0,3,29.699118,1,2,23.4500,0.0,0.0,1.0,1.0,0.0
889,890,1,1,26.000000,0,0,30.0000,1.0,0.0,0.0,0.0,1.0


In [None]:
# Prepare features and target for production model
X_final = final_data.drop(['Survived'], axis=1)
y_final = final_data['Survived']

# Standardize the full dataset
scaler = StandardScaler()
X_data_final = scaler.fit_transform(X_final)
y_data_final = y_final.to_numpy()

In [None]:
# Initialize a new classifier for production
prod_clf = RandomForestClassifier()

param_grid = [
    {
        "n_estimators": [10, 100, 200, 500],
        "max_depth": [None, 5, 10],
        "min_samples_split": [2, 3, 4]
    }
]

# Perform grid search on the full training set
grid_search = GridSearchCV(prod_clf, param_grid, cv=3, scoring="accuracy", return_train_score=True)
grid_search.fit(X_data_final, y_data_final)

# Assign the best model
prod_final_clf = grid_search.best_estimator_

### 9.2 Retrain with Hyperparameter Tuning

Perform another grid search on the full dataset to ensure optimal hyperparameters.

In [None]:
prod_final_clf

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,5
,min_samples_split,4
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


**Final Production Model Parameters:**
- max_depth: 5
- min_samples_split: 4
- n_estimators: 10 (default)

In [None]:
# Load the Kaggle test dataset (no 'Survived' column)
titanic_test_data = pd.read_csv("../data/test.csv")

## 10. Generate Kaggle Predictions

### 10.1 Load and Preprocess Kaggle Test Set

In [None]:
# Apply the same preprocessing pipeline to Kaggle test data
final_test_data = pipeline.fit_transform(titanic_test_data)

In [None]:
final_test_data

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,C,S,Q,Female,Male
0,892,3,34.50000,0,0,7.8292,0.0,1.0,0.0,0.0,1.0
1,893,3,47.00000,1,0,7.0000,0.0,0.0,1.0,1.0,0.0
2,894,2,62.00000,0,0,9.6875,0.0,1.0,0.0,0.0,1.0
3,895,3,27.00000,0,0,8.6625,0.0,0.0,1.0,0.0,1.0
4,896,3,22.00000,1,1,12.2875,0.0,0.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,30.27259,0,0,8.0500,0.0,0.0,1.0,0.0,1.0
414,1306,1,39.00000,0,0,108.9000,1.0,0.0,0.0,1.0,0.0
415,1307,3,38.50000,0,0,7.2500,0.0,0.0,1.0,0.0,1.0
416,1308,3,30.27259,0,0,8.0500,0.0,0.0,1.0,0.0,1.0


In [None]:
# Prepare final test features
X_final_test = final_test_data
X_final_test = X_final_test.ffill()  # Forward fill any remaining NaN values

# Standardize using the same scaler
scaler = StandardScaler()
X_data_final_test = scaler.fit_transform(X_final_test)

In [None]:
# Use the production model to predict survival on Kaggle test set
predictions = prod_final_clf.predict(X_data_final_test)

### 10.2 Generate Predictions

In [None]:
# Create submission DataFrame with PassengerId and Survived columns
final_df = pd.DataFrame(titanic_test_data['PassengerId'])
final_df['Survived'] = predictions

# Save to CSV for Kaggle submission
final_df.to_csv("../data/predictions.csv", index=False)

### 10.3 Export Predictions to CSV

Creating the submission file in Kaggle's required format: PassengerId, Survived

In [None]:
final_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


---

## Summary

**Project Accomplishments:**
- Built a machine learning pipeline for Titanic survival prediction
- Achieved **78.8% accuracy** on validation set
- Implemented custom preprocessing transformers for reproducibility
- Used GridSearchCV to optimize Random Forest hyperparameters
- Generated predictions for 418 test passengers

**Key Techniques Used:**
- Stratified train-test split to preserve class distributions
- Mean imputation for missing Age values
- One-hot encoding for categorical features
- Feature standardization with StandardScaler
- Random Forest with hyperparameter tuning

**Next Steps for Improvement:**
- Feature engineering (FamilySize, Title extraction from names)
- Try ensemble methods (XGBoost, Gradient Boosting)
- Implement SMOTE for class imbalance
- Add cross-validation curves and learning curves
- Feature importance analysis

In [None]:
import os
print(f"Current working directory: {os.getcwd()}")
print(f"Does ../data/train.csv exist? {os.path.exists('../data/train.csv')}")
print(f"Does data/train.csv exist? {os.path.exists('data/train.csv')}")

# List what's in the parent directory
if os.path.exists('..'):
    print(f"\nContents of parent directory: {os.listdir('..')}")

Current working directory: /Users/tyrbujac/Documents/Developments/2025/Juypter/titanic/notebooks
Does ../data/train.csv exist? True
Does data/train.csv exist? False

Contents of parent directory: ['.DS_Store', 'requirements.txt', '.claude', 'README.md', '.gitignore', '.ipynb_checkpoints', '.git', 'main.py', 'data', 'notebooks', 'src']


In [None]:
import os
print("Current working directory:", os.getcwd())
print("\nChecking if paths exist:")
print("'data/train.csv' exists:", os.path.exists('data/train.csv'))
print("'../data/train.csv' exists:", os.path.exists('../data/train.csv'))