# Titanic Survival Prediction

This notebook walks through the analysis, visualization, and prediction of survival on the Titanic using machine learning models. The dataset is available from the Kaggle Titanic competition, and the objective is to predict survival outcomes based on passenger data.


# Importing Necessary Libraries

In this step, we import the essential libraries required for data analysis, preprocessing, visualization, and machine learning model training and evaluation. Each library serves a specific purpose, as described below:

- **Numpy (`np`)**: A fundamental package for numerical computing in Python, used here for data manipulation and mathematical operations.
- **Pandas (`pd`)**: A powerful library for data manipulation and analysis, enabling us to load, clean, and process the Titanic dataset efficiently.
- **Seaborn (`sns`)** and **Matplotlib (`plt`)**: Visualization libraries used to create insightful graphs and charts for data exploration. `Seaborn` provides a high-level interface for drawing attractive and informative statistical graphics.
- **Scikit-learn (`sklearn`)**: The core machine learning library in Python, providing modules for:
  - **Data Preprocessing**: `OneHotEncoder` for encoding categorical variables.
  - **Model Selection**: `train_test_split` for data splitting, and `GridSearchCV` for hyperparameter tuning.
  - **Metrics**: `accuracy_score`, `classification_report`, and `confusion_matrix` for model evaluation.
  - **Algorithms**: Various classifiers like `RandomForestClassifier`, `LogisticRegression`, `DecisionTreeClassifier`, `SVC`, and more.
- **XGBoost (`XGBClassifier`)**: A popular gradient-boosting framework known for its speed and accuracy in structured/tabular data.
- **CatBoost (`CatBoostClassifier`)**: A gradient-boosting framework that handles categorical features natively and is known for its efficiency and accuracy.
- **Warnings**: `warnings.filterwarnings('ignore')` suppresses warnings for cleaner output, particularly useful when using large models or extensive hyperparameter tuning.

With these libraries imported, we’re ready to load, explore, and preprocess the data, as well as build and evaluate a variety of machine learning models for predicting survival on the Titanic.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import (
    RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
import catboost as catboost
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings('ignore')

# Import data

In [None]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Exploratory Data Analysis

Data overview
- PassengerId: Unique id for each passenger. No effect on the target feature.
- Survived: Whether or not the passenger survived. This is the target feature.
  - 0 = No, 1 = Yes
- Pclass: Reflects the socio-economic status of the passenger.
  - 1 = 1st, Upper Class
  - 2 = 2nd, Middle Class
  - 3 = 3rd, Lower Class
- Name: The name of the passenger. Includes the title of the passenger, such as "Mr.", "Mrs.", and "Master.".
- Sex: Gender of the passenger, either "male" or "female".
- Age: The age of the passenger in years.
- SibSp: # of siblings / spouses aboard the Titanic.
- Parch: # of parents / children aboard the Titanic.
- Ticket: The Ticket number.
- Fare: Passenger fare.
- Cabin: Cabin number of the passenger.
- Embarked: Which port the passenger embarked from.
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.shape

In [None]:
test_data.head()

In [None]:
test_data.info()

In [None]:
test_data.shape

# Features and Survival

### Age Distribution by Survival Status
This plot compares the age distribution between passengers who survived and those who did not.


In [None]:
plt.figure(figsize=(6, 4))

plt.hist(train_data[train_data['Survived'] == 1]['Age'], bins=30,
         alpha=0.5, label='Survived', color='green', edgecolor='black')
plt.hist(train_data[train_data['Survived'] == 0]['Age'], bins=30,
         alpha=0.5, label='Did Not Survive', color='red', edgecolor='black')

plt.title('Age Distribution by Survival Status')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)

plt.show()

- Younger passengers, especially those < 5 years old, seem to survive at a higher rate.
- Older passengers seem to have a lower survival rate, especially around 40 - 75 years old.

### Fare Distribution by Survival Status
This plot shows the distribution of fares paid by passengers who survived and those who did not.


In [None]:
plt.figure(figsize=(6, 4))

plt.hist(train_data[train_data['Survived'] == 1]['Fare'], bins=30,
         alpha=0.5, label='Survived', color='green', edgecolor='black')
plt.hist(train_data[train_data['Survived'] == 0]['Fare'], bins=30,
         alpha=0.5, label='Did Not Survive', color='red', edgecolor='black')

plt.title('Fare Distribution by Survival Status')
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)

plt.show()

- Lower fares seemed to have survived less.
- Higher fares seemed to survive more.
- This can be correlated with the socio-economic status of the passenger.

### Survival Counts by Sex
This bar chart displays the survival counts based on the passenger’s sex.


In [None]:
df = train_data.copy()
df['Sex_Label'] = df['Sex'].map({'male': 'Male', 'female': 'Female'})

survival_counts = df.groupby(['Sex_Label', 'Survived']).size().unstack()
survival_counts.plot(kind='bar', stacked=False,
                     figsize=(6, 4), color=['salmon', 'skyblue'])

plt.title('Survival Counts by Sex')
plt.xlabel('Sex')
plt.ylabel('Number of People')
plt.xticks(rotation=0)
plt.legend(title='Survived', labels=['Did not Survive', 'Survived'])
plt.grid(True)

plt.show()

- Females have a higher survival rate than males, which aligns with evacuation protocols that prioritized women and children.
- Males have a significantly lower survival rate, likely due to this prioritization.


### Survival Counts by Embarkation Point
This plot displays survival counts based on the port where passengers embarked: Southampton (S), Cherbourg (C), or Queenstown (Q).


In [None]:
survival_counts = train_data.groupby(['Embarked', 'Survived']).size().unstack()

survival_counts.plot(kind='bar', stacked=False,
                     figsize=(6, 4), color=['salmon', 'skyblue'])

plt.title('Survival Counts by Embarked')
plt.xlabel('Embarked')
plt.ylabel('Number of People')
plt.xticks(rotation=0)
plt.legend(title='Survived', labels=['Did not Survive', 'Survived'])
plt.grid(True)

plt.show()

- Passengers from Cherbourg (C) had the highest survival rates, possibly reflecting higher socio-economic status.
- Passengers from Southampton (S) had the lowest survival rates.


### Survival Counts by Passenger Class
This bar chart shows the survival counts based on passenger class (1st, 2nd, and 3rd class).


In [None]:
survival_counts = train_data.groupby(['Pclass', 'Survived']).size().unstack()

survival_counts.plot(kind='bar', stacked=False,
                     figsize=(6, 4), color=['salmon', 'skyblue'])

plt.title('Survival Counts by Pclass')
plt.xlabel('Pclass')
plt.ylabel('Number of People')
plt.xticks(rotation=0)
plt.legend(title='Survived', labels=['Did not Survive', 'Survived'])
plt.grid(True)

plt.show()

- 1st class passengers have a high chance of survival, while 3rd class passengers have a high chance of dying.
- Socio-economic status seems to play a part for survival.

### Feature Engineering: Fill Missing Values
First, we identify missing values in the training and test datasets to determine the most appropriate imputation strategy for each feature.


In [None]:
train_data.isnull().sum()

Missing values in train data:

- Age (177 missing values)
- Cabin (687 missing values)
- Embarked (2 missing values)

In [None]:
test_data.isnull().sum()

Missing values in test data:

- Age (86 missing values)
- Cabin (327 missing values)
- Fare (1 missing value)

### Combine Datasets
We combine the training and test data for consistent preprocessing and feature engineering. Later, we’ll split them back into separate datasets.


In [None]:
all_data = pd.concat([train_data, test_data])

**Update Sex to binary value** 

- I change the Sex feature to a binary value, 0 = male, 1 = female. Later, this will be changed to one-hot encoded values.

In [None]:
all_data['Sex'] = all_data['Sex'].map({'male':0,'female':1})

### Impute Missing Ages Using `Pclass`
To fill in missing `Age` values, we use the median age for each passenger class (`Pclass`) since `Pclass` is highly correlated with `Age`.


In [None]:
features = ["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]

# Only select rows without missing Age values
age_present = all_data[all_data['Age'].notna()]
age_data = age_present[features]

corr = age_data.corr()

plt.figure(figsize=(7, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

- The highest correlated feature with Age is the Pclass. The missing `Age` values are filled with the median age for each `Pclass`, preserving meaningful differences in age distribution across classes.


In [None]:
median_ages = all_data.groupby("Pclass")['Age'].transform('median')
all_data['Age'] = all_data['Age'].fillna(median_ages)

### Extract Deck and Create Cabin Missing Indicator

- I replace this column with the following two features:
    - Deck: The first letter in the cabin number. Set to "U" if Cabin is missing.
    - CabinMissing: 1 if cabin is missing, 0 otherwise.

In [None]:
all_data['Deck'] = all_data['Cabin'].str[0].fillna('U')
all_data['CabinMissing'] = all_data['Cabin'].isna().astype(int)

In [None]:
survival_counts = all_data[:891].groupby(['Deck','Survived']).size().unstack()

survival_counts.plot(kind='bar', stacked=False, figsize=(10,6), color = ['salmon','skyblue'])

plt.title('Survival Counts by Deck')
plt.xlabel('Deck')
plt.ylabel('Number of People')
plt.xticks(rotation=0)
plt.legend(title='Survived', labels=['Did not Survive', 'Survived'])
plt.grid(True)

plt.show()

- `Deck` is extracted from `Cabin` and filled with 'U' for missing values.
- The `CabinMissing` indicator helps capture missing information which may correlate with survival.


### Impute Missing Embarked Values



- I fill the missing values with the most common value for similar rows.
- There are only 2 missing values, so I set them to the mode embarked value of the passengers who are also female and on Deck "B".

In [None]:
missing_embarked = all_data[all_data["Embarked"].isna()]
missing_embarked

In [None]:
def mode_or_nan(series):
    mode = series.mode()
    return mode[0] if not mode.empty else np.nan

all_data['Embarked'] = all_data.groupby(['Sex','Deck'])['Embarked'].transform(lambda x: x.fillna(mode_or_nan(x)))

- Missing `Embarked` values are filled with the mode of similar rows based on `Sex` and `Deck`.


### Impute Missing Fare Values

- I fill the missing fare values with the mean fare value for the highest correlated feature, which is the Pclass.

In [None]:
all_data[all_data.Fare.isna()]

In [None]:
all_data["Fare"] = all_data.groupby("Pclass")["Fare"].transform(
    lambda x: x.fillna(x.mean())
)

- Missing `Fare` values are filled based on the mean fare of passengers in the same `Pclass`.


# New Features

### Create New Features
1. `TicketCount`: The number of passengers with the same ticket, indicating group size.
2. `FamilySize`: Combines `SibSp` and `Parch` to capture the family size.


In [None]:
all_data["TicketCount"] = all_data.groupby(
    "Ticket")["Ticket"].transform("count")

In [None]:
all_data["FamilySize"] = all_data["SibSp"] + all_data["Parch"] + 1

- `TicketCount` might indicate group size, which can impact survival probability.
- `FamilySize` can affect survival as passengers traveling alone may face different survival chances than those with family.


### Create IsAlone Feature
Create `IsAlone`, which is 1 if `FamilySize` is 1, indicating the passenger is traveling alone.

In [None]:
all_data["IsAlone"] = (all_data["FamilySize"] == 1).astype(int)

- `IsAlone` could reveal differences in survival rates based on whether a passenger is alone or with family.

### Create Age Bins
Bin `Age` into categorical ranges to simplify age-based differences in survival rates.


In [None]:
all_data["Age"].describe()

In [None]:
bins = [x*10 for x in range(10)]
labels = range(1, len(bins))

all_data["AgeBin"] = pd.cut(
    all_data["Age"], bins=bins, labels=labels, right=False)

- `AgeBin` allows us to group passengers by age range, which can capture non-linear relationships between age and survival.


### Extract and Group Titles from Names

- This is extracted from the Name feature of each passenger.
- It seems to carry potential information such as their gender (Mr. for male, Mrs. for female) and their age (Master. is given to boys).

In [None]:
all_data["Title"] = all_data["Name"].apply(
    lambda x: x.split(',')[1].split()[0])
all_data["Title"].unique()

In [None]:
plt.figure(figsize=(7, 6))
all_data["Title"].value_counts().plot(kind='bar', color='skyblue')
plt.show()

In [None]:
match_list = ["the", "Jonkheer.", "Dona.", "Mlle.", "Mme.", "Don."]

all_data[all_data["Name"].apply(
    lambda x: x.split(',')[1].split()[0]).isin(match_list)]

- Jonkheer., Dona., Mlle., MMe. and Don. seem to be the names of passengers.
- "the" is for PassengerId 760. It's followed by the term "Countess".

I'll group together these titles by similarity into one of four groups:

- Mr
- Mrs/Miss
- Master
- Officer/Professional

In [None]:
replacements = {
    'Mr.': 'Mr',
    'Mrs.': 'Mrs/Miss',
    'Miss.': 'Mrs/Miss',
    'Master.': 'Master',
    'Don.': 'Mr',
    'Rev.': 'Officer/Professional',
    'Dr.': 'Officer/Professional',
    'Mme.': 'Mrs/Miss',
    'Ms.': 'Mrs/Miss',
    'Major.': 'Officer/Professional',
    'Lady.': 'Mrs/Miss',
    'Sir.': 'Mr',
    'Mlle.': 'Mrs/Miss',
    'Col.': 'Officer/Professional',
    'Capt.': 'Officer/Professional',
    'the': 'Mrs/Miss',
    'Jonkheer.': 'Mr',
    'Dona.': 'Mrs/Miss'
}

all_data["Title"] = all_data["Title"].replace(replacements)

all_data["Title"].unique()

- Titles can provide insight into social status or gender and age (e.g., Master for young boys).


### One-Hot Encode Categorical Variables
Convert categorical variables (e.g., `Pclass`, `Title`, `Embarked`, `AgeBin`) into one-hot encoded features for model compatibility.


In [None]:
one_hot_columns = ["Pclass", "Title", "Embarked", "Deck", "Sex"]

encoder = OneHotEncoder(sparse_output=False, dtype=int)
encoded_features = encoder.fit_transform(all_data[one_hot_columns])
encoded_df = pd.DataFrame(
    encoded_features, columns=encoder.get_feature_names_out(one_hot_columns))
encoded_df.index = all_data.index

In [None]:
df_final = pd.concat([all_data, encoded_df], axis=1)

In [None]:
df_final.columns 
# Convert categorical columns to one-hot encoding or label encoding
# One-hot encode AgeBin
df_final = pd.get_dummies(df_final, columns=["AgeBin"])

In [None]:
df_final.columns

- One-hot encoding allows categorical data to be represented numerically, making it suitable for model input.


### Drop Irrelevant Columns
Remove columns that are no longer needed after feature engineering (e.g., original categorical columns and IDs).


In [None]:
drop_cols = ["Ticket", "Cabin", "Name", "PassengerId", "Age",
             "SibSp", "Parch", "Pclass", "Embarked", "Title", "Deck", "Sex"]

df_final.drop(columns=drop_cols, inplace=True)

- Dropping these columns reduces dimensionality, keeping only relevant engineered features.


In [None]:
df_final.columns

In [None]:
df_final.head()

### Split Data Back into Train and Test Sets
After feature engineering, separate the data back into training and test sets for model training and evaluation.


In [None]:
# Split the data back into the train and test sets
df_train = df_final[:891]
df_test = df_final[891:]

In [None]:
print(df_train.shape)
print(df_test.shape)

In [None]:
df_test.drop(columns=['Survived'], inplace=True)

In [None]:
X_train = df_train.drop(columns="Survived")
y_train = df_train["Survived"].values

X_test = df_test

- Data is split back into training and testing sets with the target variable (`Survived`) separated for training.


In [None]:
X_train.head()

# Model Selection and Hyperparameter Tuning

We use various machine learning algorithms to predict Titanic survival, including:
- **Random Forest**: A robust ensemble method that performs well on tabular data.
- **Logistic Regression**: A simple, interpretable linear model often used in binary classification.
- **Artificial Neural Network (ANN)**: A basic neural network that can capture complex patterns.
- **Decision Tree**: A non-linear model that performs well with high interpretability.
- **Support Vector Machine (SVM)**: A powerful classification algorithm that creates decision boundaries.
- **k-Nearest Neighbors (k-NN)**: A non-parametric method that relies on feature similarity.
- **Naive Bayes**: A simple probabilistic classifier based on Bayes' theorem.
- **AdaBoost and Gradient Boosting**: Ensemble methods that iteratively improve weak learners.
- **XGBoost and CatBoost**: Gradient-boosting models optimized for performance on structured data.

For each model, we define a parameter grid for `GridSearchCV` to perform hyperparameter tuning and select the best combination of parameters using cross-validation. This will optimize each model’s performance by finding the most suitable parameter values.


In [None]:
# Define all models with base parameters for GridSearchCV
models = {
    'Random Forest': RandomForestClassifier(class_weight='balanced'),
    'Logistic Regression': LogisticRegression(max_iter=200),
    'Artificial Neural Network': MLPClassifier(max_iter=200),
    'Decision Tree': DecisionTreeClassifier(criterion='gini'),
    'Support Vector Machine': SVC(),
    'k-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'AdaBoost': AdaBoostClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'CatBoost': CatBoostClassifier(verbose=0)
}

This dictionary contains the initialized models, each with some default parameters to be further tuned with `GridSearchCV`. Each model has different characteristics that may contribute to improved performance in predicting survival on the Titanic dataset.


### Defining Parameter Grids for GridSearchCV

Here, we define the parameter grids for each model. These grids contain values for each hyperparameter to test during cross-validation. Notable parameters for each model include:
- **Random Forest**: Number of estimators (trees), maximum tree depth, minimum samples per split, etc.
- **Logistic Regression**: Regularization strength, solver type, and maximum iterations.
- **Artificial Neural Network**: Number of hidden layers, activation function, learning rate, and maximum iterations.
- **Support Vector Machine**: Penalty parameter (C), kernel type, and gamma for non-linear kernels.
- **AdaBoost** and **Gradient Boosting**: Learning rate, number of estimators, and parameters for weak learners.
- **XGBoost and CatBoost**: Number of boosting iterations, learning rate, and regularization parameters, among others.

These parameter grids will be used in `GridSearchCV` to test multiple combinations, identifying the optimal parameter set for each model.


In [None]:
# Define parameter grids for models that will use GridSearchCV
param_grids = {
    'Random Forest': {'n_estimators': [100, 200, 300], 'max_depth': [5, 10, 15, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]},
    'Logistic Regression': {'C': [0.01, 0.1, 0.5, 1.0, 10], 'solver': ['liblinear', 'lbfgs', 'saga'],'penalty': ['l1', 'l2', 'elasticnet'], 'max_iter': [100, 200, 300]},
    'Artificial Neural Network': {'hidden_layer_sizes': [(50, 30), (100,),(100, 50),(150, 100, 50)], 'solver': ['adam','sgd'], 'alpha': [0.0001, 0.001, 0.01], 'learning_rate': ['constant', 'adaptive'], 'max_iter': [200, 300, 500]},
    'Decision Tree': {'max_depth': [5, 10, 15, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'criterion': ['gini', 'entropy', 'log_loss']},
    'Support Vector Machine': {'C': [0.1, 0.5, 1.0, 10], 'kernel': ['rbf', 'linear'], 'gamma': ['scale', 'auto']},
    'k-Nearest Neighbors': {'n_neighbors': [3, 5, 7, 10], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan', 'minkowski']},
    'Naive Bayes': {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]},  # Test different levels of variance smoothing
    'AdaBoost': {'n_estimators': [50, 100, 200], 'learning_rate': [0.1, 0.5, 1.0, 1.5]},
    'Gradient Boosting': {'n_estimators': [100, 150, 200, 300], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'max_depth': [3, 5, 7], 'subsample': [0.7, 0.8, 1.0], 'min_samples_split': [2, 5, 10]},
    'XGBoost': {'n_estimators': [100, 150, 200, 300], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'max_depth': [3, 5, 7], 'subsample': [0.7, 0.8, 1.0], 'colsample_bytree': [0.5, 0.8, 1.0], 'gamma': [0, 0.1, 0.3], 'reg_alpha': [0, 0.01, 0.1], 'reg_lambda': [1, 1.5, 2]},
    'CatBoost': {'iterations': [100, 150, 200, 300],  'learning_rate': [0.01, 0.05, 0.1, 0.2],  'depth': [3, 5, 7, 10], 'l2_leaf_reg': [1, 3, 5, 7], 'bagging_temperature': [0.0, 0.5, 1.0], 'border_count': [32, 64, 128]}
}

The parameters defined in each grid cover a range of potential values, allowing `GridSearchCV` to find the most effective hyperparameters through cross-validation.


### Train Models and Perform Hyperparameter Tuning

For each model, we use `GridSearchCV` to search for the optimal hyperparameters through 5-fold cross-validation. The best-performing model for each algorithm is then selected based on the highest cross-validation accuracy.


In [None]:
# Track performance of each model
model_performance = []

# Split the data into training and validation sets
train_x, X_val, train_y, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42)

for model_name, model in models.items():
    print(f"Training {model_name}...")

    # Get the parameter grid for the model
    param_grid = param_grids.get(model_name, {})

    # Initialize best model
    best_model = model

    # If there are parameters to tune, use GridSearchCV
    if param_grid:
        grid_search_cv = GridSearchCV(
            estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy', verbose=0)
        grid_search_cv.fit(train_x, train_y)

        # Extract the best model and parameters
        best_model = grid_search_cv.best_estimator_
        best_params = grid_search_cv.best_params_
        best_score = grid_search_cv.best_score_

        print(f"Best Parameters for {model_name}: {best_params}")
        print(
            f"Best Cross-Validation Accuracy for {model_name}: {best_score:.4f}")
    else:
        # If no parameters to tune, fit model as-is
        best_model.fit(train_x, train_y)

    # Predict on validation set with best model
    y_pred = best_model.predict(X_val)

    # Calculate accuracy and detailed metrics
    accuracy = accuracy_score(y_val, y_pred)
    report = classification_report(y_val, y_pred, output_dict=True)

    print(f"{model_name} Validation Accuracy: {accuracy:.4f}")
    print(classification_report(y_val, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred))
    print("\n" + "="*50 + "\n")

    # Append results to model_performance list
    model_performance.append({
        'Model': model_name,
        'Best Parameters': best_params if param_grid else "Default",
        'Cross-Validation Accuracy': best_score if param_grid else "Not Applicable",
        'Validation Accuracy': accuracy,
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1 Score': report['weighted avg']['f1-score']
    })


Each model is trained and evaluated on the validation set, with the best-performing hyperparameters applied. Performance metrics, including accuracy, precision, recall, and F1-score, are recorded for comparison.


# Evaluate and Compare Model Performance

After training, we compare all models based on validation accuracy to identify the best-performing model for Titanic survival prediction.


In [None]:
# Convert results to a DataFrame for better comparison
performance_df = pd.DataFrame(model_performance).sort_values(
    by='Cross-Validation Accuracy', ascending=False)
performance_df

This DataFrame ranks the models based on validation accuracy, allowing us to easily see which model performed the best on the validation set.


### Select and Train the Best Model

Based on validation accuracy, we choose the best model (e.g., Gradient Boosting). This model is then retrained on the full training data using the best hyperparameters from `GridSearchCV`.


In [None]:
best_model = performance_df.iloc[0]['Model']
print(f"Best Model: {best_model}")
print(f"Best Model Parameters: {performance_df.iloc[0]['Best Parameters']}")

# Retrieve the best parameters and model
best_params = param_grids[best_model]
best_model = models[best_model]

# Re-run GridSearchCV on the full training data to use the best parameters for the best model
grid_search_cv = GridSearchCV(estimator=best_model, param_grid=best_params, cv=5, n_jobs=-1, scoring='accuracy')
grid_search_cv.fit(X_train, y_train)

# Get the best estimator from grid search
final_model = grid_search_cv.best_estimator_

The best model is selected and retrained on the entire training set with the optimal parameters. This final model will be used to make predictions on the test data.


### Generate Predictions and Create Submission File

Using the final trained model, we predict survival outcomes on the test dataset and save the results in a CSV file for Kaggle submission.


In [None]:
# Make predictions on the test data
prediction = final_model.predict(X_test)

# Create a submission file
submission = pd.DataFrame(
    {'PassengerId': test_data.PassengerId, 'Survived': prediction.astype(int)})
submission.to_csv('titanic_submission_all_comparision.csv', index=False)