# Titanic Survival Prediction Using Neural Networks

This lab focuses on building and training a neural network model to predict survival on the Titanic.

## Titanic Dataset

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

In [None]:
import os

root_dir = "PATH/TO/YOUR/DIRECTORY"

# Checking if our specified directory exists
os.path.exists(root_dir)

In [None]:
import pandas as pd

# Paths to the downloaded files
data_path = os.path.join(root_dir, "titanic_train.csv")

# Load data
df = pd.read_csv(data_path)
df

In [None]:
random_state = 100
target = "Survived"

## Data Preprocessing

In [None]:
df.info()

### Variable Selection

Eliminate variables that are not utilized as inputs or that contain numerous missing values.

In [None]:
drop_vars = ["Name", "PassengerId", "Ticket", "Cabin"]
df.drop(drop_vars, axis=1, inplace=True)
df.info()

### Missing Value Imputation

* [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer): Univariate imputer for completing missing values with simple strategies.
* [sklearn.impute.KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer): Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set.
* [sklearn.impute.IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer): Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. (Default estimator: `BayesianRidge`)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

df_imputed = df.copy()

# Mode imputation
imputer = SimpleImputer(strategy='most_frequent')
df_imputed[['Embarked']] = imputer.fit_transform(df[['Embarked']])


features = ['Age', 'Pclass', 'SibSp', 'Parch']  # Ensure all features are numerical

# # K-Nearest Neighbors (KNN) Imputation
# imputer = KNNImputer(n_neighbors=5)

# Multivariate Imputation by Chained Equations (MICE)
imputer = IterativeImputer()

# # Random Forest Imputation
# imputer = IterativeImputer(estimator=RandomForestRegressor())

df_imputed[features] = imputer.fit_transform(df[features])

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.histplot(df_imputed['Age'], kde=True, color='blue', alpha=0.5, label='Imputed Age')
sns.histplot(df['Age'].dropna(), kde=True, color='red', alpha=0.5, label='Original Age')
plt.legend()
plt.title('Distribution of Age Before and After Imputation')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
df = df_imputed

### Handling Categorical Variables

In [None]:
df

In [None]:
df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})

var = "Embarked"
one_hot = pd.get_dummies(df[var], prefix=var)
df = pd.concat([df, one_hot], axis=1).drop([var], axis=1)

df

In [None]:
features = df.drop(target, axis=1).columns
features

### Outlier Detection

* Using Z-score, Interquartile Range (IQR)
* [sklearn.ensemble.IsolationForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html): The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
* [sklearn.cluster.DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html): Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Noisy samples are given the label -1.

#### IsolationForest

In [None]:
import numpy as np
from sklearn.ensemble import IsolationForest

df_outlier = df.copy()

iso_forest = IsolationForest(n_estimators=100, contamination=0.02, random_state=random_state)
outliers = iso_forest.fit_predict(df_outlier[features])

print("Outliers detected:", np.sum(outliers == -1))
df_outlier['outlier'] = outliers

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Fare', hue='outlier', data=df_outlier, palette={-1: 'red', 1: 'blue'})
plt.title('Outlier Detection with Isolation Forest')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.show()

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_outlier[features])

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

fig = plt.figure(figsize=(10, 8))

colors = {1: 'blue', -1: 'red'}
marker_colors = [colors[label] for label in df_outlier['outlier']]

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=marker_colors, marker='o')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Plot with Outliers')
plt.show()

### DBSCAN

<img src="https://machinelearninggeek.com/wp-content/uploads/2020/10/image-58.png" width="800">

In [None]:
from sklearn.cluster import DBSCAN

df_outlier = df.copy()

dbscan = DBSCAN(eps=1.0, min_samples=5)
clusters = dbscan.fit_predict(X_pca)

outliers = np.sum(clusters == -1)
print("Number of outliers:", outliers)
df_outlier['outlier'] = np.where(clusters == -1, -1, 1)

plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Fare', hue='outlier', data=df_outlier, palette={-1: 'red', 1: 'blue'})
plt.title('Outlier Detection with DBSCAN Clustering')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.legend(title='Cluster')
plt.show()

In [None]:
fig = plt.figure(figsize=(10, 8))

colors = {1: 'blue', -1: 'red'}
marker_colors = [colors[label] for label in df_outlier['outlier']]

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=marker_colors, marker='o')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Plot with Outliers')
plt.show()

### Data Split

Split the data into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

shuffle = True
test_size_ratio = 0.25

train_df, test_df = train_test_split(df, test_size=test_size_ratio, random_state=random_state, shuffle=shuffle)
print(train_df.shape, test_df.shape)

In [None]:
X_train = train_df.drop(target, axis=1).values
y_train = train_df[target].values

X_test = test_df.drop(target, axis=1).values
y_test = test_df[target].values

### Data Normalization

Utilizes [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn to normalize the training and testing datasets.

In [None]:
from sklearn.preprocessing import StandardScaler

# Normalize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Training and Evaluation using Scikit-Learn

### Training

* MLP Classifier ([sklearn.neural_network.MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html))

In [None]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(50, 30),
                      max_iter=300,
                      activation='relu',
                      solver='adam',
                      batch_size=200,
                      learning_rate='invscaling',
                      learning_rate_init=0.01,
                      power_t=0.5,  # Exponent for inverse scaling learning rate
                      warm_start=True,
                      random_state=random_state,
                      verbose=True) # Enable verbose to monitor

# Fit the model
model.fit(X_train, y_train)

# Access the loss_curve_ attribute
loss_values = model.loss_curve_

# Plot the loss curve
plt.figure(figsize=(8, 4))
plt.plot(loss_values, label='Loss per iteration')
plt.title('Training Loss per Iteration')
plt.xlabel('Iterations')
plt.ylabel('Loss')
plt.legend()
plt.show()

### Evaluation

In [None]:
y_prob = model.predict_proba(X_test)
print("Estimated probs:", y_prob[:10])

y_cls = model.predict(X_test)
print("Estimated classes:", y_cls[:10])
print()

* Accuracy ([metrics.accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html))
* F1 ([metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html))
* ROC AUC ([metrics.roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html))

In [None]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

print("Accuracy:", accuracy_score(y_test, y_cls))
print("F1:", f1_score(y_test, y_cls))
print("ROC AUC:", roc_auc_score(y_test, y_prob[:, 1]))

* Confusion Matrix ([metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html))

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_cls)
conf_matrix_df = pd.DataFrame(
    conf_matrix,
    columns=["Predicted Not-Survived", "Predicted Survived"],
    index=["Actual Not-Survived", "Actual Survived"]
)
print(conf_matrix_df)

* ROC Curve ([metrics.roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html))

In [None]:
from sklearn.metrics import roc_curve

fpr, tpr, _ = roc_curve(y_test, y_prob[:, 1])

plt.plot(fpr, tpr, color="darkorange", lw=2)
plt.plot([0, 1], [0, 1], color="navy", lw=2, linestyle="--")
plt.xlabel("1 - Specificity (FP Rate)")
plt.ylabel("Sensitivity (TP Rate)")
plt.title("ROC Curve")
plt.show()