# Titanic Survival Prediction Using PyTorch

This lab focuses on building and training a neural network model to predict survival on the Titanic. The session will guide you through the process of handling a real-world tabular dataset, preprocessing it, and applying a machine learning model using PyTorch.

In [None]:
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F

## Titanic Dataset

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

In [None]:
import os

# root_dir = "PATH/TO/YOUR/DIRECTORY"
root_dir = "/content/gdrive/MyDrive/lecture/[Common] 머신러닝 원리와 응용/lab12_nn_torch"

# Checking if our specified directory exists
os.path.exists(root_dir)

In [None]:
import pandas as pd

# Paths to the downloaded files
data_path = os.path.join(root_dir, "titanic_train.csv")

# Load data
df = pd.read_csv(data_path)
df

In [None]:
random_state = 100
target = "Survived"

## Data Preprocessing

In [None]:
df.info()

### Variable Selection

Eliminate variables that are not utilized as inputs or that contain numerous missing values.

In [None]:
drop_vars = ["Name", "PassengerId", "Ticket", "Cabin"]
df.drop(drop_vars, axis=1, inplace=True)
df.info()

### Missing Value Imputation

* [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer): Univariate imputer for completing missing values with simple strategies.
* [sklearn.impute.KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer): Imputation for completing missing values using k-Nearest Neighbors. Each sample’s missing values are imputed using the mean value from `n_neighbors` nearest neighbors found in the training set.
* [sklearn.impute.IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer): Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. (Default estimator: `BayesianRidge`)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

df_imputed = df.copy()

# Mode imputation
imputer = SimpleImputer(strategy='most_frequent')
df_imputed[['Embarked']] = imputer.fit_transform(df[['Embarked']])


features = ['Age', 'Pclass', 'SibSp', 'Parch']  # Ensure all features are numerical

# # K-Nearest Neighbors (KNN) Imputation
# imputer = KNNImputer(n_neighbors=5)

# Multivariate Imputation by Chained Equations (MICE)
imputer = IterativeImputer()

# # Random Forest Imputation
# imputer = IterativeImputer(estimator=RandomForestRegressor())

df_imputed[features] = imputer.fit_transform(df[features])

df = df_imputed

### Handling Categorical Variables

In [None]:
df

In [None]:
df["Sex"] = df["Sex"].replace({"male": 0, "female": 1})

var = "Embarked"
one_hot = pd.get_dummies(df[var], prefix=var)
df = pd.concat([df, one_hot], axis=1).drop([var], axis=1)

df

In [None]:
features = df.drop(target, axis=1).columns
features

### Data Split

Split the data into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

shuffle = True
test_size_ratio = 0.25

train_df, test_df = train_test_split(df, test_size=test_size_ratio, random_state=random_state, shuffle=shuffle)
print(train_df.shape, test_df.shape)

In [None]:
X_train = train_df.drop(target, axis=1).values
y_train = train_df[target].values

X_test = test_df.drop(target, axis=1).values
y_test = test_df[target].values

### Data Normalization

Utilizes [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn to normalize the training and testing datasets.

In [None]:
from sklearn.preprocessing import StandardScaler

# Normalize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Over-sampling

- In this dataset, the number of **survivors** is significantly smaller compared to **non-survivors**, leading to a potential imbalance that can cause a model to be biased towards the majority class (non-survivors).
- **[SMOTE (Synthetic Minority Over-sampling Technique)](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html)** is an **oversampling technique** used to address class imbalance by **synthesizing new samples** of the minority class rather than simply duplicating existing ones.
- It works by selecting instances from the minority class and then creating new synthetic examples along the lines between the selected instance and one of its **k-nearest neighbors**. This helps the model learn better and avoids the pitfalls of overfitting to duplicate data.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto', random_state=random_state, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

Check class ratio before/after applying SMOTE

In [None]:
print("Class distribution in y_train before SMOTE:")
print(pd.Series(y_train).value_counts())

print("Class distribution in y_train after SMOTE:")
print(pd.Series(y_train_smote).value_counts())

In [None]:
X_train, y_train = X_train_smote, y_train_smote

## Training and Evaluation using PyTorch

### Preparation for PyTorch Training

Conversion to PyTorch Tensors:
- The normalized data is then converted into PyTorch tensors, which are the fundamental data structures used in PyTorch for building and training neural networks.
- `FloatTensor` is used for input features (`X_train` and `X_test`), and `LongTensor` for labels (`y_train` and `y_test`), matching the data types expected by PyTorch models.

Creating DataLoaders:
- `TensorDataset` wraps tensors into a dataset. Each sample will be retrieved by indexing tensors along the first dimension.
- `DataLoader` is used to create iterable over the datasets. `train_loader` and `test_loader` are created with specified batch sizes and shuffling options.

In [None]:
from torch.utils.data import DataLoader, TensorDataset

batch_size = 200

# Convert to PyTorch tensors
X_train = torch.FloatTensor(X_train)
X_test = torch.FloatTensor(X_test)
y_train = torch.LongTensor(y_train)
y_test = torch.LongTensor(y_test)

# Create dataloaders
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

test_dataset = TensorDataset(X_test, y_test)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

### Model Architecture

The `SimpleNN` class extends PyTorch's `nn.Module` and represents a simple fully connected neural network (also known as a Multilayer Perceptron) with two hidden layers.

In [None]:
class SimpleNN(nn.Module):
    def __init__(self, hidden_sizes=(50, 30), apply_bn=False):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(len(features), hidden_sizes[0])
        self.bn1 = nn.BatchNorm1d(hidden_sizes[0])
        self.fc2 = nn.Linear(hidden_sizes[0], hidden_sizes[1])
        self.bn2 = nn.BatchNorm1d(hidden_sizes[1])
        self.fc3 = nn.Linear(hidden_sizes[1], 2)
        self.apply_bn = apply_bn

    def forward(self, x):
        x = self.fc1(x)
        if self.apply_bn:
            x = self.bn1(x)
        x = F.relu(x)
        x = self.fc2(x)
        if self.apply_bn:
            x = self.bn2(x)
        x = F.relu(x)
        x = self.fc3(x)
        return x

model = SimpleNN(hidden_sizes=(50, 30), apply_bn=True)
model

### Weight Initialization (Skipped)

In PyTorch, the default weight initialization method varies depending on the type of layer in the neural network. By default, PyTorch initializes the weights of `nn.Linear` layers using Kaiming uniform (He uniform) initialization, and the biases are set to zero. This is a common choice for layers that are followed by non-linear activations like ReLU.

### Optimizer and Learning Rate Scheduler

When training a neural network, choosing the right optimizer and regularization technique can significantly impact performance. In the code snippet provided, we define a method to select an optimizer based on predefined hyperparameters and apply L2 regularization to prevent overfitting.

In [None]:
from torch.optim import Adam, SGD, RMSprop
from torch.optim.lr_scheduler import StepLR

def select_optimizer(optimizer_name, parameters, lr=1e-3, weight_decay=0):
    if optimizer_name == "sgd":
        return torch.optim.SGD(parameters, lr=lr, weight_decay=weight_decay, momentum=0.9)
    elif optimizer_name == "rmsprop":
        return torch.optim.RMSprop(parameters, lr=lr, weight_decay=weight_decay, alpha=0.99)
    elif optimizer_name == "adam":
        return torch.optim.Adam(parameters, lr=lr, weight_decay=weight_decay)
    else:
        raise ValueError(f"Unknown optimizer: {optimizer_name}")


# Choose optimizer and regularization hyperparameters
optimizer_name = "adam" # Could be "sgd", "rmsprop", or "adam"
learning_rate = 0.001
weight_decay = 0.001    # L2 regularization coefficient

optimizer = select_optimizer(optimizer_name=optimizer_name, parameters=model.parameters(), lr=learning_rate, weight_decay=weight_decay)
scheduler = StepLR(optimizer, step_size=50, gamma=0.1)

### Training

This section of the code represents the training loop for our neural network model. The loop iterates over the dataset multiple times (epochs), updating the model's weights to minimize the loss function, which in this case measures the discrepancy between the predicted and actual class labels.

In [None]:
num_epochs = 100
train_losses = []

for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
        # Clear old gradients; if not cleared, they would accumulate with subsequent backward passes.
        optimizer.zero_grad()

        outputs = model(inputs) # Forward pass to get predictions.
        loss = F.cross_entropy(outputs, labels, reduction='mean') # Use mean for gradient calculation
        loss.backward()         # Backpropagation to compute the gradients.
        running_loss += loss.item() * inputs.size(0)              # Use sum for tracking running loss

        # Update the weights of the model based on the gradients calculated during backpropagation.
        optimizer.step()

        # train_losses.append(loss.item())

    average_loss = running_loss / len(train_loader)
    current_lr = scheduler.get_last_lr()[0]
    print(f"[Epoch {epoch + 1}] (LR: {current_lr:.8f}) Average Loss: {average_loss:.4f}")

    # Store the loss for visualization
    train_losses.append(average_loss)

    # Update the learning rate according to the specified schedule
    scheduler.step()

print("Finished Training")

In [None]:
import matplotlib.pyplot as plt

# Plot the training loss
plt.figure(figsize=(10, 6))
plt.plot(train_losses, label='Training Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training Loss Over Epochs')
plt.legend()
plt.show()

### Evaluation

Evaluate the performance of the trained neural network model on the test dataset.

In [None]:
correct = 0
total = 0

# Context manager under which all the operations will have `requires_grad=False`,
# meaning that PyTorch will not calculate or keep track of gradients.
# This is used because gradient computation is not needed for evaluation and saves memory and computation.
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)

        # Get the index of the max log-probability
        pred = outputs.argmax(dim=1)
        correct += pred.eq(labels).sum().item()

# Accuracy is calculated as the percentage of correct predictions over the total number of predictions.
accuracy = 100. * correct / len(test_loader.dataset)

print(f"\nTest Set Accuracy: {correct}/{len(test_loader.dataset)} (= {accuracy:.0f}%)\n")