<a href="https://www.kaggle.com/code/sonnylowe/titanic-simple-deep-learning-tutorial?scriptVersionId=181117638" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Titanic - Simple Deep Learning Tutorial**

Welcome! This simple tutorial walks you through simple data processing and using Kera's Sequential framework to generate a deep neural network to model our data. You will learn basic concepts and develop a baseline model that is impressive in accuracy. Nevertheless, you are encouraged to experiment with the data and the model to generate a more accurate prediction. 

This titanic dataset is relatively simple and small, meaning deep learning might actually be overkill and more inaccurate than simpler decision trees and linear regression. Despite this, the skills you will learn on this simpler problem can be applied to more complex datasets in your problem solving future.

If you learned something, please upvote to support these types of tutorials!

By Sonny Lowe

# File Processing:
Reading the input file csv.

In [None]:
import numpy as np
import pandas as pd
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data Processing:

Our data processing is split in two steps: one predefined custom function for the more complex data columns (ie name, ticket, etc), and then a processing keras pipeline for general numerical and categorical data. 

First, our custom function, called preprocess, isolates the prefix in a name (miss, mr, mrs) and places it into its own column. It also splits the ticket into the ticket number and its class (item).

Finally, we will split the data into two sections, **training** and **validation**. The size of the training versus validation sections can be specified through the *train_size* property.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

file_path = '../input/titanic/train.csv'
data = pd.read_csv(file_path)
data.head()

X = data.copy()
y = X.Survived

def preprocess(df):
    df = df.copy()
    
    def prefix(x):
        name = x.split(",")
        return (name[1].split(" ")[1])[:-1]
    
    def normalize_name(x):
        name = " ".join([v.strip(",()[].\"'") for v in x.split(" ")])
        return name.split(" ")[0] + " " + " ".join(name.split(" ")[2:])
    
    def ticket_number(x):
        num = x.split(" ")[-1]
        if(num.isnumeric()):
            return num
        return 0
        
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])
    
    df["Prefix"] = df["Name"].apply(prefix)
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)
        
    return df

X = preprocess(X)

prefix_possibilities = X['Prefix'].unique()
print(prefix_possibilities)

X_train, X_valid, y_train, y_valid = \
    train_test_split(X, y, stratify=y, train_size=0.75)

X.head()

The features are split into two cartegories for pipeline processing - **Numberical** and **Categorical**.
- A data **pipeline** takes in raw data and processes it into the desired format using a fixed framework
- This framework first **imputes** the numerical data by substituting missing data with a different value. Then it **scales** it to unit variance.
- For categorical data, it uses **OneHotEncoding** to transform categorical data to multiple columns of each possible unique entry, with a 0 if it exists and a 1 if it does not. See the diagram below:

![1_ggtP4a5YaRx6l09KQaYOnw.png](attachment:8b055ab6-cb25-4640-8e3a-11a9d98be05b.png)

In [None]:
features_num = [
    'Pclass', 'Age', 'SibSp', 'Parch', 'Ticket_number'
]

features_cat = [
    'Sex', 'Embarked', 'Prefix'
]

# these are features that we want to keep but need not process. In this case, it is empty but feel free to add some
features_other = [
]

transformer_num = make_pipeline(
    SimpleImputer(),
    StandardScaler()
)

transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore', sparse_output=False),
)

preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
    ('passthrough', features_other)
)

processed_X_train = preprocessor.fit_transform(X_train)
processed_X_valid = preprocessor.transform(X_valid)

df = pd.DataFrame(processed_X_train)
df.head()

# Model
Using Keras, we will create the structure of our neural network. Determining the amount of layers, dropout layers, and batch normalizations is a complex process, and you are encouraged to experiment with the structure. The given structure is by no means an optimal one.

This model uses two basic types of layers
- **Dense layers:** these are the most basic layers of a neural network that are composed of nodes (neurons) that take inputs and linearly alters it to an output (very simplified explanation). Then, they each will have an activation function that essentially places a kink in the linearly path to allow the model to stray from pure line.
- **Dropout Layers:** A dropout layer is a layer in a neural network that randomly sets some input units to zero during training to help prevent overfitting. This is done with a defined probability, or rate, which is specified through the *rate* property.

You may also experiment with batch normalization layers:
- **Batch Normalization:** Batch normalization is a technique that normalizes the inputs of each layer in a deep neural network to improve training speed and stability.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(256, activation='relu'),
    layers.Dense(256, activation='relu'),
    layers.Dropout(rate=0.2),
    layers.Dense(256, activation='relu'),
    layers.Dense(256, activation='relu'),
    layers.Dropout(rate=0.2),
    layers.Dense(256, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

This model's goal is a binary output - 0 for death and 1 for survival. Therefore, when **compiling** this model, we can specify our loss function and metric to be of binary nature. A loss function is the function that the model will attempt to minimize, which can be thought of as a measure of departure from accuracy. Obviously, we will attempt to minimize this departure.

In addition, to prevent overfitting, we have an **early_stopping** function. Essentially, we can specify how many epochs to wait (*patience*) without a minimum improvement (*min_delta*) in order to prematurely stop the compilation. You can play around with these properties too.

Finally, we will track the **history** of our model and how many *epochs* (versions) to run it for. This number is unnecessarily large, as we are relying on our early stopping function to terminate the running. You can also change the batch_sizes, which is how much data it looks at for each iteration in an epoch. (Every epoch is made of many iterations towards the best possible accuracy)

In [None]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)

early_stopping = keras.callbacks.EarlyStopping(
    patience=20,
    min_delta=0.001,
    restore_best_weights=True,
)

history = model.fit(
    processed_X_train, y_train,
    validation_data=(processed_X_valid, y_valid),
    batch_size=512,
    epochs=100,
    callbacks=[early_stopping],
)

Now, we can plot this history.

In [None]:
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['binary_accuracy', 'val_binary_accuracy']].plot(title="Accuracy")

print(("Best Validation Loss: {:0.4f}" +\
      "\nBest Validation Accuracy: {:0.4f}")\
      .format(history_df['val_loss'].min(), 
              history_df['val_binary_accuracy'].max()))

# Predicting

Using our established data processing pipeline, we can process the given testing data. Then, we will use the model with the best accuracy across epochs, as our early stopping function tracks, to determine predictions.

The final step is writing the predictions into an output file to submit. It is now ur job to get a better score than we have here! Good Luck!

In [None]:
subm_path = '../input/titanic/test.csv'

data = pd.read_csv(subm_path)
test_data = data.copy()

test_data = preprocess(test_data)
test_data = preprocessor.transform(test_data)

final_preds = np.round(np.clip(model.predict(test_data), 0, 1))

output = pd.DataFrame({
    'PassengerId': data.PassengerId,
    'Survived': final_preds[:, 0].astype(int)
})

#uncomment the below lines to write the output into your submission file.
output.to_csv('submission.csv', index=False)
output = pd.read_csv('submission.csv')
output.head()

# Hyperparameter Tuning
In this section, we will experiment with different parameters to find the best combination for this dataset. There are certain tools we can use to make it easier for us (**keras Tuner**, grid search, random search, etc).

In this example, we will use the Keras Tuner to tune many different parameters including:
- **number of dense layers**
- **presence of dropout layers**
- **dense unit count**
- **dropout rate**
- **activation function**
- **optimizer**

Given the relatively small dataset, the parameters might not make a huge difference. However, this is still a very useful skill to learn when applying it to larger more complex datasets.

In [None]:
import keras_tuner
from tensorflow import keras
from tensorflow.keras import layers


def build_model(hp):
    model = keras.Sequential()
    
    for i in range(hp.Int("num_layers", 1, 2)):
        model.add(
            layers.Dense(
                # Tune number of units separately.
                units=hp.Int(f"units_{i}", min_value=32, max_value=512, step=32),
                activation=hp.Choice("activation", ["relu", "tanh"]),
            )
        )
    if hp.Boolean("dropout"):
        model.add(layers.Dropout(0.2))
        
    for i in range(hp.Int("num_layers", 1, 2)):
        model.add(
            layers.Dense(
                # Tune number of units separately.
                units=hp.Int(f"units_{i}", min_value=32, max_value=512, step=32),
                activation=hp.Choice("activation", ["relu", "tanh"]),
            )
        )
    if hp.Boolean("dropout"):
        model.add(layers.Dropout(0.2))
    
    model.add(layers.Dense(1, activation="sigmoid"))
    
    model.compile(
        optimizer=hp.Choice("optimizer", ['adam', 'sgd', 'rmsprop']),
        loss="binary_crossentropy",
        metrics=["binary_accuracy"],
    )
    return model

# testing if it builds successfully (expected: <Sequential name=sequential_#, built=False>)
build_model(keras_tuner.HyperParameters())

Now, we will actually begin the tuner using a random search and some specified parameters. The important ones to take note of here are the max_trials and executions per trial. Similar to a lab experiement, each trial has different parameters. However, due to the variance inherent within the models, we can choose to execute each trial multiple times. Here is a more comprehensive tutorial on hyperparameter tuning: [https://keras.io/guides/keras_tuner/getting_started/](http://)

In [None]:
tuner = keras_tuner.RandomSearch(
    hypermodel=build_model,
    objective="binary_accuracy",
    max_trials=15,
    executions_per_trial=3,
    overwrite=True
)

# confirming the parameters of our search
tuner.search_space_summary()

Actually starting the search! This will take a while depending on how many trials you put. After it has searched, we can ask it to give us a summary of the best models and their respective parameters.

In [None]:
# uncomment this code to begin the search

# early_stopping = keras.callbacks.EarlyStopping(
#     patience=5,
#     min_delta=0.001,
#     restore_best_weights=True,
# )

# tuner.search(processed_X_train, y_train, epochs=100, batch_size=512, validation_data=(processed_X_valid, y_valid), callbacks=[early_stopping], verbose=0)
# tuner.results_summary()

Next, we can isolate the best model and ask for its parameters.

In [None]:
# best_model = tuner.get_best_models(num_models=1)[0]
# best_model.summary()

Finally, we will use these found best parameters to retrain the model on the entire dataset to be used for submission.

In [None]:
# Uncomment below to Get the top hyperparameter

# best_hps = tuner.get_best_hyperparameters()[0]
# final_model = build_model(best_hps)

# x_all = np.concatenate((processed_X_train, processed_X_valid))
# y_all = np.concatenate((y_train, y_valid))

# final_model.fit(x_all, y_all, epochs=30)

# subm_path = '../input/titanic/test.csv'

# data = pd.read_csv(subm_path)
# test_data = data.copy()

# test_data = preprocess(test_data)
# test_data = preprocessor.transform(test_data)

# final_preds = np.round(np.clip(final_model.predict(test_data), 0, 1))

# output = pd.DataFrame({
#     'PassengerId': data.PassengerId,
#     'Survived': final_preds[:, 0].astype(int)
# })

#uncomment the below lines to write the output into your submission file.
#output.to_csv('submission.csv', index=False)

#output = pd.read_csv('submission.csv')
#output.head()

# Bonus Section
Taking the Majority Guess of Many Models without using hyperparameterization.

In [None]:
def train_model(X, y, test_data):

    X_train, X_valid, p_y_train, p_y_valid = \
        train_test_split(X, y, stratify=y, train_size=0.75)
    
    p_X_train = preprocessor.fit_transform(X_train)
    p_X_valid = preprocessor.transform(X_valid)
    pred_data = test_data.copy()
    pred_data = preprocessor.transform(pred_data)
    
    new_model = keras.Sequential([
        layers.Dense(256, activation='relu'),
        layers.Dense(256, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(256, activation='relu'),
        layers.Dense(256, activation='relu'),
        layers.Dropout(rate=0.2),
        layers.Dense(256, activation='relu'),
        layers.Dense(1, activation='sigmoid'),
    ])
    new_model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['binary_accuracy'],
    )
    early_stopping = keras.callbacks.EarlyStopping(
        patience=10,
        min_delta=0.001,
        restore_best_weights=True,
    )
    new_history = new_model.fit(
        p_X_train, p_y_train,
        validation_data=(p_X_valid, p_y_valid),
        batch_size=512,
        epochs=100,
        callbacks=[early_stopping],
        verbose=0
    )

    preds = np.round(np.clip(new_model.predict(pred_data), 0, 1))
    histories.append(new_history)
    return preds
    

In [None]:
histories = []

file_path = '../input/titanic/train.csv'
data = pd.read_csv(file_path)

X = data.copy()
y = X.Survived
X = preprocess(X)

subm_path = '../input/titanic/test.csv'
test_data = pd.read_csv(subm_path)
test_data = preprocess(test_data)

# arrays to track the number of predictions for a passenger to have died or survived
# the final prediction will be determined by which prediction is more frequent with tiebreaks for dying
dead = []
survived = []

# change the range for how many models you want to create - the more models the more time
# uncomment the below to run the compilation
# for x in range(15):
#    preds = train_model(X, y, test_data)
       
#    for i in range(0, len(preds)):
#        if(x == 0):
#            if (preds[i] == 0):
#                dead.append(1)
#                survived.append(0)
#            else:
#                dead.append(0)
#                survived.append(1)
#        else:
#             if (preds[i] == 0):
#                 dead[i] += 1
#             else:
#                 survived[i] += 1

# print("Binary Accuracies:")
# for x in range(0, len(histories)):
#     history_df = pd.DataFrame(histories[x].history)
#     print(history_df['val_binary_accuracy'].max())

In [None]:
final_preds = []

# for i in range(0, len(dead)):
#     if(dead[i] >= survived[i]):
#         final_preds.append(0)
#     else:
#         final_preds.append(1)

# output = pd.DataFrame({
#     'PassengerId': test_data.PassengerId,
#     'Survived': final_preds
# })

# output.to_csv('submission.csv', index=False)
# output = pd.read_csv('submission.csv')
# output.head()