# Titanic Prediction with Different Models
## Table of Contents
- Summary
- Import Packages
- Common Functions
- Import datasets
- Data Wrangling
- Data Preprocessing
    - Add Cabin type Column
    - Add family member size and faimily member type Column
    - Handle Categorical Features
- Exploratory Data Analysis
    - Basic Statistic infos
    - What's the factor to survive?
        - Survival of different Gender
        - Survival of different Age
        - Survival of different Pclass
        - Survival of different Fare
        - Survival of different Cabin
        - Survival of different Embarked
        - Survival of different SibSp (Number of siblings or spouses)
        - Survival of different Parch (Number of parents or children)
        - Survival of different family member size
- More Data Preprocessing
    - Convert Categorical features to one hot features
    - Train Validation Split
    - Balance Training dataset
- Model Development and Evaluation
    - Using TensorFlow DNN
    - Using TensorFlow DNN and DenseFeatures
    - Using KNN
    - Using Decision Tree Classifier
    - Using Gradient Boosting Classifier
    - Using Random Forest Classifier
    - Using KMeans
    - Using XGBoost Classifier
    - Using Catboost Classifier
- Submission
- Conclusions
    
## Summary
In this notebook I will do EDA and Data Preprocessing on Titanic Dataset, I will also implement Titannic Predicion suing different kinds of classification Models.
- Deep Neural Network
- Deep and Wide Neural Network using keras DenseFeatures
- Logistic Regression
- KNN
- Decision Tree Classifier
- Gradient Boosting Classifier
- Random Forest Classifier
- KMeans
- XGBoost 
- CatBoost

## Import Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import tensorflow as tf
from tensorflow import feature_column
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import model_selection
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer

## Common Functions

**Save results**

In [None]:
def save_results(Survived, test, path):
    submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": Survived
    })
    submission.to_csv(path, index=False)

**Evaluate Model, save and keep tract of best results**

In [None]:
def evaulate_and_save(
    model, 
    validation_features, 
    validation_targets, 
    test_features, 
    save_path, 
    best_score, 
    best_path, 
    columns = None
):
    if columns is None:
        feature_columns = validation_features.columns
    else:
        feature_columns = columns
    y_pred = model.predict(validation_features[feature_columns])
    if y_pred.dtype != int:
        if y_pred.shape[-1] == 2:
            y_pred = np.argmax(y_pred, axis=-1)
        if y_pred.shape[-1] == 1:
            y_pred = np.array(y_pred > 0.5, dtype=int)
    y_pred = y_pred.reshape(-1)
    score = sklearn.metrics.accuracy_score(validation_targets, y_pred)
    f1 = sklearn.metrics.f1_score(validation_targets, y_pred)
    print("Accuracy Score:", score)
    print("Classification Report:")
    print(sklearn.metrics.classification_report(validation_targets, y_pred))
    Survived = model.predict(test_features[feature_columns])
    if Survived.dtype != int:
        if Survived.shape[-1] == 2:
            Survived = np.argmax(Survived, axis=-1)
        if Survived.shape[-1] == 1:
            Survived = np.array(Survived > 0.5, dtype=int)
    Survived = np.array(Survived, dtype=int).reshape(-1)
    save_results(Survived, test_features, save_path)
    if score > best_score:
        best_score = score
        best_path = save_path
    return best_score, best_path

## Import datasets

In [None]:
train = pd.read_csv("/kaggle/input/titanic/train.csv")
test = pd.read_csv("/kaggle/input/titanic/test.csv")

In [None]:
train.head()

In [None]:
test.head()

## Data Wrangling

As we can see Age, Cabin and Fare information contains missing values, so we need to apply Missing Value  Imputation to them. The most common way is to replace categorical missing values with most fequent category and repalce numerical missing values with average value of that feature.

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

In [None]:
categorical_imputation_strategy = ["mode", "unknown", "knn"][1]
numerical_imputation_strategy = ["mean", "median", "knn"][2]

In [None]:
if categorical_imputation_strategy == "mode":
    train["Cabin"] = train["Cabin"].replace(np.NAN,  train["Cabin"].mode()[0])
    train["Embarked"] = train["Embarked"].replace(np.NAN, train["Embarked"].mode()[0])
if categorical_imputation_strategy == "unknown":
    train["Cabin"] = train["Cabin"].replace(np.NAN,  "Unknown")
    train["Embarked"] = train["Embarked"].replace(np.NAN, "Unknown")
    test["Cabin"] = test["Cabin"].replace(np.NAN, "Unknown")
if categorical_imputation_strategy == "knn":
    print("To be continued")
if numerical_imputation_strategy == "mean":
    train["Age"] = train["Age"].replace(np.NAN, train["Age"].mean())
    test["Age"] = test["Age"].replace(np.NAN, test["Age"].mean())
    test["Fare"] = test["Fare"].replace(np.NAN, test["Fare"].mean())
if numerical_imputation_strategy == "median":
    train["Age"] = train["Age"].replace(np.NAN, train["Age"].median())
    test["Age"] = test["Age"].replace(np.NAN, test["Age"].median())
    test["Fare"] = test["Fare"].replace(np.NAN, test["Fare"].median())
if numerical_imputation_strategy == "knn":
    imputer = KNNImputer(n_neighbors=5)
    columns = ["Age", "Fare", "SibSp", "Parch"]
    train[columns] = imputer.fit_transform(train[columns])
    test[columns] = imputer.transform(test[columns])

## Data Preprocessing

### Add Cabin type Column

Let's see the Cabin labels, there are so many of them. But I make an assmption that the First Alphabet matters, it indicated the location and class of the Passengers so it had an impact to survive.

In [None]:
cabin_labels = sorted(set(list(train["Cabin"].unique()) + list(test["Cabin"].unique())))
print(cabin_labels[:30])

In [None]:
train["Cabin_type"] = train["Cabin"].apply(lambda cabin: cabin[0])
test["Cabin_type"] = test["Cabin"].apply(lambda cabin: cabin[0])

### Add family member size and faimily member type Column

We can indicate family member size by SibSp and Parch feature: 

In [None]:
train["family_member_size"] = 1 + train["SibSp"] + train["Parch"]
test["family_member_size"] = 1 + test["SibSp"] + test["Parch"]

According to the EDA below, Family member size had a impact on Survival, but it was not a linear relationship, that was why it had a low pearson correlation score. So I will convert it to a categorical feature with single(1 family member), medium(2-4 family members), large(more than 4 members). I will add a Feature Toggle here to control whether to use this function

In [None]:
def convert_faimly_member_size(size):
    if size == 1:
        return "single"
    elif size < 5:
        return "medium"
    else:
        return "large"
should_add_family_member_type = False
if should_add_family_member_type:
    for data in [train, test]:
        data["family_member_type"] = train["family_member_size"].apply(convert_faimly_member_size)

### Handle Categorical Features

In [None]:
categorical_features = ["Sex", "Cabin_type", "Embarked"]
if should_add_family_member_type:
    categorical_features.append("family_member_type")
categorical_label_dictionary = dict()
for feature in categorical_features:
    unique_labels = sorted(set(list(train[feature].unique()) + list(test[feature].unique())))
    for data in [train, test]:
        categorical_label_dictionary[feature] = unique_labels
        data[feature + "_value"] = data[feature].apply(lambda item: unique_labels.index(item))

Let's see after we preprocess, what does the data look like?

In [None]:
train.head(10)

## Exploratory Data Analysis

### Basic Statistic infos

In [None]:
train.info()

In [None]:
train.describe()

### What's the factor to survive?
As we can see it's related to Gender, PClass, Status, Fare, Cabin and Embarked. 

In [None]:
train.corr()["Survived"].sort_values(key=lambda x: abs(x), ascending=False)

In [None]:
related_columns = list(train.corr()[train.corr()["Survived"].abs() > 0.05].index)
related_columns.remove("Survived")
print(related_columns)

#### Survival of different Gender
Women have a higher Survival rate than Men.

In [None]:
sns.countplot(x="Sex", hue="Survived", data=train)
plt.title("Survival of different Gender")
plt.show()

#### Survival of different Age


In [None]:
sns.histplot(x="Age", hue="Survived", data=train)
plt.title("Survival of different Age")
plt.show()

#### Survival of different Pclass
- Passengers from Pclass 1 had 62% Survival Rate;
- Passengers from Pclass 2 had 47% Survival Rate;
- Passengers from Pclass 3 had 24% Survival Rate;

In [None]:
train.groupby("Pclass")["Survived"].mean()

In [None]:
sns.countplot(x="Pclass", hue="Survived", data=train)
plt.title("Survival of different Pclass")
plt.show()

#### Survival of different Fare
Most of the tickets were less than 100 pounds. Only about 1 / 5 with fare around 10 pounds survived.

In [None]:
plt.figure(figsize=(15, 7))
sns.histplot(x="Fare", hue="Survived", bins=20, kde=True, data=train)
plt.title("Survival of different Fare")
plt.show()

#### Survival of different Cabin
- More than half Passengers from Cabin started with C, D, E, F, G survived;
- Less than half Passengers from Cabin started with A,B survived;
- 30% of Passengers with unknown Cabin survived;
- Almost no Passengers from Cabin started with T survived.

In [None]:
train.groupby("Cabin_type")["Survived"].mean()

In [None]:
sns.countplot(x="Cabin_type", hue="Survived", data=train)
plt.title("Survival of different Cabin")
plt.show()

#### Survival of different Embarked
- About 1 / 3 passengers from Embarked Q, S survived;
- About half passengers from Embarked C survived;

In [None]:
train.groupby("Embarked")["Survived"].mean()

In [None]:
sns.countplot(x="Embarked", hue="Survived", data=train)
plt.title("Survival of different Embarked")
plt.show()

#### Survival of different SibSp (Number of siblings or spouse)
- Passengers without siblings or spouse had 1 / 3 Survival Rate.
- Passengers with one or two siblings or spouse had about 1 / 2 Survival Rate.
- Passengers with more than two siblings or spouse were less likely to survive.

In [None]:
train.groupby("SibSp")["Survived"].mean()

In [None]:
sns.countplot(x="SibSp", hue="Survived", data=train)
plt.title("Survival of different SibSp")
plt.show()

#### Survival of different Parch (Number of parents or children)
- Passengers without parents or children had 1 / 3 survival rate.
- Passengers with 1 - 3 parents or children had 1 / 2 survival rate.
- Passengers with more than 4 parents or children were less likely to survive.


In [None]:
train.groupby("Parch")["Survived"].mean()

In [None]:
sns.countplot(x="Parch", hue="Survived", data=train)
plt.title("Survival of different SibSp")
plt.show()

#### Survival of different family member size
- Those who were alone (1 family member size) had 1 / 3 Survival Rate.
- Those who had 2 - 4 family member size had more than 1 / 2 Survival Rate.
- Those who had 5 - 11 family member size were less likely to survive.

In [None]:
sns.countplot(x="family_member_size", hue="Survived", data=train)
plt.title("Survival of Family Member Size")
plt.show()

After converting family member size to categorical feature, the relation between family member size and survival rate were more obvious.

In [None]:
if should_add_family_member_type:
    sns.countplot(x="family_member_type", hue="Survived", data=train)
    plt.title("Survival of Family Member Type")
    plt.show()

## More data Preprocessing

In [None]:
train_test = pd.concat([train, test])
train_test.head()

### Convert Categorical features to one hot features

In [None]:
categorical_columns_to_one_hot = ["Sex", "Cabin_type", "Embarked"]
if should_add_family_member_type:
    categorical_columns_to_one_hot.append("family_member_type")
for feature in categorical_columns_to_one_hot:
    items = pd.get_dummies(train_test[feature + "_value"])
    labels = categorical_label_dictionary[feature]
    items.columns = [feature + "_" + labels[column] for column in list(items.columns)]
    train_test[items.columns] = items
    train_test.pop(feature + "_value")

Calucate mean and std value for future use.

In [None]:
mean_value = train_test.mean()
std_value = train_test.std()
mean_value.pop("Survived")
_ = std_value.pop("Survived")

In [None]:
train_test.head()

### Remove unused columns

In [None]:
for column in ["Name", "Sex", "Ticket", "Cabin", "Cabin_type", "Embarked", "family_member_size", "family_member_type"]:
    if column in train_test.columns:
        train_test.pop(column)

In [None]:
train_features = train_test.iloc[0: len(train)]
test_features = train_test.iloc[len(train):]

In [None]:
test_features.head()

In [None]:
train_features.pop("PassengerId")
test_features.pop("Survived")
train_features.head()

In [None]:
test_features.head()

### Train Validation Split

In [None]:
validation_split = 0.2

In [None]:
train_features, validation_features = model_selection.train_test_split(train_features, test_size=validation_split)
print(train_features.shape, validation_features.shape)

In [None]:
train_features

### Balance Training dataset
Let's balance the training dataset and add some noise to data. I will add a Toggle here to control whether to balance the dataset.

In [None]:
should_balance = False
batch_size = 32
number_batch_per_category = 100
if should_balance == True:
    survived = train_features[train_features.Survived == 1]
    not_survived = train_features[train_features.Survived == 0]
    survived_indices = list(np.random.choice(len(survived), size=number_batch_per_category * batch_size))
    not_survived_indices = list(np.random.choice(len(not_survived), size=number_batch_per_category * batch_size))
    survived_features = survived.iloc[survived_indices]
    not_survived_features = not_survived.iloc[not_survived_indices]
    print(not_survived_features.shape)
    train_features = pd.concat([survived_features, not_survived_features])
    train_features = sklearn.utils.shuffle(train_features)
    train_targets = train_features.pop("Survived")
    validation_targets = validation_features.pop("Survived")
    # 0.95 ~ 1.05
    scale = 1 + 0.1 * (np.random.rand(train_features.shape[0], train_features.shape[1]) - 0.5)
    train_features =  train_features * scale
else:
    train_targets = train_features.pop("Survived")
    validation_targets = validation_features.pop("Survived")

After balancing the training dataset

In [None]:
train_features.describe()

## Feature Scaling

In [None]:
train_features.head()

In [None]:
data_scaling_strategies = ["none", "max", "standard"]
data_scaling_strategy = data_scaling_strategies[2]
if data_scaling_strategy == data_scaling_strategies[1]:
    features_max = pd.concat([train_features, validation_features]).max()
    train_features = train_features / features_max
    validation_features = validation_features / features_max
    test_features[train_features.columns] = test_features[train_features.columns] / features_max
if data_scaling_strategy == data_scaling_strategies[2]:
    for data in [train_features, validation_features, test_features]:
        columns_to_scale = ["Age", "Fare"]
        data.loc[:, columns_to_scale] = (data.loc[:, columns_to_scale]  - mean_value[columns_to_scale]) / std_value[columns_to_scale]
print(train_features.shape)
print(test_features.shape)

## Model Development & Evaluation
I will try different Models and use results from best Model.

In [None]:
best_score = 0
best_path = ""

### Using TensorFlow DNN

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(train_features.shape[1])),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(2, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
early_stop = tf.keras.callbacks.EarlyStopping(patience=10)
history = model.fit(
    train_features, train_targets, 
    epochs=400, validation_data=(validation_features, validation_targets), 
    callbacks=[early_stop],
    verbose=0
)
pd.DataFrame(history.history).plot()

In [None]:
best_score, best_path = evaulate_and_save(
    model, 
    validation_features, 
    validation_targets, 
    test_features,
    "submission_dnn.csv",
    best_score, 
    best_path,
    columns=validation_features.columns
)

### Using TensorFlow DNN with DenseFeatures

In [None]:
categorical_feature_names = ["Pclass", "Sex_value", "Embarked_value", "Cabin_type_value"]
if should_add_family_member_type:
    categorical_feature_names.append("family_member_type_value")
numerical_feature_names = ["Age", "Fare", "SibSp", "Parch"]
categorical_features = [
    feature_column.indicator_column(
        feature_column.categorical_column_with_vocabulary_list(key, sorted(list(train[key].unique())))
    ) for key in categorical_feature_names
]
numerical_features = [feature_column.numeric_column(key) for key in numerical_feature_names]
input_dictionary = dict()
inputs = dict()
for item in numerical_features:
    inputs[item.key] = tf.keras.layers.Input(name=item.key, shape=())
for item in categorical_features:
    inputs[item.categorical_column.key] = tf.keras.layers.Input(name=item.categorical_column.key, shape=(), dtype="int32")

In [None]:
def features_and_labels(row_data):
    label = row_data.pop("Survived")
    features = row_data
    return features, label

def create_dataset(pattern, epochs=1, batch_size=32, mode='eval'):
    dataset = tf.data.experimental.make_csv_dataset(
        pattern, batch_size
    )
    dataset = dataset.map(features_and_labels)
    if mode == 'train':
        dataset = dataset.shuffle(buffer_size=128).repeat(epochs)
    dataset = dataset.prefetch(1)
    return dataset

def create_test_dataset(pattern, batch_size=32):
    dataset = tf.data.experimental.make_csv_dataset(
        pattern, batch_size
    )
    dataset = dataset.map(lambda features: features)
    dataset = dataset.prefetch(1)
    return dataset

In [None]:
train_data, val_data = train_test_split(
    train[categorical_feature_names + numerical_feature_names + ["Survived"]],
    test_size=validation_split,
    random_state=np.random.randint(0, 1000)
)
train_data.to_csv("train_data.csv", index=False)
val_data.to_csv("val_data.csv", index=False)
test[categorical_feature_names + numerical_feature_names].to_csv("test_data.csv", index=False)
train_dataset = create_dataset("train_data.csv", batch_size=batch_size, mode='train')
val_dataset = create_dataset("val_data.csv", batch_size=val_data.shape[0], mode='eval').take(1)
test_dataset = create_test_dataset("test_data.csv", batch_size = test.shape[0]).take(1)

In [None]:
def build_dnn_with_dense_features():
    deep = tf.keras.layers.DenseFeatures(numerical_features + categorical_features, name='deep')(inputs)
    deep = tf.keras.layers.Dense(16, activation='relu')(deep)
    deep = tf.keras.layers.Dropout(0.3)(deep)
    deep = tf.keras.layers.Dense(16, activation='relu')(deep)
    deep = tf.keras.layers.Dropout(0.3)(deep)
    deep = tf.keras.layers.Dense(16, activation='relu')(deep)
    deep = tf.keras.layers.Dropout(0.3)(deep)
    deep = tf.keras.layers.Dense(16, activation='relu')(deep)
    deep = tf.keras.layers.Dropout(0.3)(deep)
    output = tf.keras.layers.Dense(1, activation="sigmoid")(deep)
    model = tf.keras.Model(inputs=list(inputs.values()), outputs=output)
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [None]:
dnn_dense_features = build_dnn_with_dense_features()
tf.keras.utils.plot_model(dnn_dense_features, show_shapes=False, rankdir='LR')

In [None]:
epochs = 400
early_stop = tf.keras.callbacks.EarlyStopping(patience=10)
steps_per_epoch = train_data.shape[0] // batch_size
history = dnn_dense_features.fit(
    train_dataset, 
    steps_per_epoch=steps_per_epoch,
    validation_data=val_dataset,
    epochs=epochs,
    callbacks=[early_stop],
    verbose=0
)
pd.DataFrame(history.history).plot()

In [None]:
y_pred =  np.array(dnn_dense_features.predict(val_dataset) > 0.5, dtype=int).reshape(-1)
score = accuracy_score(val_data["Survived"], y_pred)
print("Accuracy score:", score)
print(sklearn.metrics.classification_report(val_data["Survived"], y_pred))
Survived = np.argmax(dnn_dense_features.predict(test_dataset), axis=-1).reshape(-1)
print(Survived.shape)
path = "submission_dnn_dense_features_model.csv"
save_results(Survived, test, path)
if score > best_score:
    best_score = score
    best_path = path

### Using Logistic Regression

In [None]:
logitistc_related_columns = list(train.corr()[train.corr()["Survived"].abs() > 0.2].index)
logitistc_related_columns.remove("Survived")
logitistc_related_columns

In [None]:
from sklearn.linear_model import LogisticRegression
best_logit = None
best_solver = ""
best_logit_score = 0
logit_train_features, logit_val_features = train_test_split(train[logitistc_related_columns +  ["Survived"]], test_size=0.2, random_state=48)
logit_train_targets = logit_train_features.pop("Survived")
logit_val_targets = logit_val_features.pop("Survived")
for solver in ['newton-cg', 'lbfgs', 'liblinear']:
    logit = LogisticRegression(solver=solver)
    logit.fit(logit_train_features, logit_train_targets)
    score = logit.score(logit_val_features, logit_val_targets)
    if score > best_logit_score:
        best_solver = solver
        best_logit_score = score
        best_logit = logit
print("Best Solver:", best_solver, "Score:", best_logit_score)

In [None]:
best_score, best_path = evaulate_and_save(
    best_logit, 
    logit_val_features, 
    logit_val_targets, 
    test, 
    "submission_logit.csv", 
    best_score, 
    best_path,
    columns=logitistc_related_columns
)

### Using KNN

In [None]:
best_algorithm = ""
best_knn_score = 0
best_knn = None
best_n = 2
for n in range(2, 10):
    knn = KNeighborsClassifier(n, algorithm='ball_tree')
    knn.fit(train_features, train_targets)
    score = knn.score(validation_features, validation_targets) 
    if score > best_knn_score:
        best_n = n
        best_knn_score = score
        best_knn = knn
print("Best KNN Score: ", best_knn_score, "Model:", best_knn, "Best N:", best_n)

In [None]:
best_score, best_path = evaulate_and_save(
    best_knn, validation_features, validation_targets, test_features, 
    "submission_knn.csv", best_score, best_path
)

### Using Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
best_tree = None
best_tree_score = 0
for max_depth in range(6, 30):
    tree = sklearn.tree.DecisionTreeClassifier(max_depth=max_depth)
    tree.fit(train_features, train_targets)
    score = tree.score(validation_features, validation_targets)
    if score > best_tree_score:
        best_tree_score = score
        best_tree = tree
print("Best Decision Tree Score: ", best_tree_score, "Model:", best_tree)

In [None]:
best_score, best_path = evaulate_and_save(
    best_tree, validation_features, validation_targets, test_features, 
    "submission_tree.csv", best_score, best_path
)

### Using Gradient Boosting Classifier

In [None]:
best_gbc_score = 0
best_depth = 5
best_n_estimators = 5
best_learning_rate = 0.1
best_gbc_model = None
for learning_rate in list(np.arange(0.05, 0.15, 0.01)):
    gbc = GradientBoostingClassifier(
            n_estimators=best_n_estimators, 
            learning_rate=learning_rate, 
            max_depth=best_depth, 
            random_state=np.random.randint(1, 1000)
    )
    gbc.fit(train_features, train_targets)
    score = gbc.score(validation_features, validation_targets)
    if score > best_gbc_score:
        best_learning_rate = learning_rate
        best_gbc_score = score
        best_gbc_model = gbc
print("Best Learning Rate:", best_learning_rate)
for depth in range(5, 20):
    gbc = GradientBoostingClassifier(
            n_estimators=best_n_estimators, 
            learning_rate=best_learning_rate, 
            max_depth=depth, 
            random_state=np.random.randint(1, 1000)
    )
    gbc.fit(train_features, train_targets)
    score = gbc.score(validation_features, validation_targets)
    if score > best_gbc_score:
        best_depth = depth
        best_gbc_score = score
        best_gbc_model = gbc
print("Best Depth:", best_depth)
for n_estimators in range(5, 15):
    gbc = GradientBoostingClassifier(
            n_estimators=n_estimators, 
            learning_rate=best_learning_rate, 
            max_depth=best_depth, 
            random_state=np.random.randint(1, 1000)
    )
    gbc.fit(train_features, train_targets)
    score = gbc.score(validation_features, validation_targets)
    if score > best_gbc_score:
        best_n_estimators = n_estimators
        best_gbc_score = score
        best_gbc_model = gbc
print("Best Number of Estimator:", best_depth)
print("Best Gradient Boosting Classifier Score:", best_gbc_score, " Model:", best_gbc_model)

In [None]:
best_score, best_path = evaulate_and_save(
    best_gbc_model, validation_features, validation_targets, test_features, 
    "submission_gbc.csv", best_score, best_path
)

### Using Random Forest Classifier

In [None]:
best_forest = None
best_max_depth = 8
best_n_estimators = 15
best_forest_score = 0
print("Find best number of estimators")
for n_estimators in list(range(3, 40, 2)):
    forest = RandomForestClassifier(
        n_estimators=n_estimators, 
        max_depth=best_max_depth, 
        random_state=np.random.randint(1, 1000)
    )
    forest.fit(train_features, train_targets)
    score = forest.score(validation_features, validation_targets)
    print("Score: ", score)
    if score > best_forest_score:
        best_n_estimators = n_estimators
        best_forest_score = score
        best_forest = forest
print("Best Number of Estimator:", best_n_estimators)
for max_depth in range(4, 15):
    forest = RandomForestClassifier(
        n_estimators=best_n_estimators, 
        max_depth=max_depth, 
        random_state=np.random.randint(1, 1000)
    )
    forest.fit(train_features, train_targets)
    score = forest.score(validation_features, validation_targets)
    print("Score: ", score)
    if score > best_forest_score:
        best_max_depth = max_depth
        best_forest_score = best_score
        best_forest = forest
print("Best Max Depth:", best_max_depth,"\nBest score:", best_forest_score)

In [None]:
best_score, best_path = evaulate_and_save(
    best_forest, validation_features, validation_targets, test_features, 
    "submission_forest.csv", best_score, best_path
)

## Using KMeans

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(train_features, train_targets)

In [None]:
best_score, best_path = evaulate_and_save(
    kmeans, validation_features, validation_targets, test_features, 
    "submission_kmeans.csv", best_score, best_path
)

## Using XGBoost Classifier

In [None]:
def get_value(key1, key2, value, parameters, best_index):
    return parameters[key1][best_index[key1]] if key1 != key2 else value
def find_best_model_with_xgboost(
    train_features, 
    train_targets,
    validation_features,
    validation_targets,
    parameters,
    columns = None
):
    train_f = train_features
    val_f = validation_features
    if columns != None:
        train_f = train_features[columns]
        val_f = validation_features[columns]
    else: 
        train_f = train_features
        val_f = validation_features
    all_keys = parameters.keys()
    best_index = {key: 0 for key in all_keys}
    best_xgb_score = 0
    best_xgb_model = None
    for key in all_keys:
        values = parameters[key]
        current_best_model = None
        current_best_score = 0
        for index, value in enumerate(values):
            learning_rate = get_value("learning_rate", key, value, parameters, best_index)
            max_depth = get_value("max_depth", key, value, parameters, best_index)
            gamma = get_value("gamma", key, value, parameters, best_index)
            xgb = XGBClassifier(
                max_depth=max_depth,
                learning_rate=learning_rate,
                gamma=gamma
            )
            xgb.fit(
                train_f, 
                train_targets, 
                early_stopping_rounds=10, 
                eval_metric="logloss", 
                eval_set=[(val_f, validation_targets)], 
                verbose=False
            )
            score = xgb.score(validation_features, validation_targets)
            if score > current_best_score:
                current_best_score = score
                current_best_model = xgb
                best_index[key] = index
            if score > best_xgb_score: 
                best_xgb_score = score
                best_xgb_model = xgb
    return best_xgb_model, best_xgb_score

In [None]:
from xgboost import XGBClassifier
hyper_parameters = {
    "max_depth": list(range(5, 15)),
    "learning_rate": [0.1, 0.15, 0.2, 0.25, 0.3],
    "gamma": [0.5, 1, 1.5, 2.0]
}
best_xgb_score, best_xgb_model = find_best_model_with_xgboost(
    train_features, 
    train_targets,
    validation_features,
    validation_targets,
    hyper_parameters
)
print("Best Model:", best_xgb_model, " Score: ", best_xgb_score)

In [None]:
best_score, best_path = evaulate_and_save(
    best_xgb_score, validation_features, validation_targets, test_features, 
    "submission_xgb.csv", best_score, best_path
)

## Using Catboost

In [None]:
from catboost import CatBoostClassifier
cat = CatBoostClassifier()
cat.fit(train_features, train_targets, verbose=False)

In [None]:
best_score, best_path = evaulate_and_save(
    cat, validation_features, validation_targets, test_features, 
    "submission_cat.csv", best_score, best_path
)

## Submission
This result can be different from Kagggle LeaderBoard, so you may try different submission files.

In [None]:
print("Best path:", best_path)
print("Best Score:", best_score)

In [None]:
submission = pd.read_csv(best_path)
print(submission.head(10))
submission.to_csv("submission.csv", index=False)

## Conclusions
- Although Deep Learning is very powerful. When handling this dataset, it's not easy to find a Model that outperforms some traditional Machine Learning algorithms. Maybe because the dataset is too small.
- Gradient Boosting Classifier, Random Forest Classifier can also achieve a very good performance and it requires less computing power than Deep Neural Network. 