# AutoML Comparison On Titanic Dataset

We are going to try the following AutoML libraries and train an XGBoost as a baseline.

- [TPOT](https://github.com/EpistasisLab/tpot)
- [AutoGluon](https://github.com/awslabs/autogluon)
- [AutoSklearn](https://github.com/automl/auto-sklearn)
- [H2OAutoML](https://github.com/h2oai/h2o-3)
- [AutoKeras](https://github.com/keras-team/autokeras)
- [MLJarSupervised](https://github.com/mljar/mljar-supervised)
- [HyperOptSklearn](https://github.com/hyperopt/hyperopt-sklearn)

## <center style="background-color: #6dc8b5; width:30%;">Contents</center>
* [Import Libraries](#Import)
* [Load Data](#Load)
* [Visualize Data](#Visualize)
* [Preprocess Data](#Preprocess)
* [Train Models](#Train)
    1. [XGBoost](#XGBoost)
    2. [AutoSklearn](#AutoSklearn)
    3. [HyperOptSklearn](#HyperOptSklearn)
    4. [TPOT](#TPOT)
    5. [AutoGluon](#AutoGluon)
    6. [H2OAutoML](#H2OAutoML)
    7. [AutoKeras](#AutoKeras)
    8. [MLJarSupervised](#MLJarSupervised)
* [Submission File](#Submission)
* [Cleanup](#Cleanup)

<a class="anchor" id="Import"></a>
# Import Libraries

In [None]:
%%capture
# https://github.com/parrt/dtreeviz/issues/108
# updated versions are needed for MLJarSupervised
! pip3 install graphviz==0.15.0
import graphviz
print(graphviz.__version__)

In [None]:
import os
import time
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
import logging
from warnings import simplefilter

warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

<a class="anchor" id="Load"></a>
# Load Data

In [None]:
train_data = pd.read_csv('../input/titanic/train.csv')
train_data.head()

In [None]:
test_data = pd.read_csv('../input/titanic/test.csv')
test_data.head()

In [None]:
submission = pd.read_csv('../input/titanic/gender_submission.csv')
submission.head()

<a class="anchor" id="Visualize"></a>
# Visualize Data

Check the NaN values which we will later solve.

In [None]:
sns.heatmap(train_data.isnull(), cbar=False)

In [None]:
sns.heatmap(test_data.isnull(), cbar=False)

Check the outliers (if there are any) we will remove.

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Age', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Fare', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Pclass', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Parch', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='SibSp', hue="Survived")

<a class="anchor" id="Preprocess"></a>
# Preprocess Data

### Impute/Remove NaN Values

First we are going to impute/remove the NaN values.

In [None]:
from sklearn.impute import SimpleImputer

def impute_nan_values(data, column):
    imr = SimpleImputer(missing_values=np.nan, strategy='median')
    print(f"Number of {column} NaN values before impute: {data[column].isnull().sum().sum()}")
    imr = imr.fit(data[[column]])
    data[column] = imr.transform(data[[column]]).ravel()
    print(f"Number of {column} NaN values after impute: {data[column].isnull().sum().sum()}")

def remove_nan_values(data, column):
    print(f"Number of {column} NaN values before impute: {data[column].isnull().sum().sum()}")
    _data = data[data[column].notnull()]
    print(f"Number of {column} NaN values after impute: {_data[column].isnull().sum().sum()}")
    return _data

In [None]:
for column in train_data.columns:
    print(f"{column}: {str(sum(train_data[column].isnull()))} missing values")

impute_nan_values(train_data, 'Age')
train_data = remove_nan_values(train_data, 'Embarked')

In [None]:
for column in test_data.columns:
    print(f"{column}: {str(sum(test_data[column].isnull()))} missing values")

impute_nan_values(test_data, 'Age')
impute_nan_values(test_data, 'Fare')

### Remove Outliers

Secondly we are going to remove outliers

In [None]:
"""
Usage of 'Z-score' (z = x – μ / σ) to find outliers
"""
def outliers_z_score(data):
    outliers=[]
    threshold = 6

    mean_y = np.mean(data)
    stdev_y = np.std(data)

    for i in data:
        z_score = (i-mean_y) / stdev_y
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [None]:
# Age feature
age_outliers = outliers_z_score(train_data['Age'])
print(f"Age outliers: {age_outliers}")
for ao in age_outliers:     
    train_data = train_data[train_data.Age != ao]


# Fare feature
fare_outliers = outliers_z_score(train_data['Fare'])
print(f"Fare outliers: {fare_outliers}")
for fo in fare_outliers:     
    train_data = train_data[train_data.Fare != fo]
    
    
# Parch feature
fare_outliers = outliers_z_score(train_data['Parch'])
print(f"Parch outliers: {fare_outliers}")
for po in fare_outliers:     
    train_data = train_data[train_data.Parch != po]

# SibSp feature
sibsp_outliers = outliers_z_score(train_data['SibSp'])
print(f"SibSp outliers: {sibsp_outliers}")
for so in sibsp_outliers:     
    train_data = train_data[train_data.SibSp != so]

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Fare', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Age', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='Parch', hue="Survived")

In [None]:
sns.swarmplot(data=train_data, x='Sex', y='SibSp', hue="Survived")

Without the outliers it already looks much better!

### Drop redundant columns

In [None]:
# not going to use these columns to train/test on
train_data.drop(['Name', 'PassengerId', 'Cabin', 'Ticket'], inplace=True, axis=1)
test_data.drop(['Name', 'PassengerId', 'Cabin', 'Ticket'], inplace=True, axis=1)

In [None]:
print(train_data.dtypes)

### Categorical To Numerical Columns

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

train_data["Embarked"] = label_encoder.fit_transform(train_data["Embarked"])
train_data["Sex"] = label_encoder.fit_transform(train_data["Sex"])

test_data["Embarked"] = label_encoder.fit_transform(test_data["Embarked"])
test_data["Sex"] = label_encoder.fit_transform(test_data["Sex"])

In [None]:
print(train_data.dtypes)

In [None]:
f,ax = plt.subplots(figsize=(8, 8))
sns.heatmap(train_data.corr(), annot=True, linewidths=1, ax=ax)

Small recap about the fiels:
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

We can several things on this corr plot:
- the higher the fare the lower the Pclass (that indicates higher class) => negative correlation of -0.61
- the higher the survival rate the lower the sex (that indicates female 0 / male 1) => negative correlation of -0.55
- the higher the Parch the higher the SibSp (that indicates large families) => positive correlation of 0.40

<a class="anchor" id="Train"></a>
# Train Models

I had a hard time getting everything installed in one environment, there were a lot of package versions clashing because each AutoML library needed specific versions. I install each AutoML library when I need to, although this might break other installations.

In [None]:
target = train_data['Survived']
train_data.drop(['Survived'], inplace=True, axis=1)

X_train, X_test, y_train, y_test = train_test_split(train_data, target, test_size=0.25, random_state=42, shuffle=False)
print(f'Sizes: X_train={X_train.shape}, y_train={y_train.shape}, X_test={X_test.shape}, y_test={y_test.shape}')

# will need this later on for AutoGluon
X_train_with_target = X_train.copy()
X_train_with_target['Survived'] = target

print(f'Sizes: X_train_with_targer={X_train_with_target.shape}')

Check one last time for null values.

In [None]:
sns.heatmap(X_train.isnull(), cbar=False)

In [None]:
X_train.head()

In [None]:
best_model = None
best_model_name = None
best_model_acc = 0.0

models = []

def validate_model(model_name, model, accuracy):
    global best_model, best_model_name, best_model_acc, models
    
    models.append([model_name, accuracy])

    print()
    print(f"Current accuracy of model {model_name}: {accuracy}")
    print(f"Previous best accuracy of model {best_model_name}: {best_model_acc}")

    if accuracy > best_model_acc:
        print(f"Improved previous accuracy!")
        best_model_acc = accuracy
        best_model = model
        best_model_name = model_name
    else:
        print(f"Did not improve previous accuracy.")

<a class="anchor" id="XGBoost"></a>
### XGBoost

We are going to train an xgboost as a baseline.

In [None]:
%%time
from xgboost import XGBClassifier
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

xgboost_model = XGBClassifier(tree_method='gpu_hist')
xgboost_model.fit(X_train, y_train)
y_preds_xgboost = xgboost_model.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_xgboost)
validate_model('xgboost', xgboost_model, accuracy)

<a class="anchor" id="AutoSklearn"></a>
### AutoSklearn

In [None]:
%%capture
%%bash
# https://github.com/automl/auto-sklearn/issues/101
apt-get remove swig
apt-get install swig3.0
ln -s /usr/bin/swig3.0 /usr/bin/swig
pip3 install pyrfr
# https://stackoverflow.com/questions/55833509/attributeerror-type-object-callable-has-no-attribute-abc-registry
pip3 uninstall -y typing

In [None]:
%%capture
%%bash
# actual installation
# curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip3 install
pip3 install auto-sklearn

In [None]:
%%time
import autosklearn.classification
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# set time_left_for_this_task to prevent trail getting stuck (default 3600 seconds)
auto_sklearn_model = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600, n_jobs=-1)

auto_sklearn_model.fit(X_train, y_train)
y_preds_autosklearn = auto_sklearn_model.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_autosklearn)
validate_model('autosklearn', auto_sklearn_model, accuracy)

<a class="anchor" id="HyperOptSklearn"></a>
### HyperOptSklearn

In [None]:
%%capture
%%bash
rm -rf hyperopt-sklearn
git clone https://github.com/hyperopt/hyperopt-sklearn.git
(cd hyperopt-sklearn && pip3 install -e .)
mv hyperopt-sklearn/hpsklearn /opt/conda/lib/python3.7/site-packages/hpsklearn

In [None]:
! export OMP_NUM_THREADS=1
os.environ['OMP_NUM_THREADS'] = "1"

In [None]:
%%time
from hpsklearn import HyperoptEstimator, any_classifier, any_preprocessing
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# setting seed to avoid trail getting stuck
np.random.seed(42)

# set max_evals to prevent too long search (default 100)
# set trail timeout to prevent trail getting stuck (default None)
estim = HyperoptEstimator(
    classifier=any_classifier('my_clf'),
    preprocessing=any_preprocessing('my_pre'),
    n_jobs=-1,
    max_evals=40,
    trial_timeout=400
)

estim.fit(X_train, y_train)
y_preds_hyperopt = estim.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_hyperopt)
validate_model('hyperopt', estim, accuracy)

print(estim.best_model())

<a class="anchor" id="TPOT"></a>
### TPOT

In [None]:
%%capture
%%bash
pip3 install tpot

In [None]:
%%time
from tpot import TPOTClassifier
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# set generations and population_size to prevent too long search (default 100 both)
tpot_classifier = TPOTClassifier(generations=50, population_size=50, verbosity=2, n_jobs=-1)
tpot_classifier.fit(X_train, y_train)
y_preds_tpot = tpot_classifier.predict(X_test)

tpot_classifier.export('tpot_pipeline.py')

accuracy = accuracy_score(y_test, y_preds_tpot)
validate_model('tpot', tpot_classifier, accuracy)

<a class="anchor" id="AutoGluon"></a>
### AutoGluon

In [None]:
%%capture
%%bash
python3 -m pip install --upgrade "mxnet<2.0.0"
pip3 install autogluon autogluon.tabular
# https://github.com/awslabs/autogluon/issues/810
pip3 install --upgrade pillow

In [None]:
%%time
from autogluon.tabular import TabularPrediction as task
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# autogluon needs target in the training_data
predictor = task.fit(train_data=X_train_with_target, label='Survived')
y_preds_autogluon = predictor.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_autogluon)
validate_model('autogluon', predictor, accuracy)

print(predictor.leaderboard())

<a class="anchor" id="H2OAutoML"></a>
### H2OAutoML

In [None]:
%%capture
%%bash
pip3 install h2o

In [None]:
%%time
import h2o
from h2o.sklearn import H2OAutoMLClassifier
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

h2o.init()

# set max_runtime_secs to prevent too long search (default 3600)
aml = H2OAutoMLClassifier(max_runtime_secs=3600)

aml.fit(X_train, y_train.values)
y_preds_h2o = aml.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_h2o)
validate_model('H2OautoML', aml, accuracy)

<a class="anchor" id="AutoKeras"></a>
### AutoKeras

In [None]:
%%capture
%%bash
# https://github.com/tensorflow/tensorflow/issues/42441
pip3 install autokeras emcee pyDOE

In [None]:
%%time
import autokeras as ak
import tensorflow as tf
tf.get_logger().setLevel(logging.ERROR)
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# set max_trials to prevent too long search (default 100)
clf = ak.StructuredDataClassifier(overwrite=True, max_trials=100)
clf.fit(x=X_train, y=y_train)
y_preds_autokeras = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_autokeras)
validate_model('autokeras', clf, accuracy)

print(clf.export_model().summary())

<a class="anchor" id="MLJarSupervised"></a>
### MLJarSupervised

In [None]:
%%capture
%%bash
pip3 install mljar-supervised
pip3 install matplotlib==3.1.3

In [None]:
%%time
from supervised.automl import AutoML
warnings.filterwarnings('ignore')
logging.captureWarnings(True)
simplefilter(action='ignore', category=FutureWarning)

# https://github.com/mljar/mljar-supervised#available-modes-books
# set total_time_limit to prevent too long search (default 3600 seconds)
# features_selection causes issues with xgboost on gpu
automl = AutoML(
    mode="Compete",
    stack_models=True,
    train_ensemble=True,
    total_time_limit=3600,
    features_selection=False
)

automl.fit(X_train, y_train)
y_preds_mljar = automl.predict(X_test)

accuracy = accuracy_score(y_test, y_preds_mljar)
validate_model('mljar-supervised', automl, accuracy)

automl.get_leaderboard()

<a class="anchor" id="Submission"></a>
# Submission File

In [None]:
models_df = pd.DataFrame(models, columns=['model_name', 'accuracy'])
models_df.sort_values(by=['accuracy'], ascending=False, inplace=True)
models_df = models_df.reset_index(drop=True)
models_df

Below we print out the best performing AutoML model. Let's use this model to generate predictions for our final submission.

In [None]:
print(best_model_name)
print(best_model)
print(best_model_acc)

In [None]:
y_preds = best_model.predict(test_data)
submission['Survived'] = y_preds.ravel().astype(int)
submission.to_csv('submission.csv', index = False)

Sidenote: it was really really really difficult to get all of these AutoML algorithms to work in one notebook, I've encountered a lot of dependency issues. If you ever use AutoML, pick one to run in your notebook.

TODO: incorporate [AutoPyTorch](https://github.com/automl/Auto-PyTorch)

<a class="anchor" id="Cleanup"></a>
# Cleanup

In [None]:
! rm -rf */