This is my naive attempt to the [Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) Kaggle competition. This notebook by no mean is to be thought to be the best solution for the competition. In fact, I am giving more details to what I am trying here since I see a suden interest on the notebook. 

## Simple explanatory data analysis

Loading some simple libraries and looking at the dataset files to be used.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Loading the files in the dataset.

In [None]:
testDf = pd.read_csv('/kaggle/input/titanic/test.csv')
submissionDf = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
trainDf = pd.read_csv('/kaggle/input/titanic/train.csv')
trainDf.info()

In [None]:
trainDf.head()

Simple overall statistics.

In [None]:
trainDf.describe()

Checking the distributions between train and test. This is useful to be to find an idea of how these variables look like.

In [None]:
trainDf.hist(bins=20, figsize=(20,20))
testDf.hist(bins=20, figsize=(20,20))

Since I dont see how can I use the `Cabin` variable, I noticed that the Cabin information started with a letter, maybe this is related with the "level" or floor in the ship. I will try to use this variable.

In [None]:
trainDf['Level'] = trainDf.Cabin.str[0]
testDf['Level'] = testDf.Cabin.str[0]
# print(trainDf.Cabin.unique())

Looking at the correlations of the initial variables:

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(trainDf.corr(), annot=True, center=0, linewidths=.5, fmt='.2f', vmin=-1, vmax=1, cmap='RdBu')

## Simplest Logistic Regression

Since I want to have a starting point, I performed the simplest logistic regression method. I plan to use the accurancy value from here to build my model on top.

In [None]:
import scipy.stats as sp
from sklearn import preprocessing 
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

I am droping here some columns in order to train the simplest of the simplest model, without doing any data enginering. These variables will be used later.

In [None]:
colToDrop = ['Cabin', 'Name', 'Sex', 'Ticket', 'Embarked', 'Age', 'Level']
X_train, X_test, y_train, y_test = train_test_split( trainDf.drop(columns=colToDrop),
                                                    trainDf['Survived'],
                                                    test_size=0.2, 
                                                    random_state=0, shuffle=True)
reg = LogisticRegression(max_iter=10000).fit(X_train, y_train)
# pred = reg.predict(X_test)
# accuracy_score(y_test, pred)
reg.score(X_test, y_test)

## First serious attempt

First, I looked at [this example](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html?highlight=titanic) and try to build something on top of this

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
colToDrop = ['Survived', 'Name', 'Cabin']
X_train, X_test, y_train, y_test = train_test_split( trainDf.drop(columns=colToDrop),
                                                    trainDf['Survived'],
                                                    test_size=0.2, 
                                                    random_state=0, shuffle=True)
# X_train = trainDf.drop(columns=colToDrop)
# y_train = trainDf['Survived']
# X_test = testDf.drop(columns=colToDrop)
# y_test = testDf['Survived']

In [None]:
numeric_features = ["Age", "Fare"]

numeric_transformer = Pipeline(
    steps=[ 
        ("imputer", SimpleImputer(strategy="most_frequent", add_indicator=True)), 
#         ("imputer", KNNImputer(add_indicator=True)), 
        ("scaler", StandardScaler())
    ]
)

categorical_features = ["Embarked", "Sex", "Pclass", 'Level']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor), 
        ("classifier", LogisticRegression())
    ]
)

clf.fit(X_train, y_train)
print("model score: %.4f" % clf.score(X_test, y_test))

Saving the prediction and the `submission.csv` file. 

In [None]:
# testDf = pd.read_csv('/kaggle/input/titanic/test.csv')

finalDf = pd.DataFrame() 
finalDf['PassengerId'] = testDf['PassengerId']
finalDf['Survived'] = clf.predict(testDf.drop(columns=['Name', 'Cabin']))
finalDf.to_csv('submission.csv', index=False)

In [None]:
pd.read_csv('submission.csv')

## Different test with parameter optimization

These test are inspired in this [notebook](https://www.kaggle.com/code/jiwonkng/tabular-playground-apr-22). Kudos to the author. 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from scipy.stats import uniform, randint

from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.pipeline import Pipeline
import time
import warnings
warnings.filterwarnings('ignore')

In order to use GridSearch, I am creating a function where I am including the data transformations.

In [None]:
def pipeline( Classifier ):
    numeric_features = ["Age", "Fare"]
    numeric_transformer = Pipeline(
        steps=[ 
            ("imputer", SimpleImputer(strategy="most_frequent", add_indicator=True)), 
            ("scaler", StandardScaler())
        ]
    )

    categorical_features = ["Embarked", "Sex", "Pclass", 'Level']
    categorical_transformer = OneHotEncoder(handle_unknown="ignore")

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ("cat", categorical_transformer, categorical_features),
        ]
    )

    clf = Pipeline(
        steps=[
            ("preprocessor", preprocessor), 
            ("classifier", Classifier )
        ]
    )
    return clf

Creating different dictionaries with the algorithms and parameters that I want to test. 

In [None]:
classifiers = {
    "KNN" : pipeline(KNeighborsClassifier()),
    "LogisticRegression" : pipeline(LogisticRegression(random_state=42)),
    "RandomForest" : pipeline(RandomForestClassifier(random_state=42)),
    "LGBM" : pipeline(LGBMClassifier(random_state=42)),
    "SVM" : pipeline(SVC(random_state=42))
}

# define grid
KNN_grid = {'classifier__n_neighbors': [3, 5, 7, 9],
            'classifier__p': [1, 2]}

LR_grid = {'classifier__penalty': ['l1','l2'],
           'classifier__C': [0.25, 0.5, 0.75, 1, 1.25],
           'classifier__max_iter': [50, 100, 150]}

RF_grid = {'classifier__n_estimators': [50, 100, 150, 200],
        'classifier__max_depth': [6, 8, 10, 12]}

LGBM_grid = {'classifier__n_estimators': [50, 100, 150, 200],
        'classifier__max_depth': [6, 8, 10, 12],
        'classifier__learning_rate': [0.05, 0.1, 0.15]}

SVM_grid = [{'classifier__C': [0.01, 0.1, 1.0, 10.0],
             'classifier__kernel': ['linear']},
            {'classifier__C': [0.01, 0.1, 1.0, 10.0],
             'classifier__gamma': [0.01, 0.1, 1.0, 10.0],
             'classifier__kernel': ['rbf']}]

grid = {
    "KNN" : KNN_grid,
    "LogisticRegression" : LR_grid,
    "RandomForest" : RF_grid,
    "LGBM" : LGBM_grid,
    "SVM" : SVM_grid
}

In [None]:
i=0
clf_best_params = classifiers.copy()
scores = pd.DataFrame({
                    'Classifer':classifiers.keys(),
                    'Train accuracy' : np.zeros(len(classifiers)),
                    'Validation accuracy': np.zeros(len(classifiers)),
                    'Training time': np.zeros(len(classifiers))
                    })
for key, classifier in classifiers.items():
    start = time.time()
    clf = GridSearchCV(estimator=classifier, param_grid=grid[key], n_jobs=-1, cv=None)

    clf.fit(X_train, y_train)
    scores.iloc[i,1]=clf.score(X_train, y_train)
    scores.iloc[i,2]=clf.score(X_test, y_test)
    clf_best_params[key]=clf.best_params_
    
    stop = time.time()
    scores.iloc[i,3]=np.round((stop - start)/60, 2)
    
    print('Model:', key)
    print('Training time (mins):', scores.iloc[i,3])
    print('')
    i+=1

The best parameters per algorithm are:

In [None]:
clf_best_params

Checking at the different scores per algorithm tested.

In [None]:
scores

In [None]:
lgb = pipeline(LGBMClassifier(learning_rate=0.1, max_depth=5, n_estimators=100, random_state=42))

splitter = StratifiedKFold(n_splits = 3, shuffle = True, random_state=42)
scores = cross_validate(lgb, X_train, y_train, return_train_score = True, cv=splitter)
print("\t", np.mean(scores['train_score']), np.mean(scores['test_score']), "\n")

Let's look at a table comparing the prediction and the actual values. This helps to visually see how far we are from the best solution.

In [None]:
lgb.fit(X_train, y_train)
y_pred = lgb.predict(X_test)
pd.DataFrame(confusion_matrix(y_test, y_pred),
                index = [["actual", "actual"], ["N", "P"]],
                columns = [["pred", "pred"], ["N", "P"]])

Checking all the scores available:

In [None]:
print("Acc. :", accuracy_score(y_test, y_pred))
print("Prec. :", precision_score(y_test, y_pred))
print('Recall :', recall_score(y_test, y_pred))
print('f1. :', f1_score(y_test, y_pred))

In [None]:
fig, ax = plt.subplots(figsize=(7,7))

fpr, tpr, _ = roc_curve(y_test, y_pred)
ax.plot(fpr, tpr, color='r', lw=2)
ax.plot([0, 1], [0, 1], color="navy", lw=1, linestyle="--")
plt.gca().set_aspect('equal')

ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.0])
ax.set_xlabel("FPR", size=12)
ax.set_ylabel("TPR", size=12)
ax.set_title("ROC Curve", size=15)

plt.show()

print("AUC Score:", roc_auc_score(y_test, y_pred))

In [None]:
# finalDf = pd.DataFrame() 
# finalDf['PassengerId'] = testDf['PassengerId']
# finalDf['Survived'] = lgb.predict(testDf.drop(columns=['Name', 'Cabin']))
# finalDf.to_csv('submission.csv', index=False)

## Test with AutoML

I recently learn about AutoML and I want it to give it a try! This step is based on [this notebook](https://www.kaggle.com/code/jeonghyunha/automl-with-titanic-dataset#Additional-Feature-Engineering)

In [None]:
! pip install mljar-supervised

In [None]:
from supervised.automl import AutoML # mljar-supervised
automl = AutoML( mode="Compete", eval_metric='accuracy' )
automl.fit(X_train, y_train)

In [None]:
pd.set_option('display.max_rows', None)
automl.get_leaderboard().sort_values('metric_value')

In [None]:
finalDf = pd.DataFrame() 
finalDf['PassengerId'] = testDf['PassengerId']
finalDf['Survived'] = automl.predict(testDf)
finalDf.to_csv('submission.csv', index=False)

In [None]:
# ! tar -czvf AutoML_1.tar.gz AutoML_1/

# Test with autogluon

(Side funny note: I am a particle physicist who actually has dedicated his carrer to study interactions with gluons and quarks, I am particularly invested in play with "autogluon" :P)

In [None]:
! pip install autogluon

In [None]:
from autogluon.tabular import TabularPredictor, TabularDataset
predictor = TabularPredictor(label='Survived', eval_metric='accuracy').fit(trainDf, presets='best_quality')

In [None]:
predictor.leaderboard()

In [None]:
finalDf = pd.DataFrame() 
finalDf['PassengerId'] = testDf['PassengerId']
finalDf['Survived'] = predictor.predict(testDf)
finalDf.to_csv('submission.csv', index=False)
finalDf.head()

#### More tests are yet to come!