## AutoML Basic Concepts

[Source](https://github.com/EpistasisLab/tpot/blob/master/tutorials/Titanic_Kaggle.ipynb)

In [None]:
#! pip install -U tpot

In [None]:
# Import required libraries
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np

**Note**
Should you not be able to import tpot due to the error:`

`AttributeError: module 'numpy' has no attribute 'float'.`

Downgrade your version of numpy to 1.23.5.

In [None]:
# Load the data
titanic = pd.read_csv('../../Data/data_titanic/train.csv')
titanic.head(5)

### Data Exploration
Before we get going with TPOT, we start with some simple data exploration to understand our data set.

In [None]:
titanic.groupby('Sex').Survived.value_counts()

In [None]:
titanic.groupby(['Pclass','Sex']).Survived.value_counts()

In [None]:
id = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
id.div(id.sum(1).astype(float), 0)

### Data Munging
The first and most important step in using TPOT on any data set is to rename the target class/response variable to 'class'.

In [None]:
titanic.rename(columns={'Survived': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 5 categorical variables which contain non-numerical values: 'Name', 'Sex', 'Ticket', 'Cabin' and 'Embarked'.

In [None]:
titanic.dtypes

We then check the number of distinct levels that each of the five categorical variables have.

In [None]:
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))

As we can see, 'Sex' and 'Embarked' have only very few levels. Let's find out what they are.

In [None]:
for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))

We then code these levels manually into numerical values. For NaN i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.

In [None]:
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})

In [None]:
titanic = titanic.fillna(-999)
pd.isnull(titanic).any()

Since 'Name' and 'Ticket' have so many different levels, we drop them in this example from our analysis for the sake of simplicity. For 'Cabin', we encode the levels as digits using Scikit-learn's `MultiLabelBinarizer` and treat them as new features.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])

In [None]:
CabinTrans

Drop the unused features from the dataset.

In [None]:
titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)

In [None]:
assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done

We then add the encoded features to form the final dataset to be used with TPOT.

In [None]:
titanic_new = np.hstack((titanic_new.values,CabinTrans))

In [None]:
np.isnan(titanic_new).any()

Keep in mind that the final data set is in the form of a numpy array. We can check the number of features in the final data set as follows.

In [None]:
titanic_new[0].size

Finally, we store the class labels which we need to predicted in a separate variable.

In [None]:
titanic_class = titanic['class'].values

### Data Analysis using TPOT
To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip the creation of this validation set.

In [None]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size

After that, we proceed with calling the fit, score and export functions on our training dataset. To get a better idea of how these functions work, refer to the TPOT documentation here.

An important TPOT parameter to set is the **number of generations**. Since our aim is to just illustrate the use of TPOT, we set maximum optimization time to 2 minutes (`max_time_mins=2`). On a standard laptop with 4GB RAM it takes roughly 5 minutes per generation to run. For each added generation, it should take 5 minutes more. Thus, for the default value of 100, the total run time could be roughly around 8 hours.

In [None]:
# Only relevant for Databricks users using an ML runtime. 
# import mlflow
# mlflow.autolog(disable=True)  # As MLFlow is not totally integrated with TPOT, we disable the autologging when running on Databricks

In [None]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=2, max_eval_time_mins=0.04, population_size=40)

In [None]:
tpot.fit(titanic_new[training_indices], titanic_class[training_indices])

In [None]:
tpot.score(titanic_new[validation_indices], titanic.loc[validation_indices, 'class'].values)

In [None]:
tpot.export('tpot_titanic_pipeline.py') # locally
#tpot.export('/tmp/tpot_titanic_pipeline.py') # Databricks

Let's have a look at the generated code. As we can see, the random forest classifier performed the best on the given dataset out of all the other models that TPOT currently evaluates on. If we ran TPOT for more generations, then the score should improve further.

To take a look at the saved tpot file, run:
> `%load tpot_titanic_pipeline.py`

If you want to use this file, you need to specify the path of the *preprocessed* data where it says `PATH/TO/DATA/FILE`.

### Make predictions on the submission data

In [None]:
# Read in the submission dataset
titanic_sub = pd.read_csv('../../Data/data_titanic/test.csv')
titanic_sub.describe()

When looking at fresh data to make predictions on, the most important step is to **check for new levels in the categorical variables** of the submission data set which were absent in the training set. We identify them and set them to our placeholder value of `-999`, i.e., we treat them as missing values. This ensures training consistency, as otherwise the model would not know what to do with the new levels in the submission data set.

In [None]:
for var in ['Cabin']: #,'Name','Ticket']:
    new = list(set(titanic_sub[var]) - set(titanic[var]))
    titanic_sub.loc[titanic_sub[var].isin(new), var] = -999

We then carry out the data munging steps as done earlier for the training dataset.

In [None]:
titanic_sub['Sex'] = titanic_sub['Sex'].map({'male':0,'female':1})
titanic_sub['Embarked'] = titanic_sub['Embarked'].map({'S':0,'C':1,'Q':2})

In [None]:
titanic_sub = titanic_sub.fillna(-999)
pd.isnull(titanic_sub).any()

While calling `MultiLabelBinarizer` on the submission data set, we first fit on the training set again to learn the levels and then transform the submission data set values. This further ensures that only those levels that were present in the training data set are transformed. If new levels are still found in the submission data set then it will return an error and we need to go back and check our earlier step of replacing new levels with the placeholder value.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
SubCabinTrans = mlb.fit([{str(val)} for val in titanic['Cabin'].values]).transform([{str(val)} for val in titanic_sub['Cabin'].values])
titanic_sub = titanic_sub.drop(['Name','Ticket','Cabin'], axis=1)

In [None]:
# Form the new submission data set
titanic_sub_new = np.hstack((titanic_sub.values,SubCabinTrans))

In [None]:
np.any(np.isnan(titanic_sub_new))

In [None]:
# Ensure an equal number of features in both the final training and submission dataset
assert (titanic_new.shape[1] == titanic_sub_new.shape[1]), "Not Equal" 

In [None]:
# Generate the predictions
submission = tpot.predict(titanic_sub_new)

In [None]:
# Create the submission file
final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
#final.to_csv('submission.csv', index = False)

In [None]:
final.shape

There we go! We have successfully generated the predictions for the 418 data points in the submission dataset, and we're good to go ahead to submit these predictions on Kaggle.