![](https://media.giphy.com/media/m2tOKbpjpFMvm/giphy.gif)
# 1. Introduction
In this Kernel is made an approach using three Python modules as the pillars: [fastai](https://github.com/fastai/fastai), [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling) and [H2O](https://www.h2o.ai/products/h2o/), but why to use these modules? Because they can do so much work writing just a few lines of code speeding up the process, therefore the name "Fastanic".<br>

The fastai loads a lot of useful modules, as Pandas and Numpy, Pandas Profiling can generate a detailed Exploratory Data Analysis (EDA) with just one code line and H2O using its AutoML module can train different Machine Learning models and ensemble it to make good predictions, but remember there's [*No Free Lunch*](https://towardsdatascience.com/a-blog-about-lunch-and-data-science-how-there-is-no-such-a-thing-as-free-lunch-e46fd57c7f27), fastai is good but maybe you will need other modules to make your EDA, Pandas Profiling is not good to report in large datasets (I tried to use on [Porto Seguro Safe Driver Prediction Dataset](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) without sucess) and H2O AutoML can take a big amount of time to train all the models.<br>

The Legendary Competition to predict the survivors of Titanic Disaster is a good start point to apply these great modules and understand how to work with them and know their pros and cons. This dataset is small enough to use these tools and give a good idea when to use they and we can dedicate more time to understand the data and apply Feature Engineering (FE) to it.

### 1.1 The Challenge

The [sinking of the Titanic](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic) is one of the most infamous shipwrecks in history.<br>

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.<br>

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.<br>

In this challenge, we ask you to build a predictive model that answers the question: "what sorts of people were more likely to survive?" using passenger data (ie name, age, gender, socio-economic class, etc).<br>

It is your job to predict if a passenger survived the sinking of the Titanic or not.<br>
For each in the test set, you must predict a 0 (Dead) or 1 (Survived)  value for the variable.<br>
Your score is the percentage of passengers you correctly predict, the accuracy.<br>

### 1.2 Data
Some variables to take note for insights:<br>
* **pclass**: A proxy for socio-economic status (SES)<br>
1st = Upper<br>
2nd = Middle<br>
3rd = Lower<br>

* **age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5<br>

* **sibsp**: The dataset defines family relations in this way...<br>
Sibling = brother, sister, stepbrother, stepsister<br>
Spouse = husband, wife (mistresses and fiancés were ignored)<br>

* **parch**: The dataset defines family relations in this way...<br>
Parent = mother, father<br>
Child = daughter, son, stepdaughter, stepson<br>
Some children travelled only with a nanny, therefore parch=0 for them.<br>

# 2. Load Modules and Data
Let's start importing our Power Trio (fastai, Pandas Profiling and H2O), define the data path and load the available data.

In [1]:
# The fast guys
from fastai.imports import *
import pandas_profiling
import h2o
from h2o.automl import H2OAutoML
import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format = 'retina'
# List of files in the directory
PATH = '../input/titanic/'
list(os.listdir(PATH))

In [1]:
train = pd.read_csv(PATH+'train.csv')
test = pd.read_csv(PATH+'test.csv')
submission = pd.read_csv(PATH+'gender_submission.csv')

# 3. Exploratory Data Analysis (EDA)

Here's the magic of Pandas Profiling, make a good EDA just using the function `ProfileReport`, it give variable distribution, check of null values, variable types, correlation matrix and other useful things.<br>
Applying it on Training data.

In [1]:
%%time
pandas_profiling.ProfileReport(train)

Amazing, no?<br>
What can be learned from this? Let's examine each variable (features that will feed the model) and get some insights.<br>

* `Survived`: The Target variable. The plot indicates a small class imbalance but there's no need for resampling and accuracy seems good to evaluate the model.<br>
* `Age`: It's a numeric type, has almost 20% of missing values (`NaN`), the distribution has a little right skewness but is near from Normal. There's a small negative correlation with `Survived`, indicating that youngers could have a better chance to survive than the oldest.  <br>
* `Cabin`: Categorical feature, has a lot of missing values (77.1%) and some passengers have more than one cabin. Maybe the first letter could indicate the deck area, then it is considered as a relevant feature. <br>
* `Embarked`: Another categorical feature, has only 2 missing values, drop these values on Feature Engineering step will not harm the model. <br>
* `Fare`: Numerical feature with a right skewness. There's no missing values. Has a small positive correlation with `Survived`, indicating that the person owned a more expensive ticket and following the idea that a better Cabin/Area/Class has more chance to survive. Note that it has a mode with low values, hence the big part of passengers bought a more cheaper ticker. <br>
* `Name`: Categorical feature. No missing values. The name structure is interesting, 'Family Name'+','+'Honorific'+'name'.
* `Parch`: Numerical with right skewness. No missing values. Small positive correlation with `Survived`.
* `PassengerId`: Numerical but not relevant for the prediction. Will be dropped in FE step.
* `Pclass`: Categorical Feature. No Missing Values. There's more 3rd class passengers than others. Negative correlation with `Survived`, it reinforces the idea on `Fare`, expensive tickets allows better classes leading to most chance to survive.
* `Sex`: Categorical Feature. The distribution indicates that have more men than women. No Missing Values.
* `SibSp`: Numeric with a small negative correlation with `Survived`. No Missing Values.
* `Ticket`: Categorical. No missing values but has a High Cardinality.

With all information gathered from the EDA, it's time to get insights on how to proceed in FE, that starts now.

# 4. Feature Engineering (FE)

After analyzing all the variables in EDA, it's time to clean and extract more useful information to feed the model.<br>
Starting dropping unworthy features: `PassengerId`, `Ticket`.

In [1]:
train = train.drop(columns=['PassengerId', 'Ticket'])
test = test.drop(columns=['PassengerId', 'Ticket']);

Identify the Cabin Deck location with the first letter of `Cabin` feature. It's important to predict the survivors, because the deck location will measure the distance to the ship staircase.<br>
Read this excellent [Kernel](https://www.kaggle.com/gunesevitan/advanced-feature-engineering-tutorial-with-titanic) for more information.


In [1]:
train['Deck'] = train['Cabin'].astype(str).str[0]
train['Deck'] = train['Deck'].map({'A': 1, 'B': 1, 'C': 1, 'D': 1, 'E':1, 
                                             'F':1, 'G': 1, 'T': 0, 'n' : 0}).astype(int)
test['Deck'] = test['Cabin'].astype(str).str[0]
test['Deck'] = test['Deck'].map({'A': 1, 'B': 1, 'C': 1, 'D': 1, 'E':1, 
                                           'F':1, 'G': 1, 'T': 0, 'n' : 0}).astype(int)
train = train.drop(columns=['Cabin'])
test = test.drop(columns=['Cabin']);

Drop the two `NaN` samples in `Embarked`.

In [1]:
train = train.dropna(subset=['Embarked']).reset_index(drop=True)

Thinking about `NaN` values, it's good to check it on test set.

In [1]:
missing = test.isnull().sum()
missing = missing[missing > 0]
missing

And let's check the `Fare` distribution to decide how to fill `NaN` values.

In [1]:
test['Fare'].hist();
plt.title('Fare Distribution on Test Set')
plt.xlabel('Fare');

Looks like for the distribution is better to fill the mode.

In [1]:
test['Fare'] = test['Fare'].fillna(test['Fare'].mode()[0])

Time to work with `Age`, since the distribution is near from normal, it's good to fill all `NaN` values with the median to not harm the distribution.

In [1]:
train['Age'] = train['Age'].fillna(train['Age'].median())
test['Age'] = test['Age'].fillna(test['Age'].median())

Ok, all `NaN` values were filled, we could proceed to build the model, but AutoML, as any Machine Learning Algorithm, is **NOT** a magic wand! You can feed an AutoML Model with no treated Data, but it will not operate a miracle, you as Data Scientist or any other job role is the miracle! You are the wizard that chant spells (coding) learned from your grimoire (your books, courses, etc.) and add your creativity (the miracle factor).<br>
Hence, let's work one more step, remember name? If you create a new feature called `Family_Name` to get the info if the Passengers are in the same family? I got this insight because it could solve the problem about Cabin missing values, let me explain my idea.<br>

As [the wikipedia article about Titanic Sink](https://en.wikipedia.org/wiki/Sinking_of_the_RMS_Titanic) say that the time of disaster was 02:20 AM, time when much people could be sleep, and family could be sleeping together in the same Cabin or in near Cabins, it could explain why some passengers owned more than one Cabin. Creating this feature will reduce the cardinality in `Name`. I will extract the honorific name too in new feature.

In [1]:
train['Family_Name'] = train['Name'].apply(lambda x : x.split(',')[0])
train['First_Name'] = train['Name'].apply(lambda x : x.split(',')[1])
train['Honorific'] = train['First_Name'].apply(lambda x : x.split('.')[0])
train = train.drop(columns=['Name', 'First_Name'])
test['Family_Name'] = test['Name'].apply(lambda x : x.split(',')[0])
test['First_Name'] = test['Name'].apply(lambda x : x.split(',')[1])
test['Honorific'] = test['First_Name'].apply(lambda x : x.split('.')[0])
test = test.drop(columns=['Name', 'First_Name'])

Let's standardize `Age` and `Fare`.

In [1]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
train[['Age', 'Fare']] = sc.fit_transform(train[['Age','Fare']])
test[['Age', 'Fare']] = sc.fit_transform(test[['Age','Fare']])

And the new Datasets are here.

In [1]:
display('Train Dataset')
display(train.head(3))
display('Test Dataset')
display(test.head(3))

About the categorical values left, I will not make a encode, this part I will let the job for the AutoML.<br>
The Deck `NaN` will be handled by H2O AutoML.

# 5. AutoML Model

The model is built using the great [H2O AutoML](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html), read about, it is a nice addition to your skills.<br>
It will train so many Machine Learning algorithms to search for a best fit and report it, even using ensemble in the models trained. There are so many parameters to configure and you need to be cautios and how many models you will train, because it will get so much time to train and get a lot of resources! Known your data and decide if you'll use H2O AutoML or any other AutoML module.

In [1]:
h2o.init()
# Load the DataFrames on H2O format
train_frame = h2o.H2OFrame(train)
test_frame = h2o.H2OFrame(test)
train_frame['Survived'] = train_frame['Survived'].asfactor()
# Features to predict
x = train_frame.columns
y = 'Survived'
x.remove(y)

Train the model for 10 algorithms and 5-folds.
The sort will be made based on AUC Score, H2O AutoML doesn't have accuracy as sort metric.

In [1]:
%%time
# Build and Train the model
model = H2OAutoML(max_models=10, seed=42, nfolds=5)
model.train(x=x, y=y, training_frame=train_frame)
lb = model.leaderboard
lb

## 5.1 Model Interpretability

It's not just build the model and make it a black-box, it's necessary to make him interpretable to undestand how the features would contribute to the predictions, therefore, time to see the feature importance.

In [1]:
m = h2o.get_model(lb[2, "model_id"])
m.varimp_plot()

We can see that `Family_Name` give a good gain to model prediction, followed by `Honorific` and `Sex`. <br>
Let's re-train with the 6 top features and see if we could got any improvements.

In [1]:
train_frame_b = train_frame[:, ['Family_Name', 'Honorific', 'Sex', 'Pclass', 'Fare', 'Deck', 'Survived']]
train_frame_b['Survived'] = train_frame_b['Survived'].asfactor()
x_b = train_frame_b.columns
y_b = 'Survived'
x_b.remove(y_b)
model_b = H2OAutoML(max_models=10, seed=42, nfolds=5)
model_b.train(x=x_b, y=y_b, training_frame=train_frame_b)
lb_b = model_b.leaderboard
lb_b

Don't get any improvements, althought the lesser AUC score, the model performed well with only the 6 top features.

## 5.2 Predictions

Make the predicitions with the model with all features and append the results to submission file.

In [1]:
predictions = model.leader.predict(test_frame);
predictions = predictions.as_data_frame()
predictions = predictions.predict
submission_frame = h2o.H2OFrame(submission)
passenger_id = submission_frame['PassengerId'].as_data_frame()
submission_final = pd.concat([passenger_id, predictions], axis=1, ignore_index=False)
submission_final.columns = ['PassengerId', 'Survived']
submission_final.to_csv('submission.csv', index=False)
print('Submission file saved!')

# 6. Conclusion

The automation tools to speedup the procees worked well, making steps like EDA more smooth and allowing more time to make FE.
About the AutoML, was possible to achieve a good result with little steps, allowing more time to research other solutions.
Thanks for your reading!