In [0]:
!pip install --upgrade scikit-learn
dbutils.library.restartPython()

# Supervised Learning Workflow

In [0]:
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


train = pd.read_csv("../../../Data/data_titanic/train.csv")
train.Pclass = train.Pclass.astype(float)  # to avoid DataConversionWarning

In [0]:
train = train.dropna(axis=0)
X_train, X_test, y_train, y_test = train_test_split(
    train[["Pclass", "Age", "Sex", "Embarked"]],
    train["Survived"],
    test_size=0.2,
    random_state=42,
)

## Part 3: Tree-based Models & Hyperparameter Tuning
In the previous notebook we constructed our first pipeline:
```
entire_pipeline = Pipeline([('feature_engineering', feature_engineering), ('dummy', DummyClassifier(strategy="most_frequent"))])
```
Hold your constructed pipeline firmly! The only thing that we need to do now is to replace [`DummyClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) with a proper learning model.   
We can start with a decision tree.

In [0]:
feature_engineering = ColumnTransformer(
    [
        ("numerical_scaler", preprocessing.MinMaxScaler(), ["Pclass", "Age"]),
        ("ohe", preprocessing.OneHotEncoder(sparse_output=False), ["Sex", "Embarked"]),
    ],
    remainder="passthrough",
)

### Fitting a Learning Model – Decision Tree

In [0]:
# TASK 1A: Reuse your composite and instead of a dummy, fit a decision tree with default parameters.
# Store the result as dt_pipeline.

# Train the pipeline

In [0]:
# TASK 1B: Let the pipeline predict for the training set. 
# Store the result as y_pred_TRAIN_DT.
# Also, display accuracy.

In [0]:
# TASK 1C: Let the pipeline predict for the holdout set. 
# Store the result as y_pred_HOLDOUT_DT.
# Also, display accuracy.

Looking at the accuracy on training and holdout set, what can you infer about over model? Will it generalize well?

In [0]:
# OPTIONAL TASK 2: Do the same steps with RandomForest with default parameters. 
# Does the RandomForest display similar results as decision tree? If not, why?

# Reuse your composite and fit a random forest with default parameters.
# Store the result as rf_pipeline.

# Train the pipeline

#Predict and show accuracy TRAIN

#Predict and show accuracy HOLDOUT

### Tuning Hyperparameters of our Decision Tree
Time to improve the performance of our learning model by finding its optimal set of hyperparameters.  
We start by examining **which hyperparameters are available** in our decision tree pipeline.

In [0]:
dt_pipeline.get_params()

We would like to tune `max_depth` and `min_samples_split`.  
Notice that to access them, we also need to navigate within the composite and call them as **`decision_tree`**`__max_depth`.  

In [0]:
# TASK 3: Define a grid through which we should search. 
# Tune parameters: max_depth and min_samples_split.
# The values which you pick as parameters are up to you. You can think about them intuitively.

In [0]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV

# Model
dt_pipeline

# Searching strategy, providing grid
tuning = GridSearchCV(dt_pipeline, param_grid)

# Train
tuning.fit(X_train, y_train)

In [0]:
# Let's get the best parameters
best_par = tuning.best_params_
print(best_par)

If you want to have a more detailed look at the result from the grid search you can use the `cv_results_` attribute.
The dict is easily transformed to a pandas DataFrame.

In [0]:
# Let's check out the full Grid Search results
# We sort the dataframe according to the rank and have a look at the top 10 models
gs_result = pd.DataFrame(tuning.cv_results_)
gs_result.sort_values("rank_test_score").head(10)

In [0]:
# TASK 4A: Use the best setting of the two hyperparameters and fit a optimized decision tree. 
# Hint: Reuse the pipeline and when declaring it, specify the params.

# Store it as dt_pipeline_tuned.

# Train

In [0]:
# TASK 4B: Display accuracy on the training set of the optimized decision tree.

In [0]:
# TASK 4C: Display accuracy on the holdout set of the optimized decision tree.

Does the optimized decision tree perform better then the one with default parameters?

The best model can also be retrieved directly from the result of the grid search, if the parameter `refit=True` is used.
By default the value of this parameter is `True` so instead of manually retraining we could eiter use the attribute `best_estimator_` to retrieve the model or make predictions by using the 
[`predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.predict)
straight away.

In [0]:
# retrieve the best model from the Grid Search object
dt_tuning = tuning.best_estimator_
print(metrics.accuracy_score(y_test, dt_tuning.predict(X_test)))

# directly predict using the Grid Search object
print(metrics.accuracy_score(y_test, tuning.predict(X_test)))

### Optional Advanced TASK: Tuning Random Forest
When you are tuning a more complex model, it is good practice to search available literature on which hyperparameters should be tuned. Below I have predefined some. You can **play around with the grid**, for example expand or narrow it. Keep in mind that as our feature set is extremely limited, its hard for hyperparameter tuning to arrive at something meaningful.

In [0]:
# OPTIONAL TASK 5
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define a pipeline
rf_pipeline = Pipeline([('feature_engineering', feature_engineering), ('random_forest', RandomForestClassifier())])

# Create the parameter grid based on the results of random search 
param_grid_rf = {
    'random_forest__bootstrap': [True, False],
    'random_forest__max_depth': [3, 5, 10, 15],
    'random_forest__max_features': [2, 3],
    'random_forest__min_samples_leaf': [3, 4, 5],
    'random_forest__min_samples_split': [5, 8, 10, 12],
    'random_forest__n_estimators': [5, 10, 15, 20, 25]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_pipeline, 
                           param_grid = param_grid_rf, 
                           cv = 3, 
                           n_jobs = -1, 
                           verbose = 2)

# Searching strategy, providing grid
tuning_rf = GridSearchCV(rf_pipeline, param_grid_rf)

# Train
tuning_rf.fit(X_train, y_train)

# Cross-validated score (more robust than holdout set most likely)
print(tuning_rf.best_score_)
print(tuning_rf.best_params_)

### Optional Advanced TASK: Check Kaggle competitions and join one of them!  
https://www.kaggle.com/