# **Lab: Engineering for ML**




## Exercise 3: Modelling

This time we will train a RandomForest model.


**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install pyenv (https://realpython.com/lessons/installing-pyenv/)
- Install poetry (https://python-poetry.org/docs/#installation)
- Install Wget for Windows users (https://eternallybored.org/misc/wget/)


The steps are:
1.   Create new Git branch
2.   Load the dataset
3.   Apply Polynomial Transformation
4.   Train Linear Regression model
5.   Push changes


## 1. Create new Git branch


**[1.1]** Create a new git branch called `adv_mla_1_rf`


In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git checkout -b adv_mla_1_rf

**[1.2]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! poetry run jupyter lab

**[1.3]** Navigate the folder `notebooks` and create a new jupyter notebook called `2_rf.ipynb`

## 2. Load the dataset


**[2.1]** Launch magic commands to automatically reload modules

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
%load_ext autoreload
%autoreload 2

**[2.1]** Import the pandas, numpy packages and dump from joblib

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import pandas as pd
import numpy as np
from joblib import dump

**[2.2]** Load the saved sets from `data/processed`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
X_train = pd.read_csv('../data/processed/X_train.csv')
X_val   = pd.read_csv('../data/processed/X_val.csv'  )
X_test  = pd.read_csv('../data/processed/X_test.csv' )
y_train = pd.read_csv('../data/processed/y_train.csv')
y_val   = pd.read_csv('../data/processed/y_val.csv'  )
y_test  = pd.read_csv('../data/processed/y_test.csv' )

# 3. Train Initial RandomForest with Default Hyperparameter

**[3.1]** Import the RandomForestClassifier from sklearn.ensemble

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.ensemble import RandomForestClassifier

**[3.2]** Instantiate the RandomForestClassifier class called rf1 with a random state=8

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
rf1 = RandomForestClassifier(random_state=8)

**[3.3]** Fit the RandomForest model

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
rf1.fit(X_train, y_train)

**[3.4]** Import `dump` from `joblib` and save the fitted model into the folder `models` as a file called `rf1`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from joblib import dump

dump(rf1,  '../models/rf1.joblib')

**[3.5]** Save the predicted probabilities from this model for the training and validation sets into 2 variables called `y_train_probas` and `y_val_probas`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
y_train_probas = rf1.predict_proba(X_train)
y_val_probas = rf1.predict_proba(X_val)

**[3.6]** Import roc_auc_score from sklearn.metrics

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from sklearn.metrics import roc_auc_score

**[3.7]** Display the ROC score of this model on the training set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
roc_auc_score(y_train, y_probas[:, 1])

**[3.8]** Display the ROC score of this model on the validation set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
roc_auc_score(y_train, y_probas[:, 1])

# 4. Tune RandomForest

**[4.1]** Instantiate and fit a RandomForestClassifier class called rf2 with a random state=8, max depth of 6 and min sample leaf of 50

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
rf2 = RandomForestClassifier(random_state=8, max_depth=6, min_samples_leaf=50)
rf2.fit(X_train, y_train)

**[4.2]** Display the ROC score of this model on the validation set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
print(roc_auc_score(y_train, rf2.predict_proba(X_train)[:, 1]))
print(roc_auc_score(y_val, rf2.predict_proba(X_val)[:, 1]))

**[4.3]** Display the ROC score of this model on the testing set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
roc_auc_score(y_val, rf2.predict_proba(X_test)[:, 1])

**[4.4]** Save the fitted model into the folder `models` as a file called `rf2`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
dump(rf2,  '../models/rf2.joblib')

# 5.   Push changes

**[5.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git add .

**[5.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git commit -m "best rf"

**[5.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git push -u origin adv_mla_1_rf

[5.4] Go to to github and merge your change to the master/main branch

**[5.5]** Check out to the master branch

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git checkout master

**[5.6]** Pull the latest updates

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git pull

**[5.7]** Stop Jupyter Lab