# **Lab: ML Lifecycle**



## Exercise 2: Tracking

Pre-requisites:
- Create a free account on Weights and Biases: https://wandb.ai/
- Create an API key

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Train model and track Experiment
6.   Push changes


### 1. Setup Environment

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
cd /Users/anthonyso/Projects/adv_mla_2024/

**[1.2]** Run the built Docker image

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
docker run  -dit --rm --name adv_mla_lab_7 -p 8888:8888 -v ~/Projects/adv_mla_2024/adv_mla_lab_7:/home/jovyan/work/ tensorflow-jupyter:latest

**[1.3]** Display last 50 lines of logs

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
docker logs --tail 50 adv_mla_lab_7

**[1.4]** Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.5]** Navigate the folder `notebooks` and create a new jupyter notebook called `1_wandb.ipynb`

### 2.   Load and Explore Dataset

**[2.1]** Launch magic commands to automatically reload modules



In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
%load_ext autoreload
%autoreload 2

**[2.2]** Install your custom package with pip

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
! pip install -i https://test.pypi.org/simple/ my-krml-149874

**[2.3]** Import the pandas and numpy packages

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import pandas as pd
import numpy as np

**[2.4]** Load the dataset into a dataframe called `df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
df = pd.read_csv('../data/raw/Bank Customer Churn Prediction.csv')

**[2.5]** Display the first 5 rows of `df`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.head()

**[2.6]** Display the dimensions of `df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.shape

**[2.7]** Display the summary (info) of `df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.info()

**[2.8]** Display the descriptive statistics of `df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df.describe()

### 3. Prepare Data

**[3.1]** Create a copy of `df` and save it into a variable called `df_cleaned`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned = df.copy()

**[3.2]** Import OneHotEncoder, StandardScaler from sklearn.preprocessing

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from sklearn.preprocessing import OneHotEncoder, StandardScaler

**[3.3]** Drop the column `customer_id`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
df_cleaned.drop('customer_id', axis=1, inplace=True)

**[3.4]** Create a list called `num_cols` that will contain the list of columns that are numeric type except for the boolean ones

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
num_cols = ['credit_score', 'age', 'tenure', 'balance', 'estimated_salary']

**[3.5]** Create a list called `cat_cols` that will contain the list of columns that are categorical type

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
cat_cols = ['country', 'gender', 'products_number', 'credit_card', 'active_member']

**[3.6]** Instantiate the StandardScaler

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
scaler = StandardScaler()

**[3.7]** Fit and apply the scaling on the numeric columns from `df_cleaned` and convert the results to a dataframe called `num_features`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
num_features = pd.DataFrame(scaler.fit_transform(df_cleaned[num_cols]), columns=df_cleaned[num_cols].columns)

**[3.8]** Instantiate the OneHotEncoder

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ohe = OneHotEncoder(sparse_output=False, drop='first')

**[3.9]** Fit and apply the OneHotEncoder on the categorical columns from `df_cleaned` and save the result in `cat_features`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
cat_features = ohe.fit_transform(df_cleaned[cat_cols])

**[3.10]** Convert `cat_features` into a dataframe

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
cat_features = pd.DataFrame(cat_features, columns=ohe.get_feature_names_out())

**[3.11]** Combine all the transformed features into `features`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
features = num_features.copy()
features[ohe.get_feature_names_out()] = cat_features[ohe.get_feature_names_out()]

**[3.14]** Save the prepared dataframe in the `data/interim` folder

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
features.to_csv('../data/interim/dataset_prepared.csv', index=False)

**[3.15]** Import dump from joblib

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from joblib import dump

**[3.16]** Save all the sklearn transformers

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
dump(scaler, '../models/scaler.joblib')
dump(ohe, '../models/ohe.joblib')

### 4. Split Dataset

**[4.1]** Import the function `split_sets_random` from your custom package and split the data into several sets as Numpy arrays

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from my_krml_149874.data.sets import split_sets_random

X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(features, df_cleaned['churn'], test_ratio=0.2)

**[4.2]** Import the function `save_sets` from your custom package and save the sets into the folder `data/processed`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
from my_krml_149874.data.sets import save_sets

save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

### 5. Train model and Track Experiment

**[5.1]** Import the `wandb` package

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import wandb

**[5.2]** login with your WandB API key

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
wandb.login()

**[5.3]** Initialise a new WandB project called `bank-churn`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
run = wandb.init(project="bank-churn", name='rf-default')

**[5.4]** Import RandomForestClassifier from sklearn.ensemble

In [None]:
# Placeholder for student's code (Python code)

In [None]:
from sklearn.ensemble import RandomForestClassifier

**[5.2]** Instantiate the RandomForestClassifier

In [None]:
# Placeholder for student's code (Python code)

In [None]:
model = RandomForestClassifier(random_state=42)

**[5.3]** Train the model on the training set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
model.fit(X_train, y_train)

**[5.4]** Save the predictions on the validation set in a variable called `y_preds`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
y_preds = model.predict(X_val)

**[5.5]** Save the probabilities of the predictions on the validation set in a variable called `y_probas`

In [None]:
# Solution
y_probas = model.predict_proba(X_val)

**[5.5]** Generate all the plots using WandB

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
wandb.sklearn.plot_classifier(
    model,
    X_train, X_test,
    y_train, y_test,
    y_preds, y_probas,
    ['no_churn', 'churn'],
    is_binary=True,
    model_name='RandomForest'
)

# 6. Manage Model

**[6.1]** Add the model as WandB artefact

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
with open("../models/rf-default.h5", "w") as f:
    # Save the dummy model to W&B
    model = wandb.Artifact(f"model_{run.id}", type='model')
    model.add_file('rf-default.h5')
    run.log_artifact(model)

**[6.2]** Add the model to WandB Model Registry

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
run.link_artifact(model, 'model-registry/bank-churn-rf')

**[6.3]** Stop WandB

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
wandb.finish()

### 7.   Push changes

**[7.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .

**[7.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git commit -m "wandb"

**[7.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git push

**[7.4]** Go to Github and merge the branch after reviewing the code and fixing any conflict




**[7.5]** Check out to the master branch

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git checkout master

**[7.6]** Pull the latest updates

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git pull