# **Lab: ML Product**



## Exercise 1: EDA and Baseline Model

In this exercise we will start our data science project by preparing the dataset for modeling

**Pre-requisites:**
- Create a DockerHub account (https://hub.docker.com/)
- Create a Render account (https://render.com/)

The steps are:
1.   Setup Repository
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Get Baseline model
6.   Push changes


### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
# This is an example of UNIX command. You don't have to use command line. You can perform this step manually
cd /Users/anthonyso/Project/adv_mla_2024

**[1.2]** Copy the cookiecutter data science template with `adv_mla_lab_4` as project and repo names

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science

**[1.3]** Go inside the created folder

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
cd adv_mla_lab_4

**[1.4]** Initialise the repo


In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git init

**[1.5]** Login into Github with your account (https://github.com/) and create a public repo with the name `adv_mla_lab_4`

**[1.6]** In your local repo `adv_mla_lab_4`, link it with Github (replace the url with your username)

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git remote add origin git@github.com:<username>/adv_mla_lab_4

**[1.7]** Add you changes to git staging area, commit and push them to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .
git commit -m "init"
git push --set-upstream origin main

**[1.8]** Set the python version to 3.11.4 with pyenv

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
pyenv local 3.11.4

**[1.9]** Initialise poetry project with python==~3.11 and no dependencies installed

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
poetry init

**[1.10]** Install with poetry the following packages:
*   pandas==2.2.2
*   scikit-learn==1.5.1
*   jupyterlab==4.2.3


In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
poetry add jupyterlab==4.2.3 pandas==2.2.2 scikit-learn==1.5.1

**[1.11]** Download the dataset into the sub-folder data/raw (https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Online%20News%20Popularity/OnlineNewsPopularity.csv)

In [None]:
# For Windows users, you can download and install WGET. Or you can manually download the file from the link and save it to specified path
wget -P /Users/anthonyso/Projects/adv_mla_2024/adv_mla_lab_4/data/raw https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Online%20News%20Popularity/OnlineNewsPopularity.csv

**[1.12]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
poetry run jupyter lab

**[1.13]** Create a new Jupyter Notebook called `1_baseline.ipynb` inside the `work/adv_mla_lab_4/notebooks/` directory


## 2.   Add functionalities to custom package

**[2.1]** Inside `src/models/performance.py`, define a function called `assess_regressor_set` with the following logics:
- input parameters: trained model (`model`), features for a set (`features`), target for a set(`target`) and name of the set (`set_name`)
- logics: Save the predictions of the model and print the RMSE and MAE on the given set
- output parameters: None

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
def assess_regressor_set(model, features, target, set_name=''):
    """Save the predictions from a trained model on a given set and print its RMSE and MAE scores

    Parameters
    ----------
    model: sklearn.base.BaseEstimator
        Trained Sklearn model with set hyperparameters
    features : Numpy Array
        Features
    target : Numpy Array
        Target variable
    set_name : str
        Name of the set to be printed

    Returns
    -------
    """
    preds = model.predict(features)
    print_regressor_scores(y_preds=preds, y_actuals=target, set_name=set_name)

**[2.2]** Inside `src/models/performance.py`, define a function called `fit_assess_regressor` with the following logics:
- input parameters: trained model (`model`), features for the training set (`X_train`), target for the training set(`y_train`), features for the validation set (`X_val`), target for the validation set(`y_val`)
- logics: Train the model on the training set and print its RMSE and MAE scores on the training and validation sets
- output parameters: None

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
def fit_assess_regressor(model, X_train, y_train, X_val, y_val):
    """Train a regressor model, print its RMSE and MAE scores on the training and validation set and return the trained model

    Parameters
    ----------
    model: sklearn.base.BaseEstimator
        Instantiated Sklearn model with set hyperparameters
    X_train : Numpy Array
        Features for the training set
    y_train : Numpy Array
        Target for the training set
    X_train : Numpy Array
        Features for the validation set
    y_train : Numpy Array
        Target for the validation set

    Returns
    sklearn.base.BaseEstimator
        Trained model
    -------
    """
    model.fit(X_train, y_train)
    assess_regressor_set(model, X_train, y_train, set_name='Training')
    assess_regressor_set(model, X_val, y_val, set_name='Validation')
    return model

**[2.3]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
git add .

**[2.4]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
git commit -m "add regressor assessment"

**[2.5]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
git push

**[2.6]** Version your package with Poetry

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
poetry version 2024.0.1.3

**[2.7]** Build your package with Poetry

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
poetry build

**[2.8]** Publish your built package to TestPypi

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
poetry publish -r test-pypi


## 3. Load and Explore Dataset



**[3.1]** Launch magic commands to automatically reload modules

In [1]:
%load_ext autoreload
%autoreload 2

**[3.2]** Install your custom package with pip

In [None]:
# Placeholder for student's code (Python code)

In [2]:
# Solution
! pip install -i https://test.pypi.org/simple/ my-krml-studentid

Looking in indexes: https://test.pypi.org/simple/
Collecting my-krml-studentid
  Using cached https://test-files.pythonhosted.org/packages/52/42/cccd42db9498a27a869d8486806c3d1d35f39366cc67cfee038969da0a5c/my_krml_studentid-0.1.9-py3-none-any.whl.metadata (1.3 kB)
Using cached https://test-files.pythonhosted.org/packages/52/42/cccd42db9498a27a869d8486806c3d1d35f39366cc67cfee038969da0a5c/my_krml_studentid-0.1.9-py3-none-any.whl (5.5 kB)
Installing collected packages: my-krml-studentid
Successfully installed my-krml-studentid-0.1.9


**[3.3]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (Python code)

In [3]:
# Solution
import pandas as pd
import numpy as np

**[3.4]** Load the dataset into dataframe called df

In [4]:
# Placeholder for student's code (Python code)

In [5]:
#Solution:
df = pd.read_csv('../data/raw/OnlineNewsPopularity.csv')

**[3.5]** Display the first 5 rows of df

In [6]:
# Placeholder for student's code (Python code)

In [7]:
# Solution
df.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
0,http://mashable.com/2013/01/07/amazon-instant-...,731.0,12.0,219.0,0.663594,1.0,0.815385,4.0,2.0,1.0,...,0.1,0.7,-0.35,-0.6,-0.2,0.5,-0.1875,0.0,0.1875,593
1,http://mashable.com/2013/01/07/ap-samsung-spon...,731.0,9.0,255.0,0.604743,1.0,0.791946,3.0,1.0,1.0,...,0.033333,0.7,-0.11875,-0.125,-0.1,0.0,0.0,0.5,0.0,711
2,http://mashable.com/2013/01/07/apple-40-billio...,731.0,9.0,211.0,0.57513,1.0,0.663866,3.0,1.0,1.0,...,0.1,1.0,-0.466667,-0.8,-0.133333,0.0,0.0,0.5,0.0,1500
3,http://mashable.com/2013/01/07/astronaut-notre...,731.0,9.0,531.0,0.503788,1.0,0.665635,9.0,0.0,1.0,...,0.136364,0.8,-0.369697,-0.6,-0.166667,0.0,0.0,0.5,0.0,1200
4,http://mashable.com/2013/01/07/att-u-verse-apps/,731.0,13.0,1072.0,0.415646,1.0,0.54089,19.0,19.0,20.0,...,0.033333,1.0,-0.220192,-0.5,-0.05,0.454545,0.136364,0.045455,0.136364,505


**[3.6]** Display the dimensions (shape) of df

In [8]:
# Placeholder for student's code (Python code)

In [9]:
# Solution
df.shape

(39644, 61)

**[3.7]** Display the summary (info) of df

In [10]:
# Placeholder for student's code (Python code)

In [11]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 61 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   url                             39644 non-null  object 
 1    timedelta                      39644 non-null  float64
 2    n_tokens_title                 39644 non-null  float64
 3    n_tokens_content               39644 non-null  float64
 4    n_unique_tokens                39644 non-null  float64
 5    n_non_stop_words               39644 non-null  float64
 6    n_non_stop_unique_tokens       39644 non-null  float64
 7    num_hrefs                      39644 non-null  float64
 8    num_self_hrefs                 39644 non-null  float64
 9    num_imgs                       39644 non-null  float64
 10   num_videos                     39644 non-null  float64
 11   average_token_length           39644 non-null  float64
 12   num_keywords                   

**[3.8]** Display the descriptive statistics of df


In [12]:
# Placeholder for student's code (Python code)

In [13]:
# Solution
df.describe()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
count,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,...,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0,39644.0
mean,354.530471,10.398749,546.514731,0.548216,0.996469,0.689175,10.88369,3.293638,4.544143,1.249874,...,0.095446,0.756728,-0.259524,-0.521944,-0.1075,0.282353,0.071425,0.341843,0.156064,3395.380184
std,214.163767,2.114037,471.107508,3.520708,5.231231,3.264816,11.332017,3.855141,8.309434,4.107855,...,0.071315,0.247786,0.127726,0.29029,0.095373,0.324247,0.26545,0.188791,0.226294,11626.950749
min,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,1.0
25%,164.0,9.0,246.0,0.47087,1.0,0.625739,4.0,1.0,1.0,0.0,...,0.05,0.6,-0.328383,-0.7,-0.125,0.0,0.0,0.166667,0.0,946.0
50%,339.0,10.0,409.0,0.539226,1.0,0.690476,8.0,3.0,1.0,0.0,...,0.1,0.8,-0.253333,-0.5,-0.1,0.15,0.0,0.5,0.0,1400.0
75%,542.0,12.0,716.0,0.608696,1.0,0.75463,14.0,4.0,4.0,1.0,...,0.1,1.0,-0.186905,-0.3,-0.05,0.5,0.15,0.5,0.25,2800.0
max,731.0,23.0,8474.0,701.0,1042.0,650.0,304.0,116.0,128.0,91.0,...,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,843300.0


## 4. Prepare Data

**[4.1]** Create a copy of df and save it into a variable called df_cleaned

In [14]:
# Placeholder for student's code (Python code)

In [15]:
# Solution
df_cleaned = df.copy()

**[4.2]** Drop the column `url`

In [16]:
# Placeholder for student's code (Python code)

In [17]:
# Solution:
df_cleaned.drop('url', axis=1, inplace=True)

**[4.3]** Remove leading and trailing space from the column names

In [18]:
# Placeholder for student's code (Python code)

In [19]:
# Solution:
df_cleaned.columns = df_cleaned.columns.str.strip()

**[4.4]** Import the pop_target() frunction from your custom package



In [20]:
# Placeholder for student's code (Python code)

In [21]:
# Solution:
from my_krml_studentid.data.sets import pop_target

**[4.5]** Use your pop_target() frunction to extract the column `shares` and save the results into 2 variables called `features` and `target`

In [22]:
# Placeholder for student's code (Python code)

In [23]:
# Solution:
features, target = pop_target(df_cleaned, 'shares')

**[4.6]** Import StandardScaler from sklearn.preprocessing

In [24]:
# Placeholder for student's code (Python code)

In [25]:
# Solution
from sklearn.preprocessing import StandardScaler

**[4.7]** Instantiate the StandardScaler

In [26]:
# Placeholder for student's code (Python code)

In [27]:
# Solution
scaler = StandardScaler()

**[4.8]** Fit and apply the scaling on `features` and convert it back to a dataframe called `features`

In [28]:
# Placeholder for student's code (Python code)

In [29]:
# Solution:
features = pd.DataFrame(scaler.fit_transform(features), columns=features.columns)

**[4.9]** Import dump from joblib



In [30]:
# Placeholder for student's code (Python code)

In [31]:
# Solution:
from joblib import dump

**[4.10]** Save the scaler into the folder `models` and call the file `scaler.joblib`

In [32]:
# Placeholder for student's code (Python code)

In [33]:
# Solution:
dump(scaler, '../models/scaler.joblib')

['../models/scaler.joblib']

## 5. Split Dataset

**[5.1]** Import the function `split_sets_random()` from your custom package

In [34]:
# Placeholder for student's code (Python code)

In [35]:
# Solution
from my_krml_studentid.data.sets import split_sets_random

**[5.2]** Split the data into training validation and testing sets as Numpy arrays

In [36]:
# Placeholder for student's code (Python code)

In [37]:
# Solution
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(features, target, test_ratio=0.2)

**[5.3]** Print the dimensions of `X_train`, `X_val`, `X_test`

In [38]:
# Placeholder for student's code (Python code)

In [39]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(23786, 59)
(7929, 59)
(7929, 59)


**[5.4]** Print the dimensions of `y_train`, `y_val`, `y_test`

In [40]:
# Placeholder for student's code (Python code)

In [41]:
# Solution
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(23786,)
(7929,)
(7929,)


**[5.5]** Import the function `save_sets()` function from your custom package

In [42]:
# Placeholder for student's code (Python code)

In [43]:
# Solution
from my_krml_studentid.data.sets import save_sets

**[5.6]** Save the sets into the folder `data/processed`

In [44]:
# Placeholder for student's code (Python code)

In [45]:
# Solution
save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

## 6. Get Baseline Model

**[6.1]** Import the `NullRegressor` from your custom package

In [46]:
# Placeholder for student's code (Python code)

In [47]:
# Solution:
from my_krml_studentid.models.null import NullRegressor

**[6.2]** Instantiate a `NullRegressor` and save it into a variable called `base_model`

In [48]:
# Placeholder for student's code (Python code)

In [49]:
# Solution:
base_model = NullRegressor()

**[6.3]** Make a prediction using `fit_predict()` and save the results in a variable called `y_base`

In [50]:
# Placeholder for student's code (Python code)

In [51]:
# Solution:
y_base = base_model.fit_predict(y_train)

**[6.4]** Import the `print_regressor_scores()` function from your custom package

In [52]:
# Placeholder for student's code (Python code)

In [53]:
# Solution:
from my_krml_studentid.models.performance import print_regressor_scores

In [54]:
# Placeholder for student's code (Python code)

**[6.5]** Display the RMSE and MAE scores of this baseline model on the training set

In [55]:
# Placeholder for student's code (Python code)

In [56]:
# Solution:
print_regressor_scores(y_preds=y_base, y_actuals=y_train, set_name='Training')

RMSE Training: 11854.785261497587
MAE Training: 3191.7838929535715




## 7.   Push changes

**[7.1]** Add your changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .

**[7.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git commit -m "prepare data and baseline"

**[7.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git push

**[7.4]** Stop Jupyter Lab