# **Lab: Engineering for ML**



## Exercise 2: EDA and Baseline Model

In this exercise we will start our data science project by preparing the dataset for modeling

**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install pyenv (https://realpython.com/lessons/installing-pyenv/)
- Install poetry (https://python-poetry.org/docs/#installation)
- Install Wget for Windows users (https://eternallybored.org/misc/wget/)

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Get Baseline model
6.   Push changes


### 1. Setup Environment

**[1.1]** Download the dataset (https://raw.githubusercontent.com/aso-uts/labs_datasets/refs/heads/main/36120-adv_mla/lab01/wfh.csv) into the sub-folder data/raw

In [None]:
# For Windows users, you can download and install WGET. Or you can manually download the file from the link and save it to specified path
! wget -P ~/Projects/adv_mla_2025/adv_mla_lab_1/data/raw https://raw.githubusercontent.com/aso-uts/labs_datasets/refs/heads/main/36120-adv_mla/lab01/wfh.csv

**[1.5]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
! poetry run jupyter lab

**[1.6]** Create a new Jupyter Notebook called `1_baseline.ipynb` inside the `work/adv_mla_lab_1/notebooks/` directory


### 2. Load and Explore Dataset



**[2.1]** Launch magic commands to automatically reload modules

In [None]:
%load_ext autoreload
%autoreload 2

**[2.2]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (Python code)

In [1]:
# Solution
import pandas as pd
import numpy as np

**[2.3]** Load the dataset into dataframe called df

In [None]:
# Placeholder for student's code (Python code)

In [2]:
#Solution:
df = pd.read_csv('../data/raw/wfh.csv')

**[2.4]** Display the first 5 rows of df

In [None]:
# Placeholder for student's code (Python code)

In [3]:
# Solution
df.head()

Unnamed: 0,id,distance_from_office,salary_range,gas_price_per_litre,public_transportation_cost,wfh_prev_workday,workday,tenure,work_home_actual
0,0,5.962247,40K - 60K,2.119485,8.568058,False,Friday,0.212653,1
1,1,0.535872,40K - 60K,2.357199,5.425382,True,Tuesday,4.927549,0
2,2,1.969519,40K - 60K,2.366849,8.247158,False,Monday,0.520817,1
3,3,2.53041,20K - 40K,2.318722,7.944251,False,Tuesday,0.453649,1
4,4,2.253635,60K+,2.221265,8.884478,True,Thursday,5.695263,1


**[2.5]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (Python code)

In [4]:
# Solution
df.shape

(50000, 9)

**[2.6]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (Python code)

In [5]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          50000 non-null  int64  
 1   distance_from_office        50000 non-null  float64
 2   salary_range                50000 non-null  object 
 3   gas_price_per_litre         50000 non-null  float64
 4   public_transportation_cost  50000 non-null  float64
 5   wfh_prev_workday            50000 non-null  bool   
 6   workday                     50000 non-null  object 
 7   tenure                      50000 non-null  float64
 8   work_home_actual            50000 non-null  int64  
dtypes: bool(1), float64(4), int64(2), object(2)
memory usage: 3.1+ MB


**[2.7]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (Python code)

In [6]:
# Solution
df.describe(include='all')

Unnamed: 0,id,distance_from_office,salary_range,gas_price_per_litre,public_transportation_cost,wfh_prev_workday,workday,tenure,work_home_actual
count,50000.0,50000.0,50000,50000.0,50000.0,50000,50000,50000.0,50000.0
unique,,,4,,,2,5,,
top,,,0 - 20K,,,True,Wednesday,,
freq,,,19918,,,27547,10095,,
mean,24999.5,3.929033,,2.049616,7.323907,,,4.60004,0.49958
std,14433.901067,4.079528,,0.334385,1.63039,,,2.301937,0.500005
min,0.0,0.00221,,1.400369,4.003417,,,0.002253,0.0
25%,12499.75,0.897909,,1.769163,6.109763,,,2.797947,0.0
50%,24999.5,2.380855,,2.189073,8.074422,,,5.584845,0.0
75%,37499.25,5.679015,,2.337894,8.627489,,,6.531917,1.0


### 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (Python code)

In [7]:
# Solution
df_cleaned = df.copy()

**[3.2]** Extract the column `work_home_actual` and save it into variable called `y`

In [None]:
# Placeholder for student's code (Python code)

In [8]:
# Solution:
y = df_cleaned.pop('work_home_actual')

**[3.3]** Import OrdinalEncoder from sklearn.preprocessing

In [9]:
# Placeholder for student's code (Python code)

In [10]:
# Solution:
from sklearn.preprocessing import OrdinalEncoder

**[3.4]** Instantiate a OrdinalEncoder with the values from `workday` column

In [11]:
# Placeholder for student's code (Python code)

In [12]:
# Solution
ord_enc = OrdinalEncoder(categories=[['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']])

**[3.5]** Fit and apply the OrdinalEncoder on `workday` column and replace with the encoded values

In [13]:
# Placeholder for student's code (Python code)

In [14]:
# Solution
df_cleaned['workday'] = ord_enc.fit_transform(df_cleaned[['workday']])

**[3.6]** Apply OneHotEncoding on `salary_range` and save the result back in `df_cleaned`

In [15]:
# Placeholder for student's code (Python code)

In [16]:
# Solution
df_cleaned = pd.get_dummies(df_cleaned, columns=["salary_range"])

**[3.7]** Remove the `id` column from `df_cleaned`

In [17]:
# Placeholder for student's code (Python code)

In [18]:
# Solution
df_cleaned.drop(["id"], axis=1, inplace=True)

### 4. Split Dataset

**[4.1]** import train_test_split from sklearn.model_selection

In [19]:
# Placeholder for student's code (Python code)

In [20]:
# Solution
from sklearn.model_selection import train_test_split

**[4.2]** Split the data into training validation and testing sets as Numpy arrays

In [21]:
# Placeholder for student's code (Python code)

In [23]:
# Solution
X_data, X_test, y_data, y_test = train_test_split(df_cleaned, y, test_size=0.2, random_state=8)
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

**[4.3]** Print the dimensions of `X_train`, `X_val`, `X_test`

In [24]:
# Placeholder for student's code (Python code)

In [25]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(32000, 10)
(8000, 10)
(10000, 10)


**[4.4]** Print the dimensions of `y_train`, `y_val`, `y_test`

In [26]:
# Placeholder for student's code (Python code)

In [27]:
# Solution
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(32000,)
(8000,)
(10000,)


**[4.5]** Save the sets into the folder `data/processed`

In [28]:
# Placeholder for student's code (Python code)

In [29]:
# Solution
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_val.to_csv('../data/processed/X_val.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_val.to_csv('../data/processed/y_val.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

### 5. Get Baseline Model

**[5.1]** Import the DummyClassifier module from sklearn

In [30]:
# Placeholder for student's code (Python code)

In [31]:
# Solution:
from sklearn.dummy import DummyClassifier

**[5.2]** Instantiate the Dummy class into a variable called `base_clf` and fit it on the training set it

In [32]:
# Placeholder for student's code (Python code)

In [33]:
# Solution:
base_clf = DummyClassifier(strategy='most_frequent')
base_clf.fit(X_train, y_train)

**[5.3]** Import roc_auc_score from sklearn.metrics

In [34]:
# Placeholder for student's code (Python code)

In [35]:
# Solution:
from sklearn.metrics import roc_auc_score

**[5.6]** Display the ROC scores of this baseline model on the training set

In [36]:
# Placeholder for student's code (Python code)

In [37]:
# Solution:
y_proba_preds = base_clf.predict_proba(X_train)
roc_auc_score(y_train, y_proba_preds[:, 1])

np.float64(0.5)

### 6.   Push changes

**[6.1]** Add your changes to git staging area

In [38]:
# Placeholder for student's code (command line)

In [39]:
# Solution:
! git add .



**[6.2]** Create the snapshot of your repository and add a description

In [40]:
# Placeholder for student's code (command line)

In [41]:
# Solution:
! git commit -m "prepare data and baseline"

[main fe01e01] prepare data and baseline
 1 file changed, 1430 insertions(+)
 create mode 100644 notebooks/1_baseline.ipynb


**[6.3]** Push your snapshot to Github

In [42]:
# Placeholder for student's code (command line)

In [43]:
# Solution:
! git push

To https://github.com/tuannm3812/adv_mla_lab_2.git
   f59f2f8..fe01e01  main -> main


**[6.4]** Stop Jupyter Lab

In [44]:
# Solution:
ctrl+c

NameError: name 'ctrl' is not defined