# **Lab: Custom Python Package**



## Exercise 2: EDA and Baseline Model

In this exercise we will start our data science project by preparing the dataset for modeling

**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install pyenv (https://realpython.com/lessons/installing-pyenv/)
- Install poetry (https://python-poetry.org/docs/#installation)
- Install Wget for Windows users (https://eternallybored.org/misc/wget/)

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Get Baseline model
6.   Push changes


### 1. Setup Environment

**[1.1]** Download the dataset (https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab01/insurance.csv) into the sub-folder data/raw

In [None]:
# For Windows users, you can download and install WGET. Or you can manually download the file from the link and save it to specified path
! wget -P ~/Projects/adv_mla_2024/adv_mla_lab_1/data/raw https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab02/insurance.csv

**[1.5]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
! poetry run jupyter lab

**[1.6]** Create a new Jupyter Notebook called `1_baseline.ipynb` inside the `work/adv_mla_lab_2/notebooks/` directory


### 2. Load and Explore Dataset



**[2.0]** Install your custom package with pip

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
! pip install -i https://test.pypi.org/simple/ my-krml-149874==2025.0.1.1

**[2.1]** Launch magic commands to automatically reload modules

In [None]:
%load_ext autoreload
%autoreload 2

**[2.2]** Import the pandas and numpy package

In [None]:
# Placeholder for student's code (Python code)

In [1]:
# Solution
import pandas as pd
import numpy as np

**[2.3]** Load the dataset into dataframe called df

In [None]:
# Placeholder for student's code (Python code)

In [2]:
#Solution:
df = pd.read_csv('../data/raw/insurance.csv')

**[2.4]** Display the first 5 rows of df

In [None]:
# Placeholder for student's code (Python code)

In [3]:
# Solution
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,18,female,33.82,0,no,southeast,1630.6617
1,19,female,23.48,1,no,southeast,1836.8043
2,46,male,30.57,2,no,southeast,6632.3513
3,54,male,32.05,1,yes,southeast,31922.4295
4,21,male,21.345,4,no,northeast,1638.37255


**[2.5]** Display the dimensions (shape) of df

In [None]:
# Placeholder for student's code (Python code)

In [4]:
# Solution
df.shape

(50000, 7)

**[2.6]** Display the summary (info) of df

In [None]:
# Placeholder for student's code (Python code)

In [5]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       50000 non-null  int64  
 1   sex       50000 non-null  object 
 2   bmi       50000 non-null  float64
 3   children  50000 non-null  int64  
 4   smoker    50000 non-null  object 
 5   region    50000 non-null  object 
 6   charges   50000 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 2.7+ MB


**[2.7]** Display the descriptive statistics of df


In [None]:
# Placeholder for student's code (Python code)

In [6]:
# Solution
df.describe(include='all')

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,50000.0,50000,50000.0,50000.0,50000,50000,50000.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,25176,,,38976,14197,
mean,39.46312,,30.713734,1.11376,,,13343.216363
std,14.117142,,6.092727,1.212835,,,12131.222744
min,18.0,,17.291,0.0,,,1137.5359
25%,27.0,,26.6,0.0,,,4694.4318
50%,40.0,,30.3,1.0,,,9399.232775
75%,51.0,,34.57,2.0,,,17340.746925


### 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [None]:
# Placeholder for student's code (Python code)

In [7]:
# Solution
df_cleaned = df.copy()

**[3.2]** Extract the column `charges` and save it into variable called `target`

In [None]:
# Placeholder for student's code (Python code)

In [8]:
# Solution:
target = df_cleaned.pop('charges')

**[3.3]** Create 2 lists named `num_cols` and `cat_cols` containing respectively the names of numerical and categotical columns

In [None]:
# Placeholder for student's code (Python code)

In [9]:
# Solution:
num_cols = list(df_cleaned.select_dtypes('number').columns)
cat_cols = list(set(df_cleaned.columns) - set(num_cols))

**[3.4]** Import StandardScaler and OneHotEncoder from sklearn.preprocessing

In [None]:
# Placeholder for student's code (Python code)

In [10]:
# Solution
from sklearn.preprocessing import StandardScaler, OneHotEncoder

**[3.5]** Instantiate the OneHotEncoder

In [None]:
# Placeholder for student's code (Python code)

In [11]:
# Solution
ohe = OneHotEncoder(sparse_output=False, drop='first')

**[3.6]** Fit and apply the OneHotEncoder on `df_cleaned` and save the result in `features`

In [None]:
# Placeholder for student's code (Python code)

In [12]:
# Solution
features = ohe.fit_transform(df_cleaned[cat_cols])

**[3.7]** Convert `features` into a dataframe

In [None]:
# Placeholder for student's code (Python code)

In [13]:
# Solution
features = pd.DataFrame(features, columns=ohe.get_feature_names_out())

**[3.8]** Instantiate the StandardScaler

In [None]:
# Placeholder for student's code (Python code)

In [14]:
# Solution
scaler = StandardScaler()

**[3.9]** Fit and apply the scaling on `df` and add the results into `features`

In [None]:
# Placeholder for student's code (Python code)

In [15]:
# Solution:
features[num_cols] = scaler.fit_transform(df_cleaned[num_cols])

**[3.10]** Import dump from joblib



In [None]:
# Placeholder for student's code (Python code)

In [16]:
# Solution:
from joblib import dump

**[3.11]** Save the one-hot encoder and scaler into the folder `models` and call the files respectively `ohe.joblib` and  `scaler.joblib`

In [None]:
# Placeholder for student's code (Python code)

In [17]:
# Solution:
dump(ohe, '../models/ohe.joblib')
dump(scaler, '../models/scaler.joblib')

['../models/scaler.joblib']

### 4. Split Dataset

**[4.1]** Import your new function `split_sets_random`

In [None]:
# Placeholder for student's code (Python code)

In [18]:
# Solution
from my_krml_25739083.data.sets import split_sets_random

**[4.2]** Split the data into training validation and testing sets

In [None]:
# Placeholder for student's code (Python code)

In [19]:
# Solution
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(features, target, test_ratio=0.2)

**[4.3]** Print the dimensions of `X_train`, `X_val`, `X_test`

In [None]:
# Placeholder for student's code (Python code)

In [20]:
# Solution
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(30000, 8)
(10000, 8)
(10000, 8)


**[4.4]** Print the dimensions of `y_train`, `y_val`, `y_test`

In [None]:
# Placeholder for student's code (Python code)

In [21]:
# Solution
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(30000,)
(10000,)
(10000,)


**[4.5]** Import the `save_sets()` function from `my_krml_149874/data/sets.py`

In [None]:
# Placeholder for student's code (Python code)

In [22]:
# Solution
from my_krml_25739083.data.sets import save_sets

**[4.6]** Save the sets into the folder `data/processed`

In [None]:
# Placeholder for student's code (Python code)

In [23]:
# Solution
save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

### 5. Get Baseline Model

**[5.1]** Import the DummyRegressor module from sklearn

In [None]:
# Placeholder for student's code (Python code)

In [24]:
# Solution:
from sklearn.dummy import DummyRegressor

**[5.2]** Instantiate the Dummy class into a variable called `base_reg`

In [None]:
# Placeholder for student's code (Python code)

In [25]:
# Solution:
base_reg = DummyRegressor(strategy='mean')

**[5.3]** Make a prediction using `predict()` and save the results in a variable called `y_base`

In [None]:
# Placeholder for student's code (Python code)

In [26]:
# Solution:
base_model.fit(X_train, y_train)
y_base = base_model.fit_predict(y_train)

NameError: name 'base_model' is not defined

**[5.6]** Import the `print_regressor_scores()` function from `my_krml_149874.models.performance` and then display the RMSE and MAE scores of this baseline model

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution:
from my_krml_149874.models.performance import print_regressor_scores

print_regressor_scores(y_preds=y_base, y_actuals=y_train, set_name='Training')

### 6.   Push changes

**[6.1]** Add your changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git commit -m "prepare data and baseline"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
! git push

**[6.4]** Stop Jupyter Lab

In [None]:
# Solution:
ctrl+c