# **Lab: Time-Series**



## Exercise 2: Baseline

In this exercise, we will work on a time-series dataset containing observations related to bike sharing services:
https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Bike%20Sharing/day.csv

The objective is to train a model that can accurately forecast the volume of bikes shared.

The steps are:
1.   Setup Repository
3.   Load and Explore Dataset
4.   Prepare Data
5.   Split Dataset
6.   Baseline Model
7.   Push Changes


### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
cd ~/Projects/adv_mla_2024

**[1.2]** Copy the cookiecutter data science template

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
# OR
#ccds

Follow the prompt (name the project and repo adv_mla_lab_2)

**[1.3]** Go inside the created folder

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
cd adv_mla_lab_2

**[1.4]** Initialise the repo


In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git init

**[1.5]** Login into Github with your account (https://github.com/) and create a public repo with the name `adv_mla_lab_2`

**[1.6]** In your local repo `adv_mla_lab_2`, link it with Github (replace the url with your username)

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git remote add origin git@github.com:<username>/adv_mla_lab_2

**[1.7]** Add you changes to git staging area, commit and push them to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .
git commit -m "init"
git push -u origin main

**[1.8]** Set the python version to 3.11.4 with pyenv

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
pyenv local 3.11.4

**[1.9]** Initialise poetry project with python==~3.11 and no dependencies installed

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
poetry init

**[1.10]** Install with poetry the following packages:
*   pandas==2.2.2
*   sklearn==1.5.1
*   jupyterlab==4.2.3
*   prophet==1.1.5


In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution
poetry add jupyterlab==4.2.3 pandas==2.2.2 scikit-learn==1.5.1 prophet==1.1.5

**[1.11]** Download the dataset into the sub-folder data/raw

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
wget -P /Users/anthonyso/Projects/adv_mla_2024/adv_mla_lab_2/data/raw https://code.datasciencedojo.com/datasciencedojo/datasets/raw/master/Bike%20Sharing/day.csv

**[1.12]** Launch Jupyter Lab from your virtual environment

In [None]:
# Placeholder for student's code (command line)

In [None]:
#Solution:
poetry run jupyter lab

**[1.13]** Navigate the folder `notebooks` and create a new jupyter notebook called `1_data_prep.ipynb`

### 2.   Load and Explore Dataset

**[2.1]** Launch magic commands to automatically reload modules

In [None]:
# Placeholder for student's code (Python code)

In [1]:
# Solution
%load_ext autoreload
%autoreload 2

**[2.2]** Install your custom package with pip

In [2]:
# Placeholder for student's code (Python code)

In [1]:
# Solution
! pip install -i https://test.pypi.org/simple/ my-krml-studentid==0.1.8

Looking in indexes: https://test.pypi.org/simple/
Collecting my-krml-studentid==0.1.8
  Downloading https://test-files.pythonhosted.org/packages/dc/b6/5715ef8053c27ed5992a71533d0a22ce92dc5ec5f6902b91b9c214acf2c0/my_krml_studentid-0.1.8-py3-none-any.whl.metadata (1.3 kB)
Downloading https://test-files.pythonhosted.org/packages/dc/b6/5715ef8053c27ed5992a71533d0a22ce92dc5ec5f6902b91b9c214acf2c0/my_krml_studentid-0.1.8-py3-none-any.whl (5.4 kB)
Installing collected packages: my-krml-studentid
  Attempting uninstall: my-krml-studentid
    Found existing installation: my_krml_studentid 0.1.7
    Uninstalling my_krml_studentid-0.1.7:
      Successfully uninstalled my_krml_studentid-0.1.7
Successfully installed my-krml-studentid-0.1.8


**[2.3]** Import the pandas and numpy package

In [2]:
# Placeholder for student's code (Python code)

In [3]:
# Solution
import pandas as pd
import numpy as np

**[2.4]** Load the dataset into dataframe called df

In [4]:
# Placeholder for student's code (Python code)

In [5]:
#Solution:
df = pd.read_csv('../data/raw/day.csv')

**[2.5]** Display the first 5 rows of df

In [6]:
# Placeholder for student's code (Python code)

In [7]:
# Solution
df.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


**[2.6]** Display the dimensions (shape) of df

In [8]:
# Placeholder for student's code (Python code)

In [9]:
# Solution
df.shape

(731, 16)

**[2.7]** Display the summary (info) of df

In [10]:
# Placeholder for student's code (Python code)

In [11]:
# Solution
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     731 non-null    int64  
 1   dteday      731 non-null    object 
 2   season      731 non-null    int64  
 3   yr          731 non-null    int64  
 4   mnth        731 non-null    int64  
 5   holiday     731 non-null    int64  
 6   weekday     731 non-null    int64  
 7   workingday  731 non-null    int64  
 8   weathersit  731 non-null    int64  
 9   temp        731 non-null    float64
 10  atemp       731 non-null    float64
 11  hum         731 non-null    float64
 12  windspeed   731 non-null    float64
 13  casual      731 non-null    int64  
 14  registered  731 non-null    int64  
 15  cnt         731 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 91.5+ KB


**[2.8]** Display the descriptive statistics of df


In [12]:
# Placeholder for student's code (Python code)

In [13]:
# Solution
df.describe(include='all')

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
unique,,731,,,,,,,,,,,,,,
top,,2011-01-01,,,,,,,,,,,,,,
freq,,1,,,,,,,,,,,,,,
mean,366.0,,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0


### 3. Prepare Data

**[3.1]** Create a copy of df and save it into a variable called df_cleaned

In [14]:
# Placeholder for student's code (Python code)

In [15]:
# Solution
df_cleaned = df.copy()

**[3.2]** Import your custom function `convert_to_date` from `my_krml_149874.features.dates`

In [16]:
# Placeholder for student's code (Python code)

In [17]:
# Solution
from my_krml_studentid.features.dates import convert_to_date

**[3.3]** Convert the column `dteday` with your function `convert_to_date`

In [18]:
# Placeholder for student's code (Python code)

In [19]:
# Solution
df_cleaned = convert_to_date(df_cleaned, ['dteday'])

In [23]:
print(df['dteday'],df_cleaned['dteday'])

0      2011-01-01
1      2011-01-02
2      2011-01-03
3      2011-01-04
4      2011-01-05
          ...    
726    2012-12-27
727    2012-12-28
728    2012-12-29
729    2012-12-30
730    2012-12-31
Name: dteday, Length: 731, dtype: object 0     2011-01-01
1     2011-01-02
2     2011-01-03
3     2011-01-04
4     2011-01-05
         ...    
726   2012-12-27
727   2012-12-28
728   2012-12-29
729   2012-12-30
730   2012-12-31
Name: dteday, Length: 731, dtype: datetime64[ns]


**[3.4]** Create a new dataframe `prophet_df` that contains only the columns `dteday` and `cnt` from `df_cleaned`

In [24]:
# Placeholder for student's code (Python code)

In [25]:
# Solution
prophet_df = df_cleaned[['dteday', 'cnt']]

**[3.5]** Rename the columns of `prophet_df` to `ds` and `y`


In [26]:
# Placeholder for student's code (Python code)

In [27]:
# Solution
prophet_df.columns = ['ds', 'y']

**[3.6]** Save the dataframe in the `data/interim/` folder

In [28]:
# Placeholder for student's code (Python code)

In [29]:
# Solution
prophet_df.to_csv('../data/interim/day_prophet.csv', index=False)

### 4. Split Dataset

**[4.1]** Import your new function `split_sets_by_time` and split the data into several sets

In [30]:
# Placeholder for student's code (Python code)

In [31]:
# Solution
from my_krml_studentid.data.sets import split_sets_by_time

X_train, y_train, X_val, y_val, X_test, y_test = split_sets_by_time(prophet_df, 'y', test_ratio=0.2)

**[4.2]** Import the `save_sets()` function from `my_krml_149874/data/sets.py`

In [None]:
# Placeholder for student's code (Python code)

In [32]:
# Solution
from my_krml_studentid.data.sets import save_sets

**[4.3]** Save the sets into the folder `data/processed`

In [33]:
# Placeholder for student's code (Python code)

In [34]:
# Solution
save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

### 5. Baseline Model

**[5.1]** Import the `NullModel` from `my_krml_149874.models.null`

In [None]:
# Placeholder for student's code (Python code)

In [35]:
# Solution:
from my_krml_studentid.models.null import NullRegressor

**[5.2]** Instantiate a `NullRegressor` and save it into a variable called `base_model`

In [36]:
# Placeholder for student's code (Python code)

In [37]:
# Solution:
base_model = NullRegressor()

**[5.3]** Make a prediction using `fit_predict()` and save the results in a variable called `y_base`

In [38]:
# Placeholder for student's code (Python code)

In [39]:
# Solution:
y_base = base_model.fit_predict(y_train)

**[5.4]** Import the `print_regressor_scores()` function from `my_krml_149874.models.performance` and then display the RMSE and MAE scores of this baseline model

In [None]:
# Placeholder for student's code (Python code)

In [40]:
# Solution
from my_krml_studentid.models.performance import print_regressor_scores

print_regressor_scores(y_preds=y_base, y_actuals=y_train, set_name='Training')

RMSE Training: 1324.7868420297616
MAE Training: 1132.7364532147508




**[5.5]** Display the RMSE and MAE scores of this baseline model on the validation set

In [41]:
# Placeholder for student's code (Python code)

In [42]:
print_regressor_scores(y_preds=base_model.predict(y_val), y_actuals=y_val, set_name='Validation')

RMSE Validation: 3110.722951659261
MAE Validation: 2951.6668330889006




**[5.6]** Display the RMSE and MAE scores of this baseline model on the testing set

In [43]:
# Placeholder for student's code (Python code)

In [44]:
print_regressor_scores(y_preds=base_model.predict(y_test), y_actuals=y_test, set_name='Testing')

RMSE Testing: 3090.593123684775
MAE Testing: 2817.6229288232903




### 6.   Push changes

**[6.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .

**[6.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git commit -m "baseline model"

**[6.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git push

**[6.4]** Go to Github and merge the branch after reviewing the code and fixing any conflict




**[6.5]** Check out to the master branch

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git checkout master

**[6.6]** Pull the latest updates

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git pull