# 04 Full Pipelines with DVC

> ‚ÄúThe most powerful tool we have as developers is automation.‚Äù ~ Scott Hanselman

![dvc_cml](https://i.ytimg.com/vi/H1VBsK7XiKs/maxresdefault.jpg)

Image source: [dvc.org](https://dvc.org/)

## Synopsis

If you have ever worked with ML pipelines of any kind and wondered if it would be possible to combine the training and model saving parts of it with a CI/CD pipeline, üê≥, look no further as it possible and you will learn about it here. The aim of this tutorial is to show you a step-by-step process for creating reproducible pipelines to gather, clean, train, evaluate, and track machine learning models with DVC (data version control), CML (continuous machine learning), and other tools in the Data Science stack. By the end of this tutorial, you will have the tools to create your own reproducible pipelines and experiment with different tools and models to heart's content.

## Learning Outcomes

By the end of the tutorial you would have learned
1. a bit of git for keeping track of your code
2. a bit of dvc to track your different datasets and machine learning models
3. a bit of ML pipelines
4. a bit of ML modeling and some python tools for it

## Table of Contents

1. [Scenario](#Scenario)
2. [The Tools](#2.-The-Tools)
3. [Environment Set Up](#3.-Environment-Set-Up)
4. [The Data](#4.-The-Data)
    - [Getting the Data](#4.1-Getting-the-Data)
    - [Preparing the Data](#4.2-Preparing-the-Data)
5. [Training our First Model](#5.-Training-our-First-Model)
6. [Model Evaluation](#6.-Model-Evaluation)
7. [DVC Pipelines](#7.-DVC-Pipelines)
8. [CI/CD Pipelines with CML](#8.-CI/CD-Pipelines-with-CML)
9. [Experiments](#9.-Experiments)
10. [Merging our Changes - PRs](#10.-Merging-our-Changes---PRs)
11. [Summary](#11.-Summary)
12. [Blind Spots and Future Work](#12.-Blind-Spots-and-Future-Work)
13. [Resources](#13.-Resources)

**NB:** this tutorial was built in a Linux machine and some terminal commands might only be available in \*NIX-based systems.

## 1. Scenario

Imagine you work at a data analytics consultancy called **XYZ Analytics**, and that your boss comes to you with a new challenge for you, to create a machine learning model to predict the amount of bikes neeeded at any given hour of the day in Seoul, South Korea. You don't know anything about bicycle rental systems but you're excited to take on the challenge and accept it with pleasure.

![bikes_seoul](https://img.koreatimes.co.kr/upload/newsV2/images/202103/3e9b5801c43048eca31b3309176c8da9.jpg)

The challenge was presented to your boss by the South Korean government, and what they are hoping to get later on is an in-house analytical product that anyone can use to figure out the amount of rental bicycles needed at any given time and at different locations in the city of Seoul. You will tackle the predictive modeling part while the rest of team will work on the application and the geospatial part of the task.

Lastly, XYZ Analytics has been improving their data science capabilities and would like for every project to use data and model version control tools, which means you will be using dvc and other cool tools for the first time for this task. Let's go over the tooling in the next section.

## 2. The Tools

Here are some of the tools that we will be using.

- [DVC](https://dvc.org/) - "Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that you're already familiar with (Git, CI/CD, etc.)."
- [NumPy](https://numpy.org/) - "It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more."
- [pandas](https://pandas.pydata.org/) - "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."
- [scikit-learn](https://scikit-learn.org/stable/index.html) - "is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
- [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) - "is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way."
- [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html) - "is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages: Faster training speed and higher efficiency, lower memory usage, better accuracy, support of parallel, distributed, and GPU learning, and capable of handling large-scale data."
- [CatBoost](https://catboost.ai/en/docs/) - "CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library."
- [Git](https://git-scm.com/) - "Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency."

Let's now set up our development environment and get started with our project.

**NB:** the definitions above have been taken directly from the tools' respective websites.

## 3. Environment Set Up

Usually we want to work with a directory with a somewhat similar structure to the one below but for the purpose of this tutorial, the set up will be slightly different.

```bash
.
‚îú‚îÄ‚îÄ data
‚îÇ¬†¬† ‚îú‚îÄ‚îÄ processed
‚îÇ¬†¬† ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ test
‚îÇ¬†¬† ‚îÇ¬†¬† ‚îî‚îÄ‚îÄ train
‚îÇ¬†¬† ‚îî‚îÄ‚îÄ raw
‚îú‚îÄ‚îÄ metrics
‚îú‚îÄ‚îÄ models
‚îú‚îÄ‚îÄ notebooks
‚îú‚îÄ‚îÄ src
‚îî‚îÄ‚îÄ README.md
```


Let's now initialize our git and dvc repository with the following commands.

```bash
git init

dvc init
```

You should see the following output.

![gitdvc](../images/git_dvc.png)

Now that we have everything we need, we can open up our IDE using the `jupyter lab` command in the terminal. (You should see the following output minus the README.md file.)

![jlab](../images/jupyterlab.png)

After you open Jupyter Lab, create a notebook in the notebooks directory and call it `exploration.ipynb`. The rest of the tutorial can, and will be, done through the jupyter notebook we just created. Let's dive in.

**NB:** You can get `conda` through the miniconda distribution [here](https://docs.conda.io/en/latest/miniconda.html).

In [None]:
!rm

In [None]:
!git init

In [13]:
!dvc init

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

## 4. The Data

Following the described scenario above, the data was donated to the UCI ML Repository on 2020-03-01 for a regression task. It contains information regarding the amount of bikes available per hour between 2017 and 2018.

![bikes_uci](../images/uci_bikes.png)

Here are the variables found in the dataset.

- `Date` - year-month-day
- `Rented Bike count` - Count of bikes rented at each hour
- `Hour` - Hour of the day
- `Temperature`-Temperature in Celsius
- `Humidity` - %
- `Windspeed` - m/s
- `Visibility` - 10m
- `Dew point temperature` - Celsius
- `Solar radiation` - MJ/m2
- `Rainfall` - mm
- `Snowfall` - cm
- `Seasons` - Winter, Spring, Summer, and Autumn
- `Holiday` - Holiday and No holiday
- `Functional Day` - NoFunc (Non Functional Hours), Fun (Functional hours)

### 4.1 Getting the Data

We will use the `urllib.request` and the `os` libraries to download the data and set up our desired path for it, respectively.

In [1]:
import urllib.request, os

In [2]:
!pwd

/home/ramonperez/Tresors/datascience/tutorials/pycon_apac21/notebooks


In [3]:
os.chdir("..")

In [4]:
!pwd

/home/ramonperez/Tresors/datascience/tutorials/pycon_apac21


The dataset can be downloaded from the URL below and we will give it the filename, `SeoulBikeData.csv`.

In [5]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv'
path_and_filename = os.path.join('data', '03_part', 'raw', 'SeoulBikeData.csv')

In [6]:
urllib.request.urlretrieve(url, path_and_filename)

('data/03_part/raw/SeoulBikeData.csv',
 <http.client.HTTPMessage at 0x7f94306b5280>)

Because we want to be able to create a pipeline later on, we will export the few lines of code above as a python script called `get_data.py` for later use. We will put every python script we want in our pipeline inside the `src` directory. Get used to this pattern. :)

Also, it is good practice to make sure the directories we are using always exist, so we will add an additional if-else statement to search and/or create it if it does not exist.

The command `%%writefile` below is a magic function and these are special function of ipython interpreter. The one we are using allows us to write anything in that cell to a file. Others like the `%%bash`, as we will soon see, make the entire cell a bash executable cell.

In [7]:
%%writefile src/full_pipe/get_data.py

import urllib.request, os

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv'
path = os.path.join('data', '03_part', 'raw')
filename = 'SeoulBikeData.csv'

if not os.path.exists(path): os.makedirs(path)
        
urllib.request.urlretrieve(url, os.path.join(path, filename))

Overwriting src/full_pipe/get_data.py


Now that we have our dataset, let's go ahead and add it to our remote storage and start keeping track of the changes that we make to it with dvc. We will use s3 as our primary storage tool for the tutorial but feel free to use the option that best suits your needs and experience from the [dvc website](https://dvc.org/doc/command-reference/remote/add). Here are the step to do it with AWS.

1. Log into your AWS account
2. Navigate to **Identity and Access Management (IAM)** > **Access Management** > **Users**
3. Click on **Add users**
4. Add a name under **Set user details**, e.g. `bikes`
5. In the **Select AWS access type**, check the box [x] next to **Access key - Programmatic access**
6. In the **Set permissions** section, click on **Attach existing policies directly** and search and select **AmazonS3FullAccess**
7. Accept the rest of the defaults, create your user and download the csv file with your **Access key ID** and your **Secret access key**
8. Navigate to the **S3 Management Console** and create a new bucket with the default settings, mine is **bikesdata**
9. Navigate to your bucket's **Permissions** tab and in the **Bucket policy** section click on **Edit** and the click on **Policy generator**
    - In the **Select Type of Policy** select **S3 Bucket Policy**
    - In the **Principal** box add your user's arn code, e.g. `arn:aws:iam::123456789135:user/bikes`
    - In the **Actions** box, select ListBucket
    - In the **Amazon Resource Name** add your bucket's arn, e.g. `arn:aws:s3:::bikesdata`
    - Click on **Add Statement**
    - Click on **Generate Policy** and then copy **Policy JSON Document**
![bucketpolicy](../images/policy_json.png)
    - Add the policy to your bucket and save the changes
   
Our Bucket without the data should look as follows.
![empty_bucket](../images/empty_bucket.png)
    
Now let's add our bucket to our dvc repo.

In [8]:
!dvc remote add -d bikestorage gdrive://1hhbxsGSxrVRcJaxI8_cZ9wroJzo3DsVy

Setting 'my_drive' as a default remote.


In [9]:
!dvc add data/03_part/raw/SeoulBikeData.csv

[2K[32m‚†ô[0m Checking graph                                                   [32m‚†ã[0m Checking graph
Adding...                                                                       
![A
  0%|          |                                   0.00/? [00:00<?,        ?B/s][A
                                                                                [A
![A
  0%|          |.RLigE5i2k6uE6Sd9kjJKTZ.tmp           0/1 [00:00<?,       ?it/s][A
100% Adding...|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|1/1 [00:00, 32.05file/s][A

To track the changes with git, run:

	git add data/03_part/raw/SeoulBikeData.csv.dvc data/03_part/raw/.gitignore
[0m

You will have to run `dvc push` on the terminal, and after that, 

In the first line above the `-d` stands for default and `bikestorage` is the name we have decided on for our bucket. The last piece is the url that directs dvc to our s3 bucket. You can find out more about the `remote` command through the official docummentation [here](https://dvc.org/doc/command-reference/remote).

In the next few lines, where we use the `modify` command to update our dvc local files, we give dvc the resources necessary to access our remote storage from our local computer. If you were to have the `awscli` installed and configured in your machine, you could have skipped the modify parts of the above cell. You can find more about the `modify` command through the official docummentation [here](https://dvc.org/doc/command-reference/remote/modify)

Make sure to always keep your credentials in a safe and secure place.

Now let's start tracking our data and make sure our remote storage is fully connected to our local storage.

In [11]:
%%bash

git add data/03_part/raw/SeoulBikeData.csv.dvc data/03_part/raw/.gitignore

![fullbucket](../images/file_up.png)

DVC uses special names to keep track of files, so there's no need to try and figure out what the above name means. Everything in our bucket can always be accessed through dvc.

Lastly, we'll commit our changes to our git repo after making sure we add the two files created by dvc, `data/raw/.gitignore` and `data/raw/SeoulBikeData.csv.dvc`. What dvc is doing is tracking some information about our dataset through git, hence the files `...Data.csv.dvc` and `.gitignore` with the actual data file, while the actual data goes to our remote storage.

In [12]:
!git status

On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   data/03_part/raw/.gitignore[m
	[32mnew file:   data/03_part/raw/SeoulBikeData.csv.dvc[m

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .dvc/config[m
	[31mmodified:   notebooks/03_dvc_pipes.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mimages/dvc_push.png[m
	[31mimages/file_up.png[m
	[31mimages/gdrive1.png[m



In [13]:
%%bash

git commit -m "Start Tracking Data"

[master 8359d71] Start Tracking Data
 2 files changed, 5 insertions(+)
 create mode 100644 data/03_part/raw/.gitignore
 create mode 100644 data/03_part/raw/SeoulBikeData.csv.dvc


Before we push any changes to GitHub, make sure you create your repository, as shown below, and then connect your local and remote repo with the commands below.
![reposetup](../images/repo_setup.png)

In [18]:
%%bash

# git remote add origin https://github.com/ramonpzg/bikes_ml.git
git push -u origin master

To github.com:ramonpzg/pycon-apac21-pipelines.git
   035163e..8359d71  master -> master


After pushing the changes to GitHub you should see the files 

### 4.2 Preparing the Data

The following steps should feel familiar to us, we want to separate the date variable into its components, create dummy variables for the categorical features, normalize the columns so that they only contain alphanumerical characters with underscores instead of spaces, and finally split the data into train and test sets.

In [15]:
import pandas as pd

data = pd.read_csv(os.path.join('data', '03_part', 'raw', 'SeoulBikeData.csv'), encoding='iso-8859-1')
data.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(¬∞C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(¬∞C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [16]:
data['Date'] = pd.to_datetime(data['Date'])

In [17]:
data.sort_values(['Date', 'Hour'], inplace=True)
data["Year"] = data['Date'].dt.year
data["Month"] = data['Date'].dt.month
data["Week"] = data['Date'].dt.isocalendar().week
data["Day"] = data['Date'].dt.day
data["Dayofweek"] = data['Date'].dt.dayofweek
data["Dayofyear"] = data['Date'].dt.dayofyear
data["Is_month_end"] = data['Date'].dt.is_month_end
data["Is_month_start"] = data['Date'].dt.is_month_start
data["Is_quarter_end"] = data['Date'].dt.is_quarter_end
data["Is_quarter_start"] = data['Date'].dt.is_quarter_start
data["Is_year_end"] = data['Date'].dt.is_year_end
data["Is_year_start"] = data['Date'].dt.is_year_start
data.drop('Date', axis=1, inplace=True)

In [18]:
data = pd.get_dummies(data=data, columns=['Holiday', 'Seasons', 'Functioning Day'])

In [19]:
data.columns = ['rented_bike_count', 'hour', 'temperature', 'humidity', 'wind_speed', 'visibility', 
                'dew_point_temperature', 'solar_radiation', 'rainfall', 'snowfall', 'year', 
                'month', 'week', 'day', 'dayofweek', 'dayofyear', 'is_month_end', 'is_month_start',
                'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start',
                'seasons_autumn', 'seasons_winter', 'seasons_summer', 'seasons_spring',
                'holiday_yes', 'holiday_no', 'functioning_day_no', 'functioning_day_yes']

In [20]:
split = 0.30
n_train = int(len(data) - len(data) * split)

train_path = os.path.join('data', '03_part', 'processed', 'train.csv')
test_path = os.path.join('data', '03_part', 'processed', 'test.csv')

data[:n_train].reset_index(drop=True).to_csv(train_path, index=False)
data[n_train:].reset_index(drop=True).to_csv(test_path, index=False)

Using the same commands as before, let's keep track of our new dataset with dvc and push the changes to our s3 bucket. In addition, we'll create a file called `prepared.py` for later use in our pipelines.

In [21]:
%%bash

dvc add data/03_part/processed/train.csv data/03_part/processed/test.csv


To track the changes with git, run:

	git add data/03_part/processed/test.csv.dvc data/03_part/processed/.gitignore data/03_part/processed/train.csv.dvc


In [22]:
%%writefile src/full_pipe/prepare.py

import pandas as pd
import os, sys

split = 0.30

raw_data_path = sys.argv[1]
train_path = os.path.join('data', '03_part', 'processed', 'train.csv')
test_path = os.path.join('data', '03_part', 'processed', 'test.csv')

# read the data
data = pd.read_csv(raw_data_path, encoding='iso-8859-1')

# add date vars
data['Date'] = pd.to_datetime(data['Date'])
data.sort_values(['Date', 'Hour'], inplace=True)
data["Year"] = data['Date'].dt.year
data["Month"] = data['Date'].dt.month
data["Week"] = data['Date'].dt.isocalendar().week
data["Day"] = data['Date'].dt.day
data["Dayofweek"] = data['Date'].dt.dayofweek
data["Dayofyear"] = data['Date'].dt.dayofyear
data["Is_month_end"] = data['Date'].dt.is_month_end
data["Is_month_start"] = data['Date'].dt.is_month_start
data["Is_quarter_end"] = data['Date'].dt.is_quarter_end
data["Is_quarter_start"] = data['Date'].dt.is_quarter_start
data["Is_year_end"] = data['Date'].dt.is_year_end
data["Is_year_start"] = data['Date'].dt.is_year_start
data.drop('Date', axis=1, inplace=True)

# add dummies
data = pd.get_dummies(data=data, columns=['Holiday', 'Seasons', 'Functioning Day'])

# Normalize columns
data.columns = ['rented_bike_count', 'hour', 'temperature', 'humidity', 'wind_speed', 'visibility', 
                'dew_point_temperature', 'solar_radiation', 'rainfall', 'snowfall', 'year', 
                'month', 'week', 'day', 'dayofweek', 'dayofyear', 'is_month_end', 'is_month_start',
                'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start',
                'seasons_autumn', 'seasons_winter', 'seasons_summer', 'seasons_spring',
                'holiday_yes', 'holiday_no', 'functioning_day_no', 'functioning_day_yes']

n_train = int(len(data) - len(data) * split)
data[:n_train].reset_index(drop=True).to_csv(train_path, index=False)
data[n_train:].reset_index(drop=True).to_csv(test_path, index=False)

Overwriting src/full_pipe/prepare.py


Now we are ready to commit all of our changes.

In [23]:
%%bash

git add .
git commit -m "Preparation stage completed"
git push

[master c1ffbd5] Preparation stage completed
 8 files changed, 120 insertions(+), 38 deletions(-)
 create mode 100644 data/03_part/processed/.gitignore
 create mode 100644 data/03_part/processed/test.csv.dvc
 create mode 100644 data/03_part/processed/train.csv.dvc
 create mode 100644 images/dvc_push.png
 create mode 100644 images/file_up.png
 create mode 100644 images/gdrive1.png


To github.com:ramonpzg/pycon-apac21-pipelines.git
   8359d71..c1ffbd5  master -> master


## 5. Training our First Model

We want to create a model that predicts how many bikes will be needed at any given hour and on any given date in the future in the city of Seoul. Since the number of bicycles available for rent is a continuous number, this is a regression problem and what a better tool to use for regression problems that Random Forests.

![rfs](https://media.makeameme.org/created/its-easy-just-5bd65c.jpg)

What are Random Forests anyways?

> "Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set." ~ [Wikipedia](https://en.wikipedia.org/wiki/Random_forest)

We want to start with a baseline model, evaluate it, and then fine tune either the implementation that we picked, in this case the scikit-learn's one or, as we'll see in a later section, an implementation from another framework.

After we train our sklearn model, we want to serialize (or pickle) that model, track it with dvc, and use it later with unseen data in the evaluation stage.

We'll import sklearn's `RandomForestRegressor` and python's `pickle` module, load our train set, and start our training with 100 estimators and seed. Feel free to change these however you'd like tho.

In [24]:
from sklearn.ensemble import RandomForestRegressor
import pickle

In [25]:
X_train = pd.read_csv(os.path.join('data', '03_part', 'processed', 'train.csv'))
y_train = X_train.pop('rented_bike_count')

In [26]:
seed = 42
n_est = 100

In [27]:
rf = RandomForestRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

RandomForestRegressor(random_state=42)

In [28]:
rf.predict(X_train.values)[:10]

array([230.73, 214.51, 163.6 ,  96.76,  79.29,  90.66, 158.14, 480.86,
       791.04, 466.79])

In [29]:
with open('models/rf_model.pkl', "wb") as fd:
    pickle.dump(rf, fd)

Now that we have a trained model, let's save the steps we just took to a file called `train.py`, and let's also start tracking our model in the same way in which we tracked our data earlier with dvc. Lastly, we'll commit our work and push everything to GitHub.

In [30]:
%%writefile src/full_pipe/train.py

import os, pickle, sys
import numpy as np, pandas as pd
from sklearn.ensemble import RandomForestRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed = 42
n_est = 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = RandomForestRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd:
    pickle.dump(rf, fd)

Overwriting src/full_pipe/train.py


In [31]:
!dvc add models/rf_model.pkl

In [31]:
!dvc push


To track the changes with git, run:

	git add models/rf_model.pkl.dvc
3 files pushed


In [32]:
%%bash

git add .
git commit -m "Training 1 completed"
git push

[master 393cd55] Training 1 completed
 2 files changed, 36 insertions(+), 92 deletions(-)
 create mode 100644 models/rf_model.pkl.dvc


To github.com:ramonpzg/pycon-apac21-pipelines.git
   c1ffbd5..393cd55  master -> master


## 6. Model Evaluation

Model evaluation is a crucial part of training ML models, and it is important that we pick useful metrics that can indicate to us how well our model is perfoming, or expecting to perform, when presented with unseen data.

The metrics we'll use are Mean Absolute Error, Root Mean Squared Error, and $R^2$.

- Mean Absolute Error - "is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement." ~ [Wikipedia](https://en.wikipedia.org/wiki/Mean_absolute_error)
- Root Mean Squared Error - "is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent." ~ [Wikipedia](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
- $R^2$ - "In statistics, the coefficient of determination, also spelt co√´fficient, denoted $R^2$ or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s)." ~ [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination)

We'll start by loading our model and our test set, predict the test set and compare such predictions with the ground truth. After we compute the metrics above, we want to save them to a JSON file for further use and comparison using dvc.

In [33]:
import sklearn.metrics as metrics, json, numpy as np

In [34]:
with open(os.path.join('models', 'rf_model.pkl'), "rb") as fd:
    model = pickle.load(fd)

In [35]:
X_test = pd.read_csv(os.path.join('data', '03_part', 'processed', 'test.csv'))
y_test = X_test.pop('rented_bike_count')

In [36]:
predictions = model.predict(X_test.values)
predictions[:10]

array([1003.81, 1035.99, 1030.2 , 1084.73, 1383.08, 1795.49, 2581.49,
       2256.15, 1960.51, 1900.05])

In [37]:
mae = metrics.mean_absolute_error(y_test.values, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test.values, predictions))
r2_score = model.score(X_test.values, y_test.values)

In [38]:
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Square Error: {rmse:.2f}")
print(f"R^2: {r2_score:.3f}")

Mean Absolute Error: 191.92
Root Mean Square Error: 290.80
R^2: 0.788


In [39]:
with open(os.path.join('metrics', 'metrics.json'), "w") as fd:
    json.dump({"MAE": mae, "RMSE": rmse, "R^2":r2_score}, fd, indent=4)

We will save the steps above to a file called `evaluate.py` for futher use later, and we will add our metrics to git and GitHub rather than to our remote s3 bucket. The reason beign that dvc has a special function that allows us to compare the diff of metrics between those in a branch and those in master, and as you'll see soon, this is a very powerful feature of dvc that we certainly want to take advantage of.

In [40]:
%%writefile src/full_pipe/evaluate.py

import json, os, pickle, sys, pandas as pd, numpy as np
import sklearn.metrics as metrics

model_file = sys.argv[1]
test_file = os.path.join(sys.argv[2], "test.csv")
scores_file = os.path.join('metrics', 'metrics.json')

with open(model_file, "rb") as fd:
    model = pickle.load(fd)

X_test = pd.read_csv(test_file)
y_test = X_test.pop('rented_bike_count')

predictions = model.predict(X_test.values)

mae = metrics.mean_absolute_error(y_test.values, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test.values, predictions))
r2_score = model.score(X_test.values, y_test.values)

with open(scores_file, "w") as fd:
    json.dump({"MAE": mae, "RMSE": rmse, "R^2":r2_score}, fd, indent=4)

Overwriting src/full_pipe/evaluate.py


In [41]:
%%bash

git add .
git commit -m "Evaluation completed"
git push

On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean


Everything up-to-date


## 7. DVC Pipelines

DVC pipelines is one of the best features offered by dvc. They allow us to create reproducible pipilines containing anything from getting the data to training and evaluating ML models.

There are several ways for creating pipelines with dvc and here we'll do so with `dvc run`. `dvc run` starts with the `-n` flag followed by the name we want to give to the step of the pipeline we want to create. Next, we add the `-d` flag to signal dependencies such as the python file we want to run as well as any arguments that such file takes. Next we have the `-o` flag which tells dvc the output expected from such step of the pipeline. For example, this stage would take the `train.csv` and `test.csv` files from the data preparation stage. Lastly, you need to pass the full python call without any flags.

After we run our dvc command, dvc creates 2 files, a `dvc.yaml` and a `dvc.lock` file. The former contains the stages dvc will follow for our pipeline, and the latter contains the metadata and other information regarding our pipeline. Once you have a look at the yaml file, you'll probably wonder if you can create such a file manually, the answer is yes. For the `dvc.lock` on the other hand, dvc will take care of that one through the command `dvc repro`, which runs whatever pipeline resides in your `dvc.yaml` file.

More information about both can be found in the official documentation site [here](https://dvc.org/doc/command-reference/run).

Before we start the stages of our pipeline, let's first remove the files we were already tracking with dvc. Not doing so will result in dvc giving us an error since the tracked files already exist.

In [42]:
%%bash

dvc remove data/03_part/raw/SeoulBikeData.csv.dvc \
           data/03_part/processed/train.csv.dvc \
           data/03_part/processed/test.csv.dvc \
           models/rf_model.pkl.dvc

In [43]:
%%bash

dvc run -n get_data \
-d src/full_pipe/get_data.py \
-o data/03_part/raw/SeoulBikeData.csv \
python src/full_pipe/get_data.py

Stage 'get_data' is cached - skipping run, checking out outputs
Creating 'dvc.yaml'
Adding stage 'get_data' in 'dvc.yaml'
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/03_part/raw/.gitignore dvc.lock dvc.yaml


In [44]:
%%bash

dvc run -n prepare \
-d src/full_pipe/prepare.py -d data/03_part/raw/SeoulBikeData.csv \
-o data/03_part/processed/train.csv -o data/03_part/processed/test.csv \
python src/full_pipe/prepare.py data/03_part/raw/SeoulBikeData.csv

Stage 'prepare' is cached - skipping run, checking out outputs
Adding stage 'prepare' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add data/03_part/processed/.gitignore dvc.lock dvc.yaml


In [45]:
%%bash

dvc run -n train \
-d src/full_pipe/train.py -d data/03_part/processed/train.csv \
-o models/rf_model.pkl \
python src/full_pipe/train.py data/03_part/processed/train.csv

Stage 'train' is cached - skipping run, checking out outputs
Adding stage 'train' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add models/.gitignore dvc.lock dvc.yaml


In [46]:
%%bash

dvc run -n evaluate \
-d src/full_pipe/evaluate.py -d models/rf_model.pkl -d data/03_part/processed \
-M metrics/metrics.json \
python src/full_pipe/evaluate.py models/rf_model.pkl data/03_part/processed

Stage 'evaluate' is cached - skipping run, checking out outputs
Adding stage 'evaluate' in 'dvc.yaml'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.yaml dvc.lock


Notice that in the last part of our pipeline we have the flag `-M`. This flag tells dvc to treat the output of that particular stage as a metric so that we can later use `dvc diff` on it and compare the metrics in the master branch with those in another.

Using the `dvc status` will tells us whether there are changes in our pipeline and files or if everthing is up to date.

In [47]:
!dvc status

Data and pipelines are up to date.                                              
[0m

Another cool function of dvc is `dvc dag`, which will show us a graph with the steps in our pipeline.

In [48]:
!dvc dag

        +----------+      
        | get_data |      
        +----------+      
              *           
              *           
              *           
         +---------+      
         | prepare |      
         +---------+      
         **        **     
       **            *    
      *               **  
+-------+               * 
| train |             **  
+-------+            *    
         **        **     
           **    **       
             *  *         
        +----------+      
        | evaluate |      
        +----------+      [0m

In order to re-run our pipeline again with one command, `dvc repro`, and see it in action, let's lemove the `dvc.lock` and the data files, and run `dvc repro` once.

In [49]:
!dvc repro

Stage 'get_data' didn't change, skipping                              core[39m>
Stage 'prepare' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping                                        
Data and pipelines are up to date.
[0m

In [50]:
!rm dvc.lock data/03_part/raw/SeoulBikeData.csv data/03_part/processed/train.csv data/03_part/processed/test.csv

In [51]:
!dvc repro

Stage 'get_data' is cached - skipping run, checking out outputs       core[39m>
Generating lock file 'dvc.lock'                                                 
Updating lock file 'dvc.lock'

Stage 'prepare' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'                                                   

Stage 'train' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'                                                   

Stage 'evaluate' is cached - skipping run, checking out outputs                 
Updating lock file 'dvc.lock'                                                   

To track the changes with git, run:

	git add dvc.lock
Use `dvc push` to send your updates to remote storage.
[0m

Now that we have learned about dvc pipelines and how to reproduce them, let's check the files that need to be committed and let's push them to GitHub.

In [52]:
!git status

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mdeleted:    data/03_part/processed/test.csv.dvc[m
	[31mdeleted:    data/03_part/processed/train.csv.dvc[m
	[31mdeleted:    data/03_part/raw/SeoulBikeData.csv.dvc[m
	[31mdeleted:    models/rf_model.pkl.dvc[m
	[31mmodified:   notebooks/03_dvc_pipes.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdvc.lock[m
	[31mdvc.yaml[m

no changes added to commit (use "git add" and/or "git commit -a")


In [53]:
%%bash

git add .
git commit -m "Pipeline Finished"
git push

[master 88c6c06] Pipeline Finished
 7 files changed, 146 insertions(+), 74 deletions(-)
 delete mode 100644 data/03_part/processed/test.csv.dvc
 delete mode 100644 data/03_part/processed/train.csv.dvc
 delete mode 100644 data/03_part/raw/SeoulBikeData.csv.dvc
 create mode 100644 dvc.lock
 create mode 100644 dvc.yaml
 delete mode 100644 models/rf_model.pkl.dvc


To github.com:ramonpzg/pycon-apac21-pipelines.git
   393cd55..88c6c06  master -> master


In [54]:
!git status

On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean


## 9. Experiments

We've been working inside our master branch using scikit-learn, and now we want to start experimenting with other tree-based frameworks like XGBoost, LightGBM, and CatBoost using different branches for each experiment. Let's do just that and start by checking out a new branch, adding XGBoost to our train file, and triggering a new run.

In [55]:
!git checkout -b "exp1-xgb"

Switched to a new branch 'exp1-xgb'


In [61]:
%%writefile src/full_pipe/train.py

import os, pickle, sys, pandas as pd
from xgboost import XGBRFRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = XGBRFRegressor(n_estimators=n_est, seed=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

Overwriting src/full_pipe/train.py


In [62]:
!dvc repro

Stage 'get_data' didn't change, skipping                              core[39m>
Stage 'prepare' didn't change, skipping
Running stage 'train':
> python src/full_pipe/train.py data/03_part/processed/train.csv
Traceback (most recent call last):
  File "/home/ramonperez/Tresors/datascience/tutorials/pycon_apac21/src/full_pipe/train.py", line 3, in <module>
    from xgboost import XGBRFRegressor
ModuleNotFoundError: No module named 'xgboost'
[31mERROR[39m: failed to reproduce 'dvc.yaml': failed to run: python src/full_pipe/train.py data/03_part/processed/train.csv, exited with 1
[0m

Note that in order for GitHub to now we have been working in a different branch, we need to use the `git push --set-upstream origin exp1-xgb` command. Otherwise, we'll get an error.

In [57]:
%%bash

git add .
git commit -m "Testing XGBoost"
git push --set-upstream origin exp1-xgb
git push

[exp1-xgb fb6dcf7] Testing XGBoost
 2 files changed, 48 insertions(+), 253 deletions(-)
 create mode 100644 src/train.py
Branch 'exp1-xgb' set up to track remote branch 'exp1-xgb' from 'origin'.


remote: 
remote: Create a pull request for 'exp1-xgb' on GitHub by visiting:        
remote:      https://github.com/ramonpzg/pycon-apac21-pipelines/pull/new/exp1-xgb        
remote: 
To github.com:ramonpzg/pycon-apac21-pipelines.git
 * [new branch]      exp1-xgb -> exp1-xgb
Everything up-to-date


In [59]:
!dvc metrics diff --show-md master

| Path   | Metric   | Old   | New   | Change   |                      core[39m>
|--------|----------|-------|-------|----------|

[0m

Following the same process as the one from the previous section, we can go to the **Actions** tab and have a look at our XGBoost run.
![actions](../images/pipe_xgb.png)

In addition, thanks to the `dvc metrics diff --show-md master` command in our CI/CD pipeline, we can now look at the difference in metrics between our current branch and the master one.
![actions](../images/xgb_metrics.png)

It seems that a plain baseline XGB model performs a bit worse than our sklearn one. Let's try LightGBM and see how it does. We'll move to a new branch, update the `train.py` file, and trigger the run on git push.

In [None]:
!git checkout -b "exp2-lgbm"

In [None]:
%%writefile src/train.py

import os, pickle, sys, pandas as pd
from lightgbm import LGBMRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = LGBMRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

In [None]:
%%bash

git add .
git commit -m "Testing LightGBM"
git push --set-upstream origin exp2-lgbm
git push

![actions](../images/lgbm_metrics.png)
The base implementation of LightGBM seems to have performed quite well. Let's try out CatBoost now.

In [None]:
!git checkout -b "exp3-cat"

In [None]:
%%writefile src/train.py

import os, pickle, sys, pandas as pd
from catboost import CatBoostRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = CatBoostRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

In [None]:
%%bash

git add .
git commit -m "Testing CatBoost"
git push --set-upstream origin exp3-cat
git push

![actions](../images/cat_metrics.png)

Surprisingly, CatBoost's MAE performed a bit worse than LightGBM but RMSE performed much better.

You can go to your Actions tab again and see a recap of all of your runs.
![exps](../images/exps.png)

We have a good candidate with CatBoost and we should merge this branch with master and start tunning our model.

## 10. Merging our Changes - PRs

![empspr](../images/compare_pr.png)
If you go back to the main page of our repo, you'll notice that GitHub has added a **Compare & pull request** option for each the three experiments. This is a nice shortcut to help us pick the one we liked best and add it to our main project's branch, master.

So what is a pull request anyways? "A PR provides a user-friendly web interface for discussing proposed changes before integrating them into the official project." ~ [Atlassian](https://www.atlassian.com/git/tutorials/making-a-pull-request)

Let's merge our experiment branch with our master branch.

1. Click on the **Compare & pull request** for exp3-cat.
2. Compare the changes.
![comp](../images/compare_pr.png)
3. Check out the report again.
![rep](../images/report_bottom.png)
4. Open a pull request with your details on why it should go to master.
![rep](../images/pr_mess.png)
5. Once reviewed, write a comment and merge the pull request.
![rep](../images/awesome_cat.png)
6. Lastly, we need to make sure our local env is up to date and once it is, we can switch to the master branch and work from there again. Run a `git pull` and a `dvc pull`.
![list_prs](../images/last_of_pr.png)

In [None]:
%%bash

git pull
dvc pull

## 11. Summary

As you have seen throughout the tutorial

DVC helps us track our data, models, and metrics, and it also allows us to create pipelines for getting, preparing, and modeling data. In contrast, CML allows us to continuously deliver ML models through its CI/CD configuration alongside dvc. CML makes it easier to experiment and deploy models in a production environment.

DVC and CML are making possible what git alone can't do for the machine learning community, and the do this by enhancing git in the parts where it's lacking. Both tools should be in every data scientist and ML Engineer's toolkit. Enough said!

![enough](https://media.giphy.com/media/mVJ5xyiYkC3Vm/giphy.gif)

## 12. Blind Spots and Future Work

**Blind Spots**
- We could have fine tuned our base model even further and make better comparisons with the other frameworks.
- We could have conducted more feature engineering.
- We could have selected the best features only based on feature importance.
- We could have done a bit more analysis of the data.
- We could have taken out the second dummy or our categorical variables. For example, there is no need to have Holiday and No Holiday as variables in our dataset.


**Future Work**
- If the data will be provided in the same formate we received it, then we need an easier transformation pipeline for the date, column names, and dummies.
- We could add the analytical tool, e.g. our dashboard, to the master branch and work with the models solely through branches.

## 13. Resources

Here are a few additional resources to dive deeped into some of the tools discussed above.
- [DVC Get Started](https://dvc.org/doc/start)
- [DVC Use Cases](https://dvc.org/doc/use-cases)
- [CML Get Started](https://cml.dev/doc/start)
- [CatBoost Tutorial](https://catboost.ai/en/docs/concepts/tutorials)
- [Git](https://realpython.com/python-git-github-intro)
- [Pull Requests](https://www.atlassian.com/git/tutorials/making-a-pull-request)