# 04 Full Pipelines with DVC

> “The most powerful tool we have as developers is automation.” ~ Scott Hanselman

![dvc_cml](https://blog.codecentric.de/files/2019/03/logo-owl-readme.png)

Image source: [dvc.org](https://dvc.org/)

## Synopsis

If you have ever created machine learning pipelines of any kind and wondered if it would be possible to combine your data, feature engineering, training and model saving in one single place where you could experiment to your heart's content, 🐳, look no further as it possible with DVC and you will learn about it here. The aim of this tutorial is to show you a step-by-step process for creating reproducible pipelines to gather, clean, train, evaluate, and track machine learning models with DVC (data version control), and other tools in the Data Science stack. By the end of this tutorial, you will have the tools to create your own reproducible pipelines and keep track not only of your code but also of your data.

## Learning Outcomes

By the end of the tutorial you would have learned
1. a bit of git for keeping track of your code
2. a bit of dvc to track your different datasets and machine learning models
3. a bit of ML pipelines
4. a bit of ML modeling and some python tools for it

## Table of Contents

1. [Scenario](#1.-Scenario)
2. [The Tools](#2.-The-Tools)
3. [Project Structure](#3.-Environment-Set-Up)
4. [The Data](#4.-The-Data)
    - [Getting the Data](#4.1-Getting-the-Data)
    - [Preparing the Data](#4.2-Preparing-the-Data)
5. [Training our First Model](#5.-Training-our-First-Model)
6. [Model Evaluation](#6.-Model-Evaluation)
7. [DVC Pipelines](#7.-DVC-Pipelines)
8. [Experiments](#8.-Experiments)
9. [Merging our Changes - PRs](#9.-Merging-our-Changes---PRs)
10. [Summary](#10.-Summary)
11. [Blind Spots and Future Work](#11.-Blind-Spots-and-Future-Work)
12. [Resources](#12.-Resources)

**NB:** this tutorial was built in a Linux machine and some terminal commands might only be available in \*NIX-based systems.

## 1. Scenario

![bikes_seoul](https://img.koreatimes.co.kr/upload/newsV2/images/202103/3e9b5801c43048eca31b3309176c8da9.jpg)

Imagine you work at a data analytics consultancy called **Beautiful Analytics**, and that your boss comes to you with a new challenge for you, to create a machine learning model to predict the amount of bikes neeeded at any given hour of the day in Seoul, South Korea. You don't know anything about bicycle rental systems but you're excited to take on the challenge and accept it with pleasure.

The challenge was presented to your boss by the South Korean government, and what they are hoping to get later on is an in-house analytical product that anyone can use to figure out the amount of rental bicycles needed at any given time and at different locations in the city of Seoul. You will tackle the predictive modeling part while the rest of team will work on the application and the geospatial part of the task.

Lastly, Beautiful Analytics has been improving their data science capabilities and would like for every project to use data and model version control tools, this means you will be using dvc and other cool tools for the first time for this task. Let's go over the tooling in the next section.

## 2. The Tools

Here are some of the tools that we will be using.

- [DVC](https://dvc.org/) - "Data Version Control, or DVC, is a data and ML experiment management tool that takes advantage of the existing engineering toolset that you're already familiar with (Git, CI/CD, etc.)."
- [NumPy](https://numpy.org/) - "It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more."
- [pandas](https://pandas.pydata.org/) - "is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."
- [scikit-learn](https://scikit-learn.org/stable/index.html) - "is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
- [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) - "is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way."
- [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html) - "is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages: Faster training speed and higher efficiency, lower memory usage, better accuracy, support of parallel, distributed, and GPU learning, and capable of handling large-scale data."
- [CatBoost](https://catboost.ai/en/docs/) - "CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library."
- [Git](https://git-scm.com/) - "Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency."

Let's now set up our development environment and get started with our project.

**NB:** the definitions above have been taken directly from the tools' respective websites.

## 3. Project Structure

Usually we want to work with a directory with a somewhat similar structure to the one below but for the purpose of this tutorial, the set up will be slightly different and everything will happen from within this notebook.

```bash
.
├── data
│   ├── processed
│   │   ├── test
│   │   └── train
│   └── raw
├── metrics
├── models
├── notebooks
├── src
└── README.md
```


Let's now initialize our git and dvc repositories with the following commands.

```bash
git init

dvc init
```

You should see the following output for dvc.

![gitdvc](../images/git_dvc.png)

If you are in binder or a different environment, make sure you run the following cell first for a fresh start.

In [None]:
# !rm -r .git .dvc dvc.lock dvc.yaml metrics/* models/* .dvcignore

In [None]:
!git init

In [None]:
!dvc init

## 4. The Data

Following the described scenario above, a sample data was donated to the UCI ML Repository on 2020-03-01 for a regression task. It contains information regarding the amount of bikes available per hour between 2017 and 2018.

![bikes_uci](../images/uci_bikes.png)

Here are the variables found in the dataset.

- `Date` - year-month-day
- `Rented Bike count` - Count of bikes rented at each hour
- `Hour` - Hour of the day
- `Temperature`-Temperature in Celsius
- `Humidity` - %
- `Windspeed` - m/s
- `Visibility` - 10m
- `Dew point temperature` - Celsius
- `Solar radiation` - MJ/m2
- `Rainfall` - mm
- `Snowfall` - cm
- `Seasons` - Winter, Spring, Summer, and Autumn
- `Holiday` - Holiday and No holiday
- `Functional Day` - NoFunc (Non Functional Hours), Fun (Functional hours)

### 4.1 Getting the Data

We will use the `urllib.request` and the `os` libraries to download the data and set up our desired path for it, respectively.

In [None]:
import urllib.request, os

In [None]:
!pwd

In [None]:
os.chdir("..")

In [None]:
!pwd

The dataset can be downloaded from the URL below and we will give it the filename, `SeoulBikeData.csv`.

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv'
path_and_filename = os.path.join('data', '03_part', 'raw', 'SeoulBikeData.csv')

In [None]:
urllib.request.urlretrieve(url, path_and_filename)

In [None]:
pd.read_csv(path_and_filename, encoding='iso-8859-1').head()

Because we want to be able to create a pipeline later on, we will export the few lines of code above as a python script called `get_data.py` for later use. We will put every python script we want in our pipeline inside the `src` directory. Get used to this pattern. :)

Also, it is good practice to make sure the directories we are using always exist, so we will add an additional if-else statement to search and/or create it if it does not exist.

The command `%%writefile` below is a magic function and these are special functions of our ipython interpreter. The one we are using allows us to write anything in that cell to a file. Others like the `%%bash`, as we will soon see, make the entire cell a bash executable cell.

In [None]:
%%writefile src/full_pipe/get_data.py

import urllib.request, os

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv'
path = os.path.join('data', '03_part', 'raw')
filename = 'SeoulBikeData.csv'

if not os.path.exists(path): os.makedirs(path)
        
urllib.request.urlretrieve(url, os.path.join(path, filename))

Now that we have our dataset, let's go ahead and add it to our remote storage and start keeping track of the changes that we make to it with dvc. We will use Google Drive as our primary storage tool for the tutorial but feel free to use the option that best suits your needs and experience from the [dvc website](https://dvc.org/doc/command-reference/remote/add). Here are the step to follow.

1. Log into your Google Drive with your gmail or create a gmail account and then log into your gdrive.
2. Create a folder for this tutorial and copy the alphanumeric string after the last `/`.
![tuto](../images/gdrive1.png)
3. Add your remote storage with `dvc remote add -d bikestorage gdrive://your_string`
4. You will be redirected to a page so that you can allow `dvc` to access such folder. Select the boxes and confirm.
5. Add the data/models/files you want to track with `dvc add data/03_part/raw/SeoulBikeData.csv`
6. From the command line run `dvc push` and this will ask you from one last authentication step
    - Copy and paste the link provided in a browser tab
    - copy the code provided and paste it on the terminal
    ![ipush](../images/dvc_push.png)

**You should be good to go** so now let's add our bucket to our dvc repo.

In [None]:
!dvc remote add -d bikestorage gdrive://1hhbxsGSxrVRcJaxI8_cZ9wroJzo3DsVy

In the first line above the `-d` stands for default and `bikestorage` is the name we have decided on for our bucket. The last piece is the url that directs dvc to our gdrive. You can find out more about the `remote` command through the official docummentation [here](https://dvc.org/doc/command-reference/remote).

Now let's start tracking our data and make sure our remote storage is fully connected to our local storage.

In [None]:
!dvc add data/03_part/raw/SeoulBikeData.csv

You will need to run `dvc push` from the terminal.

![fullbucket](../images/file_up.png)

DVC uses special names to keep track of files, so there's no need to try and figure out what the above name means. Everything in our bucket can always be accessed through dvc.

Lastly, we'll commit our changes to our git repo after making sure we add the two files created by dvc, `data/03_part/raw/.gitignore` and `data/03_part/raw/SeoulBikeData.csv.dvc`. What dvc is doing is tracking some information about our dataset through git, hence the files `...Data.csv.dvc` and `.gitignore` with the actual data file goes into git while the actual data goes to our remote storage.

In [None]:
!git add data/03_part/raw/SeoulBikeData.csv.dvc data/03_part/raw/.gitignore

In [None]:
!git status

In [None]:
%%bash

git commit -m "Start Tracking Data"

Before we push any changes to GitHub, make sure you create your repository as shown here, and then connect your local and remote repo with the commands below.
![reposetup](../images/repo_setup.png)

In [None]:
%%bash

# git remote add origin https://github.com/ramonpzg/your_repo.git
git push -u origin master

### 4.2 Preparing the Data

The following steps should feel familiar to us, we want to separate the date variable into its components, create dummy variables for the categorical features, normalize the columns so that they only contain alphanumerical characters with underscores instead of spaces, and finally split the data into train and test sets.

In [None]:
import pandas as pd

data = pd.read_csv(os.path.join('data', '03_part', 'raw', 'SeoulBikeData.csv'), encoding='iso-8859-1')
data.head()

In [None]:
data['Date'] = pd.to_datetime(data['Date'])

In [None]:
data.sort_values(['Date', 'Hour'], inplace=True)
data["Year"] = data['Date'].dt.year
data["Month"] = data['Date'].dt.month
data["Week"] = data['Date'].dt.isocalendar().week
data["Day"] = data['Date'].dt.day
data["Dayofweek"] = data['Date'].dt.dayofweek
data["Dayofyear"] = data['Date'].dt.dayofyear
data["Is_month_end"] = data['Date'].dt.is_month_end
data["Is_month_start"] = data['Date'].dt.is_month_start
data["Is_quarter_end"] = data['Date'].dt.is_quarter_end
data["Is_quarter_start"] = data['Date'].dt.is_quarter_start
data["Is_year_end"] = data['Date'].dt.is_year_end
data["Is_year_start"] = data['Date'].dt.is_year_start
data.drop('Date', axis=1, inplace=True)

In [None]:
data = pd.get_dummies(data=data, columns=['Holiday', 'Seasons', 'Functioning Day'])

In [None]:
data.columns = ['rented_bike_count', 'hour', 'temperature', 'humidity', 'wind_speed', 'visibility', 
                'dew_point_temperature', 'solar_radiation', 'rainfall', 'snowfall', 'year', 
                'month', 'week', 'day', 'dayofweek', 'dayofyear', 'is_month_end', 'is_month_start',
                'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start',
                'seasons_autumn', 'seasons_winter', 'seasons_summer', 'seasons_spring',
                'holiday_yes', 'holiday_no', 'functioning_day_no', 'functioning_day_yes']

In [None]:
split = 0.30
n_train = int(len(data) - len(data) * split)

train_path = os.path.join('data', '03_part', 'processed', 'train.csv')
test_path = os.path.join('data', '03_part', 'processed', 'test.csv')

data[:n_train].reset_index(drop=True).to_csv(train_path, index=False)
data[n_train:].reset_index(drop=True).to_csv(test_path, index=False)

Using the same commands as before, let's keep track of our new datasets with dvc and push the changes to our gdrive. Also, we'll create a file called `prepared.py` for later use in our pipelines.

In [None]:
%%bash

dvc add data/03_part/processed/train.csv data/03_part/processed/test.csv
dvc push

In [None]:
%%writefile src/full_pipe/prepare.py

import pandas as pd
import os, sys

split = 0.30

raw_data_path = sys.argv[1]
train_path = os.path.join('data', '03_part', 'processed', 'train.csv')
test_path = os.path.join('data', '03_part', 'processed', 'test.csv')

# read the data
data = pd.read_csv(raw_data_path, encoding='iso-8859-1')

# add date vars
data['Date'] = pd.to_datetime(data['Date'])
data.sort_values(['Date', 'Hour'], inplace=True)
data["Year"] = data['Date'].dt.year
data["Month"] = data['Date'].dt.month
data["Week"] = data['Date'].dt.isocalendar().week
data["Day"] = data['Date'].dt.day
data["Dayofweek"] = data['Date'].dt.dayofweek
data["Dayofyear"] = data['Date'].dt.dayofyear
data["Is_month_end"] = data['Date'].dt.is_month_end
data["Is_month_start"] = data['Date'].dt.is_month_start
data["Is_quarter_end"] = data['Date'].dt.is_quarter_end
data["Is_quarter_start"] = data['Date'].dt.is_quarter_start
data["Is_year_end"] = data['Date'].dt.is_year_end
data["Is_year_start"] = data['Date'].dt.is_year_start
data.drop('Date', axis=1, inplace=True)

# add dummies
data = pd.get_dummies(data=data, columns=['Holiday', 'Seasons', 'Functioning Day'])

# Normalize columns
data.columns = ['rented_bike_count', 'hour', 'temperature', 'humidity', 'wind_speed', 'visibility', 
                'dew_point_temperature', 'solar_radiation', 'rainfall', 'snowfall', 'year', 
                'month', 'week', 'day', 'dayofweek', 'dayofyear', 'is_month_end', 'is_month_start',
                'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start',
                'seasons_autumn', 'seasons_winter', 'seasons_summer', 'seasons_spring',
                'holiday_yes', 'holiday_no', 'functioning_day_no', 'functioning_day_yes']

n_train = int(len(data) - len(data) * split)
data[:n_train].reset_index(drop=True).to_csv(train_path, index=False)
data[n_train:].reset_index(drop=True).to_csv(test_path, index=False)

Now we are ready to commit all of our changes.

In [None]:
%%bash

git add .
git commit -m "Preparation stage completed"
git push

## 5. Training our First Model

We want to create a model that predicts how many bikes will be needed at any given hour and on any given date in the future of the city of Seoul. Since the number of bicycles available for rent is a continuous number, this is a regression problem and what a better tool to use for regression problems that Random Forests.

![rfs](https://media.makeameme.org/created/its-easy-just-5bd65c.jpg)

What are Random Forests anyways?

> "Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set." ~ [Wikipedia](https://en.wikipedia.org/wiki/Random_forest)

We want to start with a baseline model, evaluate it, and then fine tune either the implementation that we picked, in this case the scikit-learn's one or, as we'll see in a later section, an implementation from another framework.

After we train our sklearn model, we want to serialize (or pickle) that model, track it with dvc, and use it later with unseen data in the evaluation stage.

We'll import sklearn's `RandomForestRegressor()` and python's `pickle` module, load our train set, and start our training with 100 estimators and a seed. Feel free to change these parameters however you like though.

In [None]:
from sklearn.ensemble import RandomForestRegressor
import pickle

In [None]:
X_train = pd.read_csv(os.path.join('data', '03_part', 'processed', 'train.csv'))
y_train = X_train.pop('rented_bike_count')

In [None]:
seed = 42
n_est = 100

In [None]:
rf = RandomForestRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

In [None]:
rf.predict(X_train.values)[:10]

In [None]:
with open('models/rf_model.pkl', "wb") as fd:
    pickle.dump(rf, fd)

Now that we have a trained model, let's save the steps we just took to a file called `train.py`, and let's also start tracking our model in the same way in which we tracked our data earlier with dvc. Lastly, we'll commit our work and push everything to GitHub.

In [None]:
%%writefile src/full_pipe/train.py

import os, pickle, sys
import numpy as np, pandas as pd
from sklearn.ensemble import RandomForestRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed = 42
n_est = 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = RandomForestRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd:
    pickle.dump(rf, fd)

In [None]:
!dvc add models/rf_model.pkl

In [None]:
!dvc push

In [None]:
%%bash

git add .
git commit -m "Training 1 completed"
git push

## 6. Model Evaluation

Model evaluation is a crucial part of training ML models, and it is important that we pick useful metrics that can indicate to us how well our model is perfoming, or expecting to perform, when presented with unseen data.

The metrics we'll use are Mean Absolute Error, Root Mean Squared Error, and $R^2$.

- Mean Absolute Error - "is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement." ~ [Wikipedia](https://en.wikipedia.org/wiki/Mean_absolute_error)
- Root Mean Squared Error - "is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent." ~ [Wikipedia](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
- $R^2$ - "In statistics, the coefficient of determination, also spelt coëfficient, denoted $R^2$ or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s)." ~ [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination)

We'll start by loading our model and our test set from the previous step, we'll then predict the # of bikes in the test set and compare such predictions with the ground truth. After we compute the metrics above, we want to save them to a JSON file for further use and comparison using dvc.

In [None]:
import sklearn.metrics as metrics, json, numpy as np

In [None]:
with open(os.path.join('models', 'rf_model.pkl'), "rb") as fd:
    model = pickle.load(fd)

In [None]:
X_test = pd.read_csv(os.path.join('data', '03_part', 'processed', 'test.csv'))
y_test = X_test.pop('rented_bike_count')

In [None]:
predictions = model.predict(X_test.values)
predictions[:10]

In [None]:
mae = metrics.mean_absolute_error(y_test.values, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test.values, predictions))
r2_score = model.score(X_test.values, y_test.values)

In [None]:
print(f"Mean Absolute Error: {mae:.2f}")
print(f"Root Mean Square Error: {rmse:.2f}")
print(f"R^2: {r2_score:.3f}")

In [None]:
with open(os.path.join('metrics', 'metrics.json'), "w") as fd:
    json.dump({"MAE": mae, "RMSE": rmse, "R^2":r2_score}, fd, indent=4)

We will save the steps above in a file called `evaluate.py` for futher use later, and we will add our metrics to git and GitHub rather than to our remote gdrive bucket. The reason beign that dvc has a special function that allows us to compare the diff of metrics between those in a branch and those in master or those between different commits, and as you'll see soon, this is a very powerful feature of dvc that we certainly want to take advantage of.

In [None]:
%%writefile src/full_pipe/evaluate.py

import json, os, pickle, sys, pandas as pd, numpy as np
import sklearn.metrics as metrics

model_file = sys.argv[1]
test_file = os.path.join(sys.argv[2], "test.csv")
scores_file = os.path.join('metrics', 'metrics.json')

with open(model_file, "rb") as fd:
    model = pickle.load(fd)

X_test = pd.read_csv(test_file)
y_test = X_test.pop('rented_bike_count')

predictions = model.predict(X_test.values)

mae = metrics.mean_absolute_error(y_test.values, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test.values, predictions))
r2_score = model.score(X_test.values, y_test.values)

with open(scores_file, "w") as fd:
    json.dump({"MAE": mae, "RMSE": rmse, "R^2":r2_score}, fd, indent=4)

In [None]:
%%bash

git add .
git commit -m "Evaluation completed"
git push

## 7. DVC Pipelines

DVC pipelines is one of the best features offered by dvc. They allow us to create reproducible workflows containing anything from getting the data to training and evaluating models.

There are several ways for creating pipelines with dvc and here we'll do so with `dvc run`. `dvc run` starts with the `-n` flag followed by the name we want to give to the step of the pipeline we want to create. Next, we add the `-d` flag to signal dependencies such as the python file we want to run as well as any arguments that such file takes. Next we have the `-o` flag, which tells dvc the output expected from such step of the pipeline. For example, this stage would take the `train.csv` and `test.csv` files from the data preparation stage. Lastly, you need to pass the full python call without any flags.

After we run our dvc command we'll get 2 files, a `dvc.yaml` and a `dvc.lock` file. The former contains the stages dvc will follow in our pipeline, and the latter contains the metadata and other information regarding our pipeline. Once you have a look at the yaml file, you'll probably wonder if you can create such a file manually, the answer is yes. For the `dvc.lock` on the other hand, dvc will take care of that one through the command `dvc repro`, which runs whatever pipeline resides in your `dvc.yaml` file.

More information about both can be found in the official documentation site [here](https://dvc.org/doc/command-reference/run).

Before we start adding the stages of our pipeline, let's first remove the files we were already tracking with dvc. Not doing so will result in dvc giving us an error since the tracked files already exist, and also, we want to see how the whole pipeline behaves.

In [None]:
!rm dvc.lock dvc.yaml

In [None]:
%%bash

dvc remove data/03_part/raw/SeoulBikeData.csv.dvc \
           data/03_part/processed/train.csv.dvc \
           data/03_part/processed/test.csv.dvc \
           models/rf_model.pkl.dvc

In [None]:
%%bash

dvc run -n get_data \
    -d src/full_pipe/get_data.py \
    -o data/03_part/raw/SeoulBikeData.csv \
    python src/full_pipe/get_data.py

In [None]:
%%bash

dvc run -n prepare \
    -d src/full_pipe/prepare.py -d data/03_part/raw/SeoulBikeData.csv \
    -o data/03_part/processed/train.csv -o data/03_part/processed/test.csv \
    python src/full_pipe/prepare.py data/03_part/raw/SeoulBikeData.csv

In [None]:
%%bash

dvc run -n train \
    -d src/full_pipe/train.py -d data/03_part/processed/train.csv \
    -o models/rf_model.pkl \
    python src/full_pipe/train.py data/03_part/processed/train.csv

In [None]:
%%bash

dvc run -n evaluate \
    -d src/full_pipe/evaluate.py -d models/rf_model.pkl -d data/03_part/processed \
    -M metrics/metrics.json \
    python src/full_pipe/evaluate.py models/rf_model.pkl data/03_part/processed

Notice that in the last part of our pipeline we have the flag `-M`. This flag tells dvc to treat the output of that particular stage as a metric so that we can later use `dvc diff` on it and compare metrics between diffs or between branches in git.

Using the `dvc status` will tells us whether there are changes in our pipeline and files or if everthing is up to date.

In [None]:
!dvc status

In [None]:
!dvc metrics show

Another cool function of dvc is `dvc dag`, which will show us a graph with the steps in our pipeline.

In [None]:
!dvc dag

In order to re-run our pipeline again with `dvc repro` and see it in action, let's lemove the `dvc.lock` and the data files, and run `dvc repro` once.

In [None]:
!dvc repro

In [None]:
!rm dvc.lock data/03_part/raw/SeoulBikeData.csv data/03_part/processed/train.csv data/03_part/processed/test.csv

In [None]:
!dvc repro

In [None]:
!dvc push

Now that we have learned about dvc pipelines and how to reproduce them, let's check the files that need to be committed and let's push them to GitHub.

In [None]:
!git status

In [None]:
%%bash

git add .
git commit -m "Pipeline Finished"
git push

In [None]:
!git status

## 8. Experiments

We've been working inside our master branch using scikit-learn, and now we want to start experimenting with other tree-based frameworks like XGBoost, LightGBM, and CatBoost using different commits and/or branches for each experiment. Let's do just that and start by checking out a new branch, adding XGBoost to our train file, and triggering a new run.

**Note:** although the explanation is in terms of git branches we'll try to stay in the master branch for the workshop purposes. 

In [None]:
# !git checkout -b "exp1-xgb"

In [None]:
%%writefile src/full_pipe/train.py

import os, pickle, sys, pandas as pd
from xgboost import XGBRFRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = XGBRFRegressor(n_estimators=n_est, seed=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

In [None]:
!dvc status

In [None]:
!dvc repro

In [None]:
!dvc status

In [None]:
!dvc push

In [None]:
!dvc metrics show

In [None]:
!dvc metrics diff --show-md master

Note that in order for GitHub to now we have been working in a different branch, we need to use the `git push --set-upstream origin exp1-xgb` command. Otherwise, we'll get an error.

In [None]:
%%bash

git add .
git commit -m "Testing XGBoost"
git push
# git push --set-upstream origin exp1-xgb
# git push

In [None]:
# !git checkout -b "exp2-lgbm"

In [None]:
%%writefile src/full_pipe/train.py

import os, pickle, sys, pandas as pd
from lightgbm import LGBMRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = LGBMRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

In [None]:
!dvc status

In [None]:
!dvc repro

In [None]:
!dvc push

In [None]:
!dvc metrics diff --show-md master

In [None]:
%%bash

git add .
git commit -m "Testing LightGBM"
git push
# git push --set-upstream origin exp2-lgbm
# git push

The base implementation of LightGBM seems to have performed quite well. Let's try out CatBoost now.

In [None]:
# !git checkout -b "exp3-cat"

In [None]:
%%writefile src/full_pipe/train.py

import os, pickle, sys, pandas as pd
from catboost import CatBoostRegressor

input_data = sys.argv[1]
output = os.path.join('models', 'rf_model.pkl')
seed, n_est = 42, 100

X_train = pd.read_csv(input_data)
y_train = X_train.pop('rented_bike_count')

rf = CatBoostRegressor(n_estimators=n_est, random_state=seed)
rf.fit(X_train.values, y_train.values)

with open(output, "wb") as fd: pickle.dump(rf, fd)

In [None]:
!dvc status

In [None]:
!dvc repro

In [None]:
!dvc push

In [None]:
!dvc metrics diff --show-md master

In [None]:
%%bash

git add .
git commit -m "Testing CatBoost"
git push
# git push --set-upstream origin exp3-cat
# git push

Surprisingly, CatBoost's MAE performed a bit worse than LightGBM but RMSE performed much better.

We have a good candidate with CatBoost and we should merge this branch with master and start tunning our model.

## 9. Merging our Changes - PRs

![empspr](../images/compare_pr.png)
If you go back to the main page of our repo, you'll notice that GitHub has added a **Compare & pull request** option for each the three experiments. This is a nice shortcut to help us pick the one we liked best and add it to our main project's branch, master.

So what is a pull request anyways? "A PR provides a user-friendly web interface for discussing proposed changes before integrating them into the official project." ~ [Atlassian](https://www.atlassian.com/git/tutorials/making-a-pull-request)

Let's merge our experiment branch with our master branch.

1. Click on the **Compare & pull request** for exp3-cat.
2. Compare the changes.
![comp](../images/compare_pr.png)
3. Check out the report again.
![rep](../images/report_bottom.png)
4. Open a pull request with your details on why it should go to master.
![rep](../images/pr_mess.png)
5. Once reviewed, write a comment and merge the pull request.
![rep](../images/awesome_cat.png)
6. Lastly, we need to make sure our local env is up to date and once it is, we can switch to the master branch and work from there again. Run a `git pull` and a `dvc pull`.
![list_prs](../images/last_of_pr.png)

## 10. Summary

As you have seen throughout the tutorial

DVC helps us track our data, models, and metrics, and it also allows us to create pipelines for getting, preparing, and modeling data.

DVC fills in the gaps of what git alone can't do for the machine learning community. This tool should be in every data scientist and ML Engineer's toolkit. Enough said!

![enough](https://media.giphy.com/media/mVJ5xyiYkC3Vm/giphy.gif)

## 11. Blind Spots and Future Work

**Blind Spots**
- We could have fine tuned our base model even further and make better comparisons with the other frameworks.
- We could have conducted more feature engineering.
- We could have selected the best features only based on feature importance.
- We could have done a bit more analysis of the data.
- We could have taken out the second dummy or our categorical variables. For example, there is no need to have Holiday and No Holiday as variables in our dataset.


**Future Work**
- If the data will be provided in the same formate we received it, then we need an easier transformation pipeline for the date, column names, and dummies.
- We could add the analytical tool, e.g. our dashboard, to the master branch and work with the models solely through branches.

## 12. Resources

Here are a few additional resources to dive deeped into some of the tools discussed above.
- [DVC Get Started](https://dvc.org/doc/start)
- [DVC Use Cases](https://dvc.org/doc/use-cases)
- [CML Get Started](https://cml.dev/doc/start)
- [CatBoost Tutorial](https://catboost.ai/en/docs/concepts/tutorials)
- [Git](https://realpython.com/python-git-github-intro)
- [Pull Requests](https://www.atlassian.com/git/tutorials/making-a-pull-request)