<center>
<h1>The Full Machine Learning Lifecycle - How to Use Machine Learning in Production (MLOps)</h1>
<hr>
<h2>DVC Tutorial</h2>
<hr>
 </center>

# Introduction
This tutorial will teach you how to use DVC to versionize your data. You will learn how to set up data versioning and how to track and switch between dataset versions. To get started, let's navigate into our project home directory.


In [24]:
import pandas as pd
import numpy as np
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns

sys.path.insert(1, os.path.join(sys.path[0], '..'))
sys.path.append('/cd4ml/plugins/')
os.makedirs('/cd4ml/dvc-tutorial', exist_ok=True)
os.chdir("/cd4ml/dvc-tutorial")

from cd4ml.data_processing import ingest_data

# 1. Initialize the Git repository
DVC works hand-in-hand with Git. To get started tracking the data, we need to initialize a Git repository. 

In [25]:
! git init
! git config user.name "mlops-workshop"
! git config user.email "mlops@workshop.com"

Reinitialized existing Git repository in /cd4ml/dvc-tutorial/.git/


# 2. Initialize DVC
Once we are within a Git repository, we can initialize DVC by running `DVC init`. This creates a `.dvc` folder that DVC used for data versioning.

In [26]:
! dvc init -f
! ls -a

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m.  ..  data  .dvc  .dvcignore  .git


### Exploring the contents of the `.dvc` folder

In [27]:
! ls .dvc

config	tmp


The `.dvc` directory contains a `config` file, a `tmp` folder which DVC uses as a cache and a `.gitignore`. The config file is empty for now, but it will store configuration information about the DVC setup when we are done defining everything.

In [28]:
! cat .dvc/config

DVC adds its internal configuration files to the `.gitignore` to exclude it from Git tracking.

In [29]:
! cat .dvc/.gitignore

/config.local
/tmp
/cache


We are now ready to commit our DVC initialization to the Git repository.

In [30]:
! git status

On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mmodified:   .dvc/config[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mdata/[m



In [31]:
! git commit -m "Initialize DVC repository"

[master 804b985] Initialize DVC repository
 1 file changed, 4 deletions(-)


# 3. Set up remote data storage for DVC
Next, we would like to define the remote data storage where the raw data is being stored. This can be a cloud storage (e.g. Amazon S3, Azure Blob Storage, Google Drive), or a local folder on your system.

In [32]:
! dvc remote add -d remote_storage ./dvc_remote

Setting 'remote_storage' as a default remote.
[0m

The information about the remote storage is saved in DVC's `config` file.

In [33]:
! cat .dvc/config

[core]
    remote = remote_storage
['remote "remote_storage"']
    url = ../dvc_remote


Let's commit this change to the Git repository.

In [34]:
! git add .dvc/config
! git commit -m "Configuring remote storage"
! git log -n 2

[master 7d56ba1] Configuring remote storage
 1 file changed, 4 insertions(+)
[33mcommit 7d56ba1d16b58b4a23f3530a2c61827e826b5b5b[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:37 2024 +0000

    Configuring remote storage

[33mcommit 804b985598505166801c5cd9d32bb361323f97a2[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:25 2024 +0000

    Initialize DVC repository


# 4. Tracking data
With the DVC setup complete, we can start versioning the data. Let's use the ingestion script to make the data available.

In [35]:
#import os
#sys.path.append('/plugins/')
#from plugins.cd4ml.data_processing im#port ingest_data


# paths and variables
_raw_data_dir = '/data/batch1'
    
_data_dir = 'data'

# ingest the data from blobstroage
ingest_data(_raw_data_dir, data_files = {'raw_data_file': os.path.join(_data_dir, 'data.csv')})

The folder `data` now contrains the dataset `data.csv` which we want to verison with DVC. It contains 52384 rows of data.

In [36]:
! wc -l data/data.csv

52384 data/data.csv


Adding tracking to this dataset can be achieved using `dvc add <filename>`.

In [37]:
! dvc add data/data.csv

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/data.csv |0.00 [00:00,     ?file/s[A
                                                                                [A
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/.dvc/cache/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0%|          |Adding data/data.csv to cache         0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /cd4ml/dvc-tutorial/data/0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00,  8.22file/s][A

To track the changes with git, run:

	git add data/.gitignore data/data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[

Running `dvc add` created a `<filename>.dvc` file which we will track with Git and which DVC used to detected changes in the data. The `.gitignore` was also updated to ignore the data itself from Git tracking (Git tracks only the `<filename>.dvc` file). The `.dvc` file contains the file hash and some file metadata.

In [38]:
! cat data/data.csv.dvc

outs:
- md5: e7e332ee787f0207bdf0a76878a77829
  size: 13769143
  hash: md5
  path: data.csv


In [39]:
! cat data/.gitignore

/data.csv


Now, we can add the `data.csv.dvc` file and the modified `.gitignore` to a Git commit.

In [40]:
! git add data/data.csv.dvc data/.gitignore

In [41]:
! git commit -m "Dataset version 1"
! git tag "v1"

[master c68a947] Dataset version 1
 2 files changed, 6 insertions(+)
 create mode 100644 data/.gitignore
 create mode 100644 data/data.csv.dvc


In [42]:
! git log -n 3

[33mcommit c68a9471aeb6bbb686909aa8671fb8783abecd36[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;33mtag: v1[m[33m)[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:53:33 2024 +0000

    Dataset version 1

[33mcommit 7d56ba1d16b58b4a23f3530a2c61827e826b5b5b[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:37 2024 +0000

    Configuring remote storage

[33mcommit 804b985598505166801c5cd9d32bb361323f97a2[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:25 2024 +0000

    Initialize DVC repository


Finally, we push the data to the remote storage location (in this example a local folder in our directory) using `dvc push`.

In [43]:
! dvc push

Collecting                                            |1.00 [00:00, 49.7entry/s]
Pushing
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/dvc_remote/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/.dvc/cache/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0%|          |Pushing to local                      0/1 [00:00<?,     ?file/s][A
Pushing                                                                         [A
1 file pushed
[0m

That's it. We now have properly versioned our dataset.

# New data has arrived!
You have been informed that new data has arrived. We want to track this new version of the dataset so that we can later easily switch between dataset versions.

First, we we run our ingestion script again to fetch the new "day 2" data.

In [44]:
# paths and variables
_raw_data_dir = '/data/batch2'

# ingest the data from blobstroage
ingest_data(_raw_data_dir, data_files = {'raw_data_file': os.path.join(_data_dir, 'data.csv')})

We can detect changes in the dataset by running `dvc status`.

In [45]:
! dvc status

data/data.csv.dvc:                                                              
	changed outs:
		modified:           data/data.csv
[0m

Let us have a quick look at this modified dataset.

In [46]:
! wc -l data/data.csv

104188 data/data.csv


As you can see, our dataset has grown from 52384 to 104188 rows.

To track the changes of the dataset, we run `dvc add` again and commit the change to the Git repository.

In [47]:
! dvc add data/data.csv

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in data/data.csv |0.00 [00:00,     ?file/s[A
                                                                                [A
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/.dvc/cache/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0%|          |Adding data/data.csv to cache         0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /cd4ml/dvc-tutorial/data/0/1 [00:00<?,    ?files/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 10.20file/s][A

To track the changes with git, run:

	git add data/data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [48]:
! git add data/data.csv.dvc
! git commit -m "Dataset version 2"
! git tag "v2"

[master 78f9164] Dataset version 2
 1 file changed, 2 insertions(+), 2 deletions(-)


Let's confirm that our changes have been committed.

In [49]:
! git log -n 4

[33mcommit 78f916406090b79a53ffce35f1a70132abcb9e0e[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m, [m[1;33mtag: v2[m[33m)[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:54:57 2024 +0000

    Dataset version 2

[33mcommit c68a9471aeb6bbb686909aa8671fb8783abecd36[m[33m ([m[1;33mtag: v1[m[33m)[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:53:33 2024 +0000

    Dataset version 1

[33mcommit 7d56ba1d16b58b4a23f3530a2c61827e826b5b5b[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:37 2024 +0000

    Configuring remote storage

[33mcommit 804b985598505166801c5cd9d32bb361323f97a2[m
Author: mlops-workshop <mlops@workshop.com>
Date:   Wed May 15 08:49:25 2024 +0000

    Initialize DVC repository


Finally, we push our latest version of the dataset to our remote storage location.

In [50]:
! dvc push

Collecting                                            |1.00 [00:00, 71.7entry/s]
Pushing
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/dvc_remote/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0% Checking cache in '/cd4ml/dvc-tutorial/.dvc/cache/files/md5'| |0/? [00:00<?[A
                                                                                [A
![A
  0%|          |Pushing to local                      0/1 [00:00<?,     ?file/s][A
Pushing                                                                         [A
1 file pushed
[0m

Inspecting the `dvc_remote` folder shows that there is one subfolder for each version of the dataset.

In [51]:
! ls dvc_remote

files


# Switching between dataset versions
Switching between dataset versions involves a combination of `git checkout` and `dvc checkout` (or `dvc pull`). The correct version of the `<filename>.dvc` file is loaded into workspace via `git checkout` and running `dvc checkout` then pulls the associated data from our local cache (to pull the data from the remote, you would run `dvc pull`). 

Let's look again at the size of our current dataset (version 2).

In [52]:
! wc -l data/data.csv

104188 data/data.csv


Now, we will check out version 1 of our dataset and look at the contents again.

In [53]:
! git checkout tags/v1 
! dvc checkout

Note: switching to 'tags/v1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at c68a947 Dataset version 1
Building workspace index                              |2.00 [00:00, 28.9entry/s]
Comparing indexes                                    |3.00 [00:00, 1.35kentry/s]
Applying changes                                      |1.00 [00:00,  27.6file/s]
[33mM[0m       data/data.csv
[0m

In [54]:
! wc -l data/data.csv

52384 data/data.csv


As you can see, we have indeed switched to the previous version of our dataset.

# Summary
And there we have it! This is how you can use DVC to keep track of versions of data and switch between different versions. We started by initializing a Git repository, then we initialized DVC inside the Git repository. A combination of `dvc add` and `git commit` allowed us to add tracking to our dataset which we pushed to remote storage with `dvc push`. Accessing different dataset version was done with a combination of `git checkout` and `dvc checkout`. 

In the next part of this workshop, you will learn how to incorporate DVC into an end-to-end Machine Learning workflow using MLFlow and Apache Airflow.

In [55]:
# clean up
os.chdir('..')
shutil.rmtree('dvc-tutorial')