# Data Versioning

Keeping track of versions of your data can be tricky. Historically, data scientists might simply keep copies of all versions of data that they use, and rename them appropriately, which can clutter up your data store or file system, and makes it difficult to truly keep track of which data went with which model (experiment). 

The notion of having a data version control system, similar to what git does for code, has been around for awhile, and the most popular tools that have emerged to address these needs have been Git LFS and DVC. But, some would argue that using a version control tool for data is unnecessary, so long as the data scientist is disciplined enough to create reproducible data pipelines themselves, and so long as the data sources themselves are managed correctly.

## Data Version Control (DVC)

[DVC](https://dvc.org/) is a tool that can be used to track versions of data, and is used similarly to git. First, let's install DVC using 

`pip install dvc` 

from within your environment. After it's installed you can check the version and help files:

`dvc -V`  
`dvc -h`

Then, in your project root folder you can initialize DVC, similar to initializing a git repo, using 

`dvc init` 

From here, you can see that there is a new .dvc folder and a .dvcignore file. The .dvcignore file is similar to .gitignore, you use it to tell DVC if there are certain data files you want DVC to ignore. The .dvc folder contains a config file and a tmp folder, and should also have .gitignore file. This folder will eventually have a cache folder in it as well. We will talk about this in a minute. First, let's add the .dvc/config file and .dvcignore file to git. 

`git add .dvc/config .dvcignore .dvc/.gitignore`  or `git add .` might work if you don't have a bunch of other uncommited changes.  
`git commit -m 'add dvc'`  

You may want to push these to your Github repo at this point.

`git push origin main`  

Now you can start version controlling data. Let's add a dataset to our data/ folder. You may already have data there, but you can also use any toy dataset that you want to use for practicing with DVC. I will add the adult.data file into my data/ folder and then add it to DVC (again, similar to git) using

`dvc add data/adult.data`

I can now see an adult.data.dvc file in the data/ folder. This file is important, and this is the one that we will keep under version control with git, so let's go ahead and add it now.

`git add data/adult.data.dvc`  
`git commit -m 'add adult data dvc file'`

And now feel free to take a look at what's in the adult.data.dvc file.

`cat data/adult.data.dvc`  

You will notice an md5 hash value. If you now go back to the .dvc folder you will now see a cache folder has been created. In the cache folder you will see a new folder in there with a name that should match the first couple of digits of the md5 hash value. Navigate into that folder and you'll see that adult.data is now in there.

`cd .dvc/cache/{folder name}`  
`cat {file name}`  

This cache is important, it is how DVC will keep track of your different data versions.

To truly take advantage of data version control, we should store our data somewhere else, rather than locally. This way, your coworkers can access the same data. You can store the data in any number of cloud storage options, but for this demo and lab, we will use [Dagshub](https://dagshub.com). Dagshub is already integrated with DVC, it's free, and it provides 10gb of data storage for use.

### Dagshub

Dagshub markets itself as a Github for data science collaboration because it goes further than acting as a remote repository for your code. It's still a very new tool, and is adding new features, but I would describe it currently as a tool that can do version for code, data, models and experiments. For example, it has integrations with MLFlow, so you can create a Dagshub repo and use Dagshub as your remote server for tracking your experiments and artifacts instead of setting up your own remote server or tracking your experiments and artifacts locally. 

For this demo we will only use Dagshub's DVC features, but you are free to explore it further to see if it is a tool you would enjoy using for other purposes.

1. Create a Dagshub account [here](https://dagshub.com).  
2. Take a look at Dagshub's features [here](https://dagshub.com/docs/).  
3. Create a new Dagshub repository. Do not use one of the templates, just create a blank repository.  
4. Give the repository a name, make it public or private (doesn't matter), and write a brief description.  
5. We can now push data to this Daghub repo using DVC. You will need to set up the remote (similar to git remote) on your local machine so that DVC knows where to push your data. 

### DVC, git and Dagshub

I called my Dagshub repository `mlops-project`.  To set up the remote I use  
`dvc remote add origin https://dagshub.com/rclements/mlops-project.dvc`

Then you will need to configure things:  
`dvc remote modify origin --local auth basic`  
`dvc remote modify origin --local user your_username`

Lastly, you will need to supply a password. You can usually find your tokens in Dagshub:

![](images/dagshubpw.png)

`dvc remote modify origin --local password your_token`  

To make sure it worked you can run  
`dvc remote list`  

You should also see a new remote added to your .dvc/config file.

To push your code to your Dagshub repo, you should set up a second remote (call it dagshub) for git. Now you can push code to both Github and Dagshub.  
`git remote add dagshbub https://your_username:your_token@dagshub.com/rclements/mlops-project`  

And now you can push your code and data to Dagshub using  
`git push dagshub main`  
`dvc push -r origin`

Go to your Dagshub repository and you should be able to see your code, as well as a diagram showing your data pipeline:

![](images/dagshubdvc.png)

### Confusion Points

It can be tricky keeping track of your different code and data changes, and your code and data repos, but it is something you will get used to if you continue to use DVC and multiple remote repos like Github and Dagshub. The important point to remember right now is:

- git is for your code, not your data, so make sure to **ignore** your data files  
- dvc is for your data,  not your code  
- Github is for your code only  
- Dagshub is for your data **and** your code  
- If you use a different cloud storage for your data, such as AWS or GCP, then code goes in Github and data goes in that cloud storage  

### Simple Example

We can now modify our toy dataset, and *commit* and *push* those changes. I'm going to now add our `adult.test` data to the folder, and I'll go ahead and add it to DVC.

`dvc add data/adult.test`

Remember now to add the adult.test.dvc and any other changes you've made to your .gitignore files to your git repo. Then, push code changes to Github.

`git add data/.gitignore data/adult.test.dvc`  
`git commit -m 'add adult test data'`  
`git push origin main`  

And now you can push the data and code changes to Dagshub.

`git push dagshub main`  
`dvc push -r origin`  

If you go to Dagshub you will see the new adult.test data in the data pipeline diagram. Now, let's combine adult.data and adult.test. 

`wc -l data/adult.data`  
`cat data/adult.test >> data/adult.data`  
`wc -l data/adult.data`

The adult.data file has been modified. Let's add it to DVC and push the changes.

`dvc add data/adult.data`  
`git add data/adult.data.dvc`  
`git commit -m 'modify adult.data'`  
`git push origin main`  
`git push dagshub main`  
`dvc push -r origin`  

Now suppose we want to roll back to a previous version of adult.data. We can use git to roll back to the previous version of adult.data.dvc, and then use DVC to checkout the previous version of the data.

`git checkout HEAD~1 data/adult.data.dvc`  
`dvc checkout`  

You can go back to the most recent version again by running `git checkout HEAD data/adult.data.dvc` and `dvc checkout`, but instead, let's keep the previous version and run with it.

`git commit -m 'revert to previous data version'`  
`git push origin/dagshub main` 

### DVC Pipelines

A typical pipeline for a modeling experiment might have the following stages:

- Import data and create train/validation/test splits   
- Clean data and create features  
- Model training  
- Model evaluation

DVC is capable of keeping track of an entire pipeline like this, though for our purposes we will not include model training and evaluation. 

We've already created the first stage of our pipeline when we added our train and test datasets to DVC. To create the next stage in the data pipeline using DVC, we will create a script that reads in our train and test datasets, makes some changes, and outputs two new datasets for use later. We call this script `create_features.py`. This script loads the data, uses a Pipeline (from sklearn) to transform the data, writes the Pipeline to `pipeline.pkl` and the two transformed data sets to `processed_train_data.csv` and `processed_test_data.csv`. 

We can use DVC to create this new *stage* in our pipeline. DVC will run the code in `create_features.py`. We use the -n argument to name the stage, the -d argument to specify the dependencies, and the -o argument to specify the outputs for this stage. 
```
dvc stage -n featurization\  
-d data/adult.data \  
-d data/adult.test \  
-d src/create_features.py \  
-o data/pipeline.pkl  
-o data/processed_train_data.csv \
-o data/processed_test_data.csv \  
python src/create_features.py
```  

The above should run successully, and you should notice a new `dvc.yaml` file was created. Take a look and you'll see the details of the new pipeline you created. It will also tell you that a new dvc.lock file was created. From here, you should add the new dvc.lock, dvc.yaml, and the data/.gitignore to the git repo, as well as the create_features.py script. Then commit and push to Github and Dagshub. Now, when you go to Dagshub, you should see something similar to this:

![](images/dagshub2.png)

### Parameters

Suppose your pipeline depends on a set of parameters, for example a random seed for doing train/test splits. We can place those parameter values in a **params.yaml** file, and use DVC and git to keep track of changes to these parameters. We would need to change our script to include the yaml file and set the parameters.

Create a params.yaml file that looks like this:
```
features:  
    chi2percentile: 50
    train_name: adult.data  
    test_name: adult.test
```
Then change the script so that the percentile used for feature selection reads from the yaml file instead of being hard-coded:

`import yaml`  
`params = yaml.safe_load(open("params.yaml"))["features"]`  
`chi2percentile = params["chi2percentile"]`  

and update this line:

`("selector", SelectPercentile(chi2, percentile=chi2percentile)),`

I've included a new version of the script, called `create_features_w_params.py` that you can use for this instead. Now, we'll want to replace the previous pipeline we created by using this new pipeline with parameters. We can overwrite it by using `--force`.
```
dvc stage add -n featurization --force -d data/adult.data -d data/adult.test -d src/create_features_w_params.py -o data/pipeline.pkl -o data/processed_train_data.csv -o data/processed_test_data.csv python src/create_features_w_params.py
```

When you create pipelines using DVC, you can easily reproduce entire pipelines by running `dvc repro`. For example, if we accidentally changed or deleted one of the output files, such as `data/processed_test_data.csv`, we can simply run `dvc repro` to get it back. In fact, delete both output files and run `dvc repro`. DVC will automatically grab both output files from the cache without needing to rerun the entire script to recreate them.

### DVC Recap

We've shown that we can use DVC to keep track of data versions and pipelines:

- Initialize DVC, similar to a git repository  
- Add datasets to DVC so they are tracked  
- Use git to keep track of the versions of the data through the `{data_name}.dvc` files  
- Use a remote storage for storing your data  
- Roll back to previous versions of data using git  
- Create reproducible pipelines with or without parameters by adding `stages` and using `repro`  
- Use Dagshub, if you want, for tracking code, data and experiments

DVC actually does a whole lot more than this, some of which overlaps with other tools. 



## Reproducible Pipelines

If you don't want to learn how to use DVC, or if it doesn't seem useful, you can instead ensure you can reproduce your data with code. The key here is that each dataset you create, including intermediate datasets, should be reproducible by running a script, with optional parameters. Suppose you create intermediate dataset D1, and then you join D1 with another dataset and create D2, and then you do some feature engineering and transformations that result in a final dataset D3. You code should be written in a way such that you can run the script and recreate the exact same D1, D2, and D3 datasets.

There are necessary conditions for this, such as your data sources (the raw source of data that your pipeline begins with) should remain static in some way. The initial data sources should still be the same, i.e. you should be able to pull the exact same data without worrying about the schema, data types, column names, changing data values, etc.. If the source of the data suddenly has 10 additional columns, and the data has been changed or updated over time, then your intermediate and final datasets will likely be different or difficult to reproduce.

Note that you should be able to track any parameter values you use when you run the pipeline. It doesn't help to write reproducible pipelines if you can't remember what parameter values created which versions of the data.

Our **create_features.py** script attempts to create a reproducible pipeline using the Pipeline class from sklearn, but there are several improvements we can make.

1. Create functions for each step, with parameters that make sense (e.g. the column names list, file paths) and put these at the top of the script  
2. Call these functions at the bottom after `if __name__ == "__main__":` 

### Orchestration Engines

To run a data pipeline there are several tools available that are especially useful for data management, such as [Airflow](https://airflow.apache.org/), [Prefect](https://www.prefect.io/), [Dagster](https://dagster.io/), and [dbt](https://www.getdbt.com/). So. Many. Tools. These tools are what you would use to create proper production-grade data pipelines. 


# Data Versioning Lab

## Overview

In this lab you will practice using DVC and Dagshub to create data pipelines.

## Goal

The goal in this lab is to become familiar with the importance of keeping track of different versions of your data sets. Although we will use DVC in this lab, we are **not** trying to learn all we can about it. DVC has too much functionality for us to learn. 

## Instructions

Using the data that you've decided to use for your project you should start using DVC, Dagshub as remote storage, and create a reproducible pipeline for data preprocessing. 

- Create a Dagshub account  
- Create a new repository for your MLOps project  
- Set up your git and DVC remotes on your local machine  
- Add your datasets to DVC, and push code changes to Github and Dagshub, and your data to Dagshub (if it's <10gb)  

At this point you should create a script that does only the data preprocessing stages of the pipeline. Any modeling stages can be kept in a separate script. You may need to first work inside of a notebook to finalize your code. 

Once your script works, 

- use `dvc run` or `dvc stage` to run the script and track the dependencies and the outputs  
- commit and push changes to Github and Dagshub 

### Final Project

For the final project, after you've completed the above steps, see if you are able to repeat some of those steps in some way using at least one other tool. Specifically, can you easily start tracking data versions, storing data remotely, and creating reproducible pipelines? Thankfully, Neptune has written a nice blog that talks about alternatives to DVC [here](https://neptune.ai/blog/best-data-version-control-tools). Also, check [here](https://mymlops.com/builder) for a list of other options, which include Weights & Biases, Git LFS, and Pachyderm. **Note** that you can also choose to compare DVC to simply creating your own reproducible pipelines instead of one of these other tools. As I've said before, these tools may or may not even be necessary, and it's up to you to decide, but be prepared to convince me that you are able to do effective data versioning yourself if you decide not to use a tool for it.

Once you've worked with other tools or approaches, make a comparison. List out pros and cons of each. Think about user-friendliness, costs, integrations with other tools, etc.. You **do not** need to decide which tool you'd pick for your stack yet, because that may depend on the other tools you choose for the rest of the pipeline. Just be sure to write up a good comparison, with evidence, so that you can include it in your final presentation.