# Lecture 24: Data Version Control I

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1jwHTnoQx45v_19H6Vy6zD4IrMVKOheac)

In [1]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Last executed: 2022-03-07 01:31:10


This lecture is part of series on [Data Version control (DVC)](https://dvc.org), a way of systematically keeping track of different versions of models and datasets.

This first lecture in the series will cover
- why using DVC is a good idea
- how to track files and move between versions

## What is version control?

<img src="http://phdcomics.com/comics/archive/phd101212s.gif" width="400" alt="Series of drawings of a graduate student making changes to a manuscript based on his supervisor's comments, with his frustration and file names progressively increasing."/>

(From [PHD Comics](http://phdcomics.com/comics/archive.php?comicid=1531))

Instead of having multiple copies or working on a shared version:
- **Track changes** in distinct stages ("commits") as you work
- Move backwards and forwards in history
- Explore different alternatives ("branches")
- Share entire history with others

Different systems: **git**, subversion, hg, ...

We start our work with by committing the state of our code or data. Each commit we create is given a unique identifier:

![Diagram of a single commit, represented as a circle](Lecture24_Images/git_one.png)

As we work, we make more commits:

![Diagram of two commits with an arrow from the first to the second](Lecture24_Images/git_two.png)

Sometimes we make mistakes:

![Diagram of three linked commits, where the third is highlighted as wrong](Lecture24_Images/git_wrong.png)

After realising the error, we can go back and fix it, replacing it with a new commit:

![Diagram of three commits, where the previous mistake has been replaced with a new, fixed commit](Lecture24_Images/git_fixed.png)

Often, we want to try out different approaches before we decide on what's best:

![Diagram of a commit history with two branches](Lecture24_Images/git_branch.png)

This results in a non-linear history. If we want, we can also merge the two branches:

![Diagram of a commit history where two branches split off and are then rejoined](Lecture24_Images/git_merge.png)

## Why data version control?

Similar principles apply to data workflows as to code:

- Mistakes happen!
- New data appearing
- Try variants of model (e.g. algorithm or its parameters) or data pipeline (e.g. preprocessing)

Git is not only for source code files. However, a dedicated data-focused solution is more attractive:

- Git does not handle very large files efficiently
- Thinking in terms of data workflows (models, parameters, inputs, ...) offers new useful functionality, e.g. reproducibility, metrics
- Better integration with remote data providers e.g. S3
- Can still use git under the hood, keeping code and data versioned simultaneously

## Getting started with DVC

DVC is a command-line application that runs on any platform. Follow the [installation instructions](https://dvc.org/doc/install) to get it on your computer.

To follow along, first create a new directory, switch to it and download the sample data there by running this command on a terminal:

```
dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
```

This should create a directory called `data` in your new directory, with a file called `data.xml` inside it.

(This walkthrough is based on the [official tutorial](https://dvc.org/doc/start).)

### Example: tracking a single file

First, initialize your directory as a dvc (and git) repository, so they can start tracking changes:
```bash
git init
dvc init
```

After the above, dvc creates some new files and gives you a hint about what to run: "You can now commit the changes to git."

```bash
git commit -m "Initial setup"
```


#### Initializing tracking
We are not tracking any files yet. Let's tell dvc to track the dataset we downloaded:

```bash
dvc add data/data.xml
```

As before, dvc creates some internal files and tells us what to commit with git:
> To track the changes with git, run:
>
>	git add data/data.xml.dvc data/.gitignore

Run the command it suggests, and then commit:
```bash
git commit -m "Add initial version of dataset"
```

Note that this is different from the usual git workflow. Normally, we would be `add`ing the data file itself (`data.xml`).

Instead, we are adding a smaller "proxy" file (`data.xml.dvc`). This file is much smaller, and dvc knows it represents the original dataset.

(To verify the size difference, check `ls -lh data`; the original data takes up 36MB, while the proxy file is only 80 bytes long).

#### Making changes
During the course of our work, the dataset may change - intentionally or by accident. For simplicity, we will simulate a change by repeating the dataset twice:

```bash
cp data/data.xml temp.xml  # create a temporary copy
cat temp.xml >> data/data.xml  # append the copy to the original
rm temp.xml  # remove the copy
```

Check the size of the file with `ls -lh data` to verify it has doubled.

To register the changes with git and dvc, we run similar commands to before:
```shell
dvc add data/data.xml
git add data/data.xml.dvc  # as suggested by dvc
git commit -m "Double size of dataset"
```

#### Switching versions
Switching to another version happens in two stages.

First, we switch with git:
```bash
git checkout HEAD~
```
(`HEAD~`is the previous commit, so in this case the original dataset)

Then we "synchronise" the files under dvc with
```bash
dvc checkout
```

This will find the version of the data when that commit was made, and check it out.

Verify that the version changed with `ls -lh data` (look for the original size).

Go back to the newest version (doubled data) with

```bash
git checkout master
dvc checkout
```

### Summary

This has been the basic usage of dvc to track and revert changes to a file. Building on this, in the next lecture we will see how dvc can be used to track models and entire machine learning workflows.