# Lecture 25: Data Version Control II

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/106YjL7FM57HsYQMU3X5DZM2eEt_Nyewd)

In [None]:
import datetime
now = datetime.datetime.now()
print("Last executed: " + now.strftime("%Y-%m-%d %H:%M:%S"))

This lecture is part of a series on [Data Version control (DVC)](https://dvc.org), a way of systematically keeping track of different versions of models and datasets.

This second lecture in the series will cover
- including files from external sources
- automation: creating and rerunning pipelines

## Adding external files

We have seen how we can track our own files with **dvc add**. But what if we want to include data or other files that are already available?

DVC offers different options:
- **dvc get** to download a file from a DVC or git repository
- **dvc import** to reuse a file from a DVC or git repository
- **dvc get-url** and **dvc import-url** to copy or reuse a file from general remote storage, e.g. S3

The [`get`](https://dvc.org/doc/command-reference/get) and [`import`](https://dvc.org/doc/command-reference/import) commands are similar. The difference is that the latter also links to the original file and tracks its history. Therefore, if an update is made to the original repository later on, we can get the changes to our copy (see [`dvc update`](https://dvc.org/doc/command-reference/update)).

The [`get-url`](https://dvc.org/doc/command-reference/get-url) and [`import-url`](https://dvc.org/doc/command-reference/import-url) variants are useful when the original data is not already in a repository. They can handle different storage protocols and providers.

## Creating pipelines

One of the core benefits of using a data-focused version control system is that we can structure our work around data flows, not individual files.

With DVC, we can
- specify each stage of a pipeline
- infer the connections between them
- run a whole pipeline or only the parts required

Let's consider this example: (you can find a [toy implementation](https://github.com/ucl/dvc-example-pipeline) of it on GitHub)


![Image of a data pipeline with a set of samples undergoing PCA and logistic regression. Each step is represented as a box with arrows linking them.](Lecture25_Images/pipeline_example.svg)

We start with a set of a labelled samples contained in a file `samples.csv`.

The first step is to run the code in `reduce_dim.py`. This reads the inputs and performs PCA on them to reduce the dimensionality of the problem. The dataset produced is stored in `reduced.csv`.

We then use this reduced dataset to train a logicistic regression classifier. The code for this is in the file `log_reg.py`. After the training is complete, the code serializes the trained model and stores it in Pickle format in `classifier.pkl`, so it can be shared and reused.

### Representing a pipeline

DVC sees pipelines as collections of steps. Each step has some inputs or **dependencies** and produces **outputs**. The different stages are linked to each other through these.

Furthermore, a pipeline can have one or more **parameters** that control its stages and customize their behaviour.

DVC stores descriptions of pipelines in a structured file written in the YAML format. Here is how the above example could be represented:

```yaml
stages:
  pca:
    cmd: python reduce_dim.py
    deps:
      - reduce_dim.py
      - samples.csv
    params:
      - total_var
    outs:
      - reduced.csv 
  classification:
    cmd: python log_reg.py
    deps:
      - log_reg.py
      - reduced.csv
    outs:
      - classifier.pkl
```

Every step has its own section, appropriately named (`pca`, `classification`). Each section specification has:
- the command to run that step (`cmd`)
- a list of dependencies (`deps`) which feed into the step
- a list of outputs (`outs`) that the step produces
- a list of parameters (`params`) that control that step

Note that the dependencies include the code itself, as well as any input files!

The parameters are stored in a separate file. By default this is called `params.yaml` but other formats are allowed.

In this example, the file needs to contain one parameter (the total variance explained by the chosen PCA dimensions):

```yaml
total_var: 0.9
```

### Creating steps

Rather than write the above file all at once, we can run the steps in order and let DVC create the file.

To do this, we need to tell it how to execute each step, by using the following command at the terminal:

```bash
dvc run -n pca \
        -d reduce_dim.py -d samples.csv \
        -p total_var \
        -o reduced.csv \
        python reduce_dim.py
```

The different aspects are given in the command options:
- `-n`: name of the step
- `-d`: dependencies
- `-p`: parameters
- `-o`: outputs
- finally, the command to run

### Reproducing a pipeline

We can run all steps in a pipeline using the command `dvc repro`.

Often, running all steps will be redundant. For example, if we have only made changes to the `log_reg.py` file since the last time we ran the pipeline, then the previous step is unaffected and does not need to be rerun. 

However, if we had updated the samples file or modified the parameter controlling the PCA step, then that step would need to be rerun, as well as all steps downstream of it.

By tracking changes to files and analysing the structure of the pipeline, DVC can infer which steps have changed and only run those.

## Metrics and outputs

DVC also allows us to compare the performance of our models as we make changes to them. We do this by declaring **metrics** as part of a pipeline.

In the above example, let's assume we have an additional step which evaluates our trained classifier and stores the results in a file `scores.json`. If we compute two performance metrics, the precision and the area under the ROC curve, the results may look like:

```json
{ "precision": 0.63, "roc_auc": 0.85 }
```

We can record this by expanding the pipeline description with an extra step, like this:
```yaml
stages:
  (...as above...)
  evaluation:
    cmd: python evaluate.py
    deps:
      - evaluate.py
      - classifier.pkl
    metrics:
      - scores.json
```

Running the whole pipeline with `dvc repro` will now also produce the file with the scores.

Let's make a change to the parameters file, e.g. to increase `total_var` to 0.95.

DVC will notice this and can show us the difference if we run `dvc params diff`:

```
Path         Param          Old    New
params.yaml  total_var      0.9    0.95
```

We can also easilly inspect what effect this has on the performance metrics. If we run the pipeline again, then `dvc metrics diff` will show us how the metrics have changed:

```
Path         Metric     Old     New      Change
scores.json  precision  0.63    0.65     0.02
scores.json  roc_auc    0.85    0.91     0.06
```

In two commands, we can run a whole series of steps and inspect the results - in this case, see that changing how we choose features after PCA has improved performance.

## Other DVC features

- **Automated plots** for metrics
- **Experiments** collect different runs and present them cleanly
- **Pushing** to remote storage, for backups and sharing
- **Python API** lets us call DVC functionality from a program (e.g. add a new file)

## MLflow

[MLflow](https://mlflow.org/) is a tool for tracking and rerunning machine learning workflows. It is similar to dvc but is controlled from a program rather than the command line, and is primarily focused on Python workflows.

- Tracking of code, data files, parameter configurations, environment, results.
- (Re)running code locally or remotely.
- Visual interface for monitoring progress.
- Integrations with specific toolchains and frameworks, e.g. Tensorflow.

## Summary
- DVC supports including data and files from other projects, from a variety of sources.
- Combining steps in pipelines makes it simpler to rerun analyses without manual intervention.
- Apart from files, DVC can also track metrics and automatically present the effect of changes.