# Data Versioning & Data Lineage

We are going to study in this notebook how to version data and keep metadata about its origin with DVC.

This tool uncouples data versioning and data storage.

## Setup — Please run the cell

In [None]:
!rm -rf sample_data .config
!git config --global user.email "jane@doe.eu"
!git config --global user.name "Jane Doe"
!git config --global init.defaultBranch main
!apt install tree
!pip install dvc dvc-s3

## Git repository creation

Create an account on [DagsHub](https://dagshub.com/) and create a blank repository.

When you have created the repository, click on “*Get started with Data*” and get your secret token from the “*Connection credentials*” section.

Fill the `owner`, `repo` & `token` variables with your DagsHub username, repository name and secret token respectively.

In [None]:
owner = "m09"
repo = "dvc-labs"
token = "542e264f9623e096e63c24e523c7edf30081de7f"

You can now run the following cell to setup your environment.

In [None]:
!git init
!git remote add origin https://{token}@dagshub.com/{owner}/{repo}.git

## DVC repository setup

In [None]:
!dvc init

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!dvc init

In [None]:
!git commit -m "Initialisation de DVC"

In [None]:
!git push origin main

## Retrieving a first version of the data

There are several ways to import data with DVC. One of them is to use two commands: [`dvc get`](https://dvc.org/doc/command-reference/get) first, that retrieves data from a git or DVC repository, then [`dvc add`](https://dvc.org/doc/command-reference/add) that adds data to the data managed by DVC.

Use those two commands to import the `wikipedia-movie-plots/wiki_movie_plots_deduped.csv` file from the `https://github.com/shuuchuu/datasets/` repository, using `data.csv` as output name.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

DVC uses a caching folder. You can observe it with the following command: `!tree .dvc`. What do you notice?

Do not hesitate to look at `.dvc` files, with the command `!cat name-of-file.dvc`.

### Solution

In [None]:
!dvc get https://github.com/shuuchuu/datasets wikipedia-movie-plots/wiki_movie_plots_deduped.csv -o data.csv

In [None]:
!dvc add data.csv

In [None]:
!cat data.csv.dvc

In [None]:
!tree .dvc

## Adding a file to a git commit

You can now commit this file using the hint from dvc given in the output of `dvc add`.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!git add .gitignore data.csv.dvc

In [None]:
!git commit -m "Add a first version of the data"

In [None]:
!git tag "v1"

In [None]:
!git push origin main v1

## Storage server setup

DVC can store data on several types of storage servers. DagsHub makes a s3-like bucket (amazon storage solution) available.

Setup the storage server using the information contained in the “*Data*” section that appears when clicking on the green “*Remote*” button from your DasgHub repository and publish your data.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!dvc remote add origin s3://dvc
!dvc remote modify origin endpointurl https://dagshub.com/{owner}/{repo}.s3
!dvc remote modify origin --local access_key_id {token}
!dvc remote modify origin --local secret_access_key {token}

In [None]:
!dvc push -r origin

## Data modification

Run the following cell to modify the data

In [None]:
!sort -r < data.csv > a && dvc remove data.csv.dvc && mv a data.csv

Now use `dvc add` & `git add` to register the modifications.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!dvc add data.csv

In [None]:
!cat data.csv.dvc
!tree .dvc

In [None]:
!git add data.csv.dvc .gitignore

In [None]:
!git commit -m "v2 data"
!git tag "v2"

In [None]:
!git push origin main v2

In [None]:
!dvc push -r origin

## Back to the original data

Use `git checkout` and `dvc checkout` to revert back to the original data.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

You can now commit those changes.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

### Solution

In [None]:
!git checkout v1 data.csv.dvc

In [None]:
!cat data.csv.dvc
!tree .dvc

In [None]:
!dvc checkout data.csv

In [None]:
!git add data.csv.dvc

In [None]:
!git commit -m "Revert to v1 data"
!git tag v3

In [None]:
!git push origin main v3

In [None]:
!dvc push -r origin

## Data processing

Run the following cell that contains a script that takes two arguments and writes at the path given in the second argument the content of the file at the path given in the first argument, in upper case.

This cell will write the script at the following path: `upper.py`

In [None]:
%%writefile upper.py
from pathlib import Path
from sys import argv

Path(argv[2]).write_text(
    Path(argv[1]).read_text(encoding="utf8").upper(),
    encoding="utf8")

Add a processing step with [`dvc stage add`](https://dvc.org/doc/command-reference/stage/add) that takes as input `data.csv` and produces `data-upper.csv` from the `upper.py` script.

In [None]:
!# Your command here, please note the ! that prefixes bash commands in Colab

Now run this step with [`dvc repro`](https://dvc.org/doc/command-reference/repro) and save the data and metadata with DVC and git.

### Solution

In [None]:
!dvc stage add -n transform-uppercase -d data.csv -o data-upper.csv python upper.py data.csv data-upper.csv

In [None]:
!dvc repro

In [None]:
!git add dvc.yaml .gitignore dvc.lock

In [None]:
!git commit -m "Pipeline to create an upper case file"

In [None]:
!git push origin main

In [None]:
!dvc push -r origin