# What is dvc?
- A data and ML experiment management tool

# Install dvc

In [None]:
# For mac
! brew install dvc #Any command that works at the command-line can be used in a notebook by prefixing it with the ! character
#pip install dvc

# For Windows
#choco install dvc
#pip install dvc

# For Linux
#pip install dvc

# Initialize dvc
A few internal files are created that should be added to Git:
- .dvc/config: This is a configuration file. The config file can be edited by hand or with the dvc config command.
- .dvc/cache: Default location of the cache directory. The cache stores the project data in a special structure.
- .dvc/cache/runs: Default location of the run-cache.
- .dvc/plots: Directory for plot templates
- .dvc/tmp: Directory for miscellaneous temporary files
- and more...


In [1]:
#Initialize git
#! git init

#Initialize a new DVC repository
! dvc init --subdir # Creates a new DVC repository in a subdirectory; you likely don't need --subdir

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

# DVC's features can be grouped into functional components. You can explore them in two independent trails:
- Data Management Trail: 
    - Data and model versioning - The base layer of DVC for large files, datasets, and machine learning models. Use a regular Git workflow, but without storing large files in the repo (think "Git for data"). Data is stored separately, which allows for efficient sharing.

- Experiments Trail
    - Experiments versioning - Enable exploration, iteration, and comparison across many ML experiments. Track your experiments with automatic versioning and checkpoint logging. Compare differences in parameters, metrics, code, and data. Apply, drop, roll back, resume, or share any experiment.

# Get a sample dataset

In [2]:
! mkdir data # Create a directory called data
! dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml
! ls -lh data

  0% Downloading data.xml|                           |0/1 [00:00<?,    ?files/s]
![A
  0%|          |get-started/data.xml               0.00/? [00:00<?,        ?B/s][A
  0%|          |get-started/data.xml           0.00/13.8M [00:00<?,        ?B/s][A
  1%|▏         |get-started/data.xml       203k/13.8M [00:00<00:07,    1.82MB/s][A
  3%|▎         |get-started/data.xml       356k/13.8M [00:00<00:13,    1.05MB/s][A
  5%|▍         |get-started/data.xml       645k/13.8M [00:00<00:08,    1.58MB/s][A
  6%|▋         |get-started/data.xml       883k/13.8M [00:00<00:07,    1.75MB/s][A
  8%|▊         |get-started/data.xml      1.16M/13.8M [00:00<00:06,    2.04MB/s][A
  9%|▉         |get-started/data.xml      1.26M/13.8M [00:00<00:08,    1.47MB/s][A
 12%|█▏        |get-started/data.xml      1.59M/13.8M [00:00<00:06,    1.84MB/s][A
 14%|█▎        |get-started/data.xml      1.87M/13.8M [00:01<00:06,    2.02MB/s][A
 16%|█▌        |get-started/data.xml      2.14M/13.8M [00:01<00:05,    2.1

# Use dvc add to start tracking a file or directory 
- DVC stores information about the added file in a special .dvc file named data/data.xml.dvc — a small text file with a human-readable format. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git:

In [3]:
# Add the file to the DVC repository
! dvc add data/data.xml

[?25l                                                                          [32m⠋[0m Checking graph
100% Adding...|████████████████████████████████████████|1/1 [00:00, 16.64file/s]

To track the changes with git, run:

    git add data/data.xml.dvc data/.gitignore

To enable auto staging, run:

	dvc config core.autostage true
[0m

In [4]:
! git add data/data.xml.dvc data/.gitignore
! git commit -m "Add raw data"

[main 5a63525] Add raw data
 2 files changed, 5 insertions(+)
 create mode 100644 dvc_example/data/.gitignore
 create mode 100644 dvc_example/data/data.xml.dvc


# Remote Storage
- dvc push uploads DVC-tracked data or model files to a remote directory so they can be retrieved on other environments later with dvc pull

In [5]:
# Set up storage location
! dvc remote add -d -f vdsml-dvc-how-to s3://vdsml-dvc-how-to #-f forces this since we've tested this already
! git add .dvc/config
! git commit -m "Configure remote storage"

Setting 'vdsml-dvc-how-to' as a default remote.
[0mOn branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

nothing to commit, working tree clean


In [6]:
# Push to remote storage
! dvc push

  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |a1a2931c8370d3aeedd7183606fd7f     0.00/? [00:00<?,        ?B/s][A
  0%|          |a1a2931c8370d3aeedd7183606fd7f 0.00/13.8M [00:00<?,        ?B/s][A
100%|██████████|a1a2931c8370d3aeedd718360613.8M/13.8M [00:12<00:00,    1.16MB/s][A
1 file pushed                                                                   [A
[0m

dvc push copied the data cached locally to the remote storage we set up earlier. The remote storage directory should look like this:

.../dvcstore

└── 22

    └── a1a2931c8370d3aeedd7183606fd7f

# DVC Pull
Let's remove the data from our directory and see how easy it is to retrieve it from the remote storage.

In [7]:
! rm -f data/data.xml
! rm -rf /dvc/cache

In [8]:
# Pull from remote storage
! dvc pull

[32mA[0m       data/data.xml                                                  
1 file added
[0m

# Track changes in the data

In [9]:
# Make a change to the data
! cp data/data.xml /tmp/data.xml
! cat /tmp/data.xml >> data/data.xml
! ls -lh data

total 56896
-rw-r--r--  1 twileman  staff    28M Jul 28 20:39 data.xml
-rw-r--r--  1 twileman  staff    80B Jul 28 20:28 data.xml.dvc


In [10]:
! dvc add data/data.xml
! git add data/data.xml.dvc
! git commit -m "Add more data"
! dvc push

[?25l                                                                          [32m⠋[0m Checking graph
Adding...                                                                       
![A
  0% Checking cache in '/Users/twileman/vdsml_how_to/dvc_example/.dvc/cache'| |0[A
                                                                                [A
![A
  0%|          |Transferring                          0/1 [00:00<?,     ?file/s][A
100% Adding...|████████████████████████████████████████|1/1 [00:00, 10.18file/s][A

To track the changes with git, run:

    git add data/data.xml.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[main 12d24bb] Add more data
 1 file changed, 2 insertions(+), 2 deletions(-)
  0% Transferring|                                   |0/1 [00:00<?,     ?file/s]
![A
  0%|          |9fbd15fa2c32c539c4c4e3675b514a     0.00/? [00:00<?,        ?B/s][A
  0%|          |9fbd15fa2c32c539c4c4e3675b514a 0.00/27.6M [00:00<?,        ?B/s][A


In [12]:
! git log --oneline

[33m12d24bb[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m)[m Add more data
[33m5a63525[m Add raw data
[33m5bf5ede[m[33m ([m[1;31morigin/main[m[33m, [m[1;31morigin/HEAD[m[33m)[m prior to demo
[33m8da7ac1[m Add raw data
[33m352e99c[m Add more data
[33m138fb8a[m Add raw data
[33m0104f30[m Configure remote storage
[33m2d5e052[m Add raw data
[33m9138cd8[m added dvc how to
[33mc5e81a8[m Add files via upload


In [13]:
# Checkout the previous dvc file
! git checkout HEAD^1 data/data.xml.dvc

Updated 1 path from 55df548


In [14]:
# Now checkout the data
! dvc checkout data/data.xml
! ls -lh data

[33mM[0m       data/data.xml                                                  
[0mtotal 28224
-rw-r--r--  1 twileman  staff    14M Jul 28 12:27 data.xml
-rw-r--r--  1 twileman  staff    80B Jul 28 20:42 data.xml.dvc


In [15]:
# If I want to keep the data I'm just checked out
! git commit data/data.xml.dvc

hint: Waiting for your editor to close the file... 7[?47h[>4;2m[?1h=[?2004h[?1004h[1;24r[?12h[?12l[22;2t[22;1t[29m[m[H[2J[?25l[24;1H"~/vdsml_how_to/.git/COMMIT_EDITMSG" 11L, 323B[2;1H▽[6n[2;1H  [3;1HPzz\[0%m[6n[3;1H           [1;1H[>c]10;?]11;?[2;1H# Please enter the commit message for your changes. Lines starting[2;67H[K[3;1H# with '#' will be ignored, and an empty message aborts the commit.[3;68H[K[4;1H#
# On branch main
# Your branch is ahead of 'origin/main' by 2 commits.
#   (use "git push" to publish your local commits)
#
# Changes to be committed:
#[7Cmodified:   data/data.xml.dvc
#
[1m[34m~                                                                               [13;1H~                                                                               [14;1H~                                                                               [15;1H~                                                                               [16;1H~    

# Pipelines
- DVC pipelines can be used to capture data pipelines so you can keep track of the data processes that produce a final result.; how is data filtered, transformed, or used to train ML models? 
- When you create a pipeline, a dvc.yaml file is generated. This file includes information about the command we want to run (python src/prepare.py data/data.xml), its dependencies, and outputs.
- DVC uses these metafiles to track the data used and produced by the stage, so there's no need to use dvc add on data/prepared manually.

In [None]:
! dvc stage add -n prepare \
                -p prepare.seed,prepare.split \
                -d src/prepare.py -d data/data.xml \
                -o data/prepared \
                python src/prepare.py data/data.xml

    -n prepare specifies a name for the stage. If you open the dvc.yaml file you will see a section named prepare.

    -p prepare.seed,prepare.split defines special types of dependencies — parameters. We'll get to them later in the Metrics, Parameters, and Plots page, but the idea is that the stage can depend on field values from a parameters file (params.yaml by default):

prepare:
  split: 0.20
  seed: 20170428

    -d src/prepare.py and -d data/data.xml mean that the stage depends on these files to work. Notice that the source code itself is marked as a dependency. If any of these files change later, DVC will know that this stage needs to be reproduced.

    -o data/prepared specifies an output directory for this script, which writes two files in it. This is how the workspace should look like after the run:

     .
     ├── data
     │   ├── data.xml
     │   ├── data.xml.dvc
    +│   └── prepared
    +│       ├── test.tsv
    +│       └── train.tsv
    +├── dvc.yaml
    +├── dvc.lock
     ├── params.yaml
     └── src
         ├── ...

    The last line, python src/prepare.py data/data.xml is the command to run in this stage, and it's saved to dvc.yaml, as shown below.
