## Data Versioning Example
[DVC](https://dvc.org/) is an open-source framework and distributed version control system for machine learning projects. DVC is designed to handle large files, models, and metrics as well as code. Check out the [Getting Started](https://dvc.org/doc/get-started) page for more info. 

[Here's another example](https://dvc.org/doc/get-started/example) of an end-to-end machine learning scenario for your reference.
![alt text](../assets/images/dvc_info.png)

## Example DVC Workflow with Airline Data
This tutorial will illustrate how to use the DVC documentation to implement an an example with our airline data.

1. **Installation:** Normally, you need to do a `pip install dvc` but `dvc` is already included in our conda enviornment.
You can refer to [installation documentation](https://dvc.org/doc/get-started/install) for more help.

2. **Initialization:** You'll need to `dvc init && git commit -m 'initialize dvc'` to initialize your project to use DVC. 
After DVC initialization, a new directory `.dvc` will be created with config and `.gitignore` files and `cache` directory. These files and directories are hidden from a user in general and a user does not interact with these files directly. The last command, git commit, puts `.dvc/config` and `.dvc/.gitignore` files under Git control.

In [1]:
!ls ../.dvc/

config          link.state.lock state           updater
link.state      lock            state.lock


3. **Configure (Optional)**: Once you install DVC, you will be able to start using it (in its local setup) immediately. However, a remote storage (or remote) should be set up if you need to share data outside of a local environment. For example, to setup an S3 remote: `dvc remote add -d myremote s3://mybucket/myproject`. 
 A remote can be specified by the remote type preffix and a path. As of this version, DVC supports six types of remotes:
```
    local - Local directory
    s3 - Amazon Simple Storage Service
    gs - Google Cloud Storage
    azure - Azure Blob Storage
    ssh - Secure Shell
    hdfs - The Hadoop Distributed File System
```

4. **Add Files**: DVC allows storing and versioning source data files, ML models, directories, intermediate results with Git, without checking the file contents into Git.  Let's first add our dataset to DVC, by default, DVC will use `.dvc` directory from the location where `dvc` commands are executed.

In [2]:
!dvc add ../data/external/allyears2k.csv

Saving '../data/external/allyears2k.csv' to cache '../.dvc/cache'.[0m
Saving information to '../data/external/allyears2k.csv.dvc'.[0m
[0m

5. **Share & Retrieve Data**: Now, that your data files are managed by DVC (see Add Files), you can push them from your repository to the default remote storage using `dvc push`. 

    The same way as with Git remote, it ensures that your data files and your models are safely stored remotely and are shareable. It means that this data could be pulled by your team or you when you need it.

    Usually you run it along with git commit and git push to save changes to .dvc files to Git. 
    
    To retrieve data files to your local machine and your project's workspace run `dvc pull`.

6. **Connect Code and Data**: Even in its basic scenarios, commands like `dvc add`, `dvc push`, `dvc pull` described in the previous sections could be used independently and provide a basic useful framework to track, save and share models and large data files.

    Let's take a look at the help menu to determine which top level options are available...

In [3]:
!dvc -h

usage: dvc [-h] [-q] [-v] [-V] COMMAND ...

Data Version Control

optional arguments:
  -h, --help     show this help message and exit
  -q, --quiet    Be quiet.
  -v, --verbose  Be verbose.
  -V, --version  Show program's version

Available Commands:
  COMMAND        Use dvc COMMAND --help for command-specific help
    init         Initialize dvc over a directory (should already be a git dir)
    destroy      Destroy dvc
    add          Add files/directories to dvc
    import       Import files from URL
    checkout     Checkout data files from cache
    run          Generate a stage file from a given command and execute the command
    pull         Pull data files from the cloud
    push         Push data files to the cloud
    fetch        Fetch data files from the cloud
    status       Show the project status
    repro        Reproduce DVC file. Default file name - 'Dvcfile'
    remove       Remove outputs of DVC file.
    move         Move output of DVC f

Let's now take a look at `dvc run` for more info on how to stage & execute our training process such that it will be repeatable & reproducible by others.

In [4]:
!dvc run -h

usage: dvc run [-h] [-q] [-v] [-d DEPS] [-o OUTS] [-O OUTS_NO_CACHE]
               [-M METRICS_NO_CACHE] [-f FILE] [-c CWD] [--no-exec]
               ...

positional arguments:
  command               Command or command file to execute

optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           Be quiet.
  -v, --verbose         Be verbose.
  -d DEPS, --deps DEPS  Declare dependencies for reproducible cmd.
  -o OUTS, --outs OUTS  Declare output data file or data directory.
  -O OUTS_NO_CACHE, --outs-no-cache OUTS_NO_CACHE
                        Declare output regular file or directory (sync to Git,
                        not DVC cache).
  -M METRICS_NO_CACHE, --metrics-no-cache METRICS_NO_CACHE
                        Declare output metric file or directory (not cached by
                        DVC).
  -f FILE, --file FILE  Specify name of the state file
  -c CWD, --cwd CWD     Directory to run your command and place stat

6. **Connect Code and Data**: To achieve full reproducibility though you have to connect your code and configuration with the data it processes to produce the result:

In [None]:
!yes | dvc run -d ../src/models/Static_Model_Pitfalls_of_Model_Development.py -d ../data/external/allyears2k.csv \
              -o ../data/processed/ \
              python ../src/models/Static_Model_Pitfalls_of_Model_Development.py

processed.dvc already exists. Do you wish to overwrite it?
Running command:
	python ../src/models/Static_Model_Pitfalls_of_Model_Development.py[0m
numpy: 1.14.3
pandas: 0.23.0
sklearn: 0.19.1
xgboost: 0.72

Label Encode Target into Integers...

Get Training Data...
Original shape: (43978, 31)
After columns dropped shape: (43978, 13)

Naive One-Hot-Encode for features: ['UniqueCarrier', 'Dest', 'Origin']

Total number of features before encoding: 13

Total number of features after encoding: 286

Label Encode Target into Integers...

Get Training Data...
Original shape: (43978, 31)
After columns dropped shape: (43978, 13)

Naive One-Hot-Encode for features: ['UniqueCarrier', 'Dest', 'Origin']

Total number of features before encoding: 13

Total number of features after encoding: 286
[01:14:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 44 extra nodes, 0 pruned nodes, max_depth=5
[0]	train-error:0.359407
[01:14:17] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 44

In [None]:
!ls -ltr ../data/processed/

7. **Reproduce**: In the previous section we described our first pipeline. Basically, we created a number of `*.dvc` files. Each file describes a single step we need to run to get to the final result. Each depends on some data (either source data files or some intermediate results from another *.dvc file) and code files.

    It's now extremely easy for you or anyone in your team to reproduce the result end-to-end:

In [None]:
!dvc repro processed.dvc