# dvc-demo

This notebook is a demonstration of using DVC on Spell. To learn more refer to the accompanying blog post: ["Using DVC as a lightweight feature store on Spell"](https://spell.ml/blog/using-dvc-with-spell-YBHOChEAACgAaSmV).

## getting the data

For the purposes of this demonstration, we will use a sample from the dataset [A Year of Pumpkin Prices](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) on Kaggle (specifically, `new-york_9-24-2016_9-30-2017.csv`).

To begin, I downloaded the data to the file `new_york_pumpkin_prices.csv` on my local machine (if you are running this code in a Spell workspace, you can `spell upload` this file, then mount it into your workspace):

In [5]:
import pandas as pd
pd.read_csv("new_york_city_pumpkin_prices.csv").head()

Unnamed: 0,Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,...,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
0,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,...,,,,,,,,,N,
1,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,...,,,,,,,,,N,
2,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,...,,,,,,,,,N,
3,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,...,,,,,,,,,N,
4,PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,140,...,,,,,,,,,N,


This is a simple dataset of pumpkin prices (`Low Price`, `High Price`) by `Variety` and `Item Size` sold in New York City in the days prior to Halloween 2016.

## versioning it locally using dvc

To initialize DVC, run `dvc init` (we use `--subdir` here because this demo is specific to the `dvc` directory in `spellml/examples` only). To add data to DVC, run `dvc add`.

In [7]:
!dvc init --subdir


You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

In [10]:
%ls

devc-demo.ipynb                   new_york_city_pumpkin_prices.csv


In [11]:
!dvc add new_york_city_pumpkin_prices.csv

100% Add|██████████████████████████████████████████████|1/1 [00:00,  1.86file/s]

To track the changes with git, run:

	git add new_york_city_pumpkin_prices.csv.dvc .gitignore
[0m

Running `dvc add` does three things. First, it creates a `[...].dvc` file containing the MD5 content hash of the file.

In [12]:
%ls

devc-demo.ipynb                       new_york_city_pumpkin_prices.csv.dvc
new_york_city_pumpkin_prices.csv


In [13]:
!cat new_york_city_pumpkin_prices.csv.dvc

outs:
- md5: 10ac52bb2b805fe1a9de704d2f5a5be1
  size: 11875
  path: new_york_city_pumpkin_prices.csv


Second, it adds the actual CSV file itself to your `.gitignore`. DVC will handle version controlling the data now instead of `git`.

In [14]:
!cat .gitignore

/new_york_city_pumpkin_prices.csv


Third, it stores a copy of the dataset in a [content addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage) filesystem inside the special `.dvc` directory.

In [16]:
%ls .dvc/cache/

[1m[36m10[m[m/


In [18]:
%ls .dvc/cache/10/

ac52bb2b805fe1a9de704d2f5a5be1


In [19]:
!head .dvc/cache/10/ac52bb2b805fe1a9de704d2f5a5be1

Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Origin District,Item Size,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,130,150,NEW JERSEY,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,130,150,NEW JERSEY,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,140,120,140,NEW YORK,,med-lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,PENNSYLVANIA,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,170,120,170,PENNSYLVANIA,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,1

At this point, `dvc checkout` can be used to rehydrate `new_york_city_pumpkin_prices.csv`. This is helpful should be lose the file, or perhaps revert to an old commit that used an older version of this dataset.

In [23]:
!rm new_york_city_pumpkin_prices.csv

In [24]:
!dvc checkout

[32mA[39m	new_york_city_pumpkin_prices.csv                                    
[0m

## versioning it remotely using dvc

So far everything we have done has used local storage. However, DVC's most interesting feature&mdash;it's "killer app", if you will&mdash;is the ability to use remote storage.

To do this, you configure and use a **remote**. A remote in a DVC is a cloud-based entity&mdash;an S3 or GCS bucket, typically&mdash;that serves as the central storage space for your project's data.

Setting a remote is easy to do, using `dvc remote add`:

In [38]:
!dvc remote add spell-datasets-share s3://spell-datasets-share/dvc/

[0m

In [41]:
# This line is only needed if you want DVC to use the non-default AWS profile.
!dvc remote modify spell-datasets-share profile spell2

[0m

This adds a remote with this given name and URL to the project's DVC configuration file, `.dvc/config`:

In [42]:
!cat .dvc/config

['remote "spell-datasets-share"']
    url = s3://spell-datasets-share/dvc/
    profile = spell2


We can now use `dvc push` to send that data to S3:

In [43]:
!dvc push --remote spell-datasets-share

  0% Uploading|                                      |0/1 [00:00<?,     ?file/s]
![A
  0%|          |new_york_city_pumpkin_prices.cs0.00/11.6k [00:00<?,        ?B/s][A
100%|██████████|new_york_city_pumpkin_pric11.6k/11.6k [00:00<00:00,    93.4kB/s][A
1 file pushed                                                                   [A
[0m

In [44]:
!aws s3 ls s3://spell-datasets-share/dvc/ --profile spell2

                           PRE 10/


In [45]:
!aws s3 ls s3://spell-datasets-share/dvc/10/ --profile spell2

2021-01-26 11:15:46      11875 ac52bb2b805fe1a9de704d2f5a5be1


We can now easily rehydrate our dataset at any time from remote storage using `dvc pull`:

In [46]:
!rm new_york_city_pumpkin_prices.csv

In [47]:
!dvc pull --remote spell-datasets-share

[32mA[39m	new_york_city_pumpkin_prices.csv                                    
1 file added
[0m

In [48]:
!head new_york_city_pumpkin_prices.csv

Commodity Name,City Name,Type,Package,Variety,Sub Variety,Grade,Date,Low Price,High Price,Mostly Low,Mostly High,Origin,Origin District,Item Size,Color,Environment,Unit of Sale,Quality,Condition,Appearance,Storage,Crop,Repack,Trans Mode
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,MICHIGAN,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,130,150,NEW JERSEY,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,130,150,130,150,NEW JERSEY,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,140,120,140,NEW YORK,,med-lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,150,170,150,170,PENNSYLVANIA,,xlge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,120,170,120,170,PENNSYLVANIA,,lge,,,,,,,,,N,
PUMPKINS,NEW YORK,,36 inch bins,HOWDEN TYPE,,,09/24/2016,1

Now *anyone on the team* can rehydrate the complete state of the repository, using Git for the code and DVC for the data, in just a couple of commands. ✨

```bash
$ git pull
$ dvc pull
```

**We are, in essence, using DVC as a lightweight feature store.** Nifty!

## using dvc with spell

Spell has the ability to mount S3 and/or GCS buckets directly to the machine using our `--mount` syntax. Because Spell localizes mounted files in a multithreaded, asynchronous manner, this has the potential to offer significant performance benefit over the (single-process, blocking) download behavior of `dvc pull`.

For example, I ran this notebook in a Spell workspace that mounts `s3://spell-datasets-share/dvc/` to `/mnt/dvc` on the machine:

```bash
$ spell jupyter --lab \
    --mount s3://spell-datasets-share/dvc/:/mnt/dvc \
    dvc-demo
```

I then set up DVC to pull from this remote by default:

In [2]:
!dvc remote add --default s3-via-spell /mnt/dvc/

Setting 's3-via-spell' as a default remote.
[0m

Now when I run `dvc pull`:

In [3]:
!dvc pull

  0% Downloading|                                    |0/1 [00:00<?,     ?file/s]
![A
  0%|          |new_york_city_pumpkin_prices.cs0.00/11.9k [00:00<?,       ?it/s][A
[32mA[39m	new_york_city_pumpkin_prices.csv                                    [A
1 file added and 1 file fetched
[0m

While I was writing this code, Spell has already localized the contents of `s3://spell-datasets-share/dvc/` to disk in the background, so no actual over-the-write data IO had to occur!

[Refer to the blog post for more details.](https://spell.ml/blog/using-dvc-with-spell-YBHOChEAACgAaSmV)