# Intro to DVC

#### Installing DVC

DVC can be installed either through pip:

```
pip install dvc
```

Or by downloading an installer from their website: https://dvc.org/


### Discussion: Where's the Data?

  1. "I want to run your analysis script.  Where is the data for the analysis?  How do I get it?"
  2. "I found a cool data analysis project on the internet that has a data download step."  Where is it getting the data from?


## Data Validation

### Discussion: Fickle Files

  1. "I ran the analysis, but there was an error when it tried to read the file. What went wrong?"
  2. "I ran the analysis on the data without errors, but I got a different result than you did. What went wrong?"

### File Hashing

Hashing means transforming data from one format to another for the purpose of data verification.  It's equivalent to a math function:  

$$ y = f(x) $$

where $x$ is the data, $f()$ is the hashing algorithm, and $y$ is the "hash" or "checksum".  Written otherwise:

$$ checksum = hasher(data) $$

### Desirable Qualities in a Hashing Algorithm for Data Validation in Data Science contexts

  1. **Fast Checksum Calculation**.  It should be significantly easier to compare two different checksums than comparing the original data.  
  2. **Consistent Across Runs**.  Using the hasher twice on the same data should get the same result.
  3. **Consistent Across Machines**.  Using the hasher on different computers on the same data should get the same result.
  4. **Have Few Collisions**:  Different data should not result in the same checksum.
  5. **Unidirectionality**.  The data should not be able to be generated from the checksum.
  6. **Fast Hash Calculation**.  Quick Calculation of checksums would be nice, for convenience.
  7. **Tiny Checksums**.  The checksum shouldn't take up much space on the computer.
  
   

### Write Some Test Data to a File

In [1]:
%%writefile test.txt
Session Count
1 10
2 7
3 13

Writing test.txt


### Windows (DOS) Hash commands: Certutil

In [48]:
!Certutil -hashfile test.txt MD5

MD5 hash of myexp/test.txt:
3d810b829da0c3d9828dea806f26079f
CertUtil: -hashfile command completed successfully.


In [49]:
!Certutil -hashfile test.txt SHA256

SHA256 hash of myexp/test.txt:
f2ac3c76fe99a24847a2e30aee19f166b02b217d83510891fd4b118605edf6fe
CertUtil: -hashfile command completed successfully.


In [52]:
!Certutil -hashfile test.txt SHA512

SHA512 hash of myexp/test.txt:
3e411b655245b77cea179d41a5dd7a30e01c5fd2e3ccad063af1b5b5a13d82311e2838df621630b2e5436ca565e0289dff982ea43e97e90e897753e45d4a881c
CertUtil: -hashfile command completed successfully.


### Mac/Linux Hash commands

In [53]:
!md5sum test.txt

'md5sum' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
!sha256sum test.txt

In [None]:
!sha512sum test.txt

### Comparing Files Directly in Python

#### Reading Files with Path.read_bytes()

In [9]:
from pathlib import Path
data = Path("test.txt").read_bytes()
data

b'Session Count\r\n1 10\r\n2 7\r\n3 13\r\n'

#### Comparing Data with ==

In [55]:
data == b'Session Count\r\n1 10\r\n2 7\r\n3 13\r\n'

True

### Hashing in Python

In [13]:
from hashlib import sha256, sha512, md5

In [14]:
md5(data).hexdigest()

'3d810b829da0c3d9828dea806f26079f'

In [42]:
sha256(data).hexdigest()

'f2ac3c76fe99a24847a2e30aee19f166b02b217d83510891fd4b118605edf6fe'

In [43]:
sha512(data).hexdigest()

'3e411b655245b77cea179d41a5dd7a30e01c5fd2e3ccad063af1b5b5a13d82311e2838df621630b2e5436ca565e0289dff982ea43e97e90e897753e45d4a881c'

### Hashing in DVC

#### One-Time Setup per Projectg: Setting up DVC environment

DVC needs a few things to work:

  1. **A project folder**: Everything DVC should be tracking as a single project should be in the same main place on the computer.
  4. **A dvc repo**: Before it starts tracking files, DVC needs to initialize some things in the project first.
  
These translate into these steps:

  1. `cd myprojectfolder`: change the working directory to the folder you want to use as the project folder.
  4. `dvc init`: make the dvc repository by adding a ".dvc" folder.

  

In [59]:
!git init

Initialized empty Git repository in C:/Users/Nick/.git/


In [4]:
!dvc init --no-scm

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Adding Hashes to the DVC Repo by Adding "Stub" Files

In [15]:
%%writefile test.txt
Session Count
1 10
2 7
3 13

Overwriting test.txt


In [5]:
!dvc add test.txt

In [7]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is B6B1-952C

 Directory of C:\Users\Nick\Desktop\DVC Intro

05/03/2021  09:56    <DIR>          .
05/03/2021  09:56    <DIR>          ..
05/03/2021  09:55    <DIR>          .dvc
05/03/2021  09:54               142 .dvcignore
05/03/2021  09:51    <DIR>          .ipynb_checkpoints
05/03/2021  09:56            13.238 DVC Intro.ipynb
05/03/2021  09:55                32 test.txt
05/03/2021  09:55                78 test.txt.dvc
               4 File(s)         13.490 bytes
               4 Dir(s)  24.757.575.680 bytes free


In [12]:
from pathlib import Path
print(Path("test.txt.dvc").read_text())

outs:
- md5: 1ade9261c064fdc16ee8ec1b6555bb0b
  size: 32
  path: test.txt



## Version Control

Contrary to its name ("Data Version Control"), DVC doesn't actually handle much of the version control work--it lets Git do it instead on DVC's stub files.  This is a nice solution to synchronizing code, documentation, and data: when there's only one system keeping track of things, synchronization errors can be avoided.

### Setting Up Git for Version Control for DVC

  1. `git init`: Make the project folder a Git repository.
  2. `dvc init`: Make the DVC repo, telling it to talk to Git when it needs version control help (default behavior)
    - *Note*: For this tutorial, we already did this but specified `--no-scm` (no source control manager).  This is not good anymore--just delete the .dvc folder and re-run the `dvc init` command to get things set up correctly.
  3. `git commit -m "added dvc"`: save the dvc configuration files into git version control.

#### Init the Git Repository

In [None]:
!git init

#### Delete the .dvc folder

Instead of doing it via the command line, let's just open the file browser.  (Gives us a chance to talk about hidden files/folders)

In [16]:
!explorer .  # Windows, open the File Browser

In [None]:
!open .  # Mac, open Finder

In [None]:
!nautilus .  # Ubuntu, open file browser

### Init the DVC Repository

In [None]:
!dvc init

### Add and Commit the Files

Version control is straightforward.  Whenever data files are changed or added, just use the `dvc add` command, then commit the resulting stub files to git for version control with the `git commit` command.  Below are all the manual steps, but DVC helps out a bit by giving some git commands for you.

In [None]:
!dvc add test.txt

In [None]:
!git add test.txt.dvc

In [None]:
!git commit -m "changed files"

*Note*: If this is your first time using git on the computer, you may be asked to log your name and email first (code for doing this below)

In [19]:
# !git config --global user.name "Nicholas A. Del Grosso"
# !git config --global user.email "delgrosso.nick@gmail.com"

That's it!  You've saved a snapshot of the project, along with the stub files.

## Documenting Remote Data Storage and Retrieval


Storing your data on a different machine can be really helpful, if:

  1. You have lots of data, but only need to access some of it at a time.
  2. You use different computers to do your analysis, and want a common place to look for your data
  3. Other people would like to access your data
  4. You value your data and don't want to accidentally delete/modify it.
  5. Your organization/industry has regulations around data protection, privacy, security, and/or intellection property.
  6. You love making things more complicated, and one computer for one analysis just isn't enough.
  

### DVC "Remotes": Connections to various remote data storage tools

DVC's cache contains all the versions of your project's data, indexed by the file's checksum.  This is essentially a DVC-specific database for storing your data.

DVC can connect to a lot of different services, from proprietary ones like Amazon S3 or Google Drive to generic ones like SSH or FTP connections.  You can even consider a different folder on the same computer as a "remote" backup--it's all the same.

#### Add a New DVC Remote

DVC Docs: `dvc remote add`: https://dvc.org/doc/command-reference/remote/add

To use these different services, you'll likely need to pip install some remote-specific python libraries.  The docs show how.

In [21]:
# !dvc remote add local ../myfolder

In [22]:
# !dvc remote add ssh ssh://172.105.74.128/myfolder

In [23]:
# !dvc remote add gdrive gdrive://adf=2342876ds8762g324327824

### Push the Data to the Remote

In [25]:
# !dvc push -r local

### Pull the Data From the Remote

In [None]:
# !dvc pull -r local