# Intro to DVC

#### Installing DVC

DVC can be installed either through pip:

```
pip install dvc
```

Or by downloading an installer from their website: https://dvc.org/


In [43]:
!pip install dvc

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Discussion: Where's the Data?

  1. "I want to run your analysis script.  Where is the data for the analysis?  How do I get it?"
  2. "I found a cool data analysis project on the internet that has a data download step."  Where is it getting the data from?


## Data Validation

### Discussion: Fickle Files

  1. "I ran the analysis, but there was an error when it tried to read the file. What went wrong?"
  2. "I ran the analysis on the data without errors, but I got a different result than you did. What went wrong?"

### Write Some Test Data to a File

In [44]:
%%writefile test.txt
Session Count
1 10
2 7
3 13

Overwriting test.txt


### Comparing Files Directly in Python

#### Reading Files with Path.read_bytes()

In [45]:
from pathlib import Path
data = Path("test.txt").read_bytes()
data

b'Session Count\n1 10\n2 7\n3 13\n'

#### Comparing Data with ==

In [46]:
data == b'Session Count\r\n1 10\r\n2 7\r\n3 13\r\n'

False

### Hashing in Python

In [47]:
from hashlib import sha256, sha512, md5

In [48]:
md5(data).hexdigest()

'1ade9261c064fdc16ee8ec1b6555bb0b'

In [49]:
sha256(data).hexdigest()

'dde1cd8a5ce31ccd175c53962885a4c707c082d73814bfef6161f388dbb39a35'

In [50]:
sha512(data).hexdigest()

'b2c94009ef78e0b8be39373f3471ace5c2acb87de8c7ef9ceb810c2366d9e96fffebe77a689f32a958d5b2672d105c87a20af3781fcb182e37afc41222cadcaa'

### Hashing in DVC

#### One-Time Setup per Projectg: Setting up DVC environment

DVC needs a few things to work:

  1. **A project folder**: Everything DVC should be tracking as a single project should be in the same main place on the computer.
  4. **A dvc repo**: Before it starts tracking files, DVC needs to initialize some things in the project first.
  
These translate into these steps:

  1. `cd myprojectfolder`: change the working directory to the folder you want to use as the project folder.
  4. `dvc init`: make the dvc repository by adding a ".dvc" folder.

  

In [51]:
!git init

Reinitialized existing Git repository in /content/.git/


In [52]:
!dvc init --no-scm

[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

### Adding Hashes to the DVC Repo by Adding "Stub" Files

In [53]:
%%writefile test.txt
Session Count
1 10
2 7
3 13

Overwriting test.txt


In [54]:
!dvc add test.txt

!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      !Collecting targets          |0.00 [00:00,     ?file/s]                                                      [?25l[32m⠋[0m Checking graph
[?25h[1A[2KAdding...:   0% 0/1 [00:00<?, ?file/s]Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [AAdding...: 100% 1/1 [00:00<00:00, 38.51file/s{'info': ''}]
[0m

In [55]:
!ls

sample_data  test.txt  test.txt.dvc


In [56]:
from pathlib import Path
print(Path("test.txt.dvc").read_text())

outs:
- md5: 1ade9261c064fdc16ee8ec1b6555bb0b
  size: 28
  path: test.txt



## Version Control

Contrary to its name ("Data Version Control"), DVC doesn't actually handle much of the version control work--it lets Git do it instead on DVC's stub files.  This is a nice solution to synchronizing code, documentation, and data: when there's only one system keeping track of things, synchronization errors can be avoided.

### Setting Up Git for Version Control for DVC

  1. `git init`: Make the project folder a Git repository.
  2. `dvc init`: Make the DVC repo, telling it to talk to Git when it needs version control help (default behavior)
    - *Note*: For this tutorial, we already did this but specified `--no-scm` (no source control manager).  This is not good anymore--just delete the .dvc folder and re-run the `dvc init` command to get things set up correctly.
  3. `git commit -m "added dvc"`: save the dvc configuration files into git version control.

#### Init the Git Repository

In [57]:
!git init

Reinitialized existing Git repository in /content/.git/


### Add and Commit the Files

Version control is straightforward.  Whenever data files are changed or added, just use the `dvc add` command, then commit the resulting stub files to git for version control with the `git commit` command.  Below are all the manual steps, but DVC helps out a bit by giving some git commands for you.

In [58]:
!dvc add test.txt

!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      !Collecting targets          |0.00 [00:00,     ?file/s]                                                      [?25l[32m⠋[0m Checking graph
[?25h[1A[2KAdding...:   0% 0/1 [00:00<?, ?file/s]Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [AAdding...: 100% 1/1 [00:00<00:00, 31.86file/s{'info': ''}]
[0m

In [59]:
!git add test.txt.dvc

In [60]:
!git config --global user.name "my name"
!git config --global user.email "my.name@gmail.com"

In [61]:
!git commit -m "changed files"

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.config/[m
	[31m.dvc/[m
	[31m.dvcignore[m
	[31msample_data/[m
	[31mtest.txt[m

nothing added to commit but untracked files present (use "git add" to track)


*Note*: If this is your first time using git on the computer, you may be asked to log your name and email first (code for doing this below)

That's it!  You've saved a snapshot of the project, along with the stub files.

## Documenting Remote Data Storage and Retrieval


Storing your data on a different machine can be really helpful, if:

  1. You have lots of data, but only need to access some of it at a time.
  2. You use different computers to do your analysis, and want a common place to look for your data
  3. Other people would like to access your data
  4. You value your data and don't want to accidentally delete/modify it.
  5. Your organization/industry has regulations around data protection, privacy, security, and/or intellection property.
  6. You love making things more complicated, and one computer for one analysis just isn't enough.
  

### DVC "Remotes": Connections to various remote data storage tools

DVC's cache contains all the versions of your project's data, indexed by the file's checksum.  This is essentially a DVC-specific database for storing your data.

DVC can connect to a lot of different services, from proprietary ones like Amazon S3 or Google Drive to generic ones like SSH or FTP connections.  You can even consider a different folder on the same computer as a "remote" backup--it's all the same.

#### Add a New DVC Remote

DVC Docs: `dvc remote add`: https://dvc.org/doc/command-reference/remote/add

To use these different services, you'll likely need to pip install some remote-specific python libraries.  The docs show how.

In [62]:
!dvc remote add local ../myfolder

[31mERROR[39m: configuration error - config file error: remote 'local' already exists. Use `-f|--force` to overwrite it.
[0m

In [63]:
# !dvc remote add ssh ssh://172.105.74.128/myfolder

In [64]:
# !dvc remote add gdrive gdrive://adf=2342876ds8762g324327824

### Push the Data to the Remote

In [65]:
!dvc push -r local

!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      !  0% |          |0/? [00:00<?,    ?files/s]                                           Everything is up to date.
[0m

### Pull the Data From the Remote

In [66]:
!dvc pull -r local

!If DVC froze, see `hardlink_lock` in <[36mhttps://man.dvc.org/config#core[39m>                                                                      !  0% |          |0/? [00:00<?,    ?files/s]                                           !  0% Checkout|          |0/? [00:00<?,     ?file/s]Checkout:   0% 0/1 [00:00<?, ?file/s{'info': ''}]                                                   Everything is up to date.
[0m