# NLP Using DVC and DAGsHub

This notebook demonstrates how you can leverage git and [DVC](https://github.com/iterative/dvc) to easily manage ML experiments, including of course NLP.

By using this structure, you can quickly try a lot of different configurations, train a lot of models, and submit the ones that perform best.

## Instructions

* This notebook is intended to be used in Google Colab.
* To get started, we recommend you create a user in https://dagshub.com/user/sign_up
* Create a fork of the following repo: https://dagshub.com/Guy/uri_nlp_ner_workshop
  * [Link to fork creation screen](https://dagshub.com/repo/fork/19)
* If you want to modify the training code, e.g. switch to CNN instead of LSTM, we recommend to clone the repo to your laptop and edit it there. You can then push the modified code back to your repo, and the experiments in this notebook will automatically pull the latest version.
* In the **Setup** cell below, fill in your DAGsHub username `dagshub_user`.
  * If you leave the `dagshub_user` blank, then the original repo will be cloned instead of your fork. While this is OK, it will mean that you can't push the results of your experiments and be able to resume them if Google disconnects your Colab session (which is likely to happen).
  * If you do fill in `dagshub_user`, you will be prompted for your password, and will be able to push the results of experiments back to your git repo.
  * Alternatively, you can configure a git remote on your Google Drive, after it's mounted.
* The Setup section clones your git repo, mounts your Google Drive, and configures DVC to manage the different versions of your experiments inside a folder in your Google Drive.
* The **Experiment configuration** and **Experiment run** sections are the main part - here you can try a lot of different configurations, automatically commiting the result of each experiment to git and saving the resulting model to your Google Drive.
* The **Experiment overview** section is meant to run last, when you're choosing the best model. If you chose to create a fork in https://dagshub.com, you can compare metrics more comfortably over there.
* When you decide what you want to submit, run the **Submit results** section.

# Setup

## Logging in and cloning your DAGsHub repo
### RESTART THE RUNTIME AFTER RUNNING THIS CELL ONCE!

In [0]:
### RESTART THE RUNTIME AFTER RUNNING THIS CELL ONCE!
from getpass import getpass
dagshub_user = "" #@param {type:"string"}
user_email = "someone@somewhere.org" #@param {type:"string"} 
!git config --global user.name {dagshub_user}
!git config --global user.email {user_email}
if dagshub_user:
  dagshub_pass = getpass('DAGsHub password: ')
!git clone https://{dagshub_user + ':' + dagshub_pass + '@' if dagshub_user else ''}dagshub.com/{dagshub_user if dagshub_user else 'Guy'}/uri_nlp_ner_workshop.git # Change this to the URL for your fork of Uri's repo
dagshub_pass = None
!pip install -q -r uri_nlp_ner_workshop/requirements.txt
!pip install -q dvc
!pip uninstall -yq enum34
### RESTART THE RUNTIME AFTER RUNNING THIS CELL ONCE!

Git password: ··········
Cloning into 'uri_nlp_ner_workshop'...
remote: Enumerating objects: 1179, done.[K
remote: Counting objects: 100% (1179/1179), done.[K
remote: Compressing objects: 100% (957/957), done.[K
remote: Total 1179 (delta 222), reused 1134 (delta 202)[K
Receiving objects: 100% (1179/1179), 6.25 MiB | 21.21 MiB/s, done.
Resolving deltas: 100% (222/222), done.
[K    100% |████████████████████████████████| 17.3MB 1.7MB/s 
[K    100% |████████████████████████████████| 3.2MB 12.8MB/s 
[31mtorchvision 0.2.1 has requirement pillow>=4.1.1, but you'll have pillow 4.0.0 which is incompatible.[0m
[31mthinc 6.12.1 has requirement wrapt<1.11.0,>=1.10.0, but you'll have wrapt 1.11.1 which is incompatible.[0m
[31mpymc3 3.6 has requirement joblib<0.13.0, but you'll have joblib 0.13.2 which is incompatible.[0m
[31mfeaturetools 0.4.1 has requirement pandas>=0.23.0, but you'll have pandas 0.22.0 which is incompatible.[0m
[31malbumentations 0.1.12 has requirement imgaug<0.2.

**RESTART THE RUNTIME NOW!**

## Define Google Drive to be DVC remote for the project
### One time setup - run this once after restarting the runtime (after running the first cell and restaring the runtime)

In [0]:
# Run this cell once after restarting the runtime
import os
os.chdir('uri_nlp_ner_workshop')

In [0]:
# Mount your google drive
from google.colab import drive
drive.mount('/content/gdrive')

#Set up DVC
!mkdir -p '/content/gdrive/My Drive/nlp-workshop/dvc-cache'
!dvc remote add --local gdrive-remote '/content/gdrive/My Drive/nlp-workshop-dvc-cache'
!dvc config --local core.remote gdrive-remote
!dvc pull

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive
[0m[0m[0m[0mPreparing to download data from '/content/gdrive/My Drive/nlp-workshop-dvc-cache'
Preparing to collect status from /content/gdrive/My Drive/nlp-workshop-dvc-cache
[K[##############################] 100% Collecting information
[K[##############################] 100% Analysing status.
[K(1/2): [##############################] 100% ../model/model_arch.json
[K(2/2): [##############################] 100% 0.zip
Checking out '{'scheme': 'local', 

# Experiment configuration
To set up a new experiment, edit the fields on the right, then execute the cell.

It will save the experiment params into train_params.yaml, as well as create a new git branch for your experiment.

In [0]:
#@title Experiment Hyperparameters
experiment_name = "second-experiment" #@param {type: "string"}
#@markdown * WARNING: experiment name must be valid alpha or numeric or dash(-_) or dot characters.
!git fetch
!git checkout -b {experiment_name} origin/master

read_limit = 2000 #@param {type:"integer"}
max_sentence_size = 64 #@param {type:"integer"}
test_size = 0.1 #@param {type:"number"}
min_word_freq = 2 #@param {type:"integer"}
batch_size = 1024 #@param {type:"integer"}
epochs = 1 #@param {type:"integer"}
embedding_size = 128 #@param {type:"integer"}
lstm_size = 32 #@param {type:"integer"}
dropout = 0.25 #@param {type:"number"}
out_dir = '../model' #@param {type:"string"}

train_params = {
    "read_limit": read_limit,
    "max_sentence_size": max_sentence_size,
    "test_size": test_size,
    "min_word_freq": min_word_freq,
    "batch_size": batch_size,
    "epochs": epochs,
    "embedding_size": embedding_size,
    "lstm_size": lstm_size,
    "dropout": dropout,
    "out_dir": out_dir,
}

import yaml
with open('python/train_params.yaml', 'w') as f:
  yaml.dump(train_params, f, default_flow_style=False)
  
!git add python/train_params.yaml
!git diff
!git commit -m "Configured parameters for experiment {experiment_name}"

Branch 'second-experiment' set up to track remote branch 'master' from 'origin'.
Switched to a new branch 'second-experiment'
[second-experiment 23daf17] Configured parameters for experiment second-experiment
 1 file changed, 6 insertions(+), 6 deletions(-)


# Experiment run
This trains the model and records metrics

In [0]:
!dvc repro python/learn.dvc

[32mStage 'data/0.zip.dvc' didn't change.[39m
[33mStage 'python/learn.dvc' changed.[39m
Reproducing 'python/learn.dvc'
Running command:
	python3 style_learn.py 2>&1 | tee ../model/learn-stdout.txt
Using TensorFlow backend.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-03-04 00:39:52.029935: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-03-04 00:39:52.030200: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1f52840 executing computations on platform Host. Devices:
2019-03-04 00:39:52.030229: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
Train params:
{'batch_size': 1024, 'dropout': 0.25, 'embedding_size': 128, 'epochs': 1, 'lstm_size': 32, 'max_sentence_size': 64, 'min_word_freq': 2, 'out_dir': '../model', 'read

## Commit the results of the experiment

In [0]:
# TODO: Git commit, dvc metrics --all, dvc push, git push
!git add .
!git status
!git commit -m "Results of {experiment_name}"
!git push --set-upstream origin {experiment_name}
!dvc push

On branch second-experiment
Your branch is ahead of 'origin/master' by 1 commit.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

	[32mmodified:   model/metrics/test_acc.json[m
	[32mmodified:   model/metrics/test_confusion.json[m
	[32mmodified:   model/metrics/test_fbeta.json[m
	[32mmodified:   model/metrics/test_precision.json[m
	[32mmodified:   model/metrics/test_recall.json[m
	[32mmodified:   model/metrics/test_score.json[m
	[32mmodified:   model/metrics/test_support.json[m
	[32mmodified:   model/metrics/train_acc.json[m
	[32mmodified:   model/metrics/train_confusion.json[m
	[32mmodified:   model/metrics/train_fbeta.json[m
	[32mmodified:   model/metrics/train_precision.json[m
	[32mmodified:   model/metrics/train_recall.json[m
	[32mmodified:   model/metrics/train_support.json[m
	[32mmodified:   python/learn.dvc[m

[second-experiment cf2de1d] Results of second-experiment
 14 files chan

# Experiments Overview
Compare the achieved results across your different experiments, each one saved in a git branch.

For easier comparison of your different experiments, we reccomend you push this repo to DAGsHub and use the "Branches" view, e.g.: https://dagshub.com/Guy/uri_nlp_ner_workshop/branches

In [0]:
!dvc metrics show model/metrics/test_acc.json --all-branches

first-experiment:
	model/metrics/test_acc.json: [0.9120653399696448]
master:
	model/metrics/test_acc.json: [0.9111092117352935]
second-experiment:
	model/metrics/test_acc.json: [0.9094315838217703]
[0m[0m

# Submit results
After you decide which branch you want, submit the results of that branch

In [0]:
best_experiment_name = "???"
!git checkout {best_experiment_name}
!dvc pull
!dvc checkout

In [0]:
import python.gorenml as gorenml
test_submission = gorenml.Submission(best_experiment_name, model_folder="model")

In [0]:
test_submission.submit(test_folder='data/test_txt')

100%|██████████| 382/382 [02:40<00:00,  3.69it/s]


0.46136575166199