# Jupyter Notebook to showcase the features of Data Version Control (DVC)
### Thesis topic: **"Towards CI/CD for trajectory-based molecular machine learning using Git and DVC"**
#### Author: Suraj Giri 
#### Supervisor: Prof. Dr. Peter Zaspel
### **Preliminary Dataset** : 10k .xyz files with molecular trajectories of C6H6 molecules and a .dat file with energies for those 10k molecules.

#### Major Tasks:
1. Developing **Kernel Ridge Regression in DVC Structure** 
2. Developing **Testing Software in DVC Structure**
3. Jupyter Notebooks to showcase features of DVC for ML

Firstly, we developed a Kernel Ridge Regression model using QML: A Python toolkit for Quantum machine learning. The KRR model was developed in a modularized way with three stages: **prepare, train, and test**. This jupyter notebook is dedicated in showcasing the features of DVC.

Installing dvc

In [1]:
! pip install dvc

[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting dvc
  Using cached dvc-0.78.1-py2.py3-none-any.whl (296 kB)
Collecting python-dateutil<2.8.1,>=2.1
  Using cached python_dateutil-2.8.0-py2.py3-none-any.whl (226 kB)
Collecting gitpython>=2.1.8
  Using cached GitPython-2.1.15-py2.py3-none-any.whl (452 kB)
Collecting inflect<4,>=2.1.0
  Using cached inflect-3.0.2-py2.py3-none-any.whl (31 kB)
Collecting voluptuous>=0.11.7
  Using cached voluptuous-0.13.1.tar.gz (47 kB)
Collecting shortuuid>=0.5.0
  Using cached shortuuid-0.5.0.tar.gz (6.1 kB)
Collecting zc.lockf

Pulling the Dataset from remote storage

In [2]:
! dvc pull

  0% Checkout|                                   |0/10008 [00:00<?,     ?file/s]
![A
Building data objects from output/test/actual_pred    |0.00 [00:00,      ?obj/s][A
                                                                                [A
![A
Building data objects from dataset                    |0.00 [00:00,      ?obj/s][A
Building data objects from dataset                   |1.42k [00:00,  14.2kobj/s][A
Building data objects from dataset                   |2.83k [00:00,  13.6kobj/s][A
Building data objects from dataset                   |4.92k [00:00,  16.8kobj/s][A
Building data objects from dataset                   |7.03k [00:00,  18.5kobj/s][A
Building data objects from dataset                   |9.21k [00:00,  19.7kobj/s][A
Everything is up to date.                                                       [A
[0m

In [3]:
! git checkout v6.0
! dvc checkout

M	dvc_plots/index.html
M	dvc_plots/static/workspace_output_plots_Learning_Curve.png
M	output/metrics.csv
M	output/test/live/metrics.json
M	params.yaml
HEAD is now at f2969dd Model with dvc plots show and dvc exp run
  0% Checkout|                                   |0/10008 [00:00<?,     ?file/s]
![A
Building data objects from output/test/actual_pred    |0.00 [00:00,      ?obj/s][A
                                                                                [A
![A
Building data objects from dataset                    |0.00 [00:00,      ?obj/s][A
Building data objects from dataset                     |837 [00:00,  8.37kobj/s][A
Building data objects from dataset                   |2.56k [00:00,  13.6kobj/s][A
Building data objects from dataset                   |3.91k [00:00,  13.4kobj/s][A
Building data objects from dataset                   |5.99k [00:00,  16.3kobj/s][A
Building data objects from dataset                   |8.11k [00:00,  18.0kobj/s][A
[0m                 

Reproducing the whole DVC Pipeline using **dvc repro**

In [4]:
! dvc repro

'dataset.dvc' didn't change, skipping                                           
Stage 'prepare' didn't change, skipping                                         
Stage 'test' didn't change, skipping                                            
Stage 'train' didn't change, skipping                                           
Data and pipelines are up to date.
[0m

Visualization of the DVC Pipeline

In [5]:
! dvc dag

        +-------------+    
        | dataset.dvc |    
        +-------------+    
          **        **     
        **            **   
       *                ** 
+---------+               *
| prepare |             ** 
+---------+           **   
          **        **     
            **    **       
              *  *         
           +-------+       
           | train |       
           +-------+       
+------+ 
| test | 
+------+ 
[0m

## Experiment Management

Running the dvc experiment using **dvc exp run**

In [6]:
! dvc exp run

  0% Checkout|                                   |0/10008 [00:00<?,     ?file/s]
![A
Building data objects from output/test/actual_pred    |0.00 [00:00,      ?obj/s][A
                                                                                [A
![A
Building data objects from dataset                    |0.00 [00:00,      ?obj/s][A
Building data objects from dataset                     |500 [00:00,  5.00kobj/s][A
Building data objects from dataset                   |1.78k [00:00,  9.56kobj/s][A
Building data objects from dataset                   |3.04k [00:00,  11.0kobj/s][A
Building data objects from dataset                   |4.31k [00:00,  11.6kobj/s][A
Building data objects from dataset                   |5.59k [00:00,  12.1kobj/s][A
Building data objects from dataset                   |6.85k [00:00,  12.2kobj/s][A
Building data objects from dataset                   |8.12k [00:00,  12.4kobj/s][A
Building data objects from dataset                   |9.40k [00:00,  

Showing and comparing the results of different experiments

In [7]:
! dvc exp show

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
 [1;30;107m [0m[1;30;107mExperiment              [0m[1;30;107m [0m [1;30;107m [0m[1;30;107mCreated     [0m[1;30;107m [0m [1;30;107m [0m[1;30;107merrors.MAE[0m[1;30;107m [0m [1;30;107m [0m[1;30;107merrors.MSE[0m[1;30;107m [0m [1;30;107m [0m[1;30;107merrors.RMSE[0m[1;30;107m [0m [1;30;107m [0m[1;30;107merrors.COD[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mtrain.sigma[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mtrain.order[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mtrain.metric[0m[1;30;107m [0m [1;30;47m [0m[1;30;47mdataset[0m[1;30;47m [0m [1;30;47m [0m[1;30;47moutput/dataset[0m[1;30;47m [0m [1;30;47m [0m[1;30;47moutput/prepared_data.pkl[0m[1;30;47m [0m [1;30;47m [0m[1;30;47msrc/prepare

Displaying the parameters of the experiment

In [8]:
! dvc metrics show

Path                           errors.COD    errors.MAE    errors.MSE    errors.RMSE
output/metrics.csv             -             -             -             -
output/test/live/metrics.json  1.0           0.0           0.0           0.0
[0m

Displaying the Difference in metrics

In [9]:
! dvc metrics diff

Path                           Metric       HEAD    workspace    Changeore[39m>
output/test/live/metrics.json  errors.MAE   0.0     0.0          0.0
output/test/live/metrics.json  errors.MSE   0.0     0.0          0.0
output/test/live/metrics.json  errors.RMSE  0.0     0.0          0.0
[0m

Displaying the plots using dvc plots show

In [2]:
from IPython.display import HTML, display, IFrame

! dvc plots show

display(HTML(filename="/home/sgiri/BSc_Thesis_DVC/dvc_plots/index.html"))
IFrame(src='/home/sgiri/BSc_Thesis_DVC/dvc_plots/index.html', width=700, height=500)

file:///home/sgiri/BSc_Thesis_DVC/dvc_plots/index.html                          
[0m