## DVC Versioning Data and Models

- link: https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial

O objetivo desse exemplo é ter uma experiência prática com o básico de controle de versionamento de múltiplos datasets e modelos de Machine Learning usando o DVC.

```shell
# Preparação do código
git clone https://github.com/robertogyn19/conecta-ceia-2024-dvc.git
cd conecta-ceia-2024-dvc
```

```shell
# Preparação do ambiente
python3 -m venv .env
source .env/bin/activate

# Instalação das dependências
pip install -r requirements.txt
```

### Inicialização do DVC

```shell
# Inicialização
$ dvc init
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
```


```shell
# Obtenção dos dados
dvc get https://github.com/iterative/dataset-registry tutorials/versioning/data.zip
unzip -q data.zip
rm -f data.zip
tree data
data
├── train
│   ├── cats
│   │   ├── cat.1.jpg
│   │   ├── cat.10.jpg
│   │   ├── ...
│   │   ├── ...
│   │   ├── ...
│   │   ├── cat.102.jpg
│   │   └── cat.99.jpg
│   └── dogs
│       ├── dog.1.jpg
│       ├── dog.10.jpg
│       ├── ...
│       ├── ...
│       ├── ...
│       ├── dog.102.jpg
│       └── dog.99.jpg
└── validation
    ├── cats
    │   ├── cat.1001.jpg
    │   ├── cat.1002.jpg
    │   ├── ...
    │   ├── ...
    │   ├── ...    
    │   ├── cat.1005.jpg
    │   └── cat.1400.jpg
    └── dogs
        ├── dog.1001.jpg
        ├── dog.1002.jpg
        ├── ...
        ├── ...
        ├── ...        
        ├── dog.1005.jpg
        └── dog.1400.jpg

7 directories, 1800 files

```

```shell
# Adição dos dados no DVC
$ dvc add data
100% Adding...|██████████████████████████████████████████████████████████████████|1/1 [00:00,  1.18file/s]
                                                                                                          
To track the changes with git, run:

        git add .gitignore data.dvc

To enable auto staging, run:

        dvc config core.autostage true
```

```shell
$ cat data.dvc
outs:
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
  size: 41149064
  nfiles: 1800
  hash: md5
  path: data
```

In [1]:
# Vamos inspecionar esse arquivo .dir do DVC
filepath = ".dvc/cache/files/md5/b8/f4d5a78e55e88906d5f4aeaf43802e.dir"
file = open(filepath, "r")

import json
dvc_data = json.loads(file.read())

In [2]:
print(json.dumps(dvc_data, indent=2))

[
  {
    "md5": "ed779276108738fdb2179ccabf9680d9",
    "relpath": "train/cats/cat.1.jpg"
  },
  {
    "md5": "10d2a131081a3095726c5721ed31c21f",
    "relpath": "train/cats/cat.10.jpg"
  },
  {
    "md5": "0f2bfe74e9c363064087d0cd8a322106",
    "relpath": "train/cats/cat.100.jpg"
  },
  {
    "md5": "cdf4adb5d77200057c3dee2a62a0ee47",
    "relpath": "train/cats/cat.101.jpg"
  },
  {
    "md5": "7064a0c872b8feb1507ba841e8822c73",
    "relpath": "train/cats/cat.102.jpg"
  },
  {
    "md5": "412d9692e3067f75c9a984c5681883fe",
    "relpath": "train/cats/cat.103.jpg"
  },
  {
    "md5": "ebee1b7e74d6e5c70ea65f592ddc935f",
    "relpath": "train/cats/cat.104.jpg"
  },
  {
    "md5": "06e044feade703648bab0d2f7bfc147d",
    "relpath": "train/cats/cat.105.jpg"
  },
  {
    "md5": "69d3a61d7d9db45cc96e6e7b2c49bb46",
    "relpath": "train/cats/cat.106.jpg"
  },
  {
    "md5": "ba86efd9e49ce1e7268593fb32773370",
    "relpath": "train/cats/cat.107.jpg"
  },
  {
    "md5": "3bd0123fb30b49fd0ba0622e6

```shell
# Vamos rodar o MD5 em alguns arquivos para ver se os valores batem
$ md5sum data/train/cats/cat.1.jpg data/train/cats/cat.10.jpg data/train/cats/cat.100.jpg
ed779276108738fdb2179ccabf9680d9  data/train/cats/cat.1.jpg
10d2a131081a3095726c5721ed31c21f  data/train/cats/cat.10.jpg
0f2bfe74e9c363064087d0cd8a322106  data/train/cats/cat.100.jpg
```

### Configuração do MinIO

```shell
# Vamos configurar o MinIO como remote storage do DVC
# Primeiro precisamos iniciar o MinIO
$ cd minio
$ docker compose up -d

# Agora vamos criar um bucket para guardar os dados do DVC
$ docker exec -it minio-local bash

# Configuração das credenciais
$ mc alias set local http://localhost:9000 minio "minio123,./"
Added `local` successfully.

# Criação do bucket
$ mc mb local/dvc
Bucket created successfully `local/dvc`.

# List para conferir
$ mc ls local/
[2024-10-06 11:47:13 UTC]     0B dvc/

# Vamos criar um par de access key e secret key para usar no DVC
$ mc admin user svcacct add local minio \
--name dvc \
--access-key dvc \
--secret-key dvcsecret
Access Key: dvc
Secret Key: dvcsecret
Expiration: no-expiry

# exit para sair do container do minio
$ exit
```

### Configuração do DVC Remote Storage

```shell
# Adiciona o remote
$ dvc remote add --default minio s3://dvc
Setting 'minio' as a default remote.

# Configura o endpoint
$ dvc remote modify minio endpointurl http://localhost:9000

# Configura o access key e o secret key
$ dvc remote modify --local minio access_key_id dvc
$ dvc remote modify --local minio secret_access_key dvcsecret

# Conferindo as configurações
$ cat .dvc/config
[core]
    remote = minio
['remote "minio"']
    url = s3://dvc
    endpointurl = http://localhost:9000

# E as configurações de acesso?
$ cat .dvc/config.local
['remote "minio"']
    access_key_id = dvc
    secret_access_key = dvcsecret
```

### Usando o DVC Remote Storage

```shell
# push para enviar os dados para o remote storage (MinIO)
$ dvc push
Pushing
1801 files pushed

# vamos simular um ambiente novo, apagando a diretório de cache do DVC
$ rm -rf .dvc/cache .dvc/tmp

# pull para obter os dados no MinIO
$ dvc pull
Collecting                                                                                           |1.81k [00:00, 4.65kentry/s]
Fetching
Building workspace index                                                                             |1.81k [00:00, 5.41kentry/s]
Comparing indexes                                                                                    |1.81k [00:00, 38.6kentry/s]
Applying changes                                                                                       |0.00 [00:00,     ?file/s]
1801 files fetched
```

### 

### Primeiro treinamento

```shell
$ python train.py
Found 1000 images belonging to 2 classes.
100/100 [==============================] - 22s 223ms/step
Found 800 images belonging to 2 classes.
80/80 [==============================] - 18s 231ms/step
100%|██████████████████████████| 10/10 [00:04<00:00,  2.42epoch/s, loss=0.118, accuracy=0.96, val_loss=0.494, val_accuracy=0.889]
```

### Adição do modelo no dvc

```shell
$ dvc add model.weights.h5
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 31.85file/s]
               
To track the changes with git, run:
        git add .gitignore model.weights.h5.dvc

To enable auto staging, run:
        dvc config core.autostage true
```

### Versionamento do arquivo dvc

```shell
$ git add model.weights.h5.dvc metrics.csv .gitignore
$ git commit -s -m "Primeiro modelo, treinado com 1000 imagens"
[main 5fbd6f1] Primeiro modelo, treinado com 1000 imagens
 3 files changed, 17 insertions(+)
 create mode 100644 metrics.csv
 create mode 100644 model.weights.h5.dvc

$ git tag -a "v1.0" -m "Modelo v1.0, 1000 imagens"
```

### Novos dados

```shell
$ dvc get https://github.com/iterative/dataset-registry tutorials/versioning/new-labels.zip
$ unzip -q new-labels.zip
$ rm -f new-labels.zip
```

### Versionamento dos novos dados

```shell
$ dvc add data
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00,  2.23file/s]
To track the changes with git, run:
        git add data.dvc

To enable auto staging, run:
        dvc config core.autostage true

$ dvc push
Pushing
1002 files pushed

$ git add data.dvc
$ git commit -s -m "Novos dados"
[main e040efd] Novos dados
 1 file changed, 3 insertions(+), 3 deletions(-)

$ git tag -a "v1.1" -m "Novos dados"
```

### Segundo treinamento

```shell
$ python train.py
Found 2000 images belonging to 2 classes.
200/200 [==============================] - 44s 219ms/step
Found 800 images belonging to 2 classes.
80/80 [==============================] - 19s 232ms/step
100%|██████████████████████████| 10/10 [00:07<00:00,  1.30epoch/s, loss=0.155, accuracy=0.947, val_loss=0.606, val_accuracy=0.87]
```

### Adição do segundo modelo no DVC

```shell
$ dvc add model.weights.h5
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 31.35file/s]
To track the changes with git, run:
        git add model.weights.h5.dvc

To enable auto staging, run:
        dvc config core.autostage true

$ dvc push
Collecting                                                                       |2.81k [00:00, 20.5kentry/s]
Pushing
1 file pushed
```

### Versionamento do segundo modelo

```shell
$ git add model.weights.h5.dvc metrics.csv

$ git commit -s -m "Segundo modelo, treinado com 2000 imagens"
main 40d35ed] Segundo modelo, treinado com 2000 imagens
 2 files changed, 11 insertions(+), 11 deletions(-)

$ git tag -a "v2.0" -m "Modelo v2.0, 2000 imagens"
```

### Mudando versões dos dados

```shell
# Podemos voltar para a versão v1.0 voltando todo o código através da tag
$ git checkout v1.0
$ dvc checkout

# Também podemos voltar apenas o dado e manter o código na última versão
$ cat data.dvc
outs:
- md5: 21060888834f7220846d1c6f6c04e649.dir
  size: 64128504
  nfiles: 2800
  hash: md5
  path: data

$ git checkout v1.0 data.dvc
$ cat data.dvc
outs:
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
  size: 41149064
  nfiles: 1800
  hash: md5
  path: data

# Será que os dados estão corretos?
$ find data/ -not -type d | wc -l
  2800

$ dvc checkout
Building workspace index                                                         |2.81k [00:00, 46.6kentry/s]
Comparing indexes                                                                |2.81k [00:00, 86.4kentry/s]
Applying changes                                                                   |0.00 [00:00,     ?file/s]
M       data/

# Agora sim
$ find data/ -not -type d | wc -l
  1800
```

### Criando estágios no DVC

```shell
# Primeiro é necessário remover o modelo do DVC (vamos adicionar novamente logo abaixo)
$ dvc remove model.weights.h5.dvc

# Criação do stage
$ dvc stage add --name train --deps train.py --deps data \
--outs model.weights.h5 --outs bottleneck_features_train.npy --outs bottleneck_features_validation.npy \
--metrics-no-cache metrics.csv \
python train.py
Added stage 'train' in 'dvc.yaml'

To track the changes with git, run:
        git add .gitignore dvc.yaml

To enable auto staging, run:
        dvc config core.autostage true
```

### Execução do estágio

```shell
$ dvc repro
'data.dvc' didn't change, skipping                                                                                               
Running stage 'train':                                                                                                           
> python train.py
Found 2000 images belonging to 2 classes.
200/200 [==============================] - 47s 235ms/step
Found 800 images belonging to 2 classes.
80/80 [==============================] - 19s 239ms/step
100%|█████████████████████████| 10/10 [00:07<00:00,  1.26epoch/s, loss=0.148, accuracy=0.949, val_loss=0.424, val_accuracy=0.889]
Generating lock file 'dvc.lock'                                                                                                  
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
```