# Lightning model evaluation

Evaluation notebook for the Mobilenet_V2. It targets to reproduce the DF20 authors implementation of the Mobilenet_v2 training but using the Lighting.

In [None]:
import torch

# Check if CUDA is available
cuda_available = torch.cuda.is_available()

print(f"CUDA is available: {cuda_available}")

# If CUDA is available, print additional details
if cuda_available:
    print(f"Number of CUDA devices: {torch.cuda.device_count()}")
    print(f"CUDA Device Name: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Device Capability: {torch.cuda.get_device_capability(0)}")

CUDA is available: True
Number of CUDA devices: 1
CUDA Device Name: NVIDIA GeForce RTX 3080
CUDA Device Capability: (8, 6)


### Cleanup

When running on Docker instance the folders are persistent. Cleanup code must be used to remove the data from previous runs.

In [None]:
CLEANUP = True
if CLEANUP:
  #!rm -rf './FungiDiploma/'
  !rm -rf './lightning_logs'
  !rm -rf './wandb'
  !rm -rf './checkpoints/'
  !rm -rf './artifacts'

# Installation

There are several packages required by the later code that are not bundled with the Colab.

When adding a new installation please add `!pip show <packagename>` at the end. Please use -q switch on pip installs.

In [None]:
!pip install timm -q
!pip show timm


Name: timm
Version: 1.0.3
Summary: PyTorch Image Models
Home-page: https://github.com/huggingface/pytorch-image-models
Author: 
Author-email: Ross Wightman <ross@huggingface.co>
License: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: huggingface_hub, pyyaml, safetensors, torch, torchvision
Required-by: 


In [None]:
!pip install lightning -q
!pip show lightning

Name: lightning
Version: 2.2.5
Summary: The Deep Learning framework to train, deploy, and ship AI products Lightning fast.
Home-page: https://github.com/Lightning-AI/lightning
Author: Lightning AI et al.
Author-email: pytorch@lightning.ai
License: Apache-2.0
Location: /usr/local/lib/python3.10/dist-packages
Requires: fsspec, lightning-utilities, numpy, packaging, pytorch-lightning, PyYAML, torch, torchmetrics, tqdm, typing-extensions
Required-by: 


In [None]:
#!pip install dropbox

In [None]:
# Reinstall urllib3
#!pip install --upgrade --force-reinstall urllib3

# Reinstall wandb
#!pip install --upgrade --force-reinstall wandb

In [None]:
# install weights and biases
!pip install wandb -qU
!pip show wandb

Name: wandb
Version: 0.17.1
Summary: A CLI and library for interacting with the Weights & Biases API.
Home-page: 
Author: 
Author-email: Weights & Biases <support@wandb.com>
License: MIT License
        
        Copyright (c) 2021 Weights and Biases, Inc.
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IM

# Download

Clone the fungi repo and download the dataset.

In [None]:
# remove the clone when running on local dev
from pathlib import Path
if Path('./fungi').exists():
  !rm -rf './fungi'

# clone repo
# REMOVED DUE TO SECRET

Cloning into 'fungi'...
remote: Enumerating objects: 153, done.[K
remote: Counting objects: 100% (153/153), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 153 (delta 72), reused 108 (delta 31), pack-reused 0[K
Receiving objects: 100% (153/153), 245.22 KiB | 2.75 MiB/s, done.
Resolving deltas: 100% (72/72), done.


In [None]:
from pathlib import Path

# skip data download when the data folder exists
if not Path('./data').exists():
  %run "./fungi/dataset/DF20_dataset_download.ipynb"   #NOTE: using full path / non-realive

# Experiment setup

Fill in the input sizes and the mean/stddev for the dataloader. Different models use different settings. All values must be

## Reproducibility

If we need to compare two models or two flavors of a model the only way to go is to train them in the same way. To do this please set the `DETERMINISTIC=True`. This will enable the code below to be reproducable. Reproducability can be observed by running two instances of the same model and comparing the train loss in epochs for both models.

Seed value can also be set by changing the value of the `GLOBAL_SEED`.

In [None]:
# functions for determinism
from fungi.utils.deterministic import seed_worker, set_random_seed

# set to provide a deterministic model execution
# use only when comparable outcomes are needed
DETERMINISTIC = True

# set the seed if needed
GLOBAL_SEED = 0

# Model import

The timm library is used as the source of models for the training. In case of custom model implementation it can be done below. Model provides the mean and std. dev. for the augmentations. The last cell in this part provides that to the dataset to be used later on.

In [None]:
# Make an instance of the chosen model

if DETERMINISTIC:
  set_random_seed(GLOBAL_SEED)

from fungi.models.ref_mobilenet_v2 import RefMobileNet_V2
model = RefMobileNet_V2()



In [None]:
# place custom class here

#if DETERMINISTIC:
# set_random_seed(GLOBAL_SEED)

In [None]:
# provide the values for the future use by dataset&augmentations
mean = model.default_cfg['mean']
std_dev = model.default_cfg['std']

# Dataset and dataloader

First the dataset metadata is loaded from the fixed loc.

In [None]:
import pandas as pd
# load metadata
train_metadata = pd.read_csv('./data/FungiCLEF2023_train_metadata_PRODUCTION.csv')
val_metadata = pd.read_csv('./data/FungiCLEF2023_val_metadata_PRODUCTION.csv')

Clenup of the validation dataset. Validation dataset contains unlabeled samples (-1). These samples are used by the DF20 authors as a private test set. These entries must be removed form the metadata.

Also the paths to the files must be adjusted to a known location.

In [None]:
from fungi.dataset import dataset as dset
val_metadata = dset.removeNonLabeledData(val_metadata)
train_metadata, val_metadata = dset.adjustPaths(train_metadata, val_metadata)

Building the dataset objects

In [None]:
from torch.utils.data import DataLoader, Dataset
from fungi.dataset.dataset import TrainDataset
from fungi.dataset.augmentations import get_orig_transforms

# change to a function !!!

train_dataset = TrainDataset(train_metadata, transform=get_orig_transforms('train',mean=mean, std=std_dev))
valid_dataset = TrainDataset(val_metadata,   transform=get_orig_transforms('validation',mean=mean, std=std_dev))

Dataloaders. Here the BATCHSIZE, pin memory, persistent workers and number of workers are tweaked to get maximum data throughput.

Later on this can be moved to trainer.

In [None]:
# and dataloaders
#WORKERS = 8 #colab
#BATCH_SIZE = 27

#train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, persistent_workers=True, pin_memory=True)
#train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, persistent_workers=False, pin_memory=False)
#valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=1, pin_memory=True)

loaders moved to training...

In [None]:
# and dataloaders
#WORKERS = 2 #colab
WORKERS = 4
BATCH_SIZE = 32


def generateDataLoaders():
  if DETERMINISTIC:
    # prepare generators
    g_train = torch.Generator()
    g_train.manual_seed(GLOBAL_SEED)
    g_valid = torch.Generator()
    g_valid.manual_seed(GLOBAL_SEED)
    # prepare the loaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True,
                              num_workers=WORKERS, persistent_workers=True, pin_memory=True,
                              worker_init_fn=seed_worker, generator=g_train)
    valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=WORKERS, pin_memory=True,
                              worker_init_fn=seed_worker, generator=g_valid)
    return train_loader, valid_loader
  else:
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True,
                              num_workers=WORKERS, persistent_workers=True, pin_memory=True)
    valid_loader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=WORKERS, pin_memory=True)
    return train_loader, valid_loader


train_loader, valid_loader = generateDataLoaders()

# Model import and training

### checkpoints

Checkpoints are stored in the checkpoints folder. The checkpoints are created using a predefined ModelCheckpoint callback. This callback is configured to log the best and the last epoch checkpoint. The metric used to evaluate is the validation set accuracy.

In [None]:
# as functon - run before every training from start
from pathlib import Path
from lightning.pytorch.callbacks import ModelCheckpoint

if not Path('./checkpoints').exists():
  !mkdir checkpoints

def generateModelCheckpointCallback():
  return ModelCheckpoint(
    dirpath="./checkpoints",
    filename="{epoch}-{val_acc:.2f}",
    save_top_k=1,
    monitor="val_acc",
    mode="max",
    save_last=True  # saves also the last checkpoint (best+last)
  )

# test
checkpoint_callback = generateModelCheckpointCallback()

### continue learning and logging

When the learning process is interrupted by any means the RESUME_RUN can be set to True. The WANDB_RUN_ID must be set to the proper run id of the W&B.

The resume process can be undertaken either from the local checkpoint of from the W&B artifact (repo data). When using the W&B checkpoint the artifact variable must be set to the desired wandb path.

In [None]:
from fungi.utils.wandb_handler import Wandb_handler
api_key='f1fb7c22f200364cece8d669a2b0d44f3f216309'
wdh = Wandb_handler(api_key=api_key)
!wandb login $api_key   # colab workaround

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


**Fill in below** the details on what do you want to do.

In [None]:
# stays here, make conditionals to download the last ckpt etc...
RESUME_RUN = False
WANDB_RUN_ID = 'rcfwe3b5'
artifact = None
last_checkpoint = './checkpoints/last.ckpt'  # points to the last local checkpoint

When we want to import the artifact (checkpoint stored in W&B) we can use its identifier. This will download the artifact to artifacts folder. Then we use the name as the last_checkpoint name to continue the training.

In [None]:
if artifact is not None:
  ckpt_path = wdh.download(artifact)
  last_checkpoint= ckpt_path + '/model.ckpt'
  print(last_checkpoint)

### Training

Based on the resume run the trainig process will either start from scratch (new run in W&B) or it will continue. Please note that the continue of the run is set the amount of epochs will restart from the moment at which the training was stopped and continue until the max epochs.

**Logging:** You can disable W&B logging for experiments that doesn't need to be stored. Just set the flag below to False.

In [None]:
USE_WANDB_LOGGING = True

The training loop generates a new checkpoint callback and new dataloaders when run.

In [None]:
import lightning as L
import torch
import wandb
from lightning.pytorch.loggers import WandbLogger
import numpy as np

# RESUME RUN & USE_WANDB_LOGGING
if USE_WANDB_LOGGING:
  if RESUME_RUN:
    wandb_logger = WandbLogger(project="FungiDiploma", log_model="all", id=WANDB_RUN_ID, resume="must")
    # we reuse the callback ?
  else:
    wandb_logger = WandbLogger(project="FungiDiploma", log_model="all")
else:
  wandb_logger = None

# turn this on when running on L4/RTX3000 etc...
# torch.set_float32_matmul_precision('medium' | 'high')
torch.set_float32_matmul_precision('medium')

# each run must have a new callback ! (in case we need to continue or repeat...)
checkpoint_callback = generateModelCheckpointCallback()

trainer = L.Trainer(
      logger=wandb_logger,
      callbacks=[checkpoint_callback],
      limit_train_batches=100,
      limit_val_batches=10,
      max_epochs=20,
      accumulate_grad_batches=1,
      deterministic=DETERMINISTIC
  )

# regenerate data loaders
train_loader, valid_loader = generateDataLoaders()

if DETERMINISTIC:
  set_random_seed(GLOBAL_SEED)
if RESUME_RUN:
  trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=valid_loader, ckpt_path=last_checkpoint)
else:
  trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

#print(f"Run ID={wandb.run.id}")

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO:lightning.pytorch.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112378911113613, max=1.0…

INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name       | Type               | Params
--------------------------------------------------
0 | model      | MobileNetV2        | 4.3 M 
1 | loss_fn    | CrossEntropyLoss   | 0     
2 | train_acc  | MulticlassAccuracy | 0     
3 | train_loss | MeanMetric         | 0     
4 | val_acc    | MulticlassAccuracy | 0     
5 | val_acc3   | MulticlassAccuracy | 0     
6 | val_f1     | MulticlassF1Score  | 0     
--------------------------------------------------
4.3 M     Trainable params
0         Non-trainable params
4.3 M     Total params
17.114    Total estimated model params size (MB)
INFO:lightning.pytorch.callbacks.model_summary:
  | Name       | Type               | Params
--------------------------------------------------
0 | model      | MobileNetV2        | 4.3 M 
1 | loss_fn    | CrossEntropyLoss   | 0     
2 | train_acc  | MulticlassAccuracy 

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

In [None]:
# shut down wandb run
wandb.finish()

VBox(children=(Label(value='65.855 MB of 65.855 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁▁
train_acc,▁
train_loss,▁
trainer/global_step,▁▁
val_acc,▁
val_acc3,▁
val_f1,▁

0,1
epoch,0.0
train_acc,0.31583
train_loss,3.03459
trainer/global_step,99.0
val_acc,0.17188
val_acc3,0.33437
val_f1,0.10021
