# Training Demo

In this notebook we will run training script for the work [*Unsupervised Change Detection of Extreme Events Using ML On-Board*](http://arxiv.org/abs/2111.02995). This work was conducted at the [FDL Europe 2021](https://fdleurope.org/fdl-europe-2021) research accelerator program. 

**These instructions are meant to work on your local machine** (we don't use the Google Colab environment)

*Note that in practice this takes long time, so this should serve only as an orientational demo.*

## 1 Preparation

- Get the dataset (for this demo we also provide a tiny training dataset subset - see below)

- For better visualizations log into weights and biases with: wandb init



## 2 Libraries

**Run these:**

```
make requirements
conda activate ravaen_env
conda install nb_conda
jupyter notebook
# start this notebook
```

In [1]:
!pip install --quiet --upgrade gdown

In [2]:
!conda info | grep 'active environment'

     active environment : ravaen_env


In [3]:
!nvidia-smi

Sun Jan  7 16:22:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.05              Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0              12W /  42W |      0MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [7]:
# The official training dataset is much larger, for the purpose of the demo, we provide a small subset:
!gdown https://drive.google.com/uc?id=1rl3Clf0c7HlXnlPXO837Pjr2iCjwak0Y -O train_minisubset.zip
!unzip -q train_minisubset.zip
!rm train_minisubset.zip

Downloading...
From (uriginal): https://drive.google.com/uc?id=1rl3Clf0c7HlXnlPXO837Pjr2iCjwak0Y
From (redirected): https://drive.google.com/uc?id=1rl3Clf0c7HlXnlPXO837Pjr2iCjwak0Y&confirm=t&uuid=d37337b2-4d62-4e13-9a2d-8a23a97a6fbe
To: /home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset.zip
100%|████████████████████████████████████████| 658M/658M [00:31<00:00, 20.6MB/s]


**Edit the paths in config/config.yaml**

```
log_dir: "/home/<USER>/results"
cache_dir: "/home/<USER>/cache"
```

In [4]:
!cat ../config/config.yaml
"""
Fill in:
log_dir: "/home/<USER>/results"
cache_dir: "/home/<USER>/cache"
"""
pass

---
entity: "mlpayloads"

# FL setup
fraction_fit: 0.0001
fraction_eval: 0.001
min_fit_clients: 20
min_eval_clients: 20
local_epochs: 10
num_rounds: 5
num_clients: 20
config_fit:
  lr: 0.001
  momentum: 0.9
  local_epochs: 10

# RaVAEn setup
log_dir: "/home/josie/l46/l46-project/RaVAEn-master/outputs/results"
cache_dir: "/home/josie/l46/l46-project/RaVAEn-master/outputs/cache"


In [1]:
!pwd

/home/josie/l46/l46-project/RaVAEn-master/notebooks


In [1]:
import os

os.chdir('/home/josie/l46/l46-project/RaVAEn-master')

In [3]:
!pwd

/home/josie/l46/l46-project/RaVAEn-master


In [28]:
# ===== Parameters to adjust =====
epochs = 100
dataset_root_folder = "/home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset"
dataset="alpha_multiscene_tiny" # for the demo, for the full training dataset we would use: dataset="alpha_multiscene"

name="VAE_128small" # note "small" uses these settings > module.model_cls_args.latent_dim=128 module.model_cls_args.extra_depth_on_scale=0 module.model_cls_args.hidden_channels=[16,32,64]

# ===== Parameters to keep the same ======
training="simple_vae"
module="deeper_vae"

# ========================================

#python3 -m scripts.train_model +dataset=alpha_multiscene_tiny ++dataset.root_folder="/home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset" +normalisation=log_scale +channels=high_res +training=simple_vae +module=deeper_vae +project=train_VAE_128small +name="VAE_128small" module.model_cls_args.latent_dim=128 module.model_cls_args.extra_depth_on_scale=0 module.model_cls_args.hidden_channels=[16,32,64] training.epochs=100

!python3 -m scripts.train_model +dataset=$dataset ++dataset.root_folder="{dataset_root_folder}" \
         +normalisation=log_scale +channels=high_res +training=$training +module=$module +project=train_VAE_128small +name="{name}" \
         module.model_cls_args.latent_dim=128 module.model_cls_args.extra_depth_on_scale=0 module.model_cls_args.hidden_channels=[16,32,64] \
         training.epochs=$epochs

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path='../config', config_name='config.yaml')
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Global seed set to 42
Error executing job with overrides: ['+dataset=alpha_multiscene_tiny', '++dataset.root_folder=/home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset', '+normalisation=log_scale', '+channels=high_res', '+training=simple_vae', '+module=deeper_vae', '+project=train_VAE_128small', '+name=VAE_128small', 'module.model_cls_args.latent_dim=128', 'module.model_cls_args.extra_depth_on_scale=0', 'module.model_cls_args.hidden_channels=[16,32,64]', 'training.epochs=100']
Traceback (most recent call last):
  File "/home/josie/l46/l46-project/RaVAEn-master/scripts/train_model.py", line 21, in main
    data_module = ParsedDataModule.load_or_create(cfg[

In [None]:
## MAIN
# ===== Parameters to adjust =====
epochs = 100
dataset_root_folder = "/home/josie/l46/l46-project/phoenix/train_minisubset"
dataset="alpha_multiscene_tiny" # for the demo, for the full training dataset we would use: dataset="alpha_multiscene"

# note "small" uses these setting:
# module.model_cls_args.latent_dim=128 
# module.model_cls_args.extra_depth_on_scale=0 
# module.model_cls_args.hidden_channels=[16,32,64]
name="VAE_128small" 

# ===== Parameters to keep the same ======
training="simple_vae"
module="deeper_vae"

# ========================================

!HYDRA_FULL_ERROR=1 python -m scripts.main +dataset=$dataset ++dataset.root_folder="{dataset_root_folder}" +training=$training \
        +module=$module +normalisation=log_scale +channels=high_res +name="{name}" module.model_cls_args.latent_dim=128

Training on cpu using PyTorch 1.9.0 and Flower 1.6.0
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="../config", config_name="config")
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Global seed set to 42

Preprocessing dataset...
Extracted latent_dim=128 from config.yaml
Extracted vis_channels=[2, 1, 0] from config.yaml

Datamodule created!
Partitioned 12180 of 12190 training set entries into 20 partitions of length 609
Training set length: 12180
Sum of partitions: 12180

FedRaVAEn dataset loaded!
Number of training loaders: 20, Number of Validation Loaders: 20
Length of each partition's dataset: 549 (training), 60 (validation)
Length of test dataset: 726
Initialising Ray runtime..
2024-01-08 17:31:56,610	INFO worker.py:1621 -- Started a local Ray instance.
2024-01-08 17:31:56,707	INFO packaging.py:518 

[2m[36m(DefaultActor pid=74060)[0m Epoch 1: train loss nan[32m [repeated 6x across cluster][0m
[2m[36m(DefaultActor pid=74055)[0m [Client 11] fit, config: {}[32m [repeated 6x across cluster][0m
[2m[36m(DefaultActor pid=74057)[0m [Client 15] evaluate, config: {}[32m [repeated 5x across cluster][0m
100%|██████████| 3/3 [00:22<00:00,  7.49s/it]
DEBUG flwr 2024-01-08 17:32:55,466 | server.py:236 | fit_round 2 received 6 results and 0 failures
[2024-01-08 17:32:55,466][flwr][DEBUG] - fit_round 2 received 6 results and 0 failures
DEBUG flwr 2024-01-08 17:32:55,540 | server.py:173 | evaluate_round 2: strategy sampled 6 clients (out of 20)
[2024-01-08 17:32:55,540][flwr][DEBUG] - evaluate_round 2: strategy sampled 6 clients (out of 20)
DEBUG flwr 2024-01-08 17:32:57,720 | server.py:187 | evaluate_round 2 received 6 results and 0 failures
[2024-01-08 17:32:57,720][flwr][DEBUG] - evaluate_round 2 received 6 results and 0 failures
DEBUG flwr 2024-01-08 17:32:57,720 | server.py:222 

In [7]:
!pwd

/home/josie/l46/l46-project/RaVAEn-master


In [17]:
dataset_root_folder = "/home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset"
dataset="alpha_multiscene_tiny" # for the demo, for the full training dataset we would use: dataset="alpha_multiscene"

!python3 -m scripts.make_datamodule +dataset=$dataset ++dataset.root_folder="{dataset_root_folder}"

Error executing job with overrides: ['+dataset=alpha_multiscene_tiny', '++dataset.root_folder=/home/lucap/l46/l46-project/RaVAEn-master/notebooks/train_minisubset']
Traceback (most recent call last):
  File "/home/lucap/l46/l46-project/RaVAEn-master/scripts/make_datamodule.py", line 14, in main
    cfg = deepconvert(cfg)
  File "/home/lucap/l46/l46-project/RaVAEn-master/src/utils.py", line 26, in deepconvert
    not_omega_conf.update({k: deepconvert(v)})
  File "/home/lucap/l46/l46-project/RaVAEn-master/src/utils.py", line 26, in deepconvert
    not_omega_conf.update({k: deepconvert(v)})
  File "/home/lucap/l46/l46-project/RaVAEn-master/src/utils.py", line 25, in deepconvert
    for k, v in omega_conf.items():
omegaconf.errors.InterpolationKeyError: Interpolation key 'training.batch_size_train' not found
    full_key: dataset.train.batch_size
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.


### More advanced settings:

See the possible options using --help and then looking at the individual configuration files.

In [None]:
!python3 -m scripts.train_model --help

train_model is powered by Hydra.

== Configuration groups ==
Compose your configuration from those groups (group=option)

channels: all, high_res, high_res_phisat2overlap, rgb, rgb_nir, rgb_nir_b11, rgb_nir_b11_b12_landsat, rgb_nir_b12
dataset: alpha_multiscene, alpha_multiscene_tiny, alpha_singlescene, dataloader_test, eval, fire, fires, floods_evaluation, hurricanes, landslides, landslides_2, oilspills, preliminary, preliminary_da, preliminary_multiscene, preliminary_sequential, preliminary_sequential_bigger, preliminary_sequential_bigger_9k, preliminary_sequential_bigger_multiEval, preliminary_sequential_bigger_multiEval_Germany, samples_for_gui, temporal_analysis, volcanos
evaluation: ae_base, ae_fewer, vae_base, vae_da, vae_da_8px, vae_fewer, vae_paper
module: deeper_ae, deeper_ae_bigger_latent, deeper_vae, grx, simple_ae, simple_ae_with_linear, simple_vae
normalisation: log_scale, none
training: da, simple_ae, simple_vae
transform: eval_da, eval_da_8px, eval_nda, eval_

In [None]:
# to see the detiled options for "training: da, simple_ae, simple_vae"
!cat config/training/simple_vae.yaml
# for example we would then set epochs with adding this to the main command:
# training.epochs=1

---
gpus: -1
epochs: 400
grad_batches: 1
distr_backend: 'dp'
use_amp: true # ... true = 16 precision / false = 32 precision

# The check_val_every_n_epoch and val_check_interval settings overlap, see:
#     https://github.com/PyTorchLightning/pytorch-lightning/issues/6385
val_check_interval: 0.2  # either in to check after that many batches or float to check that fraction of epoch
check_val_every_n_epoch: 1 

fast_dev_run: false

num_workers: 16

batch_size_train: 256
batch_size_valid: 256
batch_size_test: 256

lr: 0.001
weight_decay: 0.0
# scheduler_gamma: 0.95

# auto_batch_size: 'binsearch'
#auto_lr: 'lr'
