In [14]:
# Copyright (c) Facebook, Inc. and its affiliates. All rights reserved.
# edited by Richard Terrile to facilitate use with ZTF image data.

# Understanding VISSL Training and YAML Config and Using ZTF Image Data

This tutorial will guide you through using a Supervised ResNet-50 model on custom data from the Zwicky Transient Facility, and understanding various parts of the model training configuration.

You can make a copy of this tutorial by `File -> Open in playground mode` and make changes there. DO NOT request access to this tutorial.

**NOTE:** Please ensure your Collab Notebook has a GPU available. To ensure/select this, simply follow: `Edit -> Notebook Settings -> select GPU`.

##Introduction and Outline
This tutorial will walk through the following steps:
- Introduction to VISSL
- Installing VISSL
- Downloading YAML file for Supervised Training using resnet-50 model
- Download VISSL Training tool
- Creating a Custom Dataset
- Downloading ZTF Image Data
- Registering the Custom Data with VISSL
- Training the Model
- Training Logs and Checkpoints
- Understanding the Training Command
- Understanding the Training stdout
- Understanding the YAML Configuration File
- Visualization using Tensorboard


# Install VISSL

Installing VISSL is pretty straightfoward. Use pip binaries of VISSL and follow instructions from [here](https://github.com/facebookresearch/vissl/blob/master/INSTALL.md#install-vissl-pip-package).

In [15]:
# installing pytorch and torchvision
!pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# installing apex
!pip install -f https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py37_cu101_pyt171/download.html apex
# clone vissl repository
!git clone --recursive https://github.com/facebookresearch/vissl.git 
# install vissl dependencies
!pip install --progress-bar off -r requirements.txt
!pip install opencv-python
# update classy vision install to current master
!pip uninstall -y classy_vision
!pip install classy-vision@https://github.com/facebookresearch/ClassyVision/tarball/master

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Looking in links: https://dl.fbaipublicfiles.com/vissl/packaging/apexwheels/py37_cu101_pyt171/download.html
fatal: destination path 'vissl' already exists and is not an empty directory.
Collecting fairscale@ https://github.com/facebookresearch/fairscale/tarball/df7db85cef7f9c30a5b821007754b96eb1f977b6
  Using cached https://github.com/facebookresearch/fairscale/tarball/df7db85cef7f9c30a5b821007754b96eb1f977b6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Found existing installation: classy-vision 0.7.0.dev0
Uninstalling classy-vision-0.7.0.dev0:
  Successfully uninstalled classy-vision-0.7.0.dev0
Collecting classy-vision@ https://github.com/facebookresearch/ClassyVision/tarball/master
  Using cached https://github.com/facebookresearch/ClassyVision/ta

Now you will install .[dev] inside the /vissl folder manually. Navigate to the folder and run the code below.

In [16]:
cd /content/vissl

/content/vissl


The following command will install .[dev]

In [17]:
!pip install ".[dev]"

Processing /content/vissl
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting fairscale@ https://github.com/facebookresearch/fairscale/tarball/df7db85cef7f9c30a5b821007754b96eb1f977b6
  Using cached https://github.com/facebookresearch/fairscale/tarball/df7db85cef7f9c30a5b821007754b96eb1f977b6
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: vissl
  Building wheel for vissl (setup.py) ... [?25l[?25hdone
  Created wheel for 

Finally run this line to finish installing VISSL and apex.

In [18]:
!python -c 'import vissl, apex'

VISSL should be successfuly installed by now and all the dependencies should be available.  Run the following code cell to make sure it works.

In [19]:
import vissl
import tensorboard
import apex
import torch

# YAML config file for Supervised Training

Definition of YAML via https://blog.stackpath.com/yaml/:
- YAML is a human-readable data serialization standard that can be used in conjunction with all programming languages and is often used to write configuration files.

VISSL provides YAML configuration files that reproduce training of all self-supervised approaches [here](https://github.com/facebookresearch/vissl/tree/master/configs/config/pretrain/supervised). 

For the purpose of this tutorial, we will use the config file for training ResNet-50 supervised model on 1-gpu. Let's go ahead and download the [example config file](https://github.com/facebookresearch/vissl/blob/master/configs/config/pretrain/supervised/supervised_1gpu_resnet_example.yaml).


In [20]:
!mkdir -p configs/config/
!wget -O configs/__init__.py https://dl.fbaipublicfiles.com/vissl/tutorials/configs/__init__.py 
!wget -O configs/config/supervised_1gpu_resnet_example.yaml https://dl.fbaipublicfiles.com/vissl/tutorials/configs/supervised_1gpu_resnet_example.yaml

--2021-09-01 00:25:30--  https://dl.fbaipublicfiles.com/vissl/tutorials/configs/__init__.py
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 71 [text/plain]
Saving to: ‘configs/__init__.py’


2021-09-01 00:25:31 (11.5 MB/s) - ‘configs/__init__.py’ saved [71/71]

--2021-09-01 00:25:31--  https://dl.fbaipublicfiles.com/vissl/tutorials/configs/supervised_1gpu_resnet_example.yaml
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2673 (2.6K) [text/plain]
Saving to: ‘configs/config/supervised_1gpu_resnet_example.yaml’


2021-09-01 00:25:32 (33.2 MB/s) - ‘configs/config/supervised_1gpu_resne

## Built-in training tool in VISSL

VISSL also provides a [helper python tool](https://github.com/facebookresearch/vissl/blob/master/tools/run_distributed_engines.py) that allows to use VISSL for training purposes. This tool offers:
- allows training and feature extraction both using VISSL. 
- also allows training on 1-gpu or multi-gpu. 
- can be used to launch multi-machine distributed training.

This tool is how you will be training your model.

Let's go ahead and download this tool directly.

In [21]:
!wget https://dl.fbaipublicfiles.com/vissl/tutorials/run_distributed_engines.py

--2021-09-01 00:25:32--  https://dl.fbaipublicfiles.com/vissl/tutorials/run_distributed_engines.py
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6568 (6.4K) [text/x-python]
Saving to: ‘run_distributed_engines.py.2’


2021-09-01 00:25:33 (57.5 MB/s) - ‘run_distributed_engines.py.2’ saved [6568/6568]



## Creating a custom dataset

The dataset uses the folder style below.  Run the code below to create the directories and then we will download the image data.  Differently labelled data will be stored in different label folders (ie. label1, label2, etc).

In [22]:
!mkdir -p custom_data/train/label1
!mkdir -p custom_data/train/label2
!mkdir -p custom_data/val/label1
!mkdir -p custom_data/val/label2

## Downloading ZTF Image Data

After creating the custom data folders, find the image data you desire and download it into the appropriate image label folders.  First download ztfquery.

In [23]:
pip install ztfquery



Using the following code, print out wget commands to retrieve all the desired image data and save it to the appropriate image label folder.  

Change 'ra' and 'dec' variables to the desired position on the sky.  You must also change the username and password in the wget commands.  And remember to change the save directory at the end of the commands.

If you want the output in a text file simply uncomment the three "outF" lines in the code.  

**Note**: There are two issues using wget commands in Google Colab. If the command does not change the image name and only has the folder, then it may fail to import.  To fix this I have a "counter" variable which counts up from 1.  This is initialized before retriving the image data.  Take a look at the returned wget commands to see how they are formatted and feel free to change it if you like.

The other issue is getting Google Colab to download data with a Python script, it seems to corrupt files downloaded in this way.  The best workaround is to save wget commands and then copy and run them in a code box.

wget commands in Google Colab need a "!" in front like so:

`!wget http://images.cocodataset.org/val2017/000000439715.jpg`

The code below puts this in for you.

In [24]:
counter = 1

In [None]:
# This code is created by Matthew Graham
from ztfquery import query, buildurl

#outF = open("nebulousData.txt", "a")    # creates a file if it does not exist to print into

ra, dec, size = 86.5546, -0.1014, 10
zquery = query.ZTFQuery()
zquery.load_metadata(kind = "sci", radec=[ra, dec]) #, size = 0.01)
mt = zquery.metatable
print(len(mt))
for i, m in mt.iterrows():
    filename = f"ztf_{m['filefracday']}_{m['field']:0>6}_{m['filtercode']}_c{m['ccdid']:0>2}_o_q{m['qid']}_sciimg.fits"
    furl = buildurl.filename_to_scienceurl(filename)
    url = f"{furl}?center={ra},{dec}&size={size}arcsec&gzip=false"
    cmd = f'''!wget --auth-no-challenge --user=USERID --password=PASSWORD "{url}" -q -O custom_data/train/label1/img{counter}.jpg'''
    counter += 1
    #outF.write(cmd)
    #outF.write("\n")
    #print(cmd)

Now verify that the data is successfully downloaded:

In [None]:
!ls custom_data/val/label1/
!ls custom_data/train/label1/

## Splitting Up the Image Data

Because this document uses a Supervised Resnet-50 model for training, the data must be split into training images and testing images.  These will be placed in their respective folders /train and /val (evalutation). 

It appears like this:
- /custom_data/train/label1 
- /custom_data/val/label1  

The following code takes a .txt file and randomly splits it into two files, one for training and one for testing.  By default it will take 25% of data for training and 75% for testing, feel free to make changes.

You should have access to nebulous.txt and nonNebulous.txt which work properly with this code.

In [None]:
from random import seed
from random import random

#seed(132)    # If you would like to seed the randomness

file1 = open("data.txt", "r")  # The file you are reading from
lines = file1.readlines()
outF1 = open("train.txt", "a")   # The data sources you will use to train
outF2 = open("test.txt", "a")    # The data sources you will use to test
for line in lines:
  if (random() < .75):      # Currently set for 25% training images and 75% for testing
    outF2.write(line)
    outF2.write("\n")
  else:
    outF1.write(cmd)
    outF1.write("\n")



## Using the custom data in VISSL

Next step is to register the custom data you created above with VISSL. Registering the dataset involves telling VISSL about the dataset name and the paths for the dataset. For this, we create a simple json file with the metadata and save it to `configs/config/dataset_catalog.py` file.

**NOTE**: VISSL uses the specific `dataset_catalog.json` under the path `configs/config/dataset_catalog.json`.

In [None]:
json_data = {
    "custom_data_folder": {
      "train": [
        "/content/vissl/custom_data/train", "/content/vissl/custom_data/train"
      ],
      "val": [
        "/content/vissl/custom_data/val", "/content/vissl/custom_data/val"
      ]
    }
}

# use VISSL's api to save or you can use your custom code.
from vissl.utils.io import save_file
save_file(json_data, "/content/vissl/configs/config/dataset_catalog.json", append_to_json=False )

** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath 



Next, verify that the dataset is registered with VISSL. Do this with a query of VISSL's dataset catalog as shown below:

In [None]:
from vissl.data.dataset_catalog import VisslDatasetCatalog

# list all the datasets that exist in catalog
print(VisslDatasetCatalog.list())

# get the metadata of dummy_data_folder dataset
print(VisslDatasetCatalog.get("custom_data_folder"))

['custom_data_folder']
{'train': ['/content/vissl/custom_data/train', '/content/vissl/custom_data/train'], 'val': ['/content/vissl/custom_data/val', '/content/vissl/custom_data/val']}


## Train the model

You are ready to train the model. VISSL supports training on a wide range of datasets and allows adding custom datasets. Please see VISSL documentation on how to use other datasets.

input to your training command: 
```
config.DATA.TRAIN.DATASET_NAMES=[imagenet1k_folder] \
config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
config.DATA.TRAIN.DATA_PATHS=["/path/to/my/imagenet/folder/train"] \
config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]
```

The training command looks like:

In [None]:
!python3 run_distributed_engines.py \
    hydra.verbose=true \
    config=supervised_1gpu_resnet_example \
    config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
    config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \
    config.DATA.TRAIN.DATASET_NAMES=[custom_data_folder] \
    config.DATA.TRAIN.DATA_PATHS=[/content/vissl/custom_data/train] \
    config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 \
    config.DATA.TEST.DATA_SOURCES=[disk_folder] \
    config.DATA.TEST.LABEL_SOURCES=[disk_folder] \
    config.DATA.TEST.DATASET_NAMES=[custom_data_folder] \
    config.DATA.TEST.DATA_PATHS=[/content/vissl/custom_data/val] \
    config.DATA.TEST.BATCHSIZE_PER_REPLICA=2 \
    config.DISTRIBUTED.NUM_NODES=1 \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \
    config.OPTIMIZER.num_epochs=2 \
    config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] \
    config.OPTIMIZER.param_schedulers.lr.milestones=[1] \
    config.TENSORBOARD_SETUP.USE_TENSORBOARD=true \
    config.CHECKPOINT.DIR="./checkpoints"

And you are done!! You now have a Supervised ResNet-50 model trained on a custom dataset and available in `checkpoints/model_final_checkpoint_phase2.torch`.

## Training logs, checkpoints, metrics

VISSL dumps model checkpoints in the checkpoint directory specified by the user. In the above example, we used `./checkpoints` directory. Let's take a look at the content of the checkpoint directory.

In [None]:
ls checkpoints/

We notice:
- model checkpoints `.torch` files after every epoch, 
- model training log `log.txt` which has the full stdout but saved in file
- `metrics.json` if your training calculated some metrics, those metrics values will be saved there.
- `tb_logs` which are the tensorboard events

# Understanding the Training Command

Let's understand the training command we used above. You override the settings in our configuration file to train the desired setting of the model. In the example, we override the dataset to use, #images/gpu, number of GPUs to use and optimizer settings like #epochs and learning rate drops.

```
!python3 run_distributed_engines.py \
    hydra.verbose=true \
    config=supervised_1gpu_resnet_example \
    config.DATA.TRAIN.DATA_SOURCES=[disk_folder] \
    config.DATA.TRAIN.LABEL_SOURCES=[disk_folder] \
    config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder] \
    config.DATA.TRAIN.DATA_PATHS=[/content/dummy_data/train] \
    config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 \
    config.DATA.TEST.DATA_SOURCES=[disk_folder] \
    config.DATA.TEST.LABEL_SOURCES=[disk_folder] \
    config.DATA.TEST.DATASET_NAMES=[dummy_data_folder] \
    config.DATA.TEST.DATA_PATHS=[/content/dummy_data/val] \
    config.DATA.TEST.BATCHSIZE_PER_REPLICA=2 \
    config.DISTRIBUTED.NUM_NODES=1 \
    config.DISTRIBUTED.NUM_PROC_PER_NODE=1 \
    config.OPTIMIZER.num_epochs=2 \
    config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] \
    config.OPTIMIZER.param_schedulers.lr.milestones=[1] \
    config.TENSORBOARD_SETUP.USE_TENSORBOARD=true \
    config.CHECKPOINT.DIR="./checkpoints"
```

We can understand each line as below:

- `config=supervised_1gpu_resnet_example` -> specify the config file for supervised training
- `config.DATA.TRAIN.DATA_SOURCES=[disk_folder] config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]` -> specify the data source for training i.e. `disk_folder`
- `config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]` -> specify the dataset name i.e. `dummy_data_folder`. We registered this dataset above.
- `config.DATA.TRAIN.DATA_PATHS=[/content/dummy_data/train]` -> another way of specifying where the data is on the disk. The example config file provided has some dummy paths set. We must override those with our desired paths.
- `config.DATA.TEST.DATA_SOURCES=[disk_folder] config.DATA.TEST.LABEL_SOURCES=[disk_folder] config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]` -> similar settings for Test dataset.
- `config.DATA.TEST.DATA_PATHS=[/content/dummy_data/val]` -> another way of specifying where the data is on the disk. The example config file provided has some dummy paths set. We must override those with our desired paths.
- `config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2 config.DATA.TEST.BATCHSIZE_PER_REPLICA=2` -> specify 2 img/gpu to use for both `TRAIN` and `TEST`.  

- `config.DISTRIBUTED.NUM_NODES=1 config.DISTRIBUTED.NUM_PROC_PER_NODE=1` -> specify the #gpus=1 and #machines=1

- `config.OPTIMIZER.num_epochs=2 config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001] config.OPTIMIZER.param_schedulers.lr.milestones=[1]` -> specify #epochs=2 and drop learning rate after 1 epoch.



# Understanding Training stdout

The following output indicates that the training is starting on `rank=0`. Similar output will be printed for each rank.
```
####### overrides: ['hydra.verbose=true', 'config=supervised_1gpu_resnet_example', 'config.DATA.TRAIN.DATA_SOURCES=[disk_folder]', 'config.DATA.TRAIN.LABEL_SOURCES=[disk_folder]', 'config.DATA.TRAIN.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TRAIN.DATA_PATHS=[/content/dummy_data/train]', 'config.DATA.TRAIN.BATCHSIZE_PER_REPLICA=2', 'config.DATA.TEST.DATA_SOURCES=[disk_folder]', 'config.DATA.TEST.LABEL_SOURCES=[disk_folder]', 'config.DATA.TEST.DATASET_NAMES=[dummy_data_folder]', 'config.DATA.TEST.DATA_PATHS=[/content/dummy_data/val]', 'config.DATA.TEST.BATCHSIZE_PER_REPLICA=2', 'config.DISTRIBUTED.NUM_NODES=1', 'config.DISTRIBUTED.NUM_PROC_PER_NODE=1', 'config.OPTIMIZER.num_epochs=2', 'config.OPTIMIZER.param_schedulers.lr.values=[0.01,0.001]', 'config.OPTIMIZER.param_schedulers.lr.milestones=[1]', 'config.TENSORBOARD_SETUP.USE_TENSORBOARD=true', 'config.CHECKPOINT.DIR=./checkpoints', 'hydra.verbose=true']
INFO 2021-01-25 20:04:50,636 __init__.py:  32: Provided Config has latest version: 1
INFO 2021-01-25 20:04:50,638 run_distributed_engines.py: 163: Spawning process for node_id: 0, local_rank: 0, dist_rank: 0, dist_run_id: localhost:56229
INFO 2021-01-25 20:04:50,638 train.py:  66: Env set for rank: 0, dist_rank: 0
```

VISSL is designed for reproducible research, so the training script will first print out the running configuration -- the environment variables, versions of various libraries, the full training config, data size, model etc.

The training will start afterwards and we see output like:

```
INFO 2021-01-25 20:05:01,116 state_update_hooks.py:  98: Starting phase 0 [train]
INFO 2021-01-25 20:05:04,574 log_hooks.py: 155: Rank: 0; [ep: 0] iter: 1; lr: 0.00078; loss: 7.07915; btime(ms): 6923; eta: 0:01:02; peak_mem: 2595M
INFO 2021-01-25 20:05:04,828 tensorboard_hook.py: 188: Logging metrics. Iteration 5
INFO 2021-01-25 20:05:04,839 log_hooks.py: 155: Rank: 0; [ep: 0] iter: 5; lr: 0.00078; loss: 0.81495; btime(ms): 1437; eta: 0:00:07; peak_mem: 2595M
INFO 2021-01-25 20:05:04,839 trainer_main.py: 194: Meters synced
INFO 2021-01-25 20:05:10,516 log_hooks.py: 346: Rank: 0, name: train_accuracy_list_meter, value: {'top_1': {0: 30.0}, 'top_5': {0: 60.0}}
INFO 2021-01-25 20:05:10,516 io.py:  56: Saving data to file: ./checkpoints/metrics.json
INFO 2021-01-25 20:05:10,516 io.py:  70: Saved data to file: ./checkpoints/metrics.json
INFO 2021-01-25 20:05:10,517 log_hooks.py: 283: [phase: 0] Saving checkpoint to ./checkpoints
INFO 2021-01-25 20:05:10,839 log_hooks.py: 312: Saved checkpoint: ./checkpoints/model_phase0.torch
INFO 2021-01-25 20:05:10,839 log_hooks.py: 316: Creating symlink...
INFO 2021-01-25 20:05:10,839 log_hooks.py: 320: Created symlink: ./checkpoints/checkpoint.torch
```

You can see the training stats printed out like LR, batch time, etc. VISSL also prints out the GPU memory usage and the ETA (approximate time for the experiment to finish).

# Understanding YAML Config File

We can now try to understand the train config file.


## Data

The input data and labels needed to train the model are specified under the `DATA` key. The training and testing data are specified under `DATA.TRAIN` and `DATA.TEST`. For example,

```yaml
DATA:
  TRAIN:
    DATA_SOURCES: [disk_folder]
    DATA_PATHS: ["<path to train folder>"]
    LABEL_SOURCES: [disk_folder]
    DATASET_NAMES: [imagenet1k_folder]
    BATCHSIZE_PER_REPLICA: 32
```
This specifies that the model will train on the images provided in the folder `DATA.TRAIN.DATA_PATHS` and infer the labels from the directory structure of the images. The model is trained with a batchsize of 32 images/GPU. VISSL provides a `configs/config/dataset_catalog.json` to easily specify dataset paths in one place rather than repeat them in each config file. In our example above, we saw how to use the `dataset_catalog.json`.


## Data Transforms
The image transforms are specified in `TRANSFORMS` and generally wrap the torchvision image transforms. One can easily compose together multiple transforms by specifying them in the config file, or implement their own custom image transforms. VISSL uses such compositionality of data transforms for implementing many self-supervised methods as well.
For example, in our training, we specify the transforms as below:

```yaml
TRANSFORMS:
  - name: RandomResizedCrop
    size: 224
  - name: RandomHorizontalFlip
  - name: ColorJitter
    brightness: 0.4
    contrast: 0.4
    saturation: 0.4
    hue: 0.4
  - name: ToTensor
  - name: Normalize
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
```

## Model
VISSL specifies the model as a `TRUNK` (the base ConvNet) and a `HEAD` (the classification or task-specific parameters). This allows one to cleanly separate the logic between the task itself and the ConvNet. Multiple model trunks (see listing under `vissl/model/trunks`) can be used for the same task.

A ResNet-50 model that outputs classification scores for 1000 classes (the number of classes in ImageNet) is specified as

```yaml
MODEL:
  TRUNK:
    NAME: resnet
    TRUNK_PARAMS:
      RESNETS:
        DEPTH: 50
  HEAD:
    PARAMS: [
      ["mlp", {"dims": [2048, 1000]}],
    ]
```
Here `TRUNK` specifies the base ConvNet architecture, and `HEAD` specifies a single fully connected layer (special case of a MLP) that produces 1000 outputs.

VISSL automatically sets the model to eval mode when using the data in `DATA.TEST`. This ensures that layers such as `BatchNorm`, `Dropout` behave correctly when used to report test set accuracies.

## Loss and Optimizer
The loss and optimizer are specified under the `LOSS` and `OPTIMIZER` keys. VISSL losses behave similar to the default `torch.nn` losses.

Example: we used cross entropy loss 
```yaml
LOSS:
  name: cross_entropy_multiple_output_single_target
  cross_entropy_multiple_output_single_target:
    ignore_index: -1
```

Example, we used the optimizer (after overriding with command line params:
The `OPTIMIZER` contains information about the base optimizer (SGD in this case) and the learning rate scheduler (`OPTIMIZER.param_schedulers`).

```yaml
OPTIMIZER:
  name: sgd
  weight_decay: 0.0001
  momentum: 0.9
  num_epochs: 105
  nesterov: True
  regularize_bn: False
  regularize_bias: True
  param_schedulers:
    lr:
      # learning rate is automatically scaled based on batch size
      auto_lr_scaling: 
        auto_scale: true
        base_value: 0.1
        base_lr_batch_size: 256 # learning rate of 0.1 is used for batch size of 256
      name: multistep
      # We want the learning rate to drop by 1/10 at epochs [1]
      milestones: [1] # epochs at which to drop the learning rate (N vals)
      values: [0.01,0.001] # the exact values of learning rate (N+1 vals)
      update_interval: epoch
```

## Measuring Accuracy
Accuracy meters are specified under `METERS` and measure the top-1 and top-5 accuracies.

Example:
```yaml
METERS:
  name: accuracy_list_meter
  accuracy_list_meter:
    num_meters: 1
    topk_values: [1, 5]
```

## Number of gpus
The number of GPUs and number of nodes are specified under `DISTRIBUTED`. VISSL seamlessly runs the same code on either a single GPU or across multiple nodes/GPUs.

Example:
```
DISTRIBUTED:
  BACKEND: nccl
  NUM_NODES: 1
  NUM_PROC_PER_NODE: 1 # 1 GPU
  RUN_ID: auto
```

If running on more than one node, you will need to run this command on each of the nodes. 


**NOTE**: The batch size specified in the configs under `DATA.TRAIN.BATCHSIZE_PER_REPLICA` (denoted as `B`) is per GPU. So if you run your code on `N` nodes with `G` gpus each, then the total effective batch size is `B*N*G`.
Since running on multiple GPUs changes the effective batch size, you may also want to use learning rate warmup (see the [ImageNet in 1 hour paper](https://arxiv.org/abs/1706.02677)).
Scaling the learning rate according to the batch size is important for distributed training. VISSL can automatically do this for you.

### Auto-scaling the LR

To make distributed training even simpler, VISSL can automatically scale the learning rate depending on the total batch size used. This is controlled by the flag `OPTIMIZER.param_schedulers.lr.auto_lr_scaling` which can be set to True to enable auto-scaling. By default the learning rate is scaled linearly with the batch size (see the [ImageNet in 1 hour paper](https://arxiv.org/abs/1706.02677)).

We specify a `base_lr_batch_size` when creating the learning rate scheduler. At run time, the learning rate, VISSL automatically computes the run_time_batch_size and the learning rate used is multiplied by (`run_time_batch_size / base_lr_batch_size`). The autoscaling magic resides in `vissl/utils/hydra_config.py`.

```yaml
OPTIMIZER:
  param_schedulers:
    lr:
      auto_lr_scaling: # learning rate is automatically scaled based on batch size
        auto_scale: true
        base_value: 0.1
        base_lr_batch_size: 256 # learning rate of 0.1 is used for batch size of 256
```

## Mixed Precision or FP16 training
If you installed Apex above, you can easily train the model using mixed precision. This requires adding the following lines to the config file under the MODEL

```yaml
AMP_PARAMS:
  USE_AMP: True
  AMP_ARGS: {"opt_level": "O1"}
```
This will run the model using the `O1` setting in apex which should generally result in stable training while saving GPU memory (and possibly faster training depending on the GPU architecture). See the [apex documentation for more information on what the different mixed precision flags](https://nvidia.github.io/apex/amp.html#opt-levels).

## Using SyncBatchNorm in the model
This can be specified in the config under the `MODEL`

```yaml
SYNC_BN_CONFIG:
  CONVERT_BN_TO_SYNC_BN: True
  SYNC_BN_TYPE: pytorch
```
If you have apex installed, you can use a faster version of `SyncBatchNorm` by

```yaml
SYNC_BN_CONFIG:
  CONVERT_BN_TO_SYNC_BN: True
  SYNC_BN_TYPE: apex
  GROUP_SIZE: 8 # set to number of GPUs per node for fast performance.
```
Our model definitions are written such that one can easily replace `BatchNorm` with other normalization functions (`LayerNorm, GroupNorm` etc.) by changing arguments in the config file.


In [None]:
less configs/config/supervised_1gpu_resnet_example.yaml

# Visualizing Tensorboard Logs

If you have enabled `config.TENSORBOARD_SETUP.USE_TENSORBOARD=true` , you will see the tensorboard events dumped in `tb_logs/` directory. You can use this to visualize the events in tensorboard as follows:

In [None]:
# Look at training curves in tensorboard:
!kill 490
%reload_ext tensorboard
%tensorboard --logdir checkpoints/tb_logs