# Guide 2: Research projects with PyTorch

![Status](https://img.shields.io/static/v1.svg?label=Status&message=First%20version&color=yellow)

* Based on some feedback I got, we will try to summarize tips and tricks on how to setup and structure large research projects in PyTorch, such as your Master Thesis
* Feel free to contribute yourself if you have good ideas

## Setup

### Framework

* Choose the right framework. If you have simple setups like classification, consider going with PyTorch Lightning. If you need to change the default training procedure, go with plain PyTorch and write your own framework
* Usually a good setup:

```bash
general/
│   train.py
│   task.py
│   mutils.py
layers/
experiments/
│   task1/
│        train.py
│        task.py
│        eval.py
│        dataset.py
│   task2/
│        train.py
│        task.py
│        eval.py
│        dataset.py
```

* The `general/train.py` file summarizes the default operations every model needs (training loop, loading/saving model, setting up model, etc.). If you use PyTorch Lightning, this reduces to a train file per task, and only needs the specification of the trainer object.
* The `general/task.py` file summarizes a template for the specific parts you have to do for a task (training step, validation step, etc.). If you use PyTorch Lightning, this would be the definition of the Lightning Module.
* The `layers/models` folder contains the code for specifying the `nn.Modules` you use for setting up the model
* The `experiments` folder contains the task-specific code. Each task has its own `train.py` for specifying the argument parser, setting up the model, etc., while the `task.py` overwrites the template in `general/task.py`. The `eval.py` file should has as input a checkpoint directory of a trained model, and should evaluate this model on the test dataset. Finally, the file `dataset.py` contains all parts you need for setting up the dataset.
* Note that this template assumes that you might have multiple different tasks and multiple different models. If you have a simpler setup, you can inherintly shrink the template together.


### Argument parser

* It is a good practice to use argument parsers for specifying hyperparameters. Argument parsers allow you to call a training like `python train.py --learning ... --seed ... --hidden_size ...` etc. 
* If you have multiple models to choose from, you will have multiple set of hyperparameters. A good summary on that can be found in the [PyTorch Lightning documentation](https://pytorch-lightning.readthedocs.io/en/latest/hyperparameters.html#argparser-best-practices) without the need of using Lightning. 

## Hyperparameter search 

* In general, hyperparameter search is all about experience. Once you have trained a lot of models, it will become easier for you to pick reasonable first-guess hyperparameters.
* Another good approach is to look at related work to your model, and see what others have used as hyperparameters for similar models. This will help you to get started with a reasonable choice.
* Hyperparameter search can be expensive. Thus, try to do the search on shallow models first before scaling them up.
* Although a large grid search is the best way to get the optimum out of your model, it is often not reasonable to run. Try to group hyperparameters, and optimize each group one by one. 

### Toolkits

* PyTorch Lightning provides a lot of useful tricks and toolkits, such as:
    * [Learning rate finder](https://pytorch-lightning.readthedocs.io/en/latest/lr_finder.html) that plots the learning rate vs loss for a few initial batches, and helps you to choose a reasonable learning rate.
    * [Autoscaling batch sizes](https://pytorch-lightning.readthedocs.io/en/latest/training_tricks.html#auto-scaling-of-batch-size) which finds the largest possible batch size given your GPU (helpful if you have very deep, large models, and it is obvious you need the largest batch size possible)
* For comparing multiple hyperparameter configurations, you can add them to TensorBoard. This is a clean way of comparing multiple runs. If interested, a blog on this can be found [here](https://towardsdatascience.com/a-complete-guide-to-using-tensorboard-with-pytorch-53cb2301e8c3)
* There are multiple libraries that support you in automatic hyperparameter search. A good overview for those in PyTorch can be found [here](https://medium.com/pytorch/accelerate-your-hyperparameter-optimization-with-pytorchs-ecosystem-tools-bc17001b9a49)

### Reproducibility

* Everything is about reproducibility. Make sure you can reproduce any training you do with the same random values, batches, etc. You will come to a point where you have tried a lot of different approaches, but none were able to improve upon one of your previous runs. When you try to run the model again with the best hyperparameters, you don't want to have a bad surprise (believe me, enough people have this issue, and it can also happen to you). Hence, before starting any grid search, make sure you are able to reproduce runs. Run two jobs in parallel on Lisa with the same hyperparams, seeds, etc., and if you don't get the exact same results, stop and try to fix it before anything else.
* Another fact about reproducibility is that saving and loading a model works without any problems. Make sure before a long training that you are able to load a saved model from the disk, and achieve the exact same test score as you had during training.
* Print your hyperparameters into the SLURM output file (simple print statement in python). This will help you identifying the runs, and you can easily check whether Lisa executes the job you intended to
* When running a job, copy the job file automatically to your checkpoint folder. Improves repoducibility
* Besides the slurm output file, create a output file in which you store the best training, validation and test score. This helps when you want to compare 

### Seeds

* DL models are noisy. Before running a grid search, try to get a feeling of how noisy your experiments might be. The more noise you expect compared to 
* After finishing the grid search, run another model of the best configuration with a new seed. If the score is still the best, take the model. If not, consider running a few more seeds for the top $k$ models in your grid search. Otherwise you risk taking a suboptimal model, which was just lucky to the best for a specific seed.

### Learning rate

* Depends on optimizer, model and many more other hyperparameters
* A usual good starting point for SGD is 0.1, and Adam 1e-3
* The deeper the model is, the lower the learning rate usually should be
* The lower your batch, the lower the lr should be. Consider using [gradient accumulation](https://towardsdatascience.com/what-is-gradient-accumulation-in-deep-learning-ec034122cfa) if your batch size is getting too small (PyTorch Lightning supports this, see [here](https://pytorch-lightning.readthedocs.io/en/latest/training_tricks.html#accumulate-gradients)). 
* Consider using the PyTorch Lightning [learning rate finder](https://pytorch-lightning.readthedocs.io/en/latest/lr_finder.html) toolkit for an initial good guess. 

#### LR scheduler

* It again depends on the classifier and model
* For classifiers and SGD, the multi-step LR has shown to be good
* Models trained with Adam commonly use a smooth exponential decay in the learning rate
* For Transformers: remember to use a learning rate warmup, the cosine scheduler is often used for decaying the learning rate afterwards

### Regularization

* Regularization is important in networks if you see a significant higher training performance than test performance
* The regularization parameters all interact with each other, and hence must be tuned together. The most commonly used regularization techniques are: 
    * Weight decay
    * Dropout
    * Augmentation
* Dropout is usually a good idea as it is applicable to most architectures and has shown to effectively reduce overfitting
* If you want to use weight decay in Adam, remember to use `torch.optim.AdamW` instead of `torch.optim.Adam`

#### Domain specific regularization

* There are couple of regularization techniques that depend on your input data/domain. The most common include:
    * Computer Vision: image augmentation
    * NLP: input dropout of whole words
    * Graphs: dropping edges, inputs


### Grid search with SLURM 

* SLURM supports you to do a grid search with [job arrays](https://help.rc.ufl.edu/doc/SLURM_Job_Arrays).
* Job arrays allow you to start N jobs in parallel, each running with slightly different settings.
* It is effectively the same as creating N job files and calling N times `sbatch ...`, but this can become annoying and is messy at some point.

#### Job arrays

Job arrays are created with two files: a job file, and a hyperparameter file.
The job file will start multiple sub-jobs that each use a different set of hyperparameters, as specified in the hyperparameter file.
In the job file, you need to add the argument `#SBATCH --array=...`. The argument specifies how many sub-jobs you want to start, how many to run in parallel (at maximum), and which lines to use from the hyperparameter file.
For example, if we specify `#SBATCH --array=1-16%8`, this means that we start 16 jobs using the lines 1 to 16 in the hyperparameter file, and running at maximum 8 jobs in parallel at the same time.
Note that the number of parallel jobs is there to limit yourself from blocking the whole cluster.
However, with your student accounts, you will not be able to run more than 1 job in parallel anyways.
The template job file `array_job.job` looks slightly different than the one we had before. 
The slurm output file is specified using `%A` and `%a`. `%A` is being automatically replaced with the job ID, while `%a` is the index of the job within the array (so 1 to 16 in our example above).
Below, we also added a block for creating a checkpoint folder for the job array, and copying the job file including hyperparameters to that folder.
This is good practice for ensuring reproducibility. 
Finally, in the training call, we specify the path checkpoint path (make sure to have implemented this argument in your argparse) with the addition of `experiment_${SLURM_ARRAY_TASK_ID}` which is a sub-folder in the checkpoint directory with the sub-job ID (1 to 16 in the example).
The next line, `$(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)`, copies the N-th line of the hyperparameter file to this job file, and hence submits the hyperparameter arguments to the training file.

File `array_job.job`:
```bash
#!/bin/bash

#SBATCH --partition=gpu_shared_course
#SBATCH --gres=gpu:1
#SBATCH --job-name=ExampleArrayJob
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=04:00:00
#SBATCH --mem=32000M
#SBATCH --array=1-16%8
#SBATCH --output=slurm_array_testing_%A_%a.out

module purge
module load 2019
module load Python/3.7.5-foss-2019b
module load CUDA/10.1.243
module load cuDNN/7.6.5.32-CUDA-10.1.243
module load NCCL/2.5.6-CUDA-10.1.243
module load Anaconda3/2018.12

# Your job starts in the directory where you call sbatch
cd $HOME/...
# Activate your environment
source activate ...

# Good practice: define your directory where to save the models, and copy the job file to it
JOB_FILE=$HOME/.../array_job.job
HPARAMS_FILE=$HOME/.../array_job_hyperparameters.txt
CHECKPOINTDIR=$HOME/.../checkpoints/array_job_${SLURM_ARRAY_JOB_ID}

mkdir $CHECKPOINTDIR
rsync $HPARAMS_FILE $CHECKPOINTDIR/
rsync $JOB_FILE $CHECKPOINTDIR/

# Run your code
srun python -u train.py \
               --checkpoint_path $CHECKPOINTDIR/experiment_${SLURM_ARRAY_TASK_ID} \
			   $(head -$SLURM_ARRAY_TASK_ID $HPARAMS_FILE | tail -1)
```

The hyperparameter file is nothing else than a text file in which each line denotes one set of hyperparameters for which you want to run an experiment. There is no specific order in which you need to put the lines, and you can extend the lines with as many hyperparameter arguments as you want.

File `array_job_hyperparameters.txt`:
```bash
--seed 42 --learning_rate 1e-3
--seed 43 --learning_rate 1e-3
--seed 44 --learning_rate 1e-3
--seed 45 --learning_rate 1e-3
--seed 42 --learning_rate 2e-3
--seed 43 --learning_rate 2e-3
--seed 44 --learning_rate 2e-3
--seed 45 --learning_rate 2e-3
--seed 42 --learning_rate 4e-3
--seed 43 --learning_rate 4e-3
--seed 44 --learning_rate 4e-3
--seed 45 --learning_rate 4e-3
--seed 42 --learning_rate 1e-2
--seed 43 --learning_rate 1e-2
--seed 44 --learning_rate 1e-2
--seed 45 --learning_rate 1e-2
```

#### PyTorch Lightning

Writing the job arrays can be sometimes annoying, and hence it is adviced to write a script that can automatically generate the hyperparameter files (for instance by adding the seed parameter 4 times to each other hyperparam config). However, if you are using PyTorch Lightning, you can directly create a job array file. The documentation for this can be found [here](https://pytorch-lightning.readthedocs.io/en/latest/slurm.html#building-slurm-scripts).