# Week 6

goal

feedback


## Project reminder

Last 4 labs of the semester will be focused on your projects alone. First, in 9th and 10th ...

Checkpoint - data analysis, model complete, at least one serious training run

How would you prefer it - joint submission / by week submission?
random assignment / subscribing


## Google Cloud Platform and training on a GPU

How does GCP work?
GCP tutorial

1. start VM
2. connect to VM
3. start docker
4. connect to docker
5. git pull
6. run a training command
    meantime. Ctrl+p, ctrl+q - leave docker running
    log back and check what is happening

TF and GPU - Most of TF is GPU compliant - what you code in TF can be run on GPUs - much faster than computing at CPUs

 1. develop locally, if it really works, run tests on clouds
 
 https://www.tensorflow.org/guide/gpu


Use GCP for final training, not for development
Docker images will help you keep unified environment, if it works on your machine, it should work in VM as well.
Check whether GPU is being used nvidia-smi, if the usage is too small, perhaps you have slow data processing pipeline.
Develop locally, train in cloud
You can use your on GPU, but we recommend it only if you have something powerful (8GB VRAM), otherwise you can do some smaller training run, but it will probably be slow and the memory will be limiting


## Project management

In this section we provide several helpful tips on how to approach managing deep learning projects. These tips come from our personal experience with managing projects similar to yours. Projects like these are quite specific, mainly because of the high computational requirements and experimental nature of machine learning. One experimental run can take hours of GPU time even for smaller projects. Some state-of-the-art projects can take even months of GPU time to complete. It costs yout time and if you compute in cloud, it also costs you money. For this reason we want to __make every run count__ and we want to avoid redoing any calculations.

We also mention some additional tips on how to effectively develop your model and how to deploy them. Some of the tips here might seem like an additional work, but they will very likely save you a lot of time down the road. They are also generally considered to be good practices and you should use in any other machine leaning project in the future.

### 1. Log your results

You should regularly measure various metrics and keep these records for future reference. These will help you compare different runs and they will also help you with early bug diagnosis.

__What should you log?__ You should minimally record your _loss_ value and your _main evaluation metric_ value for both _train_ and _validation_ set. You can also measure some additional evaluation metrics (e.g. per class precision and recall for classification) and memory (how much RAM is taken) and time (how long does one batch takes) requirements.

__How often should you log?__ Most often you evaluate after each epoch. In some cases we have a really big dataset even just one epoch can take several hours. In that case we usually want to have some information about the training earlier and more often than once in an epoch. We can simply log our metrics every X steps instead of at the end of an epoch.

__How to calculate the training set metrics?__ Calculating the metrics for the whole training set is usually quite costly, these datasets are much bigger than validation sets. You can either (1) calculate these results only for a subset of training set or (2) calculate these metrics as you train and aggregate them at the end.

__How to implement logging?__ You can use TensorBoard. It provides a convenient API for logging your results and you get cool visualizations for free. `Keras` callbacks (seen in Week 4 lab) provide basic functionality, but you can also log custom metrics, [check their tutorial](https://www.tensorflow.org/tensorboard/scalars_and_keras). If you use TensorBoard, make sure you create a logical system for naming your runs so you can identify them later.

### 2. Save your models

You should regularly save your models - its parameters - so you do not lose your progress. You can restore your model anytime and continue with training or run additional evaluation on it.

__How often should you save your model?__ You can just save your model anytime you evaluate.

__How many snapshots should you keep?__ You can keep all the snapshots, but this approach can fill your HDD quite easily for bigger models. In that case you can as a bare minimum keep only your last snapshot and your best performing snapshot.

__How to implement saving?__ TensorFlow provides a convenient API for model saving and restoring. [Check their tutorial](https://www.tensorflow.org/tutorials/keras/save_and_load#manually_save_weights).


### 3. Early stopping

You should stop training when you detect that the run does not improve anymore. This technique is called _early stopping_ and it can save you lots of GPU time. You can stop training for following reasons:

1. No significant progress was done in previous X epochs.
2. Results are getting significantly worse.
3. Model performs worse than a baseline after certain number of epochs. Some hyperparameters (such as learning rate or batch size) will make the training slower, make sure you do not penalize them.

This technique is recommended only after you get to know your model and how it behaves on the task. Otherwise, if you use early stopping too liberally, you can stop runs that could have achieved interesting results.

### 4. Hyperparameter tuning

Random search is generally the safest bet, however it still might be too expensive for the resources you have available (depending on how difficult your project is training-wise). Look at the hyperparameters other people are using. You can tune them manually at first and then experiment at least with the most important hyperparameters - learning rate, batch size, optimizer and then perhaps also additional architecture parameters (e.g. network depth, hidden layer size).

You should make a reasonable API for your hyperparameters, __do not rewrite them manually in a file all the time.__ Instead it is a good practice to be able to start the training via command line with a command, e.g.

```
python train.py --learning-rate 0.003 --batch-size 8
```

You can use `argparse` Python module to easily parse arguments like these ([tutorial](https://docs.python.org/3/howto/argparse.html#id1)). Also, consider using Tensoboard Hparams extension, that lets you log the hyperparameters for current run and then visualize the results. You can check [TensorBoard](https://www.tensorflow.org/tensorboard/hyperparameter_tuning_with_hparams) documentation or see the code from Week 5 lab.


### 5. Experiment notes

Try keeping logs about the experiments you were running. You have results and hyperparameters logged in various files, but you should also write down your findings and thoughts, e.g. when you found out that some hyperparameter seems to be very sensitive or when you found out that some technique seems to be beneficial for your experiment.

### 6. Project structure

Below is a possible project structure. It is inspired by [Cookiecutter Data Science project structure](https://drivendata.github.io/cookiecutter-data-science). This is just a suggestion that can help you get started.

<br/><br/>

```
├── .gitignore          <- You usually don't want to push your data, logs or models to your repo
├── README.md
│
├── data
│   ├── processed       <- The data prepared to be fed into your model
│   └── raw             <- The original data you got
│
├── docker
│   ├── Dockerfile      <- Dockerfile to build the image for your project
│   └── setup           <- Additional files needed for Dockerfile
│
├── logs                <- Saved evaluation results
│
├── models              <- Saved models
│
├── notebooks           <- Jupyter notebooks for data analysis and model interaction.
│
└── src
    │
    ├── data            <- Scripts that load your data, e.g. tf.data pipeline
    │   └── load_data.py
    │
    └── models         
        ├── model.py    <- Your model definition
        ├── predict.py  <- Makes prediction with trained model on new data
        └── train.py    <- Training loop

```
<br/><br/>

Note that you should have your model definition, training loops and data processing pipeline in normal Python scripts. Jupyter notebook is a great tool for communication and data visualization, but I would not recommend to implement your whole project in it. This usually results in a so called _Big Ass Script_ architecture. This is considered to be a development anti-pattern. 
 
## How to grow your model

There is a lot of options when starting a deep learning project. You have to design a model, choose a optimizer algorithm, propose data representation, pick an evaluation metrics, choose some form of regularization, etc. The sheer number of options can be overwhelming. Even if you pick some, it is hard to tell, what to change if you fail to train your model.

Another problem with developing deep learning models is that they can fail silently. The fact that no exception was raised during the training does not mean that the training is done correctly. You can feed wrong data in wrong format, you can fail to calculate your loss or minimize it properly, you can miscalculate you evaluation metrics, etc.

### 1. Build infrastructure first

1. Setup the whole process - feeding data, training, evaluation, logs, model saving, etc. and use it with the simplest model possible

### 2. From training to testing data

You should start with a very simple model -- a reasonable baseline -- that can fit your data.

Try to fit one batch by showing it over and over again. This is a sanity check, does your model work? Is it able to learn? This can be all done locally

Then try to fit training data - are you able to achieve expected performance? Now you probably need GPU

If not you have high bias: bigger model, hparam tuning, maybe your model is not good for this

Then you concern yourself with testing data - do you generalize well?

If not: regularization, more data, data augmentation

### 3. From simple to complex model

Similarly, when you work on ML project, start with simple baseline to find out what results can you expect. Then add additional complexity and you can compare to this baseline and see whether they are worth it or not.

## Further Reading

Karpatys blog
http://karpathy.github.io/2019/04/25/recipe/

Ng's third course
...

Coockiecutter machine learning project structure
https://drivendata.github.io/cookiecutter-data-science/#opinions
