# Quick start guide to CBRAIN-CAM

Hello to all you poor souls that have to use this repository. I apologize for all the sloppy coding and deprecated options. I promise that cbrain_v2 is coming!

For now this is an overview of the essential workflow. Feel free to ask me questions via email!

In [1]:
# Imports
import numpy as np

## Running SPCAM with the appropriate output

TBD

## Data preprocessing

The first step is to preprocess the data from the SPCAM NetCDF outputs to a format that is quickly readable for the neural network.

The preprocessing functions are `preprocess_aqua.py` and `shuffle_ds.py` in the `cbrain` subdirectory.

The raw SPCAM output will look something like this

```
AndKua_aqua_SPCAM3.0_sp_fbp_f4.cam2.h1.0000-01-01-00000.nc
AndKua_aqua_SPCAM3.0_sp_fbp_f4.cam2.h1.0000-01-02-00000.nc
AndKua_aqua_SPCAM3.0_sp_fbp_f4.cam2.h1.0000-01-03-00000.nc
...
```

`preprocess_aqua.py` is meant to be called from the command line. The arguments can either be passed in directly or in a config.yml file or both, in which case the order of importance is command line > config > default.

The most important arguments are:
- inputs: List of input variables, e.g. [TBP, QBP, VBP, PS, SOLIN, SHFLX, LHFLX]. `_BP` variables are computed as: `TAP - TPHYSTND*dt`
- outputs: List of output variables, e.g. [TPHYSTND, PHQ, FSNT, FSNS, FLNT, FLNS, PRECT]
- in_dir: Directory where SPCAM files are stored. I strongly recommend putting the files on a fast drive (e.g. /scratch/ on Greenplanet)
- aqua_names: Here you can put a string with placeholders `*`, e.g. `'*.h1.0001-*-*-*'` for all files from year 1 or `'*.h1.0000-*-1[7-9]-*'` for day 17-19 from each month of year 0.
- out_dir: directory where the processed files will be stored
- out_pref: Prefix for the output files. Give them some descriptive name, e.g. `my_experiment1`
- ext_norm: This is depricated and ugly. If ext_norm = None (default), a normalization file will be computed from the data. This normalization file contains means, standard deviations, etc. For a full year of data this can take a long time. For this reason I chose to compute the normalization files for a sample of the data. The differences are small. If ext_norm is some string (just pick any jiberish) no normalization file will be computed but the data will still not be normalized. This is controlled by the `norm_features` and `norm_targets` options which are None and should stay so. The normalization happens later on the fly during network training.

There are also some arguments hard-coded, for example the normalization factors that are saved in the normalization file. They are at the top of `preprocess_aqua.py` in a dictionary.

An example usage would be:
```commandline
python preprocess_aqua.py --config_file ../pp_config/fbp_engy_ess.yml --in_dir /beegfs/DATA/pritchard/srasp/fluxbypass_aqua/ --aqua_names '*.h1.0001-*-0[5-9]-*' --out_dir /beegfs/DATA/pritchard/srasp/preprocessed_data/ --out_pref fbp_engy_ess_sample_train
```
which will produce `out_pref` + features.nc, + targets.nc and + norm.nc files in `out_dir`.

A validation set has to be created separately by simply chosing a different time interval with `aqua_names`


`shuffle_ds.py` pre-randomizes the data which is important for network training (not required for validation set). Its use is simple:
```commandline
python $REPO/cbrain/shuffle_ds.py --pref <out_dir>/<out_pref>
```
It will create a file with the appendix `_shuffle`.

## Neural network training

You can either train a neural network using the script run_experiment or do it yourself in a notebook like this. Let's go through the basic steps.

In [7]:
# Import the required cbrain functions
from cbrain.imports import *
from cbrain.data_generator import *
from cbrain.models import *
from cbrain.utils import limit_mem

In [8]:
# If you are running on the GPU, execute this
# Otherwise tensorflow will use ALL your GPU RAM for no reason
limit_mem()

First we need to define the data generators for the training and validation data.

The DataGenerator class takes the following important arguments:
- data_dir: path where preprocessed files are stored. Again, I recommend using an SSD for this.
- feature_fn: name of feature file, e.g. `bp_engy_ess_sample_train_shuffle_features.nc'
- target_fn: corresponding targets file
- batch size: the batch size
- norm_fn: this is the normalization file to be used. Has to be the same for validation and training set.
- fsub, fdiv, tsub, tmult: These indicate how the features and targets are normalized. The names correspond to the normalization file, but some names also have special instructions inside the DataGenerator. My default values are `fsub: feature_means, fdiv: max_rs, tmult: target_conv, tsub: None`. For more info dig the code or ask me :)
- shuffle: Whether the batches are shuffled pseudo-randomly. Should the True for training, doesn't matter for validation

In [10]:
DATADIR = '/local/S.Rasp/preprocessed_data/'

In [12]:
train_gen = DataGenerator(
    data_dir=DATADIR, 
    feature_fn='fbp_engy_ess_train_sample1_shuffle_features.nc',
    target_fn='fbp_engy_ess_train_sample1_shuffle_targets.nc',
    batch_size=512,
    norm_fn='fbp_engy_ess_train_sample1_norm.nc',
    fsub='feature_means', 
    fdiv='feature_stds', 
    tmult='target_conv',
    shuffle=True,
)

Generator will have 23494656 samples in 45888 batches
Features have shape 94; targets have shape 65


In [13]:
# Same for validation
valid_gen = DataGenerator(
    data_dir=DATADIR, 
    feature_fn='fbp_engy_ess_valid_sample1_features.nc',
    target_fn='fbp_engy_ess_valid_sample1_targets.nc',
    batch_size=512,
    norm_fn='fbp_engy_ess_train_sample1_norm.nc',  # SAME NORMALIZATION FILE!
    fsub='feature_means', 
    fdiv='feature_stds', 
    tmult='target_conv',
    shuffle=False,
)

Generator will have 23494656 samples in 45888 batches
Features have shape 94; targets have shape 65


Next we need to build a neural network. For this we can use the functions in cbrain.models or write our own network. Let's do this real quick.

In [14]:
from keras.models import Sequential
from keras.layers import *

In [15]:
model = Sequential([
    Dense(512, input_shape=(train_gen.feature_shape,), activation='relu'),  # Input layer of size 94
    Dense(512, activation='relu'),
    Dense(train_gen.target_shape, activation='linear')  # output layer of size 65
])

In [16]:
# Now compile the model
model.compile(optimizer='adam', loss='mse')

In [17]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               48640     
_________________________________________________________________
dense_2 (Dense)              (None, 512)               262656    
_________________________________________________________________
dense_3 (Dense)              (None, 65)                33345     
Total params: 344,641
Trainable params: 344,641
Non-trainable params: 0
_________________________________________________________________


In [18]:
# Finally train the model
model.fit_generator(
    train_gen.return_generator(),   # This actually returns the generator
    train_gen.n_batches,
    epochs=1,
    validation_data=valid_gen.return_generator(),
    validation_steps=valid_gen.n_batches,
)

Epoch 1/1


<keras.callbacks.History at 0x7fac822a3908>

Here you can also see how long it takes on my GPU in Munich.

The run_experiment.py script basically does what we just did but has a lot more options. Most of them are not active by default. The final experiment for the PNAS paper is D025 which has the following config file
```
exp_name: D025_fbp_engy_ess_ref_fullyear_ref_longtrain
data_dir: /scratch/srasp/preprocessed_data/
train_fn: fbp_engy_ess_ref_train_fullyear_shuffle
valid_fn: fbp_engy_ess_valid_sample1
norm_fn: fbp_engy_ess_ref_train_sample1_norm.nc
fsub: feature_means
fdiv: max_rs
tmult: target_conv
activation: LeakyReLU
hidden_layers: [256,256,256,256,256,256,256,256,256]
loss: mse
log_dir: ./logs/
epochs: 18
lr_step: 3
valid_after: True
```

Some explanation:
- log_dir: This simply saves a Tensorboard log. I would not worry about this for now since it doesn't really give more information that the scores.
- lr_step: In addition to the simple example above, I aplemented a learning rate scheduler. I basically copied this: https://machinelearningmastery.com/using-learning-rate-schedules-deep-learning-models-python-keras/ lr_step defines after how many epochs the learning rate is dropped.
- valid_after: If this is true, the validation score is only computed for the last epoch. This is reasonable once you know that your model behaves ok. Computing the validation score actually takes some time.
- model_dir and exp_name: The keras model is saved after training in the directory specified with the experiment name. We need the saved model to implement into CAM later.

## How to get the model into CAM

TBD