In [4]:
!pwd

/global/u2/a/ading/root_gnn/notebook


## Setup python Environment

1) create an isolated python environment namely `gnn` via [conda](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands). 

\[Optional\] Create a configuration file for `conda`: `~/.condarc`, 
and specify the location of envrionments that will house python modules. 
This directory will grow very quickly. I suggest to use a project directory.
```json
envs_dirs:
  - /global/cfs/cdirs/atlas/xju/conda/envs
report_errors: true
```

1.1) Following commands is to install an environment named `gnn`. 
```bash
module load python
conda create -n gnn python=3.8 ipykernel
source $(which conda | sed -e s#bin/conda#bin/activate#)  gnn
python -m ipykernel install --user --name gnn --display-name a-Gnn
```

It will install a kernel file at `~/.local/share/jupyter/kernels/gnn/kernel.json`. 

1.2) create a `~/.local/share/jupyter/kernels/gnn/setup.sh` with the following contents:
```bash
#!/bin/bash
module load python
source $(which conda | sed -e s#bin/conda#bin/activate#)  gnn
python -m ipykernel_launcher $@
```
and make it executable `chmod +x ~/.local/share/jupyter/kernels/gnn/setup.sh`.

Get absolute path: `readlink -f ~/.local/share/jupyter/kernels/gnn/setup.sh`.

1.3) update the `~/.local/share/jupyter/kernels/gnn/kernel.json` as the following. 
Note that the path to `setup.sh` should be the absolute path.
```json
{
 "argv": [
  "/global/u1/x/xju/.local/share/jupyter/kernels/gnn/setup.sh",
  "-f",
  "{connection_file}"
 ],
 "display_name": "a-Gnn",
 "language": "python"
}
```

In [5]:
!which python

/global/cfs/cdirs/m3443/usr/ading/conda/envs/gnn/bin/python


In [6]:
!which pip

/global/cfs/cdirs/m3443/usr/ading/conda/envs/gnn/bin/pip


In [3]:
!pip install tensorflow

Collecting tensorflow
  Using cached tensorflow-2.4.1-cp38-cp38-manylinux2010_x86_64.whl (394.4 MB)
Collecting tensorboard~=2.4
  Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
Collecting termcolor~=1.1.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Collecting typing-extensions~=3.7.4
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Collecting grpcio~=1.32.0
  Using cached grpcio-1.32.0-cp38-cp38-manylinux2014_x86_64.whl (3.8 MB)
Collecting protobuf>=3.9.2
  Downloading protobuf-3.15.6-cp38-cp38-manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 26.2 MB/s eta 0:00:01
[?25hCollecting google-pasta~=0.2
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting numpy~=1.19.2
  Using cached numpy-1.19.5-cp38-cp38-manylinux2010_x86_64.whl (14.9 MB)
Collecting tensorflow-estimator<2.5.0,>=2.4.0
  Using cached tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462 kB)
Collecting astunparse~=1.6.3
  Using cached astunparse-

Install the python package [root_gnn](https://github.com/xju2/root_gnn/tree/tf2) using the branch `tf2` therein. 

In [5]:
!pip install -e ..

In [8]:
filename = '/global/homes/a/ading/atlas/data/top-tagger/test.h5'

Setting up graphs for training, validation, and testing

1. Creating training graphs

```bash
create_tfrecord /global/homes/x/xju/atlas/data/top-tagger/train.h5 tfrec/train \
  --evts-per-record 100 --max-evts 1000 \
  --type TopTaggerDataset --num-workers 2
```


2. Creating validating graphs

```bash
create_tfrecord /global/homes/x/xju/atlas/data/top-tagger/val.h5 tfrec/val \
  --evts-per-record 100 --max-evts 1000 \
  --type TopTaggerDataset --num-workers 2
```


3. Creating testing graphs

```bash
create_tfrecord /global/homes/x/xju/atlas/data/top-tagger/test.h5 tfrec/test \
  --evts-per-record 100 --max-evts 1000 \
  --type TopTaggerDataset --num-workers 2
```


## Training

The main script, train_classifier, can be invoked with the following bash command with default arguments:

```bash
train_classifier
```

or with the following arguments specifying I/O and hyperparameters:
```bash
train_classifier --input-dir tfrec --output-dir trained \
  --batch-size 25 --num-epochs 10 --num-iters 10 --lr 0.002
```

You can also specify other models and loss functions defined in ```root_gnn/model.py``` and ```root_gnn/losses.py```.

Let's examine what ```train_classifier``` is doing under the hood:

In [9]:
import tensorflow as tf

import os
import sys
import argparse

import re
import time
import random
import functools
import six

import numpy as np
import sklearn.metrics


from graph_nets import utils_tf
from graph_nets import utils_np
import sonnet as snt

from root_gnn import model as all_models
from root_gnn import losses
from root_gnn.src.datasets import graph
from root_gnn.utils import load_yaml

from root_gnn import trainer 

In [20]:
model = getattr(all_models, "GlobalClassifierNoEdgeInfo")()
loss_config = "GlobalLoss,1,1".split(',')
loss_fcn = getattr(losses, loss_config[0])(*[float(x) for x in loss_config[1:]])
config = {
    "input_dir": "../tfrec",
    "output_dir": "../trained",
    "batch_size": 50,
    "num_epochs": 5,
    "num_iters": 10,
    "shuffle_size": 1,
    "model": model,
    "loss_name": loss_config[0],
    "loss_fcn": loss_fcn,
    "lr": 0.001,
    "metric_mode": "clf",
    "early_stop": "auc",
    "max_attempts": 1
}
trnr = trainer.TrainerBase(**config)

The ```TrainerBase()``` constructor initializes a base trainer object by unpacking the ```config``` dict.

Next, the user can call functions for loading training, validation, and testing data. The requirement is that the files to be extracted from```input_dir``` must be of the proper ```.tfrec``` format created by ```create_tfrecord```.

In [22]:
train_data, _ = trnr.load_training_data(shuffle=True)
val_data, _ = trnr.load_validating_data(shuffle=True)

The ```train``` function of ```TrainerBase``` performs training given the specified configurations and hyperparameters. The function can be called in two main ways:

The first way is the default call, which assumes the model and loss are the same as the configurations, and that the training data is the same as the last call to ```load_training_data```.

In [25]:
#trnr.train()

The second way to call ```train``` is by specifying either a model, loss, or training data. The training data must be of the format returned as the first tuple value of ```load_training_data```.

In [5]:
#trnr.train(model, loss_fcn, train_data)

For this next part, we will be using TensorBoard with the ```nersc_tensorboard_helper```. At this point, it is recommended to switch the notebook kernel away from "a-Gnn" to "tensorflow-v2.0.0-cpu" to access the tensorboard helper.

In [1]:
import nersc_tensorboard_helper
%load_ext tensorboard

In [2]:
%tensorboard --logdir /global/homes/a/ading/root_gnn/trained/noedge_fullevts/logs --port 0

In [3]:
nersc_tensorboard_helper.tb_address()

You can now access the link above to view the TensorBoard for your training.