Check the current GPU usage. Please try to be nice!

In [1]:
!nvidia-smi

Wed Oct  6 10:20:16 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce RTX 208...  On   | 00000000:18:00.0 Off |                  N/A |
| 50%   83C    P2   251W / 250W |   2304MiB / 11019MiB |     82%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:3B:00.0 Off |                  N/A |
| 29%   33C    P8    20W / 250W |      3MiB / 11019MiB |      0%      Default |
|       

In [2]:
import matplotlib as mpl

> **WARNING**: The card numbers here are *not* the same as in CUDA. You have been warned. However, these numbers are correct if you use the `select_gpu` helper function.

## Imports

In [3]:
#import matplotlib.pyplot as plt
import numpy as np
import time
import torch
from torch import nn
import inspect
import pandas as pd

# Python 3 standard library
from pathlib import Path

## Get the helper functions

In [4]:
from model.collectdata import collect_data
from model.loss import Loss
from model.training import trainNet, select_gpu, Results
from model.plots import dual_train_plots, replace_in_ax
from model.core import PVModel, write_model

## Make a model

In [5]:
# Predefined:
from model.models import SimpleCNN3Layer as Model

#class Model(PVModel):
#    INPUTS = 3
#    KERNEL_SIZE =   [25, 15, 5]
#    CHANNELS_SIZE = [5, 10, 1]
#    DEFAULTS = {'dropout_1':0.15, 'dropout_2':0.15, 'dropout_3':0.25}
#    FC = True
#    FINAL_ACTIVATION = nn.Softplus

## Set up local parameters

In [6]:
n_epochs = 80

# Name is the output file name
name = 'Feb05_mask_16K_z_3layer'

# Make an output folder named "name" (change if you want)
output = Path(name)

# Size of batches
batch_size = 128

# How fast to learn
learning_rate = 1e-3

Make the output directory if it does not exist:

In [None]:
output.mkdir(exist_ok=True)

Prepare output dataframe to be filled in during run:

In [None]:
# This gets built up during the run - do not rerun this cell
results = pd.DataFrame([], columns=Results._fields)

Save the model source code information into the output directory. Only PVModel subclasses are supported for in-cell definitions; other models need to be in a `.py` file.

In [None]:
write_model(output / f'{name}_model_info.py', Model, Loss)

## GPU selection

Set up Torch device configuration. All tensors and model parameters need to know where to be put.
This takes a BUS ID number: The BUS ID is the same as the listing at the top of this script.

In [None]:
device = select_gpu(2)

## Loading data

Load the dataset, split into parts, then move to device if `device=device` is present. If this line is commented out, then load the datasets as the calculations progress. Allows larger datasets and plays nicer with memory, but very slightly slower. See `collectdata.py` in the `../model` directory for the source. Datasets are listed in the model directory README, repeated here:

|        From       |          To         |         Events          |
|-------------------|---------------------|-------------------------|
| `kernel_20181003` | `Oct03_20K_val`     | 1,2                     |
| `kernel_20181003` | `Oct03_20K_test`    | 3,4                     |
| `kernel_20181003` | `Oct03_40K_train`   | 5,6,7,8                 |
| `kernel_20181003` | `Oct03_80K_train`   | 9,10,11,12,13,14,15,16  |
| `kernel_20181003` | `Oct03_80K2_train`  | 17,18,19,20,21,22,23,24 |
| `kernel_20180814` | `Aug14_80K_train`   | 1,2,3,4,5,6,7,8         |

In [None]:
# Training dataset. You can put as many files here as desired.
train_loader = collect_data('data/Oct03_80K_train.h5',
                            'data/Oct03_80K2_train.h5',
                            batch_size=batch_size,
                            #device=device,
                            #load_xy=True,
                            masking=True, shuffle=True)

# Validation dataset. You can slice to reduce the size.
val_loader = collect_data('data/Oct03_20K_val.h5',
                          batch_size=batch_size,
                          slice=slice(256 * 39),
                          #load_xy=True,
                          #device=device,
                          masking=True, shuffle=False)

# Preparing the model

Prepare a model, use multiple GPUs if they are VISIBLE, and move the model to the device.

In [None]:
model = Model() # optional: dropout_1 = 0.15, etc.
loss = Loss(epsilon=1e-5)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

This should support multi-gpu, but doesn't work very well.

In [None]:
print("Let's use", torch.cuda.device_count(), "GPUs!")
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

Let's move the model's weight matricies to the GPU:

In [None]:
model = model.to(device)

## Train

The body of this loop runs once per epoch. Results is a named tuple of values (loss per epoch for training and validation, time each). Start by setting up a plot first:

In [None]:
ax, tax, lax, lines = dual_train_plots()
fig = ax.figure
plt.tight_layout()

In [None]:
for result in trainNet(model, optimizer, loss,
                        train_loader, val_loader,
                        n_epochs, epoch_start=len(results),
                        notebook=True):
    
    results = results.append(pd.Series(result._asdict()), ignore_index=True)
    
    xs = results.index
    
    # Update the plot above
    lines['train'].set_data(results.index,results.cost)
    lines['val'].set_data(results.index,results.val)
    
    #filter first cost epoch (can be really large)
    max_cost = max(max(results.cost if len(results.cost)<2 else results.cost[1:]), max(results.val))
    min_cost = min(min(results.cost), min(results.val))
    
    # The plot limits need updating too
    ax.set_ylim(min_cost*.9, max_cost*1.1)  
    ax.set_xlim(-.5, len(results.cost) - .5)
    
    replace_in_ax(lax, lines['eff'], xs, results['eff_val'].apply(lambda x: x.eff_rate))
    replace_in_ax(tax, lines['fp'], xs, results['eff_val'].apply(lambda x: x.fp_rate))
    
    # Redraw the figure
    fig.canvas.draw()

    # Save each model state dictionary
    torch.save(model.state_dict(), output / f'{name}_{result.epoch}.pyt')

In [None]:
results

## Results

Let's save some results: (even though if you have not changed the code above, it saves the model every epoch)

In [None]:
results

Go ahead and save the final model (even though it was also saved above):

In [None]:
torch.save(model.state_dict(), output / f'{name}_final.pyt')

Save the output results (ignore the warning about pickeling):

In [None]:
results.to_hdf(f'{name}_stats.hdf5', 'results')

Save the plot (remake the plot just in case the one above has broken):

In [None]:
dual_train_plots(results.index,
                 results.cost, results.val, 
                 results['eff_val'].apply(lambda x: x.eff_rate),
                 results['eff_val'].apply(lambda x: x.fp_rate))
plt.tight_layout()
plt.savefig(str(output / f'{name}_stats_a.png'))

Quit the kernel (try to be nice to other users). Note that plots will vanish (but are saved, so that's okay).

In [None]:
quit()