Check the current GPU usage. Please try to be nice!

In [None]:
!nvidia-smi

> **WARNING**: The card numbers here are *not* the same as in CUDA. You have been warned.

## Imports

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

# Python 3 standard library
from pathlib import Path

# Pretty progress bar (notebook version)
from tqdm import tqdm_notebook as progress_bar

Nicer plotting

In [None]:
plt.rcParams["font.weight"] = "bold"
plt.rcParams["font.size"] = "18"
plt.rcParams["axes.labelweight"] = "bold"

Standard method of using an environment varialble to controle the GPUs available (it would work with any CUDA code)

In [None]:
import os
# Force only first GPU (P100 GPU) on Goofy
os.environ['CUDA_VISIBLE_DEVICES']="0"
# This would be just the K40s:
# os.environ['CUDA_VISIBLE_DEVICES']="1,2"

Set up Torch device configuration. All tensors and model parameters need to know where to be put. If you have multiple GPU support, this still remains `cuda:0`, oddly.

In [None]:
import torch
if torch.cuda.is_available():
    device = torch.device("cuda:0")
    for i in range(torch.cuda.device_count()):
        print(i, torch.cuda.get_device_name(i))
else:
    device = torch.device("cpu")
    print("Using CPU")

### Set up local parameters

In [None]:
n_epochs = 200
name = 'Aug_01_75000_2layer'
data = '/data/schreihf/PvFinder/July_31_75000.npz'
output = Path('Aug_01_75000_2layer')
batch = 32
learning_rate = 1e-3

Make the output directory if it does not exist:

In [None]:
output.mkdir(exist_ok=True)

Add the directory with the model
definitions to the path so we can import it:

> When you type `import X`,
Python searches `sys.path` for a python
file named `X.py` to import. So we need to add the model directory to the path.

In [None]:
import sys
sys.path.append('../model')

In [None]:
from collectdata import collect_data
from loss import Loss
from training import trainNet
from models import SimpleCNN2Layer as Model

## Loading data

Load the dataset, split into parts, then move to device (see `collectdata.py` in the `../model` directory)

In [None]:
dataset_train, dataset_val, _ = collect_data(
    data, 55_000, 10_000,
    verbose=True, device=device)

# Preparing the model

Prepare a model, use multiple GPUs if they are VISIBLE, and move the model to the device.

In [None]:
model = Model()
loss_fn = Loss()

> I currently have DataParallel as part of Model. It's probably the wrong place to put it, and it makes the model code more complicated. I may move it back here soon.

In [None]:
print("Let's use", torch.cuda.device_count(), "GPUs!")

Let's move the model's weight matricies to the GPU:

In [None]:
model = model.to(device)

## Train

The body of this loop runs once per epoch. Results is a named tuple of values (loss per epoch for training and validation, time each)

In [None]:
# Make a pretty progress bar (any iterator can be given for epochs)
progress = progress_bar(range(n_epochs), dynamic_ncols=True)

# Run the epochs, using progress instead of range(n_epochs)
for results in trainNet(model, dataset_train, dataset_val,
                            loss_fn, batch, progress,
                            learning_rate=learning_rate, verbose=False):
        
    # Pretty print a description
    progress.set_postfix(train=results.cost[-1], val=results.val[-1])

    # Save each model state dictionary
    torch.save(model.state_dict(), output / f'{name}_{results.epoch}.pyt')

Go ahead and save the final model (even though it was also saved above):

In [None]:
torch.save(model.state_dict(), output / f'{name}_final.pyt')

Print the output results:

In [None]:
np.savez(output / f'{name}_stats.npz',
         cost = np.array(results.cost),
         val = np.array(results.val),
         time = np.array(results.time)) 

## Plot final details

Who doesn't like pretty pictures?

In [None]:
# Make a pretty progress bar (any iterator can be given for epochs)
progress = progress_bar(range(n_epochs), dynamic_ncols=True)

# Run the epochs, using progress instead of range(n_epochs)
for results in trainNet(model, dataset_train, dataset_val,
                            loss_fn, batch, progress,
                            learning_rate=learning_rate, verbose=False):
        
    # Pretty print a description
    progress.set_postfix(train=results.cost[-1], val=results.val[-1])

    # Save each model state dictionary
    torch.save(model.state_dict(), output / f'{name}_{results.epoch}.pyt')