In [1]:
%matplotlib widget

Check the current GPU usage. Please try to be nice!

In [2]:
!nvidia-smi

Thu Aug 16 09:30:15 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  TITAN V             On   | 00000000:03:00.0 Off |                  N/A |
| 28%   31C    P8    23W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
| N/A   70C    P0    52W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             On   | 00000000:84:00.0 Off |                  N/A |
| 28%   

> **WARNING**: The card numbers here are *not* the same as in CUDA. You have been warned.

## Imports

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import time
import torch

# Python 3 standard library
from pathlib import Path

### Set up local parameters

In [4]:
n_epochs = 20

# Name is the output file name
name = 'Aug_16_120000_2layer'

# Make an output folder named "name" (change if you want)
output = Path(name)

# This is the input file to read in
datafile = Path('/data/schreihf/PvFinder/Aug_15_140000.npz')

# Size of batches
batch_size = 32

# How fast to learn
learning_rate = 1e-3

Make the output directory if it does not exist:

In [5]:
output.mkdir(exist_ok=True)

## Get the helper functions

Add the directory with the model
definitions to the path so we can import from it:

> When you type `import X`,
Python searches `sys.path` for a python
file named `X.py` to import. So we need to add the model directory to the path.

In [6]:
import sys
sys.path.append('../model')

In [7]:
# From model/collectdata.py
from collectdata import DataCollector

# From model/loss.py
from loss import Loss

# From model/training.py
from training import trainNet, select_gpu

# From model/models.py
from models import SimpleCNN3Layer as Model

Set up Torch device configuration. All tensors and model parameters need to know where to be put.
This takes a BUS ID number: The BUS ID is the same as the listing at the top of this script.

In [8]:
device = select_gpu(1)

1 available GPUs (initially using device 0):
  0 Tesla P100-PCIE-16GB


## Loading data

Load the dataset, split into parts, then move to device (see `collectdata.py` in the `../model` directory)

In [9]:
collector = DataCollector(datafile, 120_000, 10_000)
train_loader = collector.get_training(batch_size, 120_000, device=device, shuffle=True)
val_loader = collector.get_validation(batch_size, 10_000, device=device, shuffle=False)

Loaded /data/schreihf/PvFinder/Aug_15_140000.npz in 9.121 s
Samples in Training: 120000 Validation: 10000 Test: 10000
Constructing dataset on cuda:0 took 8.502 s
Constructing dataset on cuda:0 took 0.4941 s


# Preparing the model

Prepare a model, use multiple GPUs if they are VISIBLE, and move the model to the device.

In [10]:
model = Model()
loss = Loss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [11]:
print("Let's use", torch.cuda.device_count(), "GPUs!")
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

Let's use 1 GPUs!


Let's move the model's weight matricies to the GPU:

In [12]:
model = model.to(device)

## Train

The body of this loop runs once per epoch. Results is a named tuple of values (loss per epoch for training and validation, time each). Start by setting up a plot first:

In [13]:
fig, ax = plt.subplots()
lines_train, = ax.plot([], [], 'o-', label='Train')
lines_val, = ax.plot([], [], 'o-', label='Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Cost')
plt.yscale('log') 
ax.legend();

In [14]:
# Run the epochs, using progress instead of range(n_epochs)
for results in trainNet(model, optimizer, loss,
                        train_loader, val_loader,
                        n_epochs,
                        notebook=True):
    
    # Update the plot above
    lines_train.set_data(np.arange(len(results.cost)),results.cost)
    lines_val.set_data(np.arange(len(results.val)),results.val)
    
    
    #filter first cost epoch (can be really large)
    max_cost = max(max(results.cost if len(results.cost)<2 else results.cost[1:]),max(results.val))
    min_cost = min(min(results.cost),min(results.val))
    
    # The plot limits need updating too
    ax.set_ylim(min_cost*.9, max_cost*1.1)
                
    ax.set_xlim(-.5, len(results.cost) - .5)
    
    # Redraw the figure
    fig.canvas.draw()

    # Save each model state dictionary
    torch.save(model.state_dict(), output / f'{name}_{results.epoch}.pyt')

Number of batches: train = 3750, val = 313


Epoch 0: train=7.97621, val=2.43947, took 28.264 s
  Validation Total: 55048, Successes: 0, MT: 55048 (0.00%), FP: 0 (0.00%)


Epoch 1: train=2.4279, val=2.43934, took 27.523 s
  Validation Total: 55048, Successes: 0, MT: 55048 (0.00%), FP: 0 (0.00%)


Epoch 2: train=2.42672, val=2.43586, took 26.952 s
  Validation Total: 55048, Successes: 0, MT: 55048 (0.00%), FP: 0 (0.00%)


Epoch 3: train=2.42194, val=2.42694, took 27.881 s
  Validation Total: 55048, Successes: 0, MT: 55048 (0.00%), FP: 0 (0.00%)


Epoch 4: train=2.15225, val=1.82871, took 27.944 s
  Validation Total: 55048, Successes: 17302, MT: 37746 (31.43%), FP: 14 (0.03%) (Sp: 17403)


Epoch 5: train=1.63698, val=1.35683, took 27.815 s
  Validation Total: 55048, Successes: 34203, MT: 20845 (62.13%), FP: 801 (1.46%) (Sp: 34288)


Epoch 6: train=1.41193, val=1.29412, took 28.112 s
  Validation Total: 55048, Successes: 35632, MT: 19416 (64.73%), FP: 905 (1.64%) (Sp: 35732)


Epoch 7: train=1.35397, val=1.22912, took 27.998 s
  Validation Total: 55048, Successes: 37790, MT: 17258 (68.65%), FP: 1255 (2.28%) (Sp: 37958)


Epoch 8: train=1.31498, val=1.22746, took 28.543 s
  Validation Total: 55048, Successes: 37329, MT: 17719 (67.81%), FP: 1041 (1.89%) (Sp: 37462)


Epoch 9: train=1.29548, val=1.20332, took 28.382 s
  Validation Total: 55048, Successes: 38892, MT: 16156 (70.65%), FP: 1356 (2.46%) (Sp: 39088)


Epoch 10: train=1.28228, val=1.19437, took 27.976 s
  Validation Total: 55048, Successes: 39297, MT: 15751 (71.39%), FP: 1459 (2.65%) (Sp: 39527)


Epoch 11: train=1.2724, val=1.1846, took 28.569 s
  Validation Total: 55048, Successes: 39493, MT: 15555 (71.74%), FP: 1493 (2.71%) (Sp: 39688)


Epoch 12: train=1.26335, val=1.18566, took 28.12 s
  Validation Total: 55048, Successes: 39398, MT: 15650 (71.57%), FP: 1440 (2.62%) (Sp: 39628)


Epoch 13: train=1.2576, val=1.18321, took 28.185 s
  Validation Total: 55048, Successes: 39839, MT: 15209 (72.37%), FP: 1577 (2.86%) (Sp: 40079)


Epoch 14: train=1.25204, val=1.18405, took 28.25 s
  Validation Total: 55048, Successes: 39454, MT: 15594 (71.67%), FP: 1459 (2.65%) (Sp: 39679)


Epoch 15: train=1.2465, val=1.1701, took 28.553 s
  Validation Total: 55048, Successes: 39542, MT: 15506 (71.83%), FP: 1429 (2.60%) (Sp: 39753)


Epoch 16: train=1.24266, val=1.18027, took 27.926 s
  Validation Total: 55048, Successes: 39418, MT: 15630 (71.61%), FP: 1381 (2.51%) (Sp: 39642)


Epoch 17: train=1.23791, val=1.17341, took 28.309 s
  Validation Total: 55048, Successes: 39787, MT: 15261 (72.28%), FP: 1477 (2.68%) (Sp: 40013)


Epoch 18: train=1.2328, val=1.18598, took 28.138 s
  Validation Total: 55048, Successes: 39974, MT: 15074 (72.62%), FP: 1552 (2.82%) (Sp: 40253)


Epoch 19: train=1.22906, val=1.18361, took 28.298 s
  Validation Total: 55048, Successes: 40062, MT: 14986 (72.78%), FP: 1546 (2.81%) (Sp: 40344)



## Results

Let's save some results: (even though if you have not changed the code above, it saves the model every epoch)

In [15]:
results.cost

[7.976212815856933,
 2.427902067534129,
 2.426719831371307,
 2.4219394927978515,
 2.1522500019073485,
 1.636982250467936,
 1.4119255099455517,
 1.3539749533335368,
 1.3149804826259612,
 1.295478965727488,
 1.2822842124780018,
 1.2724033920605977,
 1.263346636168162,
 1.2576027384440105,
 1.2520439935684204,
 1.2465030914783477,
 1.2426551134109498,
 1.2379084973653158,
 1.232796300156911,
 1.229063246456782]

Go ahead and save the final model (even though it was also saved above):

In [16]:
torch.save(model.state_dict(), output / f'{name}_final.pyt')

Save the output results:

In [17]:
np.savez(output / f'{name}_stats.npz', **results._asdict())

Save the plot above:

In [18]:
fig.savefig(str(output / f'{name}_stats_a.png'))