In [1]:
%matplotlib widget

Check the current GPU usage. Please try to be nice!

In [2]:
!nvidia-smi

Sun Aug 26 11:16:41 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  TITAN V             On   | 00000000:03:00.0 Off |                  N/A |
| 28%   32C    P8    24W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
| N/A   33C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             On   | 00000000:84:00.0 Off |                  N/

> **WARNING**: The card numbers here are *not* the same as in CUDA. You have been warned.

## Imports

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import time
import torch

# Python 3 standard library
from pathlib import Path

### Set up local parameters

In [4]:
n_epochs = 100

# Name is the output file name
name = 'Aug_26_120000_SimpleCNN4Layer_C_100epochs_epsilon_1em3_lr_3em4'

# Make an output folder named "name" (change if you want)
output = Path(name)

# This is the input file to read in
datafile = Path('/data/schreihf/PvFinder/Aug_15_140000.npz')

# Size of batches
batch_size = 32

# How fast to learn
learning_rate = 3e-4

Make the output directory if it does not exist:

In [5]:
output.mkdir(exist_ok=True)

## Get the helper functions

Add the directory with the model
definitions to the path so we can import from it:

> When you type `import X`,
Python searches `sys.path` for a python
file named `X.py` to import. So we need to add the model directory to the path.

In [6]:
import sys
sys.path.append('../model')

In [8]:
# From model/collectdata.py
from collectdata import DataCollector

# From model/loss.py
from loss_epsilon_1em3 import Loss

# From model/training.py
from training import trainNet, select_gpu

# From model/models.py
from models_mds_C import SimpleCNN4Layer_C as Model

Set up Torch device configuration. All tensors and model parameters need to know where to be put.
This takes a BUS ID number: The BUS ID is the same as the listing at the top of this script.

In [9]:
device = select_gpu(2)

1 available GPUs (initially using device 0):
  0 TITAN V


## Loading data

Load the dataset, split into parts, then move to device (see `collectdata.py` in the `../model` directory)

In [10]:
collector = DataCollector(datafile, 120_000, 10_000)
train_loader = collector.get_training(batch_size, 120_000, device=device, shuffle=True)
val_loader = collector.get_validation(batch_size, 10_000, device=device, shuffle=False)

Loaded /data/schreihf/PvFinder/Aug_15_140000.npz in 9.337 s
Samples in Training: 120000 Validation: 10000 Test: 10000
Constructing dataset on cuda:0 took 14.35 s
Constructing dataset on cuda:0 took 0.4637 s


# Preparing the model

Prepare a model, use multiple GPUs if they are VISIBLE, and move the model to the device.

In [11]:
model = Model()
loss = Loss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [12]:
print("Let's use", torch.cuda.device_count(), "GPUs!")
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)

Let's use 1 GPUs!


Let's move the model's weight matricies to the GPU:

In [13]:
model = model.to(device)

## Train

The body of this loop runs once per epoch. Results is a named tuple of values (loss per epoch for training and validation, time each). Start by setting up a plot first:

In [14]:
fig, ax = plt.subplots()
lines_train, = ax.plot([], [], 'o-', label='Train')
lines_val, = ax.plot([], [], 'o-', label='Validation')
ax.set_xlabel('Epochs')
ax.set_ylabel('Cost')
plt.yscale('log') 
ax.legend();

In [15]:
# Run the epochs, using progress instead of range(n_epochs)
for results in trainNet(model, optimizer, loss,
                        train_loader, val_loader,
                        n_epochs,
                        notebook=True):
    
    # Update the plot above 
    lines_train.set_data(np.arange(len(results.cost)),results.cost)
    lines_val.set_data(np.arange(len(results.val)),results.val)
    
    
    #filter first cost epoch (can be really large)
    max_cost = max(max(results.cost if len(results.cost)<2 else results.cost[1:]),max(results.val))
    min_cost = min(min(results.cost),min(results.val))
    
    # The plot limits need updating too
    ax.set_ylim(min_cost*.9, max_cost*1.1)  
    ax.set_xlim(-.5, len(results.cost) - .5)
    
    # Redraw the figure
    fig.canvas.draw()

    # Save each model state dictionary
    torch.save(model.state_dict(), output / f'{name}_{results.epoch}.pyt')

Number of batches: train = 3750, val = 313


Epoch 0: train=3.85666, val=1.0567, took 37.063 s
  Validation Found 0 of 55048, added 0 (eff 0.00%) (0.0 FP/event)


Epoch 1: train=1.05126, val=1.05542, took 36.347 s
  Validation Found 0 of 55048, added 0 (eff 0.00%) (0.0 FP/event)


Epoch 2: train=1.04849, val=1.05092, took 36.418 s
  Validation Found 0 of 55048, added 0 (eff 0.00%) (0.0 FP/event)


Epoch 3: train=1.0357, val=0.973194, took 36.683 s
  Validation Found 741 of 55048, added 1 (eff 1.35%) (0.0001 FP/event)


Epoch 4: train=0.823776, val=0.650399, took 36.322 s
  Validation Found 18750 of 55048, added 14 (eff 34.06%) (0.0014 FP/event)


Epoch 5: train=0.652485, val=0.535782, took 36.304 s
  Validation Found 27615 of 55048, added 111 (eff 50.17%) (0.0111 FP/event)


Epoch 6: train=0.594162, val=0.488234, took 37.175 s
  Validation Found 36465 of 55048, added 469 (eff 66.24%) (0.0469 FP/event)


Epoch 7: train=0.558879, val=0.466511, took 36.236 s
  Validation Found 34915 of 55048, added 328 (eff 63.43%) (0.0328 FP/event)


Epoch 8: train=0.535576, val=0.456921, took 36.352 s
  Validation Found 35725 of 55048, added 399 (eff 64.90%) (0.0399 FP/event)


Epoch 9: train=0.520239, val=0.445474, took 36.343 s
  Validation Found 37517 of 55048, added 551 (eff 68.15%) (0.0551 FP/event)


Epoch 10: train=0.508975, val=0.455222, took 36.344 s
  Validation Found 34969 of 55048, added 390 (eff 63.52%) (0.039 FP/event)


Epoch 11: train=0.500947, val=0.438736, took 36.428 s
  Validation Found 38675 of 55048, added 697 (eff 70.26%) (0.0697 FP/event)


Epoch 12: train=0.494191, val=0.438493, took 36.348 s
  Validation Found 38630 of 55048, added 676 (eff 70.18%) (0.0676 FP/event)


Epoch 13: train=0.48859, val=0.436685, took 36.394 s
  Validation Found 38818 of 55048, added 727 (eff 70.52%) (0.0727 FP/event)


Epoch 14: train=0.48338, val=0.436324, took 36.36 s
  Validation Found 38635 of 55048, added 826 (eff 70.18%) (0.0826 FP/event)


Epoch 15: train=0.478941, val=0.432875, took 36.418 s
  Validation Found 39364 of 55048, added 869 (eff 71.51%) (0.0869 FP/event)


Epoch 16: train=0.474681, val=0.432844, took 36.402 s
  Validation Found 40410 of 55048, added 1058 (eff 73.41%) (0.106 FP/event)


Epoch 17: train=0.470647, val=0.432388, took 36.389 s
  Validation Found 39183 of 55048, added 869 (eff 71.18%) (0.0869 FP/event)


Epoch 18: train=0.466855, val=0.430381, took 36.361 s
  Validation Found 39842 of 55048, added 918 (eff 72.38%) (0.0918 FP/event)


Epoch 19: train=0.463389, val=0.430428, took 36.354 s
  Validation Found 40519 of 55048, added 1142 (eff 73.61%) (0.114 FP/event)


Epoch 20: train=0.460007, val=0.430189, took 36.384 s
  Validation Found 39312 of 55048, added 901 (eff 71.41%) (0.0901 FP/event)


Epoch 21: train=0.45729, val=0.429991, took 36.369 s
  Validation Found 39659 of 55048, added 930 (eff 72.04%) (0.093 FP/event)


Epoch 22: train=0.454671, val=0.429871, took 36.334 s
  Validation Found 39057 of 55048, added 834 (eff 70.95%) (0.0834 FP/event)


Epoch 23: train=0.452453, val=0.43294, took 36.344 s
  Validation Found 38056 of 55048, added 754 (eff 69.13%) (0.0754 FP/event)


Epoch 24: train=0.450434, val=0.429177, took 36.379 s
  Validation Found 41843 of 55048, added 1289 (eff 76.01%) (0.129 FP/event)


Epoch 25: train=0.448727, val=0.426708, took 37.115 s
  Validation Found 40184 of 55048, added 996 (eff 73.00%) (0.0996 FP/event)


Epoch 26: train=0.446347, val=0.425219, took 36.72 s
  Validation Found 41213 of 55048, added 1169 (eff 74.87%) (0.117 FP/event)


Epoch 27: train=0.445191, val=0.425085, took 36.309 s
  Validation Found 41428 of 55048, added 1189 (eff 75.26%) (0.119 FP/event)


Epoch 28: train=0.444152, val=0.425509, took 36.601 s
  Validation Found 40303 of 55048, added 1068 (eff 73.21%) (0.107 FP/event)


Epoch 29: train=0.442724, val=0.424601, took 36.303 s
  Validation Found 41049 of 55048, added 1108 (eff 74.57%) (0.111 FP/event)


Epoch 30: train=0.441702, val=0.425067, took 36.268 s
  Validation Found 41624 of 55048, added 1191 (eff 75.61%) (0.119 FP/event)


Epoch 31: train=0.440224, val=0.425508, took 36.277 s
  Validation Found 41901 of 55048, added 1318 (eff 76.12%) (0.132 FP/event)


Epoch 32: train=0.439272, val=0.426578, took 36.188 s
  Validation Found 42490 of 55048, added 1453 (eff 77.19%) (0.145 FP/event)


Epoch 33: train=0.438421, val=0.424768, took 36.213 s
  Validation Found 41697 of 55048, added 1272 (eff 75.75%) (0.127 FP/event)


Epoch 34: train=0.437508, val=0.423518, took 36.274 s
  Validation Found 40860 of 55048, added 1112 (eff 74.23%) (0.111 FP/event)


Epoch 35: train=0.436746, val=0.423243, took 36.249 s
  Validation Found 41446 of 55048, added 1192 (eff 75.29%) (0.119 FP/event)


Epoch 36: train=0.436023, val=0.425172, took 36.239 s
  Validation Found 42303 of 55048, added 1411 (eff 76.85%) (0.141 FP/event)


Epoch 37: train=0.435394, val=0.422718, took 36.231 s
  Validation Found 41744 of 55048, added 1279 (eff 75.83%) (0.128 FP/event)


Epoch 38: train=0.434743, val=0.423329, took 36.252 s
  Validation Found 40724 of 55048, added 1053 (eff 73.98%) (0.105 FP/event)


Epoch 39: train=0.43373, val=0.423091, took 36.642 s
  Validation Found 40759 of 55048, added 1137 (eff 74.04%) (0.114 FP/event)


Epoch 40: train=0.4328, val=0.422856, took 36.219 s
  Validation Found 40987 of 55048, added 1166 (eff 74.46%) (0.117 FP/event)


Epoch 41: train=0.432284, val=0.422422, took 36.234 s
  Validation Found 41660 of 55048, added 1232 (eff 75.68%) (0.123 FP/event)


Epoch 42: train=0.431331, val=0.422378, took 36.251 s
  Validation Found 41164 of 55048, added 1188 (eff 74.78%) (0.119 FP/event)


Epoch 43: train=0.43097, val=0.423316, took 37.124 s
  Validation Found 40185 of 55048, added 1006 (eff 73.00%) (0.101 FP/event)


Epoch 44: train=0.430186, val=0.422125, took 36.196 s
  Validation Found 40931 of 55048, added 1098 (eff 74.36%) (0.11 FP/event)


Epoch 45: train=0.429739, val=0.422389, took 36.286 s
  Validation Found 40186 of 55048, added 972 (eff 73.00%) (0.0972 FP/event)


Epoch 46: train=0.428951, val=0.421007, took 36.269 s
  Validation Found 41199 of 55048, added 1145 (eff 74.84%) (0.114 FP/event)


Epoch 47: train=0.428331, val=0.421032, took 36.338 s
  Validation Found 41245 of 55048, added 1139 (eff 74.93%) (0.114 FP/event)


Epoch 48: train=0.427906, val=0.421232, took 36.216 s
  Validation Found 41994 of 55048, added 1307 (eff 76.29%) (0.131 FP/event)


Epoch 49: train=0.42706, val=0.423266, took 36.169 s
  Validation Found 39534 of 55048, added 835 (eff 71.82%) (0.0835 FP/event)


Epoch 50: train=0.426642, val=0.421047, took 35.901 s
  Validation Found 40579 of 55048, added 974 (eff 73.72%) (0.0974 FP/event)


Epoch 51: train=0.425711, val=0.42164, took 36.239 s
  Validation Found 40681 of 55048, added 1078 (eff 73.90%) (0.108 FP/event)


Epoch 52: train=0.425515, val=0.421183, took 37.13 s
  Validation Found 40626 of 55048, added 1046 (eff 73.80%) (0.105 FP/event)


Epoch 53: train=0.425053, val=0.420501, took 36.327 s
  Validation Found 41351 of 55048, added 1200 (eff 75.12%) (0.12 FP/event)


Epoch 54: train=0.424328, val=0.420529, took 36.307 s
  Validation Found 41255 of 55048, added 1116 (eff 74.94%) (0.112 FP/event)


Epoch 55: train=0.423265, val=0.42099, took 36.295 s
  Validation Found 40474 of 55048, added 1027 (eff 73.52%) (0.103 FP/event)


Epoch 56: train=0.423709, val=0.419798, took 36.108 s
  Validation Found 41423 of 55048, added 1176 (eff 75.25%) (0.118 FP/event)


Epoch 57: train=0.423001, val=0.419995, took 36.288 s
  Validation Found 41249 of 55048, added 1125 (eff 74.93%) (0.112 FP/event)


Epoch 58: train=0.422492, val=0.420078, took 36.293 s
  Validation Found 40817 of 55048, added 1089 (eff 74.15%) (0.109 FP/event)


Epoch 59: train=0.422018, val=0.420636, took 35.951 s
  Validation Found 41917 of 55048, added 1220 (eff 76.15%) (0.122 FP/event)


Epoch 60: train=0.421679, val=0.420176, took 35.958 s
  Validation Found 41983 of 55048, added 1309 (eff 76.27%) (0.131 FP/event)


Epoch 61: train=0.42079, val=0.419606, took 37.113 s
  Validation Found 41277 of 55048, added 1137 (eff 74.98%) (0.114 FP/event)


Epoch 62: train=0.420727, val=0.419211, took 36.28 s
  Validation Found 41575 of 55048, added 1156 (eff 75.52%) (0.116 FP/event)


Epoch 63: train=0.420201, val=0.419844, took 36.481 s
  Validation Found 41051 of 55048, added 1151 (eff 74.57%) (0.115 FP/event)


Epoch 64: train=0.420057, val=0.419561, took 36.242 s
  Validation Found 40812 of 55048, added 1049 (eff 74.14%) (0.105 FP/event)


Epoch 65: train=0.419586, val=0.420735, took 36.306 s
  Validation Found 41890 of 55048, added 1327 (eff 76.10%) (0.133 FP/event)


Epoch 66: train=0.419227, val=0.420873, took 36.378 s
  Validation Found 41528 of 55048, added 1272 (eff 75.44%) (0.127 FP/event)


Epoch 67: train=0.418725, val=0.419676, took 37.071 s
  Validation Found 40758 of 55048, added 1086 (eff 74.04%) (0.109 FP/event)


Epoch 68: train=0.418239, val=0.421601, took 36.539 s
  Validation Found 42337 of 55048, added 1350 (eff 76.91%) (0.135 FP/event)


Epoch 69: train=0.418269, val=0.42073, took 36.283 s
  Validation Found 40307 of 55048, added 934 (eff 73.22%) (0.0934 FP/event)


Epoch 70: train=0.417459, val=0.420429, took 36.24 s
  Validation Found 42273 of 55048, added 1355 (eff 76.79%) (0.135 FP/event)


Epoch 71: train=0.417602, val=0.421031, took 36.711 s
  Validation Found 40101 of 55048, added 988 (eff 72.85%) (0.0988 FP/event)


Epoch 72: train=0.417044, val=0.418808, took 36.235 s
  Validation Found 41738 of 55048, added 1258 (eff 75.82%) (0.126 FP/event)


Epoch 73: train=0.416315, val=0.420445, took 36.252 s
  Validation Found 40403 of 55048, added 884 (eff 73.40%) (0.0884 FP/event)


Epoch 74: train=0.416129, val=0.420672, took 36.272 s
  Validation Found 39966 of 55048, added 899 (eff 72.60%) (0.0899 FP/event)


Epoch 75: train=0.415628, val=0.420081, took 36.336 s
  Validation Found 40818 of 55048, added 1060 (eff 74.15%) (0.106 FP/event)


Epoch 76: train=0.415112, val=0.420565, took 36.244 s
  Validation Found 40601 of 55048, added 1056 (eff 73.76%) (0.106 FP/event)


Epoch 77: train=0.414702, val=0.419999, took 36.269 s
  Validation Found 41128 of 55048, added 1135 (eff 74.71%) (0.113 FP/event)


Epoch 78: train=0.414194, val=0.419775, took 36.266 s
  Validation Found 41344 of 55048, added 1061 (eff 75.11%) (0.106 FP/event)


Epoch 79: train=0.413659, val=0.420619, took 36.258 s
  Validation Found 41133 of 55048, added 1205 (eff 74.72%) (0.12 FP/event)


Epoch 80: train=0.413178, val=0.420183, took 36.233 s
  Validation Found 42024 of 55048, added 1303 (eff 76.34%) (0.13 FP/event)


Epoch 81: train=0.412722, val=0.419491, took 36.211 s
  Validation Found 41004 of 55048, added 1093 (eff 74.49%) (0.109 FP/event)


Epoch 82: train=0.412635, val=0.419494, took 36.761 s
  Validation Found 40595 of 55048, added 979 (eff 73.74%) (0.0979 FP/event)


Epoch 83: train=0.411787, val=0.419571, took 36.304 s
  Validation Found 41387 of 55048, added 1099 (eff 75.18%) (0.11 FP/event)


Epoch 84: train=0.411021, val=0.420224, took 36.294 s
  Validation Found 42226 of 55048, added 1326 (eff 76.71%) (0.133 FP/event)


Epoch 85: train=0.410746, val=0.419339, took 37.219 s
  Validation Found 40807 of 55048, added 944 (eff 74.13%) (0.0944 FP/event)


Epoch 86: train=0.410105, val=0.41863, took 36.3 s
  Validation Found 41422 of 55048, added 1104 (eff 75.25%) (0.11 FP/event)


Epoch 87: train=0.410021, val=0.418893, took 36.265 s
  Validation Found 41961 of 55048, added 1282 (eff 76.23%) (0.128 FP/event)


Epoch 88: train=0.409027, val=0.41862, took 36.314 s
  Validation Found 41249 of 55048, added 1087 (eff 74.93%) (0.109 FP/event)


Epoch 89: train=0.408764, val=0.420269, took 36.315 s
  Validation Found 42480 of 55048, added 1356 (eff 77.17%) (0.136 FP/event)


Epoch 90: train=0.408232, val=0.41827, took 36.277 s
  Validation Found 41696 of 55048, added 1208 (eff 75.74%) (0.121 FP/event)


Epoch 91: train=0.408015, val=0.418447, took 36.307 s
  Validation Found 41641 of 55048, added 1194 (eff 75.64%) (0.119 FP/event)


Epoch 92: train=0.40779, val=0.4198, took 36.279 s
  Validation Found 40338 of 55048, added 948 (eff 73.28%) (0.0948 FP/event)


Epoch 93: train=0.407099, val=0.418784, took 36.286 s
  Validation Found 41349 of 55048, added 1081 (eff 75.11%) (0.108 FP/event)


Epoch 94: train=0.406747, val=0.419116, took 36.348 s
  Validation Found 40525 of 55048, added 921 (eff 73.62%) (0.0921 FP/event)


Epoch 95: train=0.406103, val=0.418084, took 36.295 s
  Validation Found 41412 of 55048, added 1089 (eff 75.23%) (0.109 FP/event)


Epoch 96: train=0.405977, val=0.418502, took 36.273 s
  Validation Found 41082 of 55048, added 1033 (eff 74.63%) (0.103 FP/event)


Epoch 97: train=0.405313, val=0.418709, took 36.279 s
  Validation Found 40869 of 55048, added 1002 (eff 74.24%) (0.1 FP/event)


Epoch 98: train=0.405333, val=0.42053, took 36.325 s
  Validation Found 40701 of 55048, added 1067 (eff 73.94%) (0.107 FP/event)


Epoch 99: train=0.404884, val=0.419555, took 36.31 s
  Validation Found 40470 of 55048, added 1007 (eff 73.52%) (0.101 FP/event)



# Results

Let's save some results: (even though if you have not changed the code above, it saves the model every epoch)

Go ahead and save the final model (even though it was also saved above):

In [None]:
torch.save(model.state_dict(), output / f'{name}_final.pyt')

Save the output results:

In [None]:
np.savez(output / f'{name}_stats.npz', **results._asdict())

Save the plot above:

In [None]:
fig.savefig(str(output / f'{name}_stats_a.png'))