# Optimizing training and inference

In this notebook, we will discuss different ways to reduce memory and compute usage during training and inference.

## Prepare training script

When training large models, it is usually a best practice not to use Jupyter notebooks, but run a **separate script** for training which could have command-line flags for various hyperparameters and training modes. This is especially useful when you need to run multiple experiments simultaneously (e.g. on a cluster with task scheduler). Another advantage of this is that after training, the process will finish and free the resources for other users of a shared GPU.

In this part, you will need to put all your code to train a model on Tiny ImageNet that you wrote for the previous task in `train.py`.

You can then run your script from inside of this notebook like this:

In [24]:
! python3 train.py --batch_size 128 --epochs 2 --gpu_enabled \
                  --model_path model_state_dict_41.71.pcl \
                  --data_path tiny-imagenet-200 \
                  --model_module_path model.py \
                  --model_checkpoint_path model_checkpoints

Epoch 1 of 2 took 45.768s
  training loss (in-iteration): 	2.496664
  validation accuracy: 			48.23 %
Epoch 2 of 2 took 45.975s
  training loss (in-iteration): 	2.453964
  validation accuracy: 			46.06 %
Final results:
  test accuracy:		42.25 %


**Task** 

Write code for training with architecture from homework_part2

**Requirements**
* Optional arguments from command line such as batch size and number of epochs with built-in argparse
* Modular structure - separate functions for creating data generator, building model and training 


## Profiling time

For the next tasks, you need to add measurements to your training loop. You can use [`perf_counter`](https://docs.python.org/3/library/time.html#time.perf_counter) for that:

In [25]:
import time
import numpy as np
import torch

In [30]:
x = np.random.randn(1000, 1000)
y = np.random.randn(1000, 1000)

start_counter = time.perf_counter()
z = x @ y
elapsed_time = time.perf_counter() - start_counter
print("Matrix multiplication took {:.3f} seconds".format(elapsed_time))

Matrix multiplication took 0.020 seconds


In [36]:
! python3 train.py --batch_size 128 --epochs 2 --gpu_enabled \
                  --model_path model_state_dict_41.71.pcl \
                  --data_path tiny-imagenet-200 \
                  --model_module_path model.py \
                  --model_checkpoint_path model_checkpoints

Epoch 1 of 2 took 45.448s
  training loss (in-iteration): 	2.495216
  validation accuracy: 			47.26 %
		Forward pass took  2.290 seconds
		Backward pass took 1.755 seconds
Epoch 2 of 2 took 45.807s
  training loss (in-iteration): 	2.457833
  validation accuracy: 			46.19 %
		Forward pass took  2.262 seconds
		Backward pass took 1.715 seconds
Final results:
  test accuracy:		41.47 %


**Task**. You need to add the following measurements to your training script:
* How much time a forward-backward pass takes for a single batch;
* How much time an epoch takes.

## Profiling memory usage

**Task**. You need to measure the memory consumptions

This section depends on whether you train on CPU or GPU.

### If you train on CPU
You can use GNU time to measure peak RAM usage of a script:

In [None]:
!/usr/bin/time -lp python train.py

**Maximum resident set size**  will show you the peak RAM usage in bytes after the script finishes.

**Note**. 
Imports also require memory, do the correction

### If you train on GPU

Use [`torch.cuda.max_memory_allocated()`](https://pytorch.org/docs/stable/cuda.html#torch.cuda.max_memory_allocated) at the end of your script to show the maximum amount of memory in bytes used by all tensors.

In [40]:
x = torch.randn(1000, 1000, 1000, device='cuda:0')
print("Peak memory usage by Pytorch tensors: {:.2f} Mb".format((torch.cuda.max_memory_allocated() / 1024 / 1024)))

Peak memory usage by Pytorch tensors: 3815.58 Mb


In [1]:
! python3 train.py --batch_size 128 --epochs 2 --gpu_enabled \
                  --model_path model_state_dict_41.71.pcl \
                  --data_path tiny-imagenet-200 \
                  --model_module_path model.py \
                  --model_checkpoint_path model_checkpoints

Epoch 1 of 2 took 45.652s
  training loss (in-iteration): 	2.493698
  validation accuracy: 			47.80 %
		Forward pass took  2.287 seconds
		Backward pass took 1.732 seconds
Epoch 2 of 2 took 45.916s
  training loss (in-iteration): 	2.452110
  validation accuracy: 			45.72 %
		Forward pass took  2.271 seconds
		Backward pass took 1.729 seconds
Final results:
  test accuracy:		41.35 %
Peak memory usage by Pytorch tensors: 1660.91 Mb


## Gradient based techniques

Modern architectures can potentially consume lots and lots of memory even for minibatch of several objects. To handle such cases here we will discuss two simple techniques.

### Gradient Checkpointing

Checkpointing works by trading compute for memory. Rather than storing all intermediate activations of the entire computation graph for computing backward, the checkpointed part does not save intermediate activations, and instead recomputes them in backward pass. It can be applied on any part of a model.

See [blogpost](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) for kind introduction and different strategies or [article](https://arxiv.org/pdf/1604.06174.pdf) for not kind introduction.

**Task**. Use [built-in checkpointing](https://pytorch.org/docs/stable/checkpoint.html), measure the difference in memory/compute 

**Requirements**. 
* Try several arrangements for checkpoints
* Add the chekpointing as the optional flag into your script
* Measure the difference in memory/compute between the different arrangements and baseline 

In [9]:
! python3 -W ignore train.py --batch_size 128 --epochs 2 --gpu_enabled --checkpoint \
                             --model_path model_state_dict_41.71.pcl \
                             --data_path tiny-imagenet-200 \
                             --model_module_path model.py \
                             --model_checkpoint_path model_checkpoints

Epoch 1 of 2 took 24.709s
  training loss (in-iteration): 	2.426903
  validation accuracy: 			50.22 %
		Forward pass took  10.040 seconds
		Backward pass took 0.419 seconds
Epoch 2 of 2 took 24.785s
  training loss (in-iteration): 	2.365057
  validation accuracy: 			48.75 %
		Forward pass took  10.074 seconds
		Backward pass took 0.524 seconds
Final results:
  test accuracy:		43.05 %
Peak memory usage by Pytorch tensors: 1650.14 Mb


In [3]:
# память уменьшилась незначительно, но стало быстрее почти в 2 раза за итерацию

### Accumulating gradient for large batches
We can increase the effective batch size by simply accumulating gradients over multiple forward passes. Note that `loss.backward()` simply adds the computed gradient to `tensor.grad`, so we can call this method multiple times before actually taking an optimizer step. However, this approach might be a little tricky to combine with batch normalization. Do you see why?

In [13]:
! python3 -W ignore train.py --batch_size 128 --epochs 2 --gpu_enabled \
                             --model_path model_state_dict_41.71.pcl \
                             --data_path tiny-imagenet-200 \
                             --model_module_path model.py \
                             --model_checkpoint_path model_checkpoints \
                             --effective_batch_size 1024

Epoch 1 of 2 took 42.368s
  training loss (in-iteration): 	2.312247
  validation accuracy: 			53.63 %
		Forward pass took  1.855 seconds
		Backward pass took 1.853 seconds
Saving new best model!
Epoch 2 of 2 took 42.571s
  training loss (in-iteration): 	2.205992
  validation accuracy: 			53.48 %
		Forward pass took  1.850 seconds
		Backward pass took 1.790 seconds
Final results:
  test accuracy:		44.61 %
Peak memory usage by Pytorch tensors: 1660.91 Mb


In [None]:
#  Модель за 2 эпохи пробила прошлый скор:) по скорости просадки нет

In [14]:
# from torch.utils.data import DataLoader


# effective_batch_size = 1024
# loader_batch_size = 32
# batches_per_update = effective_batch_size / loader_batch_size # Updating weights after 8 forward passes

# dataloader = DataLoader(dataset, batch_size=loader_batch_size)

# optimizer.zero_grad()

# for batch_i, (batch_X, batch_y) in enumerate(dataloader):
#     l = loss(model(batch_X), batch_y)
#     l.backward() # Adds gradients
  
#     if (batch_i + 1) % batches_per_update == 0:
#         optimizer.step()
#         optimizer.zero_grad()

**Task**. Explore the trade-off between computation time and memory usage while maintaining the same effective batch size. By effective batch size we mean the number of objects over which the loss is computed before taking a gradient step.

**Requirements**

* Compare compute between accumulating gradient and gradient checkpointing with similar memory consumptions
* Incorporate gradient accumulation into your script with optional argument

## Accuracy vs compute trade-off

### Tensor type size

One of the hyperparameter affecting memory consumption is the precision (e.g. floating point number). The most popular choice is 32 bit however with several hacks* 16 bit arithmetics can save you approximately half of the memory without considerable loss of perfomance. This is called mixed precision training.

*https://arxiv.org/pdf/1710.03740.pdf

### Quantization

We can actually move further and use even lower precision like 8-bit integers:

* https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd
* https://nervanasystems.github.io/distiller/quantization/
* https://arxiv.org/abs/1712.05877

### Knowledge distillation
Suppose that we have a large network (*teacher network*) or an ensemble of networks which has a good accuracy. We can like train a much smaller network (*student network*) using the outputs of teacher networks. It turns out that the perfomance could be even better! This approach doesn't help with training speed, but can be quite beneficial when we'd like to reduce the model size for low-memory devices.

* https://www.ttic.edu/dl/dark14.pdf
* [Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)
* https://medium.com/neural-machines/knowledge-distillation-dc241d7c2322

Even the completely different ([article](https://arxiv.org/abs/1711.10433)) architecture can be used in a student model, e.g. you can approximate an autoregressive model (WaveNet) by a non-autoregressive one.

**Task**. Distill your (teacher) network with smaller one (student), compare it perfomance with the teacher network and with the same (student) trained directly from data.

**Note**. Logits carry more information than the probabilities after softmax

This approach doesn't help with training speed, but can be quite beneficial when we'd like to reduce the model size for low-memory devices.

In [29]:
from imp import load_source
import numpy as np
import argparse
import sys
import time
import os

import torchvision
from torchvision import transforms
import torch
from torch.autograd import Variable
from torch.utils.checkpoint import checkpoint_sequential
    
means = np.array([0.485, 0.456, 0.406])
stds = np.array([0.229, 0.224, 0.225])

random_seed = 42
torch.manual_seed(random_seed)

def count_score(model, batch_gen, accuracy_list, gpu=False):
    model.train(False) # disable dropout / use averages for batch_norm
    for X_batch, y_batch in batch_gen:
        if gpu:
            logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
        else:
            logits = model(Variable(torch.FloatTensor(X_batch)).cpu())

        y_pred = logits.max(1)[1].data
        accuracy_list.append(np.mean( (y_batch.cpu() == y_pred.cpu()).numpy() ))
    return accuracy_list

def train(student_model, teacher_model, opt, loss_fn, model_checkpoint_path, data_path, use_checkpoint=False, gpu=False, batch_size=128, epochs=100, effective_batch_size=None):
    transform = {
        'train': transforms.Compose([
            transforms.RandomRotation((-30,30)),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize(means, stds)
        ]),
        'val': transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(means, stds)
        ])
    }
    
    dataset = torchvision.datasets.ImageFolder(os.path.join(data_path, 'train'), transform=transform['train'])
    train_dataset, val_dataset = torch.utils.data.random_split(dataset, [80000, 20000])


    train_batch_gen = torch.utils.data.DataLoader(train_dataset, 
                                                  batch_size=batch_size,
                                                  shuffle=True,
                                                  num_workers=4)
    val_batch_gen = torch.utils.data.DataLoader(val_dataset, 
                                                batch_size=batch_size,
                                                shuffle=True,
                                                num_workers=2)
    
    train_loss = []
    val_accuracy = []
    
    prev_val_acc =  0
    
    if effective_batch_size is None:
        batches_per_update = 1
    else:
        batches_per_update = effective_batch_size / batch_size
    
    if use_checkpoint:
        segments = 4
        modules = [module for k, module in student_model._modules.items()]
    
    additional_criterion = nn.CrossEntropyLoss()
    
    for epoch in range(epochs):
        # In each epoch, we do a full pass over the training data:
        start_time = time.time()
        student_model.train(True) # enable dropout / batch_norm training behavior
        forward_time = 0
        backward_time = 0

        for batch_i, (X_batch, y_batch) in enumerate(train_batch_gen):
            # train on batch
            start_counter = time.perf_counter()
            if gpu:
                X_batch = Variable(torch.FloatTensor(X_batch)).cuda()
                y_batch = Variable(torch.LongTensor(y_batch)).cuda()
                if use_checkpoint:
                    raise NotImplementedError()
                else:
                    logits = student_model.cuda()(X_batch)
                    teacher_logits = teacher_model.cuda()(X_batch)
            else:
                X_batch = Variable(torch.FloatTensor(X_batch)).cpu()
                y_batch = Variable(torch.LongTensor(y_batch)).cpu()
                if use_checkpoint:
                    raise NotImplementedError()
                else:
                    logits = student_model.cpu()(X_batch)
                    teacher_logits = teacher_model.cpu()(X_batch)

            loss = 0.8*loss_fn(logits, teacher_logits) + 0.2* additional_criterion(logits, y_batch)

            apply_counter = time.perf_counter()
            forward_time += apply_counter - start_counter
            
            loss.backward()
            backward_time += time.perf_counter() - apply_counter
            
            if (batch_i + 1) % batches_per_update == 0:
                opt.step()
                opt.zero_grad()

            train_loss.append(loss.data.cpu().numpy())
        
        val_accuracy = count_score(student_model, batch_gen=val_batch_gen, accuracy_list=val_accuracy, gpu=gpu)
        vall_acc =  np.mean(val_accuracy[-len(val_dataset) // batch_size :]) * 100

        # Then we print the results for this epoch:
        print("Epoch {} of {} took {:.3f}s".format(
            epoch + 1, epochs, time.time() - start_time))
        print("  training loss (in-iteration): \t{:.6f}".format(
            np.mean(train_loss[-len(train_dataset) // batch_size :])))
        print("  validation accuracy: \t\t\t{:.2f} %".format(vall_acc))
        torch.save(student_model.state_dict(), os.path.join(model_checkpoint_path, "model_{}_{:.2f}.pcl".format(epoch, vall_acc)))
        
        print("\t\tForward pass took  {:.3f} seconds".format(forward_time))
        print("\t\tBackward pass took {:.3f} seconds".format(backward_time))

        if vall_acc > prev_val_acc:
            prev_val_acc = vall_acc
            print("Saving new best model!")
            torch.save(student_model.state_dict(), os.path.join(model_checkpoint_path, "model_best.pcl"))

    return student_model

def validate(model, data_path, batch_size):
    transform = {
        'test': transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize(means, stds)
        ])
    }
    
    test_dataset = torchvision.datasets.ImageFolder(os.path.join(data_path, 'new_val'), transform=transform['test'])
    test_batch_gen = torch.utils.data.DataLoader(test_dataset, 
                                                 batch_size=batch_size,
                                                 shuffle=False,
                                                 num_workers=2)
    
    model.train(False) # disable dropout / use averages for batch_norm
    test_acc = []

    for X_batch, y_batch in test_batch_gen:
        logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data
        test_acc += list((y_batch.cpu() == y_pred.cpu()).numpy())
    
    test_accuracy = np.mean(test_acc)
    
    print("Final results:")
    print("  test accuracy:\t\t{:.2f} %".format(
        test_accuracy * 100))



In [2]:
gpu_enabled=True 



In [3]:
if gpu_enabled:
    torch.cuda.manual_seed(random_seed)

load_source("teacher_model", "model.py") 
from teacher_model import get_model as get_teacher_model


teacher_model, _, _ = get_teacher_model(model_path="model_checkpoints/model_best.pcl", gpu=gpu_enabled)


In [4]:
teacher_model.train(False)

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace)
  (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU(inplace)
  (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (8): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (10): ReLU(inplace)
  (11): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (12): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (13): ReLU(inplace)
  (14): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (15)

In [23]:
import torch, torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)
    
student_model = nn.Sequential()


student_model.add_module('conv1', nn.Conv2d(3, 16, kernel_size=5, padding=2))
student_model.add_module('batchnorm1', nn.BatchNorm2d(16))
student_model.add_module('ReLU1', nn.ReLU())
student_model.add_module('MaxPool1', nn.MaxPool2d(2))

student_model.add_module('conv2', nn.Conv2d(16, 32, kernel_size=5, padding=2))
student_model.add_module('batchnorm1', nn.BatchNorm2d(16))
student_model.add_module('ReLU1', nn.ReLU())
student_model.add_module('MaxPool1', nn.MaxPool2d(2))

student_model.add_module('flatten', Flatten())

student_model.add_module('dense1', nn.Linear(32768, 1000))
student_model.add_module('dense1_relu', nn.ReLU())
student_model.add_module('dropout1', nn.Dropout(0.3))

student_model.add_module('dense2', nn.Linear(1000, 512))
student_model.add_module('dense2_relu', nn.ReLU())
student_model.add_module('dropout2', nn.Dropout(0.05))

student_model.add_module('dense2_logits', nn.Linear(512, 200)) # logits for 200 classes
student_model.eval()

Sequential(
  (conv1): Conv2d(3, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (batchnorm1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (ReLU1): ReLU()
  (MaxPool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (conv2): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (flatten): Flatten()
  (dense1): Linear(in_features=32768, out_features=1000, bias=True)
  (dense1_relu): ReLU()
  (dropout1): Dropout(p=0.3)
  (dense2): Linear(in_features=1000, out_features=512, bias=True)
  (dense2_relu): ReLU()
  (dropout2): Dropout(p=0.05)
  (dense2_logits): Linear(in_features=512, out_features=200, bias=True)
)

In [18]:
class RMSELoss(torch.nn.Module):
    def __init__(self):
        super(RMSELoss,self).__init__()

    def forward(self,x,y):
        criterion = nn.MSELoss()
        loss = torch.sqrt(criterion(x, y))
        return loss

In [26]:
opt = torch.optim.Adam(student_model.parameters(), lr=0.001, weight_decay=0.0001)
loss_fn = nn.MSELoss()

In [27]:
student_model = train(student_model=student_model, teacher_model=teacher_model, 
                      opt=opt, loss_fn=loss_fn, 
                      model_checkpoint_path='student_checkpoints', 
                      data_path='tiny-imagenet-200', 
                      gpu=gpu_enabled, 
                      batch_size=128, 
                      epochs=100, 
                      use_checkpoint=False, 
                      effective_batch_size=1024)
validate(student_model, data_path='tiny-imagenet-200', batch_size=128)



Epoch 1 of 100 took 52.281s
  training loss (in-iteration): 	12.182045
  validation accuracy: 			0.84 %
		Forward pass took  2.211 seconds
		Backward pass took 2.173 seconds
Saving new best model!
Epoch 2 of 100 took 53.234s
  training loss (in-iteration): 	8.665157
  validation accuracy: 			1.35 %
		Forward pass took  2.207 seconds
		Backward pass took 2.500 seconds
Saving new best model!
Epoch 3 of 100 took 53.435s
  training loss (in-iteration): 	8.121580
  validation accuracy: 			2.13 %
		Forward pass took  2.199 seconds
		Backward pass took 2.528 seconds
Saving new best model!
Epoch 4 of 100 took 53.435s
  training loss (in-iteration): 	7.627388
  validation accuracy: 			2.57 %
		Forward pass took  2.209 seconds
		Backward pass took 2.588 seconds
Saving new best model!
Epoch 5 of 100 took 53.590s
  training loss (in-iteration): 	7.233976
  validation accuracy: 			3.04 %
		Forward pass took  2.232 seconds
		Backward pass took 2.519 seconds
Saving new best model!
Epoch 6 of 100 took

Epoch 44 of 100 took 53.811s
  training loss (in-iteration): 	3.543021
  validation accuracy: 			19.86 %
		Forward pass took  2.226 seconds
		Backward pass took 1.916 seconds
Saving new best model!
Epoch 45 of 100 took 53.900s
  training loss (in-iteration): 	3.505640
  validation accuracy: 			20.28 %
		Forward pass took  2.233 seconds
		Backward pass took 2.307 seconds
Saving new best model!
Epoch 46 of 100 took 53.854s
  training loss (in-iteration): 	3.494891
  validation accuracy: 			20.31 %
		Forward pass took  2.219 seconds
		Backward pass took 2.657 seconds
Saving new best model!
Epoch 47 of 100 took 53.736s
  training loss (in-iteration): 	3.490957
  validation accuracy: 			20.65 %
		Forward pass took  2.221 seconds
		Backward pass took 2.579 seconds
Saving new best model!
Epoch 48 of 100 took 53.862s
  training loss (in-iteration): 	3.444209
  validation accuracy: 			20.72 %
		Forward pass took  2.224 seconds
		Backward pass took 2.599 seconds
Saving new best model!
Epoch 49 o

Epoch 91 of 100 took 53.894s
  training loss (in-iteration): 	2.867838
  validation accuracy: 			23.78 %
		Forward pass took  2.263 seconds
		Backward pass took 2.452 seconds
Epoch 92 of 100 took 53.753s
  training loss (in-iteration): 	2.848913
  validation accuracy: 			24.03 %
		Forward pass took  2.244 seconds
		Backward pass took 2.332 seconds
Saving new best model!
Epoch 93 of 100 took 53.786s
  training loss (in-iteration): 	2.854687
  validation accuracy: 			23.87 %
		Forward pass took  2.240 seconds
		Backward pass took 2.353 seconds
Epoch 94 of 100 took 53.809s
  training loss (in-iteration): 	2.862361
  validation accuracy: 			23.97 %
		Forward pass took  2.241 seconds
		Backward pass took 2.337 seconds
Epoch 95 of 100 took 53.795s
  training loss (in-iteration): 	2.815340
  validation accuracy: 			24.38 %
		Forward pass took  2.237 seconds
		Backward pass took 2.554 seconds
Saving new best model!
Epoch 96 of 100 took 53.847s
  training loss (in-iteration): 	2.823910
  valida

NameError: name 'model' is not defined

In [41]:
student_model = train(student_model=student_model, teacher_model=teacher_model, 
                      opt=opt, loss_fn=loss_fn, 
                      model_checkpoint_path='student_checkpoints', 
                      data_path='tiny-imagenet-200', 
                      gpu=gpu_enabled, 
                      batch_size=128, 
                      epochs=100, 
                      use_checkpoint=False, 
                      effective_batch_size=1024)



Epoch 1 of 100 took 51.330s
  training loss (in-iteration): 	2.394271
  validation accuracy: 			36.27 %
		Forward pass took  2.292 seconds
		Backward pass took 2.649 seconds
Saving new best model!
Epoch 2 of 100 took 51.964s
  training loss (in-iteration): 	2.376740
  validation accuracy: 			36.09 %
		Forward pass took  2.283 seconds
		Backward pass took 2.710 seconds
Epoch 3 of 100 took 52.887s
  training loss (in-iteration): 	2.363281
  validation accuracy: 			35.52 %
		Forward pass took  2.318 seconds
		Backward pass took 2.696 seconds
Epoch 4 of 100 took 53.325s
  training loss (in-iteration): 	2.360155
  validation accuracy: 			35.16 %
		Forward pass took  2.295 seconds
		Backward pass took 2.724 seconds
Epoch 5 of 100 took 53.522s
  training loss (in-iteration): 	2.357803
  validation accuracy: 			35.10 %
		Forward pass took  2.309 seconds
		Backward pass took 2.715 seconds
Epoch 6 of 100 took 53.559s
  training loss (in-iteration): 	2.371397
  validation accuracy: 			34.64 %
		F

Epoch 48 of 100 took 53.946s
  training loss (in-iteration): 	2.271876
  validation accuracy: 			32.26 %
		Forward pass took  2.332 seconds
		Backward pass took 2.164 seconds
Epoch 49 of 100 took 53.955s
  training loss (in-iteration): 	2.274115
  validation accuracy: 			32.23 %
		Forward pass took  2.329 seconds
		Backward pass took 2.356 seconds
Epoch 50 of 100 took 54.047s
  training loss (in-iteration): 	2.275790
  validation accuracy: 			31.96 %
		Forward pass took  2.339 seconds
		Backward pass took 2.506 seconds
Epoch 51 of 100 took 54.042s
  training loss (in-iteration): 	2.270029
  validation accuracy: 			32.13 %
		Forward pass took  2.329 seconds
		Backward pass took 2.638 seconds
Epoch 52 of 100 took 53.896s
  training loss (in-iteration): 	2.269085
  validation accuracy: 			32.21 %
		Forward pass took  2.328 seconds
		Backward pass took 2.018 seconds
Epoch 53 of 100 took 53.941s
  training loss (in-iteration): 	2.268888
  validation accuracy: 			32.25 %
		Forward pass took 

KeyboardInterrupt: 

In [42]:
validate(student_model, data_path='tiny-imagenet-200', batch_size=128)


Final results:
  test accuracy:		30.97 %


На исходных данных эта сеть за аналогичное число шагов обучилась до 25.44 %, что говорит о практической пользе подхода

### Pruning

The idea of pruning is to remove unnecessary (in terms of loss) weights. It can be measured in different ways: for example, by the norm of the weights (similar to L1 feature selection), by the magnitude of the activation or via Taylor expansion*.

One iteration of pruning consists of two steps:

1) Rank weights with some importance measure and remove the least important

2) Fine-tune the model

This approach is a bit computationally heavy but can lead to drastic (up to 150x) decrease of memory to store the weights. Moreover if you make use of structure in layers you can decrease also compute. For example, the whole convolutional filters can be removed.

*https://arxiv.org/pdf/1611.06440.pdf