# Homework 2.2: The Quest For A Better Network

In this assignment you will build a monster network to solve CIFAR10 image classification.

This notebook is intended as a sequel to seminar 3, please give it a try if you haven't done so yet.

(please read it at least diagonally)

* The ultimate quest is to create a network that has as high __accuracy__ as you can push it.
* There is a __mini-report__ at the end that you will have to fill in. We recommend reading it first and filling it while you iterate.
 
## Grading
* starting at zero points
* +20% for describing your iteration path in a report below.
* +20% for building a network that gets above 20% accuracy
* +10% for beating each of these milestones on __TEST__ dataset:
    * 50% (50% points)
    * 60% (60% points)
    * 65% (70% points)
    * 70% (80% points)
    * 75% (90% points)
    * 80% (full points)
    
## Restrictions
* Please do NOT use pre-trained networks for this assignment until you reach 80%.
 * In other words, base milestones must be beaten without pre-trained nets (and such net must be present in the e-mail). After that, you can use whatever you want.
* you __can__ use validation data for training, but you __can't'__ do anything with test data apart from running the evaluation procedure.

## Tips on what can be done:


 * __Network size__
   * MOAR neurons, 
   * MOAR layers, ([torch.nn docs](http://pytorch.org/docs/master/nn.html))

   * Nonlinearities in the hidden layers
     * tanh, relu, leaky relu, etc
   * Larger networks may take more epochs to train, so don't discard your net just because it could didn't beat the baseline in 5 epochs.

   * Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn!


### The main rule of prototyping: one change at a time
   * By now you probably have several ideas on what to change. By all means, try them out! But there's a catch: __never test several new things at once__.


### Optimization
   * Training for 100 epochs regardless of anything is probably a bad idea.
   * Some networks converge over 5 epochs, others - over 500.
   * Way to go: stop when validation score is 10 iterations past maximum
   * You should certainly use adaptive optimizers
     * rmsprop, nesterov_momentum, adam, adagrad and so on.
     * Converge faster and sometimes reach better optima
     * It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
   * __BatchNormalization__ (nn.BatchNorm2d) for the win!
     * Sometimes more batch normalization is better.
   * __Regularize__ to prevent overfitting
     * Add some L2 weight norm to the loss function, theano will do the rest
       * Can be done manually or like [this](https://discuss.pytorch.org/t/simple-l2-regularization/139/2).
     * Dropout (`nn.Dropout`) - to prevent overfitting
       * Don't overdo it. Check if it actually makes your network better
   
### Convolution architectures
   * This task __can__ be solved by a sequence of convolutions and poolings with batch_norm and ReLU seasoning, but you shouldn't necessarily stop there.
   * [Inception family](https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented/), [ResNet family](https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035?gi=9018057983ca), [Densely-connected convolutions (exotic)](https://arxiv.org/abs/1608.06993), [Capsule networks (exotic)](https://arxiv.org/abs/1710.09829)
   * Please do try a few simple architectures before you go for resnet-152.
   * Warning! Training convolutional networks can take long without GPU. That's okay.
     * If you are CPU-only, we still recomment that you try a simple convolutional architecture
     * a perfect option is if you can set it up to run at nighttime and check it up at the morning.
     * Make reasonable layer size estimates. A 128-neuron first convolution is likely an overkill.
     * __To reduce computation__ time by a factor in exchange for some accuracy drop, try using __stride__ parameter. A stride=2 convolution should take roughly 1/4 of the default (stride=1) one.
 
   
### Data augmemntation
   * getting 5x as large dataset for free is a great 
     * Zoom-in+slice = move
     * Rotate+zoom(to remove black stripes)
     * Add Noize (gaussian or bernoulli)
   * Simple way to do that (if you have PIL/Image): 
     * ```from scipy.misc import imrotate,imresize```
     * and a few slicing
     * Other cool libraries: cv2, skimake, PIL/Pillow
   * A more advanced way is to use torchvision transforms:
    ```
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    trainset = torchvision.datasets.CIFAR10(root=path_to_cifar_like_in_seminar, train=True, download=True, transform=transform_train)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

    ```
   * Or use this tool from Keras (requires theano/tensorflow): [tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html), [docs](https://keras.io/preprocessing/image/)
   * Stay realistic. There's usually no point in flipping dogs upside down as that is not the way you usually see them.
   
```

```

```

```

```

```

```

```


   
There is a template for your solution below that you can opt to use or throw away and write it your way.

In [1]:
import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from cifar import load_cifar10
X_train,y_train,X_val,y_val,X_test,y_test = load_cifar10("cifar_data")
class_names = np.array(['airplane','automobile ','bird ','cat ','deer ','dog ','frog ','horse ','ship ','truck'])

print(X_train.shape,y_train.shape)

(40000, 3, 32, 32) (40000,)


In [3]:
import torch, torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

In [4]:
model = nn.Sequential(
    nn.Conv2d(3, 64, kernel_size=(5, 5), padding=2, bias=False),
    nn.BatchNorm2d(64),
    nn.LeakyReLU(),
    
    nn.Conv2d(64, 64, kernel_size=(3, 3), padding=1, bias=False),
    nn.BatchNorm2d(64),
    nn.LeakyReLU(),
    
    nn.MaxPool2d((3, 3), stride=2, padding=1),
    
    nn.Conv2d(64, 128, kernel_size=(3, 3), padding=1, bias=False),
    nn.BatchNorm2d(128),
    nn.LeakyReLU(),
    
    nn.Conv2d(128, 128, kernel_size=(3, 3), padding=1, bias=False),
    nn.BatchNorm2d(128),
    nn.LeakyReLU(),
    
    nn.MaxPool2d((3, 3), stride=2, padding=1),
    
    nn.Conv2d(128, 256, kernel_size=(3, 3), padding=1, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(),
    
    nn.Conv2d(256, 256, kernel_size=(3, 3), padding=1, bias=False),
    nn.BatchNorm2d(256),
    nn.LeakyReLU(),
    
    nn.MaxPool2d((3, 3), stride=2, padding=1),
    
    Flatten(),
    
    nn.Dropout(),
    nn.Linear(4096, 2048, bias=False),
    nn.BatchNorm2d(2048),
    nn.LeakyReLU(),
    
    nn.Dropout(),
    nn.Linear(2048, 1024, bias=False),
    nn.BatchNorm2d(1024),
    nn.LeakyReLU(),
    
    nn.Dropout(),
    nn.Linear(1024, 10)
)

model.cuda()

Sequential(
  (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (2): LeakyReLU(0.01)
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (5): LeakyReLU(0.01)
  (6): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1), ceil_mode=False)
  (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (9): LeakyReLU(0.01)
  (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (12): LeakyReLU(0.01)
  (13): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1), ceil_mode=False)
  (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (15): Batc

In [5]:
def compute_loss(X_batch, y_batch):
    X_batch = Variable(torch.FloatTensor(X_batch)).cuda()
    y_batch = Variable(torch.LongTensor(y_batch)).cuda()
    logits = model(X_batch)
    return F.cross_entropy(logits, y_batch).mean()

__ Training __

In [6]:
def adjust_learning_rate(optimizer, initial_lr, epoch):
    lr = initial_lr * (0.9 ** (epoch // 3))
    print("New learning rate: {}".format(lr))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

In [7]:
def iterate_minibatches(X, y, batchsize):
    indices = np.random.permutation(np.arange(len(X)))
    for start in range(0, len(indices), batchsize):
        ix = indices[start: start + batchsize]
        yield X[ix], y[ix]
        
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

train_loss = []
val_accuracy = []

In [8]:
import time
num_epochs = 100 # total amount of full passes over training data
batch_size = 64  # number of samples processed in one SGD iteration

for epoch in range(num_epochs):
    adjust_learning_rate(opt, 0.01, epoch)
    # In each epoch, we do a full pass over the training data:
    start_time = time.time()
    model.train(True) # enable dropout / batch_norm training behavior
    for X_batch, y_batch in iterate_minibatches(X_train, y_train, batch_size):
        # train on batch
        loss = compute_loss(X_batch, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
        train_loss.append(loss.data.cpu().numpy()[0])
        
    # And a full pass over the validation data:
    model.train(False) # disable dropout / use averages for batch_norm
    for X_batch, y_batch in iterate_minibatches(X_val, y_val, batch_size):
        logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
        y_pred = logits.max(1)[1].data.cpu().numpy()
        val_accuracy.append(np.mean(y_batch == y_pred))

    
    # Then we print the results for this epoch:
    print("Epoch {} of {} took {:.3f}s".format(
        epoch + 1, num_epochs, time.time() - start_time))
    print("  training loss (in-iteration): \t{:.6f}".format(
        np.mean(train_loss[-len(X_train) // batch_size :])))
    print("  validation accuracy: \t\t\t{:.2f} %".format(
        np.mean(val_accuracy[-len(X_val) // batch_size :]) * 100))

New learning rate: 0.01
Epoch 1 of 100 took 11.560s
  training loss (in-iteration): 	1.411481
  validation accuracy: 			62.23 %
New learning rate: 0.01
Epoch 2 of 100 took 11.079s
  training loss (in-iteration): 	0.943350
  validation accuracy: 			54.80 %
New learning rate: 0.01
Epoch 3 of 100 took 11.111s
  training loss (in-iteration): 	0.771843
  validation accuracy: 			75.34 %
New learning rate: 0.009000000000000001
Epoch 4 of 100 took 11.138s
  training loss (in-iteration): 	0.654050
  validation accuracy: 			76.00 %
New learning rate: 0.009000000000000001
Epoch 5 of 100 took 11.120s
  training loss (in-iteration): 	0.584165
  validation accuracy: 			74.93 %
New learning rate: 0.009000000000000001
Epoch 6 of 100 took 11.157s
  training loss (in-iteration): 	0.516268
  validation accuracy: 			79.64 %
New learning rate: 0.008100000000000001
Epoch 7 of 100 took 11.125s
  training loss (in-iteration): 	0.453549
  validation accuracy: 			82.41 %
New learning rate: 0.008100000000000001


In [9]:
model.train(False) # disable dropout / use averages for batch_norm
test_batch_acc = []
for X_batch, y_batch in iterate_minibatches(X_test, y_test, 500):
    logits = model(Variable(torch.FloatTensor(X_batch)).cuda())
    y_pred = logits.max(1)[1].cpu().data.numpy()
    test_batch_acc.append(np.mean(y_batch == y_pred))

test_accuracy = np.mean(test_batch_acc)
    
print("Final results:")
print("  test accuracy:\t\t{:.2f} %".format(
    test_accuracy * 100))

if test_accuracy * 100 > 95:
    print("Double-check, than consider applying for NIPS'17. SRSly.")
elif test_accuracy * 100 > 90:
    print("U'r freakin' amazin'!")
elif test_accuracy * 100 > 80:
    print("Achievement unlocked: 110lvl Warlock!")
elif test_accuracy * 100 > 70:
    print("Achievement unlocked: 80lvl Warlock!")
elif test_accuracy * 100 > 60:
    print("Achievement unlocked: 70lvl Warlock!")
elif test_accuracy * 100 > 50:
    print("Achievement unlocked: 60lvl Warlock!")
else:
    print("We need more magic! Follow instructons below")

Final results:
  test accuracy:		87.25 %
Achievement unlocked: 110lvl Warlock!


```

```

```

```

```

```


# Report

All creative approaches are highly welcome, but at the very least it would be great to mention
* the idea;
* brief history of tweaks and improvements;
* what is the final architecture and why?
* what is the training method and, again, why?
* Any regularizations and other techniques applied and their effects;


There is no need to write strict mathematical proofs (unless you want to).
 * "I tried this, this and this, and the second one turned out to be better. And i just didn't like the name of that one" - OK, but can be better
 * "I have analized these and these articles|sources|blog posts, tried that and that to adapt them to my problem and the conclusions are such and such" - the ideal one
 * "I took that code that demo without understanding it, but i'll never confess that and instead i'll make up some pseudoscientific explaination" - __not_ok__

#### Table 1: model architectures

|ALL-CONV-FC|ALL-CONV-AVGPOOL|POOL-CONV-FC|
|-----------------------------------------|
|conv32_5x5(stride=2, pad=1)|conv96_3x3(pad=1)|conv64_5x5(pad=2)|
|conv64_3x3|conv96_3x3(pad=1)|conv64_3x3(pad=1)|
|conv64_3x3|conv96_3x3(stride=2, pad=1)|max_pool_3x3_(stride=2, pad=1)|
|conv128_3x3(stride=2, pad=1)|conv192_3x3(pad=1)|conv128_3x3(pad=1)|
|conv256_3x3|conv192_3x3(pad=1)|conv128_3x3(pad=1)|
|conv256_3x3|conv192_3x3(stride=2, pad=1)|max_pool_3x3_(stride=2, pad=1)|
|conv512_3x3(stride=2, pad=1)|conv192_3x3(pad=1)|conv256_3x3(pad=1)|
|fc_512|conv192_1x1(pad=1)|conv256_3x3(pad=1)|
|fc_512|conv10_1x1(pad=1)|max_pool_3x3_(stride=2, pad=1)|
|fc_10|global_avg_pool|fc_2048|
|-|-|fc_1024|
|-|-|fc_10|

#### Table 2: Optimizer vs test accuracy for ALL-CONV-AVGPOOL model

|Optimizer|ALL-CONV-AVGPOOL test accuracy|
|-----------------------|
|**SGD(momentum=0.9, lr=0.1)**|**78.67 %**|
|Adam(lr=0.1)|73.44 %|

#### Table 3: final results

|Network|CIFAR-10 test accuracy, %|
|------------------------------|
|RESNET-18|64.32|
|ALL-CONV-FC|78.65|
|ALL-CONV-AVGPOOL|78.67|
|Leaky POOL-CONV-FC|86.28|
|**Leaky POOL-CONV-FC-Dropout**|**87.25**|
|Leaky POOL-CONV-Dropout-FC-Dropout|86.64|

All models consist of convolutional layers with *batch normalization* and some sort of *rectifier activation* with occasional *pooling* between the conv layers. The convolutional part of the network is followed by either *global average pooling* or *dense network*. In case of dense network batch normalization and dropout is also occasionally used between the fully connected layers. See **table 1** for reference.

Also, we tried using resnet-18 model, but the final accuracy was subpar to those of other architectures, mainly because resnet model is optimized for 224x224 and to make it usable for 32x32 input we had to remove the final global pooling.

All models were trained with the following learning rate scheduling: each third epoch learning rate was multiplied by a factor of 0.9. Learning rates were selected from the set [0.1, 0.01, 0.001] to achieve fastest convergence on a per model basis. Several different optimizers were tested, but tests have shown that momentum optimizer with momentum of 0.9 were the least prone to overfitting, so it was used for all models. See **table 2**.

Batch normalization was used to improve stability of the training and also convergance speed. It is placed before the activation of each layer but the last one, which doesn't have any activation. The layer preceding the batch normalization had the bias turned off to reduce number of correlating parameters.

ReLU activation was selected for every model except POOL-CONV-FC, where Leaky ReLU was used.

We tried the all convolutional architechture from [Striving for Simplicity: The All Convolutional Net, Springenberg et al., 2014](https://arxiv.org/abs/1412.6806), see All-CNN-C, All-CNN-B from the paper. **We were unable to reproduce the results from the paper as in our case using max pooling instead of convolutional layers with strides greatly increased accuracy.** Having established that, we decided to use max pooling to continue our strive for the single best model, and so began the experiments on POOL-CONV-FC architechture.

We used fully connected layers instead of global pooling after the convolutions because that allowed us to use dropout between them. Also, we tried using dropout between convolutional layers, but only the former proved to be useful in our tests, decrising the error rate by an absolute value of 1%.

Our single best model Leaky POOL-CONV-FC-Dropout consisted of 3 blocks of pairs of convolutions followed by a max pooling layer. 3 fully connected layers with dropout between them follow the convolutional part of the net. Both batch normalization and dropout are proved to be useful in this model, because without them the final accuracy decreases slighly. For the results, see **table 3**.

## Appendix: torch.nn.Module logs and results

```
78.65 %
lr = 0.01, 0.001
ALL-CONV-FC
Sequential(
  (0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), bias=False)
  (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True)
  (2): ReLU()
  (3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (5): ReLU()
  (6): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (8): ReLU()
  (9): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (10): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (11): ReLU()
  (12): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (13): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
  (14): ReLU()
  (15): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), bias=False)
  (16): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
  (17): ReLU()
  (18): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (19): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)
  (20): ReLU()
  (21): Flatten(
  )
  (22): Linear(in_features=512, out_features=512, bias=False)
  (23): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
  (24): ReLU()
  (25): Linear(in_features=512, out_features=512, bias=False)
  (26): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
  (27): ReLU()
  (28): Linear(in_features=512, out_features=10, bias=True)
)


SGD(momentum=0.9, lr=0.1): 78.67 %
Adam(lr=0.1): ~73.44 %
ALL-CONV-AVGPOOL
Sequential(
  (0): Conv2d(3, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (1): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
  (2): ReLU()
  (3): Conv2d(96, 96, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (4): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
  (5): ReLU()
  (6): Conv2d(96, 96, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (7): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True)
  (8): ReLU()
  (9): Conv2d(96, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (10): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
  (11): ReLU()
  (12): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (13): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
  (14): ReLU()
  (15): Conv2d(192, 192, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (16): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
  (17): ReLU()
  (18): Conv2d(192, 192, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (19): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
  (20): ReLU()
  (21): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1), bias=False)
  (22): BatchNorm2d(192, eps=1e-05, momentum=0.1, affine=True)
  (23): ReLU()
  (24): Conv2d(192, 10, kernel_size=(1, 1), stride=(1, 1), padding=(1, 1), bias=False)
  (25): AvgPool2d(kernel_size=(6, 6), stride=(6, 6), padding=0, ceil_mode=False, count_include_pad=True)
  (26): Flatten(
  )
)

lr=0.01 RESNET-18: 64.32 %

SGD(lr=0.01, momentum=0.9): 86.28 %
CONV+POOL_FC
+ Dropout: 87.08 %
+ additional dropout: 86.64 %
Sequential(
  (0): Conv2d(3, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (2): LeakyReLU(0.01)
  (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True)
  (5): LeakyReLU(0.01)
  (6): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1), ceil_mode=False)
  (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (8): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (9): LeakyReLU(0.01)
  (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (11): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)
  (12): LeakyReLU(0.01)
  (13): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1), ceil_mode=False)
  (14): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (15): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
  (16): LeakyReLU(0.01)
  (17): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (18): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)
  (19): LeakyReLU(0.01)
  (20): MaxPool2d(kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), dilation=(1, 1), ceil_mode=False)
  (21): Flatten(
  )
  (22): Linear(in_features=4096, out_features=2048, bias=False)
  (23): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True)
  (24): LeakyReLU(0.01)
  (25): Linear(in_features=2048, out_features=1024, bias=False)
  (26): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True)
  (27): LeakyReLU(0.01)
  (28): Linear(in_features=1024, out_features=10, bias=True)
)
```