### Model with updated loss function

The 1989-era model had a slightly strange (by our "modern" standard) output layer, for a multi-class classification problem - 

* there is a tanh activation on the output units, which maps the output for each of the 10 outputs to the range -1 to 1.
* and then there is a mean squared error loss function on the outputs.

We will "update" the model by removing the tanh activation on the output units, and then using the typical (for a classification problem) [cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss).

I also noticed that when I ran this modified model, the loss seemed to "blow up" suggesting a too-high learning rate - so I reduced the learning rate from 0.03 to 0.01.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.set_num_threads(2) # for performance

In [None]:

class ModernLossNet(nn.Module):

    def __init__(self):
        super().__init__()

        # initialization as described in the paper to my best ability, but it doesn't look right...
        winit = lambda fan_in, *shape: (torch.rand(*shape) - 0.5) * 2 * 2.4 / fan_in**0.5
        macs = 0 # keep track of MACs (multiply accumulates)
        acts = 0 # keep track of number of activations

        # H1 layer parameters and their initialization
        self.H1w = nn.Parameter(winit(5*5*1, 12, 1, 5, 5))
        self.H1b = nn.Parameter(torch.zeros(12, 8, 8)) # presumably init to zero for biases
        assert self.H1w.nelement() + self.H1b.nelement() == 1068
        macs += (5*5*1) * (8*8) * 12
        acts += (8*8) * 12

        # H2 layer parameters and their initialization
        """
        H2 neurons all connect to only 8 of the 12 input planes, with an unspecified pattern
        I am going to assume the most sensible block pattern where 4 planes at a time connect
        to differently overlapping groups of 8/12 input planes. We will implement this with 3
        separate convolutions that we concatenate the results of.
        """
        self.H2w = nn.Parameter(winit(5*5*8, 12, 8, 5, 5))
        self.H2b = nn.Parameter(torch.zeros(12, 4, 4)) # presumably init to zero for biases
        assert self.H2w.nelement() + self.H2b.nelement() == 2592
        macs += (5*5*8) * (4*4) * 12
        acts += (4*4) * 12

        # H3 is a fully connected layer
        self.H3w = nn.Parameter(winit(4*4*12, 4*4*12, 30))
        self.H3b = nn.Parameter(torch.zeros(30))
        assert self.H3w.nelement() + self.H3b.nelement() == 5790
        macs += (4*4*12) * 30
        acts += 30

        # output layer is also fully connected layer
        self.outw = nn.Parameter(winit(30, 30, 10))
        self.outb = nn.Parameter(-torch.ones(10)) # 9/10 targets are -1, so makes sense to init slightly towards it
        assert self.outw.nelement() + self.outb.nelement() == 310
        macs += 30 * 10
        acts += 10

        self.macs = macs
        self.acts = acts

    def forward(self, x):

        # x has shape (1, 1, 16, 16)
        x = F.pad(x, (2, 2, 2, 2), 'constant', -1.0) # pad by two using constant -1 for background
        x = F.conv2d(x, self.H1w, stride=2) + self.H1b
        x = torch.tanh(x)

        # x is now shape (1, 12, 8, 8)
        x = F.pad(x, (2, 2, 2, 2), 'constant', -1.0) # pad by two using constant -1 for background
        slice1 = F.conv2d(x[:, 0:8], self.H2w[0:4], stride=2) # first 4 planes look at first 8 input planes
        slice2 = F.conv2d(x[:, 4:12], self.H2w[4:8], stride=2) # next 4 planes look at last 8 input planes
        slice3 = F.conv2d(torch.cat((x[:, 0:4], x[:, 8:12]), dim=1), self.H2w[8:12], stride=2) # last 4 planes are cross
        x = torch.cat((slice1, slice2, slice3), dim=1) + self.H2b
        x = torch.tanh(x)

        # x is now shape (1, 12, 4, 4)
        x = x.flatten(start_dim=1) # (1, 12*4*4)
        x = x @ self.H3w + self.H3b
        x = torch.tanh(x)

        # x is now shape (1, 30)
        x = x @ self.outw + self.outb
        # Note: we deleted the tanh activation here!

         # x is finally shape (1, 10)
        return x

In [None]:

# Note: with the original learning rate, the SGD did not seem to learn well
# so I changed it to a smaller learning rate
learning_rate = 0.01

# init rng
torch.manual_seed(1337)
np.random.seed(1337)
torch.use_deterministic_algorithms(True)

# init a model
model = ModernLossNet()
print("model stats:")
print("# params:      ", sum(p.numel() for p in model.parameters())) # in paper total is 9,760
print("# MACs:        ", model.macs)
print("# activations: ", model.acts)

# init data
Xtr, Ytr = torch.load('train1989.pt')
Xte, Yte = torch.load('test1989.pt')

# init optimizer
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

def eval_split(split):
    # eval the full train/test set, batched implementation for efficiency
    model.eval()
    X, Y = (Xtr, Ytr) if split == 'train' else (Xte, Yte)
    Yhat = model(X)
    # Note: the updated loss function!
    loss = F.cross_entropy(Yhat, Y.argmax(dim=1))
    err = torch.mean((Y.argmax(dim=1) != Yhat.argmax(dim=1)).float())
    print(f"eval: split {split:5s}. loss {loss.item():e}. error {err.item()*100:.2f}%. misses: {int(err.item()*Y.size(0))}")

# train
for pass_num in range(23):

    # perform one epoch of training
    model.train()
    for step_num in range(Xtr.size(0)):

        # fetch a single example into a batch of 1
        x, y = Xtr[[step_num]], Ytr[[step_num]]

        # forward the model and the loss
        yhat = model(x)
        loss = F.cross_entropy(yhat, y.argmax(dim=1))
        # Note: the updated loss function!

        # calculate the gradient and update the parameters
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    # after epoch epoch evaluate the train and test error / metrics
    print(pass_num + 1)
    eval_split('train')
    eval_split('test')

# save final model to file
torch.save(model.state_dict(), 'crossentropy_model.pt')

model stats:
# params:       9760
# MACs:         63660
# activations:  1000
1
eval: split train. loss 2.547408e-01. error 7.57%. misses: 552
eval: split test . loss 2.883549e-01. error 8.67%. misses: 174
2
eval: split train. loss 1.863126e-01. error 5.76%. misses: 419
eval: split test . loss 2.397452e-01. error 7.42%. misses: 149
3
eval: split train. loss 1.585573e-01. error 4.58%. misses: 333
eval: split test . loss 2.222673e-01. error 6.88%. misses: 138
4
eval: split train. loss 1.367183e-01. error 4.14%. misses: 301
eval: split test . loss 2.259786e-01. error 7.22%. misses: 144
5
eval: split train. loss 1.080973e-01. error 3.46%. misses: 252
eval: split test . loss 2.042281e-01. error 6.08%. misses: 122
6
eval: split train. loss 9.097695e-02. error 3.07%. misses: 223
eval: split test . loss 2.100919e-01. error 6.38%. misses: 128
7
eval: split train. loss 6.656891e-02. error 2.02%. misses: 146
eval: split test . loss 1.925825e-01. error 5.78%. misses: 115
8
eval: split train. loss 5

The test performance is similar, maybe a little worse - since I reduced the learning rate, I may need to increase the number of passes to compensate.

But, we are now seeing zero error on the training set. We may be overfitting.

Fortunately, we have some "tricks" for improving the performance of deep neural networks! We can try - 

* a more modern optimizer (i.e. improve over the stochastic gradient descent with fixed learning rate)
* data augmentation
* a regularization technique, like dropout

and see if they improve our performance, while still keeping the basic model - number of layers, number of units, size of each convolutional filter - the same.