Derived from reading through ["What is `torch.nn` _really_?"](https://pytorch.org/tutorials/beginner/nn_tutorial.html).
* I find this generally ironic, because Jeremy likes to teach from a high-level
* This tutorial is going to be everything from the bottom-up

---

In [2]:
from pathlib import Path
import requests

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

In [3]:
URL = "http://deeplearning.net/data/mnist/"

In [4]:
def make_filestructure(filename):
    if not (PATH / filename).exists():
        content = requests.get(URL + filename).content
        (PATH / filename).open("wb").write(content)

In [5]:
def load_train_validation_sets(filename):
    import pickle
    import gzip
    
    with gzip.open((PATH / filename).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
        return ((x_train, y_train), (x_valid, y_valid))

Now let's actually bring in the data

In [6]:
FILENAME = "mnist.pkl.gz"

make_filestructure(FILENAME)

Just a sidenote - this is __totally__ how Jeremy Howard codes:
1. Very brief, but super-effective, blocks of code
2. No comments, but the code is so _direct_ that the given assumption that you can code in Python is all you need

In [7]:
xy_train, xy_valid = load_train_validation_sets(FILENAME)

In [8]:
def show_one_digit(images_flat):
    from matplotlib import pyplot
    import numpy as np
    
    random_index = np.random.randint(len(images_flat), size=1)[0]
    # show our random image in grayscale
    pyplot.imshow(images_flat[random_index].reshape((28, 28)), cmap="gray")
    print("images.shape =", images_flat.shape)

In [9]:
# pass in the training set, but it doesn't really matter which one we look at
show_one_digit(xy_train[0])

images.shape = (50000, 784)


# Torch-ify me cap'n!

In [10]:
import torch

def convert_np_to_torch(images_as_numpy):
    '''
    Args:
        images_as_numpy: all data as np.array's, concatenated together.
            I.e. (x_train, y_train, x_val, y_val[, x_test, y_test])
    '''
    return map(torch.tensor, (*images_as_numpy,))

x_train, y_train, x_valid, y_valid = convert_np_to_torch((*xy_train, *xy_valid))

In [11]:
x_train, y_train

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([5, 0, 4,  ..., 8, 4, 8]))

In [12]:
n, c = x_train.shape # number of samples, columns (?)
print(n, c, 'from', x_train.shape)

50000 784 from torch.Size([50000, 784])


In [13]:
print(y_train.min(), y_train.max())

tensor(0) tensor(9)


In [14]:
def mnist_weights_and_biases():
    '''Return (weights, bias) constructed from a Normal Distribution, and "Xavier-initialized"'''
    import math
    
    weights = torch.randn(784, 10) / math.sqrt(784)
    weights.requires_grad_() # set requires_grad = True post-hoc
    
    bias = torch.zeros(10, requires_grad=True)
    
    return weights, bias

Because PyTorch can construct optimized GPU code (with a cuDNN optimizing compiler?) we're going to write two
  functions for use
1. A "softmax" function, which predicts a probability distribution
2. A "model" function, because anything that is invocable can be a PyTorch model
  * And the gradients will still be calculated!

In [15]:
def simple_linear_model():
    '''To encapsulate the shenanigans that are about to ensue, I will work in a function'''
    
    def log_softmax(x):
        return x - x.exp().sum(-1).log().unsqueeze(-1)

    def model(mini_batch, weights, bias):
        return log_softmax(mini_batch @ weights + bias)
    
    w, b = mnist_weights_and_biases()
    
    bs = 64 # batch size
    
    x_b = x_train[0:bs] # a mini-batch from our inputs
    predictions = model(x_b, w, b)
    
    print('Prediction 1:', predictions[0])
    print('predictions.shape', predictions.shape)

In [16]:
simple_linear_model()

Prediction 1: tensor([-2.3130, -2.4403, -2.1679, -2.2292, -2.1659, -2.9368, -2.3876, -2.1868,
        -2.2024, -2.2117], grad_fn=<SelectBackward>)
predictions.shape torch.Size([64, 10])


Again, for the sake of keeping the global namespace relatively unpolluted (until we get to the `torch.nn` stuff)
  I'm going to simple rewrite `simple_linear_model`.
* With the caviate of pulling up `log_softmax(x)` for its general utility

In [17]:
def _feeling_out_torch_sum():
    # give me [0, 120) and arrange the as 4 blocks of 5x6 matrices
    b = torch.arange(4 * 5 * 6).view(4, 5, 6)
    
    a = torch.arange(10 * 2).view(2, 10)
    print(a)
    print(a.sum(0))

In [18]:
_feeling_out_torch_sum()

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
tensor([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])


In [19]:
# ?torch.exp

In [20]:
def _feeling_out_torch_unsqueeze():
#     ?torch.unsqueeze
    small_range = torch.arange(1., 3., .5)
    print(small_range)
    
    _sum = small_range.sum()
    print(_sum)
    
    print(_sum.unsqueeze(-1))
    print(_sum.log())
    print(_sum.log().unsqueeze(-1))

In [21]:
_feeling_out_torch_unsqueeze()

tensor([1.0000, 1.5000, 2.0000, 2.5000])
tensor(7.)
tensor([7.])
tensor(1.9459)
tensor([1.9459])


(Log) Softmax, here, is the difference of
1. the input, and
2. the log of the sum along the last dimension of the exponential of the input

Then it just boxes that log.
* Why index `-1` rather than `0` for a scalar value? I don't know, but maybe we can find out

In [22]:
torch.Tensor(1).unsqueeze(-1), torch.Tensor(1).unsqueeze(0)

(tensor([[0.]]), tensor([[0.]]))

No difference when there's one element, but how about with multiple?

In [23]:
torch.Tensor(3).unsqueeze(-1), torch.Tensor(3).unsqueeze(0)

(tensor([[ 0.0000e+00],
         [-1.0842e-19],
         [ 3.6898e+05]]), tensor([[ 0.0000e+00, -1.0842e-19,  2.9062e+05]]))

There it is. `unsqueeze(-1)` is guaranteed to box the elements.
* If, for some reason, the last dimension in your sum was a vector, this `unsqueeze` would make sure that
  you have individually boxes elements (in this case, probabilities)

---
Another token Jeremy Howard-ism:
* Great coding practices, but he doesn't explain them along the way.
* He wants you to trust his implementation, and know enough to dig in yourself (as I have done above)

---

In [24]:
def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

In [25]:
def neg_log_likelihood(x, target):
    return -x[range(target.shape[0]), target].mean()

In [26]:
def accuracy(out, target_batch):
    preds = torch.argmax(out, dim=1) # this is weird to me
    return (preds == target_batch).float().mean() # average accuracy of the predictions

In [27]:
# ?torch.argmax

It gets the indices of the maximum values of the rows of the linear regression...
* Why is that how we determine out prediction?

In [28]:
def simple_linear_model():
    '''To encapsulate the shenanigans that are about to ensue, I will work in a function'''

    w, b = mnist_weights_and_biases()
    
    def model(mini_batch):
        return log_softmax(mini_batch @ w + b)
    
    
    bs = 64 # batch size
    
    x_b = x_train[0:bs] # a mini-batch from our inputs
    predictions = model(x_b)
    
    print('Prediction 1:', predictions[0])
    print('predictions.shape', predictions.shape)
    
    loss_func = neg_log_likelihood
    
    y_b = y_train[0:bs]
    
    print(loss_func(predictions, y_b))
    
    lr = 0.5
    epochs = 2 # how many laps through the dataset

    print(f"training our model {epochs} epochs with learning rate of {lr}")
    for epoch in range(epochs):
        # this 'i' is sometimes called a "period" when iterating through the dataset
        for i in range((n - 1) // bs + 1): # why n-1?
            start_i = i * bs
            end_i = start_i + bs # now we have a range [i*bs, (i+1)*bs)
            
            x_mini_batch = x_train[start_i: end_i]
            y_mini_batch = y_train[start_i: end_i]
            
            prediction = model(x_mini_batch) # prediction matrix of matrices ("tensor")
            loss = loss_func(prediction, y_mini_batch)
            
            if i % 100 == 0: # do some logging every 20th batch
                print(f' Loss of {loss}, with batch accuracy of {accuracy(prediction, y_mini_batch)}')
            
            loss.backward()
            with torch.no_grad():
                w -= w.grad * lr
                b -= b.grad * lr
                w.grad.zero_()
                b.grad.zero_()
    
    print(loss_func(model(x_b), y_b), accuracy(model(x_b), y_b))

In [29]:
simple_linear_model()

Prediction 1: tensor([-2.6708, -2.2444, -2.5179, -2.3209, -2.1072, -2.6687, -1.8283, -2.4041,
        -2.5895, -2.0480], grad_fn=<SelectBackward>)
predictions.shape torch.Size([64, 10])
tensor(2.3308, grad_fn=<NegBackward>)
training our model 2 epochs with learning rate of 0.5
 Loss of 2.330822706222534, with batch accuracy of 0.109375
 Loss of 0.31124401092529297, with batch accuracy of 0.90625
 Loss of 0.2908550500869751, with batch accuracy of 0.90625
 Loss of 0.38969311118125916, with batch accuracy of 0.921875
 Loss of 0.23748202621936798, with batch accuracy of 0.90625
 Loss of 0.37772005796432495, with batch accuracy of 0.890625
 Loss of 0.2568615674972534, with batch accuracy of 0.890625
 Loss of 0.3808162808418274, with batch accuracy of 0.890625
 Loss of 0.28118157386779785, with batch accuracy of 0.921875
 Loss of 0.25753968954086304, with batch accuracy of 0.921875
 Loss of 0.19260373711585999, with batch accuracy of 0.90625
 Loss of 0.34308546781539917, with batch accuracy

# Use torch.nn.functional

In [30]:
import torch.nn.functional as F

In [31]:
def functional_linear_model():
    '''Use torch.functional'''
    weights, bias = mnist_weights_and_biases()
    
    bs = 64 # batch_size

    xb = x_train[0:bs]
    yb = y_train[0:bs]
    loss_func = F.cross_entropy

    def model(xb):
        return xb @ weights + bias
    
    print('Prediction #1 loss:', loss_func(model(xb), yb), 'Accuracy:', accuracy(model(xb), yb))
    
    lr = 0.5
    epochs = 2 # how many laps through the dataset

    print(f"training our model {epochs} epochs with learning rate of {lr}")
    for epoch in range(epochs):
        # this 'i' is sometimes called a "period" when iterating through the dataset
        for i in range((n - 1) // bs + 1): # why n-1?
            start_i = i * bs
            end_i = start_i + bs # now we have a range [i*bs, (i+1)*bs)
            
            x_mini_batch = x_train[start_i: end_i]
            y_mini_batch = y_train[start_i: end_i]
            
            prediction = model(x_mini_batch) # prediction matrix of matrices ("tensor")
            loss = loss_func(prediction, y_mini_batch)
            
            if i % 100 == 0: # do some logging every 20th batch
                print(f' Loss of {loss}, with batch accuracy of {accuracy(prediction, y_mini_batch)}')
            
            loss.backward()
            with torch.no_grad():
                weights -= weights.grad * lr
                bias -= bias.grad * lr
                weights.grad.zero_()
                bias.grad.zero_()
    
    print('Final Loss', loss_func(model(xb), yb), 'and Accuracy', accuracy(model(xb), yb))

In [32]:
functional_linear_model()

Prediction #1 loss: tensor(2.3569, grad_fn=<NllLossBackward>) Accuracy: tensor(0.0625)
training our model 2 epochs with learning rate of 0.5
 Loss of 2.3569226264953613, with batch accuracy of 0.0625
 Loss of 0.3186066746711731, with batch accuracy of 0.890625
 Loss of 0.3051021993160248, with batch accuracy of 0.890625
 Loss of 0.3864426612854004, with batch accuracy of 0.921875
 Loss of 0.23489728569984436, with batch accuracy of 0.890625
 Loss of 0.3852415084838867, with batch accuracy of 0.890625
 Loss of 0.26294535398483276, with batch accuracy of 0.890625
 Loss of 0.37971171736717224, with batch accuracy of 0.890625
 Loss of 0.27945461869239807, with batch accuracy of 0.921875
 Loss of 0.2591649889945984, with batch accuracy of 0.921875
 Loss of 0.19632761180400848, with batch accuracy of 0.90625
 Loss of 0.3407357931137085, with batch accuracy of 0.921875
 Loss of 0.2096318006515503, with batch accuracy of 0.9375
 Loss of 0.35662347078323364, with batch accuracy of 0.90625
 Loss

# Refactor into PyTorch nn.Module

In [33]:
from torch import nn
import math

class Mnist_Logistic(nn.Module):
    '''Encapsulates the weights, bias, and forward() method'''
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))
    
    def forward(self, xb):
        return xb @ self.weights + self.bias

In [34]:
model = Mnist_Logistic()

In [35]:
bs = 64
lr = 0.5
xb = x_train[0:bs] # one batch
yb = y_train[0:bs]

In [36]:
print(F.cross_entropy(model(xb), yb))

tensor(2.3358, grad_fn=<NllLossBackward>)


In [37]:
def fit(epochs=2):
    '''Train the model to fit to the data distribution'''
    
    for epoch in range(epochs):
        steps = (n - 1) // bs + 1
        for i in range(steps):
            start_i = i * bs
            end_i = start_i + bs
            xb, yb = x_train[start_i:end_i], y_train[start_i:end_i]
            
            pred = model(xb)
            loss = F.cross_entropy(pred, yb)
            
            if i % 50 == 0:
                print("Loss:", loss)
            if i % 100 == 0:
                print("Accuracy:", accuracy(pred, yb))
            
            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

fit()

Loss: tensor(2.3358, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.0469)
Loss: tensor(0.4071, grad_fn=<NllLossBackward>)
Loss: tensor(0.3210, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2699, grad_fn=<NllLossBackward>)
Loss: tensor(0.3044, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8750)
Loss: tensor(0.5514, grad_fn=<NllLossBackward>)
Loss: tensor(0.3868, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.3048, grad_fn=<NllLossBackward>)
Loss: tensor(0.2362, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.2110, grad_fn=<NllLossBackward>)
Loss: tensor(0.3737, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.3655, grad_fn=<NllLossBackward>)
Loss: tensor(0.2598, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2252, grad_fn=<NllLossBackward>)
Loss: tensor(0.3784, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9062)
Loss: tensor(0.2158, grad_fn=<NllLossBackward>)
Loss: tensor(0.2741, grad_fn=<Nl

# Refactor for nn.Linear

In [38]:
class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)
    
    def forward(self, xb):
        return self.lin(xb)

In [39]:
loss_func = F.cross_entropy

In [40]:
model = Mnist_Logistic()
print(loss_func(model(xb), yb))

tensor(2.2945, grad_fn=<NllLossBackward>)


In [41]:
def print_global_loss():
    print('Global Loss:', loss_func(model(xb), yb))

In [42]:
fit()

print_global_loss()

Loss: tensor(2.2945, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.0938)
Loss: tensor(0.4156, grad_fn=<NllLossBackward>)
Loss: tensor(0.3107, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2771, grad_fn=<NllLossBackward>)
Loss: tensor(0.2985, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.5565, grad_fn=<NllLossBackward>)
Loss: tensor(0.3838, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.3042, grad_fn=<NllLossBackward>)
Loss: tensor(0.2380, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2121, grad_fn=<NllLossBackward>)
Loss: tensor(0.3795, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.3640, grad_fn=<NllLossBackward>)
Loss: tensor(0.2651, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2284, grad_fn=<NllLossBackward>)
Loss: tensor(0.3797, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9062)
Loss: tensor(0.2145, grad_fn=<NllLossBackward>)
Loss: tensor(0.2764, grad_fn=<Nl

# Refactor to use torch.optim

For all your optimization needs!

In [43]:
from torch import optim

In [44]:
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

In [45]:
model, opt = get_model()
print_global_loss()

Global Loss: tensor(2.3401, grad_fn=<NllLossBackward>)


In [46]:
epochs = 2

In [47]:
for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(0.0812, grad_fn=<NllLossBackward>)


# Refactor to use Dataset

By defining how we index into our dataset, we can iterate and slice like a serial killer.
* I haven't yet done the Dataset tutorial, because it might not be as necessary for my work.
  + What I have done is read the mentality of the Dataset API when I was working with Dataloaders

In [48]:
from torch.utils.data import TensorDataset

In [49]:
train_ds = TensorDataset(x_train, y_train)

In [50]:
model, opt = get_model()
print_global_loss()

for e in range(epochs):
    steps = (n - 1) // bs + 1
    for i in range(steps):
        start = i * bs
        xb, yb = train_ds[start:start + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(2.3164, grad_fn=<NllLossBackward>)
Global Loss: tensor(0.0827, grad_fn=<NllLossBackward>)


# A tower of abstractions: DataLoader

You can wrap a `Dataset` in a `DataLoader` to get a better grip on things.
* Even easier to iterate. We don't control the indices anymore!

In so doing, we:
* Off-load batching
* Remove concerns for "Index Out of Bounds" errors
* Leverage the ecosystem

In [51]:
from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

In [52]:
model, opt = get_model()

print_global_loss()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(2.3418, grad_fn=<NllLossBackward>)
Global Loss: tensor(0.0810, grad_fn=<NllLossBackward>)


## A quick aside

Till this point, we've been doing a lot of trimming and refactoring to get a solid training loop
* We're down to ~7 lines of code (which is a something of an ideal)
* We __haven't__ given any consideration to a train/validation/test data split

# Validate or die

In [54]:
train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)

valid_ds = TensorDataset(x_valid, y_valid)
# no shuffle, because it's just inference and the network isn't "learning"
valid_dl = DataLoader(valid_ds, batch_size=bs * 2)

Everything before this is considered "step 1".
* `model.train()` is a switch that sets a flag for things like `nn.BatchNorm2d` and `nn.Dropout`
* `model.eval()` is a switch for when you're evaluating your model, employing it for _inference_

In [53]:
model, opt = get_model()

In [55]:
for e in range(epochs):
    model.train()
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()
    
    model.eval()
    with torch.no_grad():
        valid_loss = sum(loss_func(model(xb), yb) for xb, yb in valid_dl) # sum all the small losses
    
    print(epoch, valid_loss / len(valid_dl))

1 tensor(0.2978)
1 tensor(0.2777)


# Introducing our own abstraction

I'm tired of copy-paste/typing the same loops out over and over.
* It's good to know what you have to do, but it's bad to be so redundant in programming

1. Because we've fit a given `model` to a given `_d(ata)s(et)`, we can make a training function for those
   given batches of data
2. We can also group our train/validation `DataLoader`s because of the shared initialization code.

In [56]:
def loss_batch(model, f_loss, xb, yb, opt=None):
    loss = f_loss(model(xb), yb)
    
    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()
    
    # this length is designed by Jeremy. I don't know it's useful. Maybe percentages?
    return loss.item(), len(xb)

In [78]:
import numpy as np

def fit(epochs, model, f_loss, opt, train_dl, valid_dl):
    """
    Args:
        f_loss: function that computes the loss, given predictions and labels
    """
    
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dl:
            loss_batch(model, f_loss, xb, yb, opt)
        
        model.eval()
        with torch.no_grad():
            losses, nums = zip(
                *[loss_batch(model, f_loss, xb, yb) for xb, yb in valid_dl]
            )
#             print(len(losses))
#             print(len(nums))
#             if epoch == 0: print((losses, nums))
            
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        
        print(epoch, val_loss)

In [65]:
def get_data(train_dataset, valid_dataset, batch_size):
    return (
        DataLoader(train_dataset, batch_size=batch_size, shuffle=True),
        DataLoader(valid_dataset, batch_size=batch_size * 2),
    )

In [66]:
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
model, opt = get_model()
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

79
79
((0.8022516965866089, 0.7330406904220581, 0.8440415859222412, 0.9190787672996521, 0.9403488039970398, 0.6811109781265259, 0.794546902179718, 0.5706993341445923, 0.42919379472732544, 0.6954090595245361, 0.48123791813850403, 0.6090065836906433, 0.5629838109016418, 0.5737200975418091, 0.6367772221565247, 0.7905337810516357, 0.9434087872505188, 0.8313820362091064, 0.6373736262321472, 0.5125751495361328, 0.5941041111946106, 0.8983874917030334, 1.0063868761062622, 1.00661301612854, 0.8407307267189026, 0.6104289293289185, 0.27579009532928467, 0.7052438259124756, 0.5208159685134888, 0.629485011100769, 0.8378593325614929, 0.8677883148193359, 0.7785532474517822, 0.6377819776535034, 0.7012762427330017, 0.7289571166038513, 0.6604243516921997, 0.692999541759491, 0.8283705115318298, 0.5624573230743408, 0.7023231983184814, 0.7204043865203857, 0.44788095355033875, 0.666982114315033, 0.45516759157180786, 0.621056318283081, 0.48781630396842957, 0.5128905177116394, 0.6097342371940613, 0.66936290264

# Switch to Convolutional Neural Networks

In [70]:
class Mnist_CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)        
        self.conv3 = nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1)
    
    def forward(self, xb):
        # gotta love Jeremy's code
        xb = xb.view(-1, 1, 28, 28) # any batch_size, 1 channel, 28x28 pixels
        xb = F.relu(self.conv1(xb))
        xb = F.relu(self.conv2(xb))
        xb = F.relu(self.conv3(xb))
        xb = F.avg_pool2d(xb, 4)
        
        # reshape the output to the second dimension of the pool size, and just fill the rest to whatever.
        return xb.view(-1, xb.size(1))

In [68]:
lr = 0.1 # drop the learning rate a smidge

We're also going to apply "[Momentum](https://cs231n.github.io/neural-networks-3/#sgd)", which is SGD with a memory that allows for acceleration through
  the loss function landscape.

In [74]:
model = Mnist_CNN()
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

In [75]:
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

79
79
((0.30444860458374023, 0.3562984764575958, 0.4277384877204895, 0.5014829635620117, 0.48250144720077515, 0.3880632519721985, 0.20272275805473328, 0.32982489466667175, 0.20577040314674377, 0.3576623797416687, 0.2006949484348297, 0.26592713594436646, 0.30693739652633667, 0.296922504901886, 0.24109363555908203, 0.26987355947494507, 0.5987476110458374, 0.2588673532009125, 0.2934098243713379, 0.15217657387256622, 0.2632577419281006, 0.4800276458263397, 0.549534261226654, 0.32499027252197266, 0.24341151118278503, 0.22855380177497864, 0.17020606994628906, 0.4800698757171631, 0.30877888202667236, 0.3629472851753235, 0.49663811922073364, 0.5787115097045898, 0.2166202962398529, 0.2930310368537903, 0.35528451204299927, 0.3584819734096527, 0.21122372150421143, 0.2924754321575165, 0.423581063747406, 0.3465912938117981, 0.2591487169265747, 0.28588730096817017, 0.30690738558769226, 0.33558472990989685, 0.3036747872829437, 0.2575536072254181, 0.23342877626419067, 0.15359291434288025, 0.3096280395

# An alternative abstraction: [`nn.Sequential`](https://pytorch.org/docs/stable/nn.html#torch.nn.Sequential)

In [76]:
# you can only supply instances of `nn.Module` to `nn.Sequential`
class Lambda(nn.Module):
    def __init__(self, func):
        super().__init__()
        self.func = func
    
    def forward(self, x):
        return self.func(x)

def preprocess(x):
    '''Reshapes an MNIST input for CNN'''
    return x.view(-1, 1, 28, 28)

In [77]:
model = nn.Sequential(
    Lambda(preprocess),
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(kernel_size=4),
    Lambda(lambda x: x.view(x.size(0), -1)),
)

opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

fit(epochs, model, loss_func, opt, train_dl, valid_dl)

79
79
((0.5561911463737488, 0.6966434121131897, 0.4914347231388092, 0.6605960130691528, 0.7451556921005249, 0.650911808013916, 0.5321139693260193, 0.37168675661087036, 0.2354753464460373, 0.5936785340309143, 0.4616206884384155, 0.369215726852417, 0.3910178542137146, 0.36459881067276, 0.5307605862617493, 0.4271329641342163, 0.7557274103164673, 0.42243075370788574, 0.4085313379764557, 0.23524603247642517, 0.4090055227279663, 0.6132016181945801, 0.7016457915306091, 0.486743688583374, 0.37956374883651733, 0.4555610418319702, 0.315112829208374, 0.6407705545425415, 0.4629344940185547, 0.3783039152622223, 0.39488157629966736, 0.7018197178840637, 0.29813095927238464, 0.4129551947116852, 0.4378446638584137, 0.6195954084396362, 0.3565754294395447, 0.3668956160545349, 0.6045891046524048, 0.38897472620010376, 0.34762752056121826, 0.3507446050643921, 0.4087551534175873, 0.3978545367717743, 0.5548195838928223, 0.5066040754318237, 0.30303144454956055, 0.258012056350708, 0.5292671322822571, 0.35029694

# Removing Assumptions

In [79]:
def preprocess(x, y):
    return x.view(-1, 1, 28, 28), y

In [80]:
class WrappedDataLoader:
    def __init__(self, dl, func):
        self.dl = dl
        self.func = func
    
    def __len__(self):
        return len(self.dl)
    
    def __iter__(self):
        batches = iter(self.dl)
        for b in batches:
            yield (self.func(*b))


In [81]:
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

`F.avg_pool2d` ~ `nn.AvgPool2d`
* The last can be replaced with `nn.AdaptiveAvgPool2d` which allows you to articulate the desired _output_,
  rather than the _input_, size; we have been doing the latter, rather than the former

We've also pulled the data management into our own code, and left the tools to do what they do best.
* Leverage the framework where useful, but code your rules.

In [84]:
model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AdaptiveAvgPool2d(1),
    Lambda(lambda x: x.view(x.size(0), -1)),
)

opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

In [85]:
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

0 0.4469121433734894
1 0.31724769039154055


## Leverage a GPU

In [86]:
dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

def preprocess(x, y):
    # put the data on the gpu if you can
    return x.view(-1, 1, 28, 28).to(dev), y.to(dev)

In [87]:
train_dl, valid_dl = get_data(train_ds, valid_ds, bs)
train_dl = WrappedDataLoader(train_dl, preprocess)
valid_dl = WrappedDataLoader(valid_dl, preprocess)

In [88]:
model.to(dev)
opt = optim.SGD(model.parameters(), lr=lr, momentum=0.9)

In [89]:
fit(epochs, model, loss_func, opt, train_dl, valid_dl)

0 0.21975669264793396
1 0.1737469123840332
