Derived from reading through ["What is `torch.nn` _really_?"](https://pytorch.org/tutorials/beginner/nn_tutorial.html).
* I find this generally ironic, because Jeremy likes to teach from a high-level
* This tutorial is going to be everything from the bottom-up

---

In [1]:
from pathlib import Path
import requests

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

In [2]:
URL = "http://deeplearning.net/data/mnist/"

In [3]:
def make_filestructure(filename):
    if not (PATH / filename).exists():
        content = requests.get(URL + filename).content
        (PATH / filename).open("wb").write(content)

In [4]:
def load_train_validation_sets(filename):
    import pickle
    import gzip
    
    with gzip.open((PATH / filename).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")
        return ((x_train, y_train), (x_valid, y_valid))

Now let's actually bring in the data

In [5]:
FILENAME = "mnist.pkl.gz"

make_filestructure(FILENAME)

Just a sidenote - this is __totally__ how Jeremy Howard codes:
1. Very brief, but super-effective, blocks of code
2. No comments, but the code is so _direct_ that the given assumption that you can code in Python is all you need

In [6]:
xy_train, xy_valid = load_train_validation_sets(FILENAME)

In [7]:
def show_one_digit(images_flat):
    from matplotlib import pyplot
    import numpy as np
    
    random_index = np.random.randint(len(images_flat), size=1)[0]
    # show our random image in grayscale
    pyplot.imshow(images_flat[random_index].reshape((28, 28)), cmap="gray")
    print("images.shape =", images_flat.shape)

In [8]:
# pass in the training set, but it doesn't really matter which one we look at
show_one_digit(xy_train[0])

images.shape = (50000, 784)


# Torch-ify me cap'n!

In [9]:
import torch

def convert_np_to_torch(images_as_numpy):
    '''
    Args:
        images_as_numpy: all data as np.array's, concatenated together.
            I.e. (x_train, y_train, x_val, y_val[, x_test, y_test])
    '''
    return map(torch.tensor, (*images_as_numpy,))

x_train, y_train, x_valid, y_valid = convert_np_to_torch((*xy_train, *xy_valid))

In [10]:
x_train, y_train

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]), tensor([5, 0, 4,  ..., 8, 4, 8]))

In [11]:
n, c = x_train.shape # number of samples, columns (?)
print(n, c, 'from', x_train.shape)

50000 784 from torch.Size([50000, 784])


In [12]:
print(y_train.min(), y_train.max())

tensor(0) tensor(9)


In [13]:
def mnist_weights_and_biases():
    '''Return (weights, bias) constructed from a Normal Distribution, and "Xavier-initialized"'''
    import math
    
    weights = torch.randn(784, 10) / math.sqrt(784)
    weights.requires_grad_() # set requires_grad = True post-hoc
    
    bias = torch.zeros(10, requires_grad=True)
    
    return weights, bias

Because PyTorch can construct optimized GPU code (with a cuDNN optimizing compiler?) we're going to write two
  functions for use
1. A "softmax" function, which predicts a probability distribution
2. A "model" function, because anything that is invocable can be a PyTorch model
  * And the gradients will still be calculated!

In [14]:
def simple_linear_model():
    '''To encapsulate the shenanigans that are about to ensue, I will work in a function'''
    
    def log_softmax(x):
        return x - x.exp().sum(-1).log().unsqueeze(-1)

    def model(mini_batch, weights, bias):
        return log_softmax(mini_batch @ weights + bias)
    
    w, b = mnist_weights_and_biases()
    
    bs = 64 # batch size
    
    x_b = x_train[0:bs] # a mini-batch from our inputs
    predictions = model(x_b, w, b)
    
    print('Prediction 1:', predictions[0])
    print('predictions.shape', predictions.shape)

In [15]:
simple_linear_model()

Prediction 1: tensor([-3.0738, -2.2553, -2.1997, -2.1826, -2.3903, -2.1997, -2.6240, -1.6742,
        -2.5130, -2.5024], grad_fn=<SelectBackward>)
predictions.shape torch.Size([64, 10])


Again, for the sake of keeping the global namespace relatively unpolluted (until we get to the `torch.nn` stuff)
  I'm going to simple rewrite `simple_linear_model`.
* With the caviate of pulling up `log_softmax(x)` for its general utility

In [16]:
def _feeling_out_torch_sum():
    # give me [0, 120) and arrange the as 4 blocks of 5x6 matrices
    b = torch.arange(4 * 5 * 6).view(4, 5, 6)
    
    a = torch.arange(10 * 2).view(2, 10)
    print(a)
    print(a.sum(0))

In [17]:
_feeling_out_torch_sum()

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])
tensor([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])


In [18]:
# ?torch.exp

In [19]:
def _feeling_out_torch_unsqueeze():
#     ?torch.unsqueeze
    small_range = torch.arange(1., 3., .5)
    print(small_range)
    
    _sum = small_range.sum()
    print(_sum)
    
    print(_sum.unsqueeze(-1))
    print(_sum.log())
    print(_sum.log().unsqueeze(-1))

In [20]:
_feeling_out_torch_unsqueeze()

tensor([1.0000, 1.5000, 2.0000, 2.5000])
tensor(7.)
tensor([7.])
tensor(1.9459)
tensor([1.9459])


(Log) Softmax, here, is the difference of
1. the input, and
2. the log of the sum along the last dimension of the exponential of the input

Then it just boxes that log.
* Why index `-1` rather than `0` for a scalar value? I don't know, but maybe we can find out

In [21]:
torch.Tensor(1).unsqueeze(-1), torch.Tensor(1).unsqueeze(0)

(tensor([[0.]]), tensor([[0.]]))

No difference when there's one element, but how about with multiple?

In [22]:
torch.Tensor(3).unsqueeze(-1), torch.Tensor(3).unsqueeze(0)

(tensor([[ 0.0000e+00],
         [-3.6893e+19],
         [-5.0996e-06]]), tensor([[ 0.0000e+00, -3.6893e+19,  0.0000e+00]]))

There it is. `unsqueeze(-1)` is guaranteed to box the elements.
* If, for some reason, the last dimension in your sum was a vector, this `unsqueeze` would make sure that
  you have individually boxes elements (in this case, probabilities)

---
Another token Jeremy Howard-ism:
* Great coding practices, but he doesn't explain them along the way.
* He wants you to trust his implementation, and know enough to dig in yourself (as I have done above)

---

In [23]:
def log_softmax(x):
    return x - x.exp().sum(-1).log().unsqueeze(-1)

In [24]:
def neg_log_likelihood(x, target):
    return -x[range(target.shape[0]), target].mean()

In [25]:
def accuracy(out, target_batch):
    preds = torch.argmax(out, dim=1) # this is weird to me
    return (preds == target_batch).float().mean() # average accuracy of the predictions

In [26]:
# ?torch.argmax

It gets the indices of the maximum values of the rows of the linear regression...
* Why is that how we determine out prediction?

In [27]:
def simple_linear_model():
    '''To encapsulate the shenanigans that are about to ensue, I will work in a function'''

    w, b = mnist_weights_and_biases()
    
    def model(mini_batch):
        return log_softmax(mini_batch @ w + b)
    
    
    bs = 64 # batch size
    
    x_b = x_train[0:bs] # a mini-batch from our inputs
    predictions = model(x_b)
    
    print('Prediction 1:', predictions[0])
    print('predictions.shape', predictions.shape)
    
    loss_func = neg_log_likelihood
    
    y_b = y_train[0:bs]
    
    print(loss_func(predictions, y_b))
    
    lr = 0.5
    epochs = 2 # how many laps through the dataset

    print(f"training our model {epochs} epochs with learning rate of {lr}")
    for epoch in range(epochs):
        # this 'i' is sometimes called a "period" when iterating through the dataset
        for i in range((n - 1) // bs + 1): # why n-1?
            start_i = i * bs
            end_i = start_i + bs # now we have a range [i*bs, (i+1)*bs)
            
            x_mini_batch = x_train[start_i: end_i]
            y_mini_batch = y_train[start_i: end_i]
            
            prediction = model(x_mini_batch) # prediction matrix of matrices ("tensor")
            loss = loss_func(prediction, y_mini_batch)
            
            if i % 100 == 0: # do some logging every 20th batch
                print(f' Loss of {loss}, with batch accuracy of {accuracy(prediction, y_mini_batch)}')
            
            loss.backward()
            with torch.no_grad():
                w -= w.grad * lr
                b -= b.grad * lr
                w.grad.zero_()
                b.grad.zero_()
    
    print(loss_func(model(x_b), y_b), accuracy(model(x_b), y_b))

In [28]:
simple_linear_model()

Prediction 1: tensor([-1.9524, -2.5062, -2.1828, -2.5593, -2.8282, -2.2362, -1.9332, -2.0642,
        -3.0664, -2.2805], grad_fn=<SelectBackward>)
predictions.shape torch.Size([64, 10])
tensor(2.3146, grad_fn=<NegBackward>)
training our model 2 epochs with learning rate of 0.5
 Loss of 2.3146328926086426, with batch accuracy of 0.0625
 Loss of 0.32288476824760437, with batch accuracy of 0.890625
 Loss of 0.2939816117286682, with batch accuracy of 0.90625
 Loss of 0.39154088497161865, with batch accuracy of 0.921875
 Loss of 0.23661889135837555, with batch accuracy of 0.90625
 Loss of 0.3833451271057129, with batch accuracy of 0.890625
 Loss of 0.2693641781806946, with batch accuracy of 0.890625
 Loss of 0.37894997000694275, with batch accuracy of 0.90625
 Loss of 0.279062420129776, with batch accuracy of 0.921875
 Loss of 0.26254570484161377, with batch accuracy of 0.921875
 Loss of 0.1923317313194275, with batch accuracy of 0.90625
 Loss of 0.3412483334541321, with batch accuracy of 0

# Use torch.nn.functional

In [31]:
import torch.nn.functional as F

In [40]:
def functional_linear_model():
    '''Use torch.functional'''
    weights, bias = mnist_weights_and_biases()
    
    bs = 64 # batch_size

    xb = x_train[0:bs]
    yb = y_train[0:bs]
    loss_func = F.cross_entropy

    def model(xb):
        return xb @ weights + bias
    
    print('Prediction #1 loss:', loss_func(model(xb), yb), 'Accuracy:', accuracy(model(xb), yb))
    
    lr = 0.5
    epochs = 2 # how many laps through the dataset

    print(f"training our model {epochs} epochs with learning rate of {lr}")
    for epoch in range(epochs):
        # this 'i' is sometimes called a "period" when iterating through the dataset
        for i in range((n - 1) // bs + 1): # why n-1?
            start_i = i * bs
            end_i = start_i + bs # now we have a range [i*bs, (i+1)*bs)
            
            x_mini_batch = x_train[start_i: end_i]
            y_mini_batch = y_train[start_i: end_i]
            
            prediction = model(x_mini_batch) # prediction matrix of matrices ("tensor")
            loss = loss_func(prediction, y_mini_batch)
            
            if i % 100 == 0: # do some logging every 20th batch
                print(f' Loss of {loss}, with batch accuracy of {accuracy(prediction, y_mini_batch)}')
            
            loss.backward()
            with torch.no_grad():
                weights -= weights.grad * lr
                bias -= bias.grad * lr
                weights.grad.zero_()
                bias.grad.zero_()
    
    print('Final Loss', loss_func(model(xb), yb), 'and Accuracy', accuracy(model(xb), yb))

In [41]:
functional_linear_model()

Prediction #1 loss: tensor(2.3454, grad_fn=<NllLossBackward>) Accuracy: tensor(0.0938)
training our model 2 epochs with learning rate of 0.5
 Loss of 2.3454432487487793, with batch accuracy of 0.09375
 Loss of 0.317440390586853, with batch accuracy of 0.890625
 Loss of 0.3076491355895996, with batch accuracy of 0.859375
 Loss of 0.39076757431030273, with batch accuracy of 0.90625
 Loss of 0.23709739744663239, with batch accuracy of 0.890625
 Loss of 0.38853177428245544, with batch accuracy of 0.890625
 Loss of 0.2624582052230835, with batch accuracy of 0.890625
 Loss of 0.3857908248901367, with batch accuracy of 0.90625
 Loss of 0.28397250175476074, with batch accuracy of 0.921875
 Loss of 0.26196223497390747, with batch accuracy of 0.921875
 Loss of 0.1962507665157318, with batch accuracy of 0.90625
 Loss of 0.3435472249984741, with batch accuracy of 0.921875
 Loss of 0.21012549102306366, with batch accuracy of 0.921875
 Loss of 0.35814568400382996, with batch accuracy of 0.890625
 Lo

# Refactor into PyTorch nn.Module

In [44]:
from torch import nn
import math

class Mnist_Logistic(nn.Module):
    '''Encapsulates the weights, bias, and forward() method'''
    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))
    
    def forward(self, xb):
        return xb @ self.weights + self.bias

In [45]:
model = Mnist_Logistic()

In [54]:
bs = 64
lr = 0.5
xb = x_train[0:bs] # one batch
yb = y_train[0:bs]

In [49]:
print(F.cross_entropy(model(xb), yb))

tensor(2.3429, grad_fn=<NllLossBackward>)


In [59]:
def fit(epochs=2):
    '''Train the model to fit to the data distribution'''
    
    for epoch in range(epochs):
        steps = (n - 1) // bs + 1
        for i in range(steps):
            start_i = i * bs
            end_i = start_i + bs
            xb, yb = x_train[start_i:end_i], y_train[start_i:end_i]
            
            pred = model(xb)
            loss = F.cross_entropy(pred, yb)
            
            if i % 50 == 0:
                print("Loss:", loss)
            if i % 100 == 0:
                print("Accuracy:", accuracy(pred, yb))
            
            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

fit()

Loss: tensor(0.1544, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9375)
Loss: tensor(0.2990, grad_fn=<NllLossBackward>)
Loss: tensor(0.2586, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9375)
Loss: tensor(0.2205, grad_fn=<NllLossBackward>)
Loss: tensor(0.1578, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.4692, grad_fn=<NllLossBackward>)
Loss: tensor(0.2976, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9375)
Loss: tensor(0.2169, grad_fn=<NllLossBackward>)
Loss: tensor(0.1869, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9062)
Loss: tensor(0.1216, grad_fn=<NllLossBackward>)
Loss: tensor(0.3173, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.3043, grad_fn=<NllLossBackward>)
Loss: tensor(0.1762, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9688)
Loss: tensor(0.1503, grad_fn=<NllLossBackward>)
Loss: tensor(0.3253, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.1515, grad_fn=<NllLossBackward>)
Loss: tensor(0.1530, grad_fn=<Nl

# Refactor for nn.Linear

In [60]:
class Mnist_Logistic(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(784, 10)
    
    def forward(self, xb):
        return self.lin(xb)

In [62]:
loss_func = F.cross_entropy

In [63]:
model = Mnist_Logistic()
print(loss_func(model(xb), yb))

tensor(2.3260, grad_fn=<NllLossBackward>)


In [66]:
def print_global_loss():
    print('Global Loss:', loss_func(model(xb), yb))

In [65]:
fit()

print_global_loss()

Loss: tensor(2.3260, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.0625)
Loss: tensor(0.4135, grad_fn=<NllLossBackward>)
Loss: tensor(0.3205, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9062)
Loss: tensor(0.2697, grad_fn=<NllLossBackward>)
Loss: tensor(0.2987, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8750)
Loss: tensor(0.5576, grad_fn=<NllLossBackward>)
Loss: tensor(0.3888, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9219)
Loss: tensor(0.3000, grad_fn=<NllLossBackward>)
Loss: tensor(0.2411, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2113, grad_fn=<NllLossBackward>)
Loss: tensor(0.3863, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.3603, grad_fn=<NllLossBackward>)
Loss: tensor(0.2629, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.8906)
Loss: tensor(0.2308, grad_fn=<NllLossBackward>)
Loss: tensor(0.3723, grad_fn=<NllLossBackward>)
Accuracy: tensor(0.9062)
Loss: tensor(0.2131, grad_fn=<NllLossBackward>)
Loss: tensor(0.2797, grad_fn=<Nl

# Refactor to use torch.optim

For all your optimization needs!

In [67]:
from torch import optim

In [68]:
def get_model():
    model = Mnist_Logistic()
    return model, optim.SGD(model.parameters(), lr=lr)

In [69]:
model, opt = get_model()
print_global_loss()

Global Loss: tensor(2.2984, grad_fn=<NllLossBackward>)


In [71]:
epochs = 2

In [72]:
for epoch in range(epochs):
    for i in range((n - 1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        xb = x_train[start_i:end_i]
        yb = y_train[start_i:end_i]
        pred = model(xb)
        loss = loss_func(pred, yb)

        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(0.0821, grad_fn=<NllLossBackward>)


# Refactor to use Dataset

By defining how we index into our dataset, we can iterate and slice like a serial killer.
* I haven't yet done the Dataset tutorial, because it might not be as necessary for my work.
  + What I have done is read the mentality of the Dataset API when I was working with Dataloaders

In [73]:
from torch.utils.data import TensorDataset

In [74]:
train_ds = TensorDataset(x_train, y_train)

In [75]:
model, opt = get_model()
print_global_loss()

for e in range(epochs):
    steps = (n - 1) // bs + 1
    for i in range(steps):
        start = i * bs
        xb, yb = train_ds[start:start + bs]
        pred = model(xb)
        loss = loss_func(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(2.3320, grad_fn=<NllLossBackward>)
Global Loss: tensor(0.0833, grad_fn=<NllLossBackward>)


# A tower of abstractions: DataLoader

You can wrap a `Dataset` in a `DataLoader` to get a better grip on things.
* Even easier to iterate. We don't control the indices anymore!

In so doing, we:
* Off-load batching
* Remove concerns for "Index Out of Bounds" errors
* Leverage the ecosystem

In [76]:
from torch.utils.data import DataLoader

train_ds = TensorDataset(x_train, y_train)
train_dl = DataLoader(train_ds, batch_size=bs)

In [77]:
model, opt = get_model()

print_global_loss()

for epoch in range(epochs):
    for xb, yb in train_dl:
        pred = model(xb)
        loss = loss_func(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

print_global_loss()

Global Loss: tensor(2.2560, grad_fn=<NllLossBackward>)
Global Loss: tensor(0.0813, grad_fn=<NllLossBackward>)


## A quick aside

Till this point, we've been doing a lot of trimming and refactoring to get a solid training loop
* We're down to ~7 lines of code (which is a something of an ideal)
* We __haven't__ given any consideration to a 