# Under the Hood: Training a Digit Classifier

## Pixels: The Foundations of Computer Vision

In order to understand what happens in a computer vision model, we first have to understand how computers handle images.
We are going to try to create a model that can classify any image as 3 or 7.

In [None]:
from fastai.vision.all import *
from fastbook import *

matplotlib.rc('image', cmap='Greys')

In [None]:
# To download any of the datasets or pretrained weights, simply run untar_data by passing any dataset name mentioned above like so:
#    path = untar_data(URLs.PETS)
# For details: https://docs.fast.ai/data.external.html

# Download sample of MNIST that contains images of just these digits

path = untar_data(URLs.MNIST_SAMPLE)

In [None]:
# We can see what is in the directory by using 'ls'.
# MNIST dataset followed common layout for machine learning dataset: seperate folder for the training set and the validation set. 
path.ls()

In [None]:
Path.BASE_PATH = path
path.ls()

In [None]:
(path/'train').ls()

In [None]:
threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
im3 = Image.open(threes[1])                  # Use image class from Python imaging library
im3

In a computer, everything is represented as a number. To view the numbers that make up this image, we have to convert it to a *NumPy array* or a *PyTorch tensor*. For instance, here's what a section of the image looks like, converted to a NumPy array:

In [None]:
array(im3)[4:15,4:22]

We can use a Pandas DataFrame to color-code the values using a gradient, which shows us clearly how the image is created from the pixel values.You can see that the background white pixels are stored as the number 0, black is the number 255, and shades of gray are between the two

In [None]:
#hide_output
im3_t = tensor(im3)
df = pd.DataFrame(im3_t[4:15,4:22])
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

So, now you've seen what an image looks like to a computer, let's recall our goal: create a model that can recognize 3s and 7s. 

## First Try: Pixel Similarity

How about we find the average pixel value for every pixel of the 3s, then do the same for the 7s. This will give us two group averages, defining what we might call the "ideal" 3 and 7. Then, to classify an image as one digit or the other, we see which of these two ideal digits the image is most similar to. This certainly seems like it should be better than nothing, so it will make a good baseline.

> jargon: Baseline: A simple model which you are confident should perform reasonably well. It should be very simple to implement, and very easy to test, so that you can then test each of your improved ideas, and make sure they are always better than your baseline. Without starting with a sensible baseline, it is very difficult to know whether your super-fancy models are actually any good. One good approach to creating a baseline is doing what we have done here: think of a simple, easy-to-implement model. Another good approach is to search around to find other people that have solved similar problems to yours, and download and run their code on your dataset. Ideally, try both of these!

In [None]:
three_tensors = [tensor(Image.open(o)) for o in threes]
seven_tensors = [tensor(Image.open(o)) for o in sevens]
print(three_tensors[0].shape)
len(three_tensors),len(seven_tensors)

In [None]:
# Check how a single image looks like
show_image(three_tensors[1])

For every pixel position, we want to compute the average over all the images of the intensity of that pixel. To do this, we first combine all the images in this list into a single three-dimensional tensor. The most common way to describe such a tensor is to call it a *rank-3 tensor*. We often need to stack up individual tensors in a collection into a single tensor. 
Unsurprisingly, PyTorch comes with a function called `stack` that we can use for this purpose.
'mean' require us to *cast* our integer types to float types. Generally when images are floats, the pixel values are expected to be between 0 and 1, so we will also divide by 255 here.

In [None]:
# Stack up individual tensors in a collection into a single tensor.
stacked_threes = torch.stack(three_tensors).float()/255
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes.shape

**RANK** is the number of axes or dimensions in a tensor. <br>
**SHAPE** is the size of each axis of a tensor.

We can get a tensor's rank with ndim or len

In [None]:
print(len(stacked_threes.shape))
print(stacked_threes.ndim)

Finally, we can compute what the ideal 3 looks like. For every pixel position, we'll compute the average of that pixel over all images. The result will be one value for every pixel position, or a single image

In [None]:
mean3 = stacked_threes.mean(0)
show_image(mean3);

In [None]:
mean7 = stacked_sevens.mean(0)
show_image(mean7);

Let's now pick an arbitrary 3 and measure its *distance* from our "ideal digits". 
We'll see that in both cases, the distance between our 3 and the "ideal" 3 is less than the distance to the ideal 7. So our simple model will give the right prediction in this case. 

In [None]:
dist_3_abs = (stacked_threes[1] - mean3).abs().mean()
dist_3_sqr = ((stacked_threes[1] - mean3)**2).mean().sqrt()
dist_3_abs,dist_3_sqr

In [None]:
dist_7_abs = (stacked_threes[1] - mean7).abs().mean()
dist_7_sqr = ((stacked_threes[1] - mean7)**2).mean().sqrt()
dist_7_abs,dist_7_sqr

PyTorch already provides both of these as *loss functions*. You'll find these inside `torch.nn.functional`, which the PyTorch team recommends importing as `F` (and is available by default under that name in fastai):

In [None]:
# l1 refers to the standard mathematical jargon for mean absolute value (In math it 
# is called the L1 norm)
# 'mse' stands for MeanSquaredError.
F.l1_loss(stacked_threes[1].float(), mean7), F.mse_loss(stacked_threes[1], mean7).sqrt()

### PyTorch Tensors

A tensor is container of data, almost always numerical data. You may be already familiar with matrices, which are 2D tensors: tensors are a generalaization of matrices to an arbitrary number of dimensions (often called *axis*). 
e.g Scalar (0D tensors), Vectors(1D tensors), Matrices(2D tensors). If you pack matrices in a new array, you obtain a 3D tensor. 

In [None]:
x = np.array([
    [
      [1, 2],
      [3, 4]
    ],
    [
      [1, 2],
      [3, 4]
    ]
])
x.ndim

A tensor is defined by 3 key attributes
1. Rank  - Number of axes 
2. Shape - Length of each axis
3. DataType - Type of the data contained in Tensor. 
In general, the first axis in all dataTensors you'll come across in deep learning will be the *sample axis* (sometimes called the *samples dimension*). 

In [None]:
stacked_threes.shape

In [None]:
data = [[1,2,3],[4,5,6], [7,8,9]]
tns = tensor(data)         
tns

Select a row

In [None]:
tns[1]

Select a column

In [None]:
tns[:, 1]

You can combine these with Python slice syntax ([start:end] with end being excluded) to select part of a row or column:

In [None]:
tns[1, 1:3] 

use the standard operators such as +, -, *, /:

In [None]:
tns+1

And will automatically change type as needed, for example from `int` to `float`:

In [None]:
print(tns.type())
tns1 = tns*1.5
print(tns1.type())

## Computing Metrics Using Broadcasting

Recall that a metric is a measurement of how good the model is using the validation set, chosen for human consumption. This is a number that is calculated based on the prediction of our model, and the correct labels in our dataset. 

In [None]:
valid_threes = (path/'valid'/'3').ls().sorted()
valid_sevens = (path/'valid'/'7').ls().sorted()
valid_three_tensors = [tensor(Image.open(o)) for o in valid_threes]
valid_seven_tensors = [tensor(Image.open(o)) for o in valid_sevens]

valid_stacked_threes = torch.stack(valid_three_tensors).float()/255
valid_stacked_sevens = torch.stack(valid_seven_tensors).float()/255
# 3s validation set of 1,010 images of size 28×28, 
# 7s validation set of 1,028 images of size 28×28.
valid_stacked_threes.shape, valid_stacked_sevens.shape

Write a function that calculated mean absolute error

In [None]:
def mnist_distance(a, b):
    return (a-b).abs().mean((-1,-2))
# Calculate distance between arbitrary three and 'ideal' three mean3.
# Recall that mean3 was calculated using stacked tensors of threes of training set 
# and calculating mean value of each pixel. 
mnist_distance(stacked_threes[1], mean3)  

In order to calculate a metric for overall accuracy, we'll need to calculate the distance to the ideal 3 for _every_ image in the validation set. 
We can use the 'mnist_distance()' fucntion, designed for comparing two single images, but pass in as argument valid_stacked_threes. Instead of complaining about shapes not matching, it returned the distance for every single image as a vector (a rank-1 tensor). See Appendix for details around 'broadcast'

In [None]:
valid_3_dist = mnist_distance(valid_stacked_threes, mean3)
valid_3_dist, valid_3_dist.shape

In [None]:
(valid_stacked_threes - mean3).shape

To figure out whether an image is a 3 or not by using the following logic: if the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it's a 3. This function will automatically do broadcasting and be applied elementwise, just like all PyTorch functions and operators:

In [None]:
def is_3(x): 
    return mnist_distance(x,mean3) < mnist_distance(x,mean7)
print(is_3(stacked_threes[1]), is_3(stacked_threes[1]).float())
print(is_3(stacked_sevens[1]), is_3(stacked_sevens[1]).float())

In [None]:
accuracy_3s =      is_3(valid_stacked_threes).float() .mean()
accuracy_7s = (1 - is_3(valid_stacked_sevens).float()).mean()

accuracy_3s,accuracy_7s,(accuracy_3s+accuracy_7s)/2

This looks like a pretty good start! We're getting over 90% accuracy on both 3s and 7s, and we've seen how to define a metric conveniently using broadcasting.

But let's be honest: 3s and 7s are very different-looking digits. And we're only classifying 2 out of the 10 possible digits so far. So we're going to need to do better!

To do better, perhaps it is time to try a system that does some real learning—that is, that can automatically modify itself to improve its performance. In other words, it's time to talk about the training process, and SGD.

#### Appendix A: Broadcast

In [None]:
def mnist_distance(a, b):
    return (a-b).abs().mean((-1,-2))
valid_3_dist = mnist_distance(valid_stacked_threes, mean3)

Instead of complaining about shapes not matching, it returned the distance for every single image as a vector (a rank-1 tensor)
The magic trick is that PyTorch, when it tries to perform a simple subtraction operation between two tensors of different ranks, will use broadcasting. That is, it will automatically expand the tensor with the smaller rank to have the same size as the one with the larger rank. After broadcasting so the two argument tensors have the same rank, PyTorch applies its usual logic for two tensors of the same rank: it performs the operation on each corresponding element of the two tensors, and returns the tensor result. 

In [None]:
tensor([1,2,3]) + tensor(1)

There are a couple of important points about how broadcasting is implemented, which make it valuable not just for expressivity but also for performance:

- PyTorch doesn't *actually* copy `mean3` 1,010 times. It *pretends* it were a tensor of that shape, but doesn't actually allocate any additional memory
- It does the whole calculation in C (or, if you're using a GPU, in CUDA, the equivalent of C on the GPU), tens of thousands of times faster than pure Python (up to millions of times faster on a GPU!).

NOTE: The tuple `(-1,-2)` represents a range of axes. In Python, `-1` refers to the last element, and `-2` refers to the second-to-last. So in this case, this tells PyTorch that we want to take the mean ranging over the values indexed by the last two axes of the tensor

To figure out whether an image is a 3 or not by using the following logic: if the distance between the digit in question and the ideal 3 is less than the distance to the ideal 7, then it's a 3. This function will automatically do broadcasting and be applied elementwise, just like all PyTorch functions and operators:

## Stochastic Gradient Descent (SGD)

## An end to end SGD example

In [None]:
time = torch.arange(0,20).float()
time

In [None]:
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
plt.scatter(time,speed);

Let's try to create a model for above data. We need to find a quadratic function which can predict the values accrately. 

In [None]:
# Prediction function
def f(t, params):
    a,b,c = params
    return a*(t**2) + (b*t) + c

Choose a loss function, which will return a values based on a prediction and a target. For continuous data, it's common to use *mean squared error*.

In [None]:
# Loss function
def mse(preds, targets): 
    return ((preds-targets)**2).mean()

let's work through our 7 step process.

#### Step 1: Initialize the parameters

In [None]:
# Initialize the parameters to random values
params = torch.randn(3).requires_grad_()

In [None]:
#hide
orig_params = params.clone()
orig_params

#### Step 2: Calculate the predictions

In [None]:
preds = f(time, params)

Let's create a little function to see how close our predictions are to our targets, and take a look:

In [None]:
def show_preds(preds, ax=None):
    if ax is None: ax=plt.subplots()[1]
    ax.scatter(time, speed)
    ax.scatter(time, to_np(preds), color='red')
    ax.set_ylim(-300,100)

In [None]:
show_preds(preds)

#### Step 3: Calculate the loss

In [None]:
loss = mse(preds, speed)
loss

#### Step 4: Calculate the gradients

In [None]:
loss.backward()
params.grad
params.grad * 1e-5
print(orig_params) 
params

#### Step 5: Step the weights. 

In [None]:
# w := w - α [dJ(w)/dw] where α is learning rate
lr = 1e-5
params.data -= lr * params.grad.data
params.grad = None

In [None]:
preds = f(time,params)
mse(preds, speed)

In [None]:
show_preds(preds)

We need to repeat this a few times, so we'll create a function to apply one step

In [None]:
def apply_step(params, prn=True):
    preds = f(time, params)
    loss = mse(preds, speed)
    loss.backward()
    params.data -= lr * params.grad.data      # w := w - α [dJ(w)/dw] where α is learning rate
    params.grad = None
    if prn: print(loss.item())
    return preds

#### Step 6: Repeat the process 

In [None]:
for i in range(10): 
    apply_step(params)

In [None]:
_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()

#### Step 7: stop

We just decided to stop after 10 epochs arbitrarily. In practice, we would watch the training and validation losses and our metrics to decide when to stop, as we've discussed.

#### All together

In [None]:
# Prediction function
def f(t, params):
    a,b,c = params
    return a*(t**2) + (b*t) + c

# Loss function
def mse(preds, targets): 
    return ((preds-targets)**2).mean()

time = torch.arange(0,20).float()
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1   # Labels/y/original values


def apply_step(params, prn=True):
    preds = f(time, params)
    loss = mse(preds, speed)
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = None
    if prn: print(loss.item())
    return preds

params = torch.randn(3).requires_grad_()  # Randomly initialize the parameters
for i in range(10): 
    apply_step(params)
    
_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()