In [None]:
from torchvision import datasets, transforms 
data_path = './datasets/'
cifar10 = datasets.CIFAR10(root = data_path, download=True, train=True)
cifar10_val = datasets.CIFAR10(root = data_path, download=True, train=False)

In [None]:
# Print the method resolution order of our dataset instance. 
# Notice that the dataset is returned as a subclass of torch.utils.data.dataset.Dataset base class. 
type(cifar10).__mro__, cifar10.__repr__()

In [None]:
# plot one of the images 
import matplotlib.pyplot as plt 

img, label = cifar10[99]
plt.imshow(img)
plt.show()

In [None]:
class_names = cifar10.classes
class_names[label]

### Transform image data on instantiation 

Now let's use the `transforms` module from torchvision to convert these PIL images to PyTorch tensors. 

- This module defines a set of composable function-like objects that can be passed as an argument to a torchvision dataset upon instantiation, and they perform transformations on the data after it is loaded but before it is returned by `__getitem__`


In [None]:
from torchvision import transforms 
dir(transforms)

In [None]:
to_tensor = transforms.ToTensor()
img_t = to_tensor(img)
img_t.shape

The image has been turned into a 3x32x32 tensor and therefore a 3-channel (RGB)

In [None]:
# instantiate the dataset with the transform 
tensor_cifar10 = datasets.CIFAR10(root=data_path, train=True, download=False, transform = transforms.ToTensor())

In [None]:
img_t, _ = tensor_cifar10[99]
type(img_t), type(img)

In [None]:
img_t.dtype, img_t.shape

The `.ToTensor()` transform turns the data into a 32-bit floating-point per channel, scaling the values down to the range `0.0` to `1.0`.

In [None]:
img_t.min(), img_t.max()

In [None]:
# plotting img_t: a tensor of image data 
plt.imshow(img_t.permute(1,2,0))
plt.show()

Transforms can be chained using `transforms.Compose`. They can handle normalization and data augmentation transparently. 
- Each channel having zero mean and unitary standard deviation. 
- It is good practice to normalize the dataset so that each channel has zero mean and unitary standard deviation. 
    By choosing activation functions that are linear around `0` plus or minus `1`(or`2`), keeping the data in the same range means it's more likely that neurons have nonzero gradients and hence will learn sooner. 
- Also, Normalizing each channel so that it has the same distribution will ensure that channel information can be mixed and update through gradient descent using the same learning rate. 
- `transforms.Normalize` takes mean and stdev as arguments. 


### Normalizing our tensors

In [None]:
# using transforms.Normalize, we can compute the mean value and the standard deviation of each channel across the dataset 
# then apply the following transform: v_n[c] = (v[c] - mean[c]) / stdev[c]
import torch 

imgs = torch.stack([img_t for img_t, _ in tensor_cifar10], dim=3)
imgs.shape

`3` channels (RGB), with height `32`, width`32`, and `50000` of these (images). 

In [None]:
imgs.view(3,-1).mean(dim=1), imgs.view(3,-1).std(dim=1)

In [None]:
transforms.Normalize(mean = (0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616))

In [None]:
transformed_cifar10 = datasets.CIFAR10(
    root = data_path, 
    download=False, 
    train=True, 
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize(mean = (0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616)),
    ])
)

Plotting the normalized image 

In [None]:
img_t, label = transformed_cifar10[99]
plt.imshow(img_t.permute(1,2,0))
plt.show()

### Building the Dataset, testing the model. 

In [None]:
class_names = ['airplane', 'bird']
label_map = {0:0, 2:1}
cifar2 = [(img, label_map[label]) for img, label in cifar10 if label in [0,2]]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in [0,2]]

In [None]:
import torch.nn as nn 
n_out = 2
model = nn.Sequential(
    nn.Linear(3072, 512),
    nn.Tanh(),
    nn.Linear(512, n_out),
    nn.Softmax(dim=1)
)

Understanding that the output is categorical (airplane, bird), therefore we should use one-hot-encoding representation. In the ideal case, the network would output `torch.tensor([1.0, 0.0])` for an airplane, `torch.tensor([0.0, 1.0])` for a bird, but since our classifier will not be perfect, we can expect the network to output something in between. **We can interpret our output as probabilities i.e. the first entry is the probability of 'airplane', and the second is the probability of 'bird'.** 
- Each element of the output must be in the `[0.0, 1.0]` range. 
- The elements of the output must add up to 1.0. 

In other words, a probability of an outcome cannot be less than 0 or greater than 1, and we are certain that one of the two outcomes will occur. 

**Softmax** is a function that takes a vector of values and produces another vector of the same dimension, where the values satisfy the constraints listed above. 

In [None]:
def softmax(x):
    return torch.exp(x) / torch.exp(x).sum()
x = torch.tensor([1.0, 2.0, 3.0])
softmax(x), softmax(x).sum()

In [None]:
import torch.nn as nn 
x = torch.tensor([[1.0,2.0,3.0],[1.0,2.0,3.0]])
# in this example, each row is a different input vector
# Apply softmax ALONG the columns. i.e. ALONG dimension 1. 
softmax = nn.Softmax(dim=1)
softmax(x)

In [None]:
img, _ = cifar2[0]
img

In [None]:
img_t_t = to_tensor(img)
plt.imshow(img_t_t.permute(1,2,0))
plt.show()

Note, our model expects 3072 features (3x32x32) in the input. Also, `nn` works with data organized into batches. So after we obtain a 1d tensor of length 3072, unsqueeze it to produce a batch dimension. 

In [None]:
# test the model 
img_batch = img_t_t.view(-1).unsqueeze(0)
img_batch.size()


In [None]:
out = model(img_batch)
out

The output is meaningless, because the weights and biases for our linear layers have not been trained. Their elements are initialized randomly by PyTorch between -1.0 and 1.0. 

Furthermore, the model is not aware of which output probability is which. The loss function associates meaning with these two numbers after backpropogation, and since the loss function was not run, there is no meaning. 

### Loss Function for Classification 

After training, we will be able to get the label as an index by computing the argmax of the output probabilities. That is, the index at which we get the maximum probability. 

In [None]:
torch.max(out, dim=1)
# index

We will be using the Negative Log Likelihood loss function, conveniently provided in the `nn` module. 

- We want to maximize the probability associated with the correct class (likelihood). Note that we are interested in ensuring that the correct classes probility is higher than the others (Winning the softmax ranking). We are **not** interested in driving this probability to 1 (like MSE). 

- In the following example, we have two predictions from our model, in which the correct classification of the input is associated with the second index of the output. 

    - If we have [0.40, 0.60] we want to maximize the likelihood of the model parameters
    - If we have [0.60, 0.40] we want to maximize that likelihood of the model parameters

    - In the first example, the likelihood (probability associated with the correct class) is larger than the likelihood of the first index, therefore, we want a low loss (penalty) for this correct classification. 
    - In the second example, the likelihood (probabilty associated with the correct class) of the second index is lower than the likelihood of the first index. Therefore, we want high loss to correct this misclassification. 

- From a loss perspective, we want to minimize the negative log-likelihood. 



- Our input to the loss function needs to be a tensor of log probabilities. 
    - Therefore we should use `nn.LogSoftmax`
        - Softmax providing us with the probabilities, 
        - Log providing us with a numerically stable Logarithm of these probabilities. 



In [None]:
# Let's reqrite the model 
import torch.nn as nn 
model = nn.Sequential(
    nn.Linear(3072, 512),
    nn.Tanh(),
    nn.Linear(512, 2),
    nn.LogSoftmax(dim=1)
)
loss = nn.NLLLoss()
img, label = cifar2[0]
img = to_tensor(img)
out = model(img.view(-1).unsqueeze(0))
loss(out, torch.tensor([label]))


**Note**, we are introducing randomness in our gradient descent by estimating the gradient on a few samples at a time. We are working on small batches of shuffled data. It turns out that following gradients estimated over minibatches, which are poorer approximations of gradients estimated across the whole dataset, helps convergence and prevents the optimization process from getting stuck in local minima it encounters along the way. 

- Since gradients estimated over minibatches are poorer approximations of gradients estimated across the whole dataset, we want to use a reasonably small learning rate. 
- Shuffling the data at each epoch is an attempt to help ensure that the sequence of gradients (from the minibatch) is representative of the whole dataset. 
- Minibatches are typically a constant size that we need to set prior to training, just like the learning rate. (hyperparameters)
- Below we choose minibatches of size 1. We can use the `Dataloader` module from `torch.utils.data.DataLoader` to help with shuffling and organizing the data into minibatches. 
- `DataLoader` provides with a range of different sampling strategies. 

### `DataLoader`

In [1]:
import torch.optim as optim
from torchvision import datasets, transforms 
import torch 
import torch.nn as nn 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [3]:
data_path = './datasets/'
cifar10 = datasets.CIFAR10(root = data_path, 
                           train=True, 
                           download=False, 
                           transform = transforms.Compose(
                                    [transforms.ToTensor(), 
                                    transforms.Normalize(mean = (0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616))]
                            ))
cifar10_val = datasets.CIFAR10(root = data_path, 
                               train=False, 
                               download=False, 
                                transform = transforms.Compose(
                                    [transforms.ToTensor(), 
                                    transforms.Normalize(mean = (0.4914, 0.4822, 0.4465), std=(0.2470, 0.2435, 0.2616))]
                            ))

In [4]:
class_names = ['airplane', 'bird']
label_map = {0:0, 2:1}

# to_tensor = transforms.ToTensor()
# cifar2 = [(to_tensor(img), label_map[label]) for img, label in cifar10 if label in [0,2]]
# cifar2_val = [(to_tensor(img), label_map[label]) for img, label in cifar10_val if label in [0,2]]

cifar2 = [(img, label_map[label]) for img, label in cifar10 if label in [0,2]]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in [0,2]]

In [6]:
img, label = cifar2[0]
img.view(-1).unsqueeze(-1).shape

torch.Size([3072, 1])

In [5]:
# note, dataloader can be iterated over 
train_loader = torch.utils.data.DataLoader(cifar2, batch_size=64, shuffle=True, pin_memory=True)
val_loader = torch.utils.data.DataLoader(cifar2_val, batch_size=64, shuffle=False, pin_memory=True)

In [None]:
# Model Definition 
model = nn.Sequential(
    nn.Linear(3072, 1024),
    nn.Tanh(),
    nn.Linear(1024, 512),    
    nn.Tanh(),
    nn.Linear(512, 128),    
    nn.Tanh(),
    nn.Linear(128, 2),
    nn.LogSoftmax(dim=1)
)

learning_rate = 1e-2
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
loss_fn = nn.NLLLoss()
n_epochs = 100 

# Training Loop 
for epoch in range(n_epochs):
    
    for imgs, labels in train_loader:
        batch_size = imgs.shape[0]
        outputs = model(imgs.view(batch_size, -1))
        loss = loss_fn(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


    print(f"Epoch: {epoch}", f"Loss: {loss}")

In [None]:
# calculate Prediction accuracy using validation set 
correct_predictions=0
total=0

with torch.no_grad():
    for img, label in val_loader:
        input = img.view(64, -1)
        if input.size() != torch.Size([64, 3072]):
            continue
        output = model(input)
        max_likelihood, predicted_label = torch.max(output, dim=1) 

        total += label.shape[0]
        correct_predictions += int((predicted_label == label).sum())

    print((correct_predictions/total))


We achieve roughly 80% prediction accuracy with The following architecture. 

- linear 
- activation 
- linear 

Next we attempt to taper the number of features more gently toward the output, in the hope that the intermediate layers will do a better job of squeezing information in increasingly shorter intermediate output. Here is the next architecture, which gave us 83% prediction accuracy:

- linear 3072 x 1024
- activation 
- linear 1024 x 512
- activation
- linear 512 x 128
- activation 
- linear 128 x 2



Since `nn.LogSoftmax()` into `nn.NLLLoss()` is the same as using `nn.CrossEntropyLoss()`, let's chage the model to use that. The only difference now is that `nn.CrossEntropyLoss` takes scores/logits, as opposed to computing log probabilities using `nn.LogSoftmax()`. In other words, ou rmodel will not ouput (log) probabilities, so if we want probabilities, we will have to pass the output through `nn.Softmax()`

In [9]:
# Model Definition 
model = nn.Sequential(
    nn.Linear(3072, 1024),
    nn.Tanh(),
    nn.Linear(1024, 512),    
    nn.Tanh(),
    nn.Linear(512, 128),    
    nn.Tanh(),
    nn.Linear(128, 2),
).to(device)

learning_rate = 1e-2
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()
n_epochs = 100 

# Training Loop 
for epoch in range(n_epochs):
    
    for imgs, labels in train_loader:
        imgs = imgs.to(device, non_blocking=True)
        labels = labels.to(device, non_blocking=True)
        batch_size = imgs.shape[0]
        outputs = model(imgs.view(batch_size, -1))
        loss = loss_fn(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f"Epoch: {epoch}", f"Loss: {loss.item():.4f}")

Epoch: 0 Loss: 0.4158
Epoch: 1 Loss: 0.5512
Epoch: 2 Loss: 0.4369
Epoch: 3 Loss: 0.2352
Epoch: 4 Loss: 0.4977
Epoch: 5 Loss: 0.3340
Epoch: 6 Loss: 0.6549
Epoch: 7 Loss: 0.3093
Epoch: 8 Loss: 0.3780
Epoch: 9 Loss: 0.3420
Epoch: 10 Loss: 0.4732
Epoch: 11 Loss: 0.3006
Epoch: 12 Loss: 0.5349
Epoch: 13 Loss: 0.7789
Epoch: 14 Loss: 0.5304
Epoch: 15 Loss: 0.1487
Epoch: 16 Loss: 0.4387
Epoch: 17 Loss: 0.3058
Epoch: 18 Loss: 0.2961
Epoch: 19 Loss: 0.4051
Epoch: 20 Loss: 0.3337
Epoch: 21 Loss: 0.2484
Epoch: 22 Loss: 0.3913
Epoch: 23 Loss: 0.3757
Epoch: 24 Loss: 0.2041
Epoch: 25 Loss: 0.0762
Epoch: 26 Loss: 0.3186
Epoch: 27 Loss: 0.1389
Epoch: 28 Loss: 0.7340
Epoch: 29 Loss: 0.2120
Epoch: 30 Loss: 0.8572
Epoch: 31 Loss: 0.1143
Epoch: 32 Loss: 0.2143
Epoch: 33 Loss: 0.0343
Epoch: 34 Loss: 0.1618
Epoch: 35 Loss: 0.3736
Epoch: 36 Loss: 0.1013
Epoch: 37 Loss: 0.0572
Epoch: 38 Loss: 0.0504
Epoch: 39 Loss: 0.0530
Epoch: 40 Loss: 0.0580
Epoch: 41 Loss: 0.7399
Epoch: 42 Loss: 0.0250
Epoch: 43 Loss: 0.215

In [12]:
# calculate Prediction accuracy using validation set 
correct_predictions=0
total=0

with torch.no_grad():
    for img, label in val_loader:

        # move validation tensors to GPU 
        img = img.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True)

        input = img.view(64, -1)
        if input.size() != torch.Size([64, 3072]):
            continue
        
        output = model(input)
        max_likelihood, predicted_label = torch.max(output, dim=1) 

        total += label.shape[0]
        correct_predictions += int((predicted_label == label).sum())

    print((correct_predictions/total))

0.8024193548387096


Note, our loss during training is almost zero, but our validation set prediction accuracy is only 80%. What is happening here? We are overfitting on the training data, the model is to complex. The model is 'memorizing' the training data rather than learning it. 

In [13]:
numel_list = [p.numel() for p in model.parameters()if p.requires_grad==True]
sum(numel_list), numel_list

(3737474, [3145728, 1024, 524288, 512, 65536, 128, 256, 2])

### Conclusion

We have a model, a dataset, and a training loop, and our model learns. However, due to a mismatch between our problem and our netqoek structure, we end up overfitting our training data, rather than learning the generalized features of what we want the model to detect. 

We have created a model that allowsfor relating every pixel to every other pixelin the image, regardless of their spatial arrangement. We have a reasonable assumption that pixels that are closer together are in theory alot more related though. This means we are training a classifier that is not translation-invariant, so we're forced to use a lot of capacity for learning translated replicas if we want to hope to do well on the validation set. The solution to our current set of problems is to change our model to use convolutional layers.

We have been treating 2D images as 1D data. 

**Translation invariance** means a models prediction does not change much when the input is shifted in space. Since our model is not translation invariant, it will not be able to confirm that an airplane in the bottom right of an image is the same as an airplane in the top left of an image. 
