<a id="title"></a>
# Transfer Learning MNIST Classification Tutorial using PyTorch
***
## Learning Goals:
By the end of this tutorial, you will:
- load and transform the MNIST dataset
- load and modify a pretrained model
- train and evaluate a pretrained model

## Table of Contents
[Introduction](#intro) <br>
[0. Imports](#imports) <br>
[1. MNIST Dataset and Scaling](#mnist) <br>
[2. Load and Modify a Pretrained Model](#load) <br>
[3. Test Model Functionality](#test) <br>
[4. Set Training and Test Sets](#set) <br>
[5. Hyperparameters and Loading](#hyper) <br>
[6. Train Model](#train) <br>
[7. Plot Metrics](#plot) <br>
[8. Analyze Samples](#analyze) <br>
[9. Conclusions](#con) <br>
[Additional Resources](#add) <br>
[About this Notebook](#about) <br>
[Citations](#cite) <br>

## Introduction <a id="intro"></a>

The main purpose of this notebook is to demonstrate transfer learning in [PyTorch](https://pytorch.org/), a deep learning Python library. This tutorial is not an exhaustive introduction to machine learning and assumes the user is familiar with vocabulary (supervised v unsupervised, neural networks, loss functions, backpropogation, etc) and methodology (model selection, feature selection, hyperparameter tuning, etc). This notebook also assumes the user is familiar with convolutional neural networks (CNNs) and the [MNIST handwritten dataset](http://yann.lecun.com/exdb/mnist/). Look at [Additional Resources](#add) for more complete machine learning guides. The paragraphs below serve as a brief introduction to transfer learning.

[Transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) is a method of machine learning that uses a well trained model to solve a similar problem instead of the problem it was orginally designed for. Training and tuning a deep model takes a lot of time, data, and computation, so it's not in the best interest for all models to be trained from scratch when some really high performing models already exist. 

We can use a pretrained model as a "feature extractor" by freezing most of the layers in the model. Then, we can retrain the unforzen layers after changing the number of output neurons specific to our problem (e.g. changing the number of final classifications from 1000 to 10) and/or adding more hidden layers to the end of the pretrained model (e.g. changing the number of hidden fully connected layers from 1 to 3). At the end, we'll have a model that was initially designed for one task customized for our task. 

For example, let's take a CNN trained to classify house cats and house dogs. The model will learn features that are useful for distinguishing between the two animals, such as ears, noses, etc. That model can transfer it's knowledge to classify lynxes and wolves, animals that are similar to cats and dogs. The model can use it previous knowledge of ears, noses, etc. to fine tune on and immediately produce good results. In addition, you don't have to go through the trouble of training an entire model from scratch, which would dramatically increase training time and computation for a possible drop in performance.

To extrapolate from that example, extremely deep CNNs trained on an extensive diverse dataset will learn features general enough for a large number of computer vision problems, such as multiclass classification and object detection.

The two main uses for transfer learning are when you have data limits or computation limits, which are explained below:
- Data Limitations: most datasets to solve real world problems are actually pretty small, making them proned to overfitting if being trained on too deep of a network. The most adventageous solution is to obtain more trainning data, but that could be heavily taxing and time consuming. However, utilizing a model previously trained on millions of examples could be a nice starting point since it already has well defined features useful for solving most problems. 

- Computation Limitations: training a deep network from scratch can take anywhere between hours to weeks depending on the scope of the problem at hand. Not everyone has the time or the resources to train deep models. However, since a pretrained model acts as a feature extractor, backpropagation is only performed on the last few layers, which is computationally inexpensive and faster than training from scratch.

**In this notebook, we will perform transfer learning on a pretrained network to classify MNIST handwritten digits using PyTorch.**

## 0. Imports <a id="imports"></a>

If you are running this notebook on Google Colab, you shouldn't have to install anything. If you are running this notebook in Jupyter, this notebook assumes you created the virtual environment defined in `environment.yml`. If not, close this notebook and run the following lines in a terminal window:

`conda env create -f environment.yml`

`conda activate deepwfc3_env`

We import the following libraries:
- *numpy* for handling arrays
- *matplotlib* for plotting
- *tqdm* for keeping track of loop speed
- *tensorflow* for accessing MNIST images 
- *torch* as our machine learning framework
- *torchvision* for loading models and data transforms
- *sklearn.metrics* for model evaluation
- *seaborn* for plotting confusion matrices

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

import tensorflow as tf

import torch
from torch import nn
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms

from sklearn import metrics
import seaborn as sns

## 1. MNIST Dataset and Scaling<a id="mnist"></a>

The MNIST dataset is nicely packed in `tensorflow` as `np.arrays`, which is why we are grabbing our data from there instead of directly from `torch`. The data is unpacked as `x_train` for training features, `y_train` for training labels, `x_test` for testing features, and `y_test` for testing labels. 

In [None]:
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()

We need to transform our dataset into the desired input space. The pretrained model was trained on 3 channeled (RGB) 224x224 images normalized to the means and standard deviations listed in `preprocess`.

In [None]:
preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Since our images are 8-bit gray scaled, we will append two copies of each sample as "additional channels" in `process` and perform the transformations on the "RGB" image.

In [None]:
def process(image):
    image = image.reshape(1,image.shape[0],image.shape[1])
    image_3 = np.concatenate((image, image, image))
    image_rgb = np.transpose(image_3, axes=(1,2,0))
    image_rgb_tensor = preprocess(image_rgb)
    return image_rgb_tensor.numpy()

At the time this notebook was written, `torchvision.transform` does not support transforming batches of several images so we must manually loop through our data and transform them. **Note: we will be training on the first 1000 training images to showcase transfer learning on a small dataset.**

In [None]:
x_train_1000_process = []
for image in x_train[:1000]:
    x_train_1000_process.append(process(image))
x_train_1000_process = np.array(x_train_1000_process)

y_train_1000 = y_train[:1000]

x_test_process = []
for image in x_test:
    x_test_process.append(process(image))
x_test_process = np.array(x_test_process)

Let's check the shapes of `x_train_1000_process` and `x_test_process` to make sure they are what we expect (number of images, 3, 224, 224).

In [None]:
print ('Train data shape is {}'.format(x_train_1000_process.shape))
print ('Test data shape is {}'.format(x_test_process.shape))

For a final check, let's look at the separate channels of an image to confirm they are the same image.

In [None]:
index = 0
fig, axs = plt.subplots(1,3,figsize=[15,5])
axs[0].set_title('Training Image {} with label {} (R channel)'.format(index, y_train_1000[index]))
axs[0].imshow(x_train_1000_process[index,0])
axs[1].set_title('Training Image {} with label {} (G channel)'.format(index, y_train_1000[index]))
axs[1].imshow(x_train_1000_process[index,1])
axs[2].set_title('Training Image {} with label {} (B channel)'.format(index, y_train_1000[index]))
axs[2].imshow(x_train_1000_process[index,2])

Now that our data is processed, we can move into loading a pretrained model.

## 2. Load and Modify a Pretrained Model <a id="load"></a>

PyTorch has dozens of pretrained models in `torchvision` under [torchvision.models](https://pytorch.org/vision/stable/models.html). Some famous models include [VGG](https://pytorch.org/hub/pytorch_vision_vgg/), [ResNet](https://pytorch.org/hub/pytorch_vision_resnet/), [Inception v3](https://pytorch.org/hub/pytorch_vision_inception_v3/), and [GoogLeNet](https://pytorch.org/hub/pytorch_vision_googlenet/). The published papers for these models are in [Additional Resources](#add). These models were trained using [the ImageNet dataset](https://image-net.org/index.php), which contains over 1M images with 1000 unique classifications. 

In this tutorial, we will use GoogLeNet because it is a relatively small network (7M trainable parameters) with a high accuracy (89%) for ImageNet.

In [None]:
model = torchvision.models.googlenet(pretrained=True)

We'll also switch to evaluation mode to ensure we are not training the nextwork. The architecture will be printed, which starts with a convolutional layer, goes through several [inception modules](https://paperswithcode.com/method/inception-module#:~:text=An%20Inception%20Module%20is%20an,pass%20onto%20the%20next%20layer.), and ends with a fully connected layer. Notice in the final layer (fc) that `out_features=1000`, which is the number of classifications for ImageNet.

In [None]:
model.eval()

As a default, all of the parameters in the model have the attribute `requires_grad` set to `True`, which determines if the gradients should be calculated for that layer. Since we only want to train the last layer, we first set all the `requires_grad` to `False`. Later when we modify our model, we will reactivate the last layer to require gradient calculation.

In [None]:
for param in model.parameters():
    param.requires_grad = False

Now we will change the final layer to have 10 output classifications instead of 1000. Commented is an example of adding another layer going from 1024 neurons to 128 and 128 to 10 with ReLU activation and dropout. Uncommenting the additional lines will train a deeper model, but training will take longer.

In [None]:
num_input_features = model.fc.in_features
model.fc = nn.Linear(num_input_features, 10)

# adding another layer
#dropout = 0.2
#intermediate = 128
#model.fc = nn.Sequential(
 #   nn.Linear(num_input_features, intermediate),
 #   nn.Relu(),
 #   nn.Dropout(0.2),
 #   nn.Linear(intermediate, 10)
#)

model.fc

**Tip: all of the layers in the model are organized as attributes, making it easy to call and modify individual layers, i.e. `model.layer.sublayer`.** For example, if you wanted to see the 0th layer of branch 3 in Inception 3a, you would return `model.inception3a.branch3[0]`. 

**Tip: In order to modify a layer, you must redefine the layer and can't simply change the attribute values.** For example, trying to change the number of output classes by returning `model.fc.out_features = 10` wouldn't actually change the model output neurons.

As mentioned previously, we'll require gradients for only the final layer.

In [None]:
for param in model.fc.parameters():
    param.requires_grad = True

Let's check that all of our gradients are as we expect: False for all layers expect the last layer.

In [None]:
for name, param in model.named_parameters():
    print (name, param.requires_grad)

## 3. Test Model Functionality <a id="test"></a>

Before training, we need to make sure our model is properly built, i.e. the expected input (3D 3x224x224 array) will return the expected output (2D 1x10 array). An error indicates that the architecture is inconsistent in some way, such as unexpected input and output neurons, incorrectly changing the number of output neurons, etc.

We'll use the first example in our training set to test our model's functionality. Slicing will guarantee the image shape will be (1, 3, 224, 224).

In [None]:
index = 0
test_image = x_train_1000_process[index:index+1]
test_image.shape

After the dimensions are changed, we convert the image from a `np.array` to a `torch.Tensor`.

In [None]:
test_image_torch = torch.Tensor(test_image)

Now we can "predict" the output neurons of the input image.

In [None]:
testoutput_torch = model(test_image_torch)

If there isn't an error, we know our model is working. If there is, go back and check to make sure the input size is (1, 3, 224, 224), the input and outputs are consistent, etc.

We also move the output from our model using the `detach()` method and convert the `torch.Tensor` to a `np.array` by using the `numpy()` method.

In [None]:
testoutput = testoutput_torch.detach().numpy()

Let's check the shape of the output neurons to make sure they are what we expect.

In [None]:
print ('The shape of the output neurons are {}.'.format(testoutput.shape))

Let's return the output neurons. Note since we do not use an activation function at the end of our network, the domain of our output can be any real number.

In [None]:
testoutput

We can use the [softmax activation function](https://en.wikipedia.org/wiki/Softmax_function) to convert the output neurons to probabilities for each classification with the index corresponding to the digit classification probability, e.g. the 0th index corresponds to the probability of a 0 classification. Since our model isn't trained, all the output probabilities are approximately 0.1, indicating our model is randomly "classifying".

In [None]:
softmax = nn.Softmax(dim=1)
softmax(testoutput_torch)

In addition, it's good practice to know how many trainable parameters are in our model. The number of trainable parameters can be used as a proxy for estimating total training time. We define [a counting function](https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model) for us and determine how many trainable parameters there are in our model. **Note this function only counts the layers that require gradient descent so it should only show the fully connected layer's parameters.**

In [None]:
def count_parameters(model):
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad: continue
        param = parameter.numel()
        print([name, param])
        total_params+=param
    return total_params

In [None]:
count_parameters(model)

10K training parameters is vastly smaller than the original 7M so training will take so much shorter than training GoogLeNet from scratch! Now that our model is ready, let's further prepare our data and set our hyperparameters.

## 4. Set Training and Test Sets <a id="set"></a>

PyTorch uses iterables to create its data objects. Here we use lists to format the data to be PyTorch compatible. Experienced Python users are more likely to be comfortable using and manipulating lists. The function `format_dataset` makes a 2D list with each element being [image, label].

In [None]:
def format_dataset(image_set, labels):
    data_set = []
    for i in range(len(image_set)):
        data_set.append([image_set[i], labels[i]])  
    return data_set

In [None]:
train_set = format_dataset(x_train_1000_process, y_train_1000)
val_set = format_dataset(x_test_process, y_test)

We also need to define a baseline for our model to perform better than. The baseline helps us understand if our model is learning anything at all. We choose the model classifying at random to be our baseline since a poor model would perform as such. That being said, if our model's accuracy is well above 10%, then we know the model is learning something useful.

## 5. Hyperparameters and Loading <a id="hyper"></a>

First, we must set our hyperparameters for the model to use for training. The hyperparamters we are using are batch size, shuffle, and number of workers. Batch size can be tuned as needed to improve results. Shuffle should almost always be True since the data shouldn't be ordered in any specific way when training. In addition, the number of workers has a default of 0, which uses the main processor on the machine you are using.

In [None]:
torch.manual_seed(42)

params = {
        'batch_size': 32,
        'shuffle': True,
        'num_workers': 0
    }

Next, we choose the number of epochs we wish to train for.

In [None]:
num_epochs = 10

Another useful metric to know is how many updates our model will perform during training. We can calculate this by finding the number of batches in the training set (number of training samples / batch size) and multiplying it by the number of epochs. Knowing how many batches our model might need to be well trained can be a good place to start when tuning hyperparameters.

In [None]:
print ('The model will train using a total of {} batches'.format(num_epochs * 
                                                       int(x_train_1000_process.shape[0] / params['batch_size'])))

**Tip: the author's rule of thumb is to have at least 100 batches trained per epoch, but this can be difficult with a small dataset. In addition, a minimum batch of 32 hints at enough samples for central limit theorem to hold true, although that may not be the case for each batch.**

With our hyperparameters set, we can load our training and test set using `DataLoader`. 

**Note the variable and function names in the notebook are directed for validation sets, but we will use them for the test set instead.** That being said, we use the definitions for validation set and test set interchangeably here.

In [None]:
# TRAINING SET
train_loader = DataLoader(train_set, **params)

# TEST SET
valid_loader = DataLoader(val_set, **params)

Now we define our loss function to be [Cross Entropy Loss](https://en.wikipedia.org/wiki/Cross_entropy), which combines [softmax and the negative log likelihood loss](https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/). This function is standard in multiclass classification problems.

In [None]:
distance = nn.CrossEntropyLoss()

Then we choose our optimizer to be [Adam](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam), since the learning rate updates automatically and trains relatvely fast compared to [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

In [None]:
optimizer = torch.optim.Adam(model.parameters(),  weight_decay=1e-5)

If you have GPUs available, then those will be used for training. If not, then the model will train on CPUs.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device);

Let's print the device to make sure we know what's available.

In [None]:
device

## 6. Train Model <a id="train"></a>

In order to train our model, we have to manually loop through our data for training. This is probably the biggest difference between PyTorch and [Tensorflow](https://www.tensorflow.org/), but this allows for more hands-on manipulation of how training is performed, which can be advantageous. We will train our model as follows:
1. Change the model to trianing mode to activate backpropogation
2. Initialize training loss to be 0
3. Loop through each batch of features and labels by:
    - Putting the data onto your device
    - Calculating the output neurons and the loss
    - Performing backgrpopogation and adding the batch training loss to total training loss
4. Normalize the total training loss by number of samples

In [None]:
# Define train loop

def train_model(train_loader):

    # Change model to training mode (activates backpropogation)
    model.train()
    
    # Initialize training loss
    train_loss = 0
    
    # Loop through batches of training data
    for data, target in train_loader:
        
        # Put training batch on device
        data = data.float().to(device)
        target = target.type(torch.LongTensor).to(device)

        # Calculate output and loss from training batch
        output = model(data)
        loss = distance(output, target)
        
        # Backpropogation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    # Normalize training loss from one epoch
    train_loss_norm = train_loss / len(train_loader)
    
    return train_loss_norm

In addition, we define a similar loop for evaluating the test set at each epoch, which signals us if our model is generalizing. We will test our model as follows:
1. Change model to evaluation mode to deactivate backpropogation
2. Initialize test loss and number of correctly classified samples to 0
3. Loop through each batch of features and labels by:
    - Putting the data onto your device
    - Calculating the output neurons and the loss
    - Counting the number of correct predictions
4. Calculate test set accuracy
5. Normalize the total test loss by number of samples

In [None]:
# Define validation loop

def validate_model(valid_loader):

    # Change model to evaluate mode (deactivates backpropogation)
    model.eval()
    
    # Initialize validation loss and number of correct predictions
    val_loss = 0
    correct = 0
    
    # Do not calculate gradients for the loop
    with torch.no_grad():
        
        # Loop through batches of validation data
        for data, target in valid_loader:
            
            # Put validation batch on device
            data = data.float().to(device)
            target = target.type(torch.LongTensor).to(device)
            
            # Calculate output and loss from validation batch
            output = model(data)
            val_loss += distance(output, target).item()
            
            # Count number of correct predictions
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).sum()
        
        # Calculate accuracy
        accuracy = 100. * correct / len(valid_loader.dataset)
    
    # Normalize validation loss from one epoch
    val_loss_norm = val_loss / len(valid_loader)
    
    return val_loss_norm, accuracy

Finally, we can train our model! We will print out the train and test loss/accuracy per epoch to keep track of performance. The loop below performs the training and validation loops defined above and records our metrics. Most of the time taken in the loop will actually be for pushing the batches through the network (performing millions of computations to go from a 3x224x224 to 10 neurons) and not calculating gradients, which is usually the workhorse of training. To get around this for more enhanced speed, [manually extract the features from the data](https://discuss.pytorch.org/t/how-can-i-extract-intermediate-layer-output-from-loaded-cnn-model/77301/3), and build a custom neural network to train on.

In [None]:
# keep track of metrics
lst_train_loss = []
lst_val_loss = []
lst_accuracy = []

# training loop
for epoch in tqdm(range(num_epochs), total=num_epochs):

    # Go through loops
    train_loss = train_model(train_loader)
    val_loss, accuracy = validate_model(valid_loader)

    # Append metrics
    lst_train_loss.append(train_loss)
    lst_val_loss.append(val_loss)
    lst_accuracy.append(accuracy)

    # Log
    print('Epoch {:.3f} - Train loss: {:.3f} - Val Loss: {:.3f} - Accuracy: ({:.0f}%)'.format(
            epoch, train_loss, val_loss, accuracy))

The model trained on ImageNet was able to use it's previous knowledge to classify digits! In addition, we only used 1000 training images (~100/class) and it was able to generalize pretty well, which is really impressive. Now we'll look at some metrics to get more details on performance.

## 7. Plot Metrics <a id="plot"></a>

We plot the train/test loss and test accuracy to determine how well converged our model is. Any small overfitting is most likely caused by how small our training set is and how much larger our test set is than our training set. This could be resolved by changing the size of our datasets or using stronger regularization. The model may also not be fully converged yet so training for more epochs may increase performance. However, since our model is performing a lot better than our baseline, we are satisfied with the model we trained.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=[10,5])

axs[0].set_title('Loss')
axs[0].plot(np.arange(num_epochs), lst_train_loss, label='train')
axs[0].plot(np.arange(num_epochs), lst_val_loss, label='val')
axs[0].set_xlabel('Epochs')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(np.arange(num_epochs), lst_accuracy, color='C1')
axs[1].set_xlabel('Epochs')

We'll also plot a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to see how our model is performing on an individual class basis and see what digits are most difficult to classify. We will only be evaluating the first 1000 samples of the test set to save on computation.

In [None]:
# Evaluate test subset
x_test_1000_process = x_test_process[:1000]
y_test_1000 = y_test[:1000]

val_pred0 = model(torch.Tensor(x_test_1000_process))
val_pred = val_pred0.data.max(1, keepdim=True)[1].detach().numpy().flatten()
confusion_matrix = metrics.confusion_matrix(y_test_1000, val_pred, normalize='true')

In [None]:
# Plot confusion matrix
plt.figure(figsize=(16,10))
sns.heatmap(confusion_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')

Our model seems to perform well (> 70%) on most digits, but falls short for some. Let's analyze some individual samples to see the classification probabilities per sample.

## 8. Analyze Samples <a id="analyze">
    
Now that our model is trained, let's analyze some samples to see the classification probabilities of some images. We can look at random examples in our test set and plot classification probabilities using a bar graph.

In [None]:
# choose random image and corresponding output neurons from test subset
rand_index = np.random.randint(x_test_1000_process.shape[0])
rand_image = x_test_1000_process[rand_index][0]
rand_class_prob_tensor = softmax(val_pred0[rand_index:rand_index+1])
rand_class_prob = rand_class_prob_tensor.detach().numpy().flatten()

# plot image and classification probabilities
fig, axs = plt.subplots(1,2,figsize=[10,5])
axs[0].set_title('Testing Image {} with a label of {}'.format(rand_index, y_test_1000[rand_index]))
axs[0].imshow(rand_image)
axs[1].set_title('Classification Probabilities for Testing Image {}'.format(rand_index))
axs[1].bar(np.arange(10), rand_class_prob)
axs[1].set_xlabel('Label')
axs[1].set_ylabel('Probability')
print ('Prediction: {}, Label: {}'.format(val_pred[rand_index], y_test_1000[rand_index]))
plt.tight_layout()

In addition we can look at false positive/negative samples to investigate harder samples in the test set. Let's create a mask that will return incorrect predictions.

In [None]:
mask = val_pred != y_test_1000

In [None]:
# choose random incorrect sample
false_index = np.random.randint(mask.sum())
false_image = x_test_1000_process[mask][false_index][0]
false_class_prob_tensor = softmax(val_pred0[mask][false_index:false_index+1])
false_class_prob = false_class_prob_tensor.detach().numpy().flatten()

# plot image and classification probabilities
fig, axs = plt.subplots(1,2,figsize=[10,5])
axs[0].set_title('Testing Image (mask) {} with a label of {}'.format(false_index, y_test_1000[mask][false_index]))
axs[0].imshow(false_image)
axs[1].set_title('Classification Probabilities')
axs[1].bar(np.arange(10), false_class_prob)
print ('Prediction: {}, Label: {}'.format(val_pred[mask][false_index], y_test_1000[mask][false_index]))

## 9. Conclusions <a id="con"></a>

Transfer learning can be a powerful tool in the absence of data, time, and/or computational resources. In most cases, it's often better to train an entire model from scratch if those restrictions previously mentioned do not apply. However in the real world, those restrictions are frequent and challenging to deal with as a data scientist. It's nice to have a technique that can compensate for small training data and "slow" computers, which in the end makes machine learning and deep learning more accessible to all.

Thank you for walking through this notebook. Now you should be more familiar with:
- loading and transforming the MNIST dataset
- loading and modifying a pretrained model
- training and evaluating a pretrained model

**Congratulations, you have completed the notebook!**

## Additional Resources <a id="add"></a>

Machine learning is a dense and rapidly evolving field of study. Becoming an expert takes years of practice and patience, but hopefully this notebook brought you closer in that direction. Here are some of the author's favorite resources for learning about machine learning and data science:

- [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/ml-intro)
- [scikit-learn Python Library](https://scikit-learn.org/stable/index.html) (go-to for most ML algorithms besides neural networks)
- [StatQuest YouTube Channel](https://www.youtube.com/c/joshstarmer)
- [DeepLearningAI YouTube Channel](https://www.youtube.com/c/Deeplearningai/videos)
- [Towards Data Science](https://towardsdatascience.com/) (articles about data science and machine learning, some involving example blocks of code)
- Advance searching [arxiv](https://arxiv.org/search/advanced) (e.g. search term "machine learning" in Abstract for Subject astro-ph) to see what others are doing currently
- Google, YouTube, and Wikipedia in general
- Pretrained model papers:
    - [VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION (VGG)](https://arxiv.org/pdf/1409.1556.pdf)
    - [Deep Residual Learning for Image Recognition (ResNet)](https://arxiv.org/pdf/1512.03385.pdf)
    - [Going deeper with convolutions (GoogLeNet)](https://arxiv.org/pdf/1409.4842.pdf)
    - [Rethinking the Inception Architecture for Computer Vision (Inception v3)](https://arxiv.org/pdf/1512.00567.pdf)
- Supplementary articles used for learning transfer learning in PyTorch:
    - [Pytorch: Transfer Learning and Image Classification](https://pyimagesearch.com/2021/10/11/pytorch-transfer-learning-and-image-classification/)
    - [Deep Learning for Everyone](https://www.analyticsvidhya.com/blog/2019/10/how-to-master-transfer-learning-using-pytorch/)
    - [PyTorch freeze part of the layers](https://jimmy-shen.medium.com/pytorch-freeze-part-of-the-layers-4554105e03a6)

## About this Notebook <a id="about"></a>

**Author:** Fred Dauphin, DeepWFC3

**Updated on:** 2021-12-03

## Citations <a id="cite"></a>

If you use `numpy`, `matplotlib`, or `torch` for published research, please cite the authors. Follow these links for more information about citing `numpy`, `matplotlib`, and `torch`:

* [Citing `numpy`](https://numpy.org/doc/stable/license.html)
* [Citing `matplotlib`](https://matplotlib.org/stable/users/project/license.html#:~:text=Matplotlib%20only%20uses%20BSD%20compatible,are%20acceptable%20in%20matplotlib%20toolkits.)
* [Citing `torch`](https://github.com/pytorch/pytorch/blob/master/LICENSE)

***
[Top of Page](#title)