<a id="title"></a>
# Autoencoder MNIST Tutorial using PyTorch
***
## Learning Goals:
By the end of this tutorial, you will:
- build an autoencoder
- train and evaluate an autoencoder
- visualize the latent space of an autoencoder
- detect anomalies using an autoencoder

## Table of Contents
[Introduction](#intro) <br>
[0. Imports](#imports) <br>
[1. MNIST Dataset and Scaling](#mnist) <br>
[2. Build an Autoencoder](#build) <br>
[3. Test Model Functionality](#test) <br>
[4. Set Training and Test Sets](#set) <br>
[5. Hyperparameters and Loading](#hyper) <br>
[6. Train Model](#train) <br>
[7. Plot Loss Function and R2](#plot) <br>
[8. Analyze Samples](#analyze) <br>
[9. Visualize the Latent Space](#latent) <br>
[10. Detect Anomalies](#detect) <br>
[11.. Conclusions](#con) <br>
[Additional Resources](#add) <br>
[About this Notebook](#about) <br>
[Citations](#cite) <br>

## Introduction <a id="intro"></a>

The main purpose of this notebook is to build an autoencoder in [PyTorch](https://pytorch.org/), a deep learning Python library. This tutorial is not an exhaustive introduction to machine learning and assumes the user is familiar with vocabulary (supervised v unsupervised, neural networks, loss functions, backpropogation, etc) and methodology (model selection, feature selection, hyperparameter tuning, etc). This notebook also assumes the user is familiar with convolutional neural networks (CNNs) and the [MNIST handwritten dataset](http://yann.lecun.com/exdb/mnist/). Look at [Additional Resources](#add) for more complete machine learning guides. The paragraphs below serve as a brief introduction to autoencoders.

An [autoencoder](https://en.wikipedia.org/wiki/Autoencoder) is an unsupervised learning algorithm that learns how to reconstruct the input as an output (i.e. using a neural network to learn a complex version of the identify function). It is comprised of an encoder, which compresses the input into a low dimensional latent space, and a decoder, which decompresses the representation from the latent space into the output. Autoencoders are versatile models with some of the main uses including dimensionality reduction, anomaly detection, and image denoising. Below explains why each purpose may be useful:

1. Dimensionality reduction: If an autoencoder is able to reconstruct the input to a high degree, then the latent space (low dimensionality) must be a well constructed representation of the input space (high dimensionality). That is to say the model has learned an efficient, compressed representation of the data. You could vastly decrease the training time or increase the complexity of other machine learning models by using the latent space representation as features instead of using the original inputs.

    - Example: on a training data of cats and dogs, the autoencoder could differentiate between these two classes in the latent space (a cluster of cat images separate from a cluster of dog images). 


2. Anomaly detection: If an autoencoder is able to reconstruct the input to a high degree, then it has learned the features and distribution of the training data well. Therefore features outside of the distribution of the training data will reconstruct poorly and can be automatically detected by having a high loss compared to inputs within the distribution. **Note: a deep model trained on an extremely diverse dataset could learn features general enough to reconstruct inputs outside of its learned domain. Since the model is "too good", the model may perform poorly on anomaly detection.** 

    - Example: on training data of cats and dogs, the autoencoder has learned the features of these two animals. If an image of a car was an input, the autoencoder would not understand this image and would try to reconstruct a cat or dog, leading to a high loss between the input and reconstruction.
    
    
3. Image denoising: An autoencoder in theory can be mapped between any two input and output spaces. If we use training data with features as noisy input and labels as clean inputs, then the autoencoder will learn to remove the noise from those inputs. Now any noisy inputs from the future can be denoised and used for further analysis.

    - Example: on training data of noisy cat images as features and clean cat images as labels, the autoencoder could denoise any new cat images.
    
Other types of autoencoders exists, such as the [variational autoencoder (VAE)](https://en.wikipedia.org/wiki/Variational_autoencoder), but implementing one is beyond the scope of this tutorial.
    
**In this notebook, we will build an autoencoder using PyTorch to learn the representation of MNIST handwritten digit dataset. We will also go over the first two use cases of an autoencoder as a dimensionality reducer and an anomaly detector.**

## 0. Imports <a id="imports"></a>

If you are running this notebook on Google Colab, you shouldn't have to install anything. If you are running this notebook in Jupyter, this notebook assumes you created the virtual environment defined in `environment.yml`. If not, close this notebook and run the following lines in a terminal window:

`conda env create -f environment.yml`

`conda activate deepwfc3_env`

We import the following libraries:
- *numpy* for handling arrays
- *matplotlib* for plotting
- *tqdm* for keeping track of loop speed
- *tensorflow* for accessing MNIST images 
- *torch* as our machine learning framework

In [None]:
import numpy as np
from matplotlib import pyplot as plt
from tqdm import tqdm

import tensorflow as tf

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

## 1. MNIST Dataset and Scaling<a id="mnist"></a>

The MNIST dataset is nicely packed in `tensorflow` as `np.arrays`, which is why we are grabbing our data from there instead of directly from `torch`. The data is unpacked as `x_train` for training features, `y_train` for training labels, `x_test` for testing features, and `y_test` for testing labels.

In [None]:
(x_train, y_train),(x_test, y_test) = tf.keras.datasets.mnist.load_data()

We'll also define some frequently used global variables. `x_train_size` is the number of images in the training set, `x_test_size` is the number of images in the test set, and `x_length` is the length/width of an image. In addition, we min-max scale our images to have a minimum value of 0 and a maximum value of 1.

In [None]:
x_train_size = x_train.shape[0]
x_test_size = x_test.shape[0]
x_length = x_train.shape[1]

norm = x_train.max()
x_train_scale = x_train / norm
x_test_scale = x_test / norm

## 2. Build an Autoencoder <a id="build"></a>

PyTorch has its own unique data objects called `torch.utils.data.Dataset`. `Dataset` has methods to retrieve the data length and instances. The datasets built from the class are used as inputs for `torch.utils.data.Dataloader`, which prepares our data for training. Since an autoencoder isn't trained using labels, the "labels" are not defined.

In [None]:
class LoadDataset(Dataset):
    
    def __init__(self, images):
        self.images = images
        
    def __len__(self):
        return len(self.images)
    
    def __getitem__(self, index):
        image = self.images[index]
        
        return image

We'll also define some helper functions to flatten our last encoded feature maps to neurons and unflatten our last decoded neurons to feature maps.

In [None]:
class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

class UnFlatten(nn.Module):
    def forward(self, input, shape_before_flatten):
        return input.view(input.size(0), *shape_before_flatten)

Here we define the functions and layers to build our autoencoder. The constructor has our model hyperparameters as inputs:

- `filters`: the number of filters the convolutional layers will learn
- `latent_dimensions`: the dimensionality of the latent space
- `sub_array_size`: the image length/width
- `k`: the length/width of the filter being learned
- `pool`: the length/width of max pooling
- `pad`: the length/width of zero padding

Using the constructor's parameters, we define the CNNs layers and functions. We use the [rectified linear unit (ReLU)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) as our activation function to add nonlinearity to our model, use max pool to downsample (decrease model size and force feature extraction), use `filters`, `k`, and `pad` to build our convolutional layers, and use `latent_dimensions` to build our latent space. In addition, we use max unpooling and transpose convolution to upsample in our decoder.

The `forward` fucntion builds our autoencoder from the functions we defined in the constructor. The autoencoder is built as follows:
- Encoder Layer 1
    - convole 1 28x28 image into 16 28x28 feature maps
    - activate the feature maps using ReLU
    - max pool the 16 28x28 feature maps to 16 14x14 feature maps
- Encoder Layer 2
    - convole 16 14x14 feature maps into 32 14x14 feature maps
    - activate the feature maps using ReLU
    - max pool the 32 14x14 feature maps to 32 7x7 feature maps
- Flatten the 32 7x7 feature maps to a 1D 32 * 7 * 7 array
- Latent Layer
    - use the flatten 1D array as inputs for the 2 dimensional latent space
    - use the 2 dimensional latent space as inputs for the flatten 1D array
- Unflatten a 1D 32 * 7 * 7 array to the 32 7x7 feature maps
- Decoder Layer 1
    - unmax pool the 32 7x7 feature maps to 32 14x14 feature maps
    - activate the feature maps using ReLU
    - transpose convole 32 14x14 feature maps into 16 14x14 feature maps
- Decoder Layer 2
    - unmax pool the 16 14x14 feature maps to 16 28x28 feature maps
    - activate the feature maps using ReLU
    - transpose convole 16 28x28 feature maps into 1 28x28 output
- Final activation using ReLU

In [None]:
# define functions and build model

class Autoencoder(nn.Module):
    def __init__(self, 
                 filters = [1, 16, 32],
                 latent_dimensions = 2,
                 sub_array_size = x_length,
                 k = 3,
                 pool = 2,
                 pad = 1):

        super(Autoencoder, self).__init__()

        # The Rectified Linear Unit (ReLU)
        self.relu = nn.ReLU()

        # Max Pool and Un-Max Pool
        self.mp = nn.MaxPool2d(pool, return_indices=True)
        self.up = nn.MaxUnpool2d(pool)
        
        # Flattens the feature map to a 1D array
        self.flatten = Flatten()
        # Unflattens 1D array to feature map
        self.unflatten = UnFlatten()

        # ---- ENCODER ----
        self.conv1 = nn.Conv2d(in_channels=filters[0], out_channels=filters[1], kernel_size=k, padding=pad)
        self.conv2 = nn.Conv2d(in_channels=filters[1], out_channels=filters[2], kernel_size=k, padding=pad)
        
        # ---- LATENT ----
        size_before_latent = filters[-1] * (sub_array_size // pool ** (len(filters) - 1)) ** 2
        self.latent = nn.Linear(size_before_latent, latent_dimensions)
        self.out_of_latent = nn.Linear(latent_dimensions, size_before_latent)

        # ---- DECODER ----
        self.trans1 = nn.ConvTranspose2d(in_channels=filters[2], out_channels=filters[1], kernel_size=k, padding=pad)
        self.trans2 = nn.ConvTranspose2d(in_channels=filters[1], out_channels=filters[0], kernel_size=k, padding=pad)

    def forward(self,x):
        
        # ENCODER
        # Layer 1
        x = self.conv1(x)
        x = self.relu(x)
        x, ind1 = self.mp(x)

        # Layer 2
        x = self.conv2(x)
        x = self.relu(x)
        x, ind2 = self.mp(x)

        # LATENT
        shape_before_flatten = x.size()[1:]
        x = self.flatten(x)
        x = self.latent(x)
        x = self.out_of_latent(x)
        x = self.unflatten(x, shape_before_flatten)        

        # DECODER
        # Layer 1
        x = self.up(x, ind2)
        x = self.relu(x)
        x = self.trans1(x)

        # Layer 2
        x = self.up(x, ind1)
        x = self.relu(x)
        x = self.trans2(x)

        # Final activation
        x = self.relu(x)

        return x

## 3. Test Model Functionality <a id="test"></a>

Before training, we need to make sure our model is properly built, i.e. the expected input (2D 28x28 array) will return the expected output (2D 28x28 array). An error indicates the architecture is inconsistent in some way, such as unexpected input and output filters, unexpected input and output neurons, etc. Some ways to "break" the model are listed below:
- comment out a method in the constructor or forward
- manually change arguments in the methods to a different value

To start off, we will build our model.

In [None]:
model = Autoencoder()

Next, we change the shape of our image to be compatible with PyTorch. The input dimensions for images are (number of samples, number of input channels, y dimension, x dimension), which in our case is (1, 1, 28, 28).

In [None]:
index = 0
test_image = x_train_scale[index].reshape(1,1,x_length,x_length) 

After the dimensions are changed, we convert the image from a `np.array` to a `torch.Tensor`.

In [None]:
test_image_torch = torch.Tensor(test_image)

Now we can "reconstruct" our input image.

In [None]:
testoutput_torch = model(test_image_torch)

Since there is no error, we know our model is working. We also move the output from our model using the `detach()` method and convert the `torch.Tensor` to a `np.array` by using the `numpy()` method.

In [None]:
testoutput = testoutput_torch.detach().numpy()

Let's check the shape of the output to make sure they are what we expect. If it's not, then we have to fix our parameters where we defined the model.

In [None]:
print ('The shape of the output is {}.'.format(testoutput.shape))

Now let's plot the input and output. Since the model hasn't been trained, the ouput should look like random noise.

In [None]:
fig, axs = plt.subplots(1,2,figsize=[10,5])
axs[0].set_title('Training Scaled Image {}'.format(index))
axs[0].imshow(test_image[0,0].reshape(x_length, x_length))
axs[1].set_title('Reconstructed Output')
axs[1].imshow(testoutput[0,0].reshape(x_length, x_length))

In addition, it's good practice to know how many trainable parameters are in our model. The number of trainable parameters can be used as a proxy for estimating total training time. We define [a counting function](https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model) for us and determine how many trainable parameters there are in our model.

In [None]:
def count_parameters(model):
    total_params = 0
    for name, parameter in model.named_parameters():
        if not parameter.requires_grad: continue
        param = parameter.numel()
        print([name, param])
        total_params+=param
    return total_params

In [None]:
count_parameters(model)

## 4. Set Training and Test Sets <a id="set"></a>

PyTorch uses iterables to create its data objects. Here we show two ways to format the data to be PyTorch compatible.

1. **Use arrays:** Experienced Python users are more likely to be comfortable using and manipulating arrays. We will just reshape our images to have an input channel of 1, i.e. (1, 28, 28).

2. **Use LoadDataset:** In [Section 2](#build), we defined the `LoadDataset` class to format the data to be PyTorch compatible. The `Dataset` class comes with additional functionality specifically for PyTorch, but is beyond the scope of this tutorial.

We choose option 1 as default, but option 2 can be uncommented below. Using either does not affect training at all and is up to user comfortability/preference.

In [None]:
train_set = x_train_scale.reshape(x_train_size, 1, x_length, x_length)
val_set = x_test_scale.reshape(x_test_size, 1, x_length, x_length)

In [None]:
# LoadDataset class
#train_set = LoadDataset(x_train_scale.reshape(x_train_size, 1, x_length, x_length))
#val_set = LoadDataset(x_test_scale.reshape(x_test_size, 1, x_length, x_length))

We also need to define a baseline for our model to perform better than. The baseline helps us understand if our model is learning anything at all. We choose the mean pixel of the inputs to be our baseline, i.e. a poor model would learn the reconstructed image as an image of the mean pixel. By calculating the [Mean Squared Error (MSE)](https://en.wikipedia.org/wiki/Mean_squared_error) of our training set and mean pixels, we have an established baseline to outperform.

In [None]:
# find mean pixel values of each image
mean = np.mean(x_train_scale, axis=(1,2)).reshape(x_train_size,1,1)

# create mean pixel value images
ones = np.ones((x_train_size, x_length, x_length))
mean_ones = mean * ones

# calculate baseline
baseline = np.sum(np.square(x_train_scale - mean_ones)) / (x_train_size * x_length ** 2)

baseline

## 5. Hyperparameters and Loading <a id="hyper"></a>

We must set our hyperparameters for the model to use for training. The hyperparamters we are using are batch size, shuffle, and number of workers. Batch size can be tuned as needed to improve results. Shuffle should almost always be True since the data shouldn't be ordered in any specific way when training. In addition, the number of workers has a default of 0, which uses the main processor on the machine you are using. We also choose the number of epochs we wish to train for.

In [None]:
torch.manual_seed(42)

# Prepping arguments we have to feed to `DataLoader`
params = {
        'batch_size': 128,
        'shuffle': True,
        'num_workers': 0
    }

# Number of epochs to train for
num_epochs = 5

Another useful metric to calculate is how many updates our model will perform during training. We can calculate this by finding the number of batches in the training set (number of training samples / batch size) and multiplying it by the number of epochs. Knowing how many batches our model might need to be well trained can be a good place to start when tuning hyperparameters.

In [None]:
print ('The model will train using a total of {} batches'.format(num_epochs * 
                                                       int(x_train_size / params['batch_size'])))

With our hyperparameters set, we can load our training and test set using `DataLoader`. 

**Note the variable and function names in the notebook are directed for validation sets, but we will use them for the test set instead.** That being said, we use the definitions for validation set and test set interchangeably here.

In [None]:
# TRAINING SET
train_loader = DataLoader(train_set, **params)

# VALIDATION SET
valid_loader = DataLoader(val_set, **params)

We will initialize our model again to be sure we are starting from scratch.

In [None]:
model = Autoencoder()

Now we define our loss function to be Mean Square Error. This function is standard in regression problems.

In [None]:
distance = nn.MSELoss()

Then we choose our optimizer to be [Adam](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam), since the learning rate updates automatically and trains relatvely fast compared to [Stochastic Gradient Descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent).

In [None]:
optimizer = torch.optim.Adam(model.parameters(),  weight_decay=1e-5)

If you have GPUs available, then those will be used for training. If not, then the model will train on CPUs.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device);

Let's print the device to make sure we know what's available.

In [None]:
device

## 6. Train Model <a id="train"></a>

In order to train our model, we have to manually loop through our data for training. This is probably the biggest difference between PyTorch and [Tensorflow](https://www.tensorflow.org/), but this allows for more hands-on manipulation of how training is performed, which can be advantageous. We will train our model as follows:
1. Change the model to trianing mode to activate backpropogation
2. Initialize training loss to be 0
3. Loop through each batch of features by:
    - Putting the data onto your device
    - Calculating the outputs and the loss
    - Performing backgrpopogation and adding the batch training loss to total training loss
4. Normalize the total training loss by number of samples

In [None]:
# Define train loop

def train_model(train_loader):

    # Change model to training mode (activates backpropogation)
    model.train()
    
    # Initialize training loss
    train_loss = 0
    
    # Loop through batches of training data
    for data in train_loader:
        
        # Put training batch on device
        data = data.float().to(device)

        # Calculate output and loss from training batch
        output = model(data)
        loss = distance(output, data)
        
        # Backpropogation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    # Normalize training loss from one epoch
    train_loss_norm = train_loss / len(train_loader)
    
    return train_loss_norm

In addition, we define a similar loop for evaluating the test set at each epoch, which signals us if our model is generalizing. We will test our model as follows:
1. Change model to evaluation mode to deactivate backpropogation
2. Initialize test loss and variances to 0
3. Loop through each batch of features by:
    - Putting the data onto your device
    - Calculating the outputs and the loss
4. Calculate test set loss and normalize it by number of samples
5. Calculate R2 score to determine how correlated the inputs are with the outputs (no correlation tends to 0, high correlation tends to 1)

In [None]:
# Define validation loop

def validate_model(valid_loader):

    # Change model to evaluate mode (deactivates backpropogation)
    model.eval()
    
    # Initialize validation loss and variances
    val_loss = 0
    data_variance = 0
    res_variance = 0
    
    # Do not calculate gradients for the loop
    with torch.no_grad():
        
        # Loop through batches of validation data
        for data in valid_loader:
            
            # Put validation batch on device
            data = data.float().to(device)
            
            # Calculate output and loss from validation batch
            output = model(data)
            val_loss += distance(output, data).item()
            
            # Calculate variances
            data = data.detach().numpy()
            output = output.detach().numpy()
            
            data_mean = np.mean(data, axis=(1,2,3)).reshape(data.shape[0], 1, 1, 1)
            data_var = np.nansum((data - data_mean)**2)
            res_var = np.nansum((data - output)**2)
            
            data_variance += data_var
            res_variance += res_var
    
    # Normalize validation loss from one epoch
    val_loss_norm = val_loss / len(valid_loader)
    
    # Calculate r2 score
    r2_score = 1 - res_variance / data_variance
    
    return val_loss_norm, r2_score

Finally, we can train our model! We will print out the train and test loss/R2 score per epoch to keep track of performance. The loop below performs the training and validation loops defined above and records our metrics. The last print statement will tell us how many times better our model is than our baseline with respect to the training data.

In [None]:
# keep track of metrics
lst_train_loss = []
lst_val_loss = []
lst_r2_score = []

# training loop
for epoch in tqdm(range(num_epochs), total=num_epochs):

    # Go through loops
    train_loss = train_model(train_loader)
    val_loss, r2_score = validate_model(valid_loader)

    # Append metrics
    lst_train_loss.append(train_loss)
    lst_val_loss.append(val_loss)
    lst_r2_score.append(r2_score)

    # Log
    print('Epoch {:.3f} - Train loss: {:.4f} - Val Loss: {:.4f} - R2 Score: ({:.4f})'.format(
            epoch, train_loss, val_loss, r2_score))
print ('The model is performing {:.4f} times better than the baseline'.format(baseline / lst_train_loss[-1]))

## 7. Plot Loss Function and R2 <a id="plot"></a>

We plot the train/test loss and R2 scores to determine how well converged our model is.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=[10,5])

axs[0].set_title('Loss')
axs[0].plot(np.arange(num_epochs), lst_train_loss, label='train')
axs[0].plot(np.arange(num_epochs), lst_val_loss, label='val')
axs[0].set_xlabel('Epochs')
axs[0].legend()

axs[1].set_title('R2 Score')
axs[1].plot(np.arange(num_epochs), lst_r2_score, color='C1')
axs[1].set_xlabel('Epochs')

## 8. Analyze Samples <a id="analyze">
    
Now that our model is trained, let's analyze some samples to see how well our images are being reconstructed.
    
First, we predict the outputs of our test set.

In [None]:
output = model(torch.Tensor(val_set).to(device))
recon = output.detach().numpy()
mse = np.sum((val_set - recon) ** 2, axis=(1,2,3)) / x_length ** 2

Next, we can plot random samples, their reconstructions, and their squared residuals.

In [None]:
# choose random image and corresponding output from test set
rand_index = np.random.randint(x_test_size)
rand_image = x_test_scale[rand_index]
rand_recon = recon[rand_index][0]
rand_sq_res = (rand_image - rand_recon) ** 2
rand_mse = mse[rand_index]

# plot input, output, and squared residuals
fig, axs = plt.subplots(2,2,figsize=[10,10])
axs[0,0].set_title('Testing Scaled Image {}'.format(rand_index))
axs[0,0].imshow(rand_image)
axs[0,1].set_title('Reconstructed Output'.format(rand_index))
axs[0,1].imshow(rand_recon, vmin=0, vmax=1)
axs[1,0].set_title('Squared Residual Image')
axs[1,0].imshow(rand_sq_res)
axs[1,1].set_title('Squared Residual Image (0-1 min-max)')
axs[1,1].imshow(rand_sq_res, vmin=0, vmax=1)
plt.tight_layout()

print ('MSE: {:.4f}'.format(rand_mse))

Now, let's plot the distribution of MSEs to get a better understanding of how well our model reconstructs each test sample.

In [None]:
plt.figure(figsize=[10,5])
plt.title('Test Set MSE Distribution')
plt.hist(mse, bins=50)
plt.xlabel('mse')
plt.ylabel('frequency')

In addition, let's see if we can distinguish the loss by class. If most of the distributions are within reason in relation to each other, then the model generalizes to all classes.

In [None]:
plt.figure(figsize=[10,5])
plt.title('Test Set MSE Distribution (by class)')
for digit in range(10):
    plt.hist(mse[y_test == digit], bins=50, label=digit, alpha=0.25)
plt.xlabel('mse')
plt.ylabel('frequency')
plt.legend()

Although the model is performing a lot better than the baseline, there are still samples it struggles with. Let's see how many samples have a MSE more than 3 sigma above the mean.

In [None]:
threshold = mse.mean() + 3 * mse.std()
mask = mse > threshold

print ('There are {} MSEs above {:.4f}.'.format(mask.sum(), threshold))

Now with our mask, we can look through "poorly" reconstructed samples.

In [None]:
# choose random incorrect image and corresponding output from test set
rand_index = np.random.randint(mask.sum())
rand_image = x_test_scale[mask][rand_index]
rand_recon = recon[mask][rand_index][0]
rand_sq_res = (rand_image - rand_recon) ** 2
rand_mse = mse[mask][rand_index]

# plot input, output, and squared residuals
fig, axs = plt.subplots(2,2,figsize=[10,10])
axs[0,0].set_title('Testing Masked Scaled Image {}'.format(rand_index))
axs[0,0].imshow(rand_image)
axs[0,1].set_title('Reconstructed Output'.format(rand_index))
axs[0,1].imshow(rand_recon, vmin=0, vmax=1)
axs[1,0].set_title('Squared Residual Image')
axs[1,0].imshow(rand_sq_res)
axs[1,1].set_title('Squared Residual Image (0-1 min-max)')
axs[1,1].imshow(rand_sq_res, vmin=0, vmax=1)
plt.tight_layout()

print ('MSE: {:.4f}'.format(rand_mse))

## 9. Visualize the Latent Space <a id="latent"></a>

As mentioned in the [Introduction](#intro), one of the use cases of an autoencoder is to reduce the dimensionality of the dataset. Since our decoder is able to reconstruct our inputs to a high degree, that means our data is efficiently stored in the latent space. By using the encoder as a feature extractor, we can visualize the data in the latent space.

First, we'll define [a function](https://discuss.pytorch.org/t/how-can-i-extract-intermediate-layer-output-from-loaded-cnn-model/77301/3) that extracts the features at an intermediate step.

In [None]:
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

model.latent.register_forward_hook(get_activation('latent'))

Then, we'll transform the test set to the latent space.

In [None]:
input_batch = torch.Tensor(val_set)
output = model(input_batch)
features = activation['latent'].numpy()

Finally, we can plot the representation.

In [None]:
plt.figure(figsize=[10,5])
plt.title('Latent Space')
for digit in range(10):
    plt.scatter(features[:, 0][y_test==digit], features[:, 1][y_test==digit], label=digit, alpha=0.25)
plt.xlabel('AE 1')
plt.ylabel('AE 2')
plt.legend()

Even if the classes do not separate out distinctly, it's still impressive that the decoder is able to decipher what a digit will look like in a two dimensional space! There are a few things you can try to see if we can get better representation in the latent space.
- Increase the number of dimensions of the latent space: currently we are using 2 dimensions because it's easy to plot and visualize. However, as the number of dimensions increases so does the amount of information the latent space can store. The classes could further be distinguished in these slightly higher dimensions (5-10), but still far lower than the original input space (784).
- Increase the depth of the autoencoder: currently we are not activating the neurons from our final flattened feature maps to the latent space because that would constrain our reduced space. By increasing the depth of these fully connected layers, we can activate the neurons, adding nonlinearity and extracting even higher level features that could separate out the digits.

## 10. Detect Anomalies <a id="detect"></a>

As mentioned in the [Introduction](#intro), another use case of an autoencoder is to detect anomalies. Since it understands the features (straight lines, curves, etc.) that make up digits really well, it should not be able to reconstruct images outside of this domain. To demonstrate this, we will generate 10000 anomalies and compare the MSEs of their reconstructions to that of the test set.

First, let's generate our "anomalies" and visualize what they look like. Our anomalies will be normally distributed noise with the same mean and variance as the scaled test set.

In [None]:
n = 10000
anom = np.random.normal(x_test_scale.mean(), x_test_scale.std(), (n, 1, x_length, x_length))
plt.imshow(anom[0, 0])

Then, we can reconstruct the noise images using the autoencoder and plot an output.

In [None]:
output_anom = model(torch.Tensor(anom)).detach().numpy()
plt.imshow(output_anom[0,0])

Now, let's calculate the loss of each anomaly.

In [None]:
mse_anom = np.sum((anom - output_anom) ** 2, axis=(1,2,3)) / x_length ** 2

Finally, we can plot the losses and compare distributions.

In [None]:
plt.figure(figsize=[15,10])
plt.title('Test Set and Anomaly MSE Distributions')
plt.hist(mse, bins=50, label='test set', alpha=0.5)
plt.hist(mse_anom, bins=50, label='anom', alpha=0.5)
plt.xlabel('mse')
plt.ylabel('frequency')
plt.legend()

The anomalies separate from the real data. Now if we were getting images in real time, we'd be able to distinguish between digits and noise by using a loss threshold.

## 11. Conclusions <a id="con"></a>

Thank you for walking through this notebook. Now you should be more familiar with:
- building an autoencoder
- training and evaluating an autoencoder
- visualizing the latent space of an autoencoder
- detecting anomalies using an autoencoder

**Congratulations, you have completed the notebook!**

## Additional Resources <a id="add"></a>

Machine learning is a dense and rapidly evolving field of study. Becoming an expert takes years of practice and patience, but hopefully this notebook brought you closer in that direction. Here are some of the author's favorite resources for learning about machine learning and data science:

- [Google Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/ml-intro)
- [scikit-learn Python Library](https://scikit-learn.org/stable/index.html) (go-to for most ML algorithms besides neural networks)
- [StatQuest YouTube Channel](https://www.youtube.com/c/joshstarmer)
- [DeepLearningAI YouTube Channel](https://www.youtube.com/c/Deeplearningai/videos)
- [Towards Data Science](https://towardsdatascience.com/) (articles about data science and machine learning, some involving example blocks of code)
- Advance searching [arxiv](https://arxiv.org/search/advanced) (e.g. search term "machine learning" in Abstract for Subject astro-ph) to see what others are doing currently
- Google, YouTube, and Wikipedia in general

## About this Notebook <a id="about"></a>

**Author:** Fred Dauphin, DeepWFC3

**Updated on:** 2021-12-03

## Citations <a id="cite"></a>

If you use `numpy`, `matplotlib`, or `torch` for published research, please cite the authors. Follow these links for more information about citing `numpy`, `matplotlib`, and `torch`:

* [Citing `numpy`](https://numpy.org/doc/stable/license.html)
* [Citing `matplotlib`](https://matplotlib.org/stable/users/project/license.html#:~:text=Matplotlib%20only%20uses%20BSD%20compatible,are%20acceptable%20in%20matplotlib%20toolkits.)
* [Citing `torch`](https://github.com/pytorch/pytorch/blob/master/LICENSE)

***
[Top of Page](#title)