<h1> Approximating a Region of the Sine Function </h1>

In this notebook we will approximate a region of the sine function with a neural network to get a sense of how architecture and hyperparameters affect neural network performance.

<img src="../data/sine_wave.gif" width="1200" align="center">


In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset

from tqdm.notebook import trange, tqdm

<h2> Pytorch Datasets and Dataloaders </h2>
<b> *Basics we will cover more detail later on, but for now...</b><br>
Pytorch has a huge number of functionalities that make training our neural networks very easy! One of those functionalities is the Pytorch dataset and dataloader (they are real life-savers!). The "dataset" class is an object that "stores" our dataset either directly (it loads all the data in the initialisation function) or indirectly (it loads the image paths during the initialisation function and only loads them when it needs to - for large image-based datasets this is usually the only way to do it).We will see how we can create our own Pytorch dataset soon!<br>
These datasets then are used to create a "dataloader" object that is "iterable". The Pytorch dataloader will take our dataset and randomly shuffle it (if we tell it to), it will also divide the dataset into "mini-batches" which are groups of datapoints of a fixed size (the batch size). Our Neural Network is then trained through a single step of GD on this mini-batch. As we iterate through the dataloader, the dataloader will pass us a new unique mini-batch until the whole dataset has been passed to us. One whole loop through the dataset is called an "epoch", during every epoch the dataset is re-shuffled so the mini-batches are all random. This random sampling of the dataset and training on mini-batches (instead of performing GD on the whole dataset) is called Stochastic Gradient Descent (SGD)<br>
Note: If the whole dataset does not evenly divide into mini batches then in the last iterator we will just be passed whatever is left over!

<h3> Creating a Pytorch dataset </h3>
The dataset we will be creating will be points from a "noisey" sine wave.<br>
The Pytorch dataset class has three essential parts:<br>
The __init__ function (as most Python classes do)<br>
The __getitem__ function (this is called during every iteration)<br>
The __len__ function (this must return the length of the dataset)

In [None]:
# Create a "SineDataset" class by inheriting the Pytorch Dataset class
class SineDataset(Dataset):
    """ Data noisey sinewave dataset
        num_datapoints - the number of datapoints you want
    """
    def __init__(self, num_datapoints):
        # Lets generate the noisy sinewave points
        
        # Create "num_datapoints" worth of random x points using a uniform distribution (0-1) using torch.rand
        # Then scale and shift the points to be between -9 and 9
        self.x_data = # TO DO
        
        # Calculate the sin of all data points in the x vector and the scale amplitude down by 2.5
        self.y_data = # TO DO
        
        # Add some gaussein noise to each datapoint using torch.randn_like 
        # (scale the noise down a bit by about 20 - see how different noise levels effects your model)
        # Note:torch.randn_like will generate a tensor of gaussein noise the same size 
        # and type as the provided tensor
        self.y_data += # TO DO

    def __getitem__(self, index):
        # This function is called by the dataLOADER class whenever it wants a new mini-batch
        # The dataLOADER class will pass the dataSET class a number of datapoint indexes (mini-batch of indexs)
        # It is up to the dataSET's __getitem__ function to output the corresponding input datapoints 
        # AND the corresponding labels
        return # TO DO
    
        # Note:Pytorch will actually pass the __getitem__ function one index at a time
        # If you use multiple dataLOADER "workers" multiple __getitem__ calls will be made in parallel
        # (Pytorch will spawn multiple threads)

    def __len__(self):
        # We also need to specify a "length" function, Python will use this fuction whenever
        # You use the Python len(function)
        # We need to define it so the dataLOADER knows how big the dataSET is!
        return # TO DO

Now that we've defined our dataset, lets create an instance of it for training and testing and then create  dataloaders to make it easy to iterate

In [None]:
n_x_train = 30000   # the number of training datapoints
n_x_test = 8000     # the number of testing datapoints
batch_size = 16

# Create an instance of the SineDataset for both the training and test set
dataset_train = # TO DO
dataset_test  = # TO DO

# https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
# Now we need to pass the dataSET to the Pytorch dataLOADER class along with some other arguments
# dataset - the dataset for this dataloader
# batch_size - the size of our mini-batches
# shuffle - whether or not we want to shuffle the dataset (True for training False for testing)
    
data_loader_train = # TO DO
data_loader_test = # TO DO

Lets visualise the dataset we've created!!

In [None]:
fig = plt.figure(figsize=(20, 10))
plt.scatter(dataset_train.x_data, dataset_train.y_data, s=0.2)
# Note:see here how we can just directly access the data from the dataset class

<h2> Neural Network Architecture</h2>
<b> Non-Linear function approximators! </b> <br>

Up until now we have only created a single linear layer with an input layer and an output layer. In this section we will start to create multi-layered networks with many "hidden" layers separated by "activation functions" that give our networks "non-linearities". If we didn't have these activation functions and simply stacked layers together, our network would be no better than a single linear layer! Why? Because multiple sequential "linear transformations" can be modeled with just a single linear transformation. This is easiest to understand with matrix multiplications (which is exactly what happens inside a linear layer).<br>

$M_o = M_i*M_1*M_2*M_3*M_4*M_5$<br>
Is the same as<br>
$M_o = M_i*M_T$<br>
Where<br>
$M_T = M_1*M_2*M_3*M_4*M_5$<br>

Aka multiplication with several matrices can be simplified to multiplication with a single matrix.<br>

So what are these nonlinear activation functions that turn our simple linear models into a power "nonlinear function approximator"? Some common examples are:<br>
1. relu
2. sigmoid
3. tanh

Simply put they are "nonlinear" functions, the simplest of which is the "rectified linear unit" (relu) which is "piecewise non-linear".

NOTE: The term "layer" most commonly refers to the inputs or outputs of the weight matrix or activations functions and not the linear layer or activation layer themselves. Output layers in between two "linear layers" are called "hidden layers". You can imagine them "inside" the neural network with us only being able to see the input and output layers. To confuse things even further the outputs of activation functions are also commonly called "activations"

Why do we want a linear function approximator? Because many processes, tasks, systems in the real world are non-linear. "Linear" in basic terms refers to any process that takes inputs, scales them and sums them together to get an output. 


## Regression or Classification Neural Networks do only one thing....
In this notebook we are performing regression, which as we've seen is very similar to classification! Both regression and classification can be thought of as producing a distribution over possible values for a given input. In classification the model produces the probability that the input belongs to a particular category where the probabilities define a discrete categorical distribution (or a Bernoulli distribution for binary classification)!
In regression our model also produces a distribution on the output, however it may be less clear how. The output of the model is in fact the expectation (mean) of a normal distribution with sigma (standard deviation) equal to 1 (assuming we are using the basic MSE loss). In fact the Mean Squared Error (MSE) loss we use can be thought of in the same way as the cross entropy loss used in classification! With the MSE loss we are trying to learn a model that produces a normal distribution (conditioned on the input data) such that the target value has the highest likelihood! Where does the normal distribution have the highest likelihood? At the mean!!<br>
For more information as to how we get the MSE loss from the Maximum likelihood of a normal distribution have a look at the following:<br>
[Blog: MSE is Cross Entropy at heart by Moein Shariatnia](https://towardsdatascience.com/mse-is-cross-entropy-at-heart-maximum-likelihood-estimation-explained-181a29450a0b)

<h3>Pytorch nn.Module</h3>
Now we can define a Pytorch model to be trained!<br>
To do so we use the Pytorch nn.Module class as the base for defining our network. Just like the dataset class, this class has a number of important functions.

In [None]:
# Define our network class by using the nn.module
class ShallowLinear(nn.Module):
    '''
    A simple, general purpose, fully connected network
    '''
    # Here we initialise our network and define all the layers we need
    def __init__(self, input_size, output_size, hidden_size):
        # Perform initialization of the pytorch superclass, this will allow us to inherit 
        # functions from the nn.Module class
        super(ShallowLinear, self).__init__()
        
        # Define 4 linear layers for our model with the same hidden size for all
        # Note: the output of one layer must be the same size as the input to the next!
        # TO DO
        # TO DO
        # TO DO
        # TO DO

    def forward(self, x):
        # This function is an important one and we must create it or pytorch will give us an error!
        # This function defines the "forward pass" of our neural network
        # and will be called when we simply call our network class
        # aka we can do net(input) instead of net.forward(input)
        
        # Lets define the sqeuence of events for our forward pass!
        # We'll use a tanh activation function here
        # You can experiment with other activation functions!
        x = # TO DO layer 1
        x = # TO DO activation function
        
        x = # TO DO layer 2
        x = # TO DO activation function
        
        x = # TO DO layer 3
        x = # TO DO activation function

        # No activation function on the output!!
        x = # TO DO layer 4
        
        # Note we re-use the variable x as we don't care about overwriting it 
        # though in later labs we will want to use earlier hidden layers
        # later in our network!
        return x

## Define hyperparameters,  model and optimizer

Here we define the following parameters for training:

- batch size (which has already been defined)
- learning rate
- number of training epochs
- optimizer
- loss function

Ideally, numeric parameters would be tested empirically with an exhaustive search. When testing manually, It is recommended to maximize the model fit with one parameter at a time to avoid confounding your results. 

Try these learning rates:
- 5e-2, 1e-2, 5e-3, 1e-3, 5e-4, 1e-4, 5e-5

Try these optimizers:
- `optim.SGD(shallow_model.parameters(), lr=learning_rate)`
- `optim.Adam(shallow_model.parameters(), lr=learning_rate)`

[Youtube: Optimizers - EXPLAINED! by CodeEmporium](https://youtu.be/mdKjMPmcWjY?si=VtjuF_QlPHAzj2Sx)

See the pytorch documentation pages for an extensive list of options:
- Optimizers: http://pytorch.org/docs/master/optim.html#algorithms
- Loss: http://pytorch.org/docs/master/nn.html#id46

Read this page for a detailed comparison of optimizers: http://ruder.io/optimizing-gradient-descent/



In [None]:
# Define the hyperparameters
learning_rate = # TO DO
nepochs = # TO DO

# Create model
shallow_model = # TO DO

# Initialize the optimizer with above parameters
optimizer = # TO DO

# Define the loss function
loss_fn = # TO DO mean squared error

## Initiate training, plot testing results
Here we put all the previous methods together to train and test the model. This problem is an unusual one in that our loss is the best quantitative metric of the model performance. Classification problems require further analysis of true/false positives/negatives.

Rerun this cell several times without editing any parameters. Is the result the same?

Try a larger batch size, how is the training time affected?

Look at the slope and noise level of the loss plot. Does it look like the training converged on a local minimum?

Try some different hyperparameters and see how accurate you can get your model

In [None]:
# Here we create two lists to store the loss values from training and testing
training_loss_logger = []
testing_loss_logger = []
# Note:create them outside of the train/test cell so they don't get overwritten 
# if we want to run the cell again

In [None]:
# The main train/test cell
# This will run one epoch of training and then one epoch of testing nepochs times
for epoch in trange(nepochs, desc="Epochs", leave=False):
    
    # Perform training Loop!
    for x, y in tqdm(data_loader_train, desc="Training", leave=False):

        # Run forward calculation
        # TO DO
        
        # Compute loss.
        # TO DO
        
        # Before the backward pass, use the optimizer object to zero all of the
        # gradients for the variables it will update (which are the learnable weights
        # of the model)
        # TO DO
        
        # Backward pass: compute gradient of the loss with respect to model
        # parameters
        # TO DO

        # Calling the step function on an Optimizer makes an update to its
        # parameters
        # TO DO

        # Log the loss so we can visualise the training plot later
        # TO DO

    # Perform a test Loop!
    # While we are within a "with torch.no_grad():" block, Pytorch will not construct the computational graph
    # This helps speed up computation and saves memory while we are not training
    with torch.no_grad():
        # Create a variable to accumulate the test loss so we can take the average
        test_loss_accum = 0
        for i, (x, y) in enumerate(tqdm(data_loader_test, desc="Testing", leave=False)):

            # Run forward calculation
            # TO DO
            
            # Compute loss.
            # TO DO
            
            # Log the loss so we can visualise the training plot later
            # TO DO
            test_loss_accum += loss
            
        test_loss_accum /= (i + 1)
        
print("Epoch [%d/%d], Average Test Loss %.4f" %(epoch, nepochs, test_loss_accum))

<h3>Lets visualise our results!</h3>

In [None]:
# Plot out the test and train losses!
# TO DO

In [None]:
# Perform one last epoch over the testing set, this time logging the models outputs!
# Perform a test Loop!
# TO DO

In [None]:
# Plot the testdata against the model's predicted outputs, how accurate does it look??
# TO DO