# Fundamentals of Numpy & PyTorch
---

## **Contents**

*Note, the contents is based on the [blog](https://rickwierenga.com/blog/machine%20learning/numpy-vs-pytorch-linalg.html).*

1. Setting Up Numpy & Pytorch
2. Basic Numpy Arrays & Pytorch Tensors
  2a. Example of Numpy Arrays and Pytorch Tensors
  2b. Pivoting Data
  2c. Combining Data
  2d. Mathematical operations
3. Torch GPU Operation

## **1. Setting up Numpy and Pytorch**

In [None]:
# You can set install NumPy/torch by using the following command
# !conda install numpy
# !conda install pytorch
# If you want to check the version of numpy/torch you are using 
# or if you want to conifrm if it is installed in your system:
!conda list numpy 
!conda list torch 

In [None]:
import numpy as np
import torch
# Set random seed to 0 to ensure identical results in each run.
torch.manual_seed(0) 
np.random.seed(0) 

## **2. Basic Numpy Arrays & Pytorch Tensors**

In NumPy, an array is a data structure that stores values of same data type, 
while the list in Python stores data of different types.
Simimlar with Numppy array, the tensor in torch is a multi-dimensional matrix 
containing elements of a single data type.

### **2a. Example of Numpy Arrays and Pytorch Tensors**

In [None]:
# Initializes a numpy array of zeroes, with size (3,4)
new_array1 = np.zeros((3, 4, 5))
# Initializes a numpy array of ones, with size (3,4)
new_array2 = np.ones((3, 4))
# Initializes a random numpy array with standard normal distribution
new_array2 = np.random.randn(3, 4) 
# Initializes a random numpy array with randomly distributed integers in [0, 3)
new_array5 = np.random.randint(3, size = (4, 5)) 


# Initializes a torch tensor of zeroes, with size (5,3)
new_tensor1 = torch.zeros(size=(5,3))
# Initializes a torch tensor of ones, with size (5,3)
new_tensor2 = torch.ones(size=(5,3))
# Returns a 2-D tensor with ones on the diagonal and zeros elsewhere
new_tensor3 = torch.eye(3)
# Returns a tensor with random numbers from uniform distribution on [0, 1)
new_tensor4 = torch.rand(size=(3,4))
# Returns a 1-D tensor with values from the interval [start, end) taken with 
# common difference step beginning from start
new_tensor5 = torch.arange(start=-3, end=9, step=2)


# List - Numpy Array - Torch Tensor conversion
# list -> array
list1 = [1, 2, 3, 4, 5]
list_toarray = np.array(list1) # print(type(list_toarray))
# list -> tensor
list2 = [6, 7, 8, 9, 10]
list_totensor = torch.tensor(list2) # print(type(list_totensor))
# array -> tensor
nparray1 = np.array([1,2,3,4])
array_totensor = torch.from_numpy(nparray1) # print(type(array_totensor))
# tensor -> array
tensor1 = torch.tensor([5,6,7,8])
tensor_toarray = tensor1.detach().numpy() # print(type(tensor_toarray))


### **2b. Pivoting Data**

**NOTE:** Values can easily be modified by using the accessing method to select 
the desired section of the array/tensor to be modified.
* Indexing is using the location of an element in an array/tensor to extract it.
* Slicing is used to obtain/extract a subset of elements in an array/tensor.

**NOTE:**  The size and shape of an array/tensor mean the same thing.

In [None]:
# Flattening an array/tensor means to remove all of the dimensions except for one.

# Flatten a Numpy Array
original_array = np.random.randint(3, size = (2, 3, 4)) 
flattened_array = original_array.flatten()

# Flatten a Torch Tensor
original_tensor = torch.rand(size=(2, 3, 4))
flattened_tensor = original_tensor.flatten()

In [None]:
# Squeezing an array/tensor removes the dimensions or axes that have a length of
# one, while unsqueezing an array/tensor adds a dimension with a length of one.
# These functions allow us to expand or shrink the rank (number of dimensions)
# of our array/tensor.

# Squeeze & unsqueeze a Numpy Array
original_array = np.random.randint(3, size = (6, 1, 3))
squeezed_array = np.squeeze(original_array, axis = 1)
unsqueezed_array = np.expand_dims(squeezed_array, axis = 1) 

# Squeeze & unsqueeze a Pytorch Tensor
original_tensor = torch.rand(size=(3, 2, 1, 2))
squeezed_tensor = original_tensor.squeeze(2)
unsqueezed_tensor = squeezed_tensor.unsqueeze(2)

In [None]:
# Using the reshape() function, we can specify the shape that we are seeking, 
# but the number of elements remain unchanged in the array/tensor. 
# Numpy/Pytorch allow us to give one of new shape parameter as -1 (eg: (2,-1) or
# (-1,3) but not (-1, -1)). It simply means that it is an unknown dimension and 
# we want numpy/pytorch to figure it out. Numpy/Pytorch will figure this by 
# looking at the 'length of the array and remaining dimensions' and making sure 
# it satisfies the above mentioned criteria

# Reshaping a Numpy Array
original_array = np.random.randint(3, size = (2, 3, 4)) 
reshaped_array1 = np.reshape(original_array, (4, 2, 3))
reshaped_array2 = np.reshape(original_array, (6, -1))

# Reshaping a Torch Tensor
original_tensor = torch.rand(size=(3, 1, 2, 2))
reshaped_tensor1 = original_tensor.reshape(3, 4, 1)
reshaped_tensor2 = original_tensor.reshape(2, -1)


In [None]:
# NUMPY: If the axes are specified, it must be a tuple or list which contains a 
# permutation of [0,1,..,N-1] where N is the number of axes of the array. The 
# i’th axis of the returned array will correspond to the axis numbered axes[i] 
# of the input. If not specified, defaults to range(a.ndim)[::-1], which 
# reverses the order of the axes.

# Transposing a Numpy Array
original_array = np.random.randint(3, size = (2, 3)) 
# Matrix Tranpose with axis left empty
transposed_array1 = np.transpose(original_array) 

example_array = np.random.randint(3, size = (2, 3, 5))
# axes tuple must be of size n-1 where n = rank of array (3 in this case)
transposed_exampleX = np.transpose(example_array, axes = (0,1,2)) 
transposed_example1 = np.transpose(example_array, axes = (1,0,2)) 
transposed_example2 = np.transpose(example_array, axes = (2,1,0))
transposed_example3 = np.transpose(example_array, axes = (1,0,2))
transposed_example4 = np.transpose(example_array, axes = (1,2,0)) 


# PYTORCH: Returns a tensor that is a transposed version of input. The given 
# dimensions are swapped. The resulting tensor shares it’s underlying storage
# with the input tensor, so changing the content of one would change the 
# content of the other.

# Transposing a Torch Tensor
original_tensor = torch.rand(size=(2, 3, 4))
transposed_tensor1 = original_tensor.transpose(0,2)

# **NOTE:** For Torch, the *permute* operation operation allows the user to 
# simultaneously reorder multiple dimensions unlike *transpose* which 
# interchanges two dimensions only.  PyTorch torch.permute() rearranges the 
# original tensor according to the desired ordering and returns a new 
# multidimensional rotated tensor. The size of the returned tensor remains the
# same as that of the original.


### **2c. Combining Data**

In [None]:
# A concatenation operation joins a sequence of arrays/tensors along an existing
# axis. All arrays/tensors must either have the same shape (except in the 
# concatenating dimension) or be empty.

# Concatenating Numpy Arrays
array1 = np.random.randint(3, size = (3, 4, 5))
array2 = np.random.randint(4, size = (3, 4, 5))
concatenated_array1 = np.concatenate((array1, array2), axis = 0) # (6, 4, 5)
concatenated_array2 = np.concatenate((array1, array2), axis = 1) # (3, 8, 5)
concatenated_array3 = np.concatenate((array1, array2), axis = 2) # (3, 4, 10)

# Concatenating Torch Tensors
tensor1 = torch.rand(size=(3, 4, 5))
tensor2 = torch.rand(size=(3, 4, 5))
concatenated_tensor1 = torch.cat([tensor1, tensor2], dim=0) # (6, 4, 5)
concatenated_tensor2 = torch.cat([tensor1, tensor2], dim=1) # (3, 8, 5)
concatenated_tensor3 = torch.cat([tensor1, tensor2], dim=2) # (3, 4, 10)

In [None]:
# The stack operation joins a sequence of arrays/tensors along a new axis. The 
# axis parameter specifies the index of the new axis in the dimensions of the 
# result. For example, if axis=0 it will be the first dimension and if axis=-1 
# it will be the last dimension. All arrays/tensors need to be of the same size.
# The stacked array/tensor has one more dimension than the input arrays.

# Stacking Numpy Arrays
array1 = np.random.randint(3, size = (3, 4, 5))
array2 = np.random.randint(4, size = (3, 4, 5))
stacked_array1 = np.stack((array1, array2), axis = 0) # (2, 3, 4, 5)
stacked_array2 = np.stack((array1, array2), axis = 1) # (3, 2, 4, 5)
stacked_array3 = np.stack((array1, array2), axis = 2) # (3, 4, 2, 5)
stacked_array4 = np.stack((array1, array2), axis = -1) # (3, 4, 5, 2)

# Stacking Torch Tensors
tensor1 = torch.rand(size=(3, 4, 5))
tensor2 = torch.rand(size=(3, 4, 5))
stacked_tensor1 = torch.stack([tensor1, tensor2], dim=0) # (2, 3, 4, 5)
stacked_tensor2 = torch.stack([tensor1, tensor2], dim=1) # (3, 2, 4, 5)
stacked_tensor3 = torch.stack([tensor1, tensor2], dim=2) # (3, 4, 2, 5)
stacked_tensor4 = torch.stack([tensor1, tensor2], dim=-1) # (3, 4, 5, 2)


In [None]:
# The repeat operation repeats elements of an array. The number of repetitions 
# for each element is broadcasted to fit the shape of the given axis.

# Repeat in Numpy Arrays
original_array = np.array([[1,2],[3,4]])
repeated_array1 = np.repeat(original_array, 2) # (8, )
repeated_array2 = np.repeat(original_array, 3, axis=0) # (6, 2)
repeated_array3 = np.repeat(original_array, 3, axis=1) # (2, 6)
repeated_array4 = np.repeat(original_array, [2,3], axis=0) # (5, 2)

# In the Torch version of 'repeat', only the number of repeats can be specified,
# and will be done along each dimension. This can, however, be done using 
# 'repeat_interleave'.

# Repeat in Torch Tensors
original_tensor = torch.tensor([1,2,3,4]) #(4)
repeated_tensor1 = original_tensor.repeat((0)) #()
repeated_tensor2 = original_tensor.repeat((2)) # (8)
repeated_tensor3 = original_tensor.repeat((2,3)) #(2, 12)

# Repeat Interleave in Torch Tensors
original_tensor = torch.tensor([[1,2],[3,4]])
repeated_tensor1 = original_tensor.repeat_interleave(2) #(8)
repeated_tensor2 = original_tensor.repeat_interleave(3, dim=0) #(6, 2)
repeated_tensor3 = original_tensor.repeat_interleave(3, dim=1) #(2, 6)

In [None]:
# Arrays/tensors need to be padded to ensure that computations can be optimized 
# by transfroming the underlying data to become of the same size.

# Padding Numpy Arrays
original_array = np.array([[1,2,3,4],
                 [1,2,3,4],
                 [1,2,3,4],
                 [1,2,3,4]])
# Setting the width of padding for each side
pad_left   = 1
pad_right  = 2
pad_top    = 1
pad_bottom = 2

padded_array1 = np.pad(original_array, pad_width =  ((pad_top, pad_bottom), (pad_left, pad_right)), mode = 'constant' )
padded_array2 = np.pad(original_array, pad_width =  ((pad_top, pad_bottom), (pad_left, pad_right)), mode = 'edge' )
padded_array3 = np.pad(original_array, pad_width =  ((pad_top, pad_bottom), (pad_left, pad_right)), mode = 'reflect' )
padded_array4 = np.pad(original_array, pad_width =  ((pad_top, pad_bottom), (pad_left, pad_right)), mode = 'symmetric')

# Padding Torch Tensors
# NOTE: Requires special package from torch.nn
from torch.nn import functional as F
original_tensor = torch.tensor([[1,2,3,4],
                 [1,2,3,4],
                 [1,2,3,4],
                 [1,2,3,4]])
padded_tensor1 = F.pad( original_tensor, (pad_left, pad_right, pad_top, pad_bottom), mode = 'constant' )

### **2d. Mathematical Operations**

**Point-wise/element-wise Array operations**
* Addition/Multiplication with Scalars
* Elementwise Addition/Multiplication (aka Hadmard Product) of Arrays
* Absolute value
* Broadcasting b/w arrays of different dimensions
*Note:* When broadting two multi-dimensional tensors, match their corresponding 
dimensions beginning from the last dimension.
All dimensions should either match or one of the arrays should have length 1 
in that specific dimension


**Reduction Operations**

NumPy & Torch support all commonly used mathematical reduction operations such 
as sum(), mean(), std(), max(), argmax(), unique() etc. These can either be 
applied on the entire array/tensor or along specific dimensions.

**Comparison Operations**

Comparison Operations preform comparision on the array/tensors as a whole as 
well as along particular axes.

In [None]:
# Numpy Comparison Operations
original_array1 = np.random.randint(3, size=(3,4))
original_array2 = np.random.randint(3, size=(3,4))
original_array3 = np.random.randint(3, size=(3,4))

# Element-wise Comparison Operations for > < or !=
# Combining reduction operations with boolean tensors
print((original_array1 > original_array2).any(), "\n") # ||
print((original_array1 > original_array2).all(), "\n") # &&
print((original_array1 > original_array2).any(axis=0), "\n")

# Torch Comparison Operations
original_tensor1 = torch.rand(size=(3,4))
original_tensor2 = torch.rand(size=(3,4))
original_tensor3 = torch.rand(size=(3,4))

# Element-wise Comparison Operations for > < or !=
# Combining reduction operations with boolean tensors
print((original_tensor1 > original_tensor2).any(), "\n") # ||
print((original_tensor1 > original_tensor2).all(), "\n") # &&
print((original_tensor1 > original_tensor2).any(axis=0), "\n")

**Vector/Matrix Operations**

**Dot Product:** aka Inner product (Matrix multiplication relies on dot product to multiply various combinations of rows and columns.)

**Tensor Product:** Tensordot (also known as tensor contraction) sums the product of elements from a and b over the indices specified by a_axes and b_axes. The lists a_axes and b_axes specify those pairs of axes along which to contract the tensors. (np.tensordot(), torch.tensordot())

**Einsum:** Imagine that we have two multi-dimensional arrays, A and B. Now let's suppose we want to... multiply A with B in a particular way to create new array of products; and then maybe sum this new array along particular axes; and then maybe transpose the axes of the new array in a particular order.
There's a good chance that einsum will help us do this faster and more memory-efficiently that combinations of the NumPy functions like multiply, sum and transpose will allow. (np.einsum(), torch.einsum())

In [None]:
# Numpy Vector/Matrix operations
array1 = np.random.randn(3)
array2 = np.random.randn(3)
array3 = np.random.randn(3, 4)
array4 = np.random.randn(4)
matrix1 = np.random.randint(4, size = (2, 3))
matrix2 = np.random.randint(4, size = (3, 2))
# Matmul Examples 
# Vector x Vector
print('Matmul: \n', np.matmul(array1, array2))
print("Matmul: \n", array1@array2)
# Matrix x Vector
print('Matmul: \n', np.matmul(array3, array4))
print('Matmul: \n', array3@array4)
# Matrix x Matrix 
print('Matmul: \n', np.matmul(matrix1, matrix2))


# Torch Vector/Matrix operations
tensor1 = torch.randn(3)
tensor2 = torch.randn(3)
tensor3 = torch.randn(3, 4)
tensor4 = torch.randn(4)
matrix1 = torch.randn(2, 3)
matrix2 = torch.randn(3, 2)
# Vector x Vector
print('Matmul: \n', torch.matmul(tensor1, tensor2))
# Matrix x Vector
print('Matmul: \n', torch.matmul(tensor3, tensor4))
# Matrix x Matrix
print('Matmul: \n', torch.matmul(matrix1, matrix2))


## **3. Torch GPU operation**

Pytorch’s tensors is able to perform operations on the GPU as opposed to the 
CPU. To do GPU tensor operations, you must first move the tensor from the CPU 
to the GPU. Operations require all components to be on the same device (CPU or
GPU). Operations between CPU and GPU tensors will fail. A convenient method 
is `new`, which creates a new tensor on the same device as another tensor. 
It should be used for creating tensors whenever possible.

A GPU operation’s runtime comes in two parts: 1) time taken to move a tensor to 
GPU, 2) time taken for an operation. #2 is very fast on GPU, but sometimes (for 
small operations), #1 can take much longer. In some cases, it may be faster to 
perform a certain operation on CPU. GPU memory is quite limited. You will 
frequently run into the following error. First is the RuntimeError, CUDA out of 
memory. When this happens, either reduce the batch size or check if there are 
any dangling unused tensors left on the GPU. You can delete tensors on the GPU 
and free memory with ```torch.cuda.empty_cache()```. You’ll be running into Cuda
errors like: RuntimeError: CUDA error: device-side assert triggered. This can 
mean many things. For example, an operation between CPU and GPU tensors, GPU 
operations between tensors of unexpected shape, wrong types wrong in some weird
way. You should try running the entire thing again after setting the following 
environment variable: ```CUDA_LAUNCH_BLOCKING=1```, which forces CUDA to do 
things sequentially• Remember to turn this back to ```CUDA_LAUNCH_BLOCKING=0```
after. Otherwise your code will be slow.


In [None]:
# import torch
a = torch.randn(5, 5)

# Put tensor on CUDA if available
x = torch.rand(3,2)
if torch.cuda.is_available():
    x = x.to("cuda:0")
    print(x, x.dtype)
    
# Do some calculations
y = x ** 2 
print(y)

# Copy to CPU if on GPU
if y.is_cuda:
    y = y.cpu()
    print(y, y.dtype)

try:
    a.cuda()
except AssertionError as e:
    print(e)



## **4. Torch Automatic differentiation**
Tensors provide automatic differentiation, and Tensors you are differentiating 
with respect to must have `requires_grad=True`. Call `.backward()` on scalar 
variables you are differentiating. To differentiate a vector, sum it first. 
Differentiation accumulates gradients. This is sometimes what you want and 
sometimes not. Make sure to zero gradients between batches if performing 
gradient descent or you will get strange results!

Pytorch remembers the graph of all computations to perform differenciation. To 
be integrated to this graph the raw data is wrapped internally to the Tensor 
class (like what was formerly a Variable). You can detach the tensor from the 
graph using the **.detach()** method, which returns a tensor with the same data 
but requires_grad set to False. We can also set flag `requires_grad = False`, 
which do note update the graph. If you are in a context where you have a 
differentiable tensor that you don't need to differentiate, think of detaching 
it from the graph.

In [None]:
# Create differentiable tensor
# x=torch.tensor(torch.arange(0,4), requires_grad=True, dtype=torch.float)
# Surpress warning
x = torch.arange(0,4).float()
x.requires_grad = True
y = x**2
# Calculate gradient (dy/dx=2x)
y.sum().backward() 
# Print values
print(x)
print(y)
print(x.grad)

# Differentiate again (accumulates gradient)
torch.sum(x**2).backward()
print(x.grad)
# Zero gradient before differentiating
x.grad.data.zero_()
torch.sum(x**2).backward()
print(x.grad)
x.detach().numpy()

## **5. Torch Neural Net Example Using MNIST**

torch.nn provides basic 1-layer nets, such as Linear (perceptron) and activation
layers. All nn.Module objects are reusable as components of bigger networks! 
That is how you build personnalized nets. The simplest way is to use the 
nn.Sequential class. You can also create your own class that inherits n.Module. 
The forward method should precise what happens in the forward pass given an 
input. This enables you to precise behaviors more complicated than just applying
layers one after another, if necessary.

Parameters are of type Parameter, which is basically a wrapper for a tensor. 
They are the attributes of type Parameter in your network. Moreover, if an 
attribute is of type nn.Module, its own parameters are added to your network's 
parameters. Hence, when defining a network by adding up basic components such 
as `nn.Linear`, you should never have to explicitely define parameters. However,
if you are in a case where no pytorch default module does what you need, you can
define parameters explicitely (this should be rare). Parameters are meant to be 
all the network's weights that will be optimized during training. If you were 
needing to use a tensor in your computational graph that you want to remain 
constant, just define it as a regular tensor.

`nn.CrossEntropyLoss` does both the softmax and the actual cross-entropy: given 
$output$ of size $(n,d)$ and $y$ of size $n$ and values in $0,1,...,d-1$, it 
computes $\sum_{i=0}^{n-1}log(s[i,y[i]])$ where 
$s[i,j] = \frac{e^{output[i,j]}}{\sum_{j'=0}^{d-1}e^{output[i,j']}}$. You can 
also compose nn.LogSoftmax and nn.NLLLoss to get the same result. Note that all 
these use the log-softmax rather than the softmax, for stability in the 
computations.

Now, to perform the backward pass, just execute **loss.backward()** ! It will 
update gradients in all differentiable tensors in the graph, which in particular
includes all the network parameters.

The MNIST dataset is a large database of handwritten digits that is commonly 
used for training various image processing systems. The database is also widely
used for training and testing in the field of machine learning. It consists of 
60,000 training images and 10,000 testing images.

In [None]:
# Import required libraries
import sys
import time
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils import data
from torchvision import transforms
from torchvision.datasets import MNIST

cuda = torch.cuda.is_available() # Define if awailable

torch.__version__

# Simple MLP with 2 layers and sigmoid activation using sequential network 
# (`nn.Module` object) from layers (other `nan.Module` objects).

x = torch.arange(0,32).float() 
# net = torch.nn.Sequential(
#     torch.nn.Linear(32,128),
#     torch.nn.Sigmoid(),
#     torch.nn.Linear(128,10))
# y = net(x)
# print(y)

In [None]:
# Obtain related dataset, MNIST
train = MNIST('./data', train=True, download=False, transform=transforms.ToTensor())
test = MNIST('./data', train=False, download=False, transform=transforms.ToTensor())
train_data = train.data
train_data = train.transform(train_data.numpy())

print('[Train Data]')
print(' - Numpy Shape:', train_data.cpu().numpy().shape)
print(' - Tensor Shape:', train_data.size())
print(' - min:', torch.min(train_data))
print(' - max:', torch.max(train_data))
print(' - mean:', torch.mean(train_data))
print(' - std:', torch.std(train_data))
print(' - var:', torch.var(train_data))

print('\n[Train Labels]')
print(' - Numpy Shape:', train.targets.cpu().numpy().shape)
print(' - Tensor Shape:', train.targets.size())

# plt.imshow(train.train_data.cpu().numpy()[1], cmap='gray')
# print (train.train_labels.cpu().numpy()[1])

In [None]:
# Self-defined data loader
class MyDataset(data.Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y

    def __len__(self):
        return len(self.Y)

    def __getitem__(self,index):
        X = self.X[index].float().reshape(-1) #flatten the input
        Y = self.Y[index].long()
        return X,Y

In [None]:
# create a more customizable network module (equivalent here)
class MyNetwork(torch.nn.Module):
    # you can use the layer sizes as initialization arguments if you want to
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = torch.nn.Linear(input_size,hidden_size)
        self.layer2 = torch.nn.Sigmoid()
        self.layer3 = torch.nn.Linear(hidden_size,output_size)

    def forward(self, input_val):
        h = input_val
        h = self.layer1(h)
        h = self.layer2(h)
        h = self.layer3(h)
        return h

class MyNetworkWithParams(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MyNetworkWithParams,self).__init__()
        self.layer1_weights = nn.Parameter(torch.randn(input_size,hidden_size))
        self.layer1_bias = nn.Parameter(torch.randn(hidden_size))
        self.layer2_weights = nn.Parameter(torch.randn(hidden_size,output_size))
        self.layer2_bias = nn.Parameter(torch.randn(output_size))
        
    def forward(self,x):
        h1 = torch.matmul(x,self.layer1_weights) + self.layer1_bias
        h1_act = torch.max(h1, torch.zeros(h1.size())) # ReLU
        output = torch.matmul(h1_act,self.layer2_weights) + self.layer2_bias
        return output

class MyNetWithMultiHidenLayer(nn.Module):
    def __init__(self,n_hidden_layers):
        super(MyNet,self).__init__()
        self.n_hidden_layers=n_hidden_layers
        self.final_layer = nn.Linear(128,10)
        self.act = nn.ReLU()
        self.hidden = []
        for i in range(n_hidden_layers):
            self.hidden.append(nn.Linear(128,128))
    
    def forward(self,x):
        h = x
        for i in range(self.n_hidden_layers):
            h = self.hidden[i](h)
            h = self.act(h)
        out = self.final_layer(h)
        return out

# SIMPLE MODEL DEFINITION
class Simple_MLP(nn.Module):
    def __init__(self, size_list):
        super(Simple_MLP, self).__init__()
        layers = []
        self.size_list = size_list
        for i in range(len(size_list) - 2):
            layers.append(nn.Linear(size_list[i],size_list[i+1]))
            layers.append(nn.ReLU())
        layers.append(nn.Linear(size_list[-2], size_list[-1]))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)


In [None]:
def train_epoch(model, train_loader, criterion, optimizer):
    model.train()

    running_loss = 0.0
    
    start_time = time.time()
    for batch_idx, (data, target) in enumerate(train_loader):   
        optimizer.zero_grad()   # .backward() accumulates gradients
        data = data.to(device)
        target = target.to(device) # all data & model on same device

        outputs = model(data)
        loss = criterion(outputs, target)
        running_loss += loss.item()

        loss.backward()
        optimizer.step()
    end_time = time.time()
    
    running_loss /= len(train_loader)
    print('Training Loss: ', running_loss, 'Time: ',end_time - start_time, 's')
    return running_loss

def test_model(model, test_loader, criterion):
    with torch.no_grad():
        model.eval()

        running_loss = 0.0
        total_predictions = 0.0
        correct_predictions = 0.0

        for batch_idx, (data, target) in enumerate(test_loader):   
            data = data.to(device)
            target = target.to(device)

            outputs = model(data)

            _, predicted = torch.max(outputs.data, 1)
            total_predictions += target.size(0)
            correct_predictions += (predicted == target).sum().item()

            loss = criterion(outputs, target).detach()
            running_loss += loss.item()


        running_loss /= len(test_loader)
        acc = (correct_predictions/total_predictions)*100.0
        print('Testing Loss: ', running_loss)
        print('Testing Accuracy: ', acc, '%')
        return running_loss, acc

In [None]:
num_workers = 8 if cuda else 0 
    
# Training
train_dataset = MyDataset(train.data, train.targets)
train_loader_args = dict(shuffle=True, batch_size=256, \
                         num_workers=num_workers, pin_memory=True) \
    if cuda else dict(shuffle=True, batch_size=64)
# Testing
test_dataset = MyDataset(test.test_data, test.test_labels)

test_loader_args = dict(shuffle=False, batch_size=256, \
                        num_workers=num_workers, pin_memory=True) \
    if cuda else dict(shuffle=False, batch_size=1)

train_loader = data.DataLoader(train_dataset, **train_loader_args)
test_loader = data.DataLoader(test_dataset, **test_loader_args)

n_epochs = 10
Train_loss = []
Test_loss = []
Test_acc = []

model = Simple_MLP([784, 256, 20])
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters())
device = torch.device("cuda" if cuda else "cpu")
model.to(device)
print(model)
for i in range(n_epochs):
    train_loss = train_epoch(model, train_loader, criterion, optimizer)
    test_loss, test_acc = test_model(model, test_loader, criterion)
    Train_loss.append(train_loss)
    Test_loss.append(test_loss)
    Test_acc.append(test_acc)
    print('='*20)

# save a dictionary
torch.save(model.state_dict(),'test.t7')
# load a dictionary
model.load_state_dict(torch.load('test.t7'))

plt.title('Training Loss')
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.plot(Train_loss)

plt.title('Test Loss')
plt.xlabel('Epoch Number')
plt.ylabel('Loss')
plt.plot(Test_loss)

plt.title('Test Accuracy')
plt.xlabel('Epoch Number')
plt.ylabel('Accuracy (%)')
plt.plot(Test_acc)
