## __Activation Functions__

First let's load in a dataset, and run our previous code on it, making a few slight changes, such as ading more layers and changing parameters. We will be importing the MNIST dataset, which is a dataset comprised of handwritten digits. Our project will be built around this dataset. We will use to load in our dataset. Then we will flatten each image array from 28x28 into a vector of size 784. 

In [1]:
import pickle
import numpy as np

with open('dataset.p', 'rb') as file:
  X, y = pickle.load(file)

X = X.reshape(X.shape[0], -1)

print(f'Input size: {X.shape}')

Input size: (70000, 784)


In [2]:
class Dense:
  def __init__(self, input_neurons, output_neurons):
    self.weights = np.random.randn(input_neurons, output_neurons)
    self.biases = np.zeros((1, output_neurons))
  def forward(self, inputs):
    self.output = np.dot(inputs, self.weights) + self.biases

layer_one = Dense(784, 128)
layer_one.forward(X)
layer_two = Dense(128, 64)
layer_two.forward(layer_one.output)
layer_three = Dense(64, 10)
layer_three.forward(layer_two.output)
print(layer_three.output)

[[ 309364.5215605   211902.66957978 -208831.83333386 ... -315013.16757424
   349481.74643741 -343205.23029616]
 [ 194721.61957701  307436.12271173  -79373.72638533 ...   43134.48232765
   -76537.25584081 -344886.85772519]
 [ -64325.21569131  361371.95698315  -96925.75269489 ...  -33517.13392191
   274963.39854676  -56024.38398058]
 ...
 [  28388.3853326   306543.52060894  160751.65440851 ...  -51462.80959851
   161943.6190995   185346.88253138]
 [ -39453.54304542  162326.97457821  -69681.85804179 ... -252697.45713589
   -63833.39513267  129302.80272159]
 [ 367982.6481871   166047.86945259 -321764.34322551 ... -521936.574629
   134838.17629632  -25096.15075296]]


Now let's talk about activation functions. We are already using a linear activation function, $mx + b$ but generally we want to introduce non-linearity in our neural network, to be able to predict the data better. For this we will be using a simple activation function known as relu, which just takes negative values and converts them to 0, and leaves positive values alone. We will create a class relu, which just has a forward method.

### Relu function: $f(x) = max(0, x)$

Where $x$ is passed matrix. Now let's implement it!

In [3]:
class ReLu:
  def forward(self, inputs):
    self.output = np.maximum(0, inputs)

inputs = [-12, 17, -29, 14, 68, -7]
relu_sample = ReLu()
relu_sample.forward(inputs)
print(relu_sample.output)

[ 0 17  0 14 68  0]


As we can see, we passed a list with positive and negative values and it works! Now let's talk about the softmax activation. 

## __Softmax Activation__
The softmax activation is important for being able to take our output from the neural network, and convert it into a probability distribution, where we can extract predicted labels. Let's assume we have a sample output, O. To apply softmax, we first subtract the biggest value in that sample, from all the values. This is to ensure when we apply exponents, there is no overflow. Next we apply the exponentiation, then we create a probability distribution by dividing all the values by the sum of the entire sample. 

### Softmax function: $f\left(x\right)=\frac{e^{x_{i}-\max\left(x\right)}}{\sum_{n=1}^{k}e^{x_{n}-\max\left(x\right)}}$

where $k$ is the length of vector $x$. Now let's implement it!

Note: We will use params, so we are only applying softmax to every sample differently

In [4]:
class Softmax:
    def forward(self, inputs):
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)

Now let's bring together relu and softmax, to create a neural network! (Note: We still have loss, backpropagation and optimization to go, but the neural network architecture is fully created!)

In [5]:
relu_one = ReLu()
relu_two = ReLu()
softmax = Softmax()

layer_one.forward(X)
relu_one.forward(layer_one.output)
layer_two.forward(relu_one.output)
relu_two.forward(layer_two.output)
layer_three.forward(relu_two.output)
softmax.forward(layer_three.output)

print(softmax.output)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


So it looks like the relu and softmax activation worked. It created a probability distribution of all possible classes. But now we need to calculate the loss of this network, to see how good it is.