<a href="https://colab.research.google.com/github/saralieber/CS_Studio/blob/master/Review_Ch8_ArtificialNeuralNets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Artificial Neural Nets (ANNs)
*   aka Multi-Layer Perceptron (MLP) 
*   aka Deep-Learning Model


Neural networks were proposed by McCulloch and Pitts (1943) by likening a model to the neurons in the nervous sytem.
*   ANNs - a complex computation made up of simple parts (i.e., neurons) where each simple part is only connected to its neighboring parts 
*   Each "neuron" has an input and an output; they are organized in layers


The first neural nets were simple

<img src='https://miro.medium.com/max/1032/1*WswH2fPx0bf_JFRMm8V-HA.gif'>

This is just linear regression.



The next proposed version of neural nets said add another layer.

<img src='http://www.practicalai.io/wp-content/uploads/2017/06/admission-data-linear.png'  height=200>

But this is still just linear regression where the weights (beta weights) are linear transformations of the previous inputs.

<img src='https://miro.medium.com/max/1016/1*g2HHjCkxeemizfLQC-BaAg.gif'>




There's nothing wrong with linear regression - just that it can only be used for problems that require linear solutions. For example, determining who survived and died from the Titanic data set can be solved using a linear regression model.

<img src='https://sebastianraschka.com/images/faq/logistic_regression_linear/4.png'>



But what is there is no line that cleanly separates the data?

<img src='https://www.dropbox.com/s/rk83o8qqgr7vsq2/Screenshot%202020-04-28%2011.00.53.png?raw=1'>



That's where other algorithms come in, like...

## Support Vector Machines (SVM)

See the image above. What an SVM does is takes a 2D space and maps it onto 3D space. (The opposite of feature reduction - it's feature expansion!)

Once in 3D space, you can find a plane that may cleanly separate the points you're trying to classify. First, you have to find the 2D-to-3D mapping function, called a *kernel function*. Then, you have to find the plane. 

More to come on SVM later.


## Artificial Neural Nets (ANNs) 

ANNs are another alternative to SVM. They end up being more versatile to work with than SVM when it comes to working with text data.

The jump in our understanding of neural nets came from observing biological neurons and seeing that the cell body takes a summation of inputs, and these inputs have to surpass a certain threshold to active the neuron. 

<img src='https://upload.wikimedia.org/wikipedia/commons/4/44/Neuron3.png'>

Leads to the idea of an *activation function* - a certain thredhold has to be reached before firing takes place.



This is what a simple ANN looks like:

<img src='https://www.dropbox.com/s/zdv3sjzssiewwf3/Screenshot%202020-02-18%2010.30.12.png?raw=1'>

The nodes on the left make up the *input layer*.
- This is a vector that we pass into the model. It represents one row of a dataset being handed to the ANN. 
- So 0.44 would be the first number in the vector, 0.33 would be the second, and so on.
- The vectors we'll work with are *dense*. This means each node in the input layer supplies a value to each node in the hidden layer (note the lines from every input node to every node in the hidden layer).

The nodes in the middle make up the *hidden layer* (not actually hidden).
- The nodes in the hidden layer are connected to all the nodes in the output layer (a single node, in this case). 

Not depicted in this picture are the weights - there should be a weight corresponding to each line.

## Nodes

The circles in the diagrams above are called nodes. Each node has k input lines. The outliers (?) are the input nodes with only 1 input line and no weights. They just pass through the value from the sample to the first hidden layer.


<img src='https://www.researchgate.net/profile/Karem_Chokmani/publication/255629329/figure/fig2/AS:339705213276163@1458003442306/The-basic-element-of-a-neural-network-node-computation.png'>



In this diagram,
- The inputs to the node are labeled O (to represent that they are the output from the previous layer to the left)
- Each input line has its own weight, W
- The operation of the first part of the node is the dot-product of the outputs (vector 1) and the weights (vector 2), where the resulting dot-product is typically called Z. This is represented by the function labeled I-sub-j, where j is the node number.
- The activation function is labeled f (this function was labeled A in an earlier diagram). It takes the result of the dot-product (Z) and produces the actual output of the node. If you choose a linear activiation function (e.g., f(x) = cx), you'll end up with a network that computes a linear function no matter how many layers and nodes you have.



## Exclusive-OR

We'll use the XOR function for the following examples. The XOR function has an output of True if only one of its inputs is True, but not both.


The possible outcomes of the XOR function are:
- Inputs: True and True, Output: False
- Inputs: True and False, Output: True
- Inputs: False and True, Output: True
- Inputs: False and False, Output: False



Here are these outcomes represented graphically:

<img src='https://www.dropbox.com/s/oeud4lstd84l88d/Screenshot%202020-02-21%2013.27.25.png?raw=1' height=300>




There's not a linear solution to this program. The solution? use a non-linear activation function.


Here are a couple of popular non-linear activation functions.

## 1. RELU

RELU stands for rectified linear activation function. It is a piecewise linear function that will output the input directly if the input is positive; otherwise, it will output zero (if the input is negative). The function in python for it is `max(0,x)`.

<img src='https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png' height=200>


This might be the closest function to mimicking a bioloigcal neuron that needs to reach a threshold before activating.


## 2. Sigmoid

Sigmoid produces an "S" curve with the following function (the logistic version produces a value between 0 and 1):

<img src='https://www.dropbox.com/s/58hr9e4iusnmapc/Screenshot%202020-02-18%2014.02.06.png?raw=1'>

<img src='https://www.dropbox.com/s/wdqdl22m2l7jruo/Screenshot%202020-02-18%2014.02.21.png?raw=1'>


## 3. Others

See https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e for a nice table of activation functions you have to choose from.



## How is ANN implemented?

What is a node? It really is its vector of weights on the outputs from the preceding layer. 

So if we store all our node weights in a matrix (one matrix per layer - input layer, hidden layer), then we are just doing a dot-product of the output vector against the weight matrix. 

This will produce an intermediate value (called Z). Z gets passed to the activation function to get the node output.


Example: Build an ANN from scratch using the XOR function as an example.

In [0]:
# Import Steve's uo_puddles library
!rm -r  'uo_puddles'
my_github_name = 'uo-puddles'  
clone_url = f'https://github.com/{my_github_name}/uo_puddles.git'
!git clone $clone_url 
import uo_puddles.uo_puddles as up

rm: cannot remove 'uo_puddles': No such file or directory
Cloning into 'uo_puddles'...
remote: Enumerating objects: 258, done.[K
remote: Counting objects: 100% (258/258), done.[K
remote: Compressing objects: 100% (222/222), done.[K
remote: Total 258 (delta 154), reused 64 (delta 33), pack-reused 0[K
Receiving objects: 100% (258/258), 66.97 KiB | 387.00 KiB/s, done.
Resolving deltas: 100% (154/154), done.


### Build a function that calculates dot-products

Reminder: the dot-product multiplies corresponding items from two vectors, then takes the sum of the products.

In [0]:
# Compute the dot-product of two vectors

def dot(vector1: list, vector2:list) -> float:
  assert isinstance(vector1, list), f'vector1 should be a list but is instead a {type(vector1)}'
  assert isinstance(vector2, list), f'vector2 should be a list but is instead a {type(vector2)}'
  assert len(vector1) == len(vector2), f'both vectors should be the same length'

  result = 0
  for i in range(len(vector1)):
    term = vector1[i]*vector2[i]
    result += term
  return result

In [0]:
# Test the dot function
inputs = [.002, -.09, .6] # values coming from a previous layer
weights = [.5, .4, -.2] # weights on those values

z = dot(weights, inputs) # why weights first? does it matter?
z # This is I-sub-j in the diagram below

-0.155

<img src='https://www.dropbox.com/s/en3wtc2p8nea5ig/Screenshot%202020-05-13%2014.35.58.png?raw=1' height=400>

In [0]:
# Now, we need to apply an activiation function to z, f(z)
## You get to choose which activation function to use. Let's define and use a couple.

In [0]:
# Sigmoid function
import math

def sigmoid(t:float) -> float:
  s = 1 / (1 + math.exp(-t)) # e to the -t power
  return s

In [0]:
# Apply the sigmoid function to the dot-product, z
sigmoid(z)

0.46132739479349205

<img src='https://www.dropbox.com/s/exe1xq59vpnxfrl/Screenshot%202020-05-13%2014.46.43.png?raw=1' height=400>

In [0]:
# Now try the RELU activation function.

In [0]:
# RELU function
def relu(t:float) -> float:
  result = max(t, 0.0)
  return result

In [0]:
# Apply the RELU function to the dot-product, z.
relu(z)

0.0

<img src='https://www.dropbox.com/s/wv16f1ym79gqmzw/Screenshot%202020-05-13%2014.49.58.png?raw=1' height=300>

#### Now, write a single function for computing the output node assuming we want to use the sigmoid activation function.

In [0]:
def neuron_output(weights:list, inputs:list) -> float:
  assert isinstance(weights, list), f'weights should be a list but is instead a {type(weights)}'
  assert isinstance(inputs, list), f'inputs should be a list but is instead a {type(inputs)}'
  assert len(weights) == len(inputs), f'weights and inputs should be the same length'

  z = dot(weights, inputs)
  s = sigmoid(z)
  return s

In [0]:
# Test the neuron_output function out.
neuron_output(weights, inputs)

0.46132739479349205

## A feedforward function

A feedforward function will take as arguments the set of weights in a network and input values. It will output the final result (i.e., prediction).

Example using the XOR function. The feedforward function takes two binary numbers in and outputs a value between 0 and 1. Take the round of the result to get a binary number as a prediction:

<img src='https://codingvision.net/imgs/posts/c-backpropagation-tutorial-xor/1.png'>

<img src='https://www.dropbox.com/s/fvko9fo71pp1cpr/Screenshot%202020-02-21%2009.14.37.png?raw=1'>


Even though there are three layers shown above, the input layer is implied.
- Only have to deal with two layers: the hidden and output
- The input layer just pumps data to the first hideen layer, so it is implied
- Think of this as the weights belonging to the layer on the right. Given the input layer has no weights on its left, we can leave it out. Essentially, it is represented as the `input_vector` below.


#### Choosing the weights

It's up to the researcher. One way is to use a random distribution of weights between -1 and 1. 

So for the first hidden node, we will have a list of two weights (notice two weights feed into it in the image above).

<pre>
hidden1 = [rdist1, rdist2]
</pre>

`rdist1` and `rdist2` are random numbers falling in a uniform distribution. We'll need the same for hidden2 and for the output node.

There are a total of three nodes in the network shown above, and each needs a list of its own weights. We're randomly gathering these weights below.

In [0]:
import numpy as np
np.random.seed(1234)

hidden1 = list(np.random.uniform(-1,1,2)) # Create a list of 2 random items taken from a uniform distribution between -1 and 1
hidden2 = list(np.random.uniform(-1,1,2))
output = list(np.random.uniform(-1,1,2))

In [0]:
# Combine the weights into a single object - a list of lists

xor_network = [[hidden1, hidden2], [output]] # the weights for each node in the hidden layer are contained in a separate list from the weights for the output node

print(len(xor_network)) # 2 (there are two separate lists)
print(xor_network)

2
[[[-0.6169610992422154, 0.24421754207966373], [-0.12454452198577104, 0.5707171674275384]], [[0.559951616237607, -0.45481478943471676]]]


<img src='https://www.dropbox.com/s/a8s43c314op5qg8/Screenshot%202020-05-13%2015.11.06.png?raw=1' height=500>

# Assignment 1

In [0]:
# Assignment 1
## Complete the last two functions we need to simulate a feedforward network

# First, define the function `layer_output`.
## The `layer` parameter below is something like xor_network[0] or xor_network[1]
## The `inputs` are values from the preceding layer.
## You will use the 'create a new list from an old list' gist and the `neuron_output` function.

In [0]:
 def layer_output(layer:list, inputs:list) -> list:
  assert isinstance(layer, list), f'layer must be a list but is a {type(layer)}'
  assert all([isinstance(item, list) for item in layer]), f'layer must be a list of lists'
  assert isinstance(inputs, list), f'inputs must be a list but is a {type(inputs)}'

  new_list = []

  for i in range(len(layer)):
    item = layer[i]  
    output = neuron_output(item, inputs)

    new_list.append(output)

  return new_list

I'll test your function out by stepping through the layers, left to right. First up is the hidden layer or `xor_network[0]`.

We need a sample input. The xor operator takes binary value pairs. There are 4 such unique pairs, right? Let's just choose one.

In [0]:
layer = xor_network[0] # the hidden layer
input = [0,1]
print(layer)

[[-0.6169610992422154, 0.24421754207966373], [-0.12454452198577104, 0.5707171674275384]]


Now produce the output for the first layer.

In [0]:
hidden_output = layer_output(layer, input)
hidden_output # two outputs because two ndoes

[0.5607527329852497, 0.6389286413163151]

There are 2 neurons in the hidden layer so we get an output from each.

Now here is the cool part. We take the output and make it the input to the 2nd layer, i.e., the output layer.

In [0]:
input = hidden_output
layer = xor_network[1] # the output layer
print(layer)

[[0.559951616237607, -0.45481478943471676]]


Now ready to produce next output, which is the actual output of the entire network!

In [0]:
output_output = layer_output(layer, input)
output_output # prediction = 1

[0.5058497839923097]

In [0]:
round(output_output[0])

1

# Assignment 2

Create a function called `feed_forward` that takes the entire network in as a parameter (instead of each layer in steps), as well as the initial input to the network. 

It will go through each layer calling `layer_output`.

In [0]:
def feed_forward(neural_network:list, input_vector:list) -> float:

  outputs = []

  for i in range(len(neural_network)):
    layer = neural_network[i]  #layer
    output = layer_output(layer, input_vector) # want to use the output from this as the input for next layer_output
    outputs.append(output)
    final_output = layer_output(layer, outputs[0])

  return final_output

Try it out.

In [0]:
raw_prediction = feed_forward(xor_network, [0,1])
raw_prediction  #should be same as above: [0.5058497839923097]

[0.5058497839923097]

We now have an ANN that's set up to compute the final predictions for the XOR problem.

But, we're missing our label column. We need the actual labels (answers) to see how accurate our predictions were. We'll put off generating a label column for now - it's easier to use the XOR operator built into Python to get the actual results.

Let's run through all 4 samples (inputs), get a prediction, call XOR for the actual answer, and then compare.

In [0]:
0^1 # this is the built-in python XOR function ("hat" is XOR)

1

Test on all XOR combinations.

<img src='https://www.dropbox.com/s/fvko9fo71pp1cpr/Screenshot%202020-02-21%2009.14.37.png?raw=1'>

In [0]:
for x in [0,1]: 
  for y in [0,1]:
    result = feed_forward(xor_network,[x,y])
    print(x,y, round(result[0]), x^y, result) 
    # prints each combination of numbers from x & y (first two columns)
    # then prints the predicted result (third column)
    # then prints the actual result (fourth column)
    # then prints the raw result

0 0 1 0 [0.5131390777373889]
0 1 1 1 [0.5058497839923097]
1 0 0 1 [0.4957459470865491]
1 1 0 0 [0.4877720338005447]


In [0]:
for x in [0,1]:
  for y in [0,1]:
    result = [x,y]
    print(result)

[0, 0]
[0, 1]
[1, 0]
[1, 1]

In [0]:
'''
0 0 1 0 [0.5131390777373889]  # x, y, predicted, actual (WRONG)
0 1 1 1 [0.5058497839923097]  # x, y, predicted, actual (RIGHT)
1 0 0 1 [0.4957459470865491]  # x, y, predicted, actual (WRONG)
1 1 0 0 [0.4877720338005447]  # x, y, predicted, actual (RIGHT)
'''

We got 50% accuracy using randomly assigned weights. The outputs were always very close to .5.

### Discussion

The process we went through above really is similar to an actual neural network algorithm in how we used a list of weights and moved layer by layer toward the output. 

The difference between what we did above and a fully-developed set of ANN algorithms is that the latter tried to speed things up. It'll use a GPU to run calculations in parallel, and it'll use faster forms of the `dot` and `sigmoid` functions.

## What about the "learning" part of neural networks?

We generated random weights above and then ran our four samples (inputs) to get four predictions for the XOR problem. But we're not getting very good results - all predictions were around .5. 

What we want is for the network to be able to learn from its mistakes.

Let's say our network predicts .1 but the actual value is 1. REALLY big error (difference of .9).

Our goal is to reduce the amount of error between predicted and actual. What do we have to change to achieve this? The weights. The weights are knobs we can turn.