<h1 style="color:brown;">  Introduction to neural network</h1> 

#### Import packages

In [1]:
import numpy as np

### Lesson plan 
1. Familiarity
1. Applications
1. Motivation
  1. A piece of history 
  1. Connection to neurons 
1. The neural network graph
  1. Computational graph with step function
  1. Computational graph with sigmoid - Logitic Regression
  1. The general Neural Network
  1. Non linearity - XoR
1. Optimization in NN aka backprop aka chain rule
  1. Simple backprop example
  1. Backprop simple rules
1. Size constraint
  1. Batch gradient descent
  1. Epochs
1. Takeaways

### Next lectures: 

1. hands on exprience in running a neural network
1. Different types of nets and tricks. 

### 1. It's quite familiar

![](./img/linear_vs_logistic_regression.jpg)

##### Another way of writing it:
![](img/log_reg_deriv.png)

#### --> Logistic regression was our first Neural Network!

### 2.  Applications (Let's see why the hype)

All the hype started by this <a href="https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf">paper</a>

More about applications: https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/Applications/index.html

#### Some imprressive applications:
- Image recognition: video, faces, neural style
- Recommendation systems 
- Weather predictions
- Alphago
- Translations, speech recognition
- Driverless cars


#### Videos:
- <a href="https://www.youtube.com/watch?v=4eIBisqx9_g">face recognition</a>
- <a href="https://www.youtube.com/watch?v=gn4nRCC9TwQ">walking robot </a>
- <a href="https://www.youtube.com/watch?v=8ypnLjwpzK8">translations - open.ai</a>


### 3.1. The imitation game - motivation I

Are humans just very sophisticated machines?

![](https://media.giphy.com/media/IPuswjolNFQze/giphy.gif)

### 3.2. A neuron (inside your brain) - motivation II

![](https://media.giphy.com/media/kyR6tnuZItmec/giphy.gif)

We can view the brain as a function approximator trying to adjust weights and thresholds according to 
a desired result - i.e we are looking for a performence function aka loss function aka cost function

### That's very interesting and inspiring but only loosely related - be careful with analogies!

So what parts do we have:
- inputs
- weights 
- sum of weighted electrical currents
- thresholds - should we activate or not 
- output
- performance function
    
Don't these parts sound very fimiliar?

### 4. The Neural Network graph 

#### 4.1 Simple computational graph

![Comp_graph](./img/computational_graph.jpg)

A graph where each node represents  a mathematical operation among all arrows pointing to the same node.

#### 4.2. Computational graph with step function 

![](img/step_func.jpg)

--> Notice that the step function is problematic for taking derivatives

#### 4.3. Computational graph with sigmoid function  

![log-reg](./img/log_reg.png)

##### Non-linear function is called an activation function

Why sigmoid is not the best? meet RELU
<a href="https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks">Relu advantages</a>

<a href="https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning">Leaky relu advantages</a>

#### 4.4 General Neural network graph

So what does the graphical representation of a neural network look like?

![nn](./img/First_network.jpg)

<h3 style="color:green;">The procedure of flowing inputs to get output is called feed-forward </h3>

### Why hidden?

![](./img/Ice_cream_network.jpg)

### Why deep?

"deep networks have a hierarchical structure 
which makes them particularly well adapted to learn the hierarchies of knowledge 
that seem to be useful in solving real-world problems. Put more concretely, 
when attacking problems such as image recognition, 
it helps to use a system that understands not just individual pixels, 
but also increasingly more complex concepts: from edges to simple geometric shapes, 
all the way up through complex, multi-object scenes."

### 4.5. The strength of NN - XOR example

#### Can we classify correctly with linear separation?

![](./img/XoR.jpg)

- There is no linear split that will separate the calsses correctly 

## Class exercise (5 min)
Let's try to solve it ourselves:
- we have the following points [[1, 1], [0, 1], [1, 0], [0, 0]]
- We will try to use the following formula: $f = W_2*max(0, W_1X + b)$
- Take $W_1$ 2X2 matrix full with 1's 
- Take b to be [0, -1] 
- Take the max operation element-wise
- take $W_2$ to be [1, -2]
- use np.array and np.dot for matrix multiplication and np.maximum

More impressive example: <a href="http://playground.tensorflow.org/#activation=sigmoid&batchSize=10&dataset=xor&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=6&seed=0.49592&showTestData=false&discretize=false&percTrainData=60&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false">Playground</a>

Recommended <a href="http://neuralnetworksanddeeplearning.com/chap4.html">read</a>: Neural nets can approximate any function

### 5. How do we adjust weights & threshold? optimization = Backprop + update rule

#### 5.1. Simple backprop example - some math 

![comp-graph](./img/computational_graph.jpg)

Backprop: 

- instead of taking the full derivative from the loss function to inputs:
  - break the problem to small chuncks
  - each time working on one edge and reusing previously calculated edges 

$p = x + y$ 

$g = p*z$

$\frac{\partial g}{\partial p} = z$

$\frac{\partial g}{\partial z} = p$

$\frac{\partial g}{\partial x} = \frac{\partial g}{\partial p} \frac{\partial p}{\partial x}$

$\frac{\partial g}{\partial y} = \frac{\partial g}{\partial p} \frac{\partial p}{\partial y}$

--> Notice: we only calculate $\frac{\partial g}{\partial p}$ once, then we can reuse it!

Don't forget the parameter update in gradient descent is:
    $W1 = W1 -\epsilon * \frac{\partial f}{\partial W1}$

![ff-bb](./img/ff-bb.gif)

### 6. Neural networks are data hungry

Typical image size is 64X64 = 4096 dimensional vector

Let's imagine we want to use the element-wise max operator. 
Then the output will still be 4096 dimensional vector. 
This means $4096^2$ parameters for one image! If we have 10000 images we have 
1.6777216e+15 parameters!

#### 6.1 Batch gradient descent

##### Instead of calculating all observations' influence at once, choose randomly a small sample and update parameters accordingly. 

One of the most popular optimizers these days is called 'Adam', which generalizes from ordinary gradient descent by having individual and dynamic learning rates. This <a href="https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/">article</a> has a nice discussion of Adam.

Other optimization algorithms:
- adadelta
- nestrov
- momentum

For the mathematical details, check out this <a href="https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/">post</a>.

Given that we are only passing some data points through at any one time, the question of when to update the weights becomes pressing. Standardly, we'd wait until we've passed all the data through before updating, but we might try updating after each batch ("batch gradient descent") or even after each point ("stochastic gradient descent").

#### 6.2 Epochs

One epoch means we are taking batches in a length of all observations

Typically we will run hundreds or thousands of epochs

### 7. Takeaways

- Terminology & concepts:
    - Linear layers aka fully-connected aka dense layers
    - Neuron: sum of weighted inputs 
    - Activation function: choose according to problem specifics. Common choices RELU, Leaky RELU (<a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">Relu layers</a>), <a href="https://en.wikipedia.org/wiki/Hyperbolic_function">tanh</a>
    - Hidden layer: linear layer + activation function (we can add as many as we want)
    - Feed-forward: push inputs through network to get the output
    - Backprop: calculate the weights with respect to the loss by recurssion and reusing 
    - Batch gradient descent
    - Epochs

### Typical neural net architecture:

![](./img/Neural_network_layers.png)

---> Notice: it doesn't make sense to use 2 (or more) linear layers in a row

- Most commonly successful in (applications)
    - NLP (natural language processing)
    - Machine vision
    - Games

- The bad
    - Typically tons of data
    - Tons of tunning 
    - Many things to adjust - parameters, layers, optimization techniques, learning rate, initializations
    - Hard to interpret 

- The good
    - After training very quick to produce an answer
    - Amazing results if training correctly
    - They are potentially very powerful function approximators

## First Tensorflow model!

In [None]:
import tensorflow as tf

# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

# Parameters
learning_rate = 0.01
training_epochs = 25
batch_size = 100
display_step = 1

# tf Graph Input
x = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 digits recognition => 10 classes

# Set model weights
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

# Construct model
pred = tf.nn.softmax(tf.matmul(x, W) + b) # Softmax

# Minimize error using cross entropy
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
# Gradient Descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()

# Start training
with tf.Session() as sess:

    # Run the initializer
    sess.run(init)

    # Training cycle
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(mnist.train.num_examples/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ys = mnist.train.next_batch(batch_size)
            # Run optimization op (backprop) and cost op (to get loss value)
            _, c = sess.run([optimizer, cost], feed_dict={x: batch_xs,
                                                          y: batch_ys})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        if (epoch+1) % display_step == 0:
            print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost))

    print("Optimization Finished!")

    # Test model
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    # Calculate accuracy
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print("Accuracy:", accuracy.eval({x: mnist.test.images, y: mnist.test.labels}))
