<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Neural Networks: Architecture
              
</p>
</div>

Data Science Cohort Live NYC Nov 2022
<p>Phase 4: Topic 38</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

#### What is a neural network?

- Computational graph made of layers of composite calculation units:
    - designed to compute a function mapping inputs to outputs. 
- Each unit/layer gets tuned during training:
    - learns some specific part of input-output mapping.

<center> <img src = "images/dogcat.gif" width = 500 > </center>

The many interconnections: important
    
- Let nodes at each layer use relations it learned in adjacent layers.

- Build high degree of flexibility: **can learn very complex functions**

- Use connections in model to learn what aspects of data to rely on.

<center><img src = "images/dogcat.gif" width = 500 ></center?

#### Function complexity

<img src = "Images/neural-networks-layers.webp" width = 600 >

Can learn complex functions and decision boundaries.

Let's see how a neural network is learning at each layer:

<a href = "http://playground.tensorflow.org" >Tensorflow Playground</a>

Neural network: minimizing objective function (squared loss/binary cross-entropy)
- tune connections
- each node/unit learning features in the process
- feeds features to next layer. Learns more complex features to predict with, etc.

Complexity sufficient to learn/generalize on some pretty difficult problems:

<center><img src = "Images/image_multiclass.png" width = 400 ></center>

#### But how does all this work?
- Back to basics.

<img src = "Images/simple_nn.jfif" >

**Composition of a single unit** 
- can be thought of as a model

<img src = "Images/single-unit.png" width = 500>

- If $f$ is identity matrix: literally linear regression (depending on objective function)


$$ \text{Output} = x_1 w_1 + x_2 w_2  + x_3 w_3 + b $$

In vector form: 

$$ \text{Output} = \textbf{w}^T \textbf{x} + \textbf{b} $$

<img src = "Images/single-unit.png" width = 500>

- Goal: tune weights $\textbf{w}$ to minimize objective function.

#### Relation to logistic regression

<img src = "Images/single-unit.png" width = 500>

Now $f$ is the sigmoid function:

$$ f = \sigma(w_1x_1+...+w_nx_n)$$

<img src = "Images/linear_vs_logistic_regression.png" >
<center> Linear doesn't model well</center>

#### The role of f: the activation function

- Linear: turns out to be too simple in general.
- Want neuron to learn some non-trivial part of function.

**Key is adding nonlinearity through $f$** (often referred to as $g$ )

<center><img src = "Images/activation_func.png" >
Typical choices for activation function $g$</center>

**ReLU: the most common activation function**

- simple thresholding behavior
- automated feature selection: 
    - turn off/on node depending on feature/data in previous steps to network.

<center><img src = "Images/relu.png" >
    If input < 0: turn off neuron/corresponding feature.</center>

Putting many units together: form a layer

<img src = "images/activations_nn_layer.png" >

Matrix/vector representation 

$$ \textbf{a}^{[1]} = g(\textbf{z}^{[1]})$$

$$ \textbf{z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]}$$ 

<img src = "images/activations_nn_layer.png" >

Matrix/vector representation 

$$ \textbf{a}^{[1]} = g(\textbf{z}^{[1]})$$

$$ \textbf{z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]} \\ = \begin{bmatrix}
           z^{[1]}_{1} \\
           z^{[1]}_{2} \\
           z^{[1]}_{3} \\
           z^{[1]}_{4} \\
         \end{bmatrix} = \left[
  \begin{array}{ccc}
    \rule[.5ex]{3.5em}{0.4pt} & \textbf{w}^{T}_{1} & \rule[.5ex]{3.5em}{0.4pt}\\
    \rule[.5ex]{3.5em}{0.4pt} & \textbf{w}^{T}_{2} & \rule[.5ex]{3.5em}{0.4pt} \\
          \rule[.5ex]{3.5em}{0.4pt}   & \textbf{w}^{T}_{3}    &     \rule[.5ex]{3.5em}{0.4pt}     \\
    \rule[.5ex]{3.5em}{0.4pt} & \textbf{w}^{T}_{4} & \rule[.5ex]{3.5em}{0.4pt}
  \end{array}
\right] 
\begin{bmatrix}
           x_{1} \\
           x_{2} \\
           x_{3} \\
         \end{bmatrix}
+ \textbf{b}^{[1]}$$ 

Expanded further: matrix of weights/bias vector to learn.

$$ \textbf{z}^{[1]} = W^{[1]T} \textbf{x} + \textbf{b}^{[1]} \\ = \begin{bmatrix}
           z^{[1]}_{1} \\
           z^{[1]}_{2} \\
           z^{[1]}_{3} \\
           z^{[1]}_{4} \\
         \end{bmatrix} = \left[
  \begin{array}{ccc}
    w_{11} & w_{12} & w_{13}\\
    w_{21} & w_{22} & w_{23} \\
    w_{31} & w_{32} & w_{33} \\    
    w_{41} & w_{42} & w_{43} \\
  \end{array}
\right] 
\begin{bmatrix}
           x_{1} \\
           x_{2} \\
           x_{3} \\
         \end{bmatrix}
+ \begin{bmatrix}
           b_{1}^{[1]} \\
           b_{2}^{[1]} \\
           b_{3}^{[1]} \\
           b_{4}^{[1]} 
           \\
         \end{bmatrix}$$ 

Deeper neural network

<center><img src = "images/Deeper_network.jpg" > Activations and weights for each layer </center>


<center><img src = "images/Deeper_network.jpg" > Feedforward: pass data in, calculate activations. Flows from input to output through hidden layers. </center>


- Hidden layers: layers between input/output layer.
- Neural network learning features here automatically.
- Don't really know/control what hidden layers are finding: **black box** 
    - inputs/outputs in hidden layers are hidden

**Forward Propagation**

- Pass input: compute function map layer by layer using weights
- yield output $\hat{y}$ (regression target, class label, etc.)
- Compute cost function $L(\hat{y},y)$

Single neuron network (regression):
<center><img src = "Images/costfunction_singleexample.png" width = 2000></center>

Implement **forward propagation** for single neuron.

Dataset: handwritten numbers from `sklearn`:
- Each record is a 64-bit (8x8) image of a handwritten number between 0 and 9. 
- Each pixel value (a number between 0 and 16) represents the relative brightness of the pixel.

In [None]:
digits = load_digits()
flat_image = np.array(digits.data[0]).reshape(digits.data[0].shape[0], -1)
eight_by_eight_image = digits.images[0]

Let's look at one digit:

In [None]:
digits = load_digits()
eight_by_eight_image = digits.images[0]

In [None]:
imgplot = plt.imshow(eight_by_eight_image,
                     cmap='Greys')

Image in matrix form

In [None]:
# larger numbers = darker

eight_by_eight_image

Image data fed into neuron: must be in vector form:
- **flatten** the image into a 64x1 array.

In [None]:
img_flattened = np.ravel(eight_by_eight_image)
img_flattened

In [None]:
print(img_flattened.shape)

Want to reshape into column vector

In [None]:
img_col_vec = img_flattened.reshape(-1,1)
print(img_col_vec)

In [None]:
print(img_col_vec.shape)

Forward propagate data through single neuron:
<center><img src = "Images/single-unit.png " width = 450></center>

We will instantiate our weight with small random numbers.


In [None]:
w = np.random.uniform(-0.1, 0.1, (flat_image.shape[0], 1))
print(w[:5])
print(w.shape)

We'll set our bias term to 0:

In [None]:
b = 0

### Summation

Calculate the summation (linear part of calculation):

<center><img src = "Images/single-unit.png " width = 450></center>

Dot product: $\textbf{w}^T \textbf{x} + b$

In [None]:
z = w.T@img_col_vec + b
z

### The activation function

Nonlinear part of calculation:

<center><img src = "Images/single-unit.png " width = 450></center>

We have a suite of activation functions to choose from.

#### Sigmoid

$f(x) = \frac{1}{1+e^{-x}}$

Typically used in last node of network doing binary classification:

<img src = "Images/sigmoid_binary.png" width = 350>

In [None]:
# Z is the input from our collector, the sum of the weights
# multiplied by the features and the bias

def sigmoid(z):
    '''
    Input the sum of our weights times the pixel intensities, plus the bias
    Output a number between 0 and 1.
    
    '''
    return 1/(1 + np.exp(-z))

In [None]:
X = np.linspace(-10, 10, 20000)
sig = sigmoid(X)

fig, ax = plt.subplots()
ax.plot(X, sig)
ax.set_title('Sigmoid Activation');

Calculating neuron output if sigmoid:

In [None]:
a = sigmoid(z)
a

#### tanh

- shifted/scaled version of sigmoid(roughly)
- when used, typically activation for hidden layers

$f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

In [None]:
def tanh(z):
    return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

In [None]:
# Coding tanh:

X = np.linspace(-10, 10, 20000)
y_tanh = tanh(X)

fig, ax = plt.subplots()
ax.plot(X, y_tanh)
ax.set_title('Hyperbolic Tangent Activation');

Shifted/scaled (roughly) compared  to sigmoid:

- output is centered around 0
- the output is between -1 and 1
- steeper slope vs input (more sensitivity) 
- makes learning in the next layer easier. 

One problem with tanh (and sigmoid) for hidden layers is that for large parts of input space: 

- the slope of the activation function flattens out

Doesn't learn from features effectively in these regions
- tuning weights based off information in these regions (ineffective and slow)
- **vanishing gradient** problem.

#### ReLU

- superseded tanh in hidden layers (for the most part)
- de facto non-output activation function

$f(x) = 0$ if $x\leq 0$; $f(x) = x$ otherwise

In [None]:
def relu(z):
    return (z * (z > 0))

In [None]:
# Coding ReLU:

X = np.linspace(-10, 10, 200)

y_relu = [relu(x) for x in X]

fig, ax = plt.subplots()
ax.plot(X, y_relu)
ax.set_title('ReLU Activation');

- Constant gradient leads to faster learning in comparison to sigmoid and tanh.

- Region of 0 activation: easy computation.

Models notion of neuron on/off.

#### Softmax

$$ \large \sigma(\textbf{z})_i = \frac{e^{z_i}}{\sum_{j}e^{z_j}} $$

<img src = "Images/softmax.png" width = 400>

Appropriate activation in the output layer for **multi-class** classification problems. 

- Outputting the probabilities of belonging to each class.

There are other activation functions; [see here](https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a). 

Our nodes will be taking in input from multiple sources. Let's add the entire training set as our input. 


In [None]:
X_train, X_test, y_train, y_test = train_test_split(digits.data,
                                                    digits.target,
                                                    random_state=42,
                                                    test_size=0.2)
X_train.shape

In [None]:
X_train[0, :].reshape(8, 8) # first example

In [None]:
imgplot = plt.imshow(X_train[0, :].reshape(8, 8),
                     cmap='Greys')

In [None]:
y_train[0]

Feeding all training data into single neuron

In [None]:
z_0 = X_train@(w)+b
z_0.shape

In [None]:
z_0

In [None]:
# calculating sigmoid activation for each data point
a_0 = sigmoid(z_0)
a_0

calculating ReLU for each training data point.

In [None]:
a_0_relu = np.array([relu(z) for z in z_0])
a_0_relu[:10]

In [None]:
a_0_relu.shape

Now we compute with matrix of weights:

layer with 4 neurons.

In [None]:
w_1 = np.random.normal(0, 0.1, (X_train.shape[1], 4))
w_1.shape

In [None]:
b_1 = 0

In [None]:
Z_1 = X_train@w_1 + b_1
Z_1

In [None]:
A_1 = relu(Z_1)
A_1

Now each of these neurons has a set of weights and a bias associated with it.

In [None]:
w_2 = np.random.normal(0, 0.1, (A_1.shape[1], 1))

w_2.shape

In [None]:
b_2 = 0

In [None]:
Z_2 = A_1.dot(w_2)

In [None]:
output = sigmoid(Z_2)
y_pred = output > 0.5
y_hat = y_pred.astype(int)
y_hat[:5]

#### Backpropagation

The **backpropagation** algorithm: adjusting the parameters (weights) to get a better result. 

Propagating the error (averaged over all training samples) back through the network:

- with the many-sample cost function $$J(\{W^{[l]}\}, \{b^{[l]}\}) = \frac{1}{m} \sum_{i = 1}^m L(\hat{y_i}, y_i)$$ guiding us.



- Goal: Tune $\{W^{[l]}\}$ and $ \{b^{[l]}\}$ to minimize $J$.

**How?**


We do this tuning by propagating the (average) error back through the network, with the cost function $J$ guiding us and adjusting via gradient descent.

> Turn down previous neurons that give a bad result
>
> Turn up previous neurons that give a good result

**Gradient descent algorithm**
- Compute $\frac{\partial J}{\partial W^{[l]}}$ and $\frac{\partial J}{\partial b^{[l]}}$ for each layer $l$.
- Update each bias vector/weight matrix:
$$  W^{[l]} \rightarrow W^{[l]} - \alpha \frac{\partial J}{\partial W^{[l]}}$$


***Backpropagation allows very efficient computation of gradients at each layer***

Then forward prop again. Repeat back prop...cycle until $J$ converges:

<img src = "Images/backprop.gif" width = 600>

**Update made to weights after computing all gradients traversing in backpropagation.**

Great video explanation of backpropagation by 3Blue1Brown (part of a full playlist): [Backpropagation calculus | Deep learning, chapter 4](https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4)

We'll talk more about this in optimizing our neural networks but some hyperparameters include:

- **Learning Rate ($\alpha$)**: how big of a step we take in gradient descent
- **Number of Epochs**: how many times we repeat this process
- **Batch Size**: how many data points we use in a single training session (1 epoch)