# Topic 40: Introduction to Neural Networks

1. What is a **perceptron**?
2. Multilayer perceptrons (neural networks!)
3. Backpropagation
4. Building our first neural network on Google Colab

Check out this "cheat sheet": [These are all neural networks!](https://towardsdatascience.com/the-mostly-complete-chart-of-neural-networks-explained-3fb6f2367464)

<img src='resources/move_on.jpg' width=500/>

## Applications of Neural Networks

- Clustering
- Pattern Recognition
- Image Recognition (CNN)
- Time Series Forecasting (RNN)
- Audio/Video/Image Generation (GAN) 

#### Limitations
- Good for prediction bad for inference 
- Computationally expensive 

## Logistic Regression as a Perceptron

* This is **one row of data**, each input is a different feature
* Weights are determined through gradient descent
* The **bias** term is our logistic regression intercept term
* The **activation function** is the sigmoid function that forces output values between 0 and 1
* Output is our classification result

![](https://miro.medium.com/max/1280/1*8VSBCaqL2XeSCZQe_BAyVA.jpeg)


* The perceptron algorithm is about learning the weights for inputs in order to draw a **linear decision boundary** that allows us to discriminate between two linearly separable classes
* A perceptron takes in inputs, sums them up with weights, adds a bias, applies some activation function --> output
* You can have different activation functions (sigmoid, tanh, ReLu, etc.)
* Many perceptrons put together create a neural network

<img src='resources/perceptron_binary.png'/>


## Multilayer Perceptrons

<img src='resources/non-linear-meme.webp'/>
<img src='resources/non-linear.png'/>

![fnn](resources/First_network.jpg)

* Each node in the hidden layer works like a single perceptron - each node assigns different weights and biases to every row's inputs 
* Each node transforms the inputs and passes that through an activation function
* The outputs from the activation functions are aggregated to an output

## Activation Functions

    - Sigmoid
    - Softmax
    - Tanh
    - ReLu
    - elu
    
*In most cases you can use the ReLu activation function (or one of its variants) in the hidden layers. For the output layer, the softmax activation function is generally good for multiclass problems and the sigmoid function for binary classification problems.*

## Backpropagation

* Backpropagation is the process of picking the optimal weights of any **forward-feeding** neural network
* Backpropagation is a procedure that uses gradient descent to propagate error terms backwards

<img src='resources/back.png' width=600/>

### Gradient Descent in Neural Networks
Neural Nets are usually implemented at scale with large sets of data, therefore optimizing for speed becomes a big concern. Gradient descent can take a very **very** long time to run if we use a single training example every time to update the weights and biases. Therefore, we usually use batch-mode:

- **Batch**: 
In batch gradient descent, we pass all of the training examples through the forward propagation stage before using backpropagation to compute the weights and biases

- **Epoch**: 
An epoch is when you're done passing all training examples through the forward propagation


##### Types of Gradient Descent
- Stochastic Gradient Descent 

SGD calculates the error and update the weight after training each observation in the training set. 

- Batch Gradient Descent

Batch calculates the error after each example is trained, but only updates the weight after all of the observations have been trained

- Mini-Batch Gradient Descent

Mini-batch is a compromise between batch and SGD - it splits the training examples into mini batches, and calculates the error and update the weight after each iteration of the mini batches are done training. 


### Random Initialization 
When we feed the node values forward through layers, we intialize the weights with random values and biases to be zero. We do not initialize the weights to be 0 because it would cause the training to be pointless because all weights would end up being the same. You also do not want a very large initial weights because that would saturate the value of activation function, causing taking the gradient of the activation function to be hard (only when you use sigmoid and tanh though).

## Keras on Google Colab
https://colab.research.google.com/drive/175P-oUi2VRZQx0isy5Yum-WxueYWOQlB?usp=sharing

Keras is the default choice for beginner deep learners due to its user-friendly structure and easy implementation. It is built upon Tensorflow. The four steps to building your neural net models are:
1. Specifying the architecture 
    - how many layers?
    - how many nodes in hidden layers?
    - what activation functions should you use?
2. Compile your model 
    - specify the cost function and details about how optimization works
        - learning rate
        - optimizer
        - list of metrics to use
3. Fit your model 
    - backpropagation and adjusting for model weights
4. Make predictions

See Sequential model guide [here](https://keras.io/getting-started/sequential-model-guide/)

# Resources
* 3 Blue 1 Brown: Backpropagation: https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3
* StatQuest Neural Networks: https://www.youtube.com/watch?v=CqOfi41LfDw