# Learning

## Lecture plan

- Review: **deep neural networks**.
- Conceptual introduction to training neural networks.
   - High-level overview of **back-propagation**.
- Introduction to `pytorch`.
   - A simple classification problem using `torch`.

## Review: what is a *neural network*? 

> A **neural network** is a predictive model loosely inspired by biological neural networks, which *updates weights* based on data to make better predictions.

The most basic neural network has several ingredients:

- An *input layer*, or $X$.
- An *output layer*, or $Y$.
- *Weights* $W$ (analogous to coefficients $\beta$).

### Why do hidden layers work?

Most moden neural networks have **hidden layers**.

> A **hidden layer** adds new *parameters* to a neural network, which expands the space of features (and relationships) the system can learn.

- Similar intuition as a **non-linear kernel trick** for SVM.
- Also similar to **polynomial regression**.


<img src="img/networks/nn3.png" width="300" alt="Larger model">

## On "Learning"

> Neural networks **update their weights** to make better predictions; this process is sometimes called *learning*.

In this section, we'll briefly discuss how neural networks update their weights, which also involves a technique called *backpropagation*. 

We'll focus a simple kind of network called a **feed-forward neural network**.

### "Feed-forward" networks

> A **feedforward neural network (FFN)** is a neural network with no "cycles". Each unit connects only “forward” to units in the next layer.

This is a feed-forward model with a single hidden layer. 

<img src="img/networks/ffn.png" width="300" alt="Simple FFN">


#### Getting *predictions* from a feed-forward model

- **Step 1**: Multiply inputs $X$ by weights $W$. 
- **Step 2**: Apply some *activation function* to obtain hidden unit activations.
- **Step 3**: Multiply hidden units by weights $U$.
- **Step 4**: Apply *soft-max* function to obtain predictions $\hat{Y}$.

<img src="img/networks/ffn.png" width="300" alt="Simple FFN">


#### Getting predictions: the equations

First, we obtain hidden layer activations:

$$h = \sigma(Wx + b)$$

Then, we multiply hidden layer by weights $U$.

$$z = Uh + b$$

Finally, we *softmax* this to obtain a **probability distribution**.

$$y = softmax(z)$$

#### Different activation functions

> An **activation function** maps $Wx$ through some (typically non-linear) function we call each hidden unit's "activation". **Non-linear activation functions** are important for making neural networks more powerful.

- *Sigmoid* activation function: 

$$ g(z) \ = \ \dfrac{1}{1 + e^{-z}}$$

- *Rectified linear unit* function:

$$ g(z) \ = \ (z)_+ $$

#### Illustration of activation functions

- Both the *sigmoid* and *ReLU* activation functions are non-linear.
- These days, ReLU is more common because it is more **efficient** and tends to lead to **better performance**. 

<img src="img/networks/nn2.png" width="400" alt="Activation function">


#### Check-in: random weights

When we **train** a neural network, we typically start with *random weights*. What does that mean about our predictions?


In [1]:
### Your answer here

### Where do the weights come from?

- When we **train** a neural network, we typically start with *random weights*.
- This means that at first, our predictions will be very wrong.
- However, we can **adjust** those weights iteratively until we get better and better predictions.

> Analogy: turning the knob in a shower until you reach the desired temperature.


### Back-propagation, briefly explained

> **Backpropagation** is a technique for *propagating* the error signal from the *output layer* back through the network to update the weights at each layer.

- **Forward pass**: generate predictions based on input $X$.
- **Compute error**: compare prediction to actual value(s).
- **Backward pass**: propagate error signal back through network to improve predictions.

<img src="img/networks/backprop.png" width="400" alt="Backpropagation">


#### Calculating the error

- First, the **error** (or **loss**) is calculated by comparing the network's *prediction* to the actual value.
- Conceptually, this is very similar to $MSE$ or related concepts!

> Given some initial parameters $\theta$, we can calculate the **error** by defining some loss function $J(\theta)$.

#### Updating the parameters 

> **Gradient descent** is used to iteratively update the weights such that our **cost** $J(\theta)$ is *minimized*.

- We want to find the value(s) of our weight(s) $\theta$ that minimize our cost.
- Analogy: "rolling down a hill" to find the lowest point (least error).
  - The **learning rate** determines the amount that we roll each time!

<img src="img/networks/gradient_descent.png" width="400" alt="Gradient descent">


#### Training gets complicated

> In many cases, our optimization problem is **non-convex**, so it's more challenging to find a *global minimum*.

- Analogy: rolling down a hill to reach the bottom, but you get stuck in a crevasse. 
- In these cases, researchers use techniques like **stochastic gradient descent** to improve optimization.

<img src="img/networks/optimization.webp" width="400" alt="Optimization problem">


## Lecture wrap-up

- Neural networks are predictive models consisting of multiple **layers**, with **weights** connecting those layers.
- **Training** a neural network consists of updating those weights (**parameters**), based on training data.
- In general, weights are updated to *minimize prediction error*; this is called **gradient descent**.

> Okay, so how can we actually *use* this systems?