# Notes from the 3Blue1Brown YouTube series on neural networks


## References



- [Grant Sanderson's first video in the Neural Net series](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
- [Steve Brunton's...]
- [Michael Nielsen's book on Deep Learning and Neural Nets](http://neuralnetworksanddeeplearning.com/)
    - [Michael Nielsen's MNIST OCR walkthrough](https://github.com/mnielsen/neural-networks-and-deep-learning)
- [Gil Strang's course reference at MIT](https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/)
    - [Gil Strang: First in a CNN lecture series](https://www.youtube.com/watch?v=sx00s7nYmRM&pp=ygUaZ2lsIHN0cmFuZyBuZXVyYWwgbmV0d29ya3M%3D)
- [Chris Olah's blog](http://colah.github.io/)
- [Articles at *distill*](https://distill.pub/)
- [Samson Zhang "ground up from scratch no PyTorch" neural net build (MNIST revisited) video](https://www.youtube.com/watch?v=w8yWXqWQYmU)


## Fundamentals


**Machine Learning** uses data to somehow determine how a model behaves.


## Video 1: But what is a neural network?


- The 28 x 28 pixel grid gives values called *activations*
    - This is stacked to give the activation layer
    - We end up not caring how the pixels are sorted into this layer
        - This mapping intentionally does not try to retain the spatial information of the grid
- The activation value for each pixel is real on [0, 1]
- The second layer happens to have 16 neurons; 784 x 16 weights
- There are also 16 bias values, one for each layer-2 neuron
    - Bias acts like a threshold: It sets a bar to clear
- The result is of the weighted sum plus bias is unconstrained...
    - ...but we are interested in something constrained to be on (say) [0, 1]
    - This is precisely analogous to a compressor in audio signal processing
    - This compression was originally done with a sigmoid function, as in:


$
\begin{align}
{
\sigma(x) = \frac{1}{1 + e^{-x}}
}
\end{align}
$


More recently this has becomea a Rectifier Linear Unit funciton: $ReLU(a) = max(0, a)$.
Pronounced "Ray - Loo".
Gil Strand refers to this in one of his lectures. I will continue to indicate compression
as $\sigma(x)$.

- sum of weights = 784 * 16 + 16 * 16 + 16 * 10
- sum of biases = 16 + 16 + 10


13k total parameters


Exhortation: Dig into ***why?*** Challenge your assumptions.


- Activations are a $1 \times n$ column vector $A$
- Weights are an $m \times n$ matrix $W$
- Bias is a $1 \times m$ column vector
- Second layer of neurons is a $1 \times m$ column vector $N$


$
\begin{align}
{
N = \sigma( W \cdot A + B )
}
\end{align}
$



## Video 2: Gradient descent and how neural networks learn




- Introduces a training dataset: With answers!
- MNIST database is freely available
- Now we start in on *the calculus exercise*
- Initial: Random values


Now we are introduced to a cost function: 


- Go through the (presumably wrong) values in the output vector $a_i$.
- For each: Difference the value from the correct value $0, 0, 0, 0, 1, 0, 0, 0, 0, 0$
- Square this
- Add them up
- Higher cost value: The worse this set of weights performed
- This is a result for but one example
    - This produces a single number from the combination of 784 pixel values and 13k weights and biases


Consider the average cost over the entire training set.


Now drop into single-variable calculus thinking: Use the local slope to determine a step to take
in searching for a minimum. Steeper slope: Bigger step. Small slope: Small step (avoid overshoot).


Now move to cost as a surface above two variables (up from one). Now we are taking gradients.


So what is moving to calculate the local surface gradient? Those two variables; suppose they are
the first two weights $w_1$ and $w_2$. Now we have the steepest direction; and we use the negative
of that (as it points *upward*) to descend to some minimum. But of course there is the danger 
that it is a *local* minimum.


And now of course go from 2 dimensions to 13,002 dimensions of input.


$- \nabla C(\vec{w})$


Gradient calculation in this context is called *back propagation*: Next video.



All we mean when we say a network is learning is that it is minimizing its cost function. 



So let's keep in mind: 


- We have multiple layers of neurons, weights, biases to start with
- These represent a progression from 784 inputs to one result
- And we have test data including *correct* answers
- So we can determine cost function values across the entire test dataset
- And this cost function $C$ has a mean value
- And somehow we can calculate the gradient of $C$: A 13,000 element vector
    - Whose elements indicate changes to apply; by both sign and magnitude
    - ...so that we can start over with new weights; and iterate


Then there is **the test**: Score a version of the network on data it has never seen before.



### Does the network behave understandably?



- No. 
- The weights from the 784 layer to the 16 layer (when viewed as 28 x 28 images) are just random-looking...
- ...all 16 of them...
- ...and will classify random junk as a particular digit with high confidence
- ...and has no mechanism to actually *draw* an archetype 3


Grant observes that there is no mechanism in the system for uncertainty. The result vectors
for example are always certain: 9 zeros and a single 1.


The video concludes with remarks on "memorizing the dataset" by means of all these parameters; 
what I suspect is called over-fitting.


## Video 3: Backpropagation: Plausible story


- Backpropagation as a compelling story without notation or calculus
- Consider weights and activations in the penultimate layer
    - Proceeding from the weighted sum: Converse concept:
        - "Modify Weights where Activations are high for impact!"
        - "Modify Activations where Weights are high for impact!"
    - Complexity concept:
        - We are considering 10 (not just the correct +1 neuron) end activations
        - Multiplexing means individual cases are given a vote, not final say


We want to find a minimum, hence gradient descent. 


## Video 4: Backpropagation: Calculus basis


"What is SGD?"


- Backpropagation as calculus
- Principle idea is multi-variate calculus chain rule
- A given end-neuron (e.g. answer = 3) is impacted by an activation (from the prior neuron layer), a weight and a bias
- The idea is to get the gradient of the cost function in terms of chain rule partial derivatives
- Back-prop refers to chaining backwards up the network until you arrive at the stimuls activation layer. These can not be modified.
- Also we can think in terms of modifying other feeder neuron activation layers...
    - ...but this is actually expressed in terms of -- in turn -- that neuron's inbound weight and bias
- The other dial here
    - We calculate the cost function gradient for one input case
    - ...but in fact we do some sort of average over many training inputs
    - ...but not every single one as this becomes computationally prohibitive
    - ...so we resort to random samples creating cohorts
    - ...hence stochastic
    - ...hence stochastic gradient descent
    - ...hence SGD 
 

## Video 5: GPT


* Transformers are the next key idea
    * Consist of alternating **Attention Blocks** and **Perceptron Blocks**
    * Predict the next word by generating a pdf over a set of words
* First we need encoding as tokens (words or fragments)
* Then we need embedding: Each token is assigned a vector value
    * The vector space is "defined" by the model
    * Similar words (I use words instead of tokens as roughly equivalent)...
        * ...wind up with roughly aligned vector values
    * The processing step is now to iterate through Attention and Perceptron blocks
        * The Attention block allows the series of vectors to interact with one another
            * ...hence the vector elements are modified
        * The Perceptron blocks operate on the vectors in parallel
            * ...so no interactivity in this step
            * "Multi-layer Perceptrong" or equivalently "Feed Forward Layer"
            * What this does is, for now, a mystery
                * ...but it is certainly more linear algebra
        * There is also some normalization going on, in passing
     
The end result is understood as the now-thoroughly-modified last vector in
the token sequence. And this modified vector is then used to generate the pdf,
a distribution of probabilities for a set of possible "next words".