[This Notebook on the web](https://github.com/robfatland/othermathclub/blob/master/neuralnetworks.ipynb)


[Questions](#Questions)


# Understanding neural networks and artificial intelligence models


This would naturally be broken up into multiple `.ipynb` files at some point.


## Part 1 Reference Resource and Redux

### Learning resources: Understanding neural networks and AI models today

- [Grant Sanderson's Neural Network series at 3Blue1Brown (YouTube)](https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
    - A series of 7+ "20 minute" videos; unsurpassed presentation style; requires frequent "pause and ponder" not to mention rewind/re-watch
- [Steve Brunton's Machine Learning Primer](https://www.youtube.com/watch?v=Vx2DpMgplEM)
- [Michael Nielsen's book on Deep Learning and Neural Nets](http://neuralnetworksanddeeplearning.com/)
    - [Chapter 1 in particular gets right to it](http://neuralnetworksanddeeplearning.com/chap1.html)
    - [M. Nielsen's repo for the MNIST OCR walkthrough](https://github.com/mnielsen/neural-networks-and-deep-learning)
- [Gil Strang's course reference at MIT](https://ocw.mit.edu/courses/18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018/)
    - [Gil Strang: First in a CNN lecture series](https://www.youtube.com/watch?v=sx00s7nYmRM&pp=ygUaZ2lsIHN0cmFuZyBuZXVyYWwgbmV0d29ya3M%3D)
- [Chris Olah's blog](http://colah.github.io/)
- [Andrej Karpathy: Build GPT from scratch](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=0s)
- [Articles at *distill*](https://distill.pub/)
- [Samson Zhang: Build an MNIST neural net from scratch (NumPy not PyTorch)](https://www.youtube.com/watch?v=w8yWXqWQYmU)
    - Avoids higher abstractions like PyTorch


### Looking back: How did we get here? (History) 


- [Historical background on machine learning / neural nets](https://www.youtube.com/watch?v=1il-s4mgNdI&t=0s)
- [AlexNet as an important historical milestone](https://www.youtube.com/watch?v=UZDiGooFs54)


### Data


#### Particular to image classification CNNs


- [ImageNet](https://en.wikipedia.org/wiki/ImageNet) > [Download site](https://www.image-net.org/download.php)
- [CIFAR-10/100](https://en.wikipedia.org/wiki/CIFAR-10) > [Download site](https://www.cs.toronto.edu/~kriz/cifar.html)
- [MNIST](https://en.wikipedia.org/wiki/MNIST_database) > [Download site](https://yann.lecun.com/exdb/mnist/)



### Fundamentals


**Machine Learning** uses data to somehow determine how a model behaves.


The *lesson* from the past couple decades: Scale alone gives huge improvement
in model behavior.


Model development involves a *training* phase where parameters (primarily weights
used in weighted sums) are established. When the model is ready for use: We want to
produce a response to an input. This process is *inferrence*. 


SGD makes use of first derivative only (in the Taylor series sense) and takes
comparatively many small steps. By virtue of the *Stochastic*: it is not a precise
but rather is more of a drunkard's walk, this being an attempt to speed things
up even further. By comparison using Newton's method requires both the Jacobian
(first derivative) and Hessian (second derivative) matrices and consequently takes
larger and more precise steps towards a minimum of the cost function. However there
is also the risk of starting out in the wrong place and winding up on a run away
to infinity track.


Contrast **FNN** with **RNN**: Feedforward neural networks only move information in 
the forward direction. They are trained using the (incongruously named) backpropagation
method. Recurrent neural networks, in contrast, feature information loops or what we
could term time-dependent feedback mechanisms, cf LSTM.


How parameters (and hyper?) factor into the game...


- Learning rate $s$ moderates step size in gradient descent (but I think not for Newton?)
- Bias $b$ raises the bar of activation. You have to really *want* it
- Softmax temperature $T$ tends to drive up probabilities relative to most likely
    - Look for cases where an API might arbitrarily `ceil` $T$ so the LLM avoids looking bad


Attention heads (93 in the case study) plus *feedforward* multi-layer perceptrons (MLPs) comprise one
iteration of a transformer. 


### Questions


Definition: AI is Artificial Intelligence writ large, includes ML and so on. gen-AI is somewhere
in the AI subset space; and the distinction is usually obvious by context but let's take care not
to *mean* generative but make it possible to infer broad AI, and vice versa. Like it or not we are
going to have confusion.


Input from the naïf... and I will add some fragmentary 'response thinking' in **bold font**.


- Help me understand the dimensionality of the space that AI and particularly models address
    - Tell me about the "open-ness" axis of this space
        - What is awesomely cool about OLMo? Is that a fair premise?
    - Tell me about the size of various models
        - How to evaluate "this will run on my laptop" vs "needs a big cloud server" etcetera
        - How do I understand the scale of inference in relation to training and parameter count?
    - Tell me about pipelines. What is a pipeline? Why do I need a pipeline?
        - Tell me why I need to understand `langchain`. What is it? What does it do?
    - Tell me about hyperparameters: Again what do I need to know? What is beyond my concern?
        - Learning phase versus Inference phase
        - Temperature
        - Learning rate
        - Bias
    - I need to create a model and somebody told me to use PyTorch
        - Should I? I mean... what is it? what does it do for me?
        - Is PyTorch a slow abstraction compared to something like Cuda?
- Help me understand tokenization and embedding
    - ...is tokenizing mysterious and arcane? Or easy to implement (and how?)
    - ...is there something useful I can do in just the embedding space?
        - i.e. without resorting to all this chat bot nonsense?
- Same thing for Adversarial Networks (GAN too)
- I have a very specific language-based task in mind...
    - i.e not desultory 'chat chat chat'... but LLM-centric, gen-AI applied to a very focused task
    - RAG or fine tune maybe? What are the tradeoffs? Which do I try first?
    - Are there other accessible approaches to specializing?
- What is this "Foundation Model" concept promoted by AWS?
- Help me understand prompt engineering...
   - specifically different approaches to prompting to get to what I want
   - ...I want to know what "Zero shot" means, and jargon like that
- Why is there so much hype around gen-AI...
    - ...when I can easily stump Open AI: It can't perform a trivial task
        - ...for example "Write a version of Goldilocks in the style of Ladle Rat Rotten Hut"
- I understand containerization (a bit). Is it an important tool in AI applications?
    - If so: What sort of learning journey do I need to go on to make use of it?
- What is a diffusion model? Or *stable* diffusion?
    - **thermodynamics-inspired image generation arXiv:2209.04747**
    - unconditioned versus conditioned diffusion models: distinction?
    - (wringing hands) At what point is the jargon useful?
    - Same for transformers... there seems to be both the concept and the library...
    - Same for attention heads, stochastic gradient descent, ...etcetera...
    - The underlying question here: Is there a strategy for coping with the profusion of jargon?
- Research focused "How do I get started using AI methods to my data?"
    - Suppose I am using gen-AI to write code...
        - Can I also get it to compile and run the code and check the results?
    - Suppose I have data that can be stored in tabular form, say in CSV files
        - How do I get started on applying AI methods to this data?


## Part 2: 3Blue1Brown Video Series


Probably as good or better than any other place to start: Grant Sanderson's series
explains neural networks and, in later videos, the basic plan of large language
models. He works specifically from the example of Chat-GPT 3 and includes a
careful accounting of how this LLM came to have 170 billion parameters.


These notes are in a *student format*: Interpreting the narrative in my own
words.


### 3Blue1Brown Video 1: But what is a neural network?


We have a number of digital greyscale image, 28 x 28 pixels, each pixel 
having a value on [0, 1]; and the assumption is that each picture is 
of precisely one base-10 digit { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 }. 


- We will conceptually build, train and test a neural network to interpret these images
    - We have, somewhere, the "correct answer" recorded
- The 28 x 28 pixel grid gives 784 neurons
    - Pixel/neuron values have an *activation* level, a real value on [0, 1]
    - The linear stacking of these pixels is maybe called the *activation layer*
        - Don't care how the pixels are ordered in this layer
        - That is: The list (or vertical stacking) of pixels does not retain the spatial information of the grid
- A second layer has 16 neurons
    - Connecting the source information to the second layer requires 784 x 16 weights
- Third layer: 16 more neurons
- Fourth layer: 10 'digit choice' output neurons
    - For a given picture input we want all of these neurons at zero except the correct answer, at 1
- There are also 16 bias values for the second layer: One for each layer-2 neuron
    - Bias acts like a threshold: It sets a bar to clear
    - Und so weiter biases for the remaining layers
    - There is no bias for layer 1
- The result of weighted sum plus bias is unconstrained...
    - ...but we want a value on (say) [0, 1]
    - Analogous to a compressor in audio signal processing
    - Compression was originally done with a sigmoid function


$
\begin{align}
{
\sigma(x) = \frac{1}{1 + e^{-x}}
}
\end{align}
$


More recently compression is via a Rectifier Linear Unit function: $ReLU(a) = max(0, a)$.
Pronounced "Ray - Loo". I will continue to indicate compression as $\sigma(x)$.

- sum of weights = 784 * 16 + 16 * 16 + 16 * 10
- sum of biases = 16 + 16 + 10


13k total parameters


Exhortation: Dig into ***why?*** Challenge your assumptions.


- Activations are a $1 \times n$ column vector $A$
- Weights are an $m \times n$ matrix $W$
- Bias is a $1 \times m$ column vector
- Second layer of neurons is a $1 \times m$ column vector $N$


$
\begin{align}
{
N = \sigma( W \cdot A + B )
}
\end{align}
$



### Video 2: Gradient descent and how neural networks learn




- Introduces a training dataset: With answers!
- MNIST database is freely available
- Now we start in on *the calculus exercise*
- Initial: Random values


Introduce a cost function: 


- Go through the (presumably wrong) values in the output vector $a_i$.
- For each: Difference the value from the correct value $0, 0, 0, 0, 1, 0, 0, 0, 0, 0$
- Square this
- Add them up
- Higher cost value: The worse this set of weights performed
- This is a result for but one example
    - This produces a single number from the combination of 784 pixel values and 13k weights and biases


Consider the average cost over the entire training set.


Now drop into single-variable calculus thinking: Use the local slope to determine a step to take
in searching for a minimum. Steeper slope: Bigger step. Small slope: Small step (avoid overshoot).


Now move to cost as a surface above two variables (up from one). Now we are taking gradients.


What *moves* as we calculate the local surface gradient? Those two variables represent a
simplification of the idea from above of many hundreds / thousands of weights. Suppose they are
the first two weights $w_1$ and $w_2$. The cost function is a surface generated by the two-D
space of weights 1 and 2. For a particular value of $w_1$ and $w_2$ we have a direction of
steepest ascent (the gradient) and we use its negative to get a descent vector. We can follow
this down to arrive at some local minimum.


And now of course go from 2 dimensions to 13,002 dimensions of input.


$- \nabla C(\vec{w})$


Gradient calculation in this context is called *back propagation*: Next video.



All we mean when we say a network is learning is that it is minimizing its cost function. 



- We have multiple layers of neurons, weights, biases to start with
- These represent a progression from 784 inputs to one result
    - the result is the selection of one grandmother neuron for one of { 0, 1, 2, ..., 9 }
- We have test data including *correct* answers
- We can determine cost function values for the entire input test dataset
- This cost function $C$ has a mean value
- Somehow we can calculate the gradient of $C$: A 13,000 element vector
    - ...whose elements indicate changes to apply; by both sign and magnitude
    - ...so we can modify the weights into new weights
    - ...and iterate


Then there is **the test**: Score a version of the network on data it has never seen before.



### Does the network behave understandably?



- No. 
- The weights from the 784 layer to the 16 layer (when viewed as 28 x 28 images) are just random-looking...
- ...all 16 of them...
- ...and will classify random junk as a particular digit with high confidence
- ...and has no mechanism to actually *draw* an archetype 3


Grant observes that there is no mechanism in the system for uncertainty. The result vectors
for example are always certain: 9 zeros and a single 1.


The video concludes with remarks on "memorizing the dataset" by means of all these parameters; 
what I suspect is called over-fitting.


### Video 3: Backpropagation: Plausible story


- Backpropagation as a compelling story without notation or calculus
- Consider weights and activations in the penultimate layer
    - Proceeding from the weighted sum: Converse concept:
        - "Modify Weights where Activations are high for impact!"
        - "Modify Activations where Weights are high for impact!"
    - Complexity concept:
        - We are considering 10 (not just the correct +1 neuron) end activations
        - Multiplexing means individual cases are given a vote, not final say


We want to find a minimum, hence gradient descent. 


### Video 4: Backpropagation: Calculus basis


"What is SGD?"


- Backpropagation as calculus
- Principle idea is multi-variate calculus chain rule
- A given end-neuron (e.g. answer = 3) is impacted by an activation (from the prior neuron layer), a weight and a bias
- The idea is to get the gradient of the cost function in terms of chain rule partial derivatives
- Back-prop refers to chaining backwards up the network until you arrive at the stimuls activation layer. These can not be modified.
- Also we can think in terms of modifying other feeder neuron activation layers...
    - ...but this is actually expressed in terms of -- in turn -- that neuron's inbound weight and bias
- The other dial here
    - We calculate the cost function gradient for one input case
    - ...but in fact we do some sort of average over many training inputs
    - ...but not every single one as this becomes computationally prohibitive
    - ...so we resort to random samples creating cohorts
    - ...hence stochastic
    - ...hence stochastic gradient descent
    - ...hence SGD
 

> Note: When you use backpropagation you are in the Deep Learning space.
 

### Video 5: GPT and Transformers


It is very important to call out the fact that our narrator makes an almost invisible
transition here from talking about *Deep Neural Nets* (more than one hiddent layer)
to Large Language Models (which build on DNNs). This can be a bit disorienting if we
just walked in off the street. Furthermore the manner in which "LMM is a kind of DNN"
never really gets spelled out; so we may need to seek further afield for this type of
insight.


* Transformers are the next key idea
    * Consist of alternating **Attention Blocks** and **Perceptron Blocks**
    * Predict the next word by generating a pdf over a set of words
* First we need encoding as tokens (words or fragments)
* Then we need embedding: Each token is assigned a vector value
    * The vector space is "defined" by the model
    * Similar words (I use words instead of tokens as roughly equivalent)...
        * ...wind up with roughly aligned vector values
    * The processing step is now to iterate through Attention and Perceptron blocks
        * The Attention block allows the series of vectors to interact with one another
            * ...hence the vector elements are modified
        * The Perceptron blocks operate on the vectors in parallel
            * ...so no interactivity in this step
            * "Multi-layer Perceptron" (MLP) or equivalently "Feed Forward Layer"
            * What this does is, for now, a mystery
                * ...but it is certainly more linear algebra
        * There is also some normalization going on, in passing
     
The end result is understood as the now-thoroughly-modified last vector in
the token sequence. And this modified vector is then used to generate the pdf,
a distribution of probabilities for a set of possible "next words".


Now so far this is "next word prediction". To go to Chatbot we have

- A System prompt ("The Chatbot is a helpful yadda yadda")
- An initial User prompt

Now in some sense the model is "Predicting what that helpful assistant would say...",
an odd bit of rhetorical indirection.

The model uses (100B+) weights organized into matrices, in turn 
divided up into eight categories.


Pre-processing: Tokenization

Categories


- embedding: One entry for each token/word
    - This translates our English query into a sequence of vectors
    - GPT 3 uses 12,288 dimensions for these vectors
    - Vector alignment (embedding similarity): Inner (dot) product
    - 600 million weights in total just for token > embedded vector
    - How many tokens are allowed?
        - GPT 3: 2000 tokens, the **Context Size**
    - Look-up (embedding) matrix is $W_e$
- Together Query, Key and Value comprise one **Head** of attention
    - query
        - Lower-dimensional space: Asks about word modification of meaning
    - key
        - Lower-dimensional space: Answers query for relevance
    - value
        - Generates a vector based on a relevant modifier
        - This is added to the embedding of the modified word
        - ...and this changes its location in the embedding space
    - Words that follow are not allowed to modify words that precede
- output
- up-projection
- down-projection
- unembedding
    - Simple idea: Works on the very last highly-modified embedding vector
        - So we will be looking at $12k$ values in just that last vector
    - Reality is more complicated
    - Call the un-embedding matrix $W_u$
 
**Softmax** is an operation on a vector of arbitrary values mapping this to a 
pdf with sum 1 and values on [0, 1]. This gets us to the idea of *Temperature* 
which in this context is a user-chosen value that acts as a "creativity dial". 
Higher temperature means the GPT responses will stretch further down into the 
pdf (lower probabilities) to select that next word. 

If the vector $\vec{x} = \{ x_0, \dots , x_{N-1} \}$ then the softmax-modified
value of $x_n$ (given a temperature $T$) is given by:


$
\begin{align}
{\Huge {x'}_{n} = {e^{\frac{x_n}{T}}}/{\sum_{i=0}^{N-1}{e^{\frac{x_i}{T}}}}
}\end{align}
$


Now we have an idea of softmax, of embedding, general use of 
matrix $\times$ vector in this process, and inner product for similarity...
so we're invited to the next video.

### Video 6: Attention in Transformers


"Attention is all you need" (2017)


Embedding: ***Direction corresponds to semantic meaning***


A mole is a mole is a mole (lookup table).


Attention block 
* moves a given word's vector into a more accurate direction based on context words
    * ...and those modifiers could be quite far away within the body of the input
* consists of many *heads* running in parallel


A hitherto skipped over detail: The embedding vector encodes the position of the token in
the source: Location context. This in addition to the encoding of the word itself.

#### Towards intuitive comprehension

- The noun asks the question "Am I being modified" (a Query)
    - and the adjective answers "Yes by me!" (a Key)
    - The Query / Key space is a much lower dimensionality (e.g. 128)
    - Query and Key are matrices (per head)
        - ...that operate on the embeddings to reflect query and key being aligned
        - the verb is that the modifiers *attend* to the target word / embedding

- Softmax applied to the results gives a grid called an Attention Pattern
- In Key Query space we tend to divide by the square root of the dimensionality
- Acausal is avoided by setting post-token weights to zero
    - Later tokens are not permitted to modify earlier ones 
    - This is actually done by setting the Attention Pattern element to $-\inf$ prior to softmax
    - This is called masking
- Attention Pattern size is the square of the Context size
    - So Context is a bottleneck
    - Here are the names of approaches developed to address this
        - Sparse Attention Mechanisms
        - Blockwise Attention
        - Linformer
        - Reformer
        - Ring attenuation
        - Longformer
        - Adaptive Attention Span
     
I am fairly certain that the Query Key > Attention Pattern is not modifying
the embedding vectors just yet. Rather it is a setup for the next step (Value).
That is: The Query / Key step produces an Attention Pattern matrix which will
act as weights. 


...this is a little hard to grok from Grant's presentation... 


Grant is saying that we have not yet actually used the adjective embedding to 
modify the noun embedding; and so we turn now to a Value matrix which is left-multiplied
to the adjective's embedding, result = vector, and this vector is then (I think?)
multiplied by the Attention Pattern to give a $\Delta$ modification to be added to the
noun's embedding. This is the Value step. Need a few more watchings.


And now we have the Key, Query, and Value matrices driving a modification of 
the input token embeddings, voila the Attention Block. This is one head of attention; 
of which there are many. 


Efficiency note: Make the Value map not square 13k x 13k but rather matching the Key Query
dimensionality. There is an explanation on factoring the Value matrix into two matrices
that are 12k x 128. 


Attention heads are of type 'self-attention head', contrast with 'cross-attention head' 
which process two distinct types of data. For example: Audio input of speech and transcription.
So a translator system might have Query in one language and Key in the other.


GPT-3 uses 96 attention heads in one attention block. Factor of 100 in the parameter count.

 
Each head gives a modification $\Delta$ for a given token embedding. All of them are applied.


Finally: We are up in the 50B parameters... eventually 170B... so parallel processing (GPU access)
is key. 


### Video 7: How might LLMs store facts... deep learning


Multi-Layer Perceptron (MLP): Maybe has something to do with facts. 


The Grandmother Neuron is abandoned in favor of information distributed as a pattern 
across many neurons. Look at the search term **sparse autoencoders**. Anthropic has 
articles with titles 'Toy Models of Superposition' and 'Towards Monosemanticity:
Decomposing Language Models With Dictionary Learning'. 



Details not discussed in this series


- Tokenization
- Positional encoding
- Layer normalization
- Training (chapter pending) which is all about back-propagation
    - Includes fine tuning


A lot of this is reminiscent of Fourier transforms. 

## [Part 2 Michael Nielsen's Book, Chapter 1](http://neuralnetworksanddeeplearning.com/chap1.html)


* Perceptron model is *binary output* $1$ or $0$ from evaluating $w_j \cdot x_j + b \ge 0$.
* Perceptrons cover logic gates $NAND$ and so on.


## Part 3 Gil Strang lectures


### [Gradient Descent](https://www.youtube.com/watch?v=AeRwohPuUHQ)


- Trying to minimize the cost (or any) multi-variable function...
- "When there are too many variables (weights) to take a second derivative we settle for first derivatives of the function"
- Let's introduce a learning rate parameter associated with dimension $k$, call this $s_k$


$\begin{align}
x_{k+1} = x_{k} - s_{k} \; \cdot \; - \nabla f(x_k)
\end{align}$


Let's be on the lookout for the Hessian and the role played by convexity, spoken of in ominous tones by our lecturer.


We have two independent variables $x$ and $y$ and a quadratic function *of* them:


$\begin{align}
f(x, y) = \frac{1}{2} \cdot (x^2 + by^2)
\end{align}$


Let's write this in vector algebra form with a symmetric matrix **$S$** and these two variables 
organized as a column vector, pardon the redundant notation:


$\tilde{x}=\left[{\begin{array}{c}x\\y\end{array}}\right]$


$
\begin{align}
S=\left[{\begin{array}{cc}
   1 & 0 \\
   0 & b \\
  \end{array}}\right]
\; \; \; 
\end{align}$


And so now 

$\begin{align}
f(\tilde{x}) = \frac{1}{2} \cdot \tilde{x}^{T} \cdot S \cdot \tilde{x}
\end{align}$

In [5]:
import torch

In [3]:
x=3
x+2

5