# Neural Networks II  
## Data Structures 

## Structures of Data in Real World

Data in real world often have structures

- Continuity in space
- Continuity in time  
- Causal (and noncausal) relationships
- ... 


## Dismentling Structures: Restoring Independence for Inference

Structures in data often lead to dependency

Most statistical inference require independence 

Sometimes we have to dismentle the structures for proper inference 
- Blocking or stratification
- Washout periods between experiments
- Randomized controlled trials 
- ... 


## Harnessing Structure: Architectures that Generalize

Sometimes knowing the structures help

- Image classification (convolutional neural networks)
- Language translation (attention and transformer)
- Social network anslysis (graph neural networks)
- ... 

Main message for today: Architectures that encode prior knowledge need *fewer* parameters and generalize better.

## Convolution Neural Networks

### Example: Image classification

## Structures of Images 

- Local continuity for most pixels 
- Objects/parts are more important than single pixel for prediction
- ... 




## Images in Computers' Eyes


<p align="center">
  <img src="./figures/7pixel.png" width="60%" />
</p>

- Input: $\mathbf{X} \in \mathbb{R}^{m\times m}$ 
    - $x_{i,j}$: integer between 0 and 256
    - $\mathbf{X} \in \mathbb{R}^{m\times m \times 3}$ for colored image (RGB)

- Goal: Classify the image into digits $(0-9)$



Images from the [Modified National Institute of Standards and Technology database
(MNIST) dataset](https://en.wikipedia.org/wiki/MNIST_database)

## Convolutional Neural Networks  


<p align="center">
  <img src="./figures/CNN.png" width="70%" />
</p>


## Convolution Layer

Recall that a layer of MLP takes the form  
$$ \mathbf{z}^{(l)} = \phi\left( \mathbf{W}^{(l)}\,\mathbf{z}^{(l-1)} + \mathbf{b}^{(l)} \right) , $$


A convolution layer replaces the big matrix multiplication $\mathbf{W}^{(l)} \in \mathbb{R}^{p \times m}$ with a *convolution*, for each $i,j$,

$$z_{ij}^{(1)}= \sum_{u=1}^k \sum_{v=1}^k w_{uv}^{(k,c)}\;x_{i+u,j+v}^{c}+b^{k}$$

- Local *receptive field*: filter size $k \times k$ (e.g. $3\times 3$) looks only at nearby pixels.  
- Weight sharing: the same kernel slides across the image 
- Usually followed by a ReLU activation


## Convolution 

<p align="center">
  <img src="./figures/convo_layer.png" width="70%" />
</p>

* Input dimension $M$ (often **W**idth or **H**eight)
* Kernel/filter size $K$ (typically 3 or 5)
* Stride $S$
* Padding $P$
* Output dimension $(M+2P-K)/S +1$

## Convolution Layer: Example 
<p align="center">
  <img src="./figures/7convo.png" width="80%" />
</p>

## Pooling Layer  

Pooling: Down-sample while keeping the most salient signal.

- Max-Pool: $z_{ij}=\max_{u,v} x_{(i+u)(j+v)}$   
    (translation/shift invariance)
- Average-Pool  $z_{ij}=k^{-2}\sum_{u,v} x$   
    (smooths signals)

<p align="center">
  <img src="./figures/max_pool.png" width="60%" />
</p>

## [Convolutional Neural Networks: Example](vscode://file//Users/shizhe/Documents/Github/Website(Public)/CNN.ipynb)


<p align="center">
  <img src="./figures/7cnn.png" width="80%" />
</p>





## CNN v.s. FNN

In the [example](vscode://file//Users/shizhe/Documents/Github/Website(Public)/CNN.ipynb), you will find 

1. A convolutional neural network
```python
    self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),                 # 14×14
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2)                  # 7×7
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64*7*7, 128), nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10)
        )
```
2. An FNN/MLP 
```python
    self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 512), nn.ReLU(),
            nn.Linear(512,  32), nn.ReLU(),
            nn.Linear( 32,  10)
        )
```



## Performance

<section>

<table style="width: 100%;border: none; border-collapse: collapse;">
<tr>

<!-- Text Column -->
<td style="width: 50%; vertical-align: top; padding-right: 20px;border: none;">

- Number of parameters (trainable weights)
    - CNN: 421,642
    - FNN: 418,666

- Train losses look fine for both models

- Misclassification on test sets:
    - CNN: $18.1\%$
    - FNN: $47.1\%$
</td>

<!-- Image Column -->
<td style="width: 50%;border: none;">
  <img src="./figures/loss_cnn_fnn.png" alt="attention" style="max-width: 100%;">
</td>

</tr>
</table>

</section>




## Convolutional Neural Networks: Milestones


| Year | Architecture | # Layers |
|------|--------------|-----------|
| 1998 | LeNet‑5 | 5 |
| 2012 | AlexNet |8 |
| 2014 | VGG | 19 |
| 2015 | ResNet | **152** |
| 2017| SENet | **152** |



## Key to Go Deep: ResNet (He et al. 2015)


<section>
  <div style="display: flex; align-items: center;">
    <!-- Left: Text -->
    <div style="flex: 3; padding-right: 20px;">
      
- Prior to ResNet: deeper models yield larger training error

- Contradictory to known facts:
    - Deeper models can capture more complex mappings
    - CNNs already pass the interpolation threshold (second descent) (AlexNet has around 60 million parameters)

- Hypothesis: large training error is a result of failures in optimization 
    - Possible cause: vanishing gradients in backpropogation 

- Solution: explicitly include an identity mapping to let gradient flows   
    Desired mapping $H(x) $
    "Residual"/nonlinear mapping $F(x)$

    </div>

    <!-- Right: Image -->
    <div style="flex: 2;">
    <img src="./figures/resnet_orig.png" alt="" width="300" align="center">
    </div>
  </div>
</section>



## Attention and Transformer

### Example: Translation


## Machine Translation Examples


French to English  

- **FR:** <span style="color:#d62728">Demain</span>, <span style="color:#1f77b4">je</span> <span style="color:#2ca02c">donnerai</span> <span style="color:#9467bd">le&nbsp;livre</span> <span style="color:#ff7f0e">à&nbsp;mon&nbsp;ami</span>.  
- **EN:** <span style="color:#1f77b4">I</span> <span style="color:#2ca02c">will&nbsp;give</span> <span style="color:#9467bd">the&nbsp;book</span> <span style="color:#ff7f0e">to&nbsp;my&nbsp;friend</span> <span style="color:#d62728">tomorrow</span>.

Spanish to English  

- **ES:** <span style="color:#d62728">Ayer</span>, <span style="color:#1f77b4">María</span> <span style="color:#2ca02c">le&nbsp;envió</span> <span style="color:#9467bd">una&nbsp;carta</span> <span style="color:#ff7f0e">a&nbsp;su&nbsp;hermano</span> <span style="color:#8c564b">desde&nbsp;México</span>.  
- **EN:** <span style="color:#1f77b4">Maria</span> <span style="color:#2ca02c">sent</span> <span style="color:#ff7f0e">her&nbsp;brother</span> <span style="color:#9467bd">a&nbsp;letter</span> <span style="color:#8c564b">from&nbsp;Mexico</span> <span style="color:#d62728">yesterday</span>.

Chinese to English  

- **ZH:** <span style="color:#1f77b4">我</span><span style="color:#8c564b">在图书馆</span><span style="color:#2ca02c">借了</span><span style="color:#9467bd">一本有趣的书</span>。  
- **EN:** <span style="color:#1f77b4">I</span> <span style="color:#2ca02c">borrowed</span> <span style="color:#9467bd">an&nbsp;interesting&nbsp;book</span> <span style="color:#8c564b">from&nbsp;the&nbsp;library</span>.


Text in the eyes of computers: [Tokenizer playground](https://huggingface.co/spaces/Xenova/the-tokenizer-playground)


[French numbers to English using Transformer](vscode://file//Users/shizhe/Documents/Github/Website(Public)/Transformer_Number.ipynb)


## Sequence-to-Sequence Task

- Machine Translation (e.g., French to English sentence translation)
- Music Generation (input: musical themes or motifs, output: extended music sequences)
- Code Generation from Natural Language (input: problem description, output: source code)
- ...

Key idea: **Encode** the input into a context state $c$ then **decode** step‑by‑step with RNN/Transformer.


<p align="center">
  <img src="./figures/encoder_decoder_marked.png" width="70%" />
</p>

## Recurrent Neural Networks  

Encoder 
$$
\mathbf{h}_t = f\bigl( \mathbf{W}_{hx}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b} \bigr),\qquad
$$

Decoder
$$\mathbf{s}_t = g\big(\mathbf{U}_{sc}\mathbf{c} + \mathbf{U}_{sy}\mathbf{y}_{t-1}+\mathbf{U}_{ss}\mathbf{s}_{t-1}\big)$$

- Shares parameters across all time steps 
- Trained with *Back‑Propagation Through Time* (BPTT).  
- Suffers from gradient vanishing for long $T$.


## Recurrent Neural Networks

Encoder

<p align="center">
  <img src="./figures/encoder_color.png" width="70%" />
</p>


Decoder
<p align="center">
  <img src="./figures/decoder_color.png" width="70%" />
</p>

## Problem with the Context 

One problem with using RNN is that 

- All information of the inputs are stored in one context vector $c$. 
- All decodings are based on $c$

This could have a few 
- Might fail for long sequences 
- Might not be optimal for sequential decoding

<p align="center">
  <img src="./figures/encoder_decoder_compact.png" width="70%" />
</p>


## Attention: Step 1

<section>

<table style="width: 100%;border: none; border-collapse: collapse;">
<tr>

<!-- Text Column -->
<td style="width: 45%; vertical-align: top; padding-right: 20px;border: none;">


  1. Compute alignment scores $e_{1,i} \in \mathbb{R}$
  $$e_{1,i}=f_{\rm att}(s_{0},h_i)$$

  2. Apply softmax to obtain attention weights (weights sum up to 1)
  $$ a_{1,i}= \frac{\exp(e_{1,i})}{\sum_{j=1}^{3} \exp(e_{1,j})} $$

  3. Compute context vector as the weighted sum of hidden states
  $$c_1 = \sum_{i=1}^3 a_{1,i}h_i$$

  4. Decode the new state 
  $$s_1 = g\big(\mathbf{U}_{sc}c_1 + \mathbf{U}_{sy}{y}_{0}+\mathbf{U}_{ss}{s}_{0}\big)$$

</td>

<!-- Image Column -->
<td style="width: 55%;border: none;">
  <img src="./figures/attention_step_1.png" alt="attention" style="max-width: 100%;">
</td>

</tr>
</table>

</section>


## Attention: Step $t$

<section>

<table style="width: 100%;border: none; border-collapse: collapse;">
<tr>

<!-- Text Column -->
<td style="width: 45%; vertical-align: top; padding-right: 20px;border: none;">


1. Compute alignment scores $e_{t,i} \in \mathbb{R}$
$$e_{t,i}=f_{\rm att}(s_{t-1},h_i)$$

2. Apply softmax to obtain attention weights (weights sum up to 1)
$$ a_{t,i}= \frac{\exp(e_{t,i})}{\sum_{j=1}^{3} \exp(e_{t,j})} $$

3. Compute context vector as the weighted sum of hidden states
$$c_t = \sum_{i=1}^3 a_{t,i}h_i$$

4. Decode the new state 
$$s_t = g\big(\mathbf{U}_{sc}c_t + \mathbf{U}_{sy}{y}_{t-1}+\mathbf{U}_{ss}{s}_{t-1}\big)$$

</td>

<!-- Image Column -->
<td style="width: 55%;border: none;">
  <img src="./figures/attention_step_2.png" alt="attention" style="max-width: 100%;">
</td>

</tr>
</table>

</section>


## Attention: Example 

<p align="center">
  <img src="./figures/attention_example.png" width="50%" />
</p>


Source: Figure 2(a) from [Bahdanau et al. (2015)](https://arxiv.org/abs/1409.0473)


## Self-Attention

<section>

<table style="width: 100%;border: none; border-collapse: collapse;">
<tr>

<!-- Text Column -->
<td style="width: 50%; vertical-align: top; padding-right: 20px;border: none;">

Observation: Hidden states and decoding states are both functions of input $\mathbf{X}$


Self-Attention: new architectures based on the input $\mathbf{X}$
$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt d}\right)V,$$
where the query vector $Q$, key vector $K$, and value vector $V$ are all functions of input $\mathbf{X}$

For more detail, see [Lecture 8](https://cs231n.stanford.edu/slides/2024/lecture_4.pdf) of Stanford CS231n by Fei-Fei Li et al.



</td>

<!-- Image Column -->
<td style="width: 50%;border: none;">
  <img src="./figures/self_attention.png" alt="attention" style="max-width: 100%;">
</td>

</tr>
</table>

</section>



## Transformer 

<section>

<table style="width: 100%;border: none; border-collapse: collapse;">
<tr>

<!-- Text Column -->
<td style="width: 50%; vertical-align: top; padding-right: 20px;border: none;">

Stacking multi‑head self‑attention layers yields the **Transformer** (Vaswani et al., 2017). 

Key advantages:

- Parallelizable (vs. sequential RNN).
- Captures long‑range dependencies.
- Scales with data 

</td>

<!-- Image Column -->
<td style="width: 50%;border: none;">
  <img src="./figures/transformer.png" alt="attention" style="max-width: 100%;">
</td>

</tr>
</table>

</section>


## Transformer  

<p align="center">
  <img src="./figures/transformer_paper.png" width="50%" />
</p>

Interactive demo: [The Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/).


## Summary

Real‑world data come with *structure*:

* **Grid‑like** → images (2‑D pixels), audio spectrograms (1‑D or 2‑D).
* **Sequential** → text, time‑series, DNA.

Neural layers that *leverage* structures learn faster and generalize better:

| Data type | Good default layer | Key idea |
|-----------|-------------------|----------|
| Grid      | Convolution       | local receptive field & weight sharing |
| Sequence  | Recurrent / Transformer | hidden state or attention over positions |
