# Deep Learning: Types of Deep Learning

In [6]:
import torch.nn as nn
import torch.nn.functional as F

## The Perceptron

The **Perceptron** is the simplest type of artificial neural network. It is the fundamental building block of deep learning.

Think of it as a single voter deciding 'Yes' or 'No' based on weighted evidence. It takes several inputs, multiplies them by **weights**, sums them up, adds a **bias**, and passes the result through an **activation function** (like a step function) to produce an output.

Mathematically: $y = f(\sum (weight \cdot input) + bias)$

### Linear vs. Affine
Strictly speaking, a **linear** function must pass through the origin (0,0). It obeys $f(cx+y) = cf(x) + f(y)$.
However, most real-world decision boundaries don't pass exactly through the origin.

This is why we add the **bias** term. It shifts the decision boundary away from the origin, making it an **affine** function.
*   **Linear**: $y = w \cdot x$ (Must go through origin)
*   **Affine**: $y = w \cdot x + b$ (Can be shifted)

<p align="center">
    <img src="img/funny_cnn.png" width="500">
    <br>
    <em>Chihuahua or Muffin?</em>
</p>


**Lifting Trick**: We can treat an affine function as linear by adding a "1" to our input vector ($x_{n+1} = 1$) and treating the bias as just another weight ($w_{n+1}$).

<p align="center">
    <img src="img/image1.png" width="500">
    <br>
    <em>Linear Decision Boundary</em>
</p>


### Perceptron Algorithm Steps

Given a set of data $S = \{(x_1, y_1), ..., (x_n, y_n)\}$:

1.  **Initialize** $w_0 = 0$.
2.  **Iterate** For $t=1, 2, ..., T$:
3.  **Check** If there exists some $(x_i, y_i)$ that is classified incorrectly, in other words, $y_i(w_{t-1} \cdot x_i) \le 0$:
4.  **Update** Set the next $w_t = w_{t-1} + y_i x_i$.
5.  **Terminate** when no more data is incorrectly classified.

**Runtime**: $O(n \cdot d \cdot T)$ where $d$ is the dimension of each data-point $x_i$ and $T$ is the maximum number of steps.

### Termination of the Perceptron Algorithm

The perceptron algorithm is only guaranteed to **terminate** if the data is linearly separable.

We know an optimal separator $w^*$ exists separating the data $S$ with some margin $m$. (Normalize $||w^*||=1$.)
As we have finite data, we can bound it within some radius $R$ such that $||x_i|| \le R$ for all $x_i$.
The perceptron algorithm will take at most $T=(R/m)^2$ steps. By steps, we mean number of times the algorithm finds a misclassified datapoint.

#### Proof of Perceptron Termination

1. We bound $w_t \cdot w^* = (w_{t-1} + y_i x_i) \cdot w^* \ge w_{t-1} \cdot w^* + m$.
2. $||w_t||^2 = ||w_{t-1} + y_i x_i||^2 = ||w_{t-1}||^2 + 2y_i(w_{t-1} \cdot x_i) + ||y_i x_i||^2$.

Since a mistake occurred, $y_i(w_{t-1} \cdot x_i) \le 0$.
Thus, $||w_t||^2 \le ||w_{t-1}||^2 + R^2$.

Since $w_0=0$, after $t=T$ iterations, $w_t \cdot w^* \ge Tm$.

By **Cauchy-Schwarz**,
$w_t \cdot w^* \le ||w_t|| \cdot ||w^*|| \le \sqrt{T}R$.

Thus, $Tm \le \sqrt{T}R \implies \sqrt{T} \le \frac{R}{m} \implies T \le (\frac{R}{m})^2$.

<p align="center">
    <img src="img/image8.gif" width="500">
    <br>
    <em>Visualizing Convergence (Termination)</em>
</p>


----

## From Perceptrons to Neural Networks

### The Limitation: Linearity
Perceptrons are powerful, but they have a major flaw: they are **linear classifiers**. This means they can only separate data that can be split by a straight line (or plane).

A famous example where they fail is the **XOR problem**. You cannot draw a single straight line to separate (0,0) and (1,1) from (0,1) and (1,0).


<p align="center">
    <img src="img/image6.png" width="500">
    <br>
    <em>Linearization of Non-Linear Data</em>
</p>

### The Solution: Neural Networks
To solve complex, non-linear problems, we need teamwork! By combining multiple perceptrons into **layers** and using **non-linear activation functions**, we create a **Neural Network**.

This allows the network to learn complex curves and patterns, not just straight lines.

### The Power of Non-Linearity: Linearization
How do we solve non-linear problems? We can transform the space!

**Linearization of New Space**: If we transform our inputs (e.g., $z = x^2$), a complex curved boundary in the original space ($X$) becomes a simple straight line in the new space ($Z$).

Example:
-   Original Circle: $g(X) = (x_1 - c_1)^2 + (x_2 - c_2)^2$
-   Transformation: Let $z_1 = (x_1 - c_1)^2$ and $z_2 = (x_2 - c_2)^2$
-   New Linear Boundary: $g(Z) = z_1 + z_2$



----

## Terminology

Before we get into the weeds of specific types of deep learning, we should briefly define the components that go into a neural network.

<p align="center">
    <img src="img/image10.png" width="500">
    <br>
    <em>Anatomy of a Single Neuron</em>
</p>


### Weights
**Weights represent importance.**
Think of them as the 'strength' of the connection between neurons. If a feature is very important for the decision (e.g., 'has wheels' for detecting a car), it will have a large weight.

Geometrically, the weight vector $w$ determines the **orientation** of the decision boundary.

<p align="center">
    <img src="img/image4.png" width="500">
    <br>
    <em>The Weight Vector is Normal to the Boundary</em>
</p>


### Bias
**Bias represents the threshold.**
It allows the activation function to be shifted to the left or right. Without bias, a neuron would always trigger at zero input. Bias lets the neuron say, "I only fire if the input is greater than 5."



In [7]:
# A linear layer contains both weights (w) and bias (b)
layer = nn.Linear(in_features=10, out_features=5, bias=True)


### Activation Functions
<p align="center">
    <img src="img/image10.png" width="500">
    <br>
    <em>A Single Neuron with Activation</em>
</p>


<p align="center">
    <img src="img/image9.png" width="500">
    <br>
    <em>Common Activation Functions</em>
</p>

### Key Idea ðŸ’¡
**Activation functions introduce non-linearity**
They decide whether a neuron should 'fire' or not. Without them, a neural network would just be one big linear regression model, no matter how many layers you stack!

Common examples:
- **Sigmoid**: S-shape, squashes output between 0 and 1 (like a probability).
- **ReLU (Rectified Linear Unit)**: If positive, keep it; if negative, set to zero. Simple but very effective.




Example:
```python
output = F.relu(input_tensor)        # ReLU
output = torch.sigmoid(input_tensor) # Sigmoid
```



### Layers
**Layers organize neurons.**
Neurons are arranged in layers. The first layer is the **Input Layer**, the last is the **Output Layer**, and everything in between is a **Hidden Layer**.

<p align="center">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/998px-Colored_neural_network.svg.png" height="300" style="background-color:white;">
    <br>
    <em>Layers of a Neural Network</em>
</p>


----

## Feedforward Neural Networks (FNNs)

<p align="center">
    <img src="img/fnn_detailed.png" width="500">
    <br>
    <em>Detailed View of a Feedforward Network</em>
</p>

FNNs are the simplest type of deep neural network. In an FNN, information moves in only one directionâ€”forwardâ€”from the input nodes, through the hidden nodes (if any), and to the output nodes. There are no cycles or loops in the network.

### How it Works
1.  **Input Layer**: Receives the raw data (e.g., pixels of an image, features of a house).
2.  **Hidden Layers**: The "magic" happens here. Each neuron in a hidden layer takes a weighted sum of the previous layer's outputs, adds a bias, and applies an activation function.
    -   Mathematically: $h = f(W_1 x + b_1)$
3.  **Output Layer**: Produces the final prediction (e.g., probability of "Cat").
    -   Mathematically: $y = f(W_2 h + b_2)$

### Why Hidden Layers?
Hidden layers allow the network to learn **intermediate representations**. For example, if the input is pixels:
-   Layer 1 might learn to detect edges.
-   Layer 2 might combine edges to detect shapes (circles, squares).
-   Layer 3 might combine shapes to detect objects (wheels, eyes).



### The Universal Approximation Theorem
A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function.

**Translation**: In simple terms, a neural network with enough neurons can learn to represent *any* function (curve) you can draw. It's a universal function approximator! This is why we moved from Perceptrons (limited to lines) to FNNs.

**PyTorch Connection:**


In [8]:
# A simple FNN with one hidden layer
model = nn.Sequential(
    nn.Linear(784, 128),   # Input to Hidden (784 pixels -> 128 features)
    nn.ReLU(),             # Activation
    nn.Linear(128, 10)     # Hidden to Output (10 classes)
)

## Convolutional Neural Networks (CNNs)

<p align="center">
    <img src = "https://media.geeksforgeeks.org/wp-content/uploads/20250924160202277839/23.webp" width = "500">
</p>

CNNs are the superstars of **image recognition**. While FNNs treat every pixel as independent, CNNs understand spatial structure (like knowing that an eye is usually next to a nose).



### How it Works: The Mechanics

1.  **Convolution (The Filter)**: A small matrix (kernel) slides over the image, performing element-wise multiplication and summation. This detects features like edges or curves.
    -   Mathematically: $(I * K)(i, j) = \sum_m \sum_n I(i+m, j+n) K(m, n)$
    -   **Stride**: How many pixels the filter moves at a time.
    -   **Padding**: Adding zeros around the border to keep the image size constant.
2.  **Activation (ReLU)**: Turns negative values to zero (introducing non-linearity).
3.  **Pooling (Downsampling)**: Reduces the size of the feature map (e.g., taking the maximum value in a 2x2 grid). This makes the network computationally efficient and robust to small shifts in the image.
4.  **Flattening & Fully Connected**: The final 2D feature maps are flattened into a 1D vector and fed into a standard FNN for classification.


### Intuition: The Assembly Line

<p align="center">
    <img src="img/linear_vs_affine.png" width="500">
    <br>
    <em>Why we need CNNs: Linear vs. Affine</em>
</p>

Imagine a factory assembly line analyzing a car:
1.  **Convolutional Layers (The Specialists)**: These are like workers scanning the object with specific tools (filters).
    -   Early layers detect simple lines and edges.
    -   Middle layers detect shapes (circles, squares).
    -   Later layers detect complex objects (wheels, headlights).
2.  **Pooling Layers (The Summarizers)**: These workers simplify the report. If a 'wheel' was found in the top-left, they just note 'wheel present' without recording its exact millimeter position. This makes the network faster and more robust.
3.  **Fully Connected Layers (The Decision Makers)**: They take the final summary ('2 wheels', 'handlebars', 'frame') and make the final classification: 'This is a bicycle'.



In [None]:
# 3 input channels (RGB), 16 output channels (features), 3x3 filter
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3)

----

## Recurrent Neural Networks (RNNs)

<p align="center">
    <img src = "https://media.geeksforgeeks.org/wp-content/uploads/20250523171309383561/recurrent_neural_network.webp" width = "500">
</p>

Standard networks have no memory of the past. **RNNs have memory.** They are designed for **sequential data** like text, speech, or stock prices.


### How it Works: The Hidden State
Unlike FNNs, RNNs have a **Hidden State** ($h_t$) that acts as memory. At each time step $t$:
1.  The network takes the **current input** ($x_t$) AND the **previous hidden state** ($h_{t-1}$).
2.  It calculates the **new hidden state** ($h_t$).
3.  It produces an **output** ($y_t$).

Mathematically: $h_t = \tanh(W x_t + U h_{t-1} + b)$

**The Challenge**: Standard RNNs suffer from **Vanishing Gradients**, meaning they forget information from long ago. This led to advanced architectures like **LSTMs** (Long Short-Term Memory) and **GRUs**.


### Intuition: Reading a Sentence
Imagine reading the sentence: *'I grew up in France... I speak fluent ____'.*

To fill in the blank with 'French', you need to remember the word 'France' from the beginning of the sentence. A standard FNN might look at 'fluent' and guess 'English' or 'Spanish' randomly.

An RNN processes words one by one, passing a **hidden state** (memory) from the previous step to the current one. It 'remembers' the context of 'France' while processing 'fluent'.



In [11]:
# Input size 10, Hidden memory size 20
rnn_layer = nn.RNN(input_size=10, hidden_size=20, batch_first=True)

----

## Transformer Networks

<p align="center">
    <img src = "https://media.geeksforgeeks.org/wp-content/uploads/20250924111849816889/encoder_decoder_image.webp" width = "500">
</p>

Transformers are the modern evolution of NLP (Natural Language Processing). They power models like **ChatGPT** and **BERT**.

### Intuition: Attention Mechanism
RNNs are great, but they get 'tired' reading long books. They often forget the beginning of a paragraph by the time they reach the end.

**Transformers don't read sequentially.** They look at the entire sentence at once and use **Self-Attention** to understand context.

Think of the sentence: *'The animal didn't cross the street because **it** was too tired.'*
When the model processes the word **'it'**, the Attention mechanism highlights **'animal'** heavily, understanding that 'it' refers to the animal, not the street.

- Wikipedia overview [here](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture))
- All ChatGPT models are transformer LLMs!
- Helpful overview [here](https://www.geeksforgeeks.org/machine-learning/getting-started-with-transformers/)


----

## Summary ðŸ“š
- Neural networks are *much* more complicated on average than the algorithms we've seen thus far
- Neural networks leverage **layers** to extract complex relationships
- Feature extraction happens within the network itself, but your inputs may still require preprocessing!