# Module 02: Introduction to Deep Learning

**Understanding the Revolution**

---

## Objectives

By the end of this notebook, you will:
- Understand the difference between deep learning and traditional machine learning
- Know the key milestones in deep learning history
- Grasp why deep learning works (Universal Approximation Theorem intuition)
- Understand why GPUs are essential for deep learning

**Prerequisites:** [Module 01 - Python & Math Prerequisites](../01_python_math_prerequisites/01_prerequisites.ipynb)

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

# Part 1: Machine Learning vs Deep Learning

---

## 1.1 Traditional Machine Learning

In traditional ML, the workflow is:

1. **Raw Data** -> 2. **Feature Engineering** (manual) -> 3. **Model** -> 4. **Output**

The key bottleneck is **feature engineering** - humans must decide what features matter.

### Example: Recognizing Cats

Traditional ML approach:
- Extract edges using Sobel filters
- Compute color histograms
- Calculate texture features (SIFT, HOG)
- Feed these features to an SVM or Random Forest

**Problem:** What features define a cat? Pointy ears? Fur texture? Whiskers? This requires domain expertise and is hard to generalize.

In [None]:
# Traditional ML: Manual feature extraction
# Simulate an image as pixel values
image = np.random.rand(28, 28)

# Manual features (simplified)
def extract_features(img):
    features = []
    features.append(np.mean(img))        # Average pixel value
    features.append(np.std(img))         # Standard deviation
    features.append(np.max(img))         # Maximum
    features.append(np.min(img))         # Minimum
    # More features would be needed...
    return np.array(features)

features = extract_features(image)
print(f"Manually extracted features: {features}")
print(f"From {28*28} pixels to {len(features)} features")

## 1.2 Deep Learning: Automatic Feature Learning

In deep learning, the workflow is:

1. **Raw Data** -> 2. **Neural Network** (learns features automatically) -> 3. **Output**

The network learns features directly from data - no manual engineering needed!

### How Deep Learning Learns Features

A deep network learns hierarchical representations:

- **Layer 1:** Edges, simple patterns
- **Layer 2:** Textures, simple shapes
- **Layer 3:** Object parts (ears, eyes)
- **Layer 4+:** Full objects, abstract concepts

Each layer builds on the previous, creating increasingly abstract representations.

In [None]:
# Visualization: Feature hierarchy
fig, axes = plt.subplots(1, 4, figsize=(14, 3))

# Layer 1: Edges
edge_patterns = np.array([[[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]],
                          [[-1, -1, -1], [0, 0, 0], [1, 1, 1]],
                          [[0, 1, 0], [1, 1, 1], [0, 1, 0]],
                          [[1, 0, -1], [0, 0, 0], [-1, 0, 1]]])

for i, ax in enumerate(axes):
    if i == 0:
        ax.imshow(edge_patterns[0], cmap='gray')
        ax.set_title('Layer 1: Edges')
    elif i == 1:
        # Simulated texture pattern
        texture = np.random.rand(8, 8) * 0.5 + 0.25
        texture[2:6, 2:6] = 0.8
        ax.imshow(texture, cmap='gray')
        ax.set_title('Layer 2: Textures')
    elif i == 2:
        # Simulated part (eye-like)
        part = np.zeros((16, 16))
        y, x = np.ogrid[:16, :16]
        center = (8, 8)
        mask = (x - center[0])**2 + (y - center[1])**2 <= 5**2
        part[mask] = 1
        part[7:9, 7:9] = 0  # pupil
        ax.imshow(part, cmap='gray')
        ax.set_title('Layer 3: Parts')
    else:
        # Full object (abstract cat)
        cat = np.zeros((32, 32))
        cat[8:24, 8:24] = 0.5  # face
        cat[4:10, 6:10] = 0.8  # left ear
        cat[4:10, 22:26] = 0.8  # right ear
        cat[14:16, 12:14] = 1  # left eye
        cat[14:16, 18:20] = 1  # right eye
        cat[18:20, 14:18] = 0.3  # nose
        ax.imshow(cat, cmap='gray')
        ax.set_title('Layer 4+: Objects')
    ax.axis('off')

plt.suptitle('Hierarchical Feature Learning in Deep Networks', fontsize=12)
plt.tight_layout()
plt.show()

## 1.3 Key Differences Summary

| Aspect | Traditional ML | Deep Learning |
|--------|---------------|---------------|
| Features | Manual engineering | Automatic learning |
| Data needed | Hundreds-thousands | Thousands-millions |
| Compute | CPU sufficient | GPU required |
| Interpretability | Often clearer | Often "black box" |
| Domain expertise | Required for features | Less required |
| Performance ceiling | Limited by features | Scales with data |

## 1.4 When to Use Which?

**Use Traditional ML when:**
- Limited data (hundreds to low thousands)
- Need interpretability (medical, finance)
- Structured data (tables, not images/text)
- Limited compute resources

**Use Deep Learning when:**
- Large dataset available
- Complex patterns (images, speech, text)
- Compute resources available
- State-of-the-art performance needed

---

# Part 2: History and Evolution

---

## 2.1 Timeline of Key Milestones

Understanding history helps you understand why things are the way they are.

In [None]:
# Timeline visualization
milestones = {
    1958: "Perceptron\n(Rosenblatt)",
    1969: "XOR Problem\nAI Winter begins",
    1986: "Backpropagation\n(Hinton et al.)",
    1998: "LeNet-5\n(LeCun)",
    2006: "Deep Belief Nets\nDeep Learning coined",
    2012: "AlexNet\nImageNet breakthrough",
    2014: "GANs, VGGNet\nDropout",
    2017: "Transformer\nAttention mechanism",
    2018: "BERT\n(NLP revolution)",
    2020: "GPT-3\n175B parameters",
    2022: "ChatGPT\nAI goes mainstream"
}

fig, ax = plt.subplots(figsize=(16, 6))

years = list(milestones.keys())
ax.set_xlim(1955, 2025)
ax.set_ylim(-1, 1)

# Draw timeline
ax.axhline(y=0, color='gray', linewidth=2)

for i, (year, event) in enumerate(milestones.items()):
    y_offset = 0.5 if i % 2 == 0 else -0.5
    ax.scatter([year], [0], color='blue', s=100, zorder=5)
    ax.plot([year, year], [0, y_offset*0.8], 'b-', linewidth=1)
    ax.text(year, y_offset, event, ha='center', va='center' if y_offset > 0 else 'top',
            fontsize=8, bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

ax.set_xlabel('Year', fontsize=12)
ax.set_title('Deep Learning Timeline: Key Milestones', fontsize=14)
ax.set_yticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.tight_layout()
plt.show()

## 2.2 Key Historical Moments

### The Perceptron (1958)
Frank Rosenblatt invented the perceptron - a single artificial neuron that could learn. The media claimed it would solve AI.

### AI Winter (1969-1980s)
Minsky & Papert showed perceptrons couldn't solve XOR. Funding dried up. The "AI Winter" began.

### Backpropagation (1986)
Hinton, Rumelhart, and Williams popularized backpropagation - enabling training of multi-layer networks. This solved XOR!

### ImageNet Moment (2012)
AlexNet (Krizhevsky, Sutskever, Hinton) won ImageNet by a huge margin using deep CNNs on GPUs. This sparked the current revolution.

## 2.3 Why the Sudden Success?

Three factors converged:

1. **Data:** Internet produced massive datasets (ImageNet, Wikipedia, etc.)
2. **Compute:** GPUs made training 10-100x faster
3. **Algorithms:** Better architectures, initialization, regularization

In [None]:
# Visualization: The three factors
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Data growth
years = np.arange(2000, 2024)
data_volume = 0.1 * np.exp(0.4 * (years - 2000))  # Exponential growth
axes[0].plot(years, data_volume, 'b-', linewidth=2)
axes[0].fill_between(years, data_volume, alpha=0.3)
axes[0].set_xlabel('Year')
axes[0].set_ylabel('Data Volume (relative)')
axes[0].set_title('1. DATA: Exponential Growth')
axes[0].set_yscale('log')

# Compute (GPU TFLOPs)
gpu_years = [2008, 2010, 2012, 2014, 2016, 2018, 2020, 2022]
gpu_flops = [0.5, 1, 3, 5, 10, 15, 30, 80]  # Approximate TFLOPs
axes[1].bar(gpu_years, gpu_flops, color='green', alpha=0.7, width=1.5)
axes[1].set_xlabel('Year')
axes[1].set_ylabel('GPU TFLOPs')
axes[1].set_title('2. COMPUTE: GPU Power')

# Algorithm improvements (ImageNet error rate)
alg_years = [2010, 2012, 2014, 2015, 2016, 2017, 2018]
error_rates = [28, 16, 7, 4, 3, 2.5, 2]  # Approximate top-5 error %
axes[2].plot(alg_years, error_rates, 'r-o', linewidth=2, markersize=8)
axes[2].axhline(y=5.1, color='gray', linestyle='--', label='Human level')
axes[2].set_xlabel('Year')
axes[2].set_ylabel('ImageNet Error Rate (%)')
axes[2].set_title('3. ALGORITHMS: Better Performance')
axes[2].legend()

plt.tight_layout()
plt.show()

---

# Part 3: Why Deep Learning Works

---

## 3.1 The Universal Approximation Theorem

**Theorem:** A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of R^n, given appropriate activation functions.

**In simple terms:** A neural network can learn ANY function if it has enough neurons.

This is why neural networks are so powerful - they're universal function approximators!

## 3.2 Intuition: How Can Neurons Approximate Any Function?

Think of it like this:
- Each neuron acts as a "step" or "bump"
- By combining many small steps, you can approximate any curve
- More neurons = finer approximation

In [None]:
# Demonstration: Approximating a function with "bumps"
def target_function(x):
    return np.sin(2 * x) + 0.5 * np.cos(4 * x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def create_bump(x, center, width, height):
    """Create a sigmoid-based bump at a location."""
    left = sigmoid((x - center + width/2) * 10)
    right = sigmoid(-(x - center - width/2) * 10)
    return height * left * right

x = np.linspace(-3, 3, 500)
y_target = target_function(x)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

for idx, n_bumps in enumerate([2, 5, 10, 20]):
    ax = axes[idx // 2, idx % 2]
    
    # Create approximation with n bumps
    centers = np.linspace(-2.5, 2.5, n_bumps)
    width = 5 / n_bumps
    
    approximation = np.zeros_like(x)
    for c in centers:
        # Height determined by target function at center
        h = target_function(c)
        approximation += create_bump(x, c, width, h)
    
    ax.plot(x, y_target, 'b-', linewidth=2, label='Target function')
    ax.plot(x, approximation, 'r--', linewidth=2, label=f'Approximation ({n_bumps} bumps)')
    ax.set_title(f'{n_bumps} Neurons/Bumps')
    ax.legend()
    ax.set_xlim(-3, 3)
    ax.set_ylim(-2, 2)

plt.suptitle('Universal Approximation: More Neurons = Better Approximation', fontsize=12)
plt.tight_layout()
plt.show()

## 3.3 Why Depth Matters

The theorem says ONE hidden layer is enough. So why go deep?

**Depth gives efficiency:**
- A deep network can represent some functions with exponentially fewer neurons than a shallow one
- Shallow: might need 2^n neurons
- Deep: might need only O(n) neurons

**Depth gives hierarchy:**
- Early layers learn simple features
- Later layers compose them into complex features
- This matches the structure of real-world data

In [None]:
# Visualization: Shallow vs Deep
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Shallow network
ax = axes[0]
# Input layer
for i in range(3):
    ax.scatter([0], [i], s=300, c='lightblue', edgecolors='black', zorder=5)
# Hidden layer (many neurons needed)
for i in range(8):
    ax.scatter([1], [i*0.4 - 0.4], s=300, c='lightgreen', edgecolors='black', zorder=5)
# Output
ax.scatter([2], [1], s=300, c='salmon', edgecolors='black', zorder=5)

# Connections
for i in range(3):
    for j in range(8):
        ax.plot([0, 1], [i, j*0.4-0.4], 'gray', alpha=0.3, linewidth=0.5)
for j in range(8):
    ax.plot([1, 2], [j*0.4-0.4, 1], 'gray', alpha=0.3, linewidth=0.5)

ax.set_xlim(-0.5, 2.5)
ax.set_title('Shallow Network\n(Many neurons in one layer)')
ax.axis('off')

# Deep network
ax = axes[1]
# Input layer
for i in range(3):
    ax.scatter([0], [i*0.5+0.25], s=300, c='lightblue', edgecolors='black', zorder=5)
# Hidden layers (fewer neurons per layer)
for layer in range(1, 4):
    for i in range(4):
        ax.scatter([layer], [i*0.4+0.3], s=300, c='lightgreen', edgecolors='black', zorder=5)
# Output
ax.scatter([4], [0.9], s=300, c='salmon', edgecolors='black', zorder=5)

# Connections
prev_positions = [i*0.5+0.25 for i in range(3)]
for layer in range(1, 4):
    curr_positions = [i*0.4+0.3 for i in range(4)]
    for p in prev_positions:
        for c in curr_positions:
            ax.plot([layer-1, layer], [p, c], 'gray', alpha=0.3, linewidth=0.5)
    prev_positions = curr_positions

for p in prev_positions:
    ax.plot([3, 4], [p, 0.9], 'gray', alpha=0.3, linewidth=0.5)

ax.set_xlim(-0.5, 4.5)
ax.set_title('Deep Network\n(Fewer neurons, more layers)')
ax.axis('off')

plt.suptitle('Shallow vs Deep: Same Representational Power, Different Efficiency', fontsize=12)
plt.tight_layout()
plt.show()

## 3.4 Representation Learning

The real power of deep learning is **learning representations** - finding the right way to encode information.

- Raw pixels are a poor representation for classification
- The network learns to transform pixels into features that make classification easy
- The final layer sees "cat-like features" not pixels

---

# Part 4: Hardware - CPUs vs GPUs

---

## 4.1 Why GPUs?

Neural network training is dominated by matrix multiplication. Let's see why GPUs excel at this.

### CPU Architecture
- Few cores (4-16 typically)
- Each core is very powerful
- Optimized for sequential tasks
- Good for: complex logic, branching code

### GPU Architecture
- Thousands of small cores (hundreds to thousands)
- Each core is simple
- Optimized for parallel tasks
- Good for: same operation on many data points

In [None]:
# Visualization: CPU vs GPU architecture
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# CPU
ax = axes[0]
# 8 big cores
for i in range(2):
    for j in range(4):
        rect = plt.Rectangle((j*1.2, i*1.2), 1, 1, fill=True, 
                             facecolor='royalblue', edgecolor='black', linewidth=2)
        ax.add_patch(rect)
        ax.text(j*1.2 + 0.5, i*1.2 + 0.5, f'Core\n{i*4+j+1}', 
                ha='center', va='center', fontsize=10, color='white', fontweight='bold')

ax.set_xlim(-0.5, 5.5)
ax.set_ylim(-0.5, 3)
ax.set_aspect('equal')
ax.set_title('CPU: 8 Powerful Cores\n(Sequential processing)', fontsize=12)
ax.axis('off')

# GPU
ax = axes[1]
# Many small cores
for i in range(8):
    for j in range(16):
        rect = plt.Rectangle((j*0.35, i*0.35), 0.3, 0.3, fill=True, 
                             facecolor='green', edgecolor='darkgreen', linewidth=0.5)
        ax.add_patch(rect)

ax.set_xlim(-0.5, 6)
ax.set_ylim(-0.5, 3.5)
ax.set_aspect('equal')
ax.set_title('GPU: 128+ Simple Cores\n(Parallel processing)', fontsize=12)
ax.axis('off')

plt.suptitle('CPU vs GPU Architecture', fontsize=14)
plt.tight_layout()
plt.show()

## 4.2 Matrix Multiplication is Parallel

Each element of the output matrix can be computed independently:

```
C[i,j] = sum(A[i,:] * B[:,j])
```

For a 1000x1000 matrix, that's 1,000,000 independent computations - perfect for GPU!

In [None]:
# Simulate CPU vs GPU speed difference
import time

sizes = [100, 500, 1000, 2000]
times_sequential = []
times_parallel = []

for size in sizes:
    A = np.random.randn(size, size)
    B = np.random.randn(size, size)
    
    # NumPy uses optimized BLAS (simulates multi-core/parallel)
    start = time.time()
    C = A @ B
    times_parallel.append(time.time() - start)
    
    # Simulate sequential (just the number of operations)
    # Each element needs 'size' multiplications and additions
    ops = size ** 3  # O(n^3) operations
    times_sequential.append(ops / 1e9)  # Normalize

fig, ax = plt.subplots(figsize=(10, 5))

x = np.arange(len(sizes))
width = 0.35

bars1 = ax.bar(x - width/2, times_sequential, width, label='Sequential (simulated)', color='royalblue')
bars2 = ax.bar(x + width/2, times_parallel, width, label='Parallel (NumPy BLAS)', color='green')

ax.set_xlabel('Matrix Size')
ax.set_ylabel('Time (relative)')
ax.set_title('Matrix Multiplication: Sequential vs Parallel')
ax.set_xticks(x)
ax.set_xticklabels([f'{s}x{s}' for s in sizes])
ax.legend()
ax.set_yscale('log')

plt.tight_layout()
plt.show()

print("Key insight: As matrices get larger, parallelization advantage grows!")

## 4.3 CUDA and PyTorch

**CUDA** is NVIDIA's platform for GPU computing. PyTorch uses CUDA to run operations on GPU.

```python
import torch

# Check if GPU is available
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Move tensor to GPU
x = torch.randn(1000, 1000)
x = x.to(device)  # Now on GPU!
```

In [None]:
# Check GPU availability (will work even without GPU)
try:
    import torch
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    else:
        print("Running on CPU - GPU not available")
except ImportError:
    print("PyTorch not installed. Install with: pip install torch")

## 4.4 Practical GPU Tips

1. **Batch operations:** GPU is efficient when processing many samples at once
2. **Minimize CPU-GPU transfers:** Moving data is expensive
3. **Use appropriate batch sizes:** Too small wastes GPU; too large runs out of memory
4. **Monitor GPU memory:** Use `nvidia-smi` or `torch.cuda.memory_allocated()`

---

# Key Points Summary

---

## Deep Learning vs Traditional ML
- Traditional ML requires manual feature engineering
- Deep learning learns features automatically from data
- Deep learning scales better with more data

## History
- Perceptron (1958) -> AI Winter -> Backpropagation (1986) -> ImageNet (2012) -> Today
- Three factors enabled the revolution: Data + Compute + Algorithms

## Why Deep Learning Works
- Universal Approximation: networks can learn any function
- Depth gives efficiency and hierarchical representations
- Networks learn good representations of data

## Hardware
- GPUs have many simple cores vs CPU's few powerful cores
- Matrix multiplication is highly parallel
- PyTorch uses CUDA to leverage GPU power

---

# Interview Tips

---

## Common Questions

**Q: What is the difference between ML and DL?**
A: Traditional ML requires manual feature engineering while deep learning learns features automatically. DL uses neural networks with multiple layers to learn hierarchical representations from raw data.

**Q: Why did deep learning take off around 2012?**
A: Three factors converged: (1) Large datasets like ImageNet became available, (2) GPUs made training feasible, and (3) Algorithmic improvements like ReLU and dropout improved training.

**Q: What is the Universal Approximation Theorem?**
A: It states that a neural network with a single hidden layer can approximate any continuous function, given enough neurons. This explains why neural networks are so versatile.

**Q: Why use GPUs for deep learning?**
A: Neural network training is dominated by matrix operations, which are highly parallel. GPUs have thousands of cores designed for parallel computation, making them 10-100x faster than CPUs for this workload.

**Q: Why go deep? Why not just one wide layer?**
A: Deep networks can represent complex functions more efficiently than shallow ones. They also learn hierarchical features naturally - simple features in early layers, complex in later layers.

---

# Practice Exercises

---

## Exercise 1: Feature Engineering

You have a dataset of house prices with features: square footage, bedrooms, bathrooms, year built. For traditional ML, what additional features might you engineer?

In [None]:
# Your answer here (as comments)
# Example engineered features:
# 1. ?
# 2. ?
# 3. ?

## Exercise 2: When to Use What?

For each scenario, would you recommend traditional ML or deep learning? Why?

1. Predicting customer churn with 500 customers and 20 features
2. Classifying 1 million images into 1000 categories
3. Predicting stock prices based on historical data
4. Translating English to French

In [None]:
# Your answers here
# 1. 
# 2. 
# 3. 
# 4. 

## Exercise 3: GPU Memory Estimation

A neural network has:
- Input: 1000 features
- Hidden layer 1: 512 neurons
- Hidden layer 2: 256 neurons
- Output: 10 classes

How many parameters (weights + biases) does it have? (Assume float32 = 4 bytes)

In [None]:
# Your calculation here
# Layer 1: ? weights + ? biases
# Layer 2: ? weights + ? biases
# Output: ? weights + ? biases
# Total: ?

## Solutions

In [None]:
# Exercise 1 Solution
print("Exercise 1 - Engineered features for house prices:")
print("1. Price per square foot (could compute after prediction)")
print("2. Age of house (current year - year built)")
print("3. Bathroom to bedroom ratio")
print("4. Square footage per bedroom")
print("5. Polynomial features (sqft^2, interactions)")
print("6. Log of square footage (if skewed)")

print("\nExercise 2 - ML vs DL:")
print("1. Customer churn (500 samples): Traditional ML - not enough data for DL")
print("2. 1M images: Deep Learning - large data, complex patterns")
print("3. Stock prices: Either/Both - depends on approach, DL for patterns")
print("4. Translation: Deep Learning - sequential data, complex relationships")

print("\nExercise 3 - Parameter count:")
layer1 = 1000 * 512 + 512  # weights + biases
layer2 = 512 * 256 + 256
output_layer = 256 * 10 + 10
total = layer1 + layer2 + output_layer
print(f"Layer 1: {1000}*{512} + {512} = {layer1:,}")
print(f"Layer 2: {512}*{256} + {256} = {layer2:,}")
print(f"Output: {256}*{10} + {10} = {output_layer:,}")
print(f"Total parameters: {total:,}")
print(f"Memory (float32): {total * 4 / 1e6:.2f} MB")

---

## Next Module: [03 - PyTorch Fundamentals](../03_pytorch_fundamentals/03_pytorch_fundamentals.ipynb)

Now that we understand what deep learning is and why it works, let's dive into PyTorch - the framework we'll use to build neural networks.