# Module 18: Practice Problems & Interview Questions

**Test Your Knowledge**

---

## Section 1: Conceptual Questions

---

### Q1: What is the difference between a parameter and a hyperparameter?

<details>
<summary>Answer</summary>

- **Parameters**: Learned during training (weights, biases)
- **Hyperparameters**: Set before training (learning rate, batch size, number of layers)
</details>

### Q2: Why do we use activation functions?

<details>
<summary>Answer</summary>

Activation functions introduce non-linearity. Without them, a neural network is just a linear transformation no matter how many layers it has. Non-linearity allows networks to learn complex patterns.
</details>

### Q3: What is vanishing gradient problem?

<details>
<summary>Answer</summary>

During backpropagation, gradients can become extremely small when multiplied through many layers (especially with sigmoid/tanh). This prevents early layers from learning. Solutions: ReLU, residual connections, proper initialization, BatchNorm.
</details>

### Q4: Why is ReLU preferred over sigmoid?

<details>
<summary>Answer</summary>

1. No vanishing gradient for positive values (gradient is 1)
2. Computationally faster (no exp)
3. Leads to sparse activations (some neurons output 0)
</details>

### Q5: What is the purpose of dropout?

<details>
<summary>Answer</summary>

Dropout is a regularization technique that randomly sets some activations to zero during training. This prevents co-adaptation of neurons and forces the network to learn more robust features. It's disabled during inference.
</details>

### Q6: What is batch normalization and why does it help?

<details>
<summary>Answer</summary>

BatchNorm normalizes activations to have zero mean and unit variance, then learns to scale and shift. It:
- Stabilizes training
- Allows higher learning rates
- Reduces sensitivity to initialization
- Acts as regularization
</details>

### Q7: Explain the difference between SGD, SGD with momentum, and Adam.

<details>
<summary>Answer</summary>

- **SGD**: Updates weights using gradient alone
- **SGD + Momentum**: Accumulates velocity, smooths updates, escapes local minima
- **Adam**: Combines momentum with adaptive learning rates per parameter. Usually works well out of the box.
</details>

### Q8: Why do we need weight initialization?

<details>
<summary>Answer</summary>

Proper initialization:
- Prevents vanishing/exploding gradients
- Ensures activations have reasonable variance across layers
- Xavier for tanh/sigmoid, He for ReLU
</details>

---

## Section 2: Coding Exercises

---

### Exercise 1: Implement Sigmoid from Scratch

In [1]:
import torch
import numpy as np

def sigmoid(x):
    # Your code here
    pass

# Test
x = torch.tensor([0., 1., -1.])
# Expected: [0.5, 0.7311, 0.2689]

### Exercise 2: Implement Binary Cross-Entropy

In [2]:
def binary_cross_entropy(y_true, y_pred):
    # Your code here
    pass

# Test
y_true = np.array([1, 0, 1])
y_pred = np.array([0.9, 0.1, 0.8])

### Exercise 3: Build a 3-Layer MLP

In [3]:
import torch.nn as nn

class ThreeLayerMLP(nn.Module):
    def __init__(self, input_size, hidden1, hidden2, output_size):
        super().__init__()
        # Your code here
        pass

    def forward(self, x):
        # Your code here
        pass

---

## Section 3: Solutions

---

In [4]:
# Exercise 1 Solution
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

x = torch.tensor([0., 1., -1.])
print(f"Sigmoid: {sigmoid(x)}")

Sigmoid: tensor([0.5000, 0.7311, 0.2689])


In [5]:
# Exercise 2 Solution
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

y_true = np.array([1, 0, 1])
y_pred = np.array([0.9, 0.1, 0.8])
print(f"BCE: {binary_cross_entropy(y_true, y_pred):.4f}")

BCE: 0.1446


In [6]:
# Exercise 3 Solution
class ThreeLayerMLP(nn.Module):
    def __init__(self, input_size, hidden1, hidden2, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, output_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model = ThreeLayerMLP(784, 256, 128, 10)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Parameters: 235,146


---

# Congratulations! ðŸŽ‰

You've completed the PyTorch Deep Learning Fundamentals course!

## Topics Covered:
1. Math Prerequisites
2. Introduction to Deep Learning
3. PyTorch Fundamentals
4. The Neuron
5. Activation Functions
6. Perceptron
7. Loss Functions
8. Gradient Descent
9. Backpropagation
10. Optimizers
11. Building Neural Networks
12. Training Pipeline
13. Regularization
14. Weight Initialization
15. Batch Normalization
16. Hyperparameter Tuning
17. Practical Project
18. Interview Prep

## Next Steps:
- Practice with real datasets (MNIST, CIFAR-10)
- Explore CNNs for image tasks
- Learn RNNs/Transformers for sequences
- Build projects!