# Deep Learning — Assessment

This assessment aligns with materials in `Deep-Learning/Code - Notes` and `Deep-Learning/Master - Notes` (Transformers, RNNs, CNNs, Attention, TensorFlow/PyTorch basics). Focus: foundational theory, architectures, training dynamics, and small coding utilities (framework-light).

Total questions: 25 (10 Theory, 8 Fill-in-the-Blanks, 7 Coding). Difficulty mix: 40% easy, 40% medium, 20% hard.


## Instructions
- Answer all questions.
- Coding tasks are framework-agnostic or NumPy-based to avoid heavy dependencies; asserts included.
- Solutions provided at bottom.


## References
- Code/notes across `Deep-Learning/Code - Notes` and `Deep-Learning/Master - Notes` (e.g., Transformer, RNN/LSTM, CNN, Positional Encoding, TensorFlow/PyTorch overviews)


## Part A — Theory (10)
1. Explain the difference between training loss and validation loss. What does divergence indicate?
2. MCQ: Which activation helps mitigate vanishing gradients? (a) sigmoid (b) tanh (c) ReLU (d) linear
3. Describe how backpropagation uses the chain rule to update weights.
4. What is overfitting? Name three techniques to reduce it in deep nets.
5. MCQ: In attention, the output is weighted sum of (a) queries (b) keys (c) values (d) biases
6. Contrast RNN, LSTM, and GRU with respect to long-term dependency handling.
7. What is positional encoding in Transformers and why is it needed?
8. Explain the concept of teacher forcing in seq2seq training and a potential downside.
9. MCQ: BatchNorm typically (a) speeds convergence (b) eliminates need for LR tuning (c) prevents overfitting always (d) replaces dropout
10. Why do CNNs share weights spatially? What benefit does this confer?


## Part B — Fill in the Blanks (8)
1. The gradient descent update is `w ← w − η * ______`.
2. Dropout randomly sets activations to zero during ______.
3. In self-attention, the similarity between query and key is computed before applying ______ over scores.
4. In LSTM, the gate that controls memory content removal is the ______ gate.
5. The Transformer replaces recurrence with ______ mechanisms.
6. To stabilize training, gradients may be clipped by ______.
7. In convolution, the operation uses a learnable ______ (a small matrix) sliding over the input.
8. Layer normalization normalizes across the ______ dimension for each sample.


## Part C — Coding Tasks (7)
Implement with NumPy. Run asserts.

Tasks:
1. `relu(x)` — elementwise ReLU.
2. `softmax(x, axis=-1)` — numerically stable along given axis.
3. `cross_entropy(pred_probs, targets)` — mean CE for one-hot targets.
4. `positional_encoding(max_len, d_model)` — sinusoidal PE matrix [max_len, d_model].
5. `scaled_dot_attention(q, k, v, mask=None)` — compute attention: softmax(q k^T / sqrt(d)) v with optional boolean mask (True for mask positions to -inf).
6. `layer_norm(x, eps=1e-5)` — per-row normalization.
7. `gru_cell(x_t, h_prev, Wx, Wh, b)` — single-step GRU: return h_t.


In [None]:
import numpy as np

def relu(x):
    x = np.asarray(x)
    return np.maximum(0, x)

def softmax(x, axis=-1):
    x = np.asarray(x, float)
    x = x - np.max(x, axis=axis, keepdims=True)
    ex = np.exp(x)
    return ex / np.sum(ex, axis=axis, keepdims=True)

def cross_entropy(pred_probs, targets):
    p = np.asarray(pred_probs, float)
    y = np.asarray(targets, float)
    eps = 1e-12
    p = np.clip(p, eps, 1.0)
    return float(-(y * np.log(p)).sum(axis=1).mean())

def positional_encoding(max_len, d_model):
    pos = np.arange(max_len)[:, None]
    i = np.arange(d_model)[None, :]
    angle_rates = 1 / np.power(10000, (2*(i//2))/d_model)
    angles = pos * angle_rates
    pe = np.zeros((max_len, d_model))
    pe[:, 0::2] = np.sin(angles[:, 0::2])
    pe[:, 1::2] = np.cos(angles[:, 1::2])
    return pe

def scaled_dot_attention(q, k, v, mask=None):
    q, k, v = map(lambda a: np.asarray(a, float), (q,k,v))
    d = q.shape[-1]
    scores = (q @ k.T) / np.sqrt(d)
    if mask is not None:
        m = np.asarray(mask, bool)
        scores = np.where(m, -1e9, scores)
    probs = softmax(scores, axis=-1)
    return probs @ v

def layer_norm(x, eps=1e-5):
    x = np.asarray(x, float)
    mu = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return (x - mu) / np.sqrt(var + eps)

def sigmoid(z):
    return 1/(1+np.exp(-z))

def gru_cell(x_t, h_prev, Wx, Wh, b):
    """
    Wx: weights for x -> [input_dim, 3*hidden]
    Wh: weights for h -> [hidden, 3*hidden]
    b: bias [3*hidden]
    """
    x_t = np.asarray(x_t, float)
    h_prev = np.asarray(h_prev, float)
    z = x_t @ Wx + h_prev @ Wh + b
    H = h_prev.shape[-1]
    zt = sigmoid(z[..., :H])
    rt = sigmoid(z[..., H:2*H])
    ht_tilde = np.tanh(x_t @ Wx[:, 2*H:] + (rt * h_prev) @ Wh[:, 2*H:] + b[2*H:])
    h_t = (1 - zt) * h_prev + zt * ht_tilde
    return h_t


In [None]:
# Asserts
assert np.all(relu([-1,0,2]) == np.array([0,0,2]))

logits = np.array([[1.0, 2.0, 3.0]])
probs = softmax(logits, axis=-1)
assert np.allclose(probs.sum(), 1.0)

p = np.array([[0.2,0.8]])
y = np.array([[0,1]])
ce = cross_entropy(p,y)
assert ce > 0

pe = positional_encoding(4, 6)
assert pe.shape == (4,6)

q = np.array([[1.,0.]])
k = np.array([[1.,0.],[0.,1.]])
v = np.array([[1.,2.],[3.,4.]])
out = scaled_dot_attention(q,k,v)
assert out.shape == (1,2)

ln = layer_norm(np.array([[1.,2.,3.]]))
assert np.allclose(ln.mean(), 0, atol=1e-6)

rng = np.random.default_rng(0)
inp, hid = 5, 4
Wx = rng.normal(size=(inp, 3*hid))
Wh = rng.normal(size=(hid, 3*hid))
b = rng.normal(size=(3*hid,))
h0 = np.zeros(hid)
x0 = rng.normal(size=(inp,))
h1 = gru_cell(x0, h0, Wx, Wh, b)
assert h1.shape == (hid,)

print('Deep-Learning asserts passed ✅')


## Solutions

### Theory (sample)
1. Train tracks fit; val estimates generalization; divergence suggests overfitting or data shift.
2. (c) ReLU
3. Gradients propagate via chain rule from loss to weights updating by optimizer.
4. Regularization (dropout, weight decay), data augmentation, early stopping.
5. (c) values
6. LSTM/GRU add gates to mitigate vanishing gradients vs vanilla RNN.
7. Injects sequence order via sin/cos patterns enabling attention to use positions.
8. Feeding ground-truth tokens; downside: exposure bias at inference.
9. (a)
10. Parameter sharing reduces parameters and exploits locality.

### Fill blanks
1. gradient (∂L/∂w)
2. training
3. softmax
4. forget
5. attention
6. norm (value)
7. kernel/filter
8. feature
