---
### Exercise: Two-Neuron ReLU Network and Gradient Descent

#### Given:
- Activation function: $\sigma(x) = \max(0, x)$ (ReLU)  
- Input $x \in \mathbb{R}$
- Neural network with two neurons:
   - $w_1 = 1, b_1 = -1$
   - $w_2 = -1, b_2 = 0$

Output function:
$$\hat{y}(x) = \sigma(w_1 x + b_1) + \sigma(w_2 x + b_2) \in \mathbb{R}$$

#### Tasks:

1. **Plot the output function**  
   Plot $\hat{y}(x)$ as a function of $x$.

2. **Stochastic Gradient Descent with Sample**  
   Given a training sample $(x_0, y_0) = (-0.5, 1)$:

- **(2a)** Plot the sample $(x_0, y_0)$ on the previous plot.
- **(2b)** Determine the direction in which the weights $w_1, w_2$ and biases $b_1, b_2$ will move after a gradient step. Justify your answer without performing full gradient calculations.


---
### Exercise: FLOPs and Parameters

#### **FLOP Definition**  
A **FLOP (Floating Point Operation)** is a basic arithmetic operation (e.g., addition or multiplication) performed on scalar values.

#### **Tasks**  
1. **Scalar Product**  
   Given two vectors $a, b \in \mathbb{R}^d$, compute the number of FLOPs required to compute their dot product.

2. **Linear Transformation**  
   Consider a fully connected (linear) layer with input dimension $d$ and output dimension $d'$ (denoted as `Linear(d, d')`):
   - How many **parameters** (weights) does this layer have (assuming no bias)?
   - How many **FLOPs** are required for a single forward pass through this layer?


---
### **General NN Questions**
1. Explain the vanishing/exploding gradient problem in training neural networks.
2. Why are normalization layers used?
3. Why are residual connections used?
4. Why is weight initialization important? Give an example with a linear layer.
5. What is the link between a Linear layer and a convolutional layer?

### **Transformer questions**
1. What is the purpose of self-attention in a transformer model?
2. Why do we scale the dot-product attention scores by \( \sqrt{d}$?
3. Why do we apply the **softmax function** in the attention mechanism?
4. How are the parameters of the $W_Q, W_K, W_V$ projection matrices set?
5. What is the time complexity of the self-attention mechanism?
