# Exercise set 5

We are going to implement the *cross-entropy* loss ourselves.

## Exercise 5.1 (2 points)

First, we will implement a numerically-stable version of $\log(\text{softmax})(\ldots)$. We will be given an input of shape `(N,C)` corresponding to `N` vectors (e.g., the output of an MLP for `N` inputs) of dimensionality `C`. Class labels will be numbered from `0` to `C-1`. Our vector of *targets* will be of shape `(N,)`.

Remember that the softmax operator computes for an input $\mathbf{x} \in \mathbb{R}^C$
$$ \text{softmax}(\mathbf{x}) = \left[\frac{ e^{x_1}}{\sum_c e^{x_c}}, \cdots, \frac{e^{x_C}}{\sum_c e^{x_c}}\right]$$

Say the correct label would be at the $i$-th entry, then we need for the **cross-entropy** loss 

$$ -\log(\text{softmax}(\mathbf{x})_i)=  -\log\left( \frac{ e^{x_i}}{\sum_c e^{x_c}
} \right) $$

In particular, the

$$ \log\left(\sum_{c=1}^C e^{x_c}\right) $$

can be problematic. Implement a function `stable_log_softmax` that receives as input the `(N,C)` tensor and returns a tensor of shape `(N,C)` that holds at position `(n,j)` the $\log(\text{softmax}(\mathbf{x_n})_j)$ value. In **Exercise 5.2** (below), we will use this tensor to select the correct entries for computing the **cross-entropy loss**.

## Exercise 5.2 (4 points)

Implement the **cross-entropy loss (for logits)** in the function `cross_entropy_from_scratch`.

This function takes (as first argument `logits`) an input of shape `(N,C)` corresponding to `N` vectors (e.g., the output of an MLP for `N` inputs) of dimensionality `C` and, as second argument, a tensor of shape `(N,)` that holds class labels (integers in the range `0` to `C-1`). 

Further, the function takes a (third) argument `ignore_index` and a (fourth) argument `reduction`. `ignore_index` specifies which class label to ignore in the cross-entropy computation and `reduction` should support `mean` and `sum`. Here's an example call to `cross_entropy_from_scratch`:

```python
N = torch.randint(16, 128, ()).item()
C = torch.randint(3, 10, ()).item()
logits = torch.randn(N, C, device='cpu', requires_grad=True)
targets = torch.randint(0, C, (N,), device='cpu')

loss = cross_entropy_from_scratch(logits, targets, reduction='sum', ignore_index=0)
```

In detail, depending on what `ignore_index` specifies (e.g., `ignore_index=0` as above), we want to ignore all instances that have exactly that label. In case `ignore_index=-1`, we do not ignore anything. Next, you will use `stable_log_softmax` on the `logits` tensor to get all log-softmax values, then take the negative of these values and select the *correct* column entry per row  (i.e., the correct column at row `i` is the index that is at the `i`-th position in `targets`). **Hint**: read the `torch.gather` documentation to get this step done quickly. Finally, depending on whether `reduction` is `reduction='mean'` or `reduction='sum'` we either take the average over all selected values or the sum and return it.