# Homework 1

*   Goals:
  * Implement Cross-Entropy loss
  * Implement SDPA (Scaled-dot product attention)


---


# Cross-Entropy Computation

Given a target vector *y_true* and a prediction vector *y_pred* (produced by a model), write a function that computes the **Cross-Entropy Loss** between both.

The formula for the cross-entropy loss is:

$H(t, q) = - \sum_{x \in X}\:  t(x) \: log \: p(x) $

- *t* is the **true probability** distribution
- *p* is the **predicted probability** distribution
- *X* is the **set of classes**

In [9]:
import numpy as np

y_true = [1, 0, 0, 0, 0]
y_pred = [10, 5, 3, 1, 4]

def cross_entropy(y_true, y_pred):
    """
    cross-entropy implementation
    """
    # calculate the softmax of y_pred
    exp = np.exp(y_pred)
    sum_exp = np.sum(exp)
    softmax = exp / sum_exp
    # compute the cross-entropy according to the formula
    ce = -np.sum(y_true * np.log(softmax))

    return ce


loss = cross_entropy(y_true, y_pred)

print("Cross Entropy Loss: ", loss)
assert(loss == 0.010199795719758164)
print("Success!") # nice!

Cross Entropy Loss:  0.010199795719758164
Success!


# Scaled Dot-Product Attention

***The task is to implement a simple version of the scaled-dot product attention mechanism.***

I will use `pytorch`!

My implementation will:
*   take $3$ tensors with dimensions $[batch\_size, seq\_len, d_{model}]$,
*   compute attention using the scaled dot-product attention formula,
*   and return a tensor with the same dimensions as the input.

---

## Reference
> [Vaswani et. al (2017)](https://arxiv.org/pdf/1706.03762) is the famous paper that introduced the Transformer architecture.

> The most important concept is the *scaled dot-product attention*. Using  matrices *query*, *key*, and *value*  $Q, K, M$, this mechanism is defined as\
$\text{Attention}(Q,K,V)=\sigma(\frac{QK^T}{\sqrt{d_k}})V$\
where $d_k$ is the dimensionality of the row $K$ (commonly in practice, $d_k = d_q = d_v = d_{model}$).

In [10]:
import torch
from torch import softmax, einsum # the suggestion of using einsum

def sdp_attention(q, k, v):
  """
  scaled dot-product attention implementation
  """
  # for each batch b we compute the dot product between
  # each i in the query vector and each j in the key vector
  scores = einsum("bid, bjd -> bij", q, k)
  # scale by sqrt(d_k)
  d_k = q.shape[-1]
  scores /= d_k ** 0.5
  # apply softmax
  probs = softmax(scores, dim=-1)
  # multiply by V
  score = einsum("bij, bjd -> bid", probs, v)

  return score

### TEST CASE ###
from torch import manual_seed, randn
manual_seed(42)

q,k,v = [randn(1,4,8) for _ in range(3)]
outp = sdp_attention(q, k, v)
expected = torch.tensor([[[0.26668164134025574, 0.2370503693819046, -0.05542190372943878, 0.12984780967235565, 0.354112833738327, -0.19060277938842773, -0.6448009014129639, -0.008517783135175705], [0.10862354934215546, 0.2443515509366989, -0.2163696587085724, 0.38144612312316895, 0.06310948729515076, -0.5632890462875366, -1.1007437705993652, -0.33056533336639404], [0.49469706416130066, -0.10946226119995117, -0.5350303649902344, 0.3420145511627197, -0.62238609790802, -0.47719380259513855, 0.322252094745636, 0.2334655523300171], [0.5705238580703735, -0.014622762799263, -0.27748897671699524, 0.32384753227233887, -0.46800583600997925, -0.4084155559539795, 0.19807007908821106, 0.32957538962364197]]])
assert torch.allclose(expected, outp.data, atol=0.0001)
print("Success!") # nice!

Success!
