<a href="https://colab.research.google.com/github/romenlaw/RL-playground/blob/main/DeepSeek_GPRO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[EVERY term in DeepSeek R1's GRPO explained (with examples and exercises) | RL Foundations](https://www.youtube.com/watch?v=mXWiDU9-fOk)

In [2]:
import numpy as np

# Advantage
It measures how well the answer has performed against the group.
$$A_i=\frac{r_i-mean({r_1, r_2, ... , r_G})}{std({r_1, r_2, ... , r_G})}$$

The rewards (r) are 1 (for correct), 0 (for wrong).
Wrong answers get negative advantage.

In [12]:
r = np.random.randint(0, 2, 10)
m = r.mean()
s = r.std()
A = (r - m) / s
r, m, s, A

(array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0]),
 np.float64(0.2),
 np.float64(0.4000000000000001),
 array([-0.5, -0.5, -0.5,  2. , -0.5, -0.5, -0.5,  2. , -0.5, -0.5]))

# Policy
## current policy $\pi_\theta$
Main terms of the objective formula:
$\pi_\theta(o_i|q)A_i$

* if $A_i>0$, i.e. the answer is correct, then adjust $\theta$ (hence the policy) to increase the prob of $o_i$
* if $A_i<0$, i.e. the answer is correct, then adjust $\theta$ to decrease the prob of $o_i$

## old policy $\pi_{\theta_{old}}$

old version of the policy - e.g. policy of 10 gradient updates ago under an old version of LLM.

The following term keeps the value between the 1-ϵ and 1+ϵ range, i.e. even if the new policy has increased a lot compared to the old, we don't want to update it beyond this range. i.e. be conservative about it's gradient updates. One intuition of this clipping is that even though the answer was wrong, it may be because one little step/part/token was wrong, so we don't want to over-penalise the model for the whole answer.
$$clip\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_old}(o_i|q)}, 1-\epsilon, 1+\epsilon\right)A_i$$

# the min term

the formula takes the min of the clipped objective (times Adv) and the direct objective (times adv).

It constrains how much to increase the prob of a particular question (q)'s response ($o_i$), and encourages the model
1. to work on other reaponses of the question q ($o_1,... ,o_G$)
2. to work on other questions' responses as well ($q\sim Q$).



In [15]:
def clip(input, low, high):
  return np.minimum(np.maximum(input, low), high)

In [26]:
# ratio is the pi/pi_old
eps = 0.1
"""
scenarios:
  A>0, ratio > 1+e
  A>0, ratio < 1+e
  A<0, ratio > 1-e
  A<0, ratio < 1-e
"""
A=np.array([2, 2, -2, -2])
ratio = np.array([1.2, 1, 1.2, 0.8])
unclipped = ratio * A
clipped = clip(ratio, 1-eps, 1+eps) * A

In [27]:
unclipped, clipped

(array([ 2.4,  2. , -2.4, -1.6]), array([ 2.2,  2. , -2.2, -1.8]))

In [28]:
np.minimum(unclipped, clipped)

array([ 2.2,  2. , -2.4, -1.8])

# The KL Divergence term

It's KL divergence between the current policy vs. the reference policy (DeepSeek v3). It's another measure to keep the updates not too far from the reference/base model.

If β is increased, the training will stay close to the base model (DeepSeek v3).