# Training Language Models with GRPO Trainer

The GRPO Trainer is a robust tool for training language models, specifically highlighted in the paper **"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models"**. This paper discusses advanced methods and achievements in enhancing the capabilities of language models for mathematical reasoning. The authors of the study include Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo.

For more detailed information, you can access the paper [here](https://huggingface.co/papers/2402.03300).

## Contribution

The post-training method utilized by the GRPO Trainer was contributed by Quentin Gallouédec. For more insights into Quentin's work and contributions, visit his profile [here](https://huggingface.co/qgallouedec).
```

### Saving the Markdown File in Jupyter Notebook

If you are using a Jupyter Notebook, you can save the Markdown content to a file named `summary.md` by using the `%writefile` magic command. Create a new cell in your notebook and enter the following:

```python
%writefile summary.md
# Training Language Models with GRPO Trainer

The GRPO Trainer is a robust tool for training language models, specifically highlighted in the paper **"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models"**. This paper discusses advanced methods and achievements in enhancing the capabilities of language models for mathematical reasoning. The authors of the study include Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo.

For more detailed information, you can access the paper [here](https://huggingface.co/papers/2402.03300).

## Contribution

The post-training method utilized by the GRPO Trainer was contributed by Quentin Gallouédec. For more insights into Quentin's work and contributions, visit his profile [here](https://huggingface.co/qgallouedec).

In [None]:
! pip install transformers
! pip install datasets
! pip install trl

In [None]:
%%writefile train_grpo.py

# Import necessary libraries
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from typing import List

# Load a dataset for training
dataset = load_dataset("trl-lib/tldr", split="train")

# Define the reward function for the GRPO training
def reward_len(completions: List[str], **kwargs) -> List[int]:
    """
    Rewards completions that are close to a target length of 20 characters.

    Args:
        completions (List[str]): Generated completions from the model.
        **kwargs: Additional keyword arguments not used in this function.

    Returns:
        List[int]: Reward values based on the completion lengths.
    """
    return [-abs(20 - len(completion)) for completion in completions]

# Configuration for the GRPO trainer
training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO",          # directory for outputting training files
    logging_steps=1,                      # frequency of logging steps
    per_device_train_batch_size=8,         # batch size per device
    gradient_accumulation_steps=1,         # number of steps to accumulate gradients before backpropagation
    fp16=True                              # enable mixed precision training
)

# Initialize the GRPOTrainer
trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",      # model to be trained
    reward_funcs=reward_len,               # reward function as defined above
    args=training_args,                    # pass the GRPOConfig as arguments
    train_dataset=dataset                  # dataset for training
)

# Execute the training process
trainer.train()


In [None]:
! accelerate launch train_grpo.py