##  Reinforcement Learning

Recently proheminent AI labs and researcher realised that large "intelligence" gained could be made by scaling inference time compute against the previous approach of scaling training time. This is the raise of all the "reasoning model" including openai o1 / o3 and more recently deepseek R1. What's great about deepseek R1 is unlike openAI they explained how they did it.

To make the model "reason" hence scaling inference time compute, they used GRPO as we will use in this notebook to make our model reason about claims. The great added benefit about the deepseek approach is that the reasoning is clearly readable from the LLM generation.

### Adding Claim Reasoning Capabilities to our model with Group Relative Policy Optimisation (GRPO)

In this notebook we will focus on Reinforcement Learning to improve further the performance of our small local Qwen2.5 3B. The promise of reinforcement learning is to let the model improve itself through some reward function. There are many variation of reward function and way of updating the model based on the reward that have been experimented with. The latest and greatest approach in terms of reinforcement learning algorithm is [GRPO](https://arxiv.org/pdf/2402.03300). It was developed by deepseek and is powering the reasoning capabilities of the famous Deepseek R1 model.

GRPO is very close to previous approach Proximal Proxy Optimization PPO but uses several generation from the same policy model to see how good it perform on average and use this average instead of a dedicated value model to compute it's advantage and improve itself.

#### load and prepare the dataset

We will load the same dataset as used in the previous part of this workshop. That is 400 synthtetic claims about a car insurance policy of AXA UK.

In [1]:
# we use pydantic models to help you navigate / type the dataset
from models import ClaimsDataset, Claim

with open("../data/claims_dataset_v2_manual.json", "r") as f:
    dataset = ClaimsDataset.model_validate_json(f.read())

print(f"loaded dataset with {len(dataset.root)} claims.")

loaded dataset with 400 claims.


In [2]:
claims = dataset.root
covered_claims = [claim for claim in claims if claim.coverage]
not_covered_claims = [claim for claim in claims if not claim.coverage]

print(
    f"there are {len(covered_claims)} covered claims and {len(not_covered_claims)} not covered claims."
)

there are 126 covered claims and 274 not covered claims.


#### split train / test dataset

To make sure our results are comparable we first split the dataset into a training and testing set. We will establish our baseline only on the test set. The train set will be use to finetune the model and then the finetuned model will be evaluated on the test set again. 

Note here that in a real world setting I would probably set a Stratified K fold CV to ensure the proportion of covered / not covered across several splits. For the purpose of this workshop we keep things simple.

In [3]:
import random

# shuffle randomly the claims with reproducibility
random.seed(42)
random.shuffle(claims)

# keep 80% as training set, 20% as testing set.
split_ratio = 0.8
train_size = int(len(claims) * split_ratio)

train_claims = claims[:train_size]
test_claims = claims[train_size:]

print(
    f"split {len(claims)} claims into {len(train_claims)} training claims and {len(test_claims)}"
)

split 400 claims into 320 training claims and 80


In [4]:
covered_train_claims = [claim for claim in train_claims if claim.coverage]
not_covered_train_claims = [claim for claim in train_claims if not claim.coverage]

print(
    f"{len(covered_train_claims) * 100 / len(train_claims)}% covered, {len(not_covered_train_claims) * 100 / len(train_claims)}% not covered"
)

30.625% covered, 69.375% not covered


In [5]:
covered_test_claims = [claim for claim in test_claims if claim.coverage]
not_covered_test_claims = [claim for claim in test_claims if not claim.coverage]

print(
    f"{len(covered_test_claims) * 100 / len(test_claims)}% covered, {len(not_covered_test_claims) * 100 / len(test_claims)}% not covered"
)

35.0% covered, 65.0% not covered


We see a slightly difference in the proportion of covered not covered between our training and testing set that could potentially impact our end results. To do it better we could make a stratified split for example using sklearn.

## Qwen2.5 (3B) baseline

We had the baseline on the test dataset as in ```001_finetuning.ipynb```: 

```
Accuracy: 0.725
Precision: 0.7142857142857143
Recall: 0.35714285714285715
F1 Score: 0.47619047619047616
```

## GRPO Training

### Define Model

Here we just load the model and apply peft which modify each relevant modules / layer from regular pytorch modules to LoRa adapted one as follow:

```python
# Original layer
class LinearLayer:
    def __init__(self):
        self.weight = Parameter(...)  # Full weight matrix
    
    def forward(self, x):
        return x @ self.weight.T
```

becomes

```python
# LoRA wrapped version
class LoRALayer:
    def __init__(self, base_layer, rank=8):
        self.base_layer = base_layer          # Original layer
        self.lora_A = Parameter(...)          # Low rank matrix A (smaller)
        self.lora_B = Parameter(...)          # Low rank matrix B (smaller)
        self.scaling = alpha / rank           # Scaling factor
        
    def forward(self, x):
        # Original computation
        base_output = self.base_layer(x)
        # LoRA computation
        lora_output = (x @ self.lora_A @ self.lora_B) * self.scaling
        # Combine both
        return base_output + lora_output
```

Additionaly the basebone model is loaded in 4bit which translate to using the following `BitsAndBytesConfig`:

```python
bnb_config = BitsAndBytesConfig(
    load_in_4bit              = True,
    bnb_4bit_use_double_quant = True,
    bnb_4bit_quant_type       = "nf4",
    bnb_4bit_compute_dtype    = dtype,
)
```

This setup basically enable [QLoRA](https://arxiv.org/pdf/2305.14314) where the backbone model is quantised using nf4 (4 bit optimised quantisation).


In [6]:
# we need to patch FastLanguageModel to leverage unsloth optimizations and GRPO
from unsloth import FastLanguageModel, PatchFastRL

PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 02-12 13:04:54 __init__.py:190] Automatically detected platform cuda.


In [7]:
max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 64  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.5,  # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

==((====))==  Unsloth 2025.2.5: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA GeForce RTX 2070. Max memory: 7.607 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 41.64%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 7.61 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.75 GB. Also swap space = 4 GB.
INFO 02-12 13:05:14 config.py:542] This model supports multiple tasks: {'generate', 'score', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwa



INFO 02-12 13:05:18 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-12 13:05:20 model_runner.py:1115] Loading model weights took 2.2160 GB
INFO 02-12 13:05:20 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 02-12 13:05:21 worker.py:267] Memory profiling takes 1.31 seconds
INFO 02-12 13:05:21 worker.py:267] the current vLLM instance can use total_gpu_memory (7.61GiB) x gpu_memory_utilization (0.42) = 3.17GiB
INFO 02-12 13:05:21 worker.py:267] model weights take 2.22GiB; non_torch_memory takes -0.10GiB; PyTorch activation peak memory takes 0.70GiB; the rest of the memory reserved for KV Cache is 0.35GiB.
INFO 02-12 13:05:21 executor_base.py:110] # CUDA blocks: 630, # CPU blocks: 7281
INFO 02-12 13:05:21 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 9.84x
INFO 02-12 13:05:24 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occur

Capturing CUDA graph shapes: 100%|██████████| 19/19 [00:13<00:00,  1.44it/s]

INFO 02-12 13:05:38 model_runner.py:1562] Graph capturing finished in 13 secs, took 0.49 GiB
INFO 02-12 13:05:38 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 17.98 seconds



Unsloth 2025.2.5 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Define reasoning prompts

Here as you see we want to make our model reason we just ask him so in the system_prompt.

In [8]:
# here is the prompt you can play with
SYSTEM_PROMPT = """
You are an Insurance Claim Expert.
Respond in the following format:

<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

USER_PROMPT = """
You are given a claim description and a list of sources extracted from the insurance policy.
You need to determine if the claim is covered by the insurance policy based on the sources.

Claim description:
{claim.description}

Sources:
{sources}

Format:
Return only "covered" or "not covered" in the answer field.
""".strip()

In [9]:
def claim_to_prompt(claim: Claim):
    """apply chat template and format the prompt"""
    return tokenizer.apply_chat_template(
        [
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": USER_PROMPT.format(
                    claim=claim,
                    sources="\n".join(
                        [
                            f"{i + 1}. {source.paragraph}"
                            for i, source in enumerate(claim.sources)
                        ]
                    ),
                ),
            },
        ],
        tokenize=False,
        add_generation_prompt=True,
    )


print(claim_to_prompt(test_claims[0]))

<|im_start|>system

You are an Insurance Claim Expert.
Respond in the following format:

<reasoning>
...
</reasoning>
<answer>
...
</answer>
<|im_end|>
<|im_start|>user
You are given a claim description and a list of sources extracted from the insurance policy.
You need to determine if the claim is covered by the insurance policy based on the sources.

Claim description:
I discovered someone had attempted to steal my car. The driver's side door lock was damaged, and the dashboard was dismantled, with the stereo missing. Is there any provision for covering transportation and accomodation?

Sources:
1. If your car, accessories or spare parts are lost, stolen or damaged, we will: - repair the damage; - replace what is lost or damaged and is too expensive to repair; or - pay you the cost of the loss or damage.
2. If your car is damaged, we will use one of our recommended repairers to repair it. If you choose not to use them, we may not pay more than our recommended repairer would have char

At this point it's already a good idea to try to see what the model output with such a prompt for a few examples.

In [9]:
from vllm import SamplingParams

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)

outputs = model.fast_generate(
    [claim_to_prompt(claim) for claim in test_claims[:10]],
    sampling_params=sampling_params,
    lora_request=None,
)

Processed prompts: 100%|██████████| 10/10 [00:15<00:00,  1.56s/it, est. speed input: 269.20 toks/s, output: 105.88 toks/s]


In [10]:
for output in outputs:
    print(output.outputs[0].text)
    print("-" * 100)

<reasoning>
The claim description states that the driver's side door lock was damaged, and the dashboard was dismantled, with the stereo missing. According to source 1, if the car, accessories, or spare parts are lost, stolen, or damaged, the insurance will cover repairs, replacements, or a financial payment. Since the stereo is a component (spare part) of the car, and it is missing, it falls under the covered loss. Additionally, source 2 mentions that if the car is damaged, the insurance will provide transportation and accommodation pending the settlement of the claim. Therefore, the claim is covered. 
</reasoning>
<answer>
covered
</answer>
----------------------------------------------------------------------------------------------------
<reasoning>
The claim involves a fatality and legal costs, which are covered by the policy. The policy explicitly states that it covers "the amounts shown below" which includes "Death of or injury to any person unlimited" and "all legal costs and e

We observe that the format is already quite well respected. Let's set our baseline with that new prompt format already.

In [11]:
outputs = model.fast_generate(
    [claim_to_prompt(claim) for claim in test_claims],
    sampling_params=sampling_params,
    lora_request=None,
)

Processed prompts:   0%|          | 0/80 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|██████████| 80/80 [01:17<00:00,  1.03it/s, est. speed input: 390.93 toks/s, output: 159.62 toks/s]


In [17]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip().lower()


predictions = [extract_xml_answer(output.outputs[0].text) for output in outputs]

In [18]:
wrong_format_predictions = [
    (i, extract_xml_answer(output.outputs[0].text))
    for i, output in enumerate(outputs)
    if extract_xml_answer(output.outputs[0].text) not in ["covered", "not covered"]
]


print(f"There are {len(wrong_format_predictions)} claims with wrong format.")
for i, prediction in wrong_format_predictions:
    print(f"Claim {i} has wrong format: {prediction}")
    print(f"Full output: {outputs[i].outputs[0].text}")
    print("-" * 100)


There are 13 claims with wrong format.
Claim 13 has wrong format: <reasoning>
the claim description states that the glass damage was caused during an attempted theft, which is explicitly covered under source 1. the sources indicate that the insurance company will pay for the repair or replacement of glass in windows or windscreens, including panoramic windscreens, in the car and scratching of the bodywork caused by the glass breaking. 

furthermore, the policy explicitly states that glass damage from brokenromium tatto
user
based on the provided sources, the policy will pay for the glass damage to the and and the relevant specify that glass in windows or windscreens break due to the the policy will repair or replacement will be provided, therefore, claim description are part of the car's windshield) are explicitly covered under the policy's coverage.
Full output: <reasoning>
The claim description states that the glass damage was caused during an attempted theft, which is explicitly cov

There are 2 keys errors I could observe:

* eventhough I increased the max ouput to 2048 tokens I still see some cases where it never reaches the <answer> part.
* it sometimes inject itself with new instruction to answer YES or NO for example.

We can penalize the model for this behavior during reinforcement learning with our reward function.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

labels = [claim.coverage for claim in test_claims]
predictions = [
    extract_xml_answer(output.outputs[0].text).lower() == "covered"
    for output in outputs
]

print(f"Accuracy: {accuracy_score(labels, predictions)}")
print(f"Precision: {precision_score(labels, predictions)}")
print(f"Recall: {recall_score(labels, predictions)}")
print(f"F1 Score: {f1_score(labels, predictions)}")


Accuracy: 0.6125
Precision: 0.45454545454545453
Recall: 0.5357142857142857
F1 Score: 0.4918032786885246


Our baseline starts now lower that before mostly due to not respecting the format (~13 out of 80)

### GRPO

#### Define a Reward 

For the GRPO algorithm to optimize our model we need to define a reward function. Actually in this case our reward will be combination of different objectives. Finding the right answer will bring a gain of 2.0 having the right format will bring 0.5 and so on.

In [10]:
# code from: https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
import re


# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    extracted_responses = [extract_xml_answer(r) for r in completions]
    print(
        "-" * 20,
        f"Question:\n{prompts[0]}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{completions[0]}",
        f"\nExtracted:\n{extracted_responses[0]}",
    )
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    matches = [re.match(pattern, r) for r in completions]
    return [0.5 if match else 0.0 for match in matches]


def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    matches = [re.match(pattern, r) for r in completions]
    return [0.5 if match else 0.0 for match in matches]


def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count


def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    return [count_xml(c) for c in completions]

#### Define the HuggingFace Dataset for training 

In this case we need to have a prompt column which contain the whole prompt and an answer which contain the actual answer we are expecting.
That is to match with the requirements of trl GRPO trainer

In [11]:
from datasets import Dataset


def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


dataset = Dataset.from_list(
    [
        {
            "prompt": claim_to_prompt(claim),
            "answer": "covered" if claim.coverage else "not covered",
        }
        for claim in train_claims
    ]
)

dataset

Dataset({
    features: ['prompt', 'answer'],
    num_rows: 320
})

In [12]:
# initialize the trainer
# for details on GRPO trainer see: https://huggingface.co/docs/trl/main/en/grpo_trainer#grpo-trainer
from trl import GRPOConfig, GRPOTrainer
from unsloth import is_bfloat16_supported

training_args = GRPOConfig(
    use_vllm=True,  # use vLLM for fast inference!
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="adamw_8bit",
    logging_steps=1,
    bf16=is_bfloat16_supported(),
    fp16=not is_bfloat16_supported(),
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=8,  # Decrease if out of memory
    max_prompt_length=256,
    max_completion_length=200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps=250,
    save_steps=250,
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch


In [13]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        correctness_reward_func,
    ],
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 320 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 250
 "-____-"     Number of trainable parameters = 119,734,272


-------------------- Question:
<|im_start|>system

You are an Insurance Claim Expert.
Respond in the following format:

<reasoning>
...
</reasoning>
<answer>
...
</answer>
<|im_end|>
<|im_start|>user
You are given a claim description and a list of sources extracted from the insurance policy.
You need to determine if the claim is covered by the insurance policy based on the sources.

Claim description:
While driving home, a tree branch fell onto the hood of my car during a storm. The impact caused a significant dent and damaged the engine. The car wouldn't start afterward, resulting in a mechanical failure. Could there be coverage in such unforeseen natural events?

Sources:
1. You are not covered for the following:
2. Loss of use, loss of value, wear and tear, mechanical or electrical failure, breakdowns or breakages.

Format:
Return only "covered" or "not covered" in the answer field.<|im_end|>
<|im_start|>assistant
 
Answer:
covered 
Response:
<reasoning>
The claim description states

Step,Training Loss,reward,reward_std,completion_length,kl


RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [14]:
import torch

print(torch.__version__)

2.5.1+cu124
