<a href="https://colab.research.google.com/github/venezianof/booksum/blob/main/finetuning/notebooks/grpo_for_verifiable_tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO Fine-tuning for verifiable tasks

Fine-tuning requires a GPU. If you don't have one locally, you can run this notebook for free on [Google Colab](https://colab.research.google.com/github/Liquid4All/cookbook/blob/main/finetuning/notebooks/grpo_for_verifiable_tasks.ipynb) using a free NVIDIA T4 GPU instance.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Liquid4All/cookbook/blob/main/finetuning/notebooks/grpo_for_verifiable_tasks.ipynb)

### What's in this notebook?

In this notebook you will learn how to fine-tune a Small Text-to-text Language Model for verifiable tasks using Group Relative Policy Optimization (GRPO).

GRPO is a Reinforcement Learning algorithm widely used by AI labs and practitioners to fine-tune models for easily verifiables tasks like

- Mathematical problem solving with numeric verification
- Code generation with unit test validation
- Structured output tasks (JSON, SQL) with schema validation
- Question answering with ground truth answers


We will cover
- Environment setup
- Data preparation
- Model training
- Local inference with your new model
- Model saving and exporting it into the format you need for **deployment**.

### Deployment options

LFM2.5 models are small and efficient, enabling deployment across a wide range of platforms:

<table align="left">
  <tr>
    <th>Deployment Target</th>
    <th>Use Case</th>
  </tr>
  <tr>
    <td>üì± <a href="https://docs.liquid.ai/leap/edge-sdk/android/android-quick-start-guide"><b>Android</b></a></td>
    <td>Mobile apps on Android devices</td>
  </tr>
  <tr>
    <td>üì± <a href="https://docs.liquid.ai/leap/edge-sdk/ios/ios-quick-start-guide"><b>iOS</b></a></td>
    <td>Mobile apps on iPhone/iPad</td>
  </tr>
  <tr>
    <td>üçé <a href="https://docs.liquid.ai/docs/inference/mlx"><b>Apple Silicon Mac</b></a></td>
    <td>Local inference on Mac with MLX</td>
  </tr>
  <tr>
    <td>ü¶ô <a href="https://docs.liquid.ai/docs/inference/llama-cpp"><b>llama.cpp</b></a></td>
    <td>Local deployments on any hardware</td>
  </tr>
  <tr>
    <td>ü¶ô <a href="https://docs.liquid.ai/docs/inference/ollama"><b>Ollama</b></a></td>
    <td>Local inference with easy setup</td>
  </tr>
  <tr>
    <td>üñ•Ô∏è <a href="https://docs.liquid.ai/docs/inference/lm-studio"><b>LM Studio</b></a></td>
    <td>Desktop app for local inference</td>
  </tr>
  <tr>
    <td>‚ö° <a href="https://docs.liquid.ai/docs/inference/vllm"><b>vLLM</b></a></td>
    <td>Cloud deployments with high throughput</td>
  </tr>
  <tr>
    <td>‚òÅÔ∏è <a href="https://docs.liquid.ai/docs/inference/modal-deployment"><b>Modal</b></a></td>
    <td>Serverless cloud deployment</td>
  </tr>
  <tr>
    <td>üèóÔ∏è <a href="https://docs.liquid.ai/docs/inference/baseten-deployment"><b>Baseten</b></a></td>
    <td>Production ML infrastructure</td>
  </tr>
  <tr>
    <td>üöÄ <a href="https://docs.liquid.ai/docs/inference/fal-deployment"><b>Fal</b></a></td>
    <td>Fast inference API</td>
  </tr>
</table>

In [9]:
from trl import GRPOConfig, GRPOTrainer

# Configura gli argomenti di addestramento tramite GRPOConfig
training_args = GRPOConfig(
    output_dir="LFM2.5-1.2B-Instruct-GRPO",
    learning_rate=2e-5,
    max_steps=30,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,

    # Parametri specifici GRPO
    num_generations=8,
    max_completion_length=256,

    # Ottimizzazioni
    optim="paged_adamw_8bit",
    bf16=True,
    use_liger_kernel=True,
    gradient_checkpointing=True,

    # Logging
    logging_steps=10,
    report_to=[]
)

print("Configurazione GRPO completata correttamente.")

  _is_package_version_below("liger_kernel", "0.6.5") or _is_package_version_below("peft", "0.18.0")


RuntimeError: Failed to import trl.trainer.grpo_trainer because of the following error (look up to see its traceback):
Invalid version: 'N/A'

In [3]:
import re

# 1. Format Reward: Checks if the model uses the required tags
def medical_format_reward(completions, **kwargs):
    pattern = r"^<think>.*</think>.*<diagnosis>.*</diagnosis>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]

# 2. Accuracy Reward: Placeholder for checking against ground truth
def medical_accuracy_reward(completions, solution, **kwargs):
    # 'solution' comes from the dataset if included in the prompt/extra columns
    rewards = []
    for completion, sol in zip(completions, solution):
        content = completion[0]["content"]
        # Simple string matching logic for demonstration
        if sol.lower() in content.lower():
            rewards.append(1.5) # Higher weight for correct diagnosis
        else:
            rewards.append(0.0)
    return rewards

# 3. Conciseness Reward: Penalizes overly long responses
def conciseness_reward(completions, **kwargs):
    contents = [len(c[0]["content"]) for c in completions]
    # Reward shorter responses (this is a simplified example)
    return [-0.1 if length > 1000 else 0.0 for length in contents]

# How to pass them to the trainer:
# trainer = GRPOTrainer(
#     model=model,
#     reward_funcs=[medical_format_reward, medical_accuracy_reward, conciseness_reward],
#     args=training_args,
#     train_dataset=train_dataset,
#     peft_config=peft_config,
# )

### Need help building with our models and tools?
Join the Liquid AI Discord Community and ask.

<a href="https://discord.com/invite/liquid-ai"><img src="https://img.shields.io/discord/1385439864920739850?color=7289da&label=Join%20Discord&logo=discord&logoColor=white" alt="Join Discord"></a>

And now, let the fine tune begin!

## üì¶ Installation & Setup

First, let's install all the required packages:

We'll install **TRL** with the **PEFT** extra, which ensures all main dependencies such as **Transformers** and **PEFT** (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included.

Additionally, we'll install
- **trackio** to log and monitor our experiments
- **bitsandbytes** to enable quantization of LLMs, reducing memory consumption for both inference and training, and
- **liger-kernel** for more efficient training.

In [8]:
!pip uninstall -y liger-kernel
!pip install -Uq "trl[peft]" bitsandbytes datasets

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.7/60.7 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.0/1.0 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m209.1/209.1 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m276.5/276.5 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.2/24.2 MB[0m [31m89.0 MB/s[0m eta [36m0:00:00[0m
[2

In [None]:
import torch
import transformers
import trl
import os
os.environ["WANDB_DISABLED"] = "true"

print(f"üì¶ PyTorch version: {torch.__version__}")
print(f"ü§ó Transformers version: {transformers.__version__}")
print(f"üìä TRL version: {trl.__version__}")

## Load the dataset

In this step, we load the [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset from the Hugging Face Hub using the `datasets` library.
This dataset focuses on **mathematical reasoning**, featuring problems that require step-by-step logical solutions.
By fine-tuning a model that does not yet exhibit strong reasoning capabilities, it can learn to **generate structured reasoning steps**, enhancing both the model's **accuracy** and **interpretability** on math-related tasks.

For efficiency, we'll load only a **small portion of the training split**:

In [None]:
from datasets import load_dataset

dataset_name = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_name, split='train[:5%]')

# Check the structure of the dataset
print(train_dataset)
print(train_dataset[0])

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/147M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/215k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/72441 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/99 [00:00<?, ? examples/s]

## Transform the dataset

We will adapt our dataset to a conversational format using a custom system prompt, guiding the LLM to generate both step-by-step reasoning and the final answer.

In [None]:
# Transform
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process is enclosed strictly within <think> and </think> tags. "
    "After closing </think>, the assistant MUST provide the final answer in plain text."
)


def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)

# remove unused columns
train_dataset = train_dataset.remove_columns(['messages', 'problem'])

# Check the structure of the dataset
print(train_dataset)
print(train_dataset[0]['prompt'])

Map:   0%|          | 0/3622 [00:00<?, ? examples/s]

## Load the model

This notebook can be used with two fine-tuning methods. By default, it is set up for **QLoRA**, which includes quantization using `BitsAndBytesConfig`. If you prefer to use standard **LoRA** without quantization, simply comment out the `BitsAndBytesConfig` configuration (training without quantization consumes more memory).

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model_id, output_dir = "LiquidAI/LFM2.5-1.2B-Instruct", "LFM2.5-1.2B-Instruct-GRPO"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # attn_implementation="flash_attention_2",                   # Change to Flash Attention if GPU has support
    dtype="bfloat16",                          # Change to bfloat16 if GPU has support
    # quantization_config=BitsAndBytesConfig(
    #     load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
    #     bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
    #     bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
    #     bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    # )
)

## Define LoRA adapters

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a **base model** (the one selected above) and, instead of modifying its original weights, we fine-tune a **LoRA adapter**, a lightweight layer that enables efficient and memory-friendly training. The **`target_modules`** specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

In [None]:
from peft import LoraConfig
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "in_proj", "w1", "w2", "w3"],
)

## Load reward functions or define your own ones

GRPO requires **reward functions** to guide the learning process. For convenience, we can directly load pre-defined rewards from `trl.rewards`, which already includes a [collection of ready-to-use rewards](https://huggingface.co/docs/trl/rewards).

If you want to create your own custom reward functions to teach the model, a reward function is simply a Python function that takes the generated completions and returns a list of floats. For example, the following function, which we use in this notebook, rewards completions that correctly follow the `<think>` format:

```python
def think_format_reward(completions: list[list[dict[str, str]]], **kwargs) -> list[float]:
    pattern = r"^<think>(?!.*<think>)(.*?)</think>.*$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]
```

In this notebook, we will use both `think_format_reward`, which rewards completions that correctly follow the `<think>` format, and `reasoning_accuracy_reward`, which evaluates the correctness of the model's solution to the mathematical problem. Together, these rewards guide the model to generate **structured reasoning** while producing **accurate answers**.

In [None]:
from trl.rewards import think_format_reward, reasoning_accuracy_reward

## Train model

In [None]:
from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    # Training schedule / optimization
    learning_rate=2e-5,                                     # Learning rate for the optimizer
    #num_train_epochs=1,
    max_steps=30,                                          # Number of dataset passes. For full trainings, use `num_train_epochs` instead

    # Parameters that control GRPO training (you can adapt them)
    per_device_train_batch_size = 8,
    max_completion_length=256, # default: 256               # Max completion length produced during training
    num_generations=8, # default: 8                         # Number of generations produced during trainig for comparison

    # Optimizations
    optim = "paged_adamw_8bit",                             # Optimizer
    use_liger_kernel=True,                                  # Enable Liger kernel optimizations for faster training
    gradient_checkpointing=True,                            # Save memory by re-computing activations during backpropagation

    # Parameters related to reporting and saving
    output_dir=output_dir,                                  # Where to save model checkpoints and logs
    logging_steps=10,                                       # Log training metrics every N steps
    report_to=[],
    # report_to="trackio",                                    # Experiment tracking tool
    # trackio_space_id=output_dir,                            # HF Space where the experiment tracking will be saved
    log_completions=False,                                  # Return model completions during training

    # Hub integration
    # push_to_hub=True,                                       # Automatically push the trained model to the Hugging Face Hub
                                                            # The model will be saved under your Hub account in the repository named `output_dir`
    # vLLM params
    # Enable with `use_vllm=True` and customize with the remaining params faster training
    #use_vllm=False,
    #vllm_mode='colocate',
    #vllm_gpu_memory_utilization=0.1,
    #vllm_enable_sleep_mode=True
)

from trl import GRPOTrainer
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[think_format_reward, reasoning_accuracy_reward],
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

trainer_stats = trainer.train()

## (Optional) Save fine tuned model

In this step, we save the fine-tuned model **locally**.

In [None]:
trainer.save_model(output_dir)

# Task
Carica il dataset 'AI-MO/NuminaMath-TIR' da Hugging Face Hub, selezionando una piccola porzione dello split di training, e trasformalo in un formato conversazionale con un prompt di sistema personalizzato per guidare il modello a generare sia il ragionamento passo-passo che la risposta finale, rimuovendo le colonne inutilizzate.

## Carica il dataset

### Subtask:
Carica il dataset 'AI-MO/NuminaMath-TIR' da Hugging Face Hub, selezionando una piccola porzione dello split di training per efficienza.


**Reasoning**:
I need to load the specified dataset from Hugging Face Hub, selecting a small portion of the training split, and then print its structure and the first element as per the instructions.



In [1]:
from datasets import load_dataset

dataset_name = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_name, split='train[:5%]')

# Check the structure of the dataset
print(train_dataset)
print(train_dataset[0])

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/147M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/215k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/72441 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/99 [00:00<?, ? examples/s]

Dataset({
    features: ['problem', 'solution', 'messages'],
    num_rows: 3622
})
{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{

**Reasoning**:
Now that the dataset is loaded, I need to transform it into a conversational format using a custom system prompt and then remove the original 'messages' and 'problem' columns as they are no longer needed. This will prepare the dataset for the model training.



In [2]:
# Transform
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process is enclosed strictly within <think> and </think> tags. "
    "After closing </think>, the assistant MUST provide the final answer in plain text."
)

def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)

# remove unused columns
train_dataset = train_dataset.remove_columns(['messages', 'problem', 'solution'])

# Check the structure of the dataset
print(train_dataset)
print(train_dataset[0]['prompt'])

Map:   0%|          | 0/3622 [00:00<?, ? examples/s]

Dataset({
    features: ['prompt'],
    num_rows: 3622
})
[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant  first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process is enclosed strictly within <think> and </think> tags. After closing </think>, the assistant MUST provide the final answer in plain text.', 'role': 'system'}, {'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]


## Fine della preparazione dati

### Subtask:
I dati sono stati caricati e trasformati, pronti per essere utilizzati nel processo di fine-tuning.


## Summary:

### Data Analysis Key Findings

*   The `AI-MO/NuminaMath-TIR` dataset was successfully loaded from Hugging Face Hub, specifically utilizing 5% of its training split, which resulted in a dataset containing 3622 rows.
*   Initially, the dataset included `['problem', 'solution', 'messages']` columns, with the first example showing a mathematical problem, its step-by-step solution, and a conversational history.
*   A custom `SYSTEM_PROMPT` was defined to instruct the model to first generate reasoning within `<think>` tags, followed by a plain text final answer.
*   The dataset was transformed into a conversational format by applying a `make_conversation` function, which generated a new `prompt` column containing the system message and the user's problem.
*   Unused columns, specifically `messages`, `problem`, and `solution`, were subsequently removed, leaving the dataset with only the `['prompt']` column, which is now in the desired conversational format.

### Insights or Next Steps

*   The dataset is now successfully transformed into a conversational format, with a system prompt designed to guide a language model in generating structured reasoning and a clear final answer, making it ready for fine-tuning.
*   The structured `prompt` column, including the system message and user problem, can be directly used as input for fine-tuning a conversational language model.


# Task
Load the base model 'LiquidAI/LFM2.5-1.2B-Instruct' using bfloat16 precision for GPU optimization, and configure the LoraConfig by specifying the target modules `["q_proj", "k_proj", "v_proj", "out_proj", "in_proj", "w1", "w2", "w3"]` with rank `r=32` and `lora_alpha=32` to prepare the model for parameter-efficient fine-tuning with the GRPOTrainer.

## Carica il modello base

### Subtask:
Load the 'LiquidAI/LFM2.5-1.2B-Instruct' model using bfloat16 precision for efficient GPU usage.


**Reasoning**:
Load the LiquidAI base model with bfloat16 precision as specified in the subtask instructions.



In [4]:
import torch
from transformers import AutoModelForCausalLM

model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model with bfloat16 precision
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to(device)

print(f"Model {model_id} loaded successfully on {device} with {model.dtype}.")

config.json: 0.00B [00:00, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.34G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Model LiquidAI/LFM2.5-1.2B-Instruct loaded successfully on cuda with torch.bfloat16.


**Reasoning**:
The previous execution generated a deprecation warning for 'torch_dtype'. I will update the code to use the recommended 'dtype' parameter to ensure compatibility with newer versions of the Transformers library while maintaining the requested bfloat16 precision.



In [5]:
import torch
from transformers import AutoModelForCausalLM

model_id = "LiquidAI/LFM2.5-1.2B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model with bfloat16 precision using the recommended 'dtype' parameter
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True
).to(device)

print(f"Model {model_id} loaded successfully on {device} with {model.dtype}.")

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

Model LiquidAI/LFM2.5-1.2B-Instruct loaded successfully on cuda with torch.bfloat16.


## Configura LoRA

### Subtask:
Define the LoraConfig with specific hyperparameters and target modules for parameter-efficient fine-tuning.


**Reasoning**:
I will define the LoraConfig using the specified hyperparameters, target modules, and task type to prepare the model for parameter-efficient fine-tuning.



In [6]:
from peft import LoraConfig, TaskType

# Define LoRA configuration
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "out_proj", "in_proj", "w1", "w2", "w3"],
    task_type=TaskType.CAUSAL_LM
)

print("LoRA configuration created with rank 32 and specified target modules.")

LoRA configuration created with rank 32 and specified target modules.


## Final Task

### Subtask:
Confirm the model and LoRA configuration are ready for the GRPOTrainer.


## Summary:

### Data Analysis Key Findings

*   **Model Initialization**: The base model `LiquidAI/LFM2.5-1.2B-Instruct` was successfully loaded into the environment.
*   **Precision Optimization**: The model was configured with `torch.bfloat16` precision to optimize memory usage and performance on the GPU.
*   **Hardware Utilization**: The setup correctly identified and utilized the CUDA device for model placement.
*   **LoRA Hyperparameters**: A Parameter-Efficient Fine-Tuning (PEFT) configuration was established with a rank ($r$) of 32 and a scaling factor (`lora_alpha`) of 32.
*   **Comprehensive Module Targeting**: The LoRA configuration was mapped to eight specific target modules: `q_proj`, `k_proj`, `v_proj`, `out_proj`, `in_proj`, `w1`, `w2`, and `w3`.
*   **Task Alignment**: The fine-tuning task was explicitly defined as `CAUSAL_LM`, matching the generative capabilities of the LFM-1.2B architecture.

### Insights or Next Steps

*   **Trainer Integration**: The next step is to pass the loaded model and the `peft_config` into the `GRPOTrainer` to begin the reinforcement learning/fine-tuning process.
*   **Efficiency Monitoring**: Given the rank of 32, monitor the trainable parameter count relative to the total model size to ensure the PEFT approach remains memory-efficient during training.
