<a href="https://colab.research.google.com/github/semhoun/omnius/blob/main/nb/Claire.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Acknowledgments
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a> <a href="https://github.com/unslothai/unsloth"><img src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="43"></a>
</div>

# Data Sources:
* https://huggingface.co/angeluriot
* https://huggingface.co/CATIE-AQ
* https://huggingface.co/jpacifico
* https://huggingface.co/OpenLLM-France

# Params


In [1]:
import os

if "UNSLOTH_DOCKER" in "".join(os.environ.keys()):
  hf_token = os.environ["HUGGINGFACE_TOKEN"]
  hf_username = os.environ["HUGGINGFACE_USERNAME"]
elif "COLAB_" in "".join(os.environ.keys()):
  from google.colab import userdata
  hf_token = userdata.get('HuggingFaceToken')
  hf_username = userdata.get('HuggingFaceUsername')
else:
  hf_token = os.environ["HUGGINGFACE_TOKEN"]
  hf_username = os.environ["HUGGINGFACE_USERNAME"]

debug = True
save_hf = False
local_llama = "LLAMACPP_PATH" in "".join(os.environ.keys())
dataset_processors = 8

source_model = "HuggingFaceTB/SmolLM3-3B-Base"
quantizations = ["q8_0", "q4_k_m "]
model_name = "Claire-3B-v0.2.0"

# Installation

In [2]:
%%capture
import os, re

if "UNSLOTH_DOCKER" in "".join(os.environ.keys()):
  # Docker Unsloth version
  pass
if "COLAB_" in "".join(os.environ.keys()):
  os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
  # Google colab
  import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
  xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
  !pip install --upgrade --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
  !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
  !pip install --no-deps --upgrade unsloth
  !pip install transformers==4.55.4
  !pip install --no-deps trl==0.22.2
else:
  os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"
  !pip install --upgrade unsloth-zoo
  !pip install --upgrade unsloth
  !pip install transformers==4.55.4
  !pip install --no-deps trl==0.22.2

# Unsloth

## Model download

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = source_model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = False,
    load_in_8bit = False,
    token = hf_token,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 09-30 12:18:06 [__init__.py:241] Automatically detected platform cuda.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.9: Fast Smollm3 patching. Transformers: 4.56.2. vLLM: 0.10.1.
   \\   /|    NVIDIA GeForce RTX 3060. Num GPUs = 1. Max memory: 11.629 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu128. CUDA: 8.6. CUDA Toolkit: 12.8. Triton: 3.3.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/151 [00:00<?, ?B/s]

HuggingFaceTB/SmolLM3-3B-Base does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


## LoRA
Add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
from unsloth import FastModel

lora_rank = 16 # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
      "q_proj", "k_proj", "v_proj", "o_proj",
      "gate_proj", "up_proj", "down_proj"
      "embed_tokens", "lm_head", # Add for continual pretraining, it seem to be replaced by module_to_save
    ],
    modules_to_save=["embed_tokens", "lm_head"],
    lora_alpha = lora_rank,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Making `model.base_model.model.model.embed_tokens` require gradients
Unsloth: Allowing gradients for `base_model.model.model.embed_tokens` since it's in `modules_to_save`.
Unsloth: Allowing gradients for `base_model.model.lm_head` since it's in `modules_to_save`.


# Tools


## Dataset formatter
**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

ChatML renders multi turn conversations like below:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Normally one has to train `<|im_start|>` and `<|im_end|>`. We instead map `<|im_end|>` to be the EOS token, and leave `<|im_start|>` as is. This requires no additional training of additional tokens.

More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [5]:
#@title Base Prompt
wikipedia_prompt = """{} Article
### Title: {}

### Article:
{}"""

ebook_prompt = """Book
### Title: {}

### Author: {}

### Content:
{}"""

In [6]:
#@title SmallLm ChatML
smollm_chatml = """{#  ───── defaults ─────  #}
{%- if enable_thinking is not defined -%}
	{%- set enable_thinking = true -%}
{%- endif -%}
{#  ───── reasoning mode ─────  #}
{%- if enable_thinking -%}
	{%- set reasoning_mode = "/think" -%}
{%- else -%}
	{%- set reasoning_mode = "/no_think" -%}
{%- endif -%}
{#  ───── header (system message) ─────  #}
{{- "<|im_start|>system\n" -}}
{%- if messages[0].role == "system" -%}
	{%- set system_message = messages[0].content -%}
	{%- if "/no_think" in system_message -%}
		{%- set reasoning_mode = "/no_think" -%}
	{%- elif "/think" in system_message -%}
		{%- set reasoning_mode = "/think" -%}
	{%- endif -%}
	{%- set custom_instructions = system_message.replace("/no_think", "").replace("/think", "").rstrip() -%}
{%- endif -%}
{%- if "/system_override" in system_message -%}
	{{- custom_instructions.replace("/system_override", "").rstrip() -}}
	{{- "<|im_end|>\n" -}}
{%- else -%}
	{{- "## Metadata\n\n" -}}
	{{- "Knowledge Cutoff Date: June 2025\n" -}}
	{%- set today = strftime_now("%d %B %Y") -%}
	{{- "Today Date: " ~ today ~ "\n" -}}
	{{- "Reasoning Mode: " + reasoning_mode + "\n\n" -}}
	{{- "## Custom Instructions\n\n" -}}
	{%- if custom_instructions -%}
		{{- custom_instructions + "\n\n" -}}
	{%- elif reasoning_mode == "/think" -%}
		{{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face. Your role as an assistant involves thoroughly exploring questions through a systematic thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracking, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution using the specified format: <think> Thought section </think> Solution section. In the Thought section, detail your reasoning process in steps. Each step should include detailed considerations such as analysing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The Solution section should be logical, accurate, and concise and detail necessary steps needed to reach the conclusion.\n\n" -}}
	{%- else -%}
		{{- "You are a helpful AI assistant named SmolLM, trained by Hugging Face.\n\n" -}}
	{%- endif -%}
	{%- if xml_tools or python_tools or tools -%}
		{{- "### Tools\n\n" -}}
		{%- if xml_tools or tools -%}
			{%- if tools -%}
				{%- set xml_tools = tools -%}
			{%- endif -%}
			{%- set ns = namespace(xml_tool_string="You may call one or more functions to assist with the user query.\nYou are provided with function signatures within <tools></tools> XML tags:\n\n<tools>\n") -%}
			{%- for tool in xml_tools[:] -%}
				{#  The slicing makes sure that xml_tools is a list  #}
				{%- set ns.xml_tool_string = ns.xml_tool_string ~ tool | string ~ "\n" -%}
			{%- endfor -%}
			{%- set xml_tool_string = ns.xml_tool_string + "</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{" + '"' + "name" + '"' + ": <function-name>, " + '"' + "arguments" + '"' + ": <args-json-object>}\n</tool_call>" -%}
			{{- xml_tool_string -}}
		{%- endif -%}
		{%- if python_tools -%}
			{%- set ns = namespace(python_tool_string="When you send a message containing Python code between '<code>' and '</code>' tags, it will be executed in a stateful Jupyter notebook environment, and you will then be given the output to continued reasoning in an agentic loop.\n\nYou can use the following tools in your python code like regular functions:\n<tools>\n") -%}
			{%- for tool in python_tools[:] -%}
				{#  The slicing makes sure that python_tools is a list  #}
				{%- set ns.python_tool_string = ns.python_tool_string ~ tool | string ~ "\n" -%}
			{%- endfor -%}
			{%- set python_tool_string = ns.python_tool_string + "</tools>\n\nThe state persists between code executions: so variables that you define in one step are still available thereafter." -%}
			{{- python_tool_string -}}
		{%- endif -%}
		{{- "\n\n" -}}
		{{- "<|im_end|>\n" -}}
	{%- endif -%}
{%- endif -%}
{#  ───── main loop ─────  #}
{%- for message in messages -%}
	{%- set content = message.content if message.content is string else "" -%}
	{%- if message.role == "user" -%}
		{{- "<|im_start|>" + message.role + "\n" + content + "<|im_end|>\n" -}}
	{%- elif message.role == "assistant" -%}
		{%- if reasoning_mode == "/think" -%}
			{{- "<|im_start|>assistant\n" + content.lstrip("\n") + "<|im_end|>\n" -}}
		{%- else -%}
			{{- "<|im_start|>assistant\n" + "<think>\n\n</think>\n" + content.lstrip("\n") + "<|im_end|>\n" -}}
		{%- endif -%}
	{%- elif message.role == "tool" -%}
		{{- "<|im_start|>" + "user\n" + content + "<|im_end|>\n" -}}
	{%- endif -%}
{%- endfor -%}
{#  ───── generation prompt ─────  #}
{%- if add_generation_prompt -%}
	{%- if reasoning_mode == "/think" -%}
		{{- "<|im_start|>assistant\n" -}}
	{%- else -%}
		{{- "<|im_start|>assistant\n" + "<think>\n\n</think>\n" -}}
	{%- endif -%}
{%- endif -%}"""

In [8]:
#@title Prompt Formatter
import pprint, json

tokenizer.chat_template = smollm_chatml
EOS_TOKEN = tokenizer.eos_token

def format_sft(examples, thinking):
  # convos or messages
  for key in ("conversations", "conversation", "messages"):
    convos = examples.get(key)
    if convos is not None:
      break
  else:
    raise ValueError("Unknown sft data format.")

  # system and context
  for key in ("system", "context"):
    systems = examples.get(key)
    if systems is not None:
      # Special fix for context
      if (key == "context"):
        contexts = []
        for system in systems:
          contexts.append("Utilise le contexte entre <ctx> pour répondre. Si tu ne sais pas, dis-le.\n<ctx>\n" + system +"\n</ctx>")
        systems = contexts
      break
  else:
    systems = [""] * len(convos)

  if (examples.get('chat_template_kwargs') != None):
    kwargss = examples["chat_template_kwargs"]
  else:
    kwargss = [{'enable_thinking' : thinking}] * len(convos) # Corrected list creation

  outputs = []
  for system, convo, kwargs in zip(systems, convos, kwargss):
    messages = [{
      "role": "system",
      "content": system
    }]
    for row in convo:
      role = row.get('from') or row.get('role')
      if role is None:
        raise ValueError('Unknow role data format.')
      normalized_role = str(role).lower()
      if 'humain' in normalized_role:
        role = 'user'
      elif 'gpt' in normalized_role:
        role = 'assistant'
      else:
        role = normalized_role

      content = row.get('content') or row.get('text')
      if content is None:
          raise ValueError('Unknow role data format.')

      messages.append({
        "role": role,
        "content": content
      })
    outputs.append(tokenizer.apply_chat_template(
      messages,
      tokenize=False,
      add_generation_prompt=False,
      **kwargs
    ))
  return { "text" : outputs, }

def format_cpt_wikipedia(examples):
  sources = examples["source"]
  titles = examples["title"]
  texts  = examples["text"]
  outputs = []
  for source, title, text in zip(sources, titles, texts):
    text = wikipedia_prompt.format(source.title(),title, text) + EOS_TOKEN
    outputs.append(text)
  return { "text" : outputs, }

def format_cpt_ebook(examples):
  titles = examples["title"]
  authors = examples["author"]
  texts  = examples.get('text') or examples.get('content')
  outputs = []
  for title, author, text in zip(titles, authors, texts):
    author = json.loads(author)
    text = ebook_prompt.format(title, author['author'], text) + EOS_TOKEN
    outputs.append(text)
  return { "text" : outputs, }

def format_text(examples):
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }

def format_dpo(exemples, thinking):
  for key in ("question", "prompt"):
    prompts = examples.get(key)
    if prompts is not None:
      break
  else:
    raise ValueError("Unknown dpo data format.")

  for key in ("system", "context"):
    systems = examples.get(key)
    if systems is not None:
      break
  else:
    systems = [""] * len(prompts)

  chosens = examples.get('chosen')
  rejecteds = examples.get('rejected')

  outputs = {
      "prompt" : [],
      "chosen" : [],
      "rejected" : []
  }
  kwargs = [{'enable_thinking' : false}]
  for system, prompt, chosen, rejected in zip(systems, prompts, chosens, rejecteds):
    messages = [{
      "role": "system",
      "content": system
    }, {
      "role": "user",
      "content": prompt
    }]
    outputs["prompt"].append(tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True, **kwargs))
    outputs["chosen"].append(chosen + EOS_TOKEN)
    outputs["rejected"].append(rejected + EOS_TOKEN)
  return outputs

# Must add EOS_TOKEN, otherwise your generation will go on forever!
# batched=True is needed
def formatting_prompts_func(
    examples,
    task: "sft", # "sft", "cpt_wikipedia", "cpt_book", "text", "dpo"
    thinking = False):
  formatting_functions = {
      'sft': lambda exs: format_sft(exs, thinking),
      'cpt_wikipedia': format_cpt_wikipedia,
      'cpt_ebook': format_cpt_ebook,
      'dpo': lambda exs: format_dpo(exs, thinking),
      'text': format_text,
  }
  formatter = formatting_functions.get(task)
  if formatter:
      return formatter(examples)
  else:
      raise ValueError(f'Unknown task format: {task}')

## Dataset helper

In [9]:
import random
from datasets import load_dataset
from datasets import concatenate_datasets
from datasets import Dataset

def dumpDatasets(datasets):
  for key, value in datasets.items():
    print('=' * 25 + ' ' + key.upper() + ' ' + '=' * 25)
    row = value[random.randrange(0, len(value), 1)]
    for rkey, rvalue in row.items():
      print('#' * 10 + ' ' + rkey.upper())
      pprint.pp(rvalue[:16384])
    print("\n\n")
pass

def mergeDatasets(datasets):
  resDataSet = Dataset.from_dict({})
  for key, value in datasets.items():
    if debug:
      value = value.train_test_split(train_size = 0.02)['train']
    resDataSet = concatenate_datasets([resDataSet, value])
  resDataSet = resDataSet.shuffle(seed=0)
  return resDataSet
pass

def loadDataset(url, task, subset = None, thinking = False, split = 'train'):
  if (subset != None):
    dataset = load_dataset(url, subset, split = split, token=hf_token)
  else:
    dataset = load_dataset(url, split = split, token=hf_token)
  dataset = dataset.map(formatting_prompts_func, batched = True, fn_kwargs= {'task':task, 'thinking': thinking})
  if (task == 'dpo'):
    dataset = dataset.select_columns(['prompt', 'chosen', 'rejected'])
  else:
    dataset = dataset.select_columns(['text'])
  return dataset

# Continued pre-training (CPT)

## Datasets(s)



In [None]:
#@title Init datasets
datasets = {}

In [None]:
#@title EBook data
datasets['ebooks'] = loadDataset('nsemhoun/ebooks', 'cpt_ebook')

books.jsonl:   0%|          | 0.00/59.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/101 [00:00<?, ? examples/s]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [None]:
#@title Wikipedia data
datasets['wikipedia'] = loadDataset('OpenLLM-France/wikimedia', 'cpt_wikipedia', subset='fr')

NameError: name 'loadDataset' is not defined

In [None]:
#@title Gutenberg data
datasets['gutenberg'] = loadDataset('OpenLLM-France/Lucie-Training-Dataset', 'cpt_ebook', subset='Gutenberg-fr')

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/9491 [00:00<?, ?it/s]

data/v1.1/natural/fr/Gutenberg/Gutenberg(…):   0%|          | 0.00/246M [00:00<?, ?B/s]

data/v1.1/natural/fr/Gutenberg/Gutenberg(…):   0%|          | 0.00/247M [00:00<?, ?B/s]

data/v1.1/natural/fr/Gutenberg/Gutenberg(…):   0%|          | 0.00/240M [00:00<?, ?B/s]

data/v1.1/natural/fr/Gutenberg/Gutenberg(…):   0%|          | 0.00/158M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/3449 [00:00<?, ? examples/s]

In [None]:
#@title Dump datasets
dumpDatasets(datasets)

########## TEXT
('Book\n'
 '### Title: Œuvres complètes de Alfred de Musset — Tome 4\n'
 '\n'
 '### Author: Musset, Alfred de\n'
 '\n'
 '### Content:\n'
 'OEUVRES COMPLÈTES\n'
 '\n'
 '                                DE\n'
 '\n'
 '                         ALFRED DE MUSSET\n'
 '\n'
 "ÉDITION ORNÉE DE 28 GRAVURES D'APRÈS LES DESSINS DE BIDA D'UN PORTRAIT\n"
 "GRAVÉ PAR FLAMENG; D'APRÈS L'ORIGINAL DE LANDELLE ET ACCOMPAGNÉE D'UNE\n"
 'NOTICE SUR ALFRED DE MUSSET PAR SON FRÈRE\n'
 '\n'
 '                          TOME QUATRIÈME\n'
 '\n'
 '                             COMÉDIES\n'
 '                                II\n'
 '\n'
 'PARIS EDITION CHARPENTIER L. HÉBERT, LIBRAIRE 7, RUE PERRONET, 7\n'
 '\n'
 '1888\n'
 '\n'
 '       *       *       *       *       *\n'
 '\n'
 '                            LORENZACCIO\n'
 '\n'
 '                        DRAME EN CINQ ACTES\n'
 '\n'
 '                               1834\n'
 '\n'
 'PERSONNAGES.\n'
 '\n'
 '    ALEXANDRE DE MÉDICIS, duc de Florence.\n'
 '\n

In [None]:
#@title Merge datasets
dataset = mergeDatasets(datasets)

########## TEXT
('Epub Book\n'
 '### Filename: crime du golf, Le - Agatha Christie.epub\n'
 '\n'
 '### Title: Le crime du golf\n'
 '\n'
 '### Author: Agatha Christie\n'
 '\n'
 '### Subject: Policier\n'
 '\n'
 '### Part: Christie,Agatha-Le crime du '
 'golf(1923).French.ebook.AlexandriZ_split_024\n'
 '\n'
 '### Content:\n'
 '24  Sauvez\\-le !\n'
 '\n'
 'Nous revînmes en France par le bateau\n'
 'du soir, pour nous trouver le lendemain matin à Saint\\-Omer, où l’on avait\n'
 'transféré Jack Renauld. Poirot voulut aussitôt rendre visite à M. Hautet. '
 'Comme\n'
 'il ne semblait voir aucune objection à ma présence, je l’accompagnai.\n'
 '\n'
 'Après diverses formalités, nous\n'
 'fûmes introduits dans le bureau du magistrat, qui nous accueillit avec\n'
 'cordialité.\n'
 '\n'
 '— Je m’étais laissé dire que\n'
 'vous étiez retourné en Angleterre, monsieur Poirot. Je me réjouis de '
 'constater\n'
 'qu’il n’en est rien.\n'
 '\n'
 '— J’y suis allé, en effet, mais\n'
 'pour une simple visite é

## Training
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).
Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

### Setup Trainer

In [None]:
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = dataset_processors,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 60 if debug else -1, # None for full run
        num_train_epochs = 0 if debug else 1,
        warmup_ratio = 0.1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/2179 [00:00<?, ? examples/s]

### Training Execution
Execute the training process with the configured trainer and monitor the training progress.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
8.863 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,179 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 8 x 1) = 16
 "-____-"     Trainable parameters = 555,565,056 of 3,630,663,680 (15.30% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 1008.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 436.12 MiB is free. Process 9950 has 14.31 GiB memory in use. Of the allocated memory 14.08 GiB is allocated by PyTorch, and 84.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

# Supervised Fine-Tuning (SFT)

## Datasets(s)

In [None]:
#@title Init datasets
datasets = {}

In [None]:
#@title French instruct
datasets['french_instruct'] = loadDataset('angeluriot/french_instruct', 'sft')

README.md: 0.00B [00:00, ?B/s]

dataset.parquet:   0%|          | 0.00/181M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/275600 [00:00<?, ? examples/s]

Map:   0%|          | 0/275600 [00:00<?, ? examples/s]

In [None]:
#@title SmolTalk Think
datasets['smaltalk-think'] = loadDataset('CATIE-AQ/smoltalk2_aya_think_dataset_french_split', 'sft', subset='french_raisonning', thinking=True)

README.md: 0.00B [00:00, ?B/s]

french/train-00000-of-00001.parquet:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1393 [00:00<?, ? examples/s]

Map:   0%|          | 0/1393 [00:00<?, ? examples/s]

In [None]:
#@title SmolTalk ToolCalling
datasets['smaltalk-toolcalling'] = loadDataset('CATIE-AQ/smoltalk2_smolagents_toolcalling_french', 'sft')

README.md:   0%|          | 0.00/968 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/82.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9079 [00:00<?, ? examples/s]

Map:   0%|          | 0/9079 [00:00<?, ? examples/s]

In [None]:
#@title Facebook Community
datasets['facebook'] = loadDataset('CATIE-AQ/facebook-community-alignment-dataset_french_conversation', 'sft')

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/13.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11433 [00:00<?, ? examples/s]

Map:   0%|          | 0/11433 [00:00<?, ? examples/s]

In [None]:
#@title Dump datasets
dumpDatasets(datasets)

########## TEXT
('<|im_start|>system\n'
 '## Metadata\n'
 '\n'
 'Knowledge Cutoff Date: June 2025\n'
 'Today Date: 29 September 2025\n'
 'Reasoning Mode: /no_think\n'
 '\n'
 '## Custom Instructions\n'
 '\n'
 'Utilise le contexte entre <ctx> pour répondre. Si tu ne sais pas, dis-le.\n'
 '<ctx>\n'
 '\n'
 '</ctx>\n'
 '\n'
 '<|im_start|>user\n'
 'Une application financière omniprésente sur Mac, iOS et Windows ?<|im_end|>\n'
 '<|im_start|>assistant\n'
 '<think>\n'
 '\n'
 '</think>\n'
 'Il existe plusieurs applications financières disponibles sur Mac, iOS et '
 "Windows. Voici quelques exemples : Mint : il s'agit d'une application de "
 'finances personnelles gratuite qui vous permet de suivre vos dépenses, de '
 "créer un budget et d'obtenir des conseils personnalisés sur la façon "
 "d'économiser de l'argent. Il est disponible sur les trois plates-formes. "
 "Personal Capital : il s'agit d'une application de gestion financière qui "
 'offre une gamme de fonctionnalités, notamment la budgét

In [None]:
#@title Merge datasets
dataset = mergeDatasets(datasets)

########## TEXT
('<|im_start|>user\n'
 'Comment s’appelle la partie solide située au centre de la Terre ?<|im_end|>\n'
 '<|im_start|>assistant\n'
 '<think>\n'
 "D'accord, l'utilisateur pose des questions sur la partie solide au centre de "
 'la Terre. Permettez-moi de réfléchir. Je me souviens que la Terre a des '
 'couches. La couche la plus externe est la croûte, puis le manteau, qui est '
 "principalement solide mais peut s'écouler lentement. Ensuite, il y a le "
 "noyau externe, qui est liquide, n'est-ce pas ? Et le noyau interne est "
 'solide. Ainsi, le centre solide serait le noyau interne.  Attendez, le noyau '
 "interne est-il vraiment solide ? Je pense qu'il est sous une pression si "
 "élevée que même s'il fait extrêmement chaud, la pression le maintient "
 'solide. Le noyau externe est liquide car la pression y est plus faible, ce '
 "qui permet au fer et au nickel de rester à l'état liquide. La réponse "
 'devrait donc être le noyau interne. Permettez-moi de vérifier cela.

## Training


### Setup Trainer

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = dataset_processors,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        max_steps = 60 if debug else -1, # None for full run
        num_train_epochs = 0 if debug else 1,
        learning_rate = 2e-4, # Lower for slower but more precise fine-tuning. Try values like 1e-4, 5e-5, or 2e-5
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # remove to activate WandDB
    ),
)

### Training Execution
Execute the training process with the configured trainer and monitor the training progress.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

# Direct Preference Optimization (DPO)

## Datasets

In [None]:
#@title Init datasets
datasets = {}

In [None]:
#@title French ORCA DPO
datasets['french-orca-dpo'] = loadDataset('jpacifico/french-orca-dpo-pairs-revised', 'dpo')

In [None]:
#@title Aya French
if False: datasets['aya-french'] = loadDataset('CATIE-AQ/aya_french_dpo', 'dpo')

In [None]:
#@title Dump datasets
dumpDatasets(datasets)

In [None]:
#@title Merge datasets
dataset = mergeDatasets(datasets)

## Training


### Setup Trainer

In [None]:
# Enable reward modelling stats
from unsloth import PatchDPOTrainer
PatchDPOTrainer()
from transformers import TrainingArguments
from trl import DPOTrainer, DPOConfig
from unsloth import is_bfloat16_supported

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    beta = 0.1,
    train_dataset = dataset,
    tokenizer = tokenizer,
    max_length = 4096,
    max_prompt_length = 2048,
    dataset_num_proc = dataset_processors,
    args = DPOConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60 if debug else -1, # None for full run
        num_train_epochs = 0 if debug else 1,
        learning_rate = 2e-4, # Lower for slower but more precise fine-tuning. Try values like 1e-4, 5e-5, or 2e-5
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
        report_to = "none", # remove to activate WandDB
    ),
)

### Training Execution
Execute the training process with the configured trainer and monitor the training progress.

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
dpo_trainer.train()

In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

# Testing (inference)
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizerInference = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenizerInference = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizerInference.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
    #{"role": "user", "content": "Quel est la fameuse grande tour à Paris?"},
]
inputs = tokenizerInference.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizerInference)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

# Saving

## Create HF repository

In [None]:
from huggingface_hub import HfApi

if save_hf:
  hf_api = HfApi(token=hf_token)
  hf_api.create_repo(repo_id = hf_username + "/" + model_name, repo_type = "model", private = True, exist_ok = True)
  hf_api.create_repo(repo_id = hf_username + "/" + model_name + "-GGUF", repo_type = "model", private = True, exist_ok = True)


## Saving to float16 for VLLM (safetensors)

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [10]:
model.save_pretrained_merged(model_name, tokenizer)
if save_hf: model.push_to_hub_merged(hf_username + "/" + model_name, token = hf_token)

Found HuggingFace hub cache directory: /home/unsloth/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...
Successfully copied all 2 files from cache to Claire-3B-v0.2.0.


Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 33288.13it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:55<00:00, 87.71s/it]


Unsloth: Merge process complete.


## GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
model.save_pretrained_gguf(model_name, tokenizer, quantization_method = quantizations)

if save_hf:
  for quant in quantizations:
    hf_api.upload_file(
      path_or_fileobj=model_name + "." + quant.upper() + ".gguf",
      path_in_repo=model_name + "-" + quant.upper() + ".gguf",
      repo_id=hf_username + "/" + model_name + "-GGUF"
    )

RuntimeError: Unsloth: `q4_k_m` quantization type is not supported.
The following quantization types are supported: `['f16', 'q8_0', 'f32', 'bf16']`