<a href="https://colab.research.google.com/github/semhoun/omnius/blob/main/nb/ClaireLight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>


# Definitions


In [2]:
import os

if "COLAB_" in "".join(os.environ.keys()):
  from google.colab import userdata
  hf_token = userdata.get('HuggingFaceToken')
else:
  hf_token = os.environ["HUGGINGFACETOKEN"]

hf_username = "nsemhoun"
debug = False
save_hf = True
version = '3B'

if version == '9B':
  source_model = "nvidia/NVIDIA-Nemotron-Nano-9B-v2"
  quantization_method = ["q8_0", "f16", "q4_k_m", "q4_0"]
  model_name = "Claire-9B-v0.1.0"
else:
  source_model = "HuggingFaceTB/SmolLM3-3B"
  quantization_method = ["q8_0", "f16"]
  model_name = "Claire-3B-v0.1.1"

# Installation

In [3]:
%%capture
import os, re
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

if "COLAB_" in "".join(os.environ.keys()):
    # Google colab
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --upgrade --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps --upgrade unsloth
    !pip install transformers==4.55.4
    !pip install --no-deps trl==0.22.2
elif "VAST_" in "".join(os.environ.keys()):
    # Vast.ai Unsloth version
    pass
else:
    !pip install --upgrade unsloth-zoo
    !pip install --upgrade unsloth
    !pip install transformers==4.55.4
    !pip install --no-deps trl==0.22.2

# Unsloth

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = source_model,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = False,
    load_in_8bit = False,
    token = hf_token,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.6: Fast Smollm3 patching. Transformers: 4.55.4.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/182 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

HuggingFaceTB/SmolLM3-3B does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


Add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
from unsloth import FastModel

lora_rank = 16 # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = lora_rank,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)



Unsloth: Making `model.base_model.model.model.embed_tokens` require gradients


# Continued Pretraining

## Datasets(s)
**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!


In [5]:
from datasets import load_dataset
import pprint, json
EOS_TOKEN = tokenizer.eos_token

# use: dataGutenberg = dataGutenberg.map(formatting_text_prompts_func, batched = True,)
def formatting_text_prompts_func(examples):
    return { "text" : [example + EOS_TOKEN for example in examples["text"]] }

In [6]:
#@title EBook data

ebook_prompt = """Epub Book
### Filename: {}

### Title: {}

### Author: {}

### Subject: {}

### Part: {}

### Content:
{}"""

def formatting_ebook_prompts_func(examples):
    filenames = examples["filename"]
    titles = examples["title"]
    authors = examples["author"]
    subjects = examples["subject"]
    parts = examples["part"]
    contents  = examples["content"]
    outputs = []
    for filename, title, author, subject, part, content in zip(filenames, titles, authors, subjects, parts, contents):
        text = ebook_prompt.format(filename, title, author, subject, part, content) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

dataset = load_dataset('nsemhoun/ebooks', split = 'train', token=hf_token)
dataset = dataset.train_test_split(train_size = 0.25)['train']
dataset = dataset.map(formatting_ebook_prompts_func, batched = True,)
dataset = dataset.select_columns(['text'])

books.jsonl:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/113813 [00:00<?, ? examples/s]

Map:   0%|          | 0/113813 [00:00<?, ? examples/s]

## Training
Now let's use Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).
Also set `embedding_learning_rate` to be a learning rate at least 2x or 10x smaller than `learning_rate` to make continual pretraining work!

In [11]:
from transformers import TrainingArguments
from unsloth import UnslothTrainer, UnslothTrainingArguments

trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 60 if debug else -1, # None for full run
        num_train_epochs = 0 if debug else 1,
        warmup_ratio = 0.1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [12]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
8.658 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 113,813 | Num Epochs = 1 | Total steps = 3,557
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 32,313,344 of 3,109,496,832 (1.04% trained)


Step,Training Loss
1,2.2483
2,2.2253
3,2.2025
4,2.2133
5,2.2241
6,2.2489
7,2.222
8,2.2526
9,2.2412
10,2.2025


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

# Continued Pretraining Book

## Data Prep
We now use the French Data dataset from hOpenLLM-France/Lucie-Training-Dataset. We only sample the first 5000 rows to speed training up. We must add `EOS_TOKEN` or `tokenizer.eos_token` or else the model's generation will go on forever.

# SFT (Chat)

## Dataset Preparation and Processing
We now use the `ChatML` format for conversation style finetunes. ChatML renders multi turn conversations like below:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Normally one has to train `<|im_start|>` and `<|im_end|>`. We instead map `<|im_end|>` to be the EOS token, and leave `<|im_start|>` as is. This requires no additional training of additional tokens.

More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [5]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style if needed
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

def formatting_prompts_func(examples):
    texts = []
    if (examples.get('system') == None):
      transformed_data = [
          {"role": "user", "content": examples.get('question')},
          {"role": "assistant", "content": examples.get('chosen')},
      ]
    else:
      transformed_data = [
          {"role": "system", "content": examples.get('system')},
          {"role": "user", "content": examples.get('question')},
          {"role": "assistant", "content": examples.get('chosen')},
      ]
    text = tokenizer.apply_chat_template(transformed_data, tokenize = False, add_generation_prompt = False)
    return { "text" : text, }
pass

from datasets import load_dataset
dataset = load_dataset("jpacifico/french-orca-dpo-pairs-revised", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = False,)

Unsloth: Will map <|im_end|> to EOS = <|im_end|>.


README.md:   0%|          | 0.00/676 [00:00<?, ?B/s]

french_orca_rlhf_revised.jsonl:   0%|          | 0.00/44.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12670 [00:00<?, ? examples/s]

Map:   0%|          | 0/12670 [00:00<?, ? examples/s]

Let's see how the format works by printing the 5th element

In [12]:
row = dataset[5]
print('SYSTEM: ' + '=' * 50)
pprint.pprint(row["system"])
print('INSTRUCTION: ' + '=' * 50)
pprint.pprint(row["question"])
print('ACCEPTED: ' + '=' * 50)
pprint.pprint(row["chosen"])
print('CHAT TEMPLATE: ' + '=' * 50)
pprint.pprint(row["text"])


'Vous êtes un assistant IA qui aide les gens à trouver des informations.'
('Compte tenu du principe du flux de conscience, proposez une question et une '
 'réponse pertinentes. Justification : Dans ce contexte, le jeu fait référence '
 'à une partie de volleyball où une joueuse de beach-volley effectue le '
 'service.\n'
 'La question et la réponse :')
("Question\xa0: Quelle est la technique appropriée pour qu'une joueuse de "
 'beach-volley puisse servir le ballon efficacement dans un match\xa0?\n'
 '\n'
 'Réponse : Pour servir le ballon efficacement au beach-volley, une joueuse '
 'doit adopter une routine de pré-service cohérente, se tenir dans une '
 'position équilibrée avec les pieds écartés à la largeur des épaules, lancer '
 'le ballon à une hauteur appropriée tout en étendant son bras non dominant, '
 "et utiliser une combinaison de mouvements de l'épaule, du bras et du poignet "
 'pour obtenir un service puissant et précis. Différents styles de service, '
 'tels que le servic

## Train the model


Setup the SFT traincer with approprite arguments.

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, #1
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        warmup_steps = 5,
        max_steps = 60 if debug else -1, # None for full run
        num_train_epochs = 0 if debug else 1,
        learning_rate = 2e-4, # Lower for slower but more precise fine-tuning. Try values like 1e-4, 5e-5, or 2e-5
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # remove to activate WandDB
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/12670 [00:00<?, ? examples/s]

Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Process ForkPoolWorker-6:
Process ForkPoolWorker-5:
Process ForkPoolWorker-7:
Process ForkPoolWorker-4:
Exception ignored in: Process ForkPoolWorker-3:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 314, in _bootstrap
    self.run()
<function Dataset.__del__ at 0x7a574fa99120>  File "/usr/local/lib/python3.12/dist-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/multiprocess/proce

KeyboardInterrupt: 

## Training Execution
Execute the training process with the configured trainer and monitor the training progress.

In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

In [None]:
#@title Clean trainer
del trainer
torch.cuda.empty_cache()

<a name="Inference"></a>
# Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "chatml", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    map_eos_token = True, # Maps <|im_end|> to </s> instead
)

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
    #{"role": "user", "content": "Quel est la fameuse grande tour à Paris?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True)

# Saving

In [None]:
#@title Create HF repository
from huggingface_hub import HfApi

if save_hf:
  hf_api = HfApi(token=hf_token)
  hf_api.create_repo(repo_id = hf_username + "/" + model_name, repo_type = "model", private = True, exist_ok = True)
  hf_api.create_repo(repo_id = hf_username + "/" + model_name + "-GGUF", repo_type = "model", private = True, exist_ok = True)


## Saving to float16 for VLLM (safetensors)

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
model.save_pretrained_merged(model_name, tokenizer)
if save_hf: model.push_to_hub_merged(hf_username + "/" + model_name, token = hf_token)

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
model.save_pretrained_gguf(model_name, tokenizer, quantization_method = quantization_method)

if save_hf:
  for quant in quantization_method:
    hf_api.upload_file(
      path_or_fileobj=model_name + "." + quant.upper() + ".gguf",
      path_in_repo=model_name + "-" + quant.upper() + ".gguf",
      repo_id=hf_username + "/" + model_name + "-GGUF"
    )