# (e) Continued Pretraining — Teach a Tiny New Language
**Created:** 2025-11-10 02:42 UTC

We demonstrate **Continued Pretraining** by feeding synthetic text in a toy language called *Quirkish*. The model learns new tokens/patterns.
For larger corpora and tokenizers, see Unsloth's Continued Pretraining docs.

In [1]:
!pip -q install --upgrade pip
!pip -q install "transformers>=4.44.2" "datasets>=2.19.0" "accelerate>=0.33.0" "unsloth>=2024.11.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.12.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
pylibcudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.2.2 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
bigframes 2.12.0 requires rich<14,>=12.4.4, but you have rich 14.2.0 which is incompatible.
libcugraph-cu12 25.6.0 requires libraft-cu12==25.6.*, but you have libraft-cu12 25.2.0 which is incompatible.
torchaudio 2.6.0+cu124 requires torch==2.6.0, but you have torch 2.8.0 which is incompatible.
cudf-polars-cu

In [2]:
import torch, random
print("CUDA:", torch.cuda.is_available())

CUDA: True


## Build toy raw text

In [3]:
from datasets import Dataset

quirkish = [
    "zor blip mako. quori zen taf. noro blip blip mako!",
    "mako-lin quori lin-lin. zor quori zen?",
    "zen zor mako mako. blip norin quori zor.",
    "blip-zen mako norin zor. quori zen blip.",
    "noro blip mako zor zen. quori norin mako!",
]
ds = Dataset.from_list([{"text": t} for t in quirkish])
ds

Dataset({
    features: ['text'],
    num_rows: 5
})

## Tokenize and run continued pretraining

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling, Trainer, TrainingArguments
base = "HuggingFaceTB/SmolLM2-135M"
tokenizer = AutoTokenizer.from_pretrained(base)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

def tok(b): return tokenizer(b["text"], truncation=True, max_length=256)
tok_ds = ds.map(tok, batched=True, remove_columns=["text"])

model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32)
model.resize_token_embeddings(len(tokenizer))
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

args = TrainingArguments(
    output_dir="/kaggle/working/smollm2_cpt",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=5e-5,
    max_steps=200,
    bf16=torch.cuda.is_available(),
    logging_steps=10,
    report_to="none",
)
trainer = Trainer(model=model, args=args, train_dataset=tok_ds, data_collator=collator)
_ = trainer.train()
trainer.save_model("/kaggle/working/smollm2_cpt")
tokenizer.save_pretrained("/kaggle/working/smollm2_cpt")

2025-11-10 04:24:03.245454: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762748643.452738      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762748643.509056      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu128 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Step,Training Loss
10,5.3405
20,4.2656
30,3.4609
40,2.6559
50,1.8846
60,1.1337
70,0.5477
80,0.2575
90,0.106
100,0.0526


('/kaggle/working/smollm2_cpt/tokenizer_config.json',
 '/kaggle/working/smollm2_cpt/special_tokens_map.json',
 '/kaggle/working/smollm2_cpt/vocab.json',
 '/kaggle/working/smollm2_cpt/merges.txt',
 '/kaggle/working/smollm2_cpt/added_tokens.json',
 '/kaggle/working/smollm2_cpt/tokenizer.json')

## Sample generation in the *Quirkish* style

In [5]:
from transformers import TextStreamer
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
prompt = "zor blip"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    _ = model.generate(**inputs, max_new_tokens=60, do_sample=True, temperature=0.9, streamer=streamer)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


zor blip mako. quori zen taf. noro blip blip mako!

Rafi Letzner: (1:04) You like it, so much, so much. (laughter) I'll be fine with that. If we have to talk, we
