# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we‚Äôll demo how to train a ‚Äúsmall‚Äù model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) ‚Äì that‚Äôs the same number of layers & heads as DistilBERT ‚Äì on **Esperanto**. We‚Äôll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we‚Äôll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we‚Äôll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small ‚Äì for your model, you will get better results the more data you can get to pretrain on.



In [9]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2026-02-13 21:32:49--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 3.174.180.16, 3.174.180.105, 3.174.180.76, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|3.174.180.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‚Äòoscar.eo.txt‚Äô


2026-02-13 21:32:52 (112 MB/s) - ‚Äòoscar.eo.txt‚Äô saved [312733741/312733741]



In [10]:
!mkdir data
!mv oscar.eo.txt data/

mkdir: cannot create directory ‚Äòdata‚Äô: File exists


## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let‚Äôs arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let‚Äôs say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [None]:
# We won't need TensorFlow here
# !pip uninstall -y tensorflow
# Install `transformers` from master
# !pip install git+https://github.com/huggingface/transformers
# !pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Nota: **Explicacion codigo**

paths = [str(x) for x in Path(".").glob("**/*.txt")]
- Path.glob se encarga de buscar todos los ficheros que calzan un patr√≥n (en este caso cualquiera con extensi√≥n .txt) a partir de la posici√≥n actual. Explora toda el subarbol.
- str(x) convierte el objeto Path en la ruta asociada
De modo que esta linea busca todas los ficheros acabados en txt y devuelve sus rutas. Para este ejercicio el unico que encontrara sera ./data/oscar.eo.txt

In [2]:
%%time
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])




CPU times: user 2min 14s, sys: 5.74 s, total: 2min 20s
Wall time: 25.7 s


Now let's save files to disk

In [5]:
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")



['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

üî•üî• Wow, that was fast! ‚ö°Ô∏èüî•

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
ƒ† k
o n
ƒ† la
t a
ƒ† e
ƒ† d
ƒ† p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto ‚Äì `ƒâ`, `ƒù`, `ƒ•`, `ƒµ`, `≈ù`, and `≈≠` ‚Äì are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here‚Äôs  how you can use it in `tokenizers`, including handling the RoBERTa special tokens ‚Äì of course, you‚Äôll also be able to use it directly from `transformers`.


In [6]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt",
)

In [7]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [8]:
tokenizer.encode("Mi estas Julien.")

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [9]:
tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'Mi', 'ƒ†estas', 'ƒ†Juli', 'en', '.', '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We‚Äôll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we‚Äôll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.

**Note:** This code below assumes you are using CUDA, but it can also run on other devices like XPUs or TPUs. The framework dynamically detects the available hardware and adjusts accordingly.

In [2]:
# Check that we have a GPU
!nvidia-smi

Sat Feb 14 09:11:36 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   23C    P5             39W /  240W |     428MiB /   8192MiB |     32%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------

In [1]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [4]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [17]:
#from transformers import RobertaTokenizerFast
#tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

Nota:
- Cambie de RobertaTokenizerFast a RobertaTokenizer, porque Fast rechazaba la estructura del fichero merges.txt, aparentemente solo acepta 2 "palabras" por linea
- En este sentido RobertaTokenizer es mas flexible
- Una vez procesado almacene el modelo
- Volvi a generar a RobertaTokenizerPast a partir del modelo "lento"

In [5]:
from transformers import RobertaTokenizer, RobertaTokenizerFast

slow_tokenizer = RobertaTokenizer.from_pretrained("./EsperBERTo", max_len=512)
slow_tokenizer.save_pretrained("./EsperBERTo") #Save the model

tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [6]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [7]:
model.num_parameters()
# => 84 million parameters

83504416

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

Nota: **LineByLineTextDataset fue deprecado**

Por lo que usaremos su equivalente moderno: load_dataset

In [None]:
"""
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="oscar.eo.txt",
    block_size=128,
)
"""

In [8]:
from datasets import load_dataset

raw_datasets = load_dataset("text", data_files={"train": "data/oscar.eo.txt"})

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation =True, max_length=128)

tokenized_datasets = raw_datasets.map(tokenize_function,
                                      batched=True,
                                      num_proc=4,
                                      remove_columns=["text"])
dataset = tokenized_datasets["train"]

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [9]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

### Finally, we are all set to initialize our Trainer

In [9]:
import transformers
print(transformers.__version__)

5.1.0


In [12]:
!pip install transformers[torch]


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    #overwrite_output_dir=True,
    num_train_epochs=1,
    gradient_accumulation_steps=16,
    #gradient_accumulation_steps=8,
    per_device_train_batch_size=16,
    #per_gpu_train_batch_size=64,
    save_steps=5000,
    save_total_limit=2,
    logging_steps=500,
    prediction_loss_only=True,
    #dataloader_num_workers=2,
    dataloader_num_workers=1,

    fp16=True,

    #Configuracion para torre con RTX 3060, entrenamiento reducido a alrededor de 1 hora con <5Gib de uso
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

### Start training

In [11]:
%%time
trainer.train()

Step,Training Loss
500,52.91891
1000,43.458223
1500,42.784863
2000,42.004195
2500,40.771758
3000,38.53707
3500,36.771543


Writing model shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.75it/s]


CPU times: user 1h 8min 39s, sys: 16.3 s, total: 1h 8min 55s
Wall time: 1h 15min 23s


TrainOutput(global_step=3807, training_loss=41.94483178601917, metrics={'train_runtime': 4523.0526, 'train_samples_per_second': 215.424, 'train_steps_per_second': 0.842, 'total_flos': 3.2307042064367616e+16, 'train_loss': 41.94483178601917, 'epoch': 1.0})

#### üéâ Save final model (+ tokenizer + config) to disk

In [12]:
trainer.save_model("./EsperBERTo")

Writing model shards: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.42it/s]


## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [13]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./EsperBERTo",
    tokenizer="./EsperBERTo"
)

Loading weights: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 106/106 [00:00<00:00, 1262.62it/s, Materializing param=roberta.encoder.layer.5.output.dense.weight]             


In [14]:
# The sun <mask>.
# =>

fill_mask("La suno <mask>.")

[{'score': 0.35176873207092285,
  'token': 18,
  'token_str': '.',
  'sequence': 'La suno ..'},
 {'score': 0.10754923522472382,
  'token': 83,
  'token_str': 'o',
  'sequence': 'La suno o.'},
 {'score': 0.04824163392186165,
  'token': 69,
  'token_str': 'a',
  'sequence': 'La suno a.'},
 {'score': 0.03688143193721771,
  'token': 77,
  'token_str': 'i',
  'sequence': 'La suno i.'},
 {'score': 0.029939131811261177,
  'token': 73,
  'token_str': 'e',
  'sequence': 'La suno e.'}]

Ok, simple syntax/grammar works. Let‚Äôs try a slightly more interesting prompt:



In [15]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.16486455500125885,
  'token': 83,
  'token_str': 'o',
  'sequence': 'Jen la komenco de bela o.'},
 {'score': 0.12627507746219635,
  'token': 18,
  'token_str': '.',
  'sequence': 'Jen la komenco de bela ..'},
 {'score': 0.04670954868197441,
  'token': 82,
  'token_str': 'n',
  'sequence': 'Jen la komenco de bela n.'},
 {'score': 0.04132508113980293,
  'token': 69,
  'token_str': 'a',
  'sequence': 'Jen la komenco de bela a.'},
 {'score': 0.04064961150288582,
  'token': 87,
  'token_str': 's',
  'sequence': 'Jen la komenco de bela s.'}]

Nota: **Resultados del experimento"**

El experimento fall√≥, el modelo no fue capaz de completar las oraciones correctamente. Pienso que ocurri√≥ por las siguientes razones:
- El art√≠culo de JC explicaba que utiliz√≥ cerca de 3GB de datos para su entrenamiento, sin embargo, el fichero oscar.eo.txt que usamos para entrenar apenas tiene 300MB. Con ese volumen de datos y la gran cantidad de tokens que hay, el modelo apenas pudo encontrar relaciones significativas entre palabras.
- El dataset completo que debia utilizar es la concatenaci√≥n de oscar.eo.txt junto a multiples entradas del Leipzig Corpora Collection. Estas entradas no tienen un formato "limpio" (muchas de ellas tienen el n√∫mero de linea o caracteres como <<), por lo que asumo que en el art√≠culo no se incluy√≥ una etapa de preprocesado previa. La aplicare en la siguiente prueba
- El n√∫mero de epocas es insuficiente: Una sola epoca no permite que el modelo aprenda sobre las palabras poco frecuentes. Se recomienda como m√≠nimo 5 epocas para generar un resultado aceptable, y hasta 40 epocas para refinarlo.
