In [None]:
#@title
%%html
<div style="background-color: pink;">
  Notebook written in collaboration with <a href="https://github.com/aditya-malte">Aditya Malte</a>.
  <br>
  The Notebook is on GitHub, so contributions are more than welcome.
</div>
<br>
<div style="background-color: yellow;">
  Aditya wrote another notebook with a slightly different use case and methodology, please check it out.
  <br>
  <a target="_blank" href="https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b">
    https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
  </a>
</div>


# How to train a new language model from scratch using Transformers and Tokenizers

### Notebook edition (link to blogpost [link](https://huggingface.co/blog/how-to-train)). Last update May 15, 2020


Over the past few months, we made several improvements to our [`transformers`](https://github.com/huggingface/transformers) and [`tokenizers`](https://github.com/huggingface/tokenizers) libraries, with the goal of making it easier than ever to **train a new language model from scratch**.

In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on **Esperanto**. We’ll then fine-tune the model on a downstream task of part-of-speech tagging.


## 1. Find a dataset

First, let us find a corpus of text in Esperanto. Here we’ll use the Esperanto portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">

The Esperanto portion of the dataset is only 299M, so we’ll concatenate with the Esperanto sub-corpus of the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download), which is comprised of text from diverse sources like news, literature, and wikipedia.

The final training corpus has a size of 3 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [None]:
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
!wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

--2021-12-02 05:54:47--  https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
Resolving cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)... 54.192.18.43, 54.192.18.58, 54.192.18.90, ...
Connecting to cdn-datasets.huggingface.co (cdn-datasets.huggingface.co)|54.192.18.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 312733741 (298M) [text/plain]
Saving to: ‘oscar.eo.txt’


2021-12-02 05:54:50 (87.7 MB/s) - ‘oscar.eo.txt’ saved [312733741/312733741]



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# use this cell to create data
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/ANLP21/data/regex_data/labeled_data_huge.csv")
df['regex'].to_csv('/content/drive/MyDrive/ANLP21/data/regex_data/regex_only_huge.txt', sep=' ', header=False, index=False)

## 2. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [3]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0
# tokenizers version at notebook update --- 0.8.0rc1

Found existing installation: tensorflow 2.7.0
Uninstalling tensorflow-2.7.0:
  Successfully uninstalled tensorflow-2.7.0
Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-aky2xuf5
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-aky2xuf5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 469 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 10.6 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl

In [4]:
reberth_path = "/content/drive/MyDrive/ANLP21/ReBERTh"
data_path = "/content/drive/MyDrive/ANLP21/data/regex_data"

In [5]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

# paths = [str(x) for x in Path("/content/drive/Starred/ANLP21/data/regex_data/final_data.txt")]
paths = ["/content/drive/MyDrive/ANLP21/data/regex_data/regex_only_huge.txt"]
print(paths)

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=1_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

['/content/drive/MyDrive/ANLP21/data/regex_data/regex_only_huge.txt']
CPU times: user 1.75 s, sys: 12.8 ms, total: 1.77 s
Wall time: 973 ms


Now let's save files to disk

In [6]:
model_sizes = ['mini', 'small', 'medium', 'base']
for size in model_sizes:
  %mkdir -p $reberth_path/$size
  tokenizer.save_model(f"{reberth_path}/{size}")

🔥🔥 Wow, that was fast! ⚡️🔥

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [7]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    f"{reberth_path}/vocab.json",
    f"{reberth_path}/merges.txt",
)

In [8]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [9]:
tokenizer.encode("(dog)")

Encoding(num_tokens=5, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [10]:
tokenizer.encode("(dog)").tokens

['<s>', '(', 'dog', ')', '</s>']

## 3. Train a language model from scratch

**Update:** This section follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using our new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. Feel free to pick the approach you like best.

> We’ll train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

As the model is BERT-like, we’ll train it on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


In [11]:
# Check that we have a GPU
!nvidia-smi

Tue Dec  7 00:49:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [12]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

### We'll define the following config for the model

In [13]:
from transformers import RobertaConfig

mini_config = RobertaConfig(
    vocab_size=1_000,
    max_position_embeddings=512,
    hidden_size=256,
    num_attention_heads=4,
    num_hidden_layers=4,
    type_vocab_size=1,
)


small_config = RobertaConfig(
    vocab_size=1_000,
    max_position_embeddings=512,
    hidden_size=512,
    num_attention_heads=8,
    num_hidden_layers=4,
    type_vocab_size=1,
)

medium_config = RobertaConfig(
    vocab_size=1_000,
    max_position_embeddings=512,
    hidden_size=512,
    num_attention_heads=8,
    num_hidden_layers=8,
    type_vocab_size=1,
)

base_config = RobertaConfig(
    vocab_size=1_000,
    max_position_embeddings=512,
    hidden_size=768,
    num_attention_heads=12,
    num_hidden_layers=12,
    type_vocab_size=1,
)

In [14]:
configs = {
    'mini': mini_config,
    'small': small_config,
    'medium': medium_config,
    'base': base_config
}

Now let's re-create our tokenizer in transformers

In [15]:
from transformers import RobertaTokenizerFast

# tokenizer = RobertaTokenizerFast.from_pretrained(reberth_path, max_len=512)
tokenizers = {k: RobertaTokenizerFast.from_pretrained(f"{reberth_path}/{k}", max_len=512) for k in configs.keys()}

file /content/drive/MyDrive/ANLP21/ReBERTh/small/config.json not found
file /content/drive/MyDrive/ANLP21/ReBERTh/small/config.json not found
file /content/drive/MyDrive/ANLP21/ReBERTh/medium/config.json not found
file /content/drive/MyDrive/ANLP21/ReBERTh/medium/config.json not found
file /content/drive/MyDrive/ANLP21/ReBERTh/base/config.json not found
file /content/drive/MyDrive/ANLP21/ReBERTh/base/config.json not found


Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [16]:
from transformers import RobertaForMaskedLM

models = {k: RobertaForMaskedLM(config=config) for k, config in configs.items()} 

In [None]:
# model.num_parameters()
# => 84 million parameters

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [17]:
%%time
from transformers import LineByLineTextDataset

datasets = {k: LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/content/drive/MyDrive/ANLP21/data/regex_data/regex_only_huge.txt",
    block_size=128,
) for k, tokenizer in tokenizers.items()}



CPU times: user 16.4 s, sys: 480 ms, total: 16.9 s
Wall time: 12.1 s


In [18]:
datasets

{'base': <transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f91adef5950>,
 'medium': <transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f9187ed0a90>,
 'mini': <transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f91adcd0650>,
 'small': <transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f918f367210>}

Like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script, we need to define a data_collator.

This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [19]:
from transformers import DataCollatorForLanguageModeling

data_collator = {k: DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
) for k, tokenizer in tokenizers.items()}

### Finally, we are all set to initialize our Trainer

In [20]:
from transformers import Trainer, TrainingArguments

training_args = {k: TrainingArguments(
    output_dir=f"{reberth_path}/{k}",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
) for k in configs.keys()}

trainers = {k: Trainer(
    model=model,
    args=training_args[k],
    data_collator=data_collator[k],
    train_dataset=datasets[k],
) for k, model in models.items()}

### Start training

In [27]:
%%time
trainers['mini'].train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 68034
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5320
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,4.085
1000,2.6675
1500,2.1336
2000,1.8209
2500,1.6722
3000,1.5619
3500,1.4759
4000,1.4474
4500,1.3981
5000,1.3895




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 8min 39s, sys: 19.1 s, total: 8min 58s
Wall time: 8min 58s


TrainOutput(global_step=5320, training_loss=1.9297516786962523, metrics={'train_runtime': 537.9436, 'train_samples_per_second': 632.353, 'train_steps_per_second': 9.89, 'total_flos': 299743328226048.0, 'train_loss': 1.9297516786962523, 'epoch': 5.0})

In [25]:
%%time
trainers['small'].train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 68034
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5320
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,3.0885
1000,1.743
1500,1.4345
2000,1.2881
2500,1.1968
3000,1.1369
3500,1.0847
4000,1.0392
4500,1.0017
5000,0.9803




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 13min 51s, sys: 21.9 s, total: 14min 13s
Wall time: 14min 10s


TrainOutput(global_step=5320, training_loss=1.3728270566553102, metrics={'train_runtime': 850.6099, 'train_samples_per_second': 399.913, 'train_steps_per_second': 6.254, 'total_flos': 688928135161440.0, 'train_loss': 1.3728270566553102, 'epoch': 5.0})

In [23]:
%%time
trainers['medium'].train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 68034
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5320
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,2.9359
1000,1.647
1500,1.3687
2000,1.2085
2500,1.0953
3000,1.001
3500,0.9268
4000,0.8727
4500,0.8394
5000,0.8095




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 25min 24s, sys: 28.3 s, total: 25min 52s
Wall time: 25min 46s


TrainOutput(global_step=5320, training_loss=1.242900928698088, metrics={'train_runtime': 1546.5432, 'train_samples_per_second': 219.955, 'train_steps_per_second': 3.44, 'total_flos': 1365872174603424.0, 'train_loss': 1.242900928698088, 'epoch': 5.0})

In [21]:
%%time
trainers['base'].train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 68034
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5320
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


Step,Training Loss
500,2.5325
1000,1.4209
1500,1.1667
2000,1.0035
2500,0.8434
3000,0.7443
3500,0.6539
4000,0.6001
4500,0.5638
5000,0.529




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 51min 33s, sys: 3min 36s, total: 55min 9s
Wall time: 54min 58s


TrainOutput(global_step=5320, training_loss=0.9756975446428572, metrics={'train_runtime': 3298.0518, 'train_samples_per_second': 103.143, 'train_steps_per_second': 1.613, 'total_flos': 3455386280661024.0, 'train_loss': 0.9756975446428572, 'epoch': 5.0})

#### 🎉 Save final model (+ tokenizer + config) to disk

In [28]:
trainers['mini'].save_model(f"{reberth_path}/mini")

Saving model checkpoint to /content/drive/MyDrive/ANLP21/ReBERTh/mini
Configuration saved in /content/drive/MyDrive/ANLP21/ReBERTh/mini/config.json
Model weights saved in /content/drive/MyDrive/ANLP21/ReBERTh/mini/pytorch_model.bin


In [26]:
trainers['small'].save_model(f"{reberth_path}/small")

Saving model checkpoint to /content/drive/MyDrive/ANLP21/ReBERTh/small
Configuration saved in /content/drive/MyDrive/ANLP21/ReBERTh/small/config.json
Model weights saved in /content/drive/MyDrive/ANLP21/ReBERTh/small/pytorch_model.bin


In [24]:
trainers['medium'].save_model(f"{reberth_path}/medium")

Saving model checkpoint to /content/drive/MyDrive/ANLP21/ReBERTh/medium
Configuration saved in /content/drive/MyDrive/ANLP21/ReBERTh/medium/config.json
Model weights saved in /content/drive/MyDrive/ANLP21/ReBERTh/medium/pytorch_model.bin


In [22]:
trainers['base'].save_model(f"{reberth_path}/base")

Saving model checkpoint to /content/drive/MyDrive/ANLP21/ReBERTh/base
Configuration saved in /content/drive/MyDrive/ANLP21/ReBERTh/base/config.json
Model weights saved in /content/drive/MyDrive/ANLP21/ReBERTh/base/pytorch_model.bin


In [None]:
trainer.save_model(reberth_path)

Saving model checkpoint to /content/drive/MyDrive/ANLP21/ReBERTh
Configuration saved in /content/drive/MyDrive/ANLP21/ReBERTh/config.json
Model weights saved in /content/drive/MyDrive/ANLP21/ReBERTh/pytorch_model.bin


## 4. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=reberth_path,
    tokenizer=reberth_path
)

loading configuration file /content/drive/MyDrive/ANLP21/ReBERTh/config.json
Model config RobertaConfig {
  "_name_or_path": "/content/drive/MyDrive/ANLP21/ReBERTh",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.13.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 1000
}

loading configuration file /content/drive/MyDrive/ANLP21/ReBERTh/config.json
Model config RobertaConfig {
  "_name_or_path": "/content/drive/MyDrive/ANLP21/ReBERTh",
  "architectures": [
    "Rob

In [None]:
# The sun <mask>.
# =>

fill_mask("(dog<mask>")

[{'score': 0.12063060700893402,
  'sequence': '(dog).*',
  'token': 264,
  'token_str': ').*'},
 {'score': 0.051579393446445465,
  'sequence': '(dog))',
  'token': 304,
  'token_str': '))'},
 {'score': 0.026141837239265442,
  'sequence': '(dog)).*',
  'token': 320,
  'token_str': ')).*'},
 {'score': 0.020209401845932007,
  'sequence': '(dog.*',
  'token': 261,
  'token_str': '.*'},
 {'score': 0.01259597111493349,
  'sequence': '(dog){',
  'token': 311,
  'token_str': '){'}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:



In [None]:
fill_mask("Jen la komenco de bela <mask>.")

# This is the beginning of a beautiful <mask>.
# =>

[{'score': 0.3046925365924835,
  'sequence': 'Jen la komenco de beladog.',
  'token': 266,
  'token_str': 'dog'},
 {'score': 0.13810423016548157,
  'sequence': 'Jen la komenco de belab.',
  'token': 70,
  'token_str': 'b'},
 {'score': 0.08295250684022903,
  'sequence': 'Jen la komenco de belatruck.',
  'token': 289,
  'token_str': 'truck'},
 {'score': 0.06642985343933105,
  'sequence': 'Jen la komenco de belaz.',
  'token': 94,
  'token_str': 'z'},
 {'score': 0.058431822806596756,
  'sequence': 'Jen la komenco de belaAEIOUaeiou.',
  'token': 279,
  'token_str': 'AEIOUaeiou'}]

## 5. Share your model 🎉

Finally, when you have a nice model, please think about sharing it with the community:

- upload your model using the CLI: `transformers-cli upload`
- write a README.md model card and add it to the repository under `model_cards/`. Your model card should ideally include:
    - a model description,
    - training params (dataset, preprocessing, hyperparameters), 
    - evaluation results,
    - intended uses & limitations
    - whatever else is helpful! 🤓

### **TADA!**

➡️ Your model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("username/model_name")`.

[![tb](https://huggingface.co/blog/assets/01_how-to-train/model_page.png)](https://huggingface.co/julien-c/EsperBERTo-small)


If you want to take a look at models in different languages, check https://huggingface.co/models

[![all models](https://huggingface.co/front/thumbnails/models.png)](https://huggingface.co/models)
