<a href="https://colab.research.google.com/github/wendywtchang/NLP-projects/blob/master/asr/lm/Train_LM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Find a dataset
2. Train a tokenizer
3. Train a language model from scratch
4. Check that the LM actually trained
5. Fine-tune your LM on a downstream task

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd /content/drive/MyDrive/LM
%ll

/content/drive/MyDrive/LM
total 330
drwx------ 2 root   4096 Nov 29 11:06 [0m[01;34maux_scripts[0m/
drwx------ 2 root   4096 Nov 29 11:06 [01;34mconfig[0m/
drwx------ 2 root   4096 Nov 29 11:06 [01;34mdata[0m/
drwx------ 2 root   4096 Feb  4 23:16 [01;34mEsperBERTo[0m/
drwx------ 2 root   4096 Nov 29 16:06 [01;34mgpt2[0m/
-rw------- 1 root   7481 Nov 30 14:01 kneser_ney.py
-rw------- 1 root   1082 Nov 29 11:05 LICENCE.md
drwx------ 2 root   4096 Nov 29 11:06 [01;34mlogs[0m/
drwx------ 2 root   4096 Nov 30 12:56 [01;34mmodels[0m/
-rw------- 1 root 158976 Nov 29 11:05 paper_LREC18.pdf
-rw------- 1 root  63073 Nov 29 11:05 poster_LREC18.pdf
drwx------ 2 root   4096 Nov 30 14:01 [01;34m__pycache__[0m/
-rw------- 1 root   8148 Nov 29 11:05 README.md
drwx------ 2 root   4096 Nov 29 11:06 [01;34mscripts[0m/
-rw------- 1 root  37893 Feb  4 22:57 Train_lm_huggingface.ipynb
-rw------- 1 root  21811 Feb  5 00:45 Train_LM.ipynb


# 1. Dataset

In [4]:
# train = ./data/hound/hound-train.txt
# test = ./data/hound/hound-test.txt
%ll ./data/hound/

total 928
-rw------- 1 root  20669 Feb  4 22:49 'Hound challenge.pdf'
-rw------- 1 root 320384 Feb  4 22:49  hound-test.txt
-rw------- 1 root 607788 Feb  4 22:49  hound-train.txt


# 2. Train a tokenizer

- train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. 

- Let’s arbitrarily pick its size to be 52,000. (?)

- We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [5]:
!pwd

/content/drive/MyDrive/LM


In [6]:
# We won't need TensorFlow here
#!pip uninstall -y tensorflow

# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.11.0 -> 4.17.0.dev0 (05/Feb/2022)
# tokenizers version at notebook update --- 0.8.0rc1 -> 0.11.4 (05/Feb/2022)

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-7d4t5462
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-7d4t5462
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 5.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.0 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 64.4 MB/s 
Collecting huggingface-hub<

In [7]:
%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
import glob

paths = [str(x) for x in Path(".").glob("data/hound/hound-train.txt")]
#paths = [str(x) for x in Path(".").glob("**/*.txt")]
print(paths)

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize Training
tokenizer.train(files=paths, vocab_size=52000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 20.3 µs
['data/hound/hound-train.txt']


In [8]:
!mkdir EsperBERTo
tokenizer.save_model("EsperBERTo")
# what are these files

mkdir: cannot create directory ‘EsperBERTo’: File exists


['EsperBERTo/vocab.json', 'EsperBERTo/merges.txt']

We now have both a `vocab.json`, which is a list of the most frequent tokens ranked by frequency, and a `merges.txt` list of merges.

```json
{
	"<s>": 0,
	"<pad>": 1,
	"</s>": 2,
	"<unk>": 3,
	"<mask>": 4,
	"!": 5,
	"\"": 6,
	"#": 7,
	"$": 8,
	"%": 9,
	"&": 10,
	"'": 11,
	"(": 12,
	")": 13,
	# ...
}

# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...
```

What is great is that our tokenizer is optimized for Esperanto. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. Diacritics, i.e. accented characters used in Esperanto – `ĉ`, `ĝ`, `ĥ`, `ĵ`, `ŝ`, and `ŭ` – are encoded natively. We also represent sequences in a more efficient manner. Here on this corpus, the average length of encoded sequences is ~30% smaller as when using the pretrained GPT-2 tokenizer.

Here’s  how you can use it in `tokenizers`, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from `transformers`.


In [9]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

tokenizer = ByteLevelBPETokenizer(
    "./EsperBERTo/vocab.json",
    "./EsperBERTo/merges.txt"
)

In [10]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [11]:
tokenizer.encode("Mi estas Julien.")

Encoding(num_tokens=10, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [12]:
tokenizer.encode("Mi estas Julien.").tokens

['<s>', 'M', 'i', 'Ġest', 'as', 'ĠJ', 'ul', 'ien', '.', '</s>']

# 3. Train a language model from scratch

- This follows along the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py) script, using Transformer's new [`Trainer`](https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) directly. (can change to other approach)

- train a RoBERTa-like model, which is a BERT-like with a couple of changes (check the [documentation](https://huggingface.co/transformers/model_doc/roberta.html) for more details).

- As the model is BERT-like, it is trained on a task of *Masked language modeling*, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.


## Define the following config for the model

In [13]:
import torch
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Re-create the tokenizer in transformers

In [14]:
from transformers import RobertaTokenizerFast

#tokenizer = RobertaTokenizerFast.from_pretrained("./EsperBERTo", max_len=512) -> config not found error
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
#not sure about this

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

## Initialize the model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [15]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [16]:
model.num_parameters()
# => about 84 million parameters

83504416

## Build a training dataset
- Build the training dataset by applying the tokenizer to the text file.
- Just use the `LineByLineDataset` out-of-the-box.

In [17]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[?25l[K     |█                               | 10 kB 28.5 MB/s eta 0:00:01[K     |██                              | 20 kB 9.3 MB/s eta 0:00:01[K     |███▏                            | 30 kB 8.2 MB/s eta 0:00:01[K     |████▏                           | 40 kB 7.6 MB/s eta 0:00:01[K     |█████▎                          | 51 kB 5.2 MB/s eta 0:00:01[K     |██████▎                         | 61 kB 5.3 MB/s eta 0:00:01[K     |███████▍                        | 71 kB 5.3 MB/s eta 0:00:01[K     |████████▍                       | 81 kB 6.0 MB/s eta 0:00:01[K     |█████████▌                      | 92 kB 5.9 MB/s eta 0:00:01[K     |██████████▌                     | 102 kB 5.1 MB/s eta 0:00:01[K     |███████████▋                    | 112 kB 5.1 MB/s eta 0:00:01[K     |████████████▋                   | 122 kB 5.1 MB/s eta 0:00:01[K     |█████████████▊                  | 133 kB 5.1 MB/s eta 0:00:01[

In [18]:
%time
from transformers import LineByLineTextDataset #-> soon be removed
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./data/hound/hound-train.txt",
    block_size=128,
)

# from datasets import load_dataset
# dataset = load_dataset('text', data_files={'train': ['./data/hound/hound-train.txt'], 'test': './data/hound/hound-test.txt'})

# #dataset = load_dataset('text', data_files=['./data/hound/hound-train.txt'], split='train')



CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 12.2 µs




## Define a data_collator
- like in the [`run_language_modeling.py`](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_language_modeling.py) script.

- It is a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on.

In [19]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=True,
    mlm_probability=0.15
)

## Initialize the Trainer

In [28]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsperBERTo",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)


# trainer = Trainer(
#     model=model, 
#     args=training_args, 
#     data_collator=data_collator,
#     train_dataset=dataset['train']['text'], 
#     eval_dataset=dataset['test']['text']
# )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


## Start Training

In [29]:
dataset

<transformers.data.datasets.language_modeling.LineByLineTextDataset at 0x7f4885fa6e10>

In [30]:
%time
trainer.train()

Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.
***** Running training *****
  Num examples = 9633
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 755
Using deprecated `--per_gpu_train_batch_size` argument which will be removed in a future version. Using `--per_device_train_batch_size` is preferred.


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 24.6 µs


Step,Training Loss
500,6.2453




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=755, training_loss=6.132304655163494, metrics={'train_runtime': 128.8039, 'train_samples_per_second': 373.94, 'train_steps_per_second': 5.862, 'total_flos': 345739517083584.0, 'train_loss': 6.132304655163494, 'epoch': 5.0})

## Save model

In [23]:
# trainer.save_model("./EsperBERTo")

- train_loss =  8.085252774472268 #epoch=1
- train_loss =  6.132304655163494 #epoch=5


# Evaluation: perplexity

- test ppl?

In [27]:
# not sure
perplexity = torch.exp(train_loss)

TypeError: ignored