# Hugging Face Causal Language Model

fine-tuning of the [Facebook/opt-125](https://huggingface.co/facebook/opt-125m) model using an portuguese dataset [mc4-pt-sample-1g.txt](https://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt) of 300 million tokens in its causal language modeling pre-training. The opt-125 model was originally trained on a english dataset of approximately 300 bil\lion tokens.

The task is to predict the next token in a sequence, like GPT and not like BERT

[![google colab link](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/tcvieira/IA368-DD-012023/blob/main/assingments/04/notebook.ipynb)

In [1]:
# The training is done using an T4 with 16GB of memory
!nvidia-smi

Thu Mar 30 00:05:57 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8    16W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Installs

In [2]:
!pip install transformers -q
!pip install datasets -q
!pip install ipython-autotime -q
%load_ext autotime

time: 539 µs (started: 2023-03-30 00:06:25 +00:00)


# Imports

In [3]:
from datasets import load_dataset, DatasetDict
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
    AutoConfig
)
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

time: 5.74 s (started: 2023-03-30 00:06:25 +00:00)


# Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive
time: 6.28 s (started: 2023-03-30 00:06:31 +00:00)


In [5]:
PATH_DATASET = '/content/drive/MyDrive/unicamp/IA368DD/class_4'

time: 483 µs (started: 2023-03-30 00:26:42 +00:00)


In [6]:
#!gsutil cp gs://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt {PATH_DATASET}/sample-1gb.txt

time: 190 µs (started: 2023-03-30 00:26:42 +00:00)


In [7]:
!head {PATH_DATASET}/sample-1gb.txt

Linkbar Há alguns anos, o número de rapazes e moças que subiam ao púlpito para pregar era maior que o de hoje. Na sua simplicidade, falavam do amor de Deus, da Salvação e davam testemunho sob a unção do Espirito Santo. Hoje, parece que a figura do "preletor oficial" inibiu muitos de falarem com ousadia a Palavra de Deus. Parece que há um receio de falar diante de um público que, certamente, é mais intelectualizado que há alguns anos. Jovens pregadores ficam embaraçados e cometem certos deslizes, que poderiam ser evitados. Neste modesto trabalho, vamos dar apenas algumas sugestões, e não um estudo sobre a Homilética (Arte de Falar em Publico). I -O QUE PREGAR? É a comunicação verbal da Palavra de Deus aos ouvintes. É a transmissão do evangelho de Nosso Senhor Jesus Cristo às pessoas que precisam ouvi-lo. II- QUAL A FINALIDADE DA PREGAÇÃO? É persuadir as pessoas a aceitarem a mensagem da Palavra de Deus para sua salvação (descrentes) ou para seu crescimento espiritual (crentes). Diante d

In [8]:
!wc -l {PATH_DATASET}/sample-1gb.txt

250000 /content/drive/MyDrive/unicamp/IA368DD/class_4/sample-1gb.txt
time: 833 ms (started: 2023-03-30 00:26:44 +00:00)


## small dataset for testing

In [9]:
#!sed -n '1,100p' {PATH_DATASET}/sample-1gb.txt > {PATH_DATASET}/sample_small.txt

time: 231 µs (started: 2023-03-30 00:26:48 +00:00)


In [10]:
!wc -l {PATH_DATASET}/sample_small.txt

100 /content/drive/MyDrive/unicamp/IA368DD/class_4/sample_small.txt
time: 128 ms (started: 2023-03-30 00:26:48 +00:00)


In [11]:
small_dataset = load_dataset("text", data_files=f'{PATH_DATASET}/sample_small.txt')
small_dataset["validation"] = load_dataset("text", data_files=f'{PATH_DATASET}/sample_small.txt', split=f"train[:5%]")
small_dataset["train"] = load_dataset("text", data_files=f'{PATH_DATASET}/sample_small.txt', split=f"train[5%:]")

small_dataset



  0%|          | 0/1 [00:00<?, ?it/s]



DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 95
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 5
    })
})

time: 2.97 s (started: 2023-03-30 00:26:49 +00:00)


In [43]:
base_dataset = load_dataset("text", data_files=f'{PATH_DATASET}/sample-1gb.txt')
base_dataset["validation"] = load_dataset("text", data_files=f'{PATH_DATASET}/sample-1gb.txt', split=f"train[:20%]")
base_dataset["train"] = load_dataset("text", data_files=f'{PATH_DATASET}/sample-1gb.txt', split=f"train[20%:]")

base_dataset



  0%|          | 0/1 [00:00<?, ?it/s]



DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 50000
    })
})

time: 2.82 s (started: 2023-03-29 19:56:59 +00:00)


# Select Dataset

In [44]:
#dataset = small_dataset
dataset = base_dataset

time: 325 µs (started: 2023-03-29 19:57:17 +00:00)


# Parameters

In [77]:
MODEL_NAME = 'facebook/opt-125m'
MAX_SEQ_LENGTH=1024
BATCH_SIZE=4
EPOCHS=1
MODEL_OUTPUT_FOLDER=f'{PATH_DATASET}/model_output'
MODEL_SAVE_FOLDER=f'{PATH_DATASET}/model_save'

time: 515 µs (started: 2023-03-29 23:56:14 +00:00)


# Tokenizer

In [69]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

time: 1.09 s (started: 2023-03-29 23:41:13 +00:00)


# Model

`AutoModelForCausalLM` is used for auto-regressive language models like all the GPT models, while `AutoModelForSeq2SeqLM` is used for language models with encoder-decoder architecture like T5 and BART.

https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM

In [70]:
# Download configuration from huggingface.co and cache.
config = AutoConfig.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_config(config)
model.to(device)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 768, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 768)
      (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (layers): ModuleList(
        (0): OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=1e-05,

time: 3.18 s (started: 2023-03-29 23:41:17 +00:00)


In [48]:
model_size = sum(t.numel() for t in model.parameters())
print(f"OPT-125m size: {model_size/1000**2:.1f}M parameters")

OPT-125m size: 125.2M parameters
time: 1.16 ms (started: 2023-03-29 19:57:45 +00:00)


# Tokenization

The model was pretrained with max sequence length of 2048. We use length 256 initially to overcome the exceeded ram problem.

CHANGE: we tokenize all samples in the batch (consisting of 1000 documents) and create a long sequence of tokens by concatenating all examples and separating them with the special EOS token. Finally, we divide the long sequence into chunks of 512 tokens, which will be used for training.

In [71]:
tokenized_dataset = dataset.map(lambda x: tokenizer(x["text"], 
                                      truncation=True, 
                                      padding="max_length", 
                                      max_length=MAX_SEQ_LENGTH), 
                                      batched=True, 
                                      num_proc=4, 
                                      remove_columns=["text"]
                                     )

Map (num_proc=4):   0%|          | 0/200000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/50000 [00:00<?, ? examples/s]

time: 7min 2s (started: 2023-03-29 23:41:41 +00:00)


In [72]:
print(tokenized_dataset)
print(f"{len(tokenized_dataset['train']['input_ids'][0])} tokens - {tokenized_dataset['train']['input_ids'][0]}")

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 50000
    })
})
1024 tokens - [2, 10926, 260, 43598, 897, 5441, 2102, 263, 5716, 9803, 109, 17614, 242, 4, 4838, 10870, 16738, 842, 1855, 424, 424, 62, 260, 991, 9401, 4, 673, 32868, 636, 1977, 642, 1020, 842, 3304, 14537, 2953, 290, 5352, 6, 466, 6301, 40632, 364, 8541, 8604, 3137, 316, 361, 4531, 10870, 16738, 117, 952, 3070, 7984, 11332, 740, 225, 2527, 4, 83, 32709, 808, 1829, 4410, 2154, 338, 1526, 506, 2426, 7935, 263, 501, 6, 466, 10870, 16738, 2953, 6301, 40632, 117, 38064, 1479, 12834, 109, 32868, 636, 1977, 642, 1020, 4, 846, 1210, 13265, 9958, 32868, 636, 1977, 642, 4544, 263, 2884, 102, 5874, 6, 22595, 9618, 211, 3181, 12, 22686, 4168, 2102, 939, 4746, 139, 9803, 6, 3105, 260, 8557, 842, 2662, 4324, 10, 3191, 6301, 10, 17074, 12, 673, 13967, 263, 6331, 257, 10, 913

# Training

In [78]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=MODEL_OUTPUT_FOLDER,
                                  num_train_epochs=EPOCHS, 
                                  per_device_train_batch_size=BATCH_SIZE,
                                  per_device_eval_batch_size=BATCH_SIZE, 
                                  evaluation_strategy="epoch", 
                                  save_strategy="epoch",
                                  logging_strategy="epoch", 
                                  learning_rate=2e-5, 
                                  weight_decay=0.01,
                                  fp16=True # Use mixed precision
                                )

time: 2.84 ms (started: 2023-03-29 23:56:47 +00:00)


In [79]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=tokenized_dataset["train"],
                  eval_dataset=tokenized_dataset["validation"], 
                  data_collator=data_collator)

time: 6.39 ms (started: 2023-03-29 23:56:51 +00:00)


In [80]:
trainer.train()

OutOfMemoryError: ignored

time: 87.4 ms (started: 2023-03-29 23:57:07 +00:00)


In [54]:
model.save_pretrained(MODEL_SAVE_FOLDER)
tokenizer.save_pretrained(MODEL_SAVE_FOLDER)

('/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer_config.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/special_tokens_map.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/vocab.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/merges.txt',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/added_tokens.json',
 '/content/drive/MyDrive/unicamp/IA368DD/class_4/model_save/tokenizer.json')

time: 2.72 s (started: 2023-03-29 22:20:18 +00:00)


In [58]:
#!zip model {MODEL_SAVE_FOLDER}/*

time: 283 µs (started: 2023-03-29 22:55:03 +00:00)


# Evaluation

In [56]:
model = AutoModelForCausalLM.from_pretrained(MODEL_SAVE_FOLDER)
trainer = Trainer(
    model = model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    data_collator = data_collator
)

time: 1.86 s (started: 2023-03-29 22:48:14 +00:00)


In [57]:
eval_results = trainer.evaluate()
perplexity = torch.exp(torch.tensor(eval_results["eval_loss"]))
print(f'Train Perplexity =  {perplexity.item():.2f}')

Train Perplexity =  31.67
time: 6min 36s (started: 2023-03-29 22:48:26 +00:00)


In [62]:
eval_results = trainer.evaluate(tokenized_dataset["validation"])
perplexity = torch.exp(torch.tensor(eval_results["eval_loss"]))
print(f'Test Perplexity =  {perplexity.item():.2f}')

KeyboardInterrupt: ignored

time: 25min 46s (started: 2023-03-29 23:12:18 +00:00)


In [63]:
prompt = 'rapazes e moças que subiam ao púlpito'
inputs = tokenizer(prompt, return_tensors='pt')

# Generate
generate_ids = model.generate(inputs.input_ids.to(device), max_length=30)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

'rapazes e moças que subiam ao púlpito, ao público, ao púb'

time: 123 ms (started: 2023-03-29 23:38:10 +00:00)


In [64]:
# Leonardo Augusto da Silva Pacheco - https://github.com/leonardo3108/IA368dd/blob/main/exercicios/Aula_5/Aula_5_Treino_Modelo_de_Linguagem.ipynb

prompt = 'rapazes e moças que subiam ao púlpito'
max_output_tokens = 20
model.eval()

for _ in range(max_output_tokens):
    input_ids = tokenizer(text=prompt)['input_ids']
    input_ids_truncated = input_ids[-MAX_SEQ_LENGTH:]  # Usamos apenas os últimos tokens como entrada para o modelo.
    output = model(torch.LongTensor([input_ids_truncated]).to(device))
    logits = output['logits'][:, -1, :]  # Usamos apenas o ultimo token da sequencia
    predicted_id = torch.argmax(logits).item()  # extraindo o token de maior probabilidade (greedy decoding)
    input_ids += [predicted_id]  # Concatenamos a entrada com o token escolhido nesse passo.
    prompt = tokenizer.decode(input_ids)
    print(prompt.replace('</s>', ''))

rapazes e moças que subiam ao púlpito,
rapazes e moças que subiam ao púlpito, a
rapazes e moças que subiam ao púlpito, ao
rapazes e moças que subiam ao púlpito, ao p
rapazes e moças que subiam ao púlpito, ao pú
rapazes e moças que subiam ao púlpito, ao púb
rapazes e moças que subiam ao púlpito, ao públic
rapazes e moças que subiam ao púlpito, ao público
rapazes e moças que subiam ao púlpito, ao público,
rapazes e moças que subiam ao púlpito, ao público, a
rapazes e moças que subiam ao púlpito, ao público, ao
rapazes e moças que subiam ao púlpito, ao público, ao p
rapazes e moças que subiam ao púlpito, ao público, ao pú
rapazes e moças que subiam ao púlpito, ao público, ao púb
rapazes e moças que subiam ao púlpito, ao público, ao públic
rapazes e moças que subiam ao púlpito, ao público, ao público
rapazes e moças que subiam ao púlpito, ao público, ao público de
rapazes e moças que subiam ao púlpito, ao público, ao público de j
rapazes e moças que subiam ao púlpito, ao público, ao públic

# Results

split train/validation (200.000, 50.000)

| seq length | epochs | batch size |  ppl  |
|:----------:|:------:|:----------:|:-----:|
|     256    |    1   |      4     | 31.67 |
|    1024    |    1   |      8     |       |
|            |        |            |       |

----