<a href="https://colab.research.google.com/github/thiagolaitz/IA368-search-engines/blob/main/Project%2004/opt_125m_pt_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Causal Language Modeling

This Colab notebook presents the Portuguese fine-tuning of the Facebook/opt-125 model using an additional 300 million tokens in its causal language modeling pre-training. The opt-125 model was originally trained on approximately 300 billion tokens.

To accomplish this, we will be using the HuggingFace trainer for training and Wandb for logging purposes. The report is publicly available at [here](https://api.wandb.ai/links/thiagolaitz1/ths2zi4c).


In [None]:
# The training is done using an A100 with 40GB of memory
!nvidia-smi

Mon Mar 27 20:12:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    44W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Wandb

Wandb (short for Weights and Biases) is a cloud-based platform for experiment tracking and visualization. It allows users to log and compare machine learning experiments, visualize results, and collaborate with team members. Wandb provides an easy-to-use interface to track model performance metrics, hyperparameters, training loss and is fully integrated with the HuggingFace trainer.

In [None]:
!pip install wandb -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.1/189.1 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import wandb

# Log in with your account
wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Choose the project
%env WANDB_PROJECT=causal_language_modeling

env: WANDB_PROJECT=causal_language_modeling


# Dataset Download

The chosen dataset is a reduced version of mc4 containing a subset with only Portuguese examples (~300 million tokens).

In [None]:
# Downloads the dataset
!gsutil cp gs://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt .

Copying gs://unicamp-dl/ia025a_2022s1/aula9/sample-1gb.txt...
/ [0 files][    0.0 B/  1.2 GiB]                                                ==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

|
Operation completed over 1 objects/1.2 GiB.                                      


In [None]:
!wc -l /content/sample-1gb.txt

250000 /content/sample-1gb.txt


# Model and Tokenizer

In [None]:
!pip install transformers datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)

model_name = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset

# Loads all lines with examples
dataset = load_dataset("text", data_files="/content/sample-1gb.txt")

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-c84b75aee4abf2e4/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-c84b75aee4abf2e4/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

## Tokenization

Firstly, we need to decide on the context length. In this case, it will be set to 512, which is shorter than the original model training length of 2048. Next, we tokenize all samples in the batch (consisting of 1000 documents) and create a long sequence of tokens by concatenating all examples and separating them with the special EOS token. Finally, we divide the long sequence into chunks of 512 tokens, which will be used for training.

In [None]:
context_length = 512

def tokenize(element):
    # Get all tokenized samples from the batch without truncation or padding
    tokenized_samples = tokenizer(
        element["text"],
        truncation=False,
        padding=False,
        add_special_tokens=False
    )

    # Concatenate all samples of the batch into one long sample
    # separated by the tokenizer's eos_token
    long_sample = []
    for example in tokenized_samples["input_ids"]:
        long_sample.extend(example + [tokenizer.eos_token_id])

    # return chucks of context_length
    batch = []
    for i in range(0, len(long_sample), context_length):
        batch.append(long_sample[i:i+context_length])
    
    return {"input_ids": batch}


tokenized_datasets = dataset.map(
    tokenize, batched=True, remove_columns=dataset["train"].column_names
)

# Training

For training, we need to initialize a data_collator that will dynamically pad the batch to the maximum length if the samples are not of the same length. This is necessary because the last part of the long sequence of tokens could be smaller than 512, so we need to pad the remaining elements to ensure that all samples have the same length. Additionally, we need to set mlm to False, since we are not using it for training a masked language model.

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

The training arguments are defined with a batch size of 32, which requires approximately 33GB of GPU memory during training. It's important to note that the training is configured to run with half-precision (fp16) due to GPU memory constraints and to speed up the training duration. Additionally, logging_first_step is set to True to evaluate the model before we begin training.

In [None]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir="trained_model",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    save_strategy="steps",
    save_steps=5000,
    logging_strategy="steps",
    logging_steps=500,
    logging_first_step=True,
    warmup_steps=500,
    learning_rate=5e-5,
    weight_decay=0.1,
    report_to=["wandb"],
    fp16=True,
)

In [None]:
# Defines the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    data_collator=data_collator
)

In [None]:
# Train the model
trainer.train()

Step,Training Loss
1,3.2153
500,3.165
1000,2.9463
1500,2.8331
2000,2.7503
2500,2.6966
3000,2.653
3500,2.6249
4000,2.5953
4500,2.5722


TrainOutput(global_step=26710, training_loss=2.4524274227994964, metrics={'train_runtime': 8005.3958, 'train_samples_per_second': 106.768, 'train_steps_per_second': 3.336, 'total_flos': 2.23331264299008e+17, 'train_loss': 2.4524274227994964, 'epoch': 1.0})

In [None]:
# Save the last state of the model
trainer.save_model("./final_model")

# Perplexity

Perplexity is a commonly used evaluation metric for language models. It measures how well a language model can predict the next word in a sequence, based on the probability distribution of the model. A lower perplexity score indicates better performance, as it means the model is more confident and accurate in its predictions.

In [2]:
import math

# First step
print(f"Perplexity at the first step: {math.exp(3.2153):.3f}")
print(f"Perplexity after the Portuguese finetuning: {math.exp(2.316900):.3f}")

Perplexity at the first step: 24.911
Perplexity after the Portuguese finetuning: 10.144


# HuggingFace

In this section we upload the trained model for my profile. The trained model is publicly available at: 

**thiagolaitz/opt-125m-pt-finetuned**

In [None]:
!pip install huggingface_hub -q

In [None]:
from huggingface_hub import notebook_login

# Login to my account using the access token
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from huggingface_hub import HfApi

# Uses the HF's API to upload the model to the repo
api = HfApi()

api.upload_folder(
    folder_path="final_model",
    repo_id="thiagolaitz/opt-125m-pt-finetuned",
    repo_type="model",
)

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/3.52k [00:00<?, ?B/s]

'https://huggingface.co/thiagolaitz/opt-125m-pt-finetuned/tree/main/'

# Comparing the models

Finally, there's a code snippet showing how to use the models to generate text.

In [None]:
from transformers import pipeline

model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")

# Original
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=30)
generator("Eles brincaram o dia inteiro sob o sol quente, mas")

[{'generated_text': 'Eles brincaram o dia inteiro sob o sol quente, mas não sei se não sei se não'}]

In [None]:
model = AutoModelForCausalLM.from_pretrained("thiagolaitz/opt-125m-pt-finetuned")

# Portuguese finetuned
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=30)
generator("Eles brincaram o dia inteiro sob o sol quente, mas")

[{'generated_text': 'Eles brincaram o dia inteiro sob o sol quente, mas não se deixaram levar pelo sol.'}]