In [None]:
!pip install --upgrade pip
!pip install sentencepiece
!pip install datasets
!pip install transformers
!pip install accelerate -U

# Fine-tuning XLM-T

This notebook describes a simple case of finetuning. You can finetune either the `XLM-T` language model, or XLM-T sentiment, which has already been fine-tuned on sentiment analysis data, in 8 languages (this could be useful to do sentiment transfer learning on new languages).,

This notebook was modified from https://huggingface.co/transformers/custom_datasets.html

For a short documentation how it works https://github.com/wilberquito/xlm-t-ft

In [4]:
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

import numpy as np
from sklearn.metrics import classification_report

## Parameters

In [12]:
LR = 2e-5
EPOCHS = 1
BATCH_SIZE = 32
# MODEL = "cardiffnlp/twitter-xlm-roberta-base" # use this to finetune the language model
# MODEL = "results/best_model" # use this to finetune the language model
MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment"  # use this to finetune the language model
MAX_LENGTH = 514


# MODEL = "cardiffnlp/twitter-xlm-roberta-base-sentiment" # use this to finetune the sentiment classifier
MAX_TRAINING_EXAMPLES = -1  # set this to -1 if you want to use the whole training set

## Data

We download the xml-t sentiment dataset (`UMSAB`) but you can use your own.
If you use the same files structures as [TweetEval](https://github.com/cardiffnlp/tweeteval) (`train_text.txt`, `train_labels.txt`, `val_text.txt`, `...`), you do not need to change anything in the code.

---



In [6]:
!rm -rf *.txt

In [None]:
# loading dataset for UMSAB's all 8 languages

ORIGIN_DATA_URL = "https://raw.githubusercontent.com/cardiffnlp/xlm-t/main/data/sentiment/all/{}"

files = """test_labels.txt
test_text.txt
train_labels.txt
train_text.txt
val_labels.txt
val_text.txt""".split('\n')

for f in files:
  p = ORIGIN_DATA_URL.format(f)
  !wget $p

In [8]:
dataset_dict = {}
for i in ["train", "val", "test"]:
    dataset_dict[i] = {}
    for j in ["text", "labels"]:
        dataset_dict[i][j] = open(f"{i}_{j}.txt").read().split("\n")
        if j == "labels":
            dataset_dict[i][j] = [int(x) for x in dataset_dict[i][j]]

if MAX_TRAINING_EXAMPLES > 0:
    dataset_dict["train"]["text"] = dataset_dict["train"]["text"][
        :MAX_TRAINING_EXAMPLES
    ]
    dataset_dict["train"]["labels"] = dataset_dict["train"]["labels"][
        :MAX_TRAINING_EXAMPLES
    ]

In [9]:
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [10]:
config = AutoConfig.from_pretrained(MODEL)
config

XLMRobertaConfig {
  "_name_or_path": "cardiffnlp/twitter-xlm-roberta-base-sentiment",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "neutral",
    "2": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 0,
    "neutral": 1,
    "positive": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.35.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 250002
}

In [13]:
train_encodings = tokenizer(
    dataset_dict["train"]["text"], truncation=True, padding=True, max_length=MAX_LENGTH
)
val_encodings = tokenizer(
    dataset_dict["val"]["text"], truncation=True, padding=True, max_length=MAX_LENGTH
)
test_encodings = tokenizer(
    dataset_dict["test"]["text"], truncation=True, padding=True, max_length=MAX_LENGTH
)

In [14]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = MyDataset(train_encodings, dataset_dict["train"]["labels"])
val_dataset = MyDataset(val_encodings, dataset_dict["val"]["labels"])
test_dataset = MyDataset(test_encodings, dataset_dict["test"]["labels"])

## Fine-tuning

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [15]:
training_args = TrainingArguments(
    output_dir="./results",  # output directory
    num_train_epochs=EPOCHS,  # total number of training epochs
    per_device_train_batch_size=BATCH_SIZE,  # batch size per device during training
    per_device_eval_batch_size=BATCH_SIZE,  # batch size for evaluation
    warmup_steps=100,  # number of warmup steps for learning rate scheduler
    weight_decay=0.01,  # strength of weight decay
    logging_dir="./logs",  # directory for storing logs
    logging_steps=10,  # when to print log
    load_best_model_at_end=True,  # load or not best model at the end
    evaluation_strategy="steps",
)

num_labels = len(set(dataset_dict["train"]["labels"]))
model = AutoModelForSequenceClassification.from_pretrained(MODEL, num_labels=num_labels)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

In [16]:
trainer = Trainer(
    model=model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=train_dataset,  # training dataset
    eval_dataset=val_dataset,  # evaluation dataset
)

trainer.train()

Step,Training Loss,Validation Loss
10,0.5423,0.681126
20,0.6153,0.701536
30,0.5857,0.689826
40,0.5506,0.72389
50,0.5531,0.701805


TrainOutput(global_step=58, training_loss=0.5795381973529684, metrics={'train_runtime': 39.3649, 'train_samples_per_second': 46.717, 'train_steps_per_second': 1.473, 'total_flos': 100175294865996.0, 'train_loss': 0.5795381973529684, 'epoch': 1.0})

In [18]:
trainer.save_model("./results/best_model")  # save best model

## Evaluate on Test set

In [19]:
test_preds_raw, test_labels, _ = trainer.predict(test_dataset)
test_preds = np.argmax(test_preds_raw, axis=-1)
print(classification_report(test_labels, test_preds, digits=3))

              precision    recall  f1-score   support

           0      0.705     0.724     0.714       290
           1      0.565     0.586     0.575       290
           2      0.756     0.707     0.731       290

    accuracy                          0.672       870
   macro avg      0.675     0.672     0.673       870
weighted avg      0.675     0.672     0.673       870



<a id='ft_native'></a>