# Tarea 6

Las familias de modelos codificadores, tales como BERT, ALBERT, DistilBERT, ELECTRA, RoBERTA, y MPnet, han emergido como pilares fundamentales en el campo de procesamiento de lenguaje natural. Estos modelos comparten la arquitectura de transformer y han sido preentrenados para aprender representaciones semánticas profundas del lenguaje.

Comencemos analizando BERT, un pionero en la atención bidireccional. BERT, en sus múltiples variantes, como BERT-base y BERT-large, destaca por su capacidad para capturar contextos ricos y relaciones sintácticas. Por otro lado, ALBERT introduce la innovación de la factorización de la matriz de atención, optimizando así la eficiencia computacional sin comprometer el rendimiento. La comparación entre BERT y ALBERT revela el equilibrio entre complejidad y eficiencia.

DistilBERT, por su parte, busca la simplicidad sin sacrificar la capacidad representativa. Al destilar el conocimiento de un modelo más grande como BERT, logra una reducción significativa en los parámetros, haciéndolo más liviano y ágil. Esta característica lo posiciona como una opción atractiva en tareas con recursos computacionales limitados.

ELECTRA, con su aproximación novedosa, implementa un enfoque de sustitución de palabras para el entrenamiento discriminatorio. Al destacar la eficacia de este método, ELECTRA demuestra ser competitivo en la obtención de representaciones de alta calidad.

RoBERTa, una mejora de BERT, ajusta la metodología de preentrenamiento al eliminar la tarea de predicción de la orientación de la siguiente palabra. Esto conduce a representaciones más coherentes y robustas, destacando su eficacia en diversos contextos lingüísticos.

Finalmente, MPnet se destaca por su enfoque en la modelación de patrones a nivel de múltiples posiciones. Su capacidad para capturar información contextual a lo largo de múltiples niveles jerárquicos proporciona una perspectiva única en la representación del lenguaje.

In [2]:
from collections import defaultdict, Counter
import json

from matplotlib import pyplot as plt
import numpy as np
import torch

In [1]:
from transformers import AutoModelForSequenceClassification, DistilBertForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=2)

(…)bert-base-cased/resolve/main/config.json: 100%|██████████| 465/465 [00:00<00:00, 466kB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
model.safetensors: 100%|██████████| 263M/263M [01:45<00:00, 2.49MB/s] 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased") # convenient! Defaults to Fast
print(tokenizer)

(…)cased/resolve/main/tokenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<?, ?B/s]
(…)ilbert-base-cased/resolve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 986kB/s] 
(…)t-base-cased/resolve/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 824kB/s]


DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [4]:
input_str = "I dub thee the unforgiven"

In [5]:
model_inputs = tokenizer(input_str, return_tensors="pt")


# Option 1
model_outputs = model(input_ids=model_inputs.input_ids, attention_mask=model_inputs.attention_mask)

# Option 2 - the keys of the dictionary the tokenizer returns are the same as the keyword arguments
#            the model expects

# f({k1: v1, k2: v2}) = f(k1=v1, k2=v2)

model_outputs = model(**model_inputs)



print(model_inputs)
print()
print(model_outputs)
print()
print(f"Distribution over labels: {torch.softmax(model_outputs.logits, dim=1)}")

{'input_ids': tensor([[  101,   146, 23700, 20021,  1103,  8362, 14467, 10805,  2109,  1179,
           102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

SequenceClassifierOutput(loss=None, logits=tensor([[ 0.1662, -0.1271]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Distribution over labels: tensor([[0.5728, 0.4272]], grad_fn=<SoftmaxBackward0>)


In [6]:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-cased", output_attentions=True, output_hidden_states=True)
model.eval()

model_inputs = tokenizer(input_str, return_tensors="pt")
with torch.no_grad():
    model_output = model(**model_inputs)


print("Hidden state size (per layer):  ", model_output.hidden_states[0].shape)
print("Attention head size (per layer):", model_output.attentions[0].shape)     # (layer, batch, query_word_idx, key_word_idxs)
                                                                               # y-axis is query, x-axis is key
print(model_output)

Hidden state size (per layer):   torch.Size([1, 11, 768])
Attention head size (per layer): torch.Size([1, 12, 11, 11])
BaseModelOutput(last_hidden_state=tensor([[[ 0.3658, -0.1363, -0.0241,  ..., -0.0243,  0.2092, -0.0434],
         [ 0.3351, -0.4731,  0.4455,  ...,  0.3333,  0.0894,  0.1752],
         [ 0.0559, -0.2744,  0.2439,  ...,  0.3805, -0.1535,  0.0684],
         ...,
         [-0.1702,  0.1649, -0.1826,  ...,  0.7063, -0.4539, -0.0200],
         [ 0.1001, -0.6128,  0.0116,  ...,  0.1753,  0.5835,  0.0543],
         [ 0.6431, -0.7274, -0.2860,  ...,  0.2709,  0.1994, -0.3916]]]), hidden_states=(tensor([[[ 5.5207e-01,  1.7780e-01, -5.8549e-02,  ..., -1.5978e-02,
           2.0846e-01, -1.1543e-01],
         [-9.8524e-01,  2.7473e-01,  1.2517e-03,  ...,  1.3920e+00,
          -1.1383e+00,  5.2962e-01],
         [-7.2930e-01, -2.7470e-02, -6.9238e-01,  ..., -3.3046e-01,
          -2.7515e-01, -1.2525e-01],
         ...,
         [ 1.0097e+00,  7.3380e-01, -1.2838e+00,  ..., -2.18

In [111]:
import os
import random
from transformers import AutoTokenizer
from datasets import Dataset, DatasetDict

# Function to create one-hot encoding for authors
# Function to create integer label for authors
def encode_author(author, authors_list):
    return authors_list.index(author)

# Function to split text into chunks of equal length
def split_into_chunks(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Define authors and paths
authors = os.listdir("book_datasets")
data = {'text': [], 'label': []}

# Iterate over authors and text files
for author in authors:
    author_path = os.path.join("book_datasets", author)
    author_files = [f for f in os.listdir(author_path) if f.endswith(".txt")]

    for file in author_files:
        file_path = os.path.join(author_path, file)
        
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()
            
            # Split text into chunks of 512 tokens
            chunks = split_into_chunks(text, 512)
            
            # Create one-hot encoding for each chunk
            for chunk in chunks:
                label = encode_author(author, authors)
                data['text'].append(chunk)
                data['label'].append(label)

# Shuffle the dataset
random.shuffle(data['text'])
random.shuffle(data['label'])

# Split the dataset into training and validation sets
train_size = int(0.8 * len(data['text']))
train_data = {'text': data['text'][:train_size], 'label': data['label'][:train_size]}
val_data = {'text': data['text'][train_size:], 'label': data['label'][train_size:]}

# Create datasets
book_train_dataset = Dataset.from_dict(train_data)
book_val_dataset = Dataset.from_dict(val_data)

# Create a DatasetDict
book_dataset = DatasetDict({'train': book_train_dataset, 'val': book_val_dataset})

In [112]:
book_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5135
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 1284
    })
})

In [113]:
# Tokenize dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
tokenized_dataset = book_dataset.map(lambda example: tokenizer(example['text'], truncation=True), batched=True)

tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")


[A
[A
[A
Map: 100%|██████████| 5135/5135 [00:00<00:00, 10128.09 examples/s]

[A
Map: 100%|██████████| 1284/1284 [00:00<00:00, 9173.22 examples/s]


In [114]:
tokenized_dataset["train"][0:2]

{'labels': tensor([0, 0]),
 'input_ids': [tensor([  101,  1142,  1721,  1107,  1240,  1493,   117,  1240,  2252,   787,
            188,  1567,  1110,  1107,  1240,  1493,   119,  1790,   787,   189,
          12477,  1197,  1241,  1111,  1140,   119,   164,   168, 13832,  2083,
            168,   156, 18172,   155,  2346, 27211, 10460, 24890, 17656, 12880,
           2069,  2249,   119,   166,   156, 18172,   155,  2346, 27211, 10460,
          24890, 17656, 12880,  2069,  2249,   119, 20286,   117,  1303,  1110,
           1103,  5039,  1104,  1139,  2998,   119, 17604,   146,  2373,  1122,
           1106,  1128,   136, 10722,  2137,  3663, 24890, 17656, 12880,  2069,
           2249,   119,  2421,  1143,  1267,  1122,   119,   164,   156, 18172,
            155,  2346, 27211, 10460,   168,  1493,  1123,  1103,  2998,   168,
            119,   168,  1153,  9568,  1122,   168,   117,   168,  1105,  1173,
            168,   117,   168,  1114,   170,  8982,  1104,  7615,   168,   117,


In [123]:
# Specify GPU device for Trainer
device = "cuda" if torch.cuda.is_available() else "cpu"

In [125]:
device

False


In [127]:
from transformers import TrainingArguments, Trainer

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=3)

arguments = TrainingArguments(
    output_dir="sample_hf_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    evaluation_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224
)


def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return {"accuracy": np.mean(predictions == labels)}


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [128]:
from transformers import TrainerCallback, EarlyStoppingCallback

class LoggingCallback(TrainerCallback):
    def __init__(self, log_path):
        self.log_path = log_path

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            with open(self.log_path, "a") as f:
                f.write(json.dumps(logs) + "\n")


trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.0))
trainer.add_callback(LoggingCallback("sample_hf_trainer/log.jsonl"))

In [129]:
# Train for a few steps
trainer.train()

  0%|          | 0/963 [52:10<?, ?it/s]
  1%|          | 10/963 [49:57<79:21:11, 299.76s/it]
  0%|          | 1/963 [00:07<2:02:38,  7.65s/it]

In [69]:
len(book_dataset)

11

In [91]:
train_dataset.keys()

dict_keys(['input_ids', 'attention_mask', 'label'])

In [50]:
train_dataset

{'input_ids': tensor([[  101,  1109,  4042,  ...,  1114,  1103,   102],
         [  101,  1109,  4042,  ...,   168,  5706,   102],
         [  101,  1109,  4042,  ..., 25370, 16972,   102],
         ...,
         [  101,  1109,  4042,  ...,   117,   790,   102],
         [  101,  1109,  4042,  ...,  1760, 15380,   102],
         [  101,  1109,  4042,  ...,  7462,  1787,   102]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': tensor([[1., 0., 0.],
         [1., 0., 0.],
         [1., 0., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 0., 1.]], dtype=torch.float64)}

In [40]:
from transformers import DistilBertForSequenceClassification, AdamW

# Training arguments
training_args = TrainingArguments(
    output_dir="fine-tuned_model",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224
)

# Custom compute_metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    accuracy = np.mean(preds == labels)
    return {"accuracy": accuracy}

# Model, tokenizer, and trainer
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=3)
model.train()  # Set the model to training mode

optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Add early stopping callback
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=1, early_stopping_threshold=0.0))

# Train the model
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/3 [09:11<?, ?it/s]
  0%|          | 0/3 [06:41<?, ?it/s]
  0%|          | 0/3 [00:00<?, ?it/s]

KeyError: 2