# Tarea 6

Las familias de modelos codificadores, tales como BERT, ALBERT, DistilBERT, ELECTRA, RoBERTA, y MPnet, han emergido como pilares fundamentales en el campo de procesamiento de lenguaje natural. Estos modelos comparten la arquitectura de transformer y han sido preentrenados para aprender representaciones semánticas profundas del lenguaje.

Comencemos analizando BERT, un pionero en la atención bidireccional. BERT, en sus múltiples variantes, como BERT-base y BERT-large, destaca por su capacidad para capturar contextos ricos y relaciones sintácticas. Por otro lado, ALBERT introduce la innovación de la factorización de la matriz de atención, optimizando así la eficiencia computacional sin comprometer el rendimiento. La comparación entre BERT y ALBERT revela el equilibrio entre complejidad y eficiencia.

DistilBERT, por su parte, busca la simplicidad sin sacrificar la capacidad representativa. Al destilar el conocimiento de un modelo más grande como BERT, logra una reducción significativa en los parámetros, haciéndolo más liviano y ágil. Esta característica lo posiciona como una opción atractiva en tareas con recursos computacionales limitados.

ELECTRA, con su aproximación novedosa, implementa un enfoque de sustitución de palabras para el entrenamiento discriminatorio. Al destacar la eficacia de este método, ELECTRA demuestra ser competitivo en la obtención de representaciones de alta calidad.

RoBERTa, una mejora de BERT, ajusta la metodología de preentrenamiento al eliminar la tarea de predicción de la orientación de la siguiente palabra. Esto conduce a representaciones más coherentes y robustas, destacando su eficacia en diversos contextos lingüísticos.

Finalmente, MPnet se destaca por su enfoque en la modelación de patrones a nivel de múltiples posiciones. Su capacidad para capturar información contextual a lo largo de múltiples niveles jerárquicos proporciona una perspectiva única en la representación del lenguaje.

In [1]:
from collections import defaultdict, Counter
import json

from matplotlib import pyplot as plt
import numpy as np
import torch

In [2]:
import os
import random
import re
from transformers import AutoTokenizer
from datasets import Dataset, DatasetDict
import nltk

# Download the NLTK Punkt tokenizer if not already downloaded
nltk.download('punkt')

# Function to create integer label for authors
def encode_author(author, authors_list):
    return authors_list.index(author)

# Function to clean and preprocess text
def clean_text(text):
    # Remove unwanted characters, including newline
    text = text.replace("\n", " ")
    return text

# Function to tokenize text into sentences
def tokenize_into_sentences(text):
    sentences = nltk.sent_tokenize(text)
    return sentences

# Define authors and paths
authors = os.listdir("book_datasets")
data = {'text': [], 'label': []}

# Iterate over authors and text files
for author in authors:
    author_path = os.path.join("book_datasets", author)
    author_files = [f for f in os.listdir(author_path) if f.endswith(".txt")]

    for file in author_files:
        file_path = os.path.join(author_path, file)
        
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()
            
            # Clean and preprocess the text
            text = clean_text(text)
            
            # Tokenize text into sentences
            sentences = tokenize_into_sentences(text)
            
            # Create integer label for each sentence
            for sentence in sentences:
                label = encode_author(author, authors)
                data['text'].append(sentence)
                data['label'].append(label)

# Shuffle the dataset
random.shuffle(data['text'])
random.shuffle(data['label'])

# Split the dataset into training and validation sets
train_size = int(0.8 * len(data['text']))
train_data = {'text': data['text'][:train_size], 'label': data['label'][:train_size]}
val_data = {'text': data['text'][train_size:], 'label': data['label'][train_size:]}

# Create datasets
book_train_dataset = Dataset.from_dict(train_data)
book_val_dataset = Dataset.from_dict(val_data)

# Create a DatasetDict
book_dataset = DatasetDict({'train': book_train_dataset, 'val': book_val_dataset})

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\santi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
book_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 42889
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 10723
    })
})

In [4]:
book_dataset["train"]["text"][5050]

'Fiery?'

In [5]:
book_dataset["train"]["label"][5050]

0

In [6]:
np.unique(book_dataset["train"]["label"])

array([0, 1, 2])

In [7]:
# with 'train' and 'val' splits
train_labels = book_dataset['train']['label']
val_labels = book_dataset['val']['label']

# Compute label counts for the 'train' split
train_label_counts = dict(zip(*np.unique(train_labels, return_counts=True)))

# Compute label counts for the 'val' split
val_label_counts = dict(zip(*np.unique(val_labels, return_counts=True)))

print("Train Label Counts:")
print(train_label_counts)

print("\nVal Label Counts:")
print(val_label_counts)


Train Label Counts:
{0: 15325, 1: 12780, 2: 14784}

Val Label Counts:
{0: 3730, 1: 3258, 2: 3735}


In [8]:
# Tokenize dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
tokenized_dataset = book_dataset.map(lambda example: tokenizer(example['text'], truncation=True), batched=True)

tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch")

Map: 100%|██████████| 42889/42889 [00:01<00:00, 30214.69 examples/s]
Map: 100%|██████████| 10723/10723 [00:00<00:00, 34547.21 examples/s]


In [9]:
tokenized_dataset["train"][0:2]

{'labels': tensor([2, 2]),
 'input_ids': [tensor([  101,   143,  2346, 14697,  2249, 14697,  9919,   196,   121,   198,
            168, 20777,  4854,  1906, 23231,   168,  1108,  1308,  1112,   170,
           2767,   174,  1942, 11708,  1204,  1118,  4042,   144,  6140,  8904,
            117,  1105,  1110,  1136,  1529,  1107,  1142,   174,  1942, 11708,
           1204,   119,   102]),
  tensor([  101,   789,  1573,   117,  1128,   787,  1231,  1280,  1113,  1106,
           5738,  1944,  8914,  6944,   117,  1132,  1128,   136,   790, 28110,
          14159,  9029,   119,   102])],
 'attention_mask': [tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
  tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])]}

In [10]:
# Specify GPU device for Trainer
device = "cuda" if torch.cuda.is_available() else "cpu"

In [11]:
device

'cuda'

In [12]:
from transformers import TrainerCallback, TrainingArguments, Trainer, DistilBertForSequenceClassification, EarlyStoppingCallback
from transformers.integrations import TensorBoardCallback

# Load pre-trained model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', num_labels=3)
model.to(device)

# Training arguments
arguments = TrainingArguments(
    output_dir="sample_hf_trainer",
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    num_train_epochs=20,
    evaluation_strategy="epoch",  # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    load_best_model_at_end=True,
    seed=224,
    logging_dir="sample_hf_trainer/logs",  # Directory for Tensorboard logs
    logging_steps=100,  # Log every 100 steps
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return {"accuracy": np.mean(predictions == labels)}

# Create trainer with loaded pre-trained model
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
class LoggingCallback(TrainerCallback):
    def __init__(self, log_path):
        self.log_path = log_path
        self.train_loss = []
        self.val_loss = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            with open(self.log_path, "a") as f:
                f.write(json.dumps(logs) + "\n")
            
            # Append training and validation loss to lists
            self.train_loss.append(logs.get("loss", np.inf))
            self.val_loss.append(logs.get("eval_loss", np.inf))

In [14]:
trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'], # change to test when you do your final evaluation!
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Add the callbacks to the trainer
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.0))
trainer.add_callback(LoggingCallback("sample_hf_trainer/log.jsonl"))
trainer.add_callback(TensorBoardCallback())

You are adding a <class 'transformers.integrations.integration_utils.TensorBoardCallback'> to the callbacks of this Trainer, but there is already one. The currentlist of callbacks is
:DefaultFlowCallback
TensorBoardCallback
ProgressCallback
EarlyStoppingCallback
LoggingCallback


In [15]:
trainer.train()

  0%|          | 0/71500 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  0%|          | 100/71500 [00:12<1:48:28, 10.97it/s]

{'loss': 1.1008, 'learning_rate': 1.9972027972027975e-05, 'epoch': 0.03}


  0%|          | 201/71500 [00:22<2:07:03,  9.35it/s]

{'loss': 1.1025, 'learning_rate': 1.9944055944055948e-05, 'epoch': 0.06}


  0%|          | 300/71500 [00:33<1:45:55, 11.20it/s]

{'loss': 1.1016, 'learning_rate': 1.9916083916083918e-05, 'epoch': 0.08}


  1%|          | 401/71500 [00:43<2:14:07,  8.84it/s]

{'loss': 1.1034, 'learning_rate': 1.988811188811189e-05, 'epoch': 0.11}


  1%|          | 502/71500 [00:54<1:51:25, 10.62it/s]

{'loss': 1.0994, 'learning_rate': 1.986013986013986e-05, 'epoch': 0.14}


  1%|          | 601/71500 [01:06<2:20:57,  8.38it/s]

{'loss': 1.1015, 'learning_rate': 1.9832167832167833e-05, 'epoch': 0.17}


  1%|          | 700/71500 [01:18<1:57:37, 10.03it/s]

{'loss': 1.0983, 'learning_rate': 1.9804195804195807e-05, 'epoch': 0.2}


  1%|          | 800/71500 [01:28<1:57:38, 10.02it/s]

{'loss': 1.1008, 'learning_rate': 1.9776223776223776e-05, 'epoch': 0.22}


  1%|▏         | 901/71500 [01:39<1:55:56, 10.15it/s]

{'loss': 1.1014, 'learning_rate': 1.974825174825175e-05, 'epoch': 0.25}


  1%|▏         | 1001/71500 [01:51<2:49:56,  6.91it/s]

{'loss': 1.1003, 'learning_rate': 1.9720279720279722e-05, 'epoch': 0.28}


  2%|▏         | 1101/71500 [02:03<2:30:38,  7.79it/s]

{'loss': 1.1028, 'learning_rate': 1.9692307692307696e-05, 'epoch': 0.31}


  2%|▏         | 1201/71500 [02:14<2:20:47,  8.32it/s]

{'loss': 1.0993, 'learning_rate': 1.9664335664335665e-05, 'epoch': 0.34}


  2%|▏         | 1302/71500 [02:26<1:58:10,  9.90it/s]

{'loss': 1.0999, 'learning_rate': 1.963636363636364e-05, 'epoch': 0.36}


  2%|▏         | 1403/71500 [02:36<1:45:37, 11.06it/s]

{'loss': 1.101, 'learning_rate': 1.960839160839161e-05, 'epoch': 0.39}


  2%|▏         | 1501/71500 [02:48<2:29:04,  7.83it/s]

{'loss': 1.0992, 'learning_rate': 1.958041958041958e-05, 'epoch': 0.42}


  2%|▏         | 1600/71500 [02:59<2:43:12,  7.14it/s]

{'loss': 1.101, 'learning_rate': 1.9552447552447554e-05, 'epoch': 0.45}


  2%|▏         | 1701/71500 [03:11<2:41:39,  7.20it/s]

{'loss': 1.0967, 'learning_rate': 1.9524475524475527e-05, 'epoch': 0.48}


  3%|▎         | 1802/71500 [03:23<1:58:20,  9.82it/s]

{'loss': 1.1004, 'learning_rate': 1.9496503496503497e-05, 'epoch': 0.5}


  3%|▎         | 1901/71500 [03:35<1:58:12,  9.81it/s]

{'loss': 1.0995, 'learning_rate': 1.946853146853147e-05, 'epoch': 0.53}


  3%|▎         | 2002/71500 [03:46<2:07:58,  9.05it/s]

{'loss': 1.0947, 'learning_rate': 1.944055944055944e-05, 'epoch': 0.56}


  3%|▎         | 2102/71500 [03:59<2:12:51,  8.71it/s]

{'loss': 1.0939, 'learning_rate': 1.9412587412587413e-05, 'epoch': 0.59}


  3%|▎         | 2201/71500 [04:11<2:49:50,  6.80it/s]

{'loss': 1.0979, 'learning_rate': 1.9384615384615386e-05, 'epoch': 0.62}


  3%|▎         | 2300/71500 [04:23<2:19:31,  8.27it/s]

{'loss': 1.0956, 'learning_rate': 1.935664335664336e-05, 'epoch': 0.64}


  3%|▎         | 2401/71500 [04:36<2:28:20,  7.76it/s]

{'loss': 1.0995, 'learning_rate': 1.9328671328671332e-05, 'epoch': 0.67}


  3%|▎         | 2501/71500 [04:47<1:58:05,  9.74it/s]

{'loss': 1.0979, 'learning_rate': 1.9300699300699302e-05, 'epoch': 0.7}


  4%|▎         | 2601/71500 [04:59<1:56:17,  9.87it/s]

{'loss': 1.0923, 'learning_rate': 1.9272727272727275e-05, 'epoch': 0.73}


  4%|▍         | 2701/71500 [05:11<2:12:36,  8.65it/s]

{'loss': 1.0953, 'learning_rate': 1.9244755244755248e-05, 'epoch': 0.76}


  4%|▍         | 2801/71500 [05:22<2:03:56,  9.24it/s]

{'loss': 1.0985, 'learning_rate': 1.9216783216783218e-05, 'epoch': 0.78}


  4%|▍         | 2901/71500 [05:34<2:33:01,  7.47it/s]

{'loss': 1.1007, 'learning_rate': 1.918881118881119e-05, 'epoch': 0.81}


  4%|▍         | 3002/71500 [05:46<2:10:00,  8.78it/s]

{'loss': 1.0949, 'learning_rate': 1.916083916083916e-05, 'epoch': 0.84}


  4%|▍         | 3101/71500 [05:58<1:44:57, 10.86it/s]

{'loss': 1.0881, 'learning_rate': 1.9132867132867134e-05, 'epoch': 0.87}


  4%|▍         | 3201/71500 [06:11<2:39:26,  7.14it/s]

{'loss': 1.1027, 'learning_rate': 1.9104895104895107e-05, 'epoch': 0.9}


  5%|▍         | 3301/71500 [06:23<2:18:37,  8.20it/s]

{'loss': 1.0941, 'learning_rate': 1.907692307692308e-05, 'epoch': 0.92}


  5%|▍         | 3402/71500 [06:35<2:12:52,  8.54it/s]

{'loss': 1.1004, 'learning_rate': 1.904895104895105e-05, 'epoch': 0.95}


  5%|▍         | 3500/71500 [06:47<2:27:40,  7.67it/s]

{'loss': 1.0963, 'learning_rate': 1.9020979020979023e-05, 'epoch': 0.98}


                                                      
  5%|▌         | 3575/71500 [07:36<1:52:24, 10.07it/s]

{'eval_loss': 1.1002342700958252, 'eval_accuracy': 0.34803692996362956, 'eval_runtime': 40.6223, 'eval_samples_per_second': 263.968, 'eval_steps_per_second': 22.008, 'epoch': 1.0}


  5%|▌         | 3601/71500 [07:42<2:51:19,  6.61it/s]  

{'loss': 1.0975, 'learning_rate': 1.8993006993006996e-05, 'epoch': 1.01}


  5%|▌         | 3701/71500 [07:53<2:26:40,  7.70it/s]

{'loss': 1.0986, 'learning_rate': 1.896503496503497e-05, 'epoch': 1.03}


  5%|▌         | 3801/71500 [08:05<2:25:33,  7.75it/s]

{'loss': 1.0975, 'learning_rate': 1.893706293706294e-05, 'epoch': 1.06}


  5%|▌         | 3901/71500 [08:17<2:25:23,  7.75it/s]

{'loss': 1.0954, 'learning_rate': 1.8909090909090912e-05, 'epoch': 1.09}


  6%|▌         | 4001/71500 [08:29<2:15:44,  8.29it/s]

{'loss': 1.0926, 'learning_rate': 1.888111888111888e-05, 'epoch': 1.12}


  6%|▌         | 4101/71500 [08:40<2:06:58,  8.85it/s]

{'loss': 1.097, 'learning_rate': 1.8853146853146855e-05, 'epoch': 1.15}


  6%|▌         | 4201/71500 [08:53<2:37:50,  7.11it/s]

{'loss': 1.0977, 'learning_rate': 1.8825174825174824e-05, 'epoch': 1.17}


  6%|▌         | 4301/71500 [09:12<3:49:49,  4.87it/s] 

{'loss': 1.0978, 'learning_rate': 1.8797202797202798e-05, 'epoch': 1.2}


  6%|▌         | 4401/71500 [09:24<2:14:34,  8.31it/s]

{'loss': 1.0957, 'learning_rate': 1.876923076923077e-05, 'epoch': 1.23}


  6%|▋         | 4501/71500 [09:37<2:26:14,  7.64it/s]

{'loss': 1.0953, 'learning_rate': 1.8741258741258744e-05, 'epoch': 1.26}


  6%|▋         | 4602/71500 [09:48<2:04:58,  8.92it/s]

{'loss': 1.098, 'learning_rate': 1.8713286713286717e-05, 'epoch': 1.29}


  7%|▋         | 4701/71500 [10:00<2:35:21,  7.17it/s]

{'loss': 1.0953, 'learning_rate': 1.8685314685314687e-05, 'epoch': 1.31}


  7%|▋         | 4801/71500 [10:12<1:57:56,  9.43it/s]

{'loss': 1.0897, 'learning_rate': 1.865734265734266e-05, 'epoch': 1.34}


  7%|▋         | 4901/71500 [10:25<2:27:53,  7.51it/s]

{'loss': 1.0934, 'learning_rate': 1.8629370629370633e-05, 'epoch': 1.37}


  7%|▋         | 5002/71500 [10:38<1:58:41,  9.34it/s]

{'loss': 1.0953, 'learning_rate': 1.8601398601398602e-05, 'epoch': 1.4}


  7%|▋         | 5102/71500 [10:49<2:08:02,  8.64it/s]

{'loss': 1.1007, 'learning_rate': 1.8573426573426576e-05, 'epoch': 1.43}


  7%|▋         | 5201/71500 [11:00<2:10:08,  8.49it/s]

{'loss': 1.0974, 'learning_rate': 1.8545454545454545e-05, 'epoch': 1.45}


  7%|▋         | 5301/71500 [11:18<2:44:57,  6.69it/s] 

{'loss': 1.1002, 'learning_rate': 1.851748251748252e-05, 'epoch': 1.48}


  8%|▊         | 5401/71500 [11:29<2:04:21,  8.86it/s]

{'loss': 1.0973, 'learning_rate': 1.848951048951049e-05, 'epoch': 1.51}


  8%|▊         | 5501/71500 [11:41<2:07:34,  8.62it/s]

{'loss': 1.0954, 'learning_rate': 1.8461538461538465e-05, 'epoch': 1.54}


  8%|▊         | 5601/71500 [11:53<2:09:17,  8.50it/s]

{'loss': 1.0878, 'learning_rate': 1.8433566433566434e-05, 'epoch': 1.57}


  8%|▊         | 5701/71500 [12:04<1:51:17,  9.85it/s]

{'loss': 1.0981, 'learning_rate': 1.8405594405594407e-05, 'epoch': 1.59}


  8%|▊         | 5801/71500 [12:17<2:04:42,  8.78it/s]

{'loss': 1.0946, 'learning_rate': 1.837762237762238e-05, 'epoch': 1.62}


  8%|▊         | 5902/71500 [12:29<2:14:51,  8.11it/s]

{'loss': 1.1014, 'learning_rate': 1.8349650349650354e-05, 'epoch': 1.65}


  8%|▊         | 6001/71500 [12:41<1:51:43,  9.77it/s]

{'loss': 1.0991, 'learning_rate': 1.8321678321678323e-05, 'epoch': 1.68}


  9%|▊         | 6102/71500 [12:58<1:55:10,  9.46it/s] 

{'loss': 1.097, 'learning_rate': 1.8293706293706296e-05, 'epoch': 1.71}


  9%|▊         | 6202/71500 [13:09<2:00:49,  9.01it/s]

{'loss': 1.0993, 'learning_rate': 1.8265734265734266e-05, 'epoch': 1.73}


  9%|▉         | 6300/71500 [13:21<2:28:40,  7.31it/s]

{'loss': 1.0942, 'learning_rate': 1.823776223776224e-05, 'epoch': 1.76}


  9%|▉         | 6401/71500 [13:33<2:10:50,  8.29it/s]

{'loss': 1.0967, 'learning_rate': 1.820979020979021e-05, 'epoch': 1.79}


  9%|▉         | 6502/71500 [13:44<1:52:48,  9.60it/s]

{'loss': 1.0932, 'learning_rate': 1.8181818181818182e-05, 'epoch': 1.82}


  9%|▉         | 6600/71500 [13:56<1:46:05, 10.20it/s]

{'loss': 1.1004, 'learning_rate': 1.8153846153846155e-05, 'epoch': 1.85}


  9%|▉         | 6701/71500 [14:08<2:06:02,  8.57it/s]

{'loss': 1.0938, 'learning_rate': 1.8125874125874128e-05, 'epoch': 1.87}


 10%|▉         | 6801/71500 [14:26<2:12:52,  8.12it/s] 

{'loss': 1.0992, 'learning_rate': 1.80979020979021e-05, 'epoch': 1.9}


 10%|▉         | 6901/71500 [14:38<2:02:36,  8.78it/s]

{'loss': 1.0977, 'learning_rate': 1.806993006993007e-05, 'epoch': 1.93}


 10%|▉         | 7000/71500 [14:50<2:11:58,  8.15it/s]

{'loss': 1.0958, 'learning_rate': 1.8041958041958044e-05, 'epoch': 1.96}


 10%|▉         | 7100/71500 [15:02<2:39:28,  6.73it/s]

{'loss': 1.0978, 'learning_rate': 1.8013986013986017e-05, 'epoch': 1.99}


                                                      
 10%|█         | 7150/71500 [15:49<2:07:25,  8.42it/s]

{'eval_loss': 1.0973016023635864, 'eval_accuracy': 0.351860486804066, 'eval_runtime': 41.4835, 'eval_samples_per_second': 258.488, 'eval_steps_per_second': 21.551, 'epoch': 2.0}


 10%|█         | 7201/71500 [15:57<2:01:14,  8.84it/s]  

{'loss': 1.0968, 'learning_rate': 1.7986013986013987e-05, 'epoch': 2.01}


 10%|█         | 7301/71500 [16:09<2:02:10,  8.76it/s]

{'loss': 1.0793, 'learning_rate': 1.795804195804196e-05, 'epoch': 2.04}


 10%|█         | 7401/71500 [16:27<2:05:25,  8.52it/s] 

{'loss': 1.088, 'learning_rate': 1.793006993006993e-05, 'epoch': 2.07}


 10%|█         | 7501/71500 [16:39<2:12:47,  8.03it/s]

{'loss': 1.0892, 'learning_rate': 1.7902097902097903e-05, 'epoch': 2.1}


 11%|█         | 7601/71500 [16:51<2:16:19,  7.81it/s]

{'loss': 1.089, 'learning_rate': 1.7874125874125876e-05, 'epoch': 2.13}


 11%|█         | 7702/71500 [17:02<2:00:23,  8.83it/s]

{'loss': 1.0893, 'learning_rate': 1.784615384615385e-05, 'epoch': 2.15}


 11%|█         | 7802/71500 [17:14<1:53:54,  9.32it/s]

{'loss': 1.0701, 'learning_rate': 1.781818181818182e-05, 'epoch': 2.18}


 11%|█         | 7901/71500 [17:32<3:09:26,  5.60it/s] 

{'loss': 1.0781, 'learning_rate': 1.7790209790209792e-05, 'epoch': 2.21}


 11%|█         | 8001/71500 [17:44<2:03:02,  8.60it/s]

{'loss': 1.0704, 'learning_rate': 1.7762237762237765e-05, 'epoch': 2.24}


 11%|█▏        | 8102/71500 [17:56<1:58:12,  8.94it/s]

{'loss': 1.0914, 'learning_rate': 1.7734265734265738e-05, 'epoch': 2.27}


 11%|█▏        | 8202/71500 [18:07<1:46:39,  9.89it/s]

{'loss': 1.0741, 'learning_rate': 1.7706293706293708e-05, 'epoch': 2.29}


 12%|█▏        | 8301/71500 [18:25<2:11:27,  8.01it/s] 

{'loss': 1.0763, 'learning_rate': 1.767832167832168e-05, 'epoch': 2.32}


 12%|█▏        | 8401/71500 [18:38<2:05:53,  8.35it/s]

{'loss': 1.0752, 'learning_rate': 1.765034965034965e-05, 'epoch': 2.35}


 12%|█▏        | 8501/71500 [18:50<2:13:22,  7.87it/s]

{'loss': 1.0785, 'learning_rate': 1.7622377622377624e-05, 'epoch': 2.38}


 12%|█▏        | 8601/71500 [19:02<2:16:49,  7.66it/s]

{'loss': 1.0904, 'learning_rate': 1.7594405594405597e-05, 'epoch': 2.41}


 12%|█▏        | 8701/71500 [19:20<2:05:44,  8.32it/s] 

{'loss': 1.0825, 'learning_rate': 1.7566433566433567e-05, 'epoch': 2.43}


 12%|█▏        | 8801/71500 [19:32<1:55:55,  9.01it/s]

{'loss': 1.062, 'learning_rate': 1.753846153846154e-05, 'epoch': 2.46}


 12%|█▏        | 8901/71500 [19:44<2:15:07,  7.72it/s]

{'loss': 1.0804, 'learning_rate': 1.7510489510489513e-05, 'epoch': 2.49}


 13%|█▎        | 9001/71500 [20:03<2:21:40,  7.35it/s] 

{'loss': 1.0901, 'learning_rate': 1.7482517482517486e-05, 'epoch': 2.52}


 13%|█▎        | 9102/71500 [20:15<1:46:41,  9.75it/s]

{'loss': 1.0812, 'learning_rate': 1.7454545454545456e-05, 'epoch': 2.55}


 13%|█▎        | 9201/71500 [20:26<2:08:46,  8.06it/s]

{'loss': 1.0792, 'learning_rate': 1.742657342657343e-05, 'epoch': 2.57}


 13%|█▎        | 9301/71500 [20:39<2:20:19,  7.39it/s]

{'loss': 1.0673, 'learning_rate': 1.7398601398601402e-05, 'epoch': 2.6}


 13%|█▎        | 9400/71500 [20:50<1:59:26,  8.67it/s]

{'loss': 1.0604, 'learning_rate': 1.737062937062937e-05, 'epoch': 2.63}


 13%|█▎        | 9501/71500 [21:09<1:58:35,  8.71it/s] 

{'loss': 1.0801, 'learning_rate': 1.7342657342657345e-05, 'epoch': 2.66}


 13%|█▎        | 9602/71500 [21:21<1:59:33,  8.63it/s]

{'loss': 1.0682, 'learning_rate': 1.7314685314685314e-05, 'epoch': 2.69}


 14%|█▎        | 9701/71500 [21:34<2:02:25,  8.41it/s]

{'loss': 1.0719, 'learning_rate': 1.7286713286713287e-05, 'epoch': 2.71}


 14%|█▎        | 9801/71500 [21:51<2:02:13,  8.41it/s] 

{'loss': 1.0766, 'learning_rate': 1.725874125874126e-05, 'epoch': 2.74}


 14%|█▍        | 9901/71500 [22:02<1:38:32, 10.42it/s]

{'loss': 1.0884, 'learning_rate': 1.7230769230769234e-05, 'epoch': 2.77}


 14%|█▍        | 10001/71500 [22:15<2:00:29,  8.51it/s]

{'loss': 1.0834, 'learning_rate': 1.7202797202797203e-05, 'epoch': 2.8}


 14%|█▍        | 10101/71500 [22:34<2:20:39,  7.27it/s] 

{'loss': 1.0775, 'learning_rate': 1.7174825174825176e-05, 'epoch': 2.83}


 14%|█▍        | 10201/71500 [22:45<1:59:41,  8.54it/s]

{'loss': 1.0745, 'learning_rate': 1.714685314685315e-05, 'epoch': 2.85}


 14%|█▍        | 10301/71500 [22:57<2:12:30,  7.70it/s]

{'loss': 1.0697, 'learning_rate': 1.7118881118881123e-05, 'epoch': 2.88}


 15%|█▍        | 10401/71500 [23:09<2:26:26,  6.95it/s]

{'loss': 1.0785, 'learning_rate': 1.7090909090909092e-05, 'epoch': 2.91}


 15%|█▍        | 10502/71500 [23:26<1:53:22,  8.97it/s] 

{'loss': 1.0737, 'learning_rate': 1.7062937062937065e-05, 'epoch': 2.94}


 15%|█▍        | 10601/71500 [23:38<1:56:30,  8.71it/s]

{'loss': 1.0707, 'learning_rate': 1.7034965034965035e-05, 'epoch': 2.97}


 15%|█▍        | 10701/71500 [23:50<2:02:56,  8.24it/s]

{'loss': 1.071, 'learning_rate': 1.7006993006993008e-05, 'epoch': 2.99}


                                                       
 15%|█▌        | 10725/71500 [24:41<2:07:22,  7.95it/s]

{'eval_loss': 1.1313278675079346, 'eval_accuracy': 0.3488762473188473, 'eval_runtime': 48.1243, 'eval_samples_per_second': 222.819, 'eval_steps_per_second': 18.577, 'epoch': 3.0}


 15%|█▌        | 10725/71500 [24:43<2:20:04,  7.23it/s]

{'train_runtime': 1483.2063, 'train_samples_per_second': 578.328, 'train_steps_per_second': 48.206, 'train_loss': 1.0910380687580241, 'epoch': 3.0}





TrainOutput(global_step=10725, training_loss=1.0910380687580241, metrics={'train_runtime': 1483.2063, 'train_samples_per_second': 578.328, 'train_steps_per_second': 48.206, 'train_loss': 1.0910380687580241, 'epoch': 3.0})