<a target="_blank" href="https://colab.research.google.com/github/retowuest/uio-dl-2024/blob/main/Notebooks/nb-4.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Deep Learning for Social Scientists

### PhD Course, University of Bergen

### **Notebook 4:**<br>Transformers

### Table of Contents
* [Introduction](#section_1)
* [Loading the Data](#section_2)
* [Loading and Fine-Tuning a Pre-Trained BERT Model](#section_3)

### Introduction <a class="anchor" id="section_1"></a>

In this notebook, our goal is to fine-tune a BERT model for sentiment classification in PyTorch. We will use the `transformers` [Python library](https://huggingface.co/docs/transformers/index) provided by [Hugging Face](https://huggingface.co/), which includes a number of pre-trained models that are ready for fine-tuning.

We will use as our use case the IMDb movie review data set and fine-tune the distilled BERT model (`DistilBERT`) to perform sentiment classification. `DistilBERT` is a lightweight transformer model created by distilling a pre-trained BERT base model. The original uncased BERT base model contains over 110 million parameters. According to Hugging Face (see quote below), `DistilBERT` has 40% fewer parameters and runs 60% faster while preserving 95% of BERT's performance on the GLUE language understanding benchmark.

---

Quote from https://huggingface.co/docs/transformers/model_doc/distilbert:

> DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than *google-bert/bert-base-uncased*, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

---

### Loading the Data <a class="anchor" id="section_1"></a>

We will begin by loading the required packages.

In [1]:
# Import packages
import gzip
import shutil
import time

import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


Next, we specify some general settings (number of epochs we use for training the model, device specification, and the random seed).

In [2]:
#torch.backends.cudnn.deterministic = True (for reproducibility when using NVIDIA CUDA Deep Neural Network (cuDNN))
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device("cpu")

NUM_EPOCHS = 3

Next, we will fetch the compressed IMDb movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification, unzip it, and write it into a CSV-formatted file.

In [3]:
url = "https://github.com/rasbt/machine-learning-book/raw/main/ch08/movie_data.csv.gz"
filename = url.split("/")[-1]

with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

with gzip.open("movie_data.csv.gz", "rb") as f_in:
    with open("movie_data.csv", "wb") as f_out:
        shutil.copyfileobj(f_in, f_out) # copy content from source file to destination file

Check if the data set looks okay.

In [4]:
# Load data into a Pandas DataFrame and print first few rows
df = pd.read_csv("movie_data.csv")
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [5]:
# Print dimensions of dataframe
df.shape

(50000, 2)

The next step is to split data set into training, validation, and test sets. We use 70% (or 35,000 examples) of the data for training, 10% (or 5,000 examples) for validation, and the remaining 20% (or 10,000 examples) for testing.

In [6]:
# Split data into training, validation, and test sets
train_texts = df.iloc[:35000]["review"].values
train_labels = df.iloc[:35000]["sentiment"].values

valid_texts = df.iloc[35000:40000]["review"].values
valid_labels = df.iloc[35000:40000]["sentiment"].values

test_texts = df.iloc[40000:]["review"].values
test_labels = df.iloc[40000:]["sentiment"].values

Next, we will tokenize the texts into individual word tokens using the tokenizer implementation from the pre-trained model class.

In [7]:
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased", model_max_length=512)

In [8]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

Finally, we create a class called `IMDbDataset` and create the data loaders (the encodings store a lot of information about the tokenized texts; with the dictionary in the `__getitem__` method defined below, we extract only the relevant information).

In [10]:
# Create class IMDbDataset
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [11]:
# Create data loaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

### Loading and Fine-Tuning a Pre-Trained BERT Model <a class="anchor" id="section_3"></a>

We first load the pre-trained BERT model ("uncased" means that the model does not distinguish between upper- and lower-case letters).

In [12]:
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased") # this specifies the downstream task for which we want to fine-tune the model
model.to(DEVICE)
model.train();

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To train (fine-tune) the model, we will use the `Trainer` API provided by `Hugging Face`.

In [13]:
# Import Trainer and TrainingArguments from transformers
from transformers import Trainer, TrainingArguments

# Specify optimization algorithm
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

# Specify training arguments
# (directories for output and logs, number of epochs, batch sizes)
training_args = TrainingArguments(
    output_dir="./results", 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir="./logs",
    logging_steps=10,
)

# Pass TrainingArguments settings to the Trainer class to instantiate a new trainer object
trainer = Trainer(
    model=model, # the model to be fine-tuned
    args=training_args, # training arguments specified above
    train_dataset=train_dataset, # training set
    optimizers=(optim, None) # optim and learning rate schedule
)

We can now train the model by calling the `trainer.train` method (we will use this method shortly).

The `Trainer` API only shows the training loss and does not provide model evaluation. Therefore, to evaluate the model, we define an evaluation function.

In [14]:
# Import load_metrics and numpy
from datasets import load_metric
import numpy as np

# Define metric
metric = load_metric("accuracy")

# Define evaluation function
# (function operates on the model's test predictions as logits, which is the default output of the model, and the labels)
def compute_metrics(eval_pred):
    logits, labels = eval_pred # logits are a numpy array, not pytorch tensor
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(
               predictions=predictions, references=labels)

  metric = load_metric("accuracy")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [15]:
# Run trainer again, this time including test set and compute_metrics
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

training_args = TrainingArguments(
    output_dir="./results", 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir="./logs",
    logging_steps=10
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    optimizers=(optim, None) # optimizer and learning rate scheduler
)

In [16]:
# Train model by calling trainer.train method
start_time = time.time()
trainer.train()
print(f"Total Training Time: {(time.time() - start_time)/60:.2f} min")

  0%|          | 0/6564 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  0%|          | 10/6564 [00:16<2:15:53,  1.24s/it]

{'loss': 0.6931, 'grad_norm': 1.5085525512695312, 'learning_rate': 4.9923826934795856e-05, 'epoch': 0.0}


  0%|          | 20/6564 [00:28<2:14:05,  1.23s/it]

{'loss': 0.6482, 'grad_norm': 4.474645614624023, 'learning_rate': 4.984765386959172e-05, 'epoch': 0.01}


  0%|          | 30/6564 [00:40<2:12:24,  1.22s/it]

{'loss': 0.6231, 'grad_norm': 6.141397953033447, 'learning_rate': 4.977148080438757e-05, 'epoch': 0.01}


  1%|          | 40/6564 [00:52<2:07:29,  1.17s/it]

{'loss': 0.3685, 'grad_norm': 4.076168537139893, 'learning_rate': 4.9695307739183424e-05, 'epoch': 0.02}


  1%|          | 50/6564 [01:04<2:07:29,  1.17s/it]

{'loss': 0.3367, 'grad_norm': 18.48862648010254, 'learning_rate': 4.9619134673979285e-05, 'epoch': 0.02}


  1%|          | 60/6564 [01:16<2:08:49,  1.19s/it]

{'loss': 0.3646, 'grad_norm': 10.214604377746582, 'learning_rate': 4.954296160877514e-05, 'epoch': 0.03}


  1%|          | 70/6564 [01:28<2:08:41,  1.19s/it]

{'loss': 0.2852, 'grad_norm': 1.983090877532959, 'learning_rate': 4.9466788543571e-05, 'epoch': 0.03}


  1%|          | 80/6564 [01:40<2:06:19,  1.17s/it]

{'loss': 0.3987, 'grad_norm': 9.811516761779785, 'learning_rate': 4.939061547836685e-05, 'epoch': 0.04}


  1%|▏         | 90/6564 [01:52<2:09:17,  1.20s/it]

{'loss': 0.3119, 'grad_norm': 4.606213092803955, 'learning_rate': 4.931444241316271e-05, 'epoch': 0.04}


  2%|▏         | 100/6564 [02:04<2:08:34,  1.19s/it]

{'loss': 0.3231, 'grad_norm': 11.816107749938965, 'learning_rate': 4.923826934795857e-05, 'epoch': 0.05}


  2%|▏         | 110/6564 [02:15<2:06:18,  1.17s/it]

{'loss': 0.2979, 'grad_norm': 13.447651863098145, 'learning_rate': 4.916209628275442e-05, 'epoch': 0.05}


  2%|▏         | 120/6564 [02:27<2:05:40,  1.17s/it]

{'loss': 0.4116, 'grad_norm': 3.3356642723083496, 'learning_rate': 4.9085923217550275e-05, 'epoch': 0.05}


  2%|▏         | 130/6564 [02:39<2:05:06,  1.17s/it]

{'loss': 0.2725, 'grad_norm': 8.442696571350098, 'learning_rate': 4.9009750152346136e-05, 'epoch': 0.06}


  2%|▏         | 140/6564 [02:50<2:04:27,  1.16s/it]

{'loss': 0.3021, 'grad_norm': 5.998447895050049, 'learning_rate': 4.893357708714199e-05, 'epoch': 0.06}


  2%|▏         | 150/6564 [03:02<2:04:00,  1.16s/it]

{'loss': 0.2582, 'grad_norm': 8.48385238647461, 'learning_rate': 4.885740402193785e-05, 'epoch': 0.07}


  2%|▏         | 160/6564 [03:14<2:04:16,  1.16s/it]

{'loss': 0.3742, 'grad_norm': 1.8933043479919434, 'learning_rate': 4.8781230956733704e-05, 'epoch': 0.07}


  3%|▎         | 170/6564 [03:25<2:05:23,  1.18s/it]

{'loss': 0.2428, 'grad_norm': 5.0918779373168945, 'learning_rate': 4.870505789152956e-05, 'epoch': 0.08}


  3%|▎         | 180/6564 [03:37<2:03:24,  1.16s/it]

{'loss': 0.2972, 'grad_norm': 13.854533195495605, 'learning_rate': 4.862888482632542e-05, 'epoch': 0.08}


  3%|▎         | 190/6564 [03:49<2:03:30,  1.16s/it]

{'loss': 0.3061, 'grad_norm': 4.074699401855469, 'learning_rate': 4.855271176112127e-05, 'epoch': 0.09}


  3%|▎         | 200/6564 [04:00<2:02:56,  1.16s/it]

{'loss': 0.3281, 'grad_norm': 5.683719635009766, 'learning_rate': 4.8476538695917126e-05, 'epoch': 0.09}


  3%|▎         | 210/6564 [04:12<2:03:42,  1.17s/it]

{'loss': 0.2826, 'grad_norm': 5.315464973449707, 'learning_rate': 4.8400365630712987e-05, 'epoch': 0.1}


  3%|▎         | 220/6564 [04:24<2:04:25,  1.18s/it]

{'loss': 0.3369, 'grad_norm': 12.594862937927246, 'learning_rate': 4.8324192565508834e-05, 'epoch': 0.1}


  4%|▎         | 230/6564 [04:36<2:03:37,  1.17s/it]

{'loss': 0.3694, 'grad_norm': 8.190422058105469, 'learning_rate': 4.8248019500304694e-05, 'epoch': 0.11}


  4%|▎         | 240/6564 [04:47<2:03:09,  1.17s/it]

{'loss': 0.1993, 'grad_norm': 9.928567886352539, 'learning_rate': 4.817184643510055e-05, 'epoch': 0.11}


  4%|▍         | 250/6564 [04:59<2:03:36,  1.17s/it]

{'loss': 0.2587, 'grad_norm': 8.553461074829102, 'learning_rate': 4.80956733698964e-05, 'epoch': 0.11}


  4%|▍         | 260/6564 [05:11<2:02:23,  1.16s/it]

{'loss': 0.2144, 'grad_norm': 4.435325622558594, 'learning_rate': 4.801950030469226e-05, 'epoch': 0.12}


  4%|▍         | 270/6564 [05:23<2:02:16,  1.17s/it]

{'loss': 0.4408, 'grad_norm': 4.221426010131836, 'learning_rate': 4.7943327239488116e-05, 'epoch': 0.12}


  4%|▍         | 280/6564 [05:35<2:06:43,  1.21s/it]

{'loss': 0.2962, 'grad_norm': 6.258211612701416, 'learning_rate': 4.786715417428398e-05, 'epoch': 0.13}


  4%|▍         | 290/6564 [12:48<14:54:56,  8.56s/it]  

{'loss': 0.2543, 'grad_norm': 6.355643272399902, 'learning_rate': 4.779098110907983e-05, 'epoch': 0.13}


  5%|▍         | 300/6564 [13:00<2:22:36,  1.37s/it] 

{'loss': 0.3085, 'grad_norm': 2.4925572872161865, 'learning_rate': 4.7714808043875684e-05, 'epoch': 0.14}


  5%|▍         | 310/6564 [13:11<2:00:14,  1.15s/it]

{'loss': 0.4113, 'grad_norm': 6.82965612411499, 'learning_rate': 4.7638634978671545e-05, 'epoch': 0.14}


  5%|▍         | 320/6564 [13:24<2:20:03,  1.35s/it]

{'loss': 0.3283, 'grad_norm': 3.382976770401001, 'learning_rate': 4.75624619134674e-05, 'epoch': 0.15}


  5%|▌         | 330/6564 [13:36<2:03:19,  1.19s/it]

{'loss': 0.3687, 'grad_norm': 5.6091227531433105, 'learning_rate': 4.748628884826325e-05, 'epoch': 0.15}


  5%|▌         | 340/6564 [13:48<2:04:06,  1.20s/it]

{'loss': 0.2698, 'grad_norm': 5.15540075302124, 'learning_rate': 4.741011578305911e-05, 'epoch': 0.16}


  5%|▌         | 350/6564 [13:59<2:02:16,  1.18s/it]

{'loss': 0.2544, 'grad_norm': 17.42645263671875, 'learning_rate': 4.733394271785497e-05, 'epoch': 0.16}


  5%|▌         | 360/6564 [14:11<2:02:25,  1.18s/it]

{'loss': 0.36, 'grad_norm': 8.634753227233887, 'learning_rate': 4.725776965265082e-05, 'epoch': 0.16}


  6%|▌         | 370/6564 [14:23<2:00:51,  1.17s/it]

{'loss': 0.2654, 'grad_norm': 1.646098256111145, 'learning_rate': 4.718159658744668e-05, 'epoch': 0.17}


  6%|▌         | 380/6564 [14:35<2:01:09,  1.18s/it]

{'loss': 0.2822, 'grad_norm': 5.405315399169922, 'learning_rate': 4.7105423522242535e-05, 'epoch': 0.17}


  6%|▌         | 390/6564 [14:46<2:01:29,  1.18s/it]

{'loss': 0.2375, 'grad_norm': 5.025873184204102, 'learning_rate': 4.7029250457038396e-05, 'epoch': 0.18}


  6%|▌         | 400/6564 [14:58<2:01:00,  1.18s/it]

{'loss': 0.2835, 'grad_norm': 5.348091125488281, 'learning_rate': 4.695307739183425e-05, 'epoch': 0.18}


  6%|▌         | 410/6564 [15:10<1:59:27,  1.16s/it]

{'loss': 0.2634, 'grad_norm': 13.701735496520996, 'learning_rate': 4.6876904326630103e-05, 'epoch': 0.19}


  6%|▋         | 420/6564 [15:22<1:59:47,  1.17s/it]

{'loss': 0.308, 'grad_norm': 6.961171627044678, 'learning_rate': 4.6800731261425964e-05, 'epoch': 0.19}


  7%|▋         | 430/6564 [15:33<1:59:50,  1.17s/it]

{'loss': 0.2548, 'grad_norm': 5.7014899253845215, 'learning_rate': 4.672455819622182e-05, 'epoch': 0.2}


  7%|▋         | 440/6564 [15:45<1:59:00,  1.17s/it]

{'loss': 0.3277, 'grad_norm': 4.470987796783447, 'learning_rate': 4.664838513101767e-05, 'epoch': 0.2}


  7%|▋         | 450/6564 [15:57<2:00:28,  1.18s/it]

{'loss': 0.2506, 'grad_norm': 6.4014716148376465, 'learning_rate': 4.657221206581353e-05, 'epoch': 0.21}


  7%|▋         | 460/6564 [16:09<1:59:23,  1.17s/it]

{'loss': 0.2726, 'grad_norm': 6.382644176483154, 'learning_rate': 4.6496039000609386e-05, 'epoch': 0.21}


  7%|▋         | 470/6564 [16:20<2:00:39,  1.19s/it]

{'loss': 0.2143, 'grad_norm': 4.8790812492370605, 'learning_rate': 4.641986593540525e-05, 'epoch': 0.21}


  7%|▋         | 480/6564 [16:32<1:59:42,  1.18s/it]

{'loss': 0.3999, 'grad_norm': 6.798576831817627, 'learning_rate': 4.63436928702011e-05, 'epoch': 0.22}


  7%|▋         | 490/6564 [16:44<1:59:40,  1.18s/it]

{'loss': 0.2745, 'grad_norm': 5.370530128479004, 'learning_rate': 4.6267519804996954e-05, 'epoch': 0.22}


  8%|▊         | 500/6564 [16:56<1:59:17,  1.18s/it]

{'loss': 0.1787, 'grad_norm': 7.663488864898682, 'learning_rate': 4.6191346739792815e-05, 'epoch': 0.23}


  8%|▊         | 510/6564 [17:09<2:01:25,  1.20s/it]

{'loss': 0.3007, 'grad_norm': 9.962599754333496, 'learning_rate': 4.611517367458867e-05, 'epoch': 0.23}


  8%|▊         | 520/6564 [17:21<1:59:27,  1.19s/it]

{'loss': 0.4264, 'grad_norm': 17.733213424682617, 'learning_rate': 4.603900060938452e-05, 'epoch': 0.24}


  8%|▊         | 530/6564 [17:33<1:58:25,  1.18s/it]

{'loss': 0.2794, 'grad_norm': 8.008511543273926, 'learning_rate': 4.596282754418038e-05, 'epoch': 0.24}


  8%|▊         | 540/6564 [17:44<1:58:55,  1.18s/it]

{'loss': 0.2272, 'grad_norm': 5.819798946380615, 'learning_rate': 4.588665447897624e-05, 'epoch': 0.25}


  8%|▊         | 550/6564 [17:56<1:57:55,  1.18s/it]

{'loss': 0.3254, 'grad_norm': 2.790743112564087, 'learning_rate': 4.581048141377209e-05, 'epoch': 0.25}


  9%|▊         | 560/6564 [18:08<1:58:00,  1.18s/it]

{'loss': 0.3064, 'grad_norm': 9.901390075683594, 'learning_rate': 4.573430834856795e-05, 'epoch': 0.26}


  9%|▊         | 570/6564 [18:20<1:58:53,  1.19s/it]

{'loss': 0.252, 'grad_norm': 5.460116386413574, 'learning_rate': 4.5658135283363805e-05, 'epoch': 0.26}


  9%|▉         | 580/6564 [18:32<1:56:40,  1.17s/it]

{'loss': 0.2859, 'grad_norm': 5.816339015960693, 'learning_rate': 4.5581962218159666e-05, 'epoch': 0.27}


  9%|▉         | 590/6564 [18:43<1:56:42,  1.17s/it]

{'loss': 0.2552, 'grad_norm': 6.6258864402771, 'learning_rate': 4.550578915295552e-05, 'epoch': 0.27}


  9%|▉         | 600/6564 [18:55<1:57:41,  1.18s/it]

{'loss': 0.3079, 'grad_norm': 7.387386798858643, 'learning_rate': 4.542961608775137e-05, 'epoch': 0.27}


  9%|▉         | 610/6564 [19:07<1:57:38,  1.19s/it]

{'loss': 0.2883, 'grad_norm': 6.2699103355407715, 'learning_rate': 4.5353443022547234e-05, 'epoch': 0.28}


  9%|▉         | 620/6564 [19:19<1:56:30,  1.18s/it]

{'loss': 0.2886, 'grad_norm': 4.849356651306152, 'learning_rate': 4.527726995734309e-05, 'epoch': 0.28}


 10%|▉         | 630/6564 [19:31<1:57:07,  1.18s/it]

{'loss': 0.2884, 'grad_norm': 8.237793922424316, 'learning_rate': 4.520109689213894e-05, 'epoch': 0.29}


 10%|▉         | 640/6564 [19:42<1:55:56,  1.17s/it]

{'loss': 0.2436, 'grad_norm': 3.0922303199768066, 'learning_rate': 4.51249238269348e-05, 'epoch': 0.29}


 10%|▉         | 650/6564 [19:54<1:56:01,  1.18s/it]

{'loss': 0.3259, 'grad_norm': 7.831559181213379, 'learning_rate': 4.504875076173065e-05, 'epoch': 0.3}


 10%|█         | 660/6564 [20:06<1:56:12,  1.18s/it]

{'loss': 0.2793, 'grad_norm': 2.5130982398986816, 'learning_rate': 4.497257769652651e-05, 'epoch': 0.3}


 10%|█         | 670/6564 [20:18<1:54:40,  1.17s/it]

{'loss': 0.2393, 'grad_norm': 4.469945907592773, 'learning_rate': 4.4896404631322363e-05, 'epoch': 0.31}


 10%|█         | 680/6564 [20:29<1:54:51,  1.17s/it]

{'loss': 0.3089, 'grad_norm': 12.373077392578125, 'learning_rate': 4.482023156611822e-05, 'epoch': 0.31}


 11%|█         | 690/6564 [20:41<1:56:10,  1.19s/it]

{'loss': 0.2316, 'grad_norm': 8.76420783996582, 'learning_rate': 4.474405850091408e-05, 'epoch': 0.32}


 11%|█         | 700/6564 [20:53<1:55:59,  1.19s/it]

{'loss': 0.2516, 'grad_norm': 9.099811553955078, 'learning_rate': 4.466788543570993e-05, 'epoch': 0.32}


 11%|█         | 710/6564 [21:05<1:57:14,  1.20s/it]

{'loss': 0.2877, 'grad_norm': 6.351034164428711, 'learning_rate': 4.459171237050579e-05, 'epoch': 0.32}


 11%|█         | 720/6564 [21:17<1:53:46,  1.17s/it]

{'loss': 0.1956, 'grad_norm': 6.39694356918335, 'learning_rate': 4.4515539305301646e-05, 'epoch': 0.33}


 11%|█         | 730/6564 [21:29<1:54:10,  1.17s/it]

{'loss': 0.231, 'grad_norm': 0.7530974745750427, 'learning_rate': 4.44393662400975e-05, 'epoch': 0.33}


 11%|█▏        | 740/6564 [21:41<1:53:16,  1.17s/it]

{'loss': 0.1932, 'grad_norm': 13.478053092956543, 'learning_rate': 4.436319317489336e-05, 'epoch': 0.34}


 11%|█▏        | 750/6564 [21:53<1:56:08,  1.20s/it]

{'loss': 0.2453, 'grad_norm': 13.352371215820312, 'learning_rate': 4.4287020109689214e-05, 'epoch': 0.34}


 12%|█▏        | 760/6564 [22:04<1:53:42,  1.18s/it]

{'loss': 0.226, 'grad_norm': 3.3582279682159424, 'learning_rate': 4.421084704448507e-05, 'epoch': 0.35}


 12%|█▏        | 770/6564 [22:16<1:54:35,  1.19s/it]

{'loss': 0.2419, 'grad_norm': 1.5418829917907715, 'learning_rate': 4.413467397928093e-05, 'epoch': 0.35}


 12%|█▏        | 780/6564 [22:28<1:55:59,  1.20s/it]

{'loss': 0.3361, 'grad_norm': 2.424987316131592, 'learning_rate': 4.405850091407678e-05, 'epoch': 0.36}


 12%|█▏        | 790/6564 [22:40<1:51:26,  1.16s/it]

{'loss': 0.2378, 'grad_norm': 11.558798789978027, 'learning_rate': 4.398232784887264e-05, 'epoch': 0.36}


 12%|█▏        | 800/6564 [22:51<1:51:59,  1.17s/it]

{'loss': 0.1888, 'grad_norm': 7.946761131286621, 'learning_rate': 4.39061547836685e-05, 'epoch': 0.37}


 12%|█▏        | 810/6564 [23:03<1:51:07,  1.16s/it]

{'loss': 0.2105, 'grad_norm': 4.720186233520508, 'learning_rate': 4.382998171846435e-05, 'epoch': 0.37}


 12%|█▏        | 820/6564 [23:15<1:52:56,  1.18s/it]

{'loss': 0.2758, 'grad_norm': 3.982029914855957, 'learning_rate': 4.375380865326021e-05, 'epoch': 0.37}


 13%|█▎        | 830/6564 [23:27<1:53:55,  1.19s/it]

{'loss': 0.2489, 'grad_norm': 2.9200665950775146, 'learning_rate': 4.3677635588056065e-05, 'epoch': 0.38}


 13%|█▎        | 840/6564 [23:38<1:52:10,  1.18s/it]

{'loss': 0.2722, 'grad_norm': 1.9802788496017456, 'learning_rate': 4.360146252285192e-05, 'epoch': 0.38}


 13%|█▎        | 850/6564 [23:50<1:52:14,  1.18s/it]

{'loss': 0.2772, 'grad_norm': 5.276895999908447, 'learning_rate': 4.352528945764778e-05, 'epoch': 0.39}


 13%|█▎        | 860/6564 [24:02<1:51:11,  1.17s/it]

{'loss': 0.2482, 'grad_norm': 5.757599353790283, 'learning_rate': 4.344911639244363e-05, 'epoch': 0.39}


 13%|█▎        | 870/6564 [24:14<1:51:14,  1.17s/it]

{'loss': 0.2877, 'grad_norm': 3.957468271255493, 'learning_rate': 4.337294332723949e-05, 'epoch': 0.4}


 13%|█▎        | 880/6564 [24:25<1:49:41,  1.16s/it]

{'loss': 0.1953, 'grad_norm': 5.820405006408691, 'learning_rate': 4.329677026203535e-05, 'epoch': 0.4}


 14%|█▎        | 890/6564 [24:37<1:49:08,  1.15s/it]

{'loss': 0.2724, 'grad_norm': 2.8670578002929688, 'learning_rate': 4.32205971968312e-05, 'epoch': 0.41}


 14%|█▎        | 900/6564 [24:48<1:49:47,  1.16s/it]

{'loss': 0.3944, 'grad_norm': 8.70785140991211, 'learning_rate': 4.314442413162706e-05, 'epoch': 0.41}


 14%|█▍        | 910/6564 [25:00<1:48:55,  1.16s/it]

{'loss': 0.2648, 'grad_norm': 4.624115943908691, 'learning_rate': 4.3068251066422916e-05, 'epoch': 0.42}


 14%|█▍        | 920/6564 [25:11<1:48:27,  1.15s/it]

{'loss': 0.2102, 'grad_norm': 3.269089460372925, 'learning_rate': 4.299207800121877e-05, 'epoch': 0.42}


 14%|█▍        | 930/6564 [25:23<1:48:30,  1.16s/it]

{'loss': 0.2386, 'grad_norm': 3.3938214778900146, 'learning_rate': 4.291590493601463e-05, 'epoch': 0.43}


 14%|█▍        | 940/6564 [25:35<1:49:27,  1.17s/it]

{'loss': 0.3604, 'grad_norm': 6.549070835113525, 'learning_rate': 4.2839731870810484e-05, 'epoch': 0.43}


 14%|█▍        | 950/6564 [25:46<1:48:37,  1.16s/it]

{'loss': 0.289, 'grad_norm': 3.637716054916382, 'learning_rate': 4.276355880560634e-05, 'epoch': 0.43}


 15%|█▍        | 960/6564 [25:58<1:49:21,  1.17s/it]

{'loss': 0.1889, 'grad_norm': 3.8251137733459473, 'learning_rate': 4.26873857404022e-05, 'epoch': 0.44}


 15%|█▍        | 970/6564 [26:10<1:47:51,  1.16s/it]

{'loss': 0.275, 'grad_norm': 3.6311709880828857, 'learning_rate': 4.261121267519805e-05, 'epoch': 0.44}


 15%|█▍        | 980/6564 [26:21<1:47:55,  1.16s/it]

{'loss': 0.1931, 'grad_norm': 4.610900402069092, 'learning_rate': 4.253503960999391e-05, 'epoch': 0.45}


 15%|█▌        | 990/6564 [26:33<1:47:33,  1.16s/it]

{'loss': 0.2948, 'grad_norm': 4.750773906707764, 'learning_rate': 4.245886654478977e-05, 'epoch': 0.45}


 15%|█▌        | 1000/6564 [26:44<1:47:41,  1.16s/it]

{'loss': 0.2324, 'grad_norm': 2.4097776412963867, 'learning_rate': 4.238269347958562e-05, 'epoch': 0.46}


 15%|█▌        | 1010/6564 [26:57<1:47:51,  1.17s/it]

{'loss': 0.3829, 'grad_norm': 2.456437826156616, 'learning_rate': 4.230652041438148e-05, 'epoch': 0.46}


 16%|█▌        | 1020/6564 [27:09<1:48:34,  1.18s/it]

{'loss': 0.2039, 'grad_norm': 2.8997998237609863, 'learning_rate': 4.2230347349177335e-05, 'epoch': 0.47}


 16%|█▌        | 1030/6564 [27:20<1:48:02,  1.17s/it]

{'loss': 0.2193, 'grad_norm': 0.4877140522003174, 'learning_rate': 4.215417428397319e-05, 'epoch': 0.47}


 16%|█▌        | 1040/6564 [27:32<1:47:46,  1.17s/it]

{'loss': 0.3136, 'grad_norm': 7.595839023590088, 'learning_rate': 4.207800121876905e-05, 'epoch': 0.48}


 16%|█▌        | 1050/6564 [27:44<1:55:51,  1.26s/it]

{'loss': 0.1413, 'grad_norm': 7.446314334869385, 'learning_rate': 4.20018281535649e-05, 'epoch': 0.48}


 16%|█▌        | 1060/6564 [27:56<1:48:46,  1.19s/it]

{'loss': 0.2452, 'grad_norm': 8.73460578918457, 'learning_rate': 4.192565508836076e-05, 'epoch': 0.48}


 16%|█▋        | 1070/6564 [28:08<1:47:36,  1.18s/it]

{'loss': 0.1754, 'grad_norm': 7.065487861633301, 'learning_rate': 4.184948202315662e-05, 'epoch': 0.49}


 16%|█▋        | 1080/6564 [28:20<1:50:16,  1.21s/it]

{'loss': 0.2062, 'grad_norm': 5.477041244506836, 'learning_rate': 4.1773308957952465e-05, 'epoch': 0.49}


 17%|█▋        | 1090/6564 [28:32<1:49:04,  1.20s/it]

{'loss': 0.1395, 'grad_norm': 1.0251052379608154, 'learning_rate': 4.1697135892748325e-05, 'epoch': 0.5}


 17%|█▋        | 1100/6564 [28:44<1:53:16,  1.24s/it]

{'loss': 0.2724, 'grad_norm': 10.856822967529297, 'learning_rate': 4.162096282754418e-05, 'epoch': 0.5}


 17%|█▋        | 1110/6564 [28:56<1:47:53,  1.19s/it]

{'loss': 0.2244, 'grad_norm': 0.8719285726547241, 'learning_rate': 4.154478976234004e-05, 'epoch': 0.51}


 17%|█▋        | 1120/6564 [29:08<1:47:25,  1.18s/it]

{'loss': 0.17, 'grad_norm': 9.31025505065918, 'learning_rate': 4.146861669713589e-05, 'epoch': 0.51}


 17%|█▋        | 1130/6564 [29:20<1:45:58,  1.17s/it]

{'loss': 0.2176, 'grad_norm': 3.247462749481201, 'learning_rate': 4.139244363193175e-05, 'epoch': 0.52}


 17%|█▋        | 1140/6564 [29:32<1:45:13,  1.16s/it]

{'loss': 0.1953, 'grad_norm': 4.35000467300415, 'learning_rate': 4.131627056672761e-05, 'epoch': 0.52}


 18%|█▊        | 1150/6564 [29:43<1:47:35,  1.19s/it]

{'loss': 0.41, 'grad_norm': 11.879334449768066, 'learning_rate': 4.124009750152346e-05, 'epoch': 0.53}


 18%|█▊        | 1160/6564 [29:55<1:45:34,  1.17s/it]

{'loss': 0.2089, 'grad_norm': 10.482942581176758, 'learning_rate': 4.1163924436319315e-05, 'epoch': 0.53}


 18%|█▊        | 1170/6564 [30:07<1:45:37,  1.17s/it]

{'loss': 0.1874, 'grad_norm': 6.962795257568359, 'learning_rate': 4.1087751371115176e-05, 'epoch': 0.53}


 18%|█▊        | 1180/6564 [30:19<1:45:37,  1.18s/it]

{'loss': 0.2615, 'grad_norm': 4.575301170349121, 'learning_rate': 4.101157830591103e-05, 'epoch': 0.54}


 18%|█▊        | 1190/6564 [30:31<1:44:41,  1.17s/it]

{'loss': 0.2409, 'grad_norm': 9.057226181030273, 'learning_rate': 4.0935405240706884e-05, 'epoch': 0.54}


 18%|█▊        | 1200/6564 [30:42<1:46:32,  1.19s/it]

{'loss': 0.1887, 'grad_norm': 5.247386455535889, 'learning_rate': 4.0859232175502744e-05, 'epoch': 0.55}


 18%|█▊        | 1210/6564 [30:54<1:43:38,  1.16s/it]

{'loss': 0.2899, 'grad_norm': 4.447776794433594, 'learning_rate': 4.07830591102986e-05, 'epoch': 0.55}


 19%|█▊        | 1220/6564 [31:06<1:43:58,  1.17s/it]

{'loss': 0.1936, 'grad_norm': 4.13130521774292, 'learning_rate': 4.070688604509446e-05, 'epoch': 0.56}


 19%|█▊        | 1230/6564 [31:18<1:44:46,  1.18s/it]

{'loss': 0.2075, 'grad_norm': 3.6529910564422607, 'learning_rate': 4.063071297989031e-05, 'epoch': 0.56}


 19%|█▉        | 1240/6564 [31:29<1:44:05,  1.17s/it]

{'loss': 0.2607, 'grad_norm': 8.376482009887695, 'learning_rate': 4.0554539914686166e-05, 'epoch': 0.57}


 19%|█▉        | 1250/6564 [31:41<1:42:35,  1.16s/it]

{'loss': 0.3043, 'grad_norm': 10.221869468688965, 'learning_rate': 4.047836684948203e-05, 'epoch': 0.57}


 19%|█▉        | 1260/6564 [31:53<1:44:03,  1.18s/it]

{'loss': 0.2191, 'grad_norm': 5.041331768035889, 'learning_rate': 4.040219378427788e-05, 'epoch': 0.58}


 19%|█▉        | 1270/6564 [32:04<1:42:10,  1.16s/it]

{'loss': 0.1756, 'grad_norm': 3.2296817302703857, 'learning_rate': 4.0326020719073734e-05, 'epoch': 0.58}


 20%|█▉        | 1280/6564 [32:16<1:42:21,  1.16s/it]

{'loss': 0.3598, 'grad_norm': 4.030505657196045, 'learning_rate': 4.0249847653869595e-05, 'epoch': 0.59}


 20%|█▉        | 1290/6564 [32:28<1:41:37,  1.16s/it]

{'loss': 0.1305, 'grad_norm': 6.232932090759277, 'learning_rate': 4.017367458866545e-05, 'epoch': 0.59}


 20%|█▉        | 1300/6564 [32:39<1:41:21,  1.16s/it]

{'loss': 0.1806, 'grad_norm': 7.98593282699585, 'learning_rate': 4.009750152346131e-05, 'epoch': 0.59}


 20%|█▉        | 1310/6564 [32:51<1:41:59,  1.16s/it]

{'loss': 0.2634, 'grad_norm': 1.1696819067001343, 'learning_rate': 4.002132845825716e-05, 'epoch': 0.6}


 20%|██        | 1320/6564 [33:02<1:41:48,  1.16s/it]

{'loss': 0.3164, 'grad_norm': 7.769529819488525, 'learning_rate': 3.994515539305302e-05, 'epoch': 0.6}


 20%|██        | 1330/6564 [33:14<1:40:40,  1.15s/it]

{'loss': 0.2178, 'grad_norm': 2.1690797805786133, 'learning_rate': 3.986898232784888e-05, 'epoch': 0.61}


 20%|██        | 1340/6564 [33:25<1:39:56,  1.15s/it]

{'loss': 0.2398, 'grad_norm': 4.143704414367676, 'learning_rate': 3.979280926264473e-05, 'epoch': 0.61}


 21%|██        | 1350/6564 [33:37<1:40:28,  1.16s/it]

{'loss': 0.2096, 'grad_norm': 2.0510096549987793, 'learning_rate': 3.9716636197440585e-05, 'epoch': 0.62}


 21%|██        | 1360/6564 [33:49<1:40:51,  1.16s/it]

{'loss': 0.2686, 'grad_norm': 7.217216491699219, 'learning_rate': 3.9640463132236446e-05, 'epoch': 0.62}


 21%|██        | 1370/6564 [34:00<1:39:57,  1.15s/it]

{'loss': 0.2635, 'grad_norm': 9.964211463928223, 'learning_rate': 3.95642900670323e-05, 'epoch': 0.63}


 21%|██        | 1380/6564 [34:12<1:41:39,  1.18s/it]

{'loss': 0.2421, 'grad_norm': 1.6370264291763306, 'learning_rate': 3.948811700182815e-05, 'epoch': 0.63}


 21%|██        | 1390/6564 [34:24<1:40:57,  1.17s/it]

{'loss': 0.1648, 'grad_norm': 1.3885245323181152, 'learning_rate': 3.9411943936624014e-05, 'epoch': 0.64}


 21%|██▏       | 1400/6564 [34:35<1:40:25,  1.17s/it]

{'loss': 0.3006, 'grad_norm': 9.01548957824707, 'learning_rate': 3.933577087141987e-05, 'epoch': 0.64}


 21%|██▏       | 1410/6564 [34:47<1:40:08,  1.17s/it]

{'loss': 0.225, 'grad_norm': 11.126134872436523, 'learning_rate': 3.925959780621573e-05, 'epoch': 0.64}


 22%|██▏       | 1420/6564 [34:59<1:40:26,  1.17s/it]

{'loss': 0.2882, 'grad_norm': 9.469328880310059, 'learning_rate': 3.918342474101158e-05, 'epoch': 0.65}


 22%|██▏       | 1430/6564 [35:10<1:39:18,  1.16s/it]

{'loss': 0.2517, 'grad_norm': 2.4000632762908936, 'learning_rate': 3.9107251675807436e-05, 'epoch': 0.65}


 22%|██▏       | 1440/6564 [35:22<1:38:49,  1.16s/it]

{'loss': 0.2593, 'grad_norm': 9.950848579406738, 'learning_rate': 3.9031078610603297e-05, 'epoch': 0.66}


 22%|██▏       | 1450/6564 [35:34<1:38:22,  1.15s/it]

{'loss': 0.1847, 'grad_norm': 5.771816253662109, 'learning_rate': 3.895490554539915e-05, 'epoch': 0.66}


 22%|██▏       | 1460/6564 [35:45<1:38:55,  1.16s/it]

{'loss': 0.2179, 'grad_norm': 8.052103996276855, 'learning_rate': 3.8878732480195004e-05, 'epoch': 0.67}


 22%|██▏       | 1470/6564 [35:57<1:37:36,  1.15s/it]

{'loss': 0.3652, 'grad_norm': 8.800811767578125, 'learning_rate': 3.8802559414990865e-05, 'epoch': 0.67}


 23%|██▎       | 1480/6564 [36:08<1:39:10,  1.17s/it]

{'loss': 0.2271, 'grad_norm': 2.9849462509155273, 'learning_rate': 3.872638634978672e-05, 'epoch': 0.68}


 23%|██▎       | 1490/6564 [36:20<1:38:29,  1.16s/it]

{'loss': 0.24, 'grad_norm': 5.798781871795654, 'learning_rate': 3.865021328458258e-05, 'epoch': 0.68}


 23%|██▎       | 1500/6564 [36:32<1:37:55,  1.16s/it]

{'loss': 0.1279, 'grad_norm': 1.2784970998764038, 'learning_rate': 3.857404021937843e-05, 'epoch': 0.69}


 23%|██▎       | 1510/6564 [36:45<1:38:01,  1.16s/it]

{'loss': 0.1681, 'grad_norm': 0.2257666438817978, 'learning_rate': 3.849786715417429e-05, 'epoch': 0.69}


 23%|██▎       | 1520/6564 [36:56<1:38:14,  1.17s/it]

{'loss': 0.3027, 'grad_norm': 0.31386351585388184, 'learning_rate': 3.842169408897014e-05, 'epoch': 0.69}


 23%|██▎       | 1530/6564 [37:08<1:37:43,  1.16s/it]

{'loss': 0.2382, 'grad_norm': 7.253026008605957, 'learning_rate': 3.8345521023765994e-05, 'epoch': 0.7}


 23%|██▎       | 1540/6564 [37:20<1:37:57,  1.17s/it]

{'loss': 0.2208, 'grad_norm': 6.765403747558594, 'learning_rate': 3.8269347958561855e-05, 'epoch': 0.7}


 24%|██▎       | 1550/6564 [37:31<1:37:05,  1.16s/it]

{'loss': 0.1787, 'grad_norm': 2.894822835922241, 'learning_rate': 3.819317489335771e-05, 'epoch': 0.71}


 24%|██▍       | 1560/6564 [37:43<1:37:13,  1.17s/it]

{'loss': 0.2498, 'grad_norm': 4.324819087982178, 'learning_rate': 3.811700182815356e-05, 'epoch': 0.71}


 24%|██▍       | 1570/6564 [37:55<1:37:23,  1.17s/it]

{'loss': 0.2655, 'grad_norm': 4.715109348297119, 'learning_rate': 3.804082876294942e-05, 'epoch': 0.72}


 24%|██▍       | 1580/6564 [38:06<1:36:32,  1.16s/it]

{'loss': 0.2054, 'grad_norm': 5.74821662902832, 'learning_rate': 3.796465569774528e-05, 'epoch': 0.72}


 24%|██▍       | 1590/6564 [38:18<1:35:57,  1.16s/it]

{'loss': 0.2506, 'grad_norm': 2.9920661449432373, 'learning_rate': 3.788848263254113e-05, 'epoch': 0.73}


 24%|██▍       | 1600/6564 [38:29<1:36:59,  1.17s/it]

{'loss': 0.2066, 'grad_norm': 7.905521869659424, 'learning_rate': 3.781230956733699e-05, 'epoch': 0.73}


 25%|██▍       | 1610/6564 [38:41<1:36:30,  1.17s/it]

{'loss': 0.3132, 'grad_norm': 2.276318311691284, 'learning_rate': 3.7736136502132845e-05, 'epoch': 0.74}


 25%|██▍       | 1620/6564 [38:53<1:34:47,  1.15s/it]

{'loss': 0.2267, 'grad_norm': 3.308094024658203, 'learning_rate': 3.7659963436928706e-05, 'epoch': 0.74}


 25%|██▍       | 1630/6564 [39:04<1:36:35,  1.17s/it]

{'loss': 0.1927, 'grad_norm': 5.3849616050720215, 'learning_rate': 3.758379037172456e-05, 'epoch': 0.74}


 25%|██▍       | 1640/6564 [39:16<1:36:19,  1.17s/it]

{'loss': 0.155, 'grad_norm': 5.801999568939209, 'learning_rate': 3.750761730652041e-05, 'epoch': 0.75}


 25%|██▌       | 1650/6564 [39:28<1:36:00,  1.17s/it]

{'loss': 0.2803, 'grad_norm': 4.3402605056762695, 'learning_rate': 3.7431444241316274e-05, 'epoch': 0.75}


 25%|██▌       | 1660/6564 [39:39<1:34:15,  1.15s/it]

{'loss': 0.2167, 'grad_norm': 4.600078105926514, 'learning_rate': 3.735527117611213e-05, 'epoch': 0.76}


 25%|██▌       | 1670/6564 [39:51<1:33:52,  1.15s/it]

{'loss': 0.2537, 'grad_norm': 7.123739719390869, 'learning_rate': 3.727909811090798e-05, 'epoch': 0.76}


 26%|██▌       | 1680/6564 [40:03<1:34:30,  1.16s/it]

{'loss': 0.2445, 'grad_norm': 5.118649482727051, 'learning_rate': 3.720292504570384e-05, 'epoch': 0.77}


 26%|██▌       | 1690/6564 [40:14<1:34:01,  1.16s/it]

{'loss': 0.2486, 'grad_norm': 4.128127098083496, 'learning_rate': 3.7126751980499696e-05, 'epoch': 0.77}


 26%|██▌       | 1700/6564 [40:26<1:34:17,  1.16s/it]

{'loss': 0.2148, 'grad_norm': 1.335541009902954, 'learning_rate': 3.705057891529555e-05, 'epoch': 0.78}


 26%|██▌       | 1710/6564 [40:37<1:33:27,  1.16s/it]

{'loss': 0.1618, 'grad_norm': 9.164387702941895, 'learning_rate': 3.697440585009141e-05, 'epoch': 0.78}


 26%|██▌       | 1720/6564 [40:49<1:34:05,  1.17s/it]

{'loss': 0.2621, 'grad_norm': 2.6553356647491455, 'learning_rate': 3.6898232784887264e-05, 'epoch': 0.79}


 26%|██▋       | 1730/6564 [41:01<1:33:08,  1.16s/it]

{'loss': 0.278, 'grad_norm': 5.470137596130371, 'learning_rate': 3.6822059719683125e-05, 'epoch': 0.79}


 27%|██▋       | 1740/6564 [41:12<1:33:16,  1.16s/it]

{'loss': 0.2352, 'grad_norm': 4.401906490325928, 'learning_rate': 3.674588665447898e-05, 'epoch': 0.8}


 27%|██▋       | 1750/6564 [41:24<1:33:58,  1.17s/it]

{'loss': 0.1984, 'grad_norm': 1.937485933303833, 'learning_rate': 3.666971358927483e-05, 'epoch': 0.8}


 27%|██▋       | 1760/6564 [41:36<1:33:22,  1.17s/it]

{'loss': 0.1582, 'grad_norm': 7.768233776092529, 'learning_rate': 3.659354052407069e-05, 'epoch': 0.8}


 27%|██▋       | 1770/6564 [41:47<1:32:28,  1.16s/it]

{'loss': 0.2077, 'grad_norm': 6.539453029632568, 'learning_rate': 3.651736745886655e-05, 'epoch': 0.81}


 27%|██▋       | 1780/6564 [41:59<1:31:40,  1.15s/it]

{'loss': 0.1614, 'grad_norm': 7.454543113708496, 'learning_rate': 3.64411943936624e-05, 'epoch': 0.81}


 27%|██▋       | 1790/6564 [42:10<1:32:30,  1.16s/it]

{'loss': 0.288, 'grad_norm': 12.600789070129395, 'learning_rate': 3.636502132845826e-05, 'epoch': 0.82}


 27%|██▋       | 1800/6564 [42:22<1:32:59,  1.17s/it]

{'loss': 0.2903, 'grad_norm': 8.3849458694458, 'learning_rate': 3.6288848263254115e-05, 'epoch': 0.82}


 28%|██▊       | 1810/6564 [42:34<1:33:20,  1.18s/it]

{'loss': 0.3142, 'grad_norm': 7.899462699890137, 'learning_rate': 3.6212675198049976e-05, 'epoch': 0.83}


 28%|██▊       | 1820/6564 [42:46<1:32:49,  1.17s/it]

{'loss': 0.1552, 'grad_norm': 3.740419626235962, 'learning_rate': 3.613650213284583e-05, 'epoch': 0.83}


 28%|██▊       | 1830/6564 [42:57<1:31:30,  1.16s/it]

{'loss': 0.2403, 'grad_norm': 3.6767795085906982, 'learning_rate': 3.606032906764168e-05, 'epoch': 0.84}


 28%|██▊       | 1840/6564 [43:09<1:31:50,  1.17s/it]

{'loss': 0.2204, 'grad_norm': 5.886195659637451, 'learning_rate': 3.5984156002437544e-05, 'epoch': 0.84}


 28%|██▊       | 1850/6564 [43:21<1:31:46,  1.17s/it]

{'loss': 0.2333, 'grad_norm': 4.600228786468506, 'learning_rate': 3.59079829372334e-05, 'epoch': 0.85}


 28%|██▊       | 1860/6564 [43:32<1:30:08,  1.15s/it]

{'loss': 0.2889, 'grad_norm': 8.411650657653809, 'learning_rate': 3.583180987202925e-05, 'epoch': 0.85}


 28%|██▊       | 1870/6564 [43:44<1:30:04,  1.15s/it]

{'loss': 0.206, 'grad_norm': 0.6866732239723206, 'learning_rate': 3.575563680682511e-05, 'epoch': 0.85}


 29%|██▊       | 1880/6564 [43:56<1:30:51,  1.16s/it]

{'loss': 0.3009, 'grad_norm': 15.311250686645508, 'learning_rate': 3.5679463741620966e-05, 'epoch': 0.86}


 29%|██▉       | 1890/6564 [44:07<1:30:34,  1.16s/it]

{'loss': 0.2067, 'grad_norm': 0.38857781887054443, 'learning_rate': 3.560329067641682e-05, 'epoch': 0.86}


 29%|██▉       | 1900/6564 [44:19<1:31:04,  1.17s/it]

{'loss': 0.2509, 'grad_norm': 6.75197172164917, 'learning_rate': 3.552711761121268e-05, 'epoch': 0.87}


 29%|██▉       | 1910/6564 [44:30<1:29:44,  1.16s/it]

{'loss': 0.2774, 'grad_norm': 1.0758992433547974, 'learning_rate': 3.5450944546008534e-05, 'epoch': 0.87}


 29%|██▉       | 1920/6564 [44:42<1:29:23,  1.15s/it]

{'loss': 0.2691, 'grad_norm': 7.623169422149658, 'learning_rate': 3.5374771480804395e-05, 'epoch': 0.88}


 29%|██▉       | 1930/6564 [44:54<1:29:13,  1.16s/it]

{'loss': 0.2745, 'grad_norm': 5.922852516174316, 'learning_rate': 3.529859841560025e-05, 'epoch': 0.88}


 30%|██▉       | 1940/6564 [45:05<1:29:50,  1.17s/it]

{'loss': 0.2667, 'grad_norm': 4.369819641113281, 'learning_rate': 3.52224253503961e-05, 'epoch': 0.89}


 30%|██▉       | 1950/6564 [45:17<1:30:02,  1.17s/it]

{'loss': 0.1534, 'grad_norm': 0.6020002961158752, 'learning_rate': 3.5146252285191956e-05, 'epoch': 0.89}


 30%|██▉       | 1960/6564 [45:29<1:30:05,  1.17s/it]

{'loss': 0.2662, 'grad_norm': 2.8985161781311035, 'learning_rate': 3.507007921998781e-05, 'epoch': 0.9}


 30%|███       | 1970/6564 [45:40<1:29:23,  1.17s/it]

{'loss': 0.2022, 'grad_norm': 8.507351875305176, 'learning_rate': 3.499390615478367e-05, 'epoch': 0.9}


 30%|███       | 1980/6564 [45:52<1:28:37,  1.16s/it]

{'loss': 0.1875, 'grad_norm': 4.8701348304748535, 'learning_rate': 3.4917733089579524e-05, 'epoch': 0.9}


 30%|███       | 1990/6564 [46:04<1:29:26,  1.17s/it]

{'loss': 0.1807, 'grad_norm': 3.545762300491333, 'learning_rate': 3.484156002437538e-05, 'epoch': 0.91}


 30%|███       | 2000/6564 [46:15<1:30:08,  1.19s/it]

{'loss': 0.1928, 'grad_norm': 5.509070873260498, 'learning_rate': 3.476538695917124e-05, 'epoch': 0.91}


 31%|███       | 2010/6564 [46:28<1:28:54,  1.17s/it]

{'loss': 0.1459, 'grad_norm': 7.9765801429748535, 'learning_rate': 3.468921389396709e-05, 'epoch': 0.92}


 31%|███       | 2020/6564 [46:40<1:27:36,  1.16s/it]

{'loss': 0.2535, 'grad_norm': 4.215446472167969, 'learning_rate': 3.4613040828762946e-05, 'epoch': 0.92}


 31%|███       | 2030/6564 [46:52<1:33:12,  1.23s/it]

{'loss': 0.2077, 'grad_norm': 3.262648105621338, 'learning_rate': 3.453686776355881e-05, 'epoch': 0.93}


 31%|███       | 2040/6564 [47:04<1:28:40,  1.18s/it]

{'loss': 0.2791, 'grad_norm': 3.915167808532715, 'learning_rate': 3.446069469835466e-05, 'epoch': 0.93}


 31%|███       | 2050/6564 [47:16<1:28:16,  1.17s/it]

{'loss': 0.1433, 'grad_norm': 2.4946658611297607, 'learning_rate': 3.438452163315052e-05, 'epoch': 0.94}


 31%|███▏      | 2060/6564 [47:27<1:27:01,  1.16s/it]

{'loss': 0.184, 'grad_norm': 9.28756046295166, 'learning_rate': 3.4308348567946375e-05, 'epoch': 0.94}


 32%|███▏      | 2070/6564 [47:39<1:26:54,  1.16s/it]

{'loss': 0.1685, 'grad_norm': 6.726015567779541, 'learning_rate': 3.423217550274223e-05, 'epoch': 0.95}


 32%|███▏      | 2080/6564 [47:51<1:26:27,  1.16s/it]

{'loss': 0.257, 'grad_norm': 14.182364463806152, 'learning_rate': 3.415600243753809e-05, 'epoch': 0.95}


 32%|███▏      | 2090/6564 [48:02<1:26:01,  1.15s/it]

{'loss': 0.2344, 'grad_norm': 11.736066818237305, 'learning_rate': 3.407982937233394e-05, 'epoch': 0.96}


 32%|███▏      | 2100/6564 [48:14<1:25:43,  1.15s/it]

{'loss': 0.1356, 'grad_norm': 4.600945472717285, 'learning_rate': 3.40036563071298e-05, 'epoch': 0.96}


 32%|███▏      | 2110/6564 [48:25<1:25:33,  1.15s/it]

{'loss': 0.2505, 'grad_norm': 5.267944812774658, 'learning_rate': 3.392748324192566e-05, 'epoch': 0.96}


 32%|███▏      | 2120/6564 [48:37<1:25:37,  1.16s/it]

{'loss': 0.2684, 'grad_norm': 7.904533863067627, 'learning_rate': 3.385131017672151e-05, 'epoch': 0.97}


 32%|███▏      | 2130/6564 [48:48<1:25:29,  1.16s/it]

{'loss': 0.1771, 'grad_norm': 2.537166118621826, 'learning_rate': 3.377513711151737e-05, 'epoch': 0.97}


 33%|███▎      | 2140/6564 [49:00<1:24:54,  1.15s/it]

{'loss': 0.2104, 'grad_norm': 7.179216384887695, 'learning_rate': 3.3698964046313226e-05, 'epoch': 0.98}


 33%|███▎      | 2150/6564 [49:11<1:24:50,  1.15s/it]

{'loss': 0.144, 'grad_norm': 1.0901341438293457, 'learning_rate': 3.362279098110908e-05, 'epoch': 0.98}


 33%|███▎      | 2160/6564 [49:23<1:24:39,  1.15s/it]

{'loss': 0.2919, 'grad_norm': 6.92612886428833, 'learning_rate': 3.354661791590494e-05, 'epoch': 0.99}


 33%|███▎      | 2170/6564 [49:34<1:24:18,  1.15s/it]

{'loss': 0.2272, 'grad_norm': 9.368473052978516, 'learning_rate': 3.3470444850700794e-05, 'epoch': 0.99}


 33%|███▎      | 2180/6564 [49:46<1:24:15,  1.15s/it]

{'loss': 0.2317, 'grad_norm': 4.071686267852783, 'learning_rate': 3.339427178549665e-05, 'epoch': 1.0}


 33%|███▎      | 2190/6564 [49:58<1:26:16,  1.18s/it]

{'loss': 0.21, 'grad_norm': 5.249389171600342, 'learning_rate': 3.331809872029251e-05, 'epoch': 1.0}


 34%|███▎      | 2200/6564 [50:10<1:27:50,  1.21s/it]

{'loss': 0.0856, 'grad_norm': 0.37125709652900696, 'learning_rate': 3.324192565508836e-05, 'epoch': 1.01}


 34%|███▎      | 2210/6564 [50:21<1:25:59,  1.18s/it]

{'loss': 0.1567, 'grad_norm': 0.8401703834533691, 'learning_rate': 3.3165752589884216e-05, 'epoch': 1.01}


 34%|███▍      | 2220/6564 [50:33<1:24:54,  1.17s/it]

{'loss': 0.0891, 'grad_norm': 0.2362436205148697, 'learning_rate': 3.3089579524680077e-05, 'epoch': 1.01}


 34%|███▍      | 2230/6564 [50:45<1:24:54,  1.18s/it]

{'loss': 0.0784, 'grad_norm': 0.12791508436203003, 'learning_rate': 3.301340645947593e-05, 'epoch': 1.02}


 34%|███▍      | 2240/6564 [50:57<1:24:40,  1.17s/it]

{'loss': 0.0558, 'grad_norm': 0.16592560708522797, 'learning_rate': 3.293723339427179e-05, 'epoch': 1.02}


 34%|███▍      | 2250/6564 [51:09<1:24:31,  1.18s/it]

{'loss': 0.1549, 'grad_norm': 0.038741789758205414, 'learning_rate': 3.2861060329067645e-05, 'epoch': 1.03}


 34%|███▍      | 2260/6564 [51:20<1:24:50,  1.18s/it]

{'loss': 0.0878, 'grad_norm': 0.3169513940811157, 'learning_rate': 3.27848872638635e-05, 'epoch': 1.03}


 35%|███▍      | 2270/6564 [51:32<1:23:23,  1.17s/it]

{'loss': 0.1436, 'grad_norm': 9.612409591674805, 'learning_rate': 3.270871419865936e-05, 'epoch': 1.04}


 35%|███▍      | 2280/6564 [51:44<1:22:55,  1.16s/it]

{'loss': 0.0811, 'grad_norm': 0.03675293177366257, 'learning_rate': 3.263254113345521e-05, 'epoch': 1.04}


 35%|███▍      | 2290/6564 [51:55<1:22:17,  1.16s/it]

{'loss': 0.0631, 'grad_norm': 0.6477396488189697, 'learning_rate': 3.255636806825107e-05, 'epoch': 1.05}


 35%|███▌      | 2300/6564 [52:07<1:24:07,  1.18s/it]

{'loss': 0.2014, 'grad_norm': 0.1669951230287552, 'learning_rate': 3.248019500304693e-05, 'epoch': 1.05}


 35%|███▌      | 2310/6564 [52:19<1:23:03,  1.17s/it]

{'loss': 0.0917, 'grad_norm': 17.573139190673828, 'learning_rate': 3.240402193784278e-05, 'epoch': 1.06}


 35%|███▌      | 2320/6564 [52:31<1:23:28,  1.18s/it]

{'loss': 0.1323, 'grad_norm': 5.490989685058594, 'learning_rate': 3.232784887263864e-05, 'epoch': 1.06}


 35%|███▌      | 2330/6564 [52:43<1:23:11,  1.18s/it]

{'loss': 0.2268, 'grad_norm': 2.3314027786254883, 'learning_rate': 3.2251675807434496e-05, 'epoch': 1.06}


 36%|███▌      | 2340/6564 [52:54<1:23:51,  1.19s/it]

{'loss': 0.0798, 'grad_norm': 0.13614565134048462, 'learning_rate': 3.217550274223035e-05, 'epoch': 1.07}


 36%|███▌      | 2350/6564 [53:06<1:23:38,  1.19s/it]

{'loss': 0.0429, 'grad_norm': 2.046937942504883, 'learning_rate': 3.209932967702621e-05, 'epoch': 1.07}


 36%|███▌      | 2360/6564 [53:18<1:21:15,  1.16s/it]

{'loss': 0.1184, 'grad_norm': 0.21694351732730865, 'learning_rate': 3.2023156611822064e-05, 'epoch': 1.08}


 36%|███▌      | 2370/6564 [53:30<1:24:31,  1.21s/it]

{'loss': 0.047, 'grad_norm': 0.05405857041478157, 'learning_rate': 3.194698354661792e-05, 'epoch': 1.08}


 36%|███▋      | 2380/6564 [53:42<1:21:45,  1.17s/it]

{'loss': 0.1645, 'grad_norm': 2.506153106689453, 'learning_rate': 3.187081048141377e-05, 'epoch': 1.09}


 36%|███▋      | 2390/6564 [53:53<1:21:14,  1.17s/it]

{'loss': 0.225, 'grad_norm': 0.21679382026195526, 'learning_rate': 3.1794637416209625e-05, 'epoch': 1.09}


 37%|███▋      | 2400/6564 [54:05<1:20:49,  1.16s/it]

{'loss': 0.1144, 'grad_norm': 1.3271936178207397, 'learning_rate': 3.1718464351005486e-05, 'epoch': 1.1}


 37%|███▋      | 2410/6564 [54:17<1:20:46,  1.17s/it]

{'loss': 0.1096, 'grad_norm': 2.8806052207946777, 'learning_rate': 3.164229128580134e-05, 'epoch': 1.1}


 37%|███▋      | 2420/6564 [54:28<1:20:49,  1.17s/it]

{'loss': 0.1012, 'grad_norm': 1.2271696329116821, 'learning_rate': 3.1566118220597193e-05, 'epoch': 1.11}


 37%|███▋      | 2430/6564 [54:40<1:20:24,  1.17s/it]

{'loss': 0.1584, 'grad_norm': 0.13479292392730713, 'learning_rate': 3.1489945155393054e-05, 'epoch': 1.11}


 37%|███▋      | 2440/6564 [54:52<1:21:09,  1.18s/it]

{'loss': 0.1932, 'grad_norm': 0.09088807553052902, 'learning_rate': 3.141377209018891e-05, 'epoch': 1.12}


 37%|███▋      | 2450/6564 [55:04<1:20:17,  1.17s/it]

{'loss': 0.2049, 'grad_norm': 17.222909927368164, 'learning_rate': 3.133759902498477e-05, 'epoch': 1.12}


 37%|███▋      | 2460/6564 [55:15<1:20:39,  1.18s/it]

{'loss': 0.1153, 'grad_norm': 2.411273241043091, 'learning_rate': 3.126142595978062e-05, 'epoch': 1.12}


 38%|███▊      | 2470/6564 [55:27<1:19:34,  1.17s/it]

{'loss': 0.2565, 'grad_norm': 11.40098762512207, 'learning_rate': 3.1185252894576476e-05, 'epoch': 1.13}


 38%|███▊      | 2480/6564 [55:39<1:19:23,  1.17s/it]

{'loss': 0.1172, 'grad_norm': 5.308154106140137, 'learning_rate': 3.110907982937234e-05, 'epoch': 1.13}


 38%|███▊      | 2490/6564 [55:50<1:19:09,  1.17s/it]

{'loss': 0.1557, 'grad_norm': 7.218357563018799, 'learning_rate': 3.103290676416819e-05, 'epoch': 1.14}


 38%|███▊      | 2500/6564 [56:02<1:18:23,  1.16s/it]

{'loss': 0.0923, 'grad_norm': 4.979889392852783, 'learning_rate': 3.0956733698964044e-05, 'epoch': 1.14}


 38%|███▊      | 2510/6564 [56:15<1:20:02,  1.18s/it]

{'loss': 0.1676, 'grad_norm': 0.6581547260284424, 'learning_rate': 3.0880560633759905e-05, 'epoch': 1.15}


 38%|███▊      | 2520/6564 [56:26<1:18:02,  1.16s/it]

{'loss': 0.0696, 'grad_norm': 0.2143227607011795, 'learning_rate': 3.080438756855576e-05, 'epoch': 1.15}


 39%|███▊      | 2530/6564 [56:38<1:18:35,  1.17s/it]

{'loss': 0.0852, 'grad_norm': 2.37446928024292, 'learning_rate': 3.072821450335161e-05, 'epoch': 1.16}


 39%|███▊      | 2540/6564 [56:50<1:18:16,  1.17s/it]

{'loss': 0.1668, 'grad_norm': 0.6923195719718933, 'learning_rate': 3.065204143814747e-05, 'epoch': 1.16}


 39%|███▉      | 2550/6564 [57:01<1:17:58,  1.17s/it]

{'loss': 0.0715, 'grad_norm': 2.0592823028564453, 'learning_rate': 3.057586837294333e-05, 'epoch': 1.17}


 39%|███▉      | 2560/6564 [57:13<1:18:12,  1.17s/it]

{'loss': 0.0619, 'grad_norm': 0.6407021880149841, 'learning_rate': 3.0499695307739184e-05, 'epoch': 1.17}


 39%|███▉      | 2570/6564 [57:25<1:18:37,  1.18s/it]

{'loss': 0.229, 'grad_norm': 11.317167282104492, 'learning_rate': 3.042352224253504e-05, 'epoch': 1.17}


 39%|███▉      | 2580/6564 [57:37<1:18:30,  1.18s/it]

{'loss': 0.1839, 'grad_norm': 7.385293483734131, 'learning_rate': 3.03473491773309e-05, 'epoch': 1.18}


 39%|███▉      | 2590/6564 [57:49<1:20:20,  1.21s/it]

{'loss': 0.1443, 'grad_norm': 10.888921737670898, 'learning_rate': 3.0271176112126752e-05, 'epoch': 1.18}


 40%|███▉      | 2600/6564 [58:01<1:18:36,  1.19s/it]

{'loss': 0.1058, 'grad_norm': 1.7857502698898315, 'learning_rate': 3.019500304692261e-05, 'epoch': 1.19}


 40%|███▉      | 2610/6564 [58:13<1:18:48,  1.20s/it]

{'loss': 0.1653, 'grad_norm': 14.266030311584473, 'learning_rate': 3.0118829981718467e-05, 'epoch': 1.19}


 40%|███▉      | 2620/6564 [58:24<1:17:23,  1.18s/it]

{'loss': 0.1631, 'grad_norm': 0.11740263551473618, 'learning_rate': 3.0042656916514324e-05, 'epoch': 1.2}


 40%|████      | 2630/6564 [58:36<1:18:11,  1.19s/it]

{'loss': 0.0566, 'grad_norm': 5.54660701751709, 'learning_rate': 2.9966483851310178e-05, 'epoch': 1.2}


 40%|████      | 2640/6564 [58:48<1:17:00,  1.18s/it]

{'loss': 0.1118, 'grad_norm': 6.508388519287109, 'learning_rate': 2.9890310786106035e-05, 'epoch': 1.21}


 40%|████      | 2650/6564 [59:00<1:16:07,  1.17s/it]

{'loss': 0.0901, 'grad_norm': 0.3256705105304718, 'learning_rate': 2.9814137720901892e-05, 'epoch': 1.21}


 41%|████      | 2660/6564 [59:12<1:16:32,  1.18s/it]

{'loss': 0.1716, 'grad_norm': 0.1785009354352951, 'learning_rate': 2.973796465569775e-05, 'epoch': 1.22}


 41%|████      | 2670/6564 [59:23<1:15:36,  1.17s/it]

{'loss': 0.1455, 'grad_norm': 10.67719554901123, 'learning_rate': 2.9661791590493603e-05, 'epoch': 1.22}


 41%|████      | 2680/6564 [59:35<1:15:21,  1.16s/it]

{'loss': 0.1283, 'grad_norm': 12.157363891601562, 'learning_rate': 2.958561852528946e-05, 'epoch': 1.22}


 41%|████      | 2690/6564 [59:47<1:15:08,  1.16s/it]

{'loss': 0.1865, 'grad_norm': 5.263129234313965, 'learning_rate': 2.9509445460085317e-05, 'epoch': 1.23}


 41%|████      | 2700/6564 [59:58<1:17:07,  1.20s/it]

{'loss': 0.0851, 'grad_norm': 0.17981913685798645, 'learning_rate': 2.943327239488117e-05, 'epoch': 1.23}


 41%|████▏     | 2710/6564 [1:00:10<1:15:14,  1.17s/it]

{'loss': 0.1007, 'grad_norm': 0.7467876672744751, 'learning_rate': 2.935709932967703e-05, 'epoch': 1.24}


 41%|████▏     | 2720/6564 [1:00:22<1:15:10,  1.17s/it]

{'loss': 0.1345, 'grad_norm': 6.544661998748779, 'learning_rate': 2.9280926264472886e-05, 'epoch': 1.24}


 42%|████▏     | 2730/6564 [1:00:34<1:14:20,  1.16s/it]

{'loss': 0.2424, 'grad_norm': 4.307293891906738, 'learning_rate': 2.9204753199268743e-05, 'epoch': 1.25}


 42%|████▏     | 2740/6564 [1:00:45<1:14:57,  1.18s/it]

{'loss': 0.091, 'grad_norm': 0.2715713381767273, 'learning_rate': 2.9128580134064597e-05, 'epoch': 1.25}


 42%|████▏     | 2750/6564 [1:00:57<1:14:55,  1.18s/it]

{'loss': 0.1096, 'grad_norm': 0.1337081342935562, 'learning_rate': 2.9052407068860454e-05, 'epoch': 1.26}


 42%|████▏     | 2760/6564 [1:01:09<1:15:42,  1.19s/it]

{'loss': 0.0711, 'grad_norm': 11.27046012878418, 'learning_rate': 2.897623400365631e-05, 'epoch': 1.26}


 42%|████▏     | 2770/6564 [1:01:21<1:14:25,  1.18s/it]

{'loss': 0.1568, 'grad_norm': 0.0594547875225544, 'learning_rate': 2.8900060938452168e-05, 'epoch': 1.27}


 42%|████▏     | 2780/6564 [1:01:32<1:13:15,  1.16s/it]

{'loss': 0.0785, 'grad_norm': 12.853815078735352, 'learning_rate': 2.8823887873248022e-05, 'epoch': 1.27}


 43%|████▎     | 2790/6564 [1:01:44<1:13:08,  1.16s/it]

{'loss': 0.2069, 'grad_norm': 7.644510746002197, 'learning_rate': 2.874771480804388e-05, 'epoch': 1.28}


 43%|████▎     | 2800/6564 [1:01:56<1:13:21,  1.17s/it]

{'loss': 0.1855, 'grad_norm': 1.0332313776016235, 'learning_rate': 2.8671541742839736e-05, 'epoch': 1.28}


 43%|████▎     | 2810/6564 [1:02:07<1:13:01,  1.17s/it]

{'loss': 0.0951, 'grad_norm': 3.629451036453247, 'learning_rate': 2.8595368677635587e-05, 'epoch': 1.28}


 43%|████▎     | 2820/6564 [1:02:19<1:12:42,  1.17s/it]

{'loss': 0.0572, 'grad_norm': 2.603800058364868, 'learning_rate': 2.8519195612431444e-05, 'epoch': 1.29}


 43%|████▎     | 2830/6564 [1:02:31<1:11:59,  1.16s/it]

{'loss': 0.1405, 'grad_norm': 15.498551368713379, 'learning_rate': 2.8443022547227298e-05, 'epoch': 1.29}


 43%|████▎     | 2840/6564 [1:02:42<1:12:34,  1.17s/it]

{'loss': 0.1291, 'grad_norm': 2.741499185562134, 'learning_rate': 2.8366849482023155e-05, 'epoch': 1.3}


 43%|████▎     | 2850/6564 [1:02:54<1:12:05,  1.16s/it]

{'loss': 0.1385, 'grad_norm': 2.635972738265991, 'learning_rate': 2.8290676416819012e-05, 'epoch': 1.3}


 44%|████▎     | 2860/6564 [1:03:06<1:12:02,  1.17s/it]

{'loss': 0.0492, 'grad_norm': 0.23363196849822998, 'learning_rate': 2.821450335161487e-05, 'epoch': 1.31}


 44%|████▎     | 2870/6564 [1:03:18<1:12:11,  1.17s/it]

{'loss': 0.1662, 'grad_norm': 5.4561920166015625, 'learning_rate': 2.8138330286410723e-05, 'epoch': 1.31}


 44%|████▍     | 2880/6564 [1:03:29<1:11:10,  1.16s/it]

{'loss': 0.1514, 'grad_norm': 4.514926910400391, 'learning_rate': 2.806215722120658e-05, 'epoch': 1.32}


 44%|████▍     | 2890/6564 [1:03:41<1:10:52,  1.16s/it]

{'loss': 0.2411, 'grad_norm': 2.0438904762268066, 'learning_rate': 2.7985984156002438e-05, 'epoch': 1.32}


 44%|████▍     | 2900/6564 [1:03:53<1:13:29,  1.20s/it]

{'loss': 0.0828, 'grad_norm': 1.1101844310760498, 'learning_rate': 2.7909811090798295e-05, 'epoch': 1.33}


 44%|████▍     | 2910/6564 [1:04:05<1:11:37,  1.18s/it]

{'loss': 0.2182, 'grad_norm': 1.8757838010787964, 'learning_rate': 2.783363802559415e-05, 'epoch': 1.33}


 44%|████▍     | 2920/6564 [1:04:16<1:12:41,  1.20s/it]

{'loss': 0.147, 'grad_norm': 10.066351890563965, 'learning_rate': 2.7757464960390006e-05, 'epoch': 1.33}


 45%|████▍     | 2930/6564 [1:04:28<1:12:43,  1.20s/it]

{'loss': 0.0653, 'grad_norm': 0.27781549096107483, 'learning_rate': 2.7681291895185863e-05, 'epoch': 1.34}


 45%|████▍     | 2940/6564 [1:04:40<1:10:27,  1.17s/it]

{'loss': 0.1071, 'grad_norm': 0.13621152937412262, 'learning_rate': 2.760511882998172e-05, 'epoch': 1.34}


 45%|████▍     | 2950/6564 [1:04:52<1:10:44,  1.17s/it]

{'loss': 0.0761, 'grad_norm': 5.878857612609863, 'learning_rate': 2.7528945764777574e-05, 'epoch': 1.35}


 45%|████▌     | 2960/6564 [1:05:03<1:09:49,  1.16s/it]

{'loss': 0.1635, 'grad_norm': 6.227474689483643, 'learning_rate': 2.745277269957343e-05, 'epoch': 1.35}


 45%|████▌     | 2970/6564 [1:05:15<1:10:20,  1.17s/it]

{'loss': 0.1321, 'grad_norm': 9.201252937316895, 'learning_rate': 2.737659963436929e-05, 'epoch': 1.36}


 45%|████▌     | 2980/6564 [1:05:27<1:10:23,  1.18s/it]

{'loss': 0.1735, 'grad_norm': 12.14353084564209, 'learning_rate': 2.7300426569165142e-05, 'epoch': 1.36}


 46%|████▌     | 2990/6564 [1:05:39<1:09:03,  1.16s/it]

{'loss': 0.1683, 'grad_norm': 15.056737899780273, 'learning_rate': 2.7224253503961e-05, 'epoch': 1.37}


 46%|████▌     | 3000/6564 [1:05:50<1:10:10,  1.18s/it]

{'loss': 0.1379, 'grad_norm': 5.186688423156738, 'learning_rate': 2.7148080438756857e-05, 'epoch': 1.37}


 46%|████▌     | 3010/6564 [1:06:04<1:11:57,  1.21s/it]

{'loss': 0.0842, 'grad_norm': 8.39631175994873, 'learning_rate': 2.7071907373552714e-05, 'epoch': 1.38}


 46%|████▌     | 3020/6564 [1:06:16<1:10:40,  1.20s/it]

{'loss': 0.1806, 'grad_norm': 2.580153703689575, 'learning_rate': 2.6995734308348568e-05, 'epoch': 1.38}


 46%|████▌     | 3030/6564 [1:06:27<1:09:22,  1.18s/it]

{'loss': 0.1562, 'grad_norm': 2.4276294708251953, 'learning_rate': 2.6919561243144425e-05, 'epoch': 1.38}


 46%|████▋     | 3040/6564 [1:06:39<1:08:25,  1.17s/it]

{'loss': 0.0943, 'grad_norm': 8.457245826721191, 'learning_rate': 2.6843388177940282e-05, 'epoch': 1.39}


 46%|████▋     | 3050/6564 [1:06:51<1:07:53,  1.16s/it]

{'loss': 0.0998, 'grad_norm': 1.5102890729904175, 'learning_rate': 2.676721511273614e-05, 'epoch': 1.39}


 47%|████▋     | 3060/6564 [1:07:02<1:08:32,  1.17s/it]

{'loss': 0.1579, 'grad_norm': 0.6194230914115906, 'learning_rate': 2.6691042047531993e-05, 'epoch': 1.4}


 47%|████▋     | 3070/6564 [1:07:14<1:07:23,  1.16s/it]

{'loss': 0.1255, 'grad_norm': 14.769829750061035, 'learning_rate': 2.661486898232785e-05, 'epoch': 1.4}


 47%|████▋     | 3080/6564 [1:07:26<1:08:59,  1.19s/it]

{'loss': 0.1945, 'grad_norm': 0.30124226212501526, 'learning_rate': 2.6538695917123707e-05, 'epoch': 1.41}


 47%|████▋     | 3090/6564 [1:07:37<1:07:51,  1.17s/it]

{'loss': 0.1237, 'grad_norm': 12.342453002929688, 'learning_rate': 2.6462522851919565e-05, 'epoch': 1.41}


 47%|████▋     | 3100/6564 [1:07:49<1:07:31,  1.17s/it]

{'loss': 0.1218, 'grad_norm': 12.838286399841309, 'learning_rate': 2.638634978671542e-05, 'epoch': 1.42}


 47%|████▋     | 3110/6564 [1:08:01<1:07:39,  1.18s/it]

{'loss': 0.0999, 'grad_norm': 13.04186725616455, 'learning_rate': 2.6310176721511276e-05, 'epoch': 1.42}


 48%|████▊     | 3120/6564 [1:08:13<1:07:14,  1.17s/it]

{'loss': 0.1588, 'grad_norm': 0.2426454722881317, 'learning_rate': 2.6234003656307133e-05, 'epoch': 1.43}


 48%|████▊     | 3130/6564 [1:08:24<1:07:56,  1.19s/it]

{'loss': 0.174, 'grad_norm': 17.803733825683594, 'learning_rate': 2.615783059110299e-05, 'epoch': 1.43}


 48%|████▊     | 3140/6564 [1:08:36<1:06:59,  1.17s/it]

{'loss': 0.0555, 'grad_norm': 2.5330617427825928, 'learning_rate': 2.6081657525898844e-05, 'epoch': 1.44}


 48%|████▊     | 3150/6564 [1:08:48<1:06:25,  1.17s/it]

{'loss': 0.1724, 'grad_norm': 8.09517765045166, 'learning_rate': 2.60054844606947e-05, 'epoch': 1.44}


 48%|████▊     | 3160/6564 [1:09:00<1:06:54,  1.18s/it]

{'loss': 0.1895, 'grad_norm': 4.236171245574951, 'learning_rate': 2.5929311395490558e-05, 'epoch': 1.44}


 48%|████▊     | 3170/6564 [1:09:11<1:06:06,  1.17s/it]

{'loss': 0.1041, 'grad_norm': 4.084079742431641, 'learning_rate': 2.5853138330286415e-05, 'epoch': 1.45}


 48%|████▊     | 3180/6564 [1:09:23<1:05:49,  1.17s/it]

{'loss': 0.0833, 'grad_norm': 9.42581844329834, 'learning_rate': 2.577696526508227e-05, 'epoch': 1.45}


 49%|████▊     | 3190/6564 [1:09:35<1:05:06,  1.16s/it]

{'loss': 0.0549, 'grad_norm': 10.037931442260742, 'learning_rate': 2.5700792199878127e-05, 'epoch': 1.46}


 49%|████▉     | 3200/6564 [1:09:46<1:04:51,  1.16s/it]

{'loss': 0.1832, 'grad_norm': 19.21676254272461, 'learning_rate': 2.5624619134673984e-05, 'epoch': 1.46}


 49%|████▉     | 3210/6564 [1:09:58<1:05:45,  1.18s/it]

{'loss': 0.0547, 'grad_norm': 21.309946060180664, 'learning_rate': 2.5548446069469838e-05, 'epoch': 1.47}


 49%|████▉     | 3220/6564 [1:10:10<1:05:06,  1.17s/it]

{'loss': 0.1494, 'grad_norm': 0.06049312651157379, 'learning_rate': 2.5472273004265695e-05, 'epoch': 1.47}


 49%|████▉     | 3230/6564 [1:10:22<1:05:02,  1.17s/it]

{'loss': 0.0914, 'grad_norm': 22.15172004699707, 'learning_rate': 2.5396099939061552e-05, 'epoch': 1.48}


 49%|████▉     | 3240/6564 [1:10:33<1:04:24,  1.16s/it]

{'loss': 0.1598, 'grad_norm': 0.14791692793369293, 'learning_rate': 2.5319926873857402e-05, 'epoch': 1.48}


 50%|████▉     | 3250/6564 [1:10:45<1:06:13,  1.20s/it]

{'loss': 0.1965, 'grad_norm': 12.921769142150879, 'learning_rate': 2.524375380865326e-05, 'epoch': 1.49}


 50%|████▉     | 3260/6564 [1:10:57<1:04:13,  1.17s/it]

{'loss': 0.1211, 'grad_norm': 4.284175872802734, 'learning_rate': 2.5167580743449117e-05, 'epoch': 1.49}


 50%|████▉     | 3270/6564 [1:11:08<1:04:37,  1.18s/it]

{'loss': 0.0865, 'grad_norm': 10.458927154541016, 'learning_rate': 2.509140767824497e-05, 'epoch': 1.49}


 50%|████▉     | 3280/6564 [1:11:20<1:04:05,  1.17s/it]

{'loss': 0.078, 'grad_norm': 0.6749761700630188, 'learning_rate': 2.5015234613040828e-05, 'epoch': 1.5}


 50%|█████     | 3290/6564 [1:11:32<1:04:36,  1.18s/it]

{'loss': 0.121, 'grad_norm': 0.09847043454647064, 'learning_rate': 2.493906154783669e-05, 'epoch': 1.5}


 50%|█████     | 3300/6564 [1:11:44<1:03:38,  1.17s/it]

{'loss': 0.1648, 'grad_norm': 0.043906036764383316, 'learning_rate': 2.4862888482632542e-05, 'epoch': 1.51}


 50%|█████     | 3310/6564 [1:11:55<1:02:52,  1.16s/it]

{'loss': 0.1035, 'grad_norm': 16.91029930114746, 'learning_rate': 2.47867154174284e-05, 'epoch': 1.51}


 51%|█████     | 3320/6564 [1:12:07<1:02:34,  1.16s/it]

{'loss': 0.1334, 'grad_norm': 1.1183264255523682, 'learning_rate': 2.4710542352224257e-05, 'epoch': 1.52}


 51%|█████     | 3330/6564 [1:12:18<1:02:28,  1.16s/it]

{'loss': 0.1683, 'grad_norm': 20.06360626220703, 'learning_rate': 2.4634369287020114e-05, 'epoch': 1.52}


 51%|█████     | 3340/6564 [1:12:30<1:02:06,  1.16s/it]

{'loss': 0.1491, 'grad_norm': 0.09176309406757355, 'learning_rate': 2.4558196221815968e-05, 'epoch': 1.53}


 51%|█████     | 3350/6564 [1:12:42<1:02:02,  1.16s/it]

{'loss': 0.1809, 'grad_norm': 0.49128538370132446, 'learning_rate': 2.448202315661182e-05, 'epoch': 1.53}


 51%|█████     | 3360/6564 [1:12:53<1:03:27,  1.19s/it]

{'loss': 0.1575, 'grad_norm': 10.125486373901367, 'learning_rate': 2.440585009140768e-05, 'epoch': 1.54}


 51%|█████▏    | 3370/6564 [1:13:05<1:02:11,  1.17s/it]

{'loss': 0.1701, 'grad_norm': 2.499164342880249, 'learning_rate': 2.4329677026203536e-05, 'epoch': 1.54}


 51%|█████▏    | 3380/6564 [1:13:17<1:02:48,  1.18s/it]

{'loss': 0.0914, 'grad_norm': 1.727538824081421, 'learning_rate': 2.425350396099939e-05, 'epoch': 1.54}


 52%|█████▏    | 3390/6564 [1:13:29<1:02:15,  1.18s/it]

{'loss': 0.2217, 'grad_norm': 0.2228633165359497, 'learning_rate': 2.4177330895795247e-05, 'epoch': 1.55}


 52%|█████▏    | 3400/6564 [1:13:41<1:02:24,  1.18s/it]

{'loss': 0.1341, 'grad_norm': 12.529808044433594, 'learning_rate': 2.4101157830591104e-05, 'epoch': 1.55}


 52%|█████▏    | 3410/6564 [1:13:53<1:02:50,  1.20s/it]

{'loss': 0.1596, 'grad_norm': 0.36960890889167786, 'learning_rate': 2.402498476538696e-05, 'epoch': 1.56}


 52%|█████▏    | 3420/6564 [1:14:05<1:01:51,  1.18s/it]

{'loss': 0.1244, 'grad_norm': 0.5278613567352295, 'learning_rate': 2.3948811700182815e-05, 'epoch': 1.56}


 52%|█████▏    | 3430/6564 [1:14:16<1:02:21,  1.19s/it]

{'loss': 0.089, 'grad_norm': 13.384969711303711, 'learning_rate': 2.3872638634978672e-05, 'epoch': 1.57}


 52%|█████▏    | 3440/6564 [1:14:28<1:01:25,  1.18s/it]

{'loss': 0.1819, 'grad_norm': 9.866079330444336, 'learning_rate': 2.379646556977453e-05, 'epoch': 1.57}


 53%|█████▎    | 3450/6564 [1:14:40<1:00:52,  1.17s/it]

{'loss': 0.205, 'grad_norm': 14.7227783203125, 'learning_rate': 2.3720292504570387e-05, 'epoch': 1.58}


 53%|█████▎    | 3460/6564 [1:14:52<1:00:14,  1.16s/it]

{'loss': 0.08, 'grad_norm': 0.14117130637168884, 'learning_rate': 2.364411943936624e-05, 'epoch': 1.58}


 53%|█████▎    | 3470/6564 [1:15:03<1:00:28,  1.17s/it]

{'loss': 0.087, 'grad_norm': 1.8794268369674683, 'learning_rate': 2.3567946374162098e-05, 'epoch': 1.59}


 53%|█████▎    | 3480/6564 [1:15:15<1:01:06,  1.19s/it]

{'loss': 0.1733, 'grad_norm': 1.1988705396652222, 'learning_rate': 2.3491773308957955e-05, 'epoch': 1.59}


 53%|█████▎    | 3490/6564 [1:15:27<59:41,  1.17s/it]  

{'loss': 0.1314, 'grad_norm': 0.9468510746955872, 'learning_rate': 2.341560024375381e-05, 'epoch': 1.6}


 53%|█████▎    | 3500/6564 [1:15:39<1:00:47,  1.19s/it]

{'loss': 0.1922, 'grad_norm': 17.92820930480957, 'learning_rate': 2.3339427178549666e-05, 'epoch': 1.6}


 53%|█████▎    | 3510/6564 [1:15:52<1:01:35,  1.21s/it]

{'loss': 0.1247, 'grad_norm': 0.4867855906486511, 'learning_rate': 2.3263254113345523e-05, 'epoch': 1.6}


 54%|█████▎    | 3520/6564 [1:16:04<1:00:29,  1.19s/it]

{'loss': 0.1031, 'grad_norm': 17.655536651611328, 'learning_rate': 2.318708104814138e-05, 'epoch': 1.61}


 54%|█████▍    | 3530/6564 [1:16:15<58:52,  1.16s/it]  

{'loss': 0.1811, 'grad_norm': 8.785656929016113, 'learning_rate': 2.3110907982937234e-05, 'epoch': 1.61}


 54%|█████▍    | 3540/6564 [1:16:27<59:09,  1.17s/it]  

{'loss': 0.1941, 'grad_norm': 5.033447265625, 'learning_rate': 2.303473491773309e-05, 'epoch': 1.62}


 54%|█████▍    | 3550/6564 [1:16:39<59:51,  1.19s/it]  

{'loss': 0.1708, 'grad_norm': 7.704570770263672, 'learning_rate': 2.295856185252895e-05, 'epoch': 1.62}


 54%|█████▍    | 3560/6564 [1:16:51<58:11,  1.16s/it]

{'loss': 0.1353, 'grad_norm': 12.7319917678833, 'learning_rate': 2.2882388787324806e-05, 'epoch': 1.63}


 54%|█████▍    | 3570/6564 [1:17:03<58:15,  1.17s/it]

{'loss': 0.1415, 'grad_norm': 10.197020530700684, 'learning_rate': 2.280621572212066e-05, 'epoch': 1.63}


 55%|█████▍    | 3580/6564 [1:17:14<59:02,  1.19s/it]

{'loss': 0.1339, 'grad_norm': 13.606178283691406, 'learning_rate': 2.2730042656916513e-05, 'epoch': 1.64}


 55%|█████▍    | 3590/6564 [1:17:26<58:11,  1.17s/it]  

{'loss': 0.1182, 'grad_norm': 0.9780967831611633, 'learning_rate': 2.265386959171237e-05, 'epoch': 1.64}


 55%|█████▍    | 3600/6564 [1:17:38<57:38,  1.17s/it]

{'loss': 0.1726, 'grad_norm': 3.4824013710021973, 'learning_rate': 2.2577696526508228e-05, 'epoch': 1.65}


 55%|█████▍    | 3610/6564 [1:17:50<57:34,  1.17s/it]

{'loss': 0.1693, 'grad_norm': 5.06027889251709, 'learning_rate': 2.2501523461304085e-05, 'epoch': 1.65}


 55%|█████▌    | 3620/6564 [1:18:01<57:49,  1.18s/it]

{'loss': 0.1084, 'grad_norm': 0.7732430696487427, 'learning_rate': 2.242535039609994e-05, 'epoch': 1.65}


 55%|█████▌    | 3630/6564 [1:18:13<56:57,  1.16s/it]

{'loss': 0.1033, 'grad_norm': 8.538224220275879, 'learning_rate': 2.2349177330895796e-05, 'epoch': 1.66}


 55%|█████▌    | 3640/6564 [1:18:25<56:31,  1.16s/it]

{'loss': 0.1583, 'grad_norm': 1.9244009256362915, 'learning_rate': 2.2273004265691653e-05, 'epoch': 1.66}


 56%|█████▌    | 3650/6564 [1:18:36<56:37,  1.17s/it]

{'loss': 0.2282, 'grad_norm': 7.817650318145752, 'learning_rate': 2.2196831200487507e-05, 'epoch': 1.67}


 56%|█████▌    | 3660/6564 [1:18:48<56:10,  1.16s/it]

{'loss': 0.1174, 'grad_norm': 8.931446075439453, 'learning_rate': 2.2120658135283364e-05, 'epoch': 1.67}


 56%|█████▌    | 3670/6564 [1:19:00<56:02,  1.16s/it]

{'loss': 0.0743, 'grad_norm': 4.063891887664795, 'learning_rate': 2.204448507007922e-05, 'epoch': 1.68}


 56%|█████▌    | 3680/6564 [1:19:11<55:52,  1.16s/it]

{'loss': 0.0867, 'grad_norm': 0.2497372031211853, 'learning_rate': 2.196831200487508e-05, 'epoch': 1.68}


 56%|█████▌    | 3690/6564 [1:19:23<56:59,  1.19s/it]

{'loss': 0.0889, 'grad_norm': 4.944891929626465, 'learning_rate': 2.1892138939670932e-05, 'epoch': 1.69}


 56%|█████▋    | 3700/6564 [1:19:35<57:50,  1.21s/it]

{'loss': 0.1714, 'grad_norm': 0.6394674181938171, 'learning_rate': 2.181596587446679e-05, 'epoch': 1.69}


 57%|█████▋    | 3710/6564 [1:19:47<55:59,  1.18s/it]

{'loss': 0.0813, 'grad_norm': 0.18757112324237823, 'learning_rate': 2.1739792809262647e-05, 'epoch': 1.7}


 57%|█████▋    | 3720/6564 [1:19:59<56:07,  1.18s/it]

{'loss': 0.2265, 'grad_norm': 19.687864303588867, 'learning_rate': 2.1663619744058504e-05, 'epoch': 1.7}


 57%|█████▋    | 3730/6564 [1:20:11<55:00,  1.16s/it]

{'loss': 0.1783, 'grad_norm': 11.102585792541504, 'learning_rate': 2.1587446678854358e-05, 'epoch': 1.7}


 57%|█████▋    | 3740/6564 [1:20:23<55:02,  1.17s/it]

{'loss': 0.0931, 'grad_norm': 5.729753017425537, 'learning_rate': 2.1511273613650215e-05, 'epoch': 1.71}


 57%|█████▋    | 3750/6564 [1:20:34<55:31,  1.18s/it]

{'loss': 0.1577, 'grad_norm': 0.9559081196784973, 'learning_rate': 2.1435100548446072e-05, 'epoch': 1.71}


 57%|█████▋    | 3760/6564 [1:20:46<55:23,  1.19s/it]

{'loss': 0.0967, 'grad_norm': 0.17030054330825806, 'learning_rate': 2.135892748324193e-05, 'epoch': 1.72}


 57%|█████▋    | 3770/6564 [1:20:58<55:04,  1.18s/it]

{'loss': 0.2618, 'grad_norm': 2.4629034996032715, 'learning_rate': 2.1282754418037783e-05, 'epoch': 1.72}


 58%|█████▊    | 3780/6564 [1:21:10<54:44,  1.18s/it]

{'loss': 0.1174, 'grad_norm': 2.1520698070526123, 'learning_rate': 2.1206581352833637e-05, 'epoch': 1.73}


 58%|█████▊    | 3790/6564 [1:21:22<54:00,  1.17s/it]

{'loss': 0.1647, 'grad_norm': 9.526540756225586, 'learning_rate': 2.1130408287629494e-05, 'epoch': 1.73}


 58%|█████▊    | 3800/6564 [1:21:34<55:20,  1.20s/it]

{'loss': 0.1405, 'grad_norm': 5.2952165603637695, 'learning_rate': 2.105423522242535e-05, 'epoch': 1.74}


 58%|█████▊    | 3810/6564 [1:21:46<54:23,  1.18s/it]

{'loss': 0.0877, 'grad_norm': 0.45401731133461, 'learning_rate': 2.0978062157221205e-05, 'epoch': 1.74}


 58%|█████▊    | 3820/6564 [1:21:57<54:27,  1.19s/it]

{'loss': 0.1362, 'grad_norm': 0.19752778112888336, 'learning_rate': 2.0901889092017062e-05, 'epoch': 1.75}


 58%|█████▊    | 3830/6564 [1:22:09<53:47,  1.18s/it]

{'loss': 0.1699, 'grad_norm': 0.20542672276496887, 'learning_rate': 2.082571602681292e-05, 'epoch': 1.75}


 59%|█████▊    | 3840/6564 [1:22:21<52:36,  1.16s/it]

{'loss': 0.154, 'grad_norm': 17.079124450683594, 'learning_rate': 2.0749542961608777e-05, 'epoch': 1.76}


 59%|█████▊    | 3850/6564 [1:22:33<53:05,  1.17s/it]

{'loss': 0.0781, 'grad_norm': 0.2378496527671814, 'learning_rate': 2.067336989640463e-05, 'epoch': 1.76}


 59%|█████▉    | 3860/6564 [1:22:44<54:00,  1.20s/it]

{'loss': 0.0618, 'grad_norm': 0.28005746006965637, 'learning_rate': 2.0597196831200488e-05, 'epoch': 1.76}


 59%|█████▉    | 3870/6564 [1:22:56<53:33,  1.19s/it]

{'loss': 0.1167, 'grad_norm': 0.34447675943374634, 'learning_rate': 2.0521023765996345e-05, 'epoch': 1.77}


 59%|█████▉    | 3880/6564 [1:23:08<53:28,  1.20s/it]

{'loss': 0.1355, 'grad_norm': 8.811196327209473, 'learning_rate': 2.0444850700792202e-05, 'epoch': 1.77}


 59%|█████▉    | 3890/6564 [1:23:20<53:50,  1.21s/it]

{'loss': 0.1067, 'grad_norm': 0.8150097131729126, 'learning_rate': 2.0368677635588056e-05, 'epoch': 1.78}


 59%|█████▉    | 3900/6564 [1:23:32<51:14,  1.15s/it]

{'loss': 0.1074, 'grad_norm': 3.2581796646118164, 'learning_rate': 2.0292504570383913e-05, 'epoch': 1.78}


 60%|█████▉    | 3910/6564 [1:23:44<53:20,  1.21s/it]

{'loss': 0.157, 'grad_norm': 20.000093460083008, 'learning_rate': 2.021633150517977e-05, 'epoch': 1.79}


 60%|█████▉    | 3920/6564 [1:23:56<52:35,  1.19s/it]

{'loss': 0.0088, 'grad_norm': 0.5218881964683533, 'learning_rate': 2.0140158439975627e-05, 'epoch': 1.79}


 60%|█████▉    | 3930/6564 [1:24:07<51:08,  1.17s/it]

{'loss': 0.1117, 'grad_norm': 16.912248611450195, 'learning_rate': 2.006398537477148e-05, 'epoch': 1.8}


 60%|██████    | 3940/6564 [1:24:19<50:51,  1.16s/it]

{'loss': 0.1353, 'grad_norm': 26.479734420776367, 'learning_rate': 1.998781230956734e-05, 'epoch': 1.8}


 60%|██████    | 3950/6564 [1:24:31<52:36,  1.21s/it]

{'loss': 0.1041, 'grad_norm': 0.08233145624399185, 'learning_rate': 1.9911639244363196e-05, 'epoch': 1.81}


 60%|██████    | 3960/6564 [1:24:43<51:06,  1.18s/it]

{'loss': 0.1656, 'grad_norm': 4.9224724769592285, 'learning_rate': 1.9835466179159053e-05, 'epoch': 1.81}


 60%|██████    | 3970/6564 [1:24:56<53:47,  1.24s/it]

{'loss': 0.1551, 'grad_norm': 4.553347587585449, 'learning_rate': 1.9759293113954907e-05, 'epoch': 1.81}


 61%|██████    | 3980/6564 [1:25:07<51:49,  1.20s/it]

{'loss': 0.053, 'grad_norm': 0.2870252728462219, 'learning_rate': 1.9683120048750764e-05, 'epoch': 1.82}


 61%|██████    | 3990/6564 [1:25:19<51:38,  1.20s/it]

{'loss': 0.1834, 'grad_norm': 1.1241309642791748, 'learning_rate': 1.960694698354662e-05, 'epoch': 1.82}


 61%|██████    | 4000/6564 [1:25:31<51:22,  1.20s/it]

{'loss': 0.1489, 'grad_norm': 1.7609113454818726, 'learning_rate': 1.9530773918342475e-05, 'epoch': 1.83}


 61%|██████    | 4010/6564 [1:25:45<51:12,  1.20s/it]  

{'loss': 0.0742, 'grad_norm': 7.295375823974609, 'learning_rate': 1.945460085313833e-05, 'epoch': 1.83}


 61%|██████    | 4020/6564 [1:25:57<52:57,  1.25s/it]

{'loss': 0.1469, 'grad_norm': 2.3040659427642822, 'learning_rate': 1.9378427787934186e-05, 'epoch': 1.84}


 61%|██████▏   | 4030/6564 [1:26:09<51:12,  1.21s/it]

{'loss': 0.1261, 'grad_norm': 0.25323158502578735, 'learning_rate': 1.9302254722730043e-05, 'epoch': 1.84}


 62%|██████▏   | 4040/6564 [1:26:21<50:14,  1.19s/it]

{'loss': 0.1578, 'grad_norm': 1.3489563465118408, 'learning_rate': 1.92260816575259e-05, 'epoch': 1.85}


 62%|██████▏   | 4050/6564 [1:26:33<50:33,  1.21s/it]

{'loss': 0.1233, 'grad_norm': 7.253735065460205, 'learning_rate': 1.9149908592321754e-05, 'epoch': 1.85}


 62%|██████▏   | 4060/6564 [1:26:45<49:20,  1.18s/it]

{'loss': 0.0848, 'grad_norm': 6.781784534454346, 'learning_rate': 1.907373552711761e-05, 'epoch': 1.86}


 62%|██████▏   | 4070/6564 [1:26:57<50:43,  1.22s/it]

{'loss': 0.1214, 'grad_norm': 5.507378578186035, 'learning_rate': 1.899756246191347e-05, 'epoch': 1.86}


 62%|██████▏   | 4080/6564 [1:27:09<48:41,  1.18s/it]

{'loss': 0.1688, 'grad_norm': 0.3325740396976471, 'learning_rate': 1.8921389396709326e-05, 'epoch': 1.86}


 62%|██████▏   | 4090/6564 [1:27:21<49:00,  1.19s/it]

{'loss': 0.0496, 'grad_norm': 0.30664533376693726, 'learning_rate': 1.884521633150518e-05, 'epoch': 1.87}


 62%|██████▏   | 4100/6564 [1:27:33<49:13,  1.20s/it]

{'loss': 0.1875, 'grad_norm': 13.688329696655273, 'learning_rate': 1.8769043266301037e-05, 'epoch': 1.87}


 63%|██████▎   | 4110/6564 [1:27:45<47:45,  1.17s/it]

{'loss': 0.1928, 'grad_norm': 7.592219829559326, 'learning_rate': 1.8692870201096894e-05, 'epoch': 1.88}


 63%|██████▎   | 4120/6564 [1:27:56<48:05,  1.18s/it]

{'loss': 0.1409, 'grad_norm': 1.464354157447815, 'learning_rate': 1.861669713589275e-05, 'epoch': 1.88}


 63%|██████▎   | 4130/6564 [1:28:08<47:54,  1.18s/it]

{'loss': 0.1951, 'grad_norm': 1.2037854194641113, 'learning_rate': 1.8540524070688605e-05, 'epoch': 1.89}


 63%|██████▎   | 4140/6564 [1:28:20<51:13,  1.27s/it]

{'loss': 0.0935, 'grad_norm': 20.099943161010742, 'learning_rate': 1.8464351005484462e-05, 'epoch': 1.89}


 63%|██████▎   | 4150/6564 [1:28:32<47:53,  1.19s/it]

{'loss': 0.1098, 'grad_norm': 4.3572821617126465, 'learning_rate': 1.838817794028032e-05, 'epoch': 1.9}


 63%|██████▎   | 4160/6564 [1:28:44<47:41,  1.19s/it]

{'loss': 0.1001, 'grad_norm': 0.25634393095970154, 'learning_rate': 1.8312004875076173e-05, 'epoch': 1.9}


 64%|██████▎   | 4170/6564 [1:28:56<47:24,  1.19s/it]

{'loss': 0.2415, 'grad_norm': 5.618750095367432, 'learning_rate': 1.823583180987203e-05, 'epoch': 1.91}


 64%|██████▎   | 4180/6564 [1:29:08<47:08,  1.19s/it]

{'loss': 0.171, 'grad_norm': 1.5673106908798218, 'learning_rate': 1.8159658744667887e-05, 'epoch': 1.91}


 64%|██████▍   | 4190/6564 [1:29:20<47:34,  1.20s/it]

{'loss': 0.1189, 'grad_norm': 5.80615234375, 'learning_rate': 1.8083485679463745e-05, 'epoch': 1.91}


 64%|██████▍   | 4200/6564 [1:29:32<46:31,  1.18s/it]

{'loss': 0.1011, 'grad_norm': 2.5959908962249756, 'learning_rate': 1.80073126142596e-05, 'epoch': 1.92}


 64%|██████▍   | 4210/6564 [1:29:44<46:34,  1.19s/it]

{'loss': 0.09, 'grad_norm': 5.448886394500732, 'learning_rate': 1.7931139549055456e-05, 'epoch': 1.92}


 64%|██████▍   | 4220/6564 [1:29:56<45:54,  1.18s/it]

{'loss': 0.1586, 'grad_norm': 0.4241207242012024, 'learning_rate': 1.785496648385131e-05, 'epoch': 1.93}


 64%|██████▍   | 4230/6564 [1:30:08<46:35,  1.20s/it]

{'loss': 0.0832, 'grad_norm': 0.2956974506378174, 'learning_rate': 1.7778793418647167e-05, 'epoch': 1.93}


 65%|██████▍   | 4240/6564 [1:30:20<47:15,  1.22s/it]

{'loss': 0.1056, 'grad_norm': 0.21347463130950928, 'learning_rate': 1.7702620353443024e-05, 'epoch': 1.94}


 65%|██████▍   | 4250/6564 [1:30:32<45:47,  1.19s/it]

{'loss': 0.1086, 'grad_norm': 0.06104281172156334, 'learning_rate': 1.7626447288238878e-05, 'epoch': 1.94}


 65%|██████▍   | 4260/6564 [1:30:43<44:57,  1.17s/it]

{'loss': 0.0708, 'grad_norm': 0.7084373235702515, 'learning_rate': 1.7550274223034735e-05, 'epoch': 1.95}


 65%|██████▌   | 4270/6564 [1:30:55<44:40,  1.17s/it]

{'loss': 0.2483, 'grad_norm': 0.1950598806142807, 'learning_rate': 1.7474101157830592e-05, 'epoch': 1.95}


 65%|██████▌   | 4280/6564 [1:31:07<44:40,  1.17s/it]

{'loss': 0.1318, 'grad_norm': 12.49069595336914, 'learning_rate': 1.739792809262645e-05, 'epoch': 1.96}


 65%|██████▌   | 4290/6564 [1:31:19<44:10,  1.17s/it]

{'loss': 0.1377, 'grad_norm': 10.005043029785156, 'learning_rate': 1.7321755027422303e-05, 'epoch': 1.96}


 66%|██████▌   | 4300/6564 [1:31:30<44:46,  1.19s/it]

{'loss': 0.1985, 'grad_norm': 11.53274154663086, 'learning_rate': 1.724558196221816e-05, 'epoch': 1.97}


 66%|██████▌   | 4310/6564 [1:31:42<44:32,  1.19s/it]

{'loss': 0.0947, 'grad_norm': 8.339262008666992, 'learning_rate': 1.7169408897014017e-05, 'epoch': 1.97}


 66%|██████▌   | 4320/6564 [1:31:54<45:03,  1.20s/it]

{'loss': 0.1261, 'grad_norm': 0.2911263108253479, 'learning_rate': 1.709323583180987e-05, 'epoch': 1.97}


 66%|██████▌   | 4330/6564 [1:32:06<43:29,  1.17s/it]

{'loss': 0.0785, 'grad_norm': 1.2159154415130615, 'learning_rate': 1.701706276660573e-05, 'epoch': 1.98}


 66%|██████▌   | 4340/6564 [1:32:18<44:27,  1.20s/it]

{'loss': 0.1516, 'grad_norm': 0.534681499004364, 'learning_rate': 1.6940889701401586e-05, 'epoch': 1.98}


 66%|██████▋   | 4350/6564 [1:32:30<43:54,  1.19s/it]

{'loss': 0.1597, 'grad_norm': 4.5061469078063965, 'learning_rate': 1.6864716636197443e-05, 'epoch': 1.99}


 66%|██████▋   | 4360/6564 [1:32:42<42:55,  1.17s/it]

{'loss': 0.1088, 'grad_norm': 8.431715965270996, 'learning_rate': 1.6788543570993297e-05, 'epoch': 1.99}


 67%|██████▋   | 4370/6564 [1:32:53<43:23,  1.19s/it]

{'loss': 0.1208, 'grad_norm': 2.420541286468506, 'learning_rate': 1.6712370505789154e-05, 'epoch': 2.0}


 67%|██████▋   | 4380/6564 [1:33:05<42:23,  1.16s/it]

{'loss': 0.0449, 'grad_norm': 0.19978150725364685, 'learning_rate': 1.663619744058501e-05, 'epoch': 2.0}


 67%|██████▋   | 4390/6564 [1:33:17<42:38,  1.18s/it]

{'loss': 0.0604, 'grad_norm': 0.16168636083602905, 'learning_rate': 1.6560024375380868e-05, 'epoch': 2.01}


 67%|██████▋   | 4400/6564 [1:33:28<42:35,  1.18s/it]

{'loss': 0.0907, 'grad_norm': 0.1850530505180359, 'learning_rate': 1.6483851310176722e-05, 'epoch': 2.01}


 67%|██████▋   | 4410/6564 [1:33:41<43:00,  1.20s/it]

{'loss': 0.0255, 'grad_norm': 0.11802350729703903, 'learning_rate': 1.640767824497258e-05, 'epoch': 2.02}


 67%|██████▋   | 4420/6564 [1:33:53<43:39,  1.22s/it]

{'loss': 0.0962, 'grad_norm': 11.793299674987793, 'learning_rate': 1.6331505179768436e-05, 'epoch': 2.02}


 67%|██████▋   | 4430/6564 [1:34:05<41:59,  1.18s/it]

{'loss': 0.1001, 'grad_norm': 3.5995681285858154, 'learning_rate': 1.625533211456429e-05, 'epoch': 2.02}


 68%|██████▊   | 4440/6564 [1:34:16<41:35,  1.17s/it]

{'loss': 0.0502, 'grad_norm': 0.13423369824886322, 'learning_rate': 1.6179159049360147e-05, 'epoch': 2.03}


 68%|██████▊   | 4450/6564 [1:34:28<43:16,  1.23s/it]

{'loss': 0.0405, 'grad_norm': 0.07823609560728073, 'learning_rate': 1.6102985984156e-05, 'epoch': 2.03}


 68%|██████▊   | 4460/6564 [1:34:40<41:16,  1.18s/it]

{'loss': 0.0538, 'grad_norm': 5.599870681762695, 'learning_rate': 1.602681291895186e-05, 'epoch': 2.04}


 68%|██████▊   | 4470/6564 [1:34:52<40:37,  1.16s/it]

{'loss': 0.0756, 'grad_norm': 0.07614962756633759, 'learning_rate': 1.5950639853747716e-05, 'epoch': 2.04}


 68%|██████▊   | 4480/6564 [1:35:04<40:49,  1.18s/it]

{'loss': 0.0041, 'grad_norm': 0.09375171363353729, 'learning_rate': 1.587446678854357e-05, 'epoch': 2.05}


 68%|██████▊   | 4490/6564 [1:35:15<40:14,  1.16s/it]

{'loss': 0.0377, 'grad_norm': 13.516674041748047, 'learning_rate': 1.5798293723339427e-05, 'epoch': 2.05}


 69%|██████▊   | 4500/6564 [1:35:27<42:06,  1.22s/it]

{'loss': 0.0359, 'grad_norm': 0.06875430792570114, 'learning_rate': 1.5722120658135284e-05, 'epoch': 2.06}


 69%|██████▊   | 4510/6564 [1:35:40<41:00,  1.20s/it]

{'loss': 0.034, 'grad_norm': 0.12864913046360016, 'learning_rate': 1.564594759293114e-05, 'epoch': 2.06}


 69%|██████▉   | 4520/6564 [1:35:52<40:06,  1.18s/it]

{'loss': 0.0568, 'grad_norm': 0.41262632608413696, 'learning_rate': 1.5569774527726995e-05, 'epoch': 2.07}


 69%|██████▉   | 4530/6564 [1:36:04<40:51,  1.21s/it]

{'loss': 0.0688, 'grad_norm': 0.9349941611289978, 'learning_rate': 1.5493601462522852e-05, 'epoch': 2.07}


 69%|██████▉   | 4540/6564 [1:36:16<39:43,  1.18s/it]

{'loss': 0.0442, 'grad_norm': 0.035089846700429916, 'learning_rate': 1.541742839731871e-05, 'epoch': 2.07}


 69%|██████▉   | 4550/6564 [1:36:28<39:23,  1.17s/it]

{'loss': 0.1115, 'grad_norm': 0.5455880761146545, 'learning_rate': 1.5341255332114566e-05, 'epoch': 2.08}


 69%|██████▉   | 4560/6564 [1:36:40<38:52,  1.16s/it]

{'loss': 0.0136, 'grad_norm': 0.09646622091531754, 'learning_rate': 1.526508226691042e-05, 'epoch': 2.08}


 70%|██████▉   | 4570/6564 [1:36:51<38:47,  1.17s/it]

{'loss': 0.0814, 'grad_norm': 0.6409608721733093, 'learning_rate': 1.5188909201706277e-05, 'epoch': 2.09}


 70%|██████▉   | 4580/6564 [1:37:03<39:13,  1.19s/it]

{'loss': 0.0138, 'grad_norm': 0.713467538356781, 'learning_rate': 1.5112736136502135e-05, 'epoch': 2.09}


 70%|██████▉   | 4590/6564 [1:37:15<39:05,  1.19s/it]

{'loss': 0.1732, 'grad_norm': 0.11979474872350693, 'learning_rate': 1.503656307129799e-05, 'epoch': 2.1}


 70%|███████   | 4600/6564 [1:37:27<38:45,  1.18s/it]

{'loss': 0.0368, 'grad_norm': 0.043425627052783966, 'learning_rate': 1.4960390006093847e-05, 'epoch': 2.1}


 70%|███████   | 4610/6564 [1:37:38<37:53,  1.16s/it]

{'loss': 0.0848, 'grad_norm': 0.42455095052719116, 'learning_rate': 1.4884216940889703e-05, 'epoch': 2.11}


 70%|███████   | 4620/6564 [1:37:50<37:39,  1.16s/it]

{'loss': 0.0634, 'grad_norm': 0.1552007496356964, 'learning_rate': 1.480804387568556e-05, 'epoch': 2.11}


 71%|███████   | 4630/6564 [1:38:02<38:08,  1.18s/it]

{'loss': 0.1701, 'grad_norm': 18.002147674560547, 'learning_rate': 1.4731870810481416e-05, 'epoch': 2.12}


 71%|███████   | 4640/6564 [1:38:14<37:50,  1.18s/it]

{'loss': 0.0692, 'grad_norm': 12.307827949523926, 'learning_rate': 1.4655697745277273e-05, 'epoch': 2.12}


 71%|███████   | 4650/6564 [1:38:25<37:23,  1.17s/it]

{'loss': 0.2115, 'grad_norm': 0.16352397203445435, 'learning_rate': 1.4579524680073125e-05, 'epoch': 2.13}


 71%|███████   | 4660/6564 [1:38:37<37:19,  1.18s/it]

{'loss': 0.0673, 'grad_norm': 0.18606100976467133, 'learning_rate': 1.4503351614868982e-05, 'epoch': 2.13}


 71%|███████   | 4670/6564 [1:38:49<36:57,  1.17s/it]

{'loss': 0.032, 'grad_norm': 0.07553093880414963, 'learning_rate': 1.4427178549664838e-05, 'epoch': 2.13}


 71%|███████▏  | 4680/6564 [1:39:01<37:13,  1.19s/it]

{'loss': 0.0681, 'grad_norm': 0.04116776958107948, 'learning_rate': 1.4351005484460695e-05, 'epoch': 2.14}


 71%|███████▏  | 4690/6564 [1:39:13<36:57,  1.18s/it]

{'loss': 0.0453, 'grad_norm': 0.14512993395328522, 'learning_rate': 1.427483241925655e-05, 'epoch': 2.14}


 72%|███████▏  | 4700/6564 [1:39:25<37:06,  1.19s/it]

{'loss': 0.0589, 'grad_norm': 32.16224670410156, 'learning_rate': 1.4198659354052407e-05, 'epoch': 2.15}


 72%|███████▏  | 4710/6564 [1:39:36<36:02,  1.17s/it]

{'loss': 0.0775, 'grad_norm': 0.06372810900211334, 'learning_rate': 1.4122486288848263e-05, 'epoch': 2.15}


 72%|███████▏  | 4720/6564 [1:39:48<35:58,  1.17s/it]

{'loss': 0.0685, 'grad_norm': 0.4886122941970825, 'learning_rate': 1.404631322364412e-05, 'epoch': 2.16}


 72%|███████▏  | 4730/6564 [1:40:00<36:09,  1.18s/it]

{'loss': 0.0462, 'grad_norm': 21.54496955871582, 'learning_rate': 1.3970140158439976e-05, 'epoch': 2.16}


 72%|███████▏  | 4740/6564 [1:40:12<35:13,  1.16s/it]

{'loss': 0.065, 'grad_norm': 0.16136276721954346, 'learning_rate': 1.3893967093235833e-05, 'epoch': 2.17}


 72%|███████▏  | 4750/6564 [1:40:23<35:23,  1.17s/it]

{'loss': 0.0309, 'grad_norm': 0.12975572049617767, 'learning_rate': 1.3817794028031688e-05, 'epoch': 2.17}


 73%|███████▎  | 4760/6564 [1:40:35<35:59,  1.20s/it]

{'loss': 0.0232, 'grad_norm': 0.11081256717443466, 'learning_rate': 1.3741620962827546e-05, 'epoch': 2.18}


 73%|███████▎  | 4770/6564 [1:40:47<35:38,  1.19s/it]

{'loss': 0.0309, 'grad_norm': 7.844303607940674, 'learning_rate': 1.3665447897623401e-05, 'epoch': 2.18}


 73%|███████▎  | 4780/6564 [1:40:59<35:00,  1.18s/it]

{'loss': 0.102, 'grad_norm': 1.6599410772323608, 'learning_rate': 1.3589274832419258e-05, 'epoch': 2.18}


 73%|███████▎  | 4790/6564 [1:41:11<36:42,  1.24s/it]

{'loss': 0.0882, 'grad_norm': 9.763582229614258, 'learning_rate': 1.3513101767215114e-05, 'epoch': 2.19}


 73%|███████▎  | 4800/6564 [1:41:23<34:46,  1.18s/it]

{'loss': 0.0742, 'grad_norm': 5.603075981140137, 'learning_rate': 1.3436928702010971e-05, 'epoch': 2.19}


 73%|███████▎  | 4810/6564 [1:41:35<34:54,  1.19s/it]

{'loss': 0.0409, 'grad_norm': 0.2745377719402313, 'learning_rate': 1.3360755636806826e-05, 'epoch': 2.2}


 73%|███████▎  | 4820/6564 [1:41:46<34:25,  1.18s/it]

{'loss': 0.0106, 'grad_norm': 24.792041778564453, 'learning_rate': 1.3284582571602684e-05, 'epoch': 2.2}


 74%|███████▎  | 4830/6564 [1:41:59<35:22,  1.22s/it]

{'loss': 0.1213, 'grad_norm': 2.515357732772827, 'learning_rate': 1.320840950639854e-05, 'epoch': 2.21}


 74%|███████▎  | 4840/6564 [1:42:11<34:16,  1.19s/it]

{'loss': 0.0123, 'grad_norm': 0.05793308466672897, 'learning_rate': 1.3132236441194395e-05, 'epoch': 2.21}


 74%|███████▍  | 4850/6564 [1:42:22<33:50,  1.18s/it]

{'loss': 0.0338, 'grad_norm': 0.14469066262245178, 'learning_rate': 1.3056063375990252e-05, 'epoch': 2.22}


 74%|███████▍  | 4860/6564 [1:42:35<34:38,  1.22s/it]

{'loss': 0.0678, 'grad_norm': 0.2546468675136566, 'learning_rate': 1.2979890310786106e-05, 'epoch': 2.22}


 74%|███████▍  | 4870/6564 [1:42:46<33:07,  1.17s/it]

{'loss': 0.1144, 'grad_norm': 23.85500717163086, 'learning_rate': 1.2903717245581961e-05, 'epoch': 2.23}


 74%|███████▍  | 4880/6564 [1:42:58<33:57,  1.21s/it]

{'loss': 0.1029, 'grad_norm': 3.329155921936035, 'learning_rate': 1.2827544180377818e-05, 'epoch': 2.23}


 74%|███████▍  | 4890/6564 [1:43:10<33:06,  1.19s/it]

{'loss': 0.1042, 'grad_norm': 0.5148162841796875, 'learning_rate': 1.2751371115173674e-05, 'epoch': 2.23}


 75%|███████▍  | 4900/6564 [1:43:22<32:40,  1.18s/it]

{'loss': 0.1132, 'grad_norm': 10.343450546264648, 'learning_rate': 1.2675198049969531e-05, 'epoch': 2.24}


 75%|███████▍  | 4910/6564 [1:43:34<32:36,  1.18s/it]

{'loss': 0.0352, 'grad_norm': 0.15583285689353943, 'learning_rate': 1.2599024984765387e-05, 'epoch': 2.24}


 75%|███████▍  | 4920/6564 [1:43:46<32:44,  1.20s/it]

{'loss': 0.0258, 'grad_norm': 0.08789531886577606, 'learning_rate': 1.2522851919561244e-05, 'epoch': 2.25}


 75%|███████▌  | 4930/6564 [1:43:58<32:19,  1.19s/it]

{'loss': 0.0496, 'grad_norm': 5.127464771270752, 'learning_rate': 1.24466788543571e-05, 'epoch': 2.25}


 75%|███████▌  | 4940/6564 [1:44:10<31:48,  1.18s/it]

{'loss': 0.0413, 'grad_norm': 0.052441105246543884, 'learning_rate': 1.2370505789152956e-05, 'epoch': 2.26}


 75%|███████▌  | 4950/6564 [1:44:21<31:08,  1.16s/it]

{'loss': 0.0851, 'grad_norm': 0.0810997486114502, 'learning_rate': 1.2294332723948812e-05, 'epoch': 2.26}


 76%|███████▌  | 4960/6564 [1:44:33<31:35,  1.18s/it]

{'loss': 0.0038, 'grad_norm': 0.06683114916086197, 'learning_rate': 1.221815965874467e-05, 'epoch': 2.27}


 76%|███████▌  | 4970/6564 [1:44:45<31:49,  1.20s/it]

{'loss': 0.092, 'grad_norm': 0.11066436022520065, 'learning_rate': 1.2141986593540525e-05, 'epoch': 2.27}


 76%|███████▌  | 4980/6564 [1:44:57<31:52,  1.21s/it]

{'loss': 0.0541, 'grad_norm': 2.2217960357666016, 'learning_rate': 1.206581352833638e-05, 'epoch': 2.28}


 76%|███████▌  | 4990/6564 [1:45:09<31:41,  1.21s/it]

{'loss': 0.1283, 'grad_norm': 10.920653343200684, 'learning_rate': 1.1989640463132237e-05, 'epoch': 2.28}


 76%|███████▌  | 5000/6564 [1:45:21<30:31,  1.17s/it]

{'loss': 0.0652, 'grad_norm': 0.11005783081054688, 'learning_rate': 1.1913467397928093e-05, 'epoch': 2.29}


 76%|███████▋  | 5010/6564 [1:45:34<30:58,  1.20s/it]

{'loss': 0.0214, 'grad_norm': 0.024691354483366013, 'learning_rate': 1.183729433272395e-05, 'epoch': 2.29}


 76%|███████▋  | 5020/6564 [1:45:46<30:47,  1.20s/it]

{'loss': 0.0207, 'grad_norm': 0.03695464879274368, 'learning_rate': 1.1761121267519806e-05, 'epoch': 2.29}


 77%|███████▋  | 5030/6564 [1:45:58<31:37,  1.24s/it]

{'loss': 0.07, 'grad_norm': 0.027202997356653214, 'learning_rate': 1.1684948202315661e-05, 'epoch': 2.3}


 77%|███████▋  | 5040/6564 [1:46:10<30:25,  1.20s/it]

{'loss': 0.1046, 'grad_norm': 0.08574314415454865, 'learning_rate': 1.1608775137111518e-05, 'epoch': 2.3}


 77%|███████▋  | 5050/6564 [1:46:22<29:31,  1.17s/it]

{'loss': 0.0182, 'grad_norm': 0.03474871814250946, 'learning_rate': 1.1532602071907374e-05, 'epoch': 2.31}


 77%|███████▋  | 5060/6564 [1:46:34<30:12,  1.21s/it]

{'loss': 0.0564, 'grad_norm': 0.0755540132522583, 'learning_rate': 1.145642900670323e-05, 'epoch': 2.31}


 77%|███████▋  | 5070/6564 [1:46:46<29:41,  1.19s/it]

{'loss': 0.0534, 'grad_norm': 13.718018531799316, 'learning_rate': 1.1380255941499086e-05, 'epoch': 2.32}


 77%|███████▋  | 5080/6564 [1:46:58<28:51,  1.17s/it]

{'loss': 0.0999, 'grad_norm': 0.06375642120838165, 'learning_rate': 1.1304082876294942e-05, 'epoch': 2.32}


 78%|███████▊  | 5090/6564 [1:47:10<29:16,  1.19s/it]

{'loss': 0.0053, 'grad_norm': 0.051638826727867126, 'learning_rate': 1.12279098110908e-05, 'epoch': 2.33}


 78%|███████▊  | 5100/6564 [1:47:21<29:14,  1.20s/it]

{'loss': 0.0041, 'grad_norm': 0.021658651530742645, 'learning_rate': 1.1151736745886655e-05, 'epoch': 2.33}


 78%|███████▊  | 5110/6564 [1:47:34<29:27,  1.22s/it]

{'loss': 0.0691, 'grad_norm': 0.041214555501937866, 'learning_rate': 1.1075563680682512e-05, 'epoch': 2.34}


 78%|███████▊  | 5120/6564 [1:47:46<28:44,  1.19s/it]

{'loss': 0.0043, 'grad_norm': 0.025748832151293755, 'learning_rate': 1.0999390615478367e-05, 'epoch': 2.34}


 78%|███████▊  | 5130/6564 [1:47:57<28:06,  1.18s/it]

{'loss': 0.0368, 'grad_norm': 0.039530862122774124, 'learning_rate': 1.0923217550274223e-05, 'epoch': 2.34}


 78%|███████▊  | 5140/6564 [1:48:09<27:42,  1.17s/it]

{'loss': 0.0019, 'grad_norm': 0.06022071838378906, 'learning_rate': 1.0847044485070078e-05, 'epoch': 2.35}


 78%|███████▊  | 5150/6564 [1:48:21<27:25,  1.16s/it]

{'loss': 0.0958, 'grad_norm': 30.41588592529297, 'learning_rate': 1.0770871419865936e-05, 'epoch': 2.35}


 79%|███████▊  | 5160/6564 [1:48:32<27:14,  1.16s/it]

{'loss': 0.1627, 'grad_norm': 0.4867681860923767, 'learning_rate': 1.0694698354661791e-05, 'epoch': 2.36}


 79%|███████▉  | 5170/6564 [1:48:44<27:04,  1.17s/it]

{'loss': 0.0782, 'grad_norm': 0.05081498250365257, 'learning_rate': 1.0618525289457648e-05, 'epoch': 2.36}


 79%|███████▉  | 5180/6564 [1:48:56<26:42,  1.16s/it]

{'loss': 0.0387, 'grad_norm': 0.03800646588206291, 'learning_rate': 1.0542352224253504e-05, 'epoch': 2.37}


 79%|███████▉  | 5190/6564 [1:49:07<26:45,  1.17s/it]

{'loss': 0.0617, 'grad_norm': 0.07397352159023285, 'learning_rate': 1.0466179159049361e-05, 'epoch': 2.37}


 79%|███████▉  | 5200/6564 [1:49:19<26:24,  1.16s/it]

{'loss': 0.0374, 'grad_norm': 0.037134747952222824, 'learning_rate': 1.0390006093845217e-05, 'epoch': 2.38}


 79%|███████▉  | 5210/6564 [1:49:31<26:23,  1.17s/it]

{'loss': 0.0904, 'grad_norm': 0.03445752337574959, 'learning_rate': 1.0313833028641074e-05, 'epoch': 2.38}


 80%|███████▉  | 5220/6564 [1:49:43<26:35,  1.19s/it]

{'loss': 0.0027, 'grad_norm': 0.04434563219547272, 'learning_rate': 1.023765996343693e-05, 'epoch': 2.39}


 80%|███████▉  | 5230/6564 [1:49:54<26:13,  1.18s/it]

{'loss': 0.0209, 'grad_norm': 0.026654601097106934, 'learning_rate': 1.0161486898232786e-05, 'epoch': 2.39}


 80%|███████▉  | 5240/6564 [1:50:06<26:07,  1.18s/it]

{'loss': 0.0538, 'grad_norm': 0.021725745871663094, 'learning_rate': 1.008531383302864e-05, 'epoch': 2.39}


 80%|███████▉  | 5250/6564 [1:50:18<25:34,  1.17s/it]

{'loss': 0.1266, 'grad_norm': 27.285869598388672, 'learning_rate': 1.0009140767824497e-05, 'epoch': 2.4}


 80%|████████  | 5260/6564 [1:50:29<25:12,  1.16s/it]

{'loss': 0.0254, 'grad_norm': 19.519704818725586, 'learning_rate': 9.932967702620353e-06, 'epoch': 2.4}


 80%|████████  | 5270/6564 [1:50:41<25:20,  1.18s/it]

{'loss': 0.0678, 'grad_norm': 0.03672628477215767, 'learning_rate': 9.85679463741621e-06, 'epoch': 2.41}


 80%|████████  | 5280/6564 [1:56:29<18:01:59, 50.56s/it] 

{'loss': 0.0434, 'grad_norm': 0.3626496493816376, 'learning_rate': 9.780621572212066e-06, 'epoch': 2.41}


 81%|████████  | 5290/6564 [1:56:43<56:06,  2.64s/it]   

{'loss': 0.0817, 'grad_norm': 0.2051614671945572, 'learning_rate': 9.704448507007923e-06, 'epoch': 2.42}


 81%|████████  | 5300/6564 [1:56:58<29:27,  1.40s/it]

{'loss': 0.0571, 'grad_norm': 0.03448178991675377, 'learning_rate': 9.628275441803778e-06, 'epoch': 2.42}


 81%|████████  | 5310/6564 [1:57:10<25:01,  1.20s/it]

{'loss': 0.0865, 'grad_norm': 6.9016313552856445, 'learning_rate': 9.552102376599636e-06, 'epoch': 2.43}


 81%|████████  | 5320/6564 [1:57:21<24:13,  1.17s/it]

{'loss': 0.0394, 'grad_norm': 0.31904178857803345, 'learning_rate': 9.475929311395491e-06, 'epoch': 2.43}


 81%|████████  | 5330/6564 [1:57:33<23:55,  1.16s/it]

{'loss': 0.0637, 'grad_norm': 0.045820217579603195, 'learning_rate': 9.399756246191348e-06, 'epoch': 2.44}


 81%|████████▏ | 5340/6564 [1:57:45<23:49,  1.17s/it]

{'loss': 0.0861, 'grad_norm': 18.438051223754883, 'learning_rate': 9.323583180987204e-06, 'epoch': 2.44}


 82%|████████▏ | 5350/6564 [1:57:56<23:28,  1.16s/it]

{'loss': 0.0276, 'grad_norm': 0.02043003775179386, 'learning_rate': 9.24741011578306e-06, 'epoch': 2.45}


 82%|████████▏ | 5360/6564 [1:58:08<23:18,  1.16s/it]

{'loss': 0.0937, 'grad_norm': 0.2013404369354248, 'learning_rate': 9.171237050578915e-06, 'epoch': 2.45}


 82%|████████▏ | 5370/6564 [1:58:20<23:06,  1.16s/it]

{'loss': 0.058, 'grad_norm': 6.578105926513672, 'learning_rate': 9.095063985374772e-06, 'epoch': 2.45}


 82%|████████▏ | 5380/6564 [1:58:31<22:56,  1.16s/it]

{'loss': 0.0768, 'grad_norm': 13.728053092956543, 'learning_rate': 9.018890920170627e-06, 'epoch': 2.46}


 82%|████████▏ | 5390/6564 [1:58:43<23:05,  1.18s/it]

{'loss': 0.0964, 'grad_norm': 0.06772981584072113, 'learning_rate': 8.942717854966485e-06, 'epoch': 2.46}


 82%|████████▏ | 5400/6564 [1:58:55<23:16,  1.20s/it]

{'loss': 0.0696, 'grad_norm': 0.1070556491613388, 'learning_rate': 8.86654478976234e-06, 'epoch': 2.47}


 82%|████████▏ | 5410/6564 [1:59:07<22:43,  1.18s/it]

{'loss': 0.0333, 'grad_norm': 0.02708963118493557, 'learning_rate': 8.790371724558197e-06, 'epoch': 2.47}


 83%|████████▎ | 5420/6564 [1:59:19<22:39,  1.19s/it]

{'loss': 0.0706, 'grad_norm': 4.393401622772217, 'learning_rate': 8.714198659354053e-06, 'epoch': 2.48}


 83%|████████▎ | 5430/6564 [1:59:31<22:45,  1.20s/it]

{'loss': 0.0445, 'grad_norm': 16.338003158569336, 'learning_rate': 8.63802559414991e-06, 'epoch': 2.48}


 83%|████████▎ | 5440/6564 [1:59:43<22:22,  1.19s/it]

{'loss': 0.0309, 'grad_norm': 0.04385371506214142, 'learning_rate': 8.561852528945766e-06, 'epoch': 2.49}


 83%|████████▎ | 5450/6564 [1:59:55<21:47,  1.17s/it]

{'loss': 0.0352, 'grad_norm': 0.03146085515618324, 'learning_rate': 8.485679463741623e-06, 'epoch': 2.49}


 83%|████████▎ | 5460/6564 [2:00:06<21:28,  1.17s/it]

{'loss': 0.1059, 'grad_norm': 0.3540728986263275, 'learning_rate': 8.409506398537477e-06, 'epoch': 2.5}


 83%|████████▎ | 5470/6564 [2:00:18<21:35,  1.18s/it]

{'loss': 0.0468, 'grad_norm': 0.44589701294898987, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.5}


 83%|████████▎ | 5480/6564 [2:00:30<21:08,  1.17s/it]

{'loss': 0.0754, 'grad_norm': 19.399938583374023, 'learning_rate': 8.25716026812919e-06, 'epoch': 2.5}


 84%|████████▎ | 5490/6564 [2:00:42<21:11,  1.18s/it]

{'loss': 0.0458, 'grad_norm': 0.08378936350345612, 'learning_rate': 8.180987202925046e-06, 'epoch': 2.51}


 84%|████████▍ | 5500/6564 [2:00:53<20:56,  1.18s/it]

{'loss': 0.0835, 'grad_norm': 41.220458984375, 'learning_rate': 8.104814137720902e-06, 'epoch': 2.51}


 84%|████████▍ | 5510/6564 [2:01:07<21:12,  1.21s/it]

{'loss': 0.0341, 'grad_norm': 0.03599700331687927, 'learning_rate': 8.028641072516759e-06, 'epoch': 2.52}


 84%|████████▍ | 5520/6564 [2:01:19<20:40,  1.19s/it]

{'loss': 0.0494, 'grad_norm': 0.05069280043244362, 'learning_rate': 7.952468007312615e-06, 'epoch': 2.52}


 84%|████████▍ | 5530/6564 [2:01:30<20:13,  1.17s/it]

{'loss': 0.0399, 'grad_norm': 0.03986920416355133, 'learning_rate': 7.876294942108472e-06, 'epoch': 2.53}


 84%|████████▍ | 5540/6564 [2:01:43<20:24,  1.20s/it]

{'loss': 0.0886, 'grad_norm': 14.128746032714844, 'learning_rate': 7.800121876904327e-06, 'epoch': 2.53}


 85%|████████▍ | 5550/6564 [2:01:54<20:04,  1.19s/it]

{'loss': 0.0628, 'grad_norm': 0.0917443260550499, 'learning_rate': 7.723948811700185e-06, 'epoch': 2.54}


 85%|████████▍ | 5560/6564 [2:02:06<19:55,  1.19s/it]

{'loss': 0.0015, 'grad_norm': 0.07702615112066269, 'learning_rate': 7.647775746496038e-06, 'epoch': 2.54}


 85%|████████▍ | 5570/6564 [2:02:18<19:59,  1.21s/it]

{'loss': 0.0035, 'grad_norm': 0.021205155178904533, 'learning_rate': 7.571602681291895e-06, 'epoch': 2.55}


 85%|████████▌ | 5580/6564 [2:02:30<20:02,  1.22s/it]

{'loss': 0.0284, 'grad_norm': 0.06330925971269608, 'learning_rate': 7.495429616087751e-06, 'epoch': 2.55}


 85%|████████▌ | 5590/6564 [2:02:43<19:25,  1.20s/it]

{'loss': 0.0018, 'grad_norm': 0.0433436818420887, 'learning_rate': 7.419256550883607e-06, 'epoch': 2.55}


 85%|████████▌ | 5600/6564 [2:02:54<19:04,  1.19s/it]

{'loss': 0.0413, 'grad_norm': 0.31534668803215027, 'learning_rate': 7.343083485679464e-06, 'epoch': 2.56}


 85%|████████▌ | 5610/6564 [2:03:06<18:52,  1.19s/it]

{'loss': 0.0013, 'grad_norm': 0.04053394868969917, 'learning_rate': 7.26691042047532e-06, 'epoch': 2.56}


 86%|████████▌ | 5620/6564 [2:03:18<18:32,  1.18s/it]

{'loss': 0.0931, 'grad_norm': 0.4882762134075165, 'learning_rate': 7.1907373552711764e-06, 'epoch': 2.57}


 86%|████████▌ | 5630/6564 [2:03:30<18:33,  1.19s/it]

{'loss': 0.1058, 'grad_norm': 0.043691378086805344, 'learning_rate': 7.114564290067033e-06, 'epoch': 2.57}


 86%|████████▌ | 5640/6564 [2:03:42<18:27,  1.20s/it]

{'loss': 0.1698, 'grad_norm': 0.08877409994602203, 'learning_rate': 7.038391224862889e-06, 'epoch': 2.58}


 86%|████████▌ | 5650/6564 [2:03:54<17:49,  1.17s/it]

{'loss': 0.0183, 'grad_norm': 0.019488923251628876, 'learning_rate': 6.9622181596587455e-06, 'epoch': 2.58}


 86%|████████▌ | 5660/6564 [2:04:05<17:31,  1.16s/it]

{'loss': 0.0633, 'grad_norm': 2.797370195388794, 'learning_rate': 6.886045094454602e-06, 'epoch': 2.59}


 86%|████████▋ | 5670/6564 [2:04:17<17:42,  1.19s/it]

{'loss': 0.0568, 'grad_norm': 20.39288902282715, 'learning_rate': 6.8098720292504565e-06, 'epoch': 2.59}


 87%|████████▋ | 5680/6564 [2:04:29<17:39,  1.20s/it]

{'loss': 0.1189, 'grad_norm': 0.10269598662853241, 'learning_rate': 6.733698964046313e-06, 'epoch': 2.6}


 87%|████████▋ | 5690/6564 [2:04:41<17:01,  1.17s/it]

{'loss': 0.0434, 'grad_norm': 1.156854271888733, 'learning_rate': 6.657525898842169e-06, 'epoch': 2.6}


 87%|████████▋ | 5700/6564 [2:04:53<17:04,  1.19s/it]

{'loss': 0.0703, 'grad_norm': 22.239606857299805, 'learning_rate': 6.5813528336380256e-06, 'epoch': 2.61}


 87%|████████▋ | 5710/6564 [2:05:05<16:52,  1.19s/it]

{'loss': 0.057, 'grad_norm': 0.28279057145118713, 'learning_rate': 6.505179768433882e-06, 'epoch': 2.61}


 87%|████████▋ | 5720/6564 [2:05:17<16:46,  1.19s/it]

{'loss': 0.0401, 'grad_norm': 0.039673078805208206, 'learning_rate': 6.429006703229738e-06, 'epoch': 2.61}


 87%|████████▋ | 5730/6564 [2:05:29<16:53,  1.22s/it]

{'loss': 0.0978, 'grad_norm': 0.10191051661968231, 'learning_rate': 6.352833638025595e-06, 'epoch': 2.62}


 87%|████████▋ | 5740/6564 [2:05:41<16:35,  1.21s/it]

{'loss': 0.0772, 'grad_norm': 0.06314467638731003, 'learning_rate': 6.276660572821451e-06, 'epoch': 2.62}


 88%|████████▊ | 5750/6564 [2:05:53<16:04,  1.19s/it]

{'loss': 0.0631, 'grad_norm': 0.0648263469338417, 'learning_rate': 6.2004875076173065e-06, 'epoch': 2.63}


 88%|████████▊ | 5760/6564 [2:06:05<16:11,  1.21s/it]

{'loss': 0.0502, 'grad_norm': 20.51567840576172, 'learning_rate': 6.124314442413163e-06, 'epoch': 2.63}


 88%|████████▊ | 5770/6564 [2:06:17<15:40,  1.19s/it]

{'loss': 0.0482, 'grad_norm': 0.07514764368534088, 'learning_rate': 6.048141377209019e-06, 'epoch': 2.64}


 88%|████████▊ | 5780/6564 [2:06:29<15:34,  1.19s/it]

{'loss': 0.0982, 'grad_norm': 0.2543574869632721, 'learning_rate': 5.9719683120048755e-06, 'epoch': 2.64}


 88%|████████▊ | 5790/6564 [2:06:41<15:24,  1.19s/it]

{'loss': 0.0145, 'grad_norm': 0.16939125955104828, 'learning_rate': 5.895795246800732e-06, 'epoch': 2.65}


 88%|████████▊ | 5800/6564 [2:06:52<15:01,  1.18s/it]

{'loss': 0.052, 'grad_norm': 0.028997093439102173, 'learning_rate': 5.819622181596588e-06, 'epoch': 2.65}


 89%|████████▊ | 5810/6564 [2:07:04<15:08,  1.21s/it]

{'loss': 0.0452, 'grad_norm': 3.5914251804351807, 'learning_rate': 5.743449116392444e-06, 'epoch': 2.66}


 89%|████████▊ | 5820/6564 [2:07:16<14:50,  1.20s/it]

{'loss': 0.0334, 'grad_norm': 0.3885941505432129, 'learning_rate': 5.6672760511883e-06, 'epoch': 2.66}


 89%|████████▉ | 5830/6564 [2:07:28<14:51,  1.22s/it]

{'loss': 0.0381, 'grad_norm': 0.02082815021276474, 'learning_rate': 5.591102985984156e-06, 'epoch': 2.66}


 89%|████████▉ | 5840/6564 [2:07:41<14:56,  1.24s/it]

{'loss': 0.0409, 'grad_norm': 3.3115344047546387, 'learning_rate': 5.514929920780013e-06, 'epoch': 2.67}


 89%|████████▉ | 5850/6564 [2:07:53<14:12,  1.19s/it]

{'loss': 0.0082, 'grad_norm': 2.1816155910491943, 'learning_rate': 5.438756855575869e-06, 'epoch': 2.67}


 89%|████████▉ | 5860/6564 [2:08:05<13:40,  1.17s/it]

{'loss': 0.0675, 'grad_norm': 6.916333198547363, 'learning_rate': 5.362583790371725e-06, 'epoch': 2.68}


 89%|████████▉ | 5870/6564 [2:08:17<13:59,  1.21s/it]

{'loss': 0.0298, 'grad_norm': 0.6399375200271606, 'learning_rate': 5.286410725167581e-06, 'epoch': 2.68}


 90%|████████▉ | 5880/6564 [2:08:29<13:33,  1.19s/it]

{'loss': 0.0538, 'grad_norm': 1.84055495262146, 'learning_rate': 5.210237659963437e-06, 'epoch': 2.69}


 90%|████████▉ | 5890/6564 [2:08:40<13:21,  1.19s/it]

{'loss': 0.0022, 'grad_norm': 0.026082493364810944, 'learning_rate': 5.134064594759294e-06, 'epoch': 2.69}


 90%|████████▉ | 5900/6564 [2:08:53<13:21,  1.21s/it]

{'loss': 0.0648, 'grad_norm': 7.446417331695557, 'learning_rate': 5.05789152955515e-06, 'epoch': 2.7}


 90%|█████████ | 5910/6564 [2:09:04<12:58,  1.19s/it]

{'loss': 0.0653, 'grad_norm': 0.6462566256523132, 'learning_rate': 4.981718464351006e-06, 'epoch': 2.7}


 90%|█████████ | 5920/6564 [2:09:16<12:27,  1.16s/it]

{'loss': 0.1967, 'grad_norm': 0.0323537215590477, 'learning_rate': 4.905545399146862e-06, 'epoch': 2.71}


 90%|█████████ | 5930/6564 [2:09:28<12:24,  1.17s/it]

{'loss': 0.0015, 'grad_norm': 0.11580958962440491, 'learning_rate': 4.829372333942718e-06, 'epoch': 2.71}


 90%|█████████ | 5940/6564 [2:09:40<12:40,  1.22s/it]

{'loss': 0.064, 'grad_norm': 0.10241850465536118, 'learning_rate': 4.753199268738575e-06, 'epoch': 2.71}


 91%|█████████ | 5950/6564 [2:09:52<12:34,  1.23s/it]

{'loss': 0.0016, 'grad_norm': 0.052852876484394073, 'learning_rate': 4.677026203534431e-06, 'epoch': 2.72}


 91%|█████████ | 5960/6564 [2:10:04<11:51,  1.18s/it]

{'loss': 0.0792, 'grad_norm': 0.03304526209831238, 'learning_rate': 4.600853138330287e-06, 'epoch': 2.72}


 91%|█████████ | 5970/6564 [2:10:15<11:30,  1.16s/it]

{'loss': 0.0444, 'grad_norm': 0.14582961797714233, 'learning_rate': 4.524680073126143e-06, 'epoch': 2.73}


 91%|█████████ | 5980/6564 [2:10:27<11:16,  1.16s/it]

{'loss': 0.0074, 'grad_norm': 0.029292413964867592, 'learning_rate': 4.448507007921999e-06, 'epoch': 2.73}


 91%|█████████▏| 5990/6564 [2:10:39<11:33,  1.21s/it]

{'loss': 0.0488, 'grad_norm': 0.031892143189907074, 'learning_rate': 4.3723339427178555e-06, 'epoch': 2.74}


 91%|█████████▏| 6000/6564 [2:10:51<11:00,  1.17s/it]

{'loss': 0.0324, 'grad_norm': 0.18524375557899475, 'learning_rate': 4.296160877513712e-06, 'epoch': 2.74}


 92%|█████████▏| 6010/6564 [2:11:04<11:00,  1.19s/it]

{'loss': 0.0016, 'grad_norm': 0.03595380857586861, 'learning_rate': 4.219987812309568e-06, 'epoch': 2.75}


 92%|█████████▏| 6020/6564 [2:11:16<10:47,  1.19s/it]

{'loss': 0.0013, 'grad_norm': 0.22287617623806, 'learning_rate': 4.143814747105424e-06, 'epoch': 2.75}


 92%|█████████▏| 6030/6564 [2:11:27<10:25,  1.17s/it]

{'loss': 0.0809, 'grad_norm': 4.251863479614258, 'learning_rate': 4.06764168190128e-06, 'epoch': 2.76}


 92%|█████████▏| 6040/6564 [2:11:39<10:21,  1.19s/it]

{'loss': 0.0109, 'grad_norm': 0.19045688211917877, 'learning_rate': 3.991468616697136e-06, 'epoch': 2.76}


 92%|█████████▏| 6050/6564 [2:11:51<09:57,  1.16s/it]

{'loss': 0.0372, 'grad_norm': 3.9486217498779297, 'learning_rate': 3.915295551492993e-06, 'epoch': 2.77}


 92%|█████████▏| 6060/6564 [2:12:03<09:51,  1.17s/it]

{'loss': 0.019, 'grad_norm': 1.3420705795288086, 'learning_rate': 3.839122486288848e-06, 'epoch': 2.77}


 92%|█████████▏| 6070/6564 [2:12:14<09:34,  1.16s/it]

{'loss': 0.021, 'grad_norm': 1.2953158617019653, 'learning_rate': 3.762949421084705e-06, 'epoch': 2.77}


 93%|█████████▎| 6080/6564 [2:12:26<09:31,  1.18s/it]

{'loss': 0.0572, 'grad_norm': 24.675743103027344, 'learning_rate': 3.6867763558805605e-06, 'epoch': 2.78}


 93%|█████████▎| 6090/6564 [2:12:38<09:33,  1.21s/it]

{'loss': 0.0998, 'grad_norm': 0.0389069989323616, 'learning_rate': 3.610603290676417e-06, 'epoch': 2.78}


 93%|█████████▎| 6100/6564 [2:12:50<09:19,  1.21s/it]

{'loss': 0.0016, 'grad_norm': 0.01808329112827778, 'learning_rate': 3.5344302254722732e-06, 'epoch': 2.79}


 93%|█████████▎| 6110/6564 [2:13:02<08:58,  1.19s/it]

{'loss': 0.0682, 'grad_norm': 0.020550405606627464, 'learning_rate': 3.4582571602681296e-06, 'epoch': 2.79}


 93%|█████████▎| 6120/6564 [2:13:14<09:03,  1.22s/it]

{'loss': 0.0033, 'grad_norm': 0.19578780233860016, 'learning_rate': 3.382084095063986e-06, 'epoch': 2.8}


 93%|█████████▎| 6130/6564 [2:13:26<08:43,  1.21s/it]

{'loss': 0.0497, 'grad_norm': 41.935791015625, 'learning_rate': 3.3059110298598414e-06, 'epoch': 2.8}


 94%|█████████▎| 6140/6564 [2:13:38<08:24,  1.19s/it]

{'loss': 0.0906, 'grad_norm': 0.04775018244981766, 'learning_rate': 3.2297379646556978e-06, 'epoch': 2.81}


 94%|█████████▎| 6150/6564 [2:13:50<08:20,  1.21s/it]

{'loss': 0.033, 'grad_norm': 0.021373389288783073, 'learning_rate': 3.153564899451554e-06, 'epoch': 2.81}


 94%|█████████▍| 6160/6564 [2:14:02<08:05,  1.20s/it]

{'loss': 0.0299, 'grad_norm': 0.0896301344037056, 'learning_rate': 3.0773918342474105e-06, 'epoch': 2.82}


 94%|█████████▍| 6170/6564 [2:14:14<07:54,  1.20s/it]

{'loss': 0.0631, 'grad_norm': 9.817818641662598, 'learning_rate': 3.0012187690432664e-06, 'epoch': 2.82}


 94%|█████████▍| 6180/6564 [2:14:26<07:32,  1.18s/it]

{'loss': 0.05, 'grad_norm': 0.03585360199213028, 'learning_rate': 2.9250457038391228e-06, 'epoch': 2.82}


 94%|█████████▍| 6190/6564 [2:14:38<07:21,  1.18s/it]

{'loss': 0.0737, 'grad_norm': 0.07066062092781067, 'learning_rate': 2.848872638634979e-06, 'epoch': 2.83}


 94%|█████████▍| 6200/6564 [2:14:50<07:04,  1.17s/it]

{'loss': 0.0637, 'grad_norm': 0.050514210015535355, 'learning_rate': 2.772699573430835e-06, 'epoch': 2.83}


 95%|█████████▍| 6210/6564 [2:15:01<06:56,  1.18s/it]

{'loss': 0.0408, 'grad_norm': 0.0378100648522377, 'learning_rate': 2.6965265082266914e-06, 'epoch': 2.84}


 95%|█████████▍| 6220/6564 [2:15:13<06:46,  1.18s/it]

{'loss': 0.1038, 'grad_norm': 1.8040308952331543, 'learning_rate': 2.6203534430225473e-06, 'epoch': 2.84}


 95%|█████████▍| 6230/6564 [2:15:25<06:36,  1.19s/it]

{'loss': 0.0069, 'grad_norm': 0.03541530296206474, 'learning_rate': 2.5441803778184037e-06, 'epoch': 2.85}


 95%|█████████▌| 6240/6564 [2:15:37<06:19,  1.17s/it]

{'loss': 0.0051, 'grad_norm': 0.033474646508693695, 'learning_rate': 2.4680073126142596e-06, 'epoch': 2.85}


 95%|█████████▌| 6250/6564 [2:15:49<06:10,  1.18s/it]

{'loss': 0.1308, 'grad_norm': 0.01950627937912941, 'learning_rate': 2.391834247410116e-06, 'epoch': 2.86}


 95%|█████████▌| 6260/6564 [2:16:01<05:53,  1.16s/it]

{'loss': 0.0341, 'grad_norm': 12.179352760314941, 'learning_rate': 2.315661182205972e-06, 'epoch': 2.86}


 96%|█████████▌| 6270/6564 [2:16:13<05:50,  1.19s/it]

{'loss': 0.0018, 'grad_norm': 0.022293154150247574, 'learning_rate': 2.239488117001828e-06, 'epoch': 2.87}


 96%|█████████▌| 6280/6564 [2:16:25<05:42,  1.20s/it]

{'loss': 0.0682, 'grad_norm': 2.130002737045288, 'learning_rate': 2.163315051797684e-06, 'epoch': 2.87}


 96%|█████████▌| 6290/6564 [2:16:37<05:26,  1.19s/it]

{'loss': 0.0344, 'grad_norm': 0.046727389097213745, 'learning_rate': 2.0871419865935405e-06, 'epoch': 2.87}


 96%|█████████▌| 6300/6564 [2:16:48<05:11,  1.18s/it]

{'loss': 0.0811, 'grad_norm': 22.859132766723633, 'learning_rate': 2.010968921389397e-06, 'epoch': 2.88}


 96%|█████████▌| 6310/6564 [2:17:00<05:09,  1.22s/it]

{'loss': 0.0308, 'grad_norm': 0.042938295751810074, 'learning_rate': 1.9347958561852528e-06, 'epoch': 2.88}


 96%|█████████▋| 6320/6564 [2:17:12<04:52,  1.20s/it]

{'loss': 0.0213, 'grad_norm': 8.13247299194336, 'learning_rate': 1.8586227909811093e-06, 'epoch': 2.89}


 96%|█████████▋| 6330/6564 [2:17:24<04:35,  1.18s/it]

{'loss': 0.0016, 'grad_norm': 0.03345213830471039, 'learning_rate': 1.7824497257769653e-06, 'epoch': 2.89}


 97%|█████████▋| 6340/6564 [2:17:36<04:29,  1.20s/it]

{'loss': 0.0808, 'grad_norm': 0.031208084896206856, 'learning_rate': 1.7062766605728214e-06, 'epoch': 2.9}


 97%|█████████▋| 6350/6564 [2:17:48<04:11,  1.18s/it]

{'loss': 0.0406, 'grad_norm': 4.408977508544922, 'learning_rate': 1.6301035953686777e-06, 'epoch': 2.9}


 97%|█████████▋| 6360/6564 [2:18:00<04:04,  1.20s/it]

{'loss': 0.0251, 'grad_norm': 0.06275659054517746, 'learning_rate': 1.5539305301645339e-06, 'epoch': 2.91}


 97%|█████████▋| 6370/6564 [2:18:12<03:50,  1.19s/it]

{'loss': 0.1304, 'grad_norm': 0.050761982798576355, 'learning_rate': 1.47775746496039e-06, 'epoch': 2.91}


 97%|█████████▋| 6380/6564 [2:18:24<03:35,  1.17s/it]

{'loss': 0.0735, 'grad_norm': 0.11581601947546005, 'learning_rate': 1.4015843997562462e-06, 'epoch': 2.92}


 97%|█████████▋| 6390/6564 [2:18:36<03:26,  1.19s/it]

{'loss': 0.171, 'grad_norm': 0.29513606429100037, 'learning_rate': 1.3254113345521023e-06, 'epoch': 2.92}


 98%|█████████▊| 6400/6564 [2:18:48<03:17,  1.20s/it]

{'loss': 0.0564, 'grad_norm': 9.165438652038574, 'learning_rate': 1.2492382693479586e-06, 'epoch': 2.93}


 98%|█████████▊| 6410/6564 [2:18:59<03:00,  1.17s/it]

{'loss': 0.007, 'grad_norm': 5.851017475128174, 'learning_rate': 1.1730652041438148e-06, 'epoch': 2.93}


 98%|█████████▊| 6420/6564 [2:19:12<02:53,  1.20s/it]

{'loss': 0.0503, 'grad_norm': 0.05469610169529915, 'learning_rate': 1.096892138939671e-06, 'epoch': 2.93}


 98%|█████████▊| 6430/6564 [2:19:23<02:36,  1.17s/it]

{'loss': 0.05, 'grad_norm': 0.042262982577085495, 'learning_rate': 1.0207190737355273e-06, 'epoch': 2.94}


 98%|█████████▊| 6440/6564 [2:19:35<02:29,  1.20s/it]

{'loss': 0.0301, 'grad_norm': 0.022372540086507797, 'learning_rate': 9.445460085313834e-07, 'epoch': 2.94}


 98%|█████████▊| 6450/6564 [2:19:47<02:15,  1.19s/it]

{'loss': 0.098, 'grad_norm': 0.03492415323853493, 'learning_rate': 8.683729433272396e-07, 'epoch': 2.95}


 98%|█████████▊| 6460/6564 [2:19:59<02:05,  1.21s/it]

{'loss': 0.1041, 'grad_norm': 28.102115631103516, 'learning_rate': 7.921998781230957e-07, 'epoch': 2.95}


 99%|█████████▊| 6470/6564 [2:20:11<01:52,  1.20s/it]

{'loss': 0.0131, 'grad_norm': 0.03474888950586319, 'learning_rate': 7.160268129189518e-07, 'epoch': 2.96}


 99%|█████████▊| 6480/6564 [2:20:23<01:41,  1.21s/it]

{'loss': 0.0704, 'grad_norm': 0.04523150995373726, 'learning_rate': 6.398537477148081e-07, 'epoch': 2.96}


 99%|█████████▉| 6490/6564 [2:20:35<01:29,  1.20s/it]

{'loss': 0.0629, 'grad_norm': 0.07383324205875397, 'learning_rate': 5.636806825106642e-07, 'epoch': 2.97}


 99%|█████████▉| 6500/6564 [2:20:47<01:15,  1.18s/it]

{'loss': 0.1434, 'grad_norm': 7.6739912033081055, 'learning_rate': 4.875076173065205e-07, 'epoch': 2.97}


 99%|█████████▉| 6510/6564 [2:21:00<01:04,  1.19s/it]

{'loss': 0.049, 'grad_norm': 0.09557928144931793, 'learning_rate': 4.113345521023766e-07, 'epoch': 2.98}


 99%|█████████▉| 6520/6564 [2:21:12<00:51,  1.16s/it]

{'loss': 0.0409, 'grad_norm': 0.03226378932595253, 'learning_rate': 3.351614868982328e-07, 'epoch': 2.98}


 99%|█████████▉| 6530/6564 [2:21:24<00:39,  1.16s/it]

{'loss': 0.0033, 'grad_norm': 0.029864417389035225, 'learning_rate': 2.58988421694089e-07, 'epoch': 2.98}


100%|█████████▉| 6540/6564 [2:21:35<00:28,  1.19s/it]

{'loss': 0.0393, 'grad_norm': 0.0208151675760746, 'learning_rate': 1.8281535648994517e-07, 'epoch': 2.99}


100%|█████████▉| 6550/6564 [2:21:47<00:16,  1.20s/it]

{'loss': 0.0949, 'grad_norm': 0.06539382040500641, 'learning_rate': 1.0664229128580134e-07, 'epoch': 2.99}


100%|█████████▉| 6560/6564 [2:22:00<00:04,  1.22s/it]

{'loss': 0.0012, 'grad_norm': 0.04037441685795784, 'learning_rate': 3.046922608165753e-08, 'epoch': 3.0}


100%|██████████| 6564/6564 [2:22:04<00:00,  1.30s/it]

{'train_runtime': 8524.5142, 'train_samples_per_second': 12.317, 'train_steps_per_second': 0.77, 'train_loss': 0.14979129153661916, 'epoch': 3.0}
Total Training Time: 142.08 min





After training has completed (142min), we can call `trainer.evaluate` to obtain the model performance on the test set.

In [17]:
print(trainer.evaluate())

100%|██████████| 625/625 [03:57<00:00,  2.63it/s]

{'eval_loss': 0.3030503988265991, 'eval_accuracy': 0.9362, 'eval_runtime': 239.074, 'eval_samples_per_second': 41.828, 'eval_steps_per_second': 2.614, 'epoch': 3.0}





The evaluation accuracy is around 94%.