## Using Unsloth to finetune

## Install Prerequisite Packages

In [None]:
# This is necessary for colab
!pip install python-dotenv
!pip install datasets
!pip install plotly
!pip install nbformat
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-0hbirobv/unsloth_7ece29f83129443888b84ec24b54f40b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-0hbirobv/unsloth_7ece29f83129443888b84ec24b54f40b
  Resolved https://github.com/unslothai/unsloth.git to commit a2f8db3e7341f983af5814a2c56f54fa29ee548d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25ldone
[?25h  Created wheel for unsloth: filename=unsloth-2024.10.7-py3-none-any.whl size=164377 sha256=c8373ee7d1e549f3d79ee9a9ed4aed159bd7b745ec8366ef8f6c71758fc31eb6
  Stored in directory: /tmp/pip-ephem-w

## Load `.env`

In [1]:
import os
import sys

from datasets import Dataset

from dotenv import find_dotenv, load_dotenv

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

## Important Global Parameters

In [11]:
# FINETUNING_DATASET_NAME="CPSC532/2024NOV2_arxiv_qa_data_cleaned"
FINETUNING_DATASET_NAME="CPSC532/2024NOV6_qwen_2_5_model"
OUTPUT_MODEL_NAME="2024NOV6_qwen_2_5_model"
BASE_MODEL_NAME="unsloth/Llama-3.2-3B-Instruct"

## API Keys

In [3]:
# Could also insert the token here directly
HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")

Leveraging Unsloth notebooks for finetuning

In [4]:
max_seq_length = 16000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


In [5]:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = BASE_MODEL_NAME, # or choose "unsloth/Llama-3.2-1B-Instruct"
    # model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3080 Ti. Max memory: 11.753 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


In [6]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [7]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset


## Get dataset

In [23]:
dataset_finetune = load_dataset(
    "CPSC532/arxiv_qa_data",
    "2024NOV6_qwen_2_5_model",
    split="train",
    token=HF_TOKEN
)

In [24]:
dataset_finetune

Dataset({
    features: ['filename', 'source', 'source_type', 'chunk', 'question', 'answer'],
    num_rows: 1068
})

In [25]:

dataset_finetune['question'][0]

'Where are the CLS tokens added to the augmented input tensor x[+]1:T +1, and what is their role before and after processing through the Transformer?'

In [26]:

dataset_finetune['answer'][0]

'The CLS tokens are added as an additional component (the \\((T+1)\\)-st component) to the augmented input tensor \\(x^{+}_{1:T+1}\\), which has a shape of \\([T + 1] \\times 2b\\). Specifically, the augmented input tensor is formed by combining the original input tensor \\(x_{1:T} \\in \\mathbb{R}[T \\times 2b]\\) with the CLS tokens:\n\n\\[ x^{+}_{1:T+1} = [x_1, \\ldots, x_T, c]^\\top \\]\n\nwhere \\(c\\) represents the CLS token.\n\n### Role Before Processing Through the Transformer\n\nBefore processing through the Transformer, the CLS tokens do not carry any covariate-specific information. They are initialized as part of the augmented input tensor but have not yet interacted with the covariates or undergone any transformation by the attention mechanism (2.4)-(2.6). Their role at this stage is to serve as a global prior parameter, providing initial context before being influenced by the covariate data.\n\n### Role After Processing Through the Transformer\n\nAfter processing through 

## Convert dataset to messages format

In [27]:
def convert_to_messages_format(example):
    return [
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['answer']},
    ]

In [28]:
dataset_finetune = dataset_finetune.map(
    lambda x: {
        'conversations' : convert_to_messages_format(x)
        }
)

Map: 100%|██████████| 1068/1068 [00:00<00:00, 15076.20 examples/s]


In [29]:
dataset_finetune = dataset_finetune.map(formatting_prompts_func, batched = True,)

Map: 100%|██████████| 1068/1068 [00:00<00:00, 18775.59 examples/s]


In [30]:
dataset_finetune['text'][0]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhere are the CLS tokens added to the augmented input tensor x[+]1:T +1, and what is their role before and after processing through the Transformer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe CLS tokens are added as an additional component (the \\((T+1)\\)-st component) to the augmented input tensor \\(x^{+}_{1:T+1}\\), which has a shape of \\([T + 1] \\times 2b\\). Specifically, the augmented input tensor is formed by combining the original input tensor \\(x_{1:T} \\in \\mathbb{R}[T \\times 2b]\\) with the CLS tokens:\n\n\\[ x^{+}_{1:T+1} = [x_1, \\ldots, x_T, c]^\\top \\]\n\nwhere \\(c\\) represents the CLS token.\n\n### Role Before Processing Through the Transformer\n\nBefore processing through the Transformer, the CLS tokens do not carry any covariate-specific information. They are

## Set Training Parameters

In [31]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_finetune,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 1,  # Affects memory usage
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1, # Affects memory usage
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 10, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none"
    ),
)

Map: 100%|██████████| 1068/1068 [00:00<00:00, 6513.37 examples/s]


We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. Look into this

In [32]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map: 100%|██████████| 1068/1068 [00:00<00:00, 5527.58 examples/s]


In [33]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhere are the CLS tokens added to the augmented input tensor x[+]1:T +1, and what is their role before and after processing through the Transformer?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe CLS tokens are added as an additional component (the \\((T+1)\\)-st component) to the augmented input tensor \\(x^{+}_{1:T+1}\\), which has a shape of \\([T + 1] \\times 2b\\). Specifically, the augmented input tensor is formed by combining the original input tensor \\(x_{1:T} \\in \\mathbb{R}[T \\times 2b]\\) with the CLS tokens:\n\n\\[ x^{+}_{1:T+1} = [x_1, \\ldots, x_T, c]^\\top \\]\n\nwhere \\(c\\) represents the CLS token.\n\n### Role Before Processing Through the Transformer\n\nBefore processing through the Transformer, the CLS tokens do not carry any covariate-specific information. They are

In [34]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                      \n\n### What are CLS Tokens and How Are They Used in the Credibility Transformer Architecture?\n\n#### Definition of CLS Tokens\nCLS (Class) tokens play a crucial role in the Credibility Transformer architecture. These tokens are scalar values (\\( c_j \\in \\mathbb{R} \\)) that encode one column of the input tensor, forming part of an augmented input tensor denoted as \\( x[+]1:T + 1 \\). Specifically, CLS tokens are represented by \\( c = [c_1,..., c_{2b}]^T \\), where each scalar \\( c_j \\) corresponds to a one-dimensional projection of the corresponding vector in the input tensor. In this augmented input tensor, these CLS tokens occupy the (T + 1)-st component.\n\n#### Usage in Credibility Transformer Architecture\n- **Augmented Input Tensor**: The CLS tokens are part of an augmented input tensor that includes both covariates and positional encodings.\n  \n- **Normalization Layer**: Before entering the Transfor

In [35]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3080 Ti. Max memory = 11.753 GB.
3.275 GB of memory reserved.


## Train

In [36]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,068 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 2,670
 "-____-"     Number of trainable parameters = 194,510,848


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


  0%|          | 1/2670 [00:02<1:32:07,  2.07s/it]

{'loss': 1.8814, 'grad_norm': 1.4999339580535889, 'learning_rate': 4e-05, 'epoch': 0.0}


  0%|          | 2/2670 [00:02<1:02:08,  1.40s/it]

{'loss': 1.9541, 'grad_norm': 4.335379600524902, 'learning_rate': 8e-05, 'epoch': 0.01}


  0%|          | 3/2670 [00:04<1:00:36,  1.36s/it]

{'loss': 2.5364, 'grad_norm': 7.329607963562012, 'learning_rate': 0.00012, 'epoch': 0.01}


  0%|          | 4/2670 [00:05<57:42,  1.30s/it]  

{'loss': 1.56, 'grad_norm': 1.255139708518982, 'learning_rate': 0.00016, 'epoch': 0.01}


  0%|          | 5/2670 [00:06<56:31,  1.27s/it]

{'loss': 1.6291, 'grad_norm': 1.1483500003814697, 'learning_rate': 0.0002, 'epoch': 0.02}


  0%|          | 6/2670 [00:08<1:03:15,  1.42s/it]

{'loss': 1.5504, 'grad_norm': 0.9299312233924866, 'learning_rate': 0.00019992495309568482, 'epoch': 0.02}


  0%|          | 7/2670 [00:10<1:05:06,  1.47s/it]

{'loss': 1.3889, 'grad_norm': 0.8983221650123596, 'learning_rate': 0.0001998499061913696, 'epoch': 0.03}


  0%|          | 8/2670 [00:11<1:04:47,  1.46s/it]

{'loss': 1.5516, 'grad_norm': 0.9957975745201111, 'learning_rate': 0.0001997748592870544, 'epoch': 0.03}


  0%|          | 9/2670 [00:12<1:00:04,  1.35s/it]

{'loss': 1.6243, 'grad_norm': 1.420490026473999, 'learning_rate': 0.00019969981238273922, 'epoch': 0.03}


  0%|          | 10/2670 [00:13<1:00:03,  1.35s/it]

{'loss': 1.6244, 'grad_norm': 1.0333540439605713, 'learning_rate': 0.00019962476547842403, 'epoch': 0.04}


  0%|          | 11/2670 [00:15<1:02:17,  1.41s/it]

{'loss': 1.3095, 'grad_norm': 1.0123270750045776, 'learning_rate': 0.00019954971857410884, 'epoch': 0.04}


  0%|          | 12/2670 [00:16<1:02:23,  1.41s/it]

{'loss': 1.7838, 'grad_norm': 0.9026670455932617, 'learning_rate': 0.00019947467166979364, 'epoch': 0.04}


  0%|          | 13/2670 [00:18<1:04:47,  1.46s/it]

{'loss': 1.5747, 'grad_norm': 0.8933441638946533, 'learning_rate': 0.00019939962476547845, 'epoch': 0.05}


  1%|          | 14/2670 [00:19<1:03:24,  1.43s/it]

{'loss': 1.583, 'grad_norm': 0.9513471722602844, 'learning_rate': 0.00019932457786116324, 'epoch': 0.05}


  1%|          | 15/2670 [00:21<1:06:46,  1.51s/it]

{'loss': 1.3352, 'grad_norm': 0.7099140882492065, 'learning_rate': 0.00019924953095684804, 'epoch': 0.06}


  1%|          | 16/2670 [00:22<1:02:48,  1.42s/it]

{'loss': 1.6923, 'grad_norm': 0.9866396188735962, 'learning_rate': 0.00019917448405253285, 'epoch': 0.06}


  1%|          | 17/2670 [00:24<1:05:12,  1.47s/it]

{'loss': 1.5648, 'grad_norm': 0.6784353852272034, 'learning_rate': 0.00019909943714821763, 'epoch': 0.06}


  1%|          | 18/2670 [00:25<1:07:00,  1.52s/it]

{'loss': 1.7083, 'grad_norm': 0.7330591082572937, 'learning_rate': 0.00019902439024390244, 'epoch': 0.07}


  1%|          | 19/2670 [00:27<1:03:45,  1.44s/it]

{'loss': 1.8701, 'grad_norm': 0.8928713798522949, 'learning_rate': 0.00019894934333958725, 'epoch': 0.07}


  1%|          | 20/2670 [00:28<1:05:18,  1.48s/it]

{'loss': 1.551, 'grad_norm': 0.8457674384117126, 'learning_rate': 0.00019887429643527203, 'epoch': 0.07}


  1%|          | 21/2670 [00:30<1:05:45,  1.49s/it]

{'loss': 1.4609, 'grad_norm': 1.0717543363571167, 'learning_rate': 0.00019879924953095684, 'epoch': 0.08}


  1%|          | 22/2670 [00:31<1:07:04,  1.52s/it]

{'loss': 1.4061, 'grad_norm': 0.8107964992523193, 'learning_rate': 0.00019872420262664165, 'epoch': 0.08}


  1%|          | 23/2670 [00:33<1:03:38,  1.44s/it]

{'loss': 1.2095, 'grad_norm': 0.9287973642349243, 'learning_rate': 0.0001986491557223265, 'epoch': 0.09}


  1%|          | 24/2670 [00:34<1:03:01,  1.43s/it]

{'loss': 1.3834, 'grad_norm': 1.015104055404663, 'learning_rate': 0.00019857410881801127, 'epoch': 0.09}


  1%|          | 25/2670 [00:36<1:04:29,  1.46s/it]

{'loss': 1.4019, 'grad_norm': 0.937252938747406, 'learning_rate': 0.00019849906191369608, 'epoch': 0.09}


  1%|          | 26/2670 [00:37<1:02:07,  1.41s/it]

{'loss': 1.1771, 'grad_norm': 0.872703492641449, 'learning_rate': 0.0001984240150093809, 'epoch': 0.1}


  1%|          | 27/2670 [00:38<1:00:48,  1.38s/it]

{'loss': 1.5768, 'grad_norm': 0.9836437106132507, 'learning_rate': 0.00019834896810506567, 'epoch': 0.1}


  1%|          | 28/2670 [00:39<59:12,  1.34s/it]  

{'loss': 1.2488, 'grad_norm': 1.1096760034561157, 'learning_rate': 0.00019827392120075048, 'epoch': 0.1}


  1%|          | 29/2670 [00:41<1:02:33,  1.42s/it]

{'loss': 1.6164, 'grad_norm': 0.7690670490264893, 'learning_rate': 0.0001981988742964353, 'epoch': 0.11}


  1%|          | 30/2670 [00:42<1:00:37,  1.38s/it]

{'loss': 1.3646, 'grad_norm': 0.8451787233352661, 'learning_rate': 0.00019812382739212007, 'epoch': 0.11}


  1%|          | 31/2670 [00:44<58:55,  1.34s/it]  

{'loss': 1.1061, 'grad_norm': 0.8278184533119202, 'learning_rate': 0.00019804878048780488, 'epoch': 0.12}


  1%|          | 32/2670 [00:45<58:34,  1.33s/it]

{'loss': 1.5513, 'grad_norm': 0.8071569204330444, 'learning_rate': 0.00019797373358348969, 'epoch': 0.12}


  1%|          | 33/2670 [00:46<56:38,  1.29s/it]

{'loss': 1.4677, 'grad_norm': 0.938491702079773, 'learning_rate': 0.0001978986866791745, 'epoch': 0.12}


  1%|▏         | 34/2670 [00:47<57:54,  1.32s/it]

{'loss': 1.4389, 'grad_norm': 0.9732556939125061, 'learning_rate': 0.0001978236397748593, 'epoch': 0.13}


  1%|▏         | 35/2670 [00:49<59:19,  1.35s/it]

{'loss': 1.2385, 'grad_norm': 0.8377432823181152, 'learning_rate': 0.0001977485928705441, 'epoch': 0.13}


  1%|▏         | 36/2670 [00:50<1:00:02,  1.37s/it]

{'loss': 1.3881, 'grad_norm': 0.8687959313392639, 'learning_rate': 0.00019767354596622892, 'epoch': 0.13}


  1%|▏         | 37/2670 [00:52<1:01:50,  1.41s/it]

{'loss': 1.5279, 'grad_norm': 0.8189431428909302, 'learning_rate': 0.0001975984990619137, 'epoch': 0.14}


  1%|▏         | 38/2670 [00:53<1:03:08,  1.44s/it]

{'loss': 1.4703, 'grad_norm': 0.8265480995178223, 'learning_rate': 0.0001975234521575985, 'epoch': 0.14}


  1%|▏         | 39/2670 [00:55<1:02:58,  1.44s/it]

{'loss': 1.3288, 'grad_norm': 0.766923725605011, 'learning_rate': 0.00019744840525328332, 'epoch': 0.15}


  1%|▏         | 40/2670 [00:56<1:03:04,  1.44s/it]

{'loss': 1.6216, 'grad_norm': 0.8498266339302063, 'learning_rate': 0.0001973733583489681, 'epoch': 0.15}


  2%|▏         | 41/2670 [00:57<1:00:22,  1.38s/it]

{'loss': 1.1334, 'grad_norm': 1.0123637914657593, 'learning_rate': 0.0001972983114446529, 'epoch': 0.15}


  2%|▏         | 42/2670 [00:59<1:01:38,  1.41s/it]

{'loss': 1.422, 'grad_norm': 0.8046767115592957, 'learning_rate': 0.00019722326454033772, 'epoch': 0.16}


  2%|▏         | 43/2670 [01:00<59:08,  1.35s/it]  

{'loss': 1.4447, 'grad_norm': 0.895034670829773, 'learning_rate': 0.0001971482176360225, 'epoch': 0.16}


  2%|▏         | 44/2670 [01:01<56:55,  1.30s/it]

{'loss': 1.3585, 'grad_norm': 0.9440730810165405, 'learning_rate': 0.0001970731707317073, 'epoch': 0.16}


  2%|▏         | 45/2670 [01:03<1:00:46,  1.39s/it]

{'loss': 1.2331, 'grad_norm': 0.7039077281951904, 'learning_rate': 0.00019699812382739215, 'epoch': 0.17}


  2%|▏         | 46/2670 [01:04<1:02:34,  1.43s/it]

{'loss': 1.1689, 'grad_norm': 0.6966765522956848, 'learning_rate': 0.00019692307692307696, 'epoch': 0.17}


  2%|▏         | 47/2670 [01:06<1:00:36,  1.39s/it]

{'loss': 1.4765, 'grad_norm': 0.9199206829071045, 'learning_rate': 0.00019684803001876174, 'epoch': 0.18}


  2%|▏         | 48/2670 [01:07<1:03:06,  1.44s/it]

{'loss': 1.4938, 'grad_norm': 0.8813374638557434, 'learning_rate': 0.00019677298311444655, 'epoch': 0.18}


  2%|▏         | 49/2670 [01:08<59:59,  1.37s/it]  

{'loss': 1.1455, 'grad_norm': 1.1979137659072876, 'learning_rate': 0.00019669793621013136, 'epoch': 0.18}


  2%|▏         | 50/2670 [01:10<1:00:13,  1.38s/it]

{'loss': 1.4983, 'grad_norm': 0.8927791118621826, 'learning_rate': 0.00019662288930581614, 'epoch': 0.19}


  2%|▏         | 51/2670 [01:11<59:56,  1.37s/it]  

{'loss': 1.4307, 'grad_norm': 0.8430410027503967, 'learning_rate': 0.00019654784240150095, 'epoch': 0.19}


  2%|▏         | 52/2670 [01:13<1:01:45,  1.42s/it]

{'loss': 1.4928, 'grad_norm': 0.7813041806221008, 'learning_rate': 0.00019647279549718575, 'epoch': 0.19}


  2%|▏         | 53/2670 [01:14<58:33,  1.34s/it]  

{'loss': 1.1367, 'grad_norm': 0.9888713955879211, 'learning_rate': 0.00019639774859287054, 'epoch': 0.2}


  2%|▏         | 54/2670 [01:15<1:00:01,  1.38s/it]

{'loss': 1.3822, 'grad_norm': 0.7854729890823364, 'learning_rate': 0.00019632270168855535, 'epoch': 0.2}


  2%|▏         | 55/2670 [01:17<58:07,  1.33s/it]  

{'loss': 0.9085, 'grad_norm': 0.7992561459541321, 'learning_rate': 0.00019624765478424015, 'epoch': 0.21}


  2%|▏         | 56/2670 [01:18<59:04,  1.36s/it]

{'loss': 1.3443, 'grad_norm': 0.9032743573188782, 'learning_rate': 0.00019617260787992496, 'epoch': 0.21}


  2%|▏         | 57/2670 [01:19<56:38,  1.30s/it]

{'loss': 1.4019, 'grad_norm': 1.4080089330673218, 'learning_rate': 0.00019609756097560977, 'epoch': 0.21}


  2%|▏         | 58/2670 [01:21<1:00:35,  1.39s/it]

{'loss': 1.7165, 'grad_norm': 1.092320442199707, 'learning_rate': 0.00019602251407129458, 'epoch': 0.22}


  2%|▏         | 59/2670 [01:22<58:09,  1.34s/it]  

{'loss': 1.4758, 'grad_norm': 1.1796749830245972, 'learning_rate': 0.0001959474671669794, 'epoch': 0.22}


  2%|▏         | 60/2670 [01:24<1:01:39,  1.42s/it]

{'loss': 1.3831, 'grad_norm': 0.787621259689331, 'learning_rate': 0.00019587242026266417, 'epoch': 0.22}


  2%|▏         | 61/2670 [01:25<59:47,  1.38s/it]  

{'loss': 1.3843, 'grad_norm': 1.3761162757873535, 'learning_rate': 0.00019579737335834898, 'epoch': 0.23}


  2%|▏         | 62/2670 [01:26<57:06,  1.31s/it]

{'loss': 1.2604, 'grad_norm': 1.0732957124710083, 'learning_rate': 0.0001957223264540338, 'epoch': 0.23}


  2%|▏         | 63/2670 [01:28<1:05:20,  1.50s/it]

{'loss': 1.156, 'grad_norm': 0.6469979882240295, 'learning_rate': 0.00019564727954971857, 'epoch': 0.24}


  2%|▏         | 64/2670 [01:30<1:07:06,  1.55s/it]

{'loss': 1.3385, 'grad_norm': 0.7027405500411987, 'learning_rate': 0.00019557223264540338, 'epoch': 0.24}


  2%|▏         | 65/2670 [01:31<1:05:49,  1.52s/it]

{'loss': 1.1286, 'grad_norm': 0.8013327121734619, 'learning_rate': 0.0001954971857410882, 'epoch': 0.24}


  2%|▏         | 66/2670 [01:33<1:04:42,  1.49s/it]

{'loss': 1.2745, 'grad_norm': 0.9353983998298645, 'learning_rate': 0.00019542213883677297, 'epoch': 0.25}


  3%|▎         | 67/2670 [01:34<1:02:26,  1.44s/it]

{'loss': 1.3522, 'grad_norm': 1.2060198783874512, 'learning_rate': 0.0001953470919324578, 'epoch': 0.25}


  3%|▎         | 68/2670 [01:36<1:05:22,  1.51s/it]

{'loss': 1.0533, 'grad_norm': 0.641863226890564, 'learning_rate': 0.00019527204502814262, 'epoch': 0.25}


  3%|▎         | 69/2670 [01:37<1:02:45,  1.45s/it]

{'loss': 1.1486, 'grad_norm': 1.0249252319335938, 'learning_rate': 0.0001951969981238274, 'epoch': 0.26}


  3%|▎         | 70/2670 [01:38<1:04:44,  1.49s/it]

{'loss': 1.3409, 'grad_norm': 0.9229178428649902, 'learning_rate': 0.0001951219512195122, 'epoch': 0.26}


  3%|▎         | 71/2670 [01:40<1:02:35,  1.44s/it]

{'loss': 1.0507, 'grad_norm': 1.088132619857788, 'learning_rate': 0.00019504690431519701, 'epoch': 0.27}


  3%|▎         | 72/2670 [01:41<1:01:01,  1.41s/it]

{'loss': 1.1412, 'grad_norm': 1.0096207857131958, 'learning_rate': 0.00019497185741088182, 'epoch': 0.27}


  3%|▎         | 73/2670 [01:43<1:01:04,  1.41s/it]

{'loss': 1.4964, 'grad_norm': 0.9406872391700745, 'learning_rate': 0.0001948968105065666, 'epoch': 0.27}


  3%|▎         | 74/2670 [01:44<58:41,  1.36s/it]  

{'loss': 1.5374, 'grad_norm': 1.0387954711914062, 'learning_rate': 0.00019482176360225141, 'epoch': 0.28}


  3%|▎         | 75/2670 [01:45<55:25,  1.28s/it]

{'loss': 1.4563, 'grad_norm': 1.0437957048416138, 'learning_rate': 0.00019474671669793622, 'epoch': 0.28}


  3%|▎         | 76/2670 [01:47<1:00:33,  1.40s/it]

{'loss': 1.3107, 'grad_norm': 0.7816162705421448, 'learning_rate': 0.000194671669793621, 'epoch': 0.28}


  3%|▎         | 77/2670 [01:48<1:02:12,  1.44s/it]

{'loss': 1.1314, 'grad_norm': 0.7461825013160706, 'learning_rate': 0.00019459662288930581, 'epoch': 0.29}


  3%|▎         | 78/2670 [01:49<1:01:56,  1.43s/it]

{'loss': 1.1957, 'grad_norm': 0.8896241784095764, 'learning_rate': 0.00019452157598499062, 'epoch': 0.29}


  3%|▎         | 79/2670 [01:51<59:09,  1.37s/it]  

{'loss': 0.8245, 'grad_norm': 0.7619091868400574, 'learning_rate': 0.00019444652908067543, 'epoch': 0.3}


  3%|▎         | 80/2670 [01:52<1:02:35,  1.45s/it]

{'loss': 1.212, 'grad_norm': 0.7017621994018555, 'learning_rate': 0.00019437148217636024, 'epoch': 0.3}


  3%|▎         | 81/2670 [01:54<1:02:59,  1.46s/it]

{'loss': 1.4563, 'grad_norm': 1.1262227296829224, 'learning_rate': 0.00019429643527204505, 'epoch': 0.3}


  3%|▎         | 82/2670 [01:55<1:03:41,  1.48s/it]

{'loss': 1.0533, 'grad_norm': 0.857958972454071, 'learning_rate': 0.00019422138836772986, 'epoch': 0.31}


  3%|▎         | 83/2670 [01:57<1:04:15,  1.49s/it]

{'loss': 1.3454, 'grad_norm': 0.8999024033546448, 'learning_rate': 0.00019414634146341464, 'epoch': 0.31}


  3%|▎         | 84/2670 [01:59<1:07:17,  1.56s/it]

{'loss': 1.2175, 'grad_norm': 0.770698070526123, 'learning_rate': 0.00019407129455909945, 'epoch': 0.31}


  3%|▎         | 85/2670 [02:00<1:05:15,  1.51s/it]

{'loss': 1.0123, 'grad_norm': 0.8635655045509338, 'learning_rate': 0.00019399624765478426, 'epoch': 0.32}


  3%|▎         | 86/2670 [02:02<1:07:29,  1.57s/it]

{'loss': 1.1949, 'grad_norm': 0.7209506034851074, 'learning_rate': 0.00019392120075046904, 'epoch': 0.32}


  3%|▎         | 87/2670 [02:03<1:04:29,  1.50s/it]

{'loss': 1.0416, 'grad_norm': 1.0180848836898804, 'learning_rate': 0.00019384615384615385, 'epoch': 0.33}


  3%|▎         | 88/2670 [02:04<1:02:07,  1.44s/it]

{'loss': 0.8596, 'grad_norm': 0.903834342956543, 'learning_rate': 0.00019377110694183866, 'epoch': 0.33}


  3%|▎         | 89/2670 [02:06<1:06:51,  1.55s/it]

{'loss': 1.3008, 'grad_norm': 0.9023714661598206, 'learning_rate': 0.00019369606003752347, 'epoch': 0.33}


  3%|▎         | 90/2670 [02:08<1:04:50,  1.51s/it]

{'loss': 0.8907, 'grad_norm': 0.7753657698631287, 'learning_rate': 0.00019362101313320827, 'epoch': 0.34}


  3%|▎         | 91/2670 [02:09<1:03:10,  1.47s/it]

{'loss': 0.9975, 'grad_norm': 0.9177045822143555, 'learning_rate': 0.00019354596622889308, 'epoch': 0.34}


  3%|▎         | 92/2670 [02:11<1:05:04,  1.51s/it]

{'loss': 1.0624, 'grad_norm': 0.8080312013626099, 'learning_rate': 0.00019347091932457787, 'epoch': 0.34}


  3%|▎         | 93/2670 [02:12<1:05:06,  1.52s/it]

{'loss': 1.1736, 'grad_norm': 0.8208636045455933, 'learning_rate': 0.00019339587242026267, 'epoch': 0.35}


  4%|▎         | 94/2670 [02:14<1:04:29,  1.50s/it]

{'loss': 1.2681, 'grad_norm': 1.0111960172653198, 'learning_rate': 0.00019332082551594748, 'epoch': 0.35}


  4%|▎         | 95/2670 [02:15<1:02:53,  1.47s/it]

{'loss': 0.8458, 'grad_norm': 0.9090946912765503, 'learning_rate': 0.0001932457786116323, 'epoch': 0.36}


  4%|▎         | 96/2670 [02:17<1:06:05,  1.54s/it]

{'loss': 1.2764, 'grad_norm': 0.8983012437820435, 'learning_rate': 0.00019317073170731707, 'epoch': 0.36}


  4%|▎         | 97/2670 [02:18<1:07:16,  1.57s/it]

{'loss': 0.8914, 'grad_norm': 0.7161690592765808, 'learning_rate': 0.00019309568480300188, 'epoch': 0.36}


  4%|▎         | 98/2670 [02:20<1:06:32,  1.55s/it]

{'loss': 1.1201, 'grad_norm': 0.7675704956054688, 'learning_rate': 0.0001930206378986867, 'epoch': 0.37}


  4%|▎         | 99/2670 [02:21<1:01:13,  1.43s/it]

{'loss': 0.8694, 'grad_norm': 0.7851967215538025, 'learning_rate': 0.00019294559099437147, 'epoch': 0.37}


  4%|▎         | 100/2670 [02:22<1:02:15,  1.45s/it]

{'loss': 1.2084, 'grad_norm': 0.8352574706077576, 'learning_rate': 0.00019287054409005628, 'epoch': 0.37}


  4%|▍         | 101/2670 [02:24<1:02:12,  1.45s/it]

{'loss': 1.527, 'grad_norm': 0.9252908229827881, 'learning_rate': 0.0001927954971857411, 'epoch': 0.38}


  4%|▍         | 102/2670 [02:25<1:02:23,  1.46s/it]

{'loss': 1.0754, 'grad_norm': 0.8404330015182495, 'learning_rate': 0.0001927204502814259, 'epoch': 0.38}


  4%|▍         | 103/2670 [02:27<59:21,  1.39s/it]  

{'loss': 1.1366, 'grad_norm': 0.957699716091156, 'learning_rate': 0.0001926454033771107, 'epoch': 0.39}


  4%|▍         | 104/2670 [02:28<1:02:18,  1.46s/it]

{'loss': 0.8212, 'grad_norm': 0.6915704607963562, 'learning_rate': 0.00019257035647279552, 'epoch': 0.39}


  4%|▍         | 105/2670 [02:29<58:46,  1.37s/it]  

{'loss': 1.2348, 'grad_norm': 1.0942485332489014, 'learning_rate': 0.00019249530956848033, 'epoch': 0.39}


  4%|▍         | 106/2670 [02:31<57:22,  1.34s/it]

{'loss': 1.3974, 'grad_norm': 1.1258493661880493, 'learning_rate': 0.0001924202626641651, 'epoch': 0.4}


  4%|▍         | 107/2670 [02:32<59:54,  1.40s/it]

{'loss': 1.0899, 'grad_norm': 0.870192289352417, 'learning_rate': 0.00019234521575984992, 'epoch': 0.4}


  4%|▍         | 108/2670 [02:34<59:25,  1.39s/it]

{'loss': 1.2512, 'grad_norm': 1.071904182434082, 'learning_rate': 0.00019227016885553473, 'epoch': 0.4}


  4%|▍         | 109/2670 [02:35<1:02:11,  1.46s/it]

{'loss': 1.2326, 'grad_norm': 0.952530026435852, 'learning_rate': 0.0001921951219512195, 'epoch': 0.41}


  4%|▍         | 110/2670 [02:37<1:02:57,  1.48s/it]

{'loss': 1.1793, 'grad_norm': 0.7702542543411255, 'learning_rate': 0.00019212007504690432, 'epoch': 0.41}


  4%|▍         | 111/2670 [02:38<1:04:44,  1.52s/it]

{'loss': 1.313, 'grad_norm': 0.8339638710021973, 'learning_rate': 0.00019204502814258912, 'epoch': 0.42}


  4%|▍         | 112/2670 [02:40<1:01:50,  1.45s/it]

{'loss': 1.1589, 'grad_norm': 1.0069241523742676, 'learning_rate': 0.00019196998123827393, 'epoch': 0.42}


  4%|▍         | 113/2670 [02:41<1:01:41,  1.45s/it]

{'loss': 1.1215, 'grad_norm': 0.8888334035873413, 'learning_rate': 0.00019189493433395874, 'epoch': 0.42}


  4%|▍         | 114/2670 [02:42<1:01:21,  1.44s/it]

{'loss': 1.1254, 'grad_norm': 0.8063221573829651, 'learning_rate': 0.00019181988742964355, 'epoch': 0.43}


  4%|▍         | 115/2670 [02:44<1:02:13,  1.46s/it]

{'loss': 1.128, 'grad_norm': 0.7835497856140137, 'learning_rate': 0.00019174484052532833, 'epoch': 0.43}


  4%|▍         | 116/2670 [02:45<57:50,  1.36s/it]  

{'loss': 0.6451, 'grad_norm': 0.7845574021339417, 'learning_rate': 0.00019166979362101314, 'epoch': 0.43}


  4%|▍         | 117/2670 [02:47<59:49,  1.41s/it]

{'loss': 1.2895, 'grad_norm': 1.1049498319625854, 'learning_rate': 0.00019159474671669795, 'epoch': 0.44}


  4%|▍         | 118/2670 [02:48<58:54,  1.39s/it]

{'loss': 1.168, 'grad_norm': 1.1010576486587524, 'learning_rate': 0.00019151969981238276, 'epoch': 0.44}


  4%|▍         | 119/2670 [02:49<1:01:01,  1.44s/it]

{'loss': 0.9941, 'grad_norm': 0.8988431096076965, 'learning_rate': 0.00019144465290806754, 'epoch': 0.45}


  4%|▍         | 120/2670 [02:51<59:22,  1.40s/it]  

{'loss': 0.6355, 'grad_norm': 1.0000977516174316, 'learning_rate': 0.00019136960600375235, 'epoch': 0.45}


  5%|▍         | 121/2670 [02:52<59:30,  1.40s/it]

{'loss': 1.1045, 'grad_norm': 1.0586273670196533, 'learning_rate': 0.00019129455909943716, 'epoch': 0.45}


  5%|▍         | 122/2670 [02:54<59:23,  1.40s/it]

{'loss': 0.7311, 'grad_norm': 0.8337058424949646, 'learning_rate': 0.00019121951219512194, 'epoch': 0.46}


  5%|▍         | 123/2670 [02:55<56:44,  1.34s/it]

{'loss': 1.0516, 'grad_norm': 1.0386929512023926, 'learning_rate': 0.00019114446529080675, 'epoch': 0.46}


  5%|▍         | 124/2670 [02:56<56:04,  1.32s/it]

{'loss': 1.2476, 'grad_norm': 1.046160101890564, 'learning_rate': 0.00019106941838649156, 'epoch': 0.46}


  5%|▍         | 125/2670 [02:58<59:56,  1.41s/it]

{'loss': 1.1375, 'grad_norm': 0.8118711709976196, 'learning_rate': 0.00019099437148217637, 'epoch': 0.47}


  5%|▍         | 126/2670 [03:00<1:05:21,  1.54s/it]

{'loss': 1.2018, 'grad_norm': 0.7302491664886475, 'learning_rate': 0.00019091932457786118, 'epoch': 0.47}


  5%|▍         | 127/2670 [03:01<1:06:05,  1.56s/it]

{'loss': 0.9304, 'grad_norm': 0.8028238415718079, 'learning_rate': 0.00019084427767354599, 'epoch': 0.48}


  5%|▍         | 128/2670 [03:03<1:08:03,  1.61s/it]

{'loss': 1.3084, 'grad_norm': 0.8226630091667175, 'learning_rate': 0.0001907692307692308, 'epoch': 0.48}


  5%|▍         | 129/2670 [03:04<1:04:50,  1.53s/it]

{'loss': 1.2044, 'grad_norm': 1.0818675756454468, 'learning_rate': 0.00019069418386491558, 'epoch': 0.48}


  5%|▍         | 130/2670 [03:06<1:04:43,  1.53s/it]

{'loss': 1.0579, 'grad_norm': 0.9131962656974792, 'learning_rate': 0.00019061913696060038, 'epoch': 0.49}


  5%|▍         | 131/2670 [03:07<1:01:40,  1.46s/it]

{'loss': 1.157, 'grad_norm': 1.0773954391479492, 'learning_rate': 0.0001905440900562852, 'epoch': 0.49}


  5%|▍         | 132/2670 [03:08<1:01:14,  1.45s/it]

{'loss': 1.0481, 'grad_norm': 0.8856256008148193, 'learning_rate': 0.00019046904315196998, 'epoch': 0.49}


  5%|▍         | 133/2670 [03:10<1:02:22,  1.48s/it]

{'loss': 0.8975, 'grad_norm': 0.8208453059196472, 'learning_rate': 0.00019039399624765478, 'epoch': 0.5}


  5%|▌         | 134/2670 [03:11<1:02:13,  1.47s/it]

{'loss': 1.3016, 'grad_norm': 0.9239622950553894, 'learning_rate': 0.0001903189493433396, 'epoch': 0.5}


  5%|▌         | 135/2670 [03:13<1:02:33,  1.48s/it]

{'loss': 1.2623, 'grad_norm': 0.9760921001434326, 'learning_rate': 0.0001902439024390244, 'epoch': 0.51}


  5%|▌         | 136/2670 [03:15<1:06:30,  1.57s/it]

{'loss': 1.211, 'grad_norm': 0.7714282274246216, 'learning_rate': 0.0001901688555347092, 'epoch': 0.51}


  5%|▌         | 137/2670 [03:16<1:05:11,  1.54s/it]

{'loss': 1.3234, 'grad_norm': 0.8274587392807007, 'learning_rate': 0.00019009380863039402, 'epoch': 0.51}


  5%|▌         | 138/2670 [03:18<1:08:36,  1.63s/it]

{'loss': 0.8895, 'grad_norm': 0.6923800110816956, 'learning_rate': 0.0001900187617260788, 'epoch': 0.52}


  5%|▌         | 139/2670 [03:20<1:07:00,  1.59s/it]

{'loss': 1.2207, 'grad_norm': 0.9036975502967834, 'learning_rate': 0.0001899437148217636, 'epoch': 0.52}


  5%|▌         | 140/2670 [03:21<1:05:18,  1.55s/it]

{'loss': 1.4233, 'grad_norm': 0.9966073036193848, 'learning_rate': 0.00018986866791744842, 'epoch': 0.52}


  5%|▌         | 141/2670 [03:22<1:04:06,  1.52s/it]

{'loss': 0.7364, 'grad_norm': 0.8060712814331055, 'learning_rate': 0.00018979362101313323, 'epoch': 0.53}


  5%|▌         | 142/2670 [03:24<1:01:24,  1.46s/it]

{'loss': 1.0282, 'grad_norm': 0.9137322902679443, 'learning_rate': 0.000189718574108818, 'epoch': 0.53}


  5%|▌         | 143/2670 [03:25<1:01:08,  1.45s/it]

{'loss': 1.1095, 'grad_norm': 1.2246837615966797, 'learning_rate': 0.00018964352720450282, 'epoch': 0.54}


  5%|▌         | 144/2670 [03:26<58:50,  1.40s/it]  

{'loss': 0.7382, 'grad_norm': 0.9253960251808167, 'learning_rate': 0.00018956848030018763, 'epoch': 0.54}


  5%|▌         | 145/2670 [03:28<58:51,  1.40s/it]

{'loss': 0.9761, 'grad_norm': 1.0554521083831787, 'learning_rate': 0.0001894934333958724, 'epoch': 0.54}


  5%|▌         | 146/2670 [03:29<58:09,  1.38s/it]

{'loss': 0.9435, 'grad_norm': 1.0760082006454468, 'learning_rate': 0.00018941838649155722, 'epoch': 0.55}


  6%|▌         | 147/2670 [03:31<1:00:42,  1.44s/it]

{'loss': 0.844, 'grad_norm': 0.8181930780410767, 'learning_rate': 0.00018934333958724205, 'epoch': 0.55}


  6%|▌         | 148/2670 [03:32<1:02:48,  1.49s/it]

{'loss': 1.1311, 'grad_norm': 0.9069963693618774, 'learning_rate': 0.00018926829268292684, 'epoch': 0.55}


  6%|▌         | 149/2670 [03:34<58:48,  1.40s/it]  

{'loss': 0.7827, 'grad_norm': 1.0163404941558838, 'learning_rate': 0.00018919324577861164, 'epoch': 0.56}


  6%|▌         | 150/2670 [03:35<1:00:09,  1.43s/it]

{'loss': 1.0331, 'grad_norm': 1.0197733640670776, 'learning_rate': 0.00018911819887429645, 'epoch': 0.56}


  6%|▌         | 151/2670 [03:37<1:00:59,  1.45s/it]

{'loss': 1.0994, 'grad_norm': 0.9215015769004822, 'learning_rate': 0.00018904315196998126, 'epoch': 0.57}


  6%|▌         | 152/2670 [03:38<57:19,  1.37s/it]  

{'loss': 1.1243, 'grad_norm': 0.9694474935531616, 'learning_rate': 0.00018896810506566604, 'epoch': 0.57}


  6%|▌         | 153/2670 [03:39<56:38,  1.35s/it]

{'loss': 1.201, 'grad_norm': 1.0268540382385254, 'learning_rate': 0.00018889305816135085, 'epoch': 0.57}


  6%|▌         | 154/2670 [03:41<58:38,  1.40s/it]

{'loss': 1.1069, 'grad_norm': 0.8139618635177612, 'learning_rate': 0.00018881801125703566, 'epoch': 0.58}


  6%|▌         | 155/2670 [03:42<1:04:12,  1.53s/it]

{'loss': 1.0179, 'grad_norm': 0.8058673143386841, 'learning_rate': 0.00018874296435272044, 'epoch': 0.58}


  6%|▌         | 156/2670 [03:44<1:03:51,  1.52s/it]

{'loss': 0.9189, 'grad_norm': 0.8487463593482971, 'learning_rate': 0.00018866791744840525, 'epoch': 0.58}


  6%|▌         | 157/2670 [03:45<1:03:52,  1.53s/it]

{'loss': 0.9217, 'grad_norm': 0.7958036065101624, 'learning_rate': 0.00018859287054409006, 'epoch': 0.59}


  6%|▌         | 158/2670 [03:47<1:01:40,  1.47s/it]

{'loss': 1.145, 'grad_norm': 0.836961030960083, 'learning_rate': 0.00018851782363977487, 'epoch': 0.59}


  6%|▌         | 159/2670 [03:48<58:14,  1.39s/it]  

{'loss': 1.0277, 'grad_norm': 1.0280561447143555, 'learning_rate': 0.00018844277673545968, 'epoch': 0.6}


  6%|▌         | 160/2670 [03:49<55:52,  1.34s/it]

{'loss': 0.932, 'grad_norm': 1.0223512649536133, 'learning_rate': 0.0001883677298311445, 'epoch': 0.6}


  6%|▌         | 161/2670 [03:51<1:00:56,  1.46s/it]

{'loss': 1.2129, 'grad_norm': 0.8393130898475647, 'learning_rate': 0.00018829268292682927, 'epoch': 0.6}


  6%|▌         | 162/2670 [03:52<1:00:22,  1.44s/it]

{'loss': 1.3376, 'grad_norm': 1.0436290502548218, 'learning_rate': 0.00018821763602251408, 'epoch': 0.61}


  6%|▌         | 163/2670 [03:54<1:04:02,  1.53s/it]

{'loss': 1.0461, 'grad_norm': 0.8722823262214661, 'learning_rate': 0.0001881425891181989, 'epoch': 0.61}


  6%|▌         | 164/2670 [03:56<1:03:49,  1.53s/it]

{'loss': 1.0531, 'grad_norm': 0.8375099301338196, 'learning_rate': 0.0001880675422138837, 'epoch': 0.61}


  6%|▌         | 165/2670 [03:57<1:01:05,  1.46s/it]

{'loss': 1.0653, 'grad_norm': 1.0614607334136963, 'learning_rate': 0.00018799249530956848, 'epoch': 0.62}


  6%|▌         | 166/2670 [03:58<1:00:17,  1.44s/it]

{'loss': 0.6236, 'grad_norm': 0.6904727220535278, 'learning_rate': 0.0001879174484052533, 'epoch': 0.62}


  6%|▋         | 167/2670 [04:00<1:02:14,  1.49s/it]

{'loss': 0.9434, 'grad_norm': 0.9447101354598999, 'learning_rate': 0.0001878424015009381, 'epoch': 0.63}


  6%|▋         | 168/2670 [04:01<58:54,  1.41s/it]  

{'loss': 0.6806, 'grad_norm': 0.9206408858299255, 'learning_rate': 0.00018776735459662288, 'epoch': 0.63}


  6%|▋         | 169/2670 [04:03<57:43,  1.38s/it]

{'loss': 0.8802, 'grad_norm': 0.8537869453430176, 'learning_rate': 0.0001876923076923077, 'epoch': 0.63}


  6%|▋         | 170/2670 [04:04<58:53,  1.41s/it]

{'loss': 0.9496, 'grad_norm': 0.8654609322547913, 'learning_rate': 0.00018761726078799252, 'epoch': 0.64}


  6%|▋         | 171/2670 [04:05<59:56,  1.44s/it]

{'loss': 0.9607, 'grad_norm': 1.1153559684753418, 'learning_rate': 0.0001875422138836773, 'epoch': 0.64}


  6%|▋         | 172/2670 [04:07<58:03,  1.39s/it]

{'loss': 1.1051, 'grad_norm': 1.2823294401168823, 'learning_rate': 0.0001874671669793621, 'epoch': 0.64}


  6%|▋         | 173/2670 [04:08<59:49,  1.44s/it]

{'loss': 0.8329, 'grad_norm': 0.8196737766265869, 'learning_rate': 0.00018739212007504692, 'epoch': 0.65}


  7%|▋         | 174/2670 [04:10<58:38,  1.41s/it]

{'loss': 0.9365, 'grad_norm': 0.9694922566413879, 'learning_rate': 0.00018731707317073173, 'epoch': 0.65}


  7%|▋         | 175/2670 [04:11<59:53,  1.44s/it]

{'loss': 1.0435, 'grad_norm': 0.8634311556816101, 'learning_rate': 0.0001872420262664165, 'epoch': 0.66}


  7%|▋         | 176/2670 [04:13<1:02:28,  1.50s/it]

{'loss': 1.0074, 'grad_norm': 0.8283655643463135, 'learning_rate': 0.00018716697936210132, 'epoch': 0.66}


  7%|▋         | 177/2670 [04:14<59:38,  1.44s/it]  

{'loss': 0.6703, 'grad_norm': 0.8621525764465332, 'learning_rate': 0.00018709193245778613, 'epoch': 0.66}


  7%|▋         | 178/2670 [04:15<58:32,  1.41s/it]

{'loss': 0.9312, 'grad_norm': 0.9825034737586975, 'learning_rate': 0.0001870168855534709, 'epoch': 0.67}


  7%|▋         | 179/2670 [04:17<58:55,  1.42s/it]

{'loss': 0.8494, 'grad_norm': 0.8710715770721436, 'learning_rate': 0.00018694183864915572, 'epoch': 0.67}


  7%|▋         | 180/2670 [04:18<58:17,  1.40s/it]

{'loss': 0.9442, 'grad_norm': 0.878926157951355, 'learning_rate': 0.00018686679174484053, 'epoch': 0.67}


  7%|▋         | 181/2670 [04:20<1:00:43,  1.46s/it]

{'loss': 0.8579, 'grad_norm': 0.7707462906837463, 'learning_rate': 0.00018679174484052534, 'epoch': 0.68}


  7%|▋         | 182/2670 [04:21<1:00:12,  1.45s/it]

{'loss': 0.9512, 'grad_norm': 0.8351399302482605, 'learning_rate': 0.00018671669793621015, 'epoch': 0.68}


  7%|▋         | 183/2670 [04:23<1:03:12,  1.52s/it]

{'loss': 0.9127, 'grad_norm': 0.7324218153953552, 'learning_rate': 0.00018664165103189496, 'epoch': 0.69}


  7%|▋         | 184/2670 [04:24<1:01:48,  1.49s/it]

{'loss': 0.6753, 'grad_norm': 1.1441900730133057, 'learning_rate': 0.00018656660412757974, 'epoch': 0.69}


  7%|▋         | 185/2670 [04:26<1:00:56,  1.47s/it]

{'loss': 1.0467, 'grad_norm': 1.2328987121582031, 'learning_rate': 0.00018649155722326455, 'epoch': 0.69}


  7%|▋         | 186/2670 [04:27<1:02:49,  1.52s/it]

{'loss': 0.9462, 'grad_norm': 1.039818286895752, 'learning_rate': 0.00018641651031894936, 'epoch': 0.7}


  7%|▋         | 187/2670 [04:29<57:54,  1.40s/it]  

{'loss': 0.9798, 'grad_norm': 1.155814528465271, 'learning_rate': 0.00018634146341463416, 'epoch': 0.7}


  7%|▋         | 188/2670 [04:30<59:12,  1.43s/it]

{'loss': 0.9168, 'grad_norm': 0.8655292391777039, 'learning_rate': 0.00018626641651031895, 'epoch': 0.7}


  7%|▋         | 189/2670 [04:31<59:01,  1.43s/it]

{'loss': 1.4032, 'grad_norm': 0.9979897737503052, 'learning_rate': 0.00018619136960600376, 'epoch': 0.71}


  7%|▋         | 190/2670 [04:33<58:01,  1.40s/it]

{'loss': 1.1626, 'grad_norm': 1.4581836462020874, 'learning_rate': 0.00018611632270168856, 'epoch': 0.71}


  7%|▋         | 191/2670 [04:34<57:50,  1.40s/it]

{'loss': 1.0297, 'grad_norm': 0.9164560437202454, 'learning_rate': 0.00018604127579737337, 'epoch': 0.72}


  7%|▋         | 192/2670 [04:36<1:00:29,  1.46s/it]

{'loss': 0.7567, 'grad_norm': 0.7608788013458252, 'learning_rate': 0.00018596622889305818, 'epoch': 0.72}


  7%|▋         | 193/2670 [04:37<1:02:43,  1.52s/it]

{'loss': 0.8289, 'grad_norm': 0.8456688523292542, 'learning_rate': 0.000185891181988743, 'epoch': 0.72}


  7%|▋         | 194/2670 [04:39<1:01:01,  1.48s/it]

{'loss': 0.7368, 'grad_norm': 0.9591609835624695, 'learning_rate': 0.00018581613508442777, 'epoch': 0.73}


  7%|▋         | 195/2670 [04:40<1:02:36,  1.52s/it]

{'loss': 0.7708, 'grad_norm': 0.8034269213676453, 'learning_rate': 0.00018574108818011258, 'epoch': 0.73}


  7%|▋         | 196/2670 [04:42<1:01:19,  1.49s/it]

{'loss': 0.9521, 'grad_norm': 0.8955484628677368, 'learning_rate': 0.0001856660412757974, 'epoch': 0.73}


  7%|▋         | 197/2670 [04:44<1:05:13,  1.58s/it]

{'loss': 1.3719, 'grad_norm': 1.056671142578125, 'learning_rate': 0.00018559099437148217, 'epoch': 0.74}


  7%|▋         | 198/2670 [04:45<1:03:47,  1.55s/it]

{'loss': 0.9974, 'grad_norm': 0.9989265203475952, 'learning_rate': 0.00018551594746716698, 'epoch': 0.74}


  7%|▋         | 199/2670 [04:47<1:03:08,  1.53s/it]

{'loss': 1.0937, 'grad_norm': 1.093043327331543, 'learning_rate': 0.0001854409005628518, 'epoch': 0.75}


  7%|▋         | 200/2670 [04:48<1:02:53,  1.53s/it]

{'loss': 0.7852, 'grad_norm': 0.9071416258811951, 'learning_rate': 0.0001853658536585366, 'epoch': 0.75}


  8%|▊         | 201/2670 [04:50<1:02:47,  1.53s/it]

{'loss': 0.9742, 'grad_norm': 0.8543567657470703, 'learning_rate': 0.00018529080675422138, 'epoch': 0.75}


  8%|▊         | 202/2670 [04:51<1:03:02,  1.53s/it]

{'loss': 0.7045, 'grad_norm': 0.8605316877365112, 'learning_rate': 0.0001852157598499062, 'epoch': 0.76}


  8%|▊         | 203/2670 [04:53<1:00:16,  1.47s/it]

{'loss': 1.2064, 'grad_norm': 1.1464241743087769, 'learning_rate': 0.000185140712945591, 'epoch': 0.76}


  8%|▊         | 204/2670 [04:54<1:03:23,  1.54s/it]

{'loss': 1.2257, 'grad_norm': 0.843928337097168, 'learning_rate': 0.0001850656660412758, 'epoch': 0.76}


  8%|▊         | 205/2670 [04:56<1:01:37,  1.50s/it]

{'loss': 0.918, 'grad_norm': 0.947966456413269, 'learning_rate': 0.00018499061913696062, 'epoch': 0.77}


  8%|▊         | 206/2670 [04:57<59:18,  1.44s/it]  

{'loss': 1.1788, 'grad_norm': 1.308289885520935, 'learning_rate': 0.00018491557223264542, 'epoch': 0.77}


  8%|▊         | 207/2670 [04:58<58:42,  1.43s/it]

{'loss': 0.9975, 'grad_norm': 0.9286049008369446, 'learning_rate': 0.0001848405253283302, 'epoch': 0.78}


  8%|▊         | 208/2670 [05:00<1:02:51,  1.53s/it]

{'loss': 1.061, 'grad_norm': 0.9245262742042542, 'learning_rate': 0.00018476547842401501, 'epoch': 0.78}


  8%|▊         | 209/2670 [05:02<1:02:56,  1.53s/it]

{'loss': 1.0774, 'grad_norm': 0.9976503252983093, 'learning_rate': 0.00018469043151969982, 'epoch': 0.78}


  8%|▊         | 210/2670 [05:03<1:02:13,  1.52s/it]

{'loss': 0.516, 'grad_norm': 0.826149046421051, 'learning_rate': 0.00018461538461538463, 'epoch': 0.79}


  8%|▊         | 211/2670 [05:05<1:00:56,  1.49s/it]

{'loss': 1.106, 'grad_norm': 1.1197729110717773, 'learning_rate': 0.00018454033771106941, 'epoch': 0.79}


  8%|▊         | 212/2670 [05:06<1:00:13,  1.47s/it]

{'loss': 1.003, 'grad_norm': 0.942689061164856, 'learning_rate': 0.00018446529080675422, 'epoch': 0.79}


  8%|▊         | 213/2670 [05:08<1:01:11,  1.49s/it]

{'loss': 1.0817, 'grad_norm': 0.8827250003814697, 'learning_rate': 0.00018439024390243903, 'epoch': 0.8}


  8%|▊         | 214/2670 [05:09<1:00:41,  1.48s/it]

{'loss': 0.9102, 'grad_norm': 1.0306540727615356, 'learning_rate': 0.00018431519699812384, 'epoch': 0.8}


  8%|▊         | 215/2670 [05:11<1:03:43,  1.56s/it]

{'loss': 0.985, 'grad_norm': 0.844195544719696, 'learning_rate': 0.00018424015009380865, 'epoch': 0.81}


  8%|▊         | 216/2670 [05:12<1:03:29,  1.55s/it]

{'loss': 1.1088, 'grad_norm': 0.9208515882492065, 'learning_rate': 0.00018416510318949346, 'epoch': 0.81}


  8%|▊         | 217/2670 [05:14<1:01:15,  1.50s/it]

{'loss': 1.0984, 'grad_norm': 1.1098246574401855, 'learning_rate': 0.00018409005628517824, 'epoch': 0.81}


  8%|▊         | 218/2670 [05:15<1:01:43,  1.51s/it]

{'loss': 0.8656, 'grad_norm': 0.878213107585907, 'learning_rate': 0.00018401500938086305, 'epoch': 0.82}


  8%|▊         | 219/2670 [05:17<1:00:42,  1.49s/it]

{'loss': 1.0409, 'grad_norm': 0.9619109630584717, 'learning_rate': 0.00018393996247654786, 'epoch': 0.82}


  8%|▊         | 220/2670 [05:18<1:00:39,  1.49s/it]

{'loss': 0.8854, 'grad_norm': 1.0427801609039307, 'learning_rate': 0.00018386491557223264, 'epoch': 0.82}


  8%|▊         | 221/2670 [05:20<1:00:21,  1.48s/it]

{'loss': 0.4994, 'grad_norm': 0.8120683431625366, 'learning_rate': 0.00018378986866791745, 'epoch': 0.83}


  8%|▊         | 222/2670 [05:21<57:39,  1.41s/it]  

{'loss': 1.0163, 'grad_norm': 1.1344729661941528, 'learning_rate': 0.00018371482176360226, 'epoch': 0.83}


  8%|▊         | 223/2670 [05:23<1:01:10,  1.50s/it]

{'loss': 1.0859, 'grad_norm': 0.7926139831542969, 'learning_rate': 0.00018363977485928707, 'epoch': 0.84}


  8%|▊         | 224/2670 [05:24<1:01:34,  1.51s/it]

{'loss': 0.8473, 'grad_norm': 1.047844409942627, 'learning_rate': 0.00018356472795497185, 'epoch': 0.84}


  8%|▊         | 225/2670 [05:26<1:01:45,  1.52s/it]

{'loss': 0.6673, 'grad_norm': 0.8143274188041687, 'learning_rate': 0.00018348968105065666, 'epoch': 0.84}


  8%|▊         | 226/2670 [05:27<1:02:25,  1.53s/it]

{'loss': 1.2221, 'grad_norm': 0.9314271807670593, 'learning_rate': 0.0001834146341463415, 'epoch': 0.85}


  9%|▊         | 227/2670 [05:29<1:02:27,  1.53s/it]

{'loss': 0.7469, 'grad_norm': 0.9066036939620972, 'learning_rate': 0.00018333958724202627, 'epoch': 0.85}


  9%|▊         | 228/2670 [05:30<1:03:38,  1.56s/it]

{'loss': 0.616, 'grad_norm': 0.7137109637260437, 'learning_rate': 0.00018326454033771108, 'epoch': 0.85}


  9%|▊         | 229/2670 [05:32<1:02:27,  1.54s/it]

{'loss': 1.0938, 'grad_norm': 0.868909478187561, 'learning_rate': 0.0001831894934333959, 'epoch': 0.86}


  9%|▊         | 230/2670 [05:33<59:09,  1.45s/it]  

{'loss': 0.7001, 'grad_norm': 0.8983033299446106, 'learning_rate': 0.00018311444652908067, 'epoch': 0.86}


  9%|▊         | 231/2670 [05:35<1:00:03,  1.48s/it]

{'loss': 0.9897, 'grad_norm': 1.1212639808654785, 'learning_rate': 0.00018303939962476548, 'epoch': 0.87}


  9%|▊         | 232/2670 [05:36<1:04:25,  1.59s/it]

{'loss': 0.8229, 'grad_norm': 0.736832320690155, 'learning_rate': 0.0001829643527204503, 'epoch': 0.87}


  9%|▊         | 233/2670 [05:38<1:01:20,  1.51s/it]

{'loss': 0.4379, 'grad_norm': 0.7644688487052917, 'learning_rate': 0.0001828893058161351, 'epoch': 0.87}


  9%|▉         | 234/2670 [05:39<1:01:30,  1.52s/it]

{'loss': 0.9849, 'grad_norm': 0.9184508919715881, 'learning_rate': 0.00018281425891181988, 'epoch': 0.88}


  9%|▉         | 235/2670 [05:41<1:02:31,  1.54s/it]

{'loss': 1.0991, 'grad_norm': 0.8763378858566284, 'learning_rate': 0.0001827392120075047, 'epoch': 0.88}


  9%|▉         | 236/2670 [05:42<1:01:15,  1.51s/it]

{'loss': 0.93, 'grad_norm': 0.887546718120575, 'learning_rate': 0.0001826641651031895, 'epoch': 0.88}


  9%|▉         | 237/2670 [05:44<59:17,  1.46s/it]  

{'loss': 1.0191, 'grad_norm': 0.974234402179718, 'learning_rate': 0.0001825891181988743, 'epoch': 0.89}


  9%|▉         | 238/2670 [05:45<59:15,  1.46s/it]

{'loss': 0.9483, 'grad_norm': 1.009032964706421, 'learning_rate': 0.00018251407129455912, 'epoch': 0.89}


  9%|▉         | 239/2670 [05:47<57:31,  1.42s/it]

{'loss': 0.721, 'grad_norm': 0.9800095558166504, 'learning_rate': 0.00018243902439024393, 'epoch': 0.9}


  9%|▉         | 240/2670 [05:48<58:24,  1.44s/it]

{'loss': 1.1403, 'grad_norm': 0.9674206376075745, 'learning_rate': 0.0001823639774859287, 'epoch': 0.9}


  9%|▉         | 241/2670 [05:49<56:20,  1.39s/it]

{'loss': 1.0095, 'grad_norm': 0.9916980862617493, 'learning_rate': 0.00018228893058161352, 'epoch': 0.9}


  9%|▉         | 242/2670 [05:51<55:16,  1.37s/it]

{'loss': 0.8435, 'grad_norm': 0.9957733750343323, 'learning_rate': 0.00018221388367729833, 'epoch': 0.91}


  9%|▉         | 243/2670 [05:52<58:26,  1.44s/it]

{'loss': 0.5419, 'grad_norm': 0.7759210467338562, 'learning_rate': 0.0001821388367729831, 'epoch': 0.91}


  9%|▉         | 244/2670 [05:54<1:00:23,  1.49s/it]

{'loss': 0.6097, 'grad_norm': 0.775246798992157, 'learning_rate': 0.00018206378986866792, 'epoch': 0.91}


  9%|▉         | 245/2670 [05:55<58:16,  1.44s/it]  

{'loss': 0.7733, 'grad_norm': 1.014561653137207, 'learning_rate': 0.00018198874296435273, 'epoch': 0.92}


  9%|▉         | 246/2670 [05:57<57:59,  1.44s/it]

{'loss': 0.8713, 'grad_norm': 1.0461020469665527, 'learning_rate': 0.00018191369606003753, 'epoch': 0.92}


  9%|▉         | 247/2670 [05:58<58:38,  1.45s/it]

{'loss': 0.6637, 'grad_norm': 0.8739756345748901, 'learning_rate': 0.00018183864915572232, 'epoch': 0.93}


  9%|▉         | 248/2670 [06:00<59:05,  1.46s/it]

{'loss': 1.1177, 'grad_norm': 1.1148302555084229, 'learning_rate': 0.00018176360225140715, 'epoch': 0.93}


  9%|▉         | 249/2670 [06:01<58:26,  1.45s/it]

{'loss': 0.9708, 'grad_norm': 1.0576038360595703, 'learning_rate': 0.00018168855534709196, 'epoch': 0.93}


  9%|▉         | 250/2670 [06:03<1:01:19,  1.52s/it]

{'loss': 0.7979, 'grad_norm': 0.8396061062812805, 'learning_rate': 0.00018161350844277674, 'epoch': 0.94}


  9%|▉         | 251/2670 [06:04<57:52,  1.44s/it]  

{'loss': 0.7416, 'grad_norm': 1.0362131595611572, 'learning_rate': 0.00018153846153846155, 'epoch': 0.94}


  9%|▉         | 252/2670 [06:05<59:20,  1.47s/it]

{'loss': 0.696, 'grad_norm': 0.8434439301490784, 'learning_rate': 0.00018146341463414636, 'epoch': 0.94}


  9%|▉         | 253/2670 [06:07<58:44,  1.46s/it]

{'loss': 1.4233, 'grad_norm': 1.3029038906097412, 'learning_rate': 0.00018138836772983114, 'epoch': 0.95}


 10%|▉         | 254/2670 [06:08<58:45,  1.46s/it]

{'loss': 0.5672, 'grad_norm': 0.8779231905937195, 'learning_rate': 0.00018131332082551595, 'epoch': 0.95}


 10%|▉         | 255/2670 [06:10<57:20,  1.42s/it]

{'loss': 1.0212, 'grad_norm': 1.4211483001708984, 'learning_rate': 0.00018123827392120076, 'epoch': 0.96}


 10%|▉         | 256/2670 [06:11<59:14,  1.47s/it]

{'loss': 0.6316, 'grad_norm': 0.743217408657074, 'learning_rate': 0.00018116322701688557, 'epoch': 0.96}


 10%|▉         | 257/2670 [06:13<58:13,  1.45s/it]

{'loss': 0.7649, 'grad_norm': 0.8021498918533325, 'learning_rate': 0.00018108818011257035, 'epoch': 0.96}


 10%|▉         | 258/2670 [06:14<1:00:05,  1.49s/it]

{'loss': 0.9285, 'grad_norm': 0.8706200122833252, 'learning_rate': 0.00018101313320825516, 'epoch': 0.97}


 10%|▉         | 259/2670 [06:16<1:01:39,  1.53s/it]

{'loss': 0.6931, 'grad_norm': 0.8035203218460083, 'learning_rate': 0.00018093808630393997, 'epoch': 0.97}


 10%|▉         | 260/2670 [06:17<1:01:48,  1.54s/it]

{'loss': 0.6072, 'grad_norm': 0.8135258555412292, 'learning_rate': 0.00018086303939962478, 'epoch': 0.97}


 10%|▉         | 261/2670 [06:19<1:01:34,  1.53s/it]

{'loss': 0.977, 'grad_norm': 0.9221916198730469, 'learning_rate': 0.00018078799249530959, 'epoch': 0.98}


 10%|▉         | 262/2670 [06:20<1:00:15,  1.50s/it]

{'loss': 0.5492, 'grad_norm': 1.065190076828003, 'learning_rate': 0.0001807129455909944, 'epoch': 0.98}


 10%|▉         | 263/2670 [06:22<59:36,  1.49s/it]  

{'loss': 0.8492, 'grad_norm': 1.0713071823120117, 'learning_rate': 0.00018063789868667918, 'epoch': 0.99}


 10%|▉         | 264/2670 [06:23<59:12,  1.48s/it]

{'loss': 0.8971, 'grad_norm': 1.0840240716934204, 'learning_rate': 0.00018056285178236399, 'epoch': 0.99}


 10%|▉         | 265/2670 [06:25<59:53,  1.49s/it]

{'loss': 0.6168, 'grad_norm': 0.9086223244667053, 'learning_rate': 0.0001804878048780488, 'epoch': 0.99}


 10%|▉         | 266/2670 [06:27<1:03:20,  1.58s/it]

{'loss': 0.9437, 'grad_norm': 1.0657494068145752, 'learning_rate': 0.00018041275797373358, 'epoch': 1.0}


 10%|█         | 267/2670 [06:28<59:04,  1.48s/it]  

{'loss': 0.6444, 'grad_norm': 1.1843454837799072, 'learning_rate': 0.00018033771106941839, 'epoch': 1.0}


 10%|█         | 268/2670 [06:29<59:34,  1.49s/it]

{'loss': 0.3088, 'grad_norm': 1.0239872932434082, 'learning_rate': 0.0001802626641651032, 'epoch': 1.0}


 10%|█         | 269/2670 [06:31<1:00:02,  1.50s/it]

{'loss': 0.214, 'grad_norm': 0.5889759659767151, 'learning_rate': 0.000180187617260788, 'epoch': 1.01}


 10%|█         | 270/2670 [06:32<59:12,  1.48s/it]  

{'loss': 0.3977, 'grad_norm': 0.716795027256012, 'learning_rate': 0.0001801125703564728, 'epoch': 1.01}


 10%|█         | 271/2670 [06:34<58:07,  1.45s/it]

{'loss': 0.3826, 'grad_norm': 0.8703362345695496, 'learning_rate': 0.00018003752345215762, 'epoch': 1.01}


 10%|█         | 272/2670 [06:35<54:32,  1.36s/it]

{'loss': 0.402, 'grad_norm': 1.009680151939392, 'learning_rate': 0.00017996247654784243, 'epoch': 1.02}


 10%|█         | 273/2670 [06:36<54:02,  1.35s/it]

{'loss': 0.3142, 'grad_norm': 0.7565304636955261, 'learning_rate': 0.0001798874296435272, 'epoch': 1.02}


 10%|█         | 274/2670 [06:38<55:22,  1.39s/it]

{'loss': 0.6166, 'grad_norm': 1.1176295280456543, 'learning_rate': 0.00017981238273921202, 'epoch': 1.03}


 10%|█         | 275/2670 [06:39<55:26,  1.39s/it]

{'loss': 0.3716, 'grad_norm': 1.1247645616531372, 'learning_rate': 0.00017973733583489683, 'epoch': 1.03}


 10%|█         | 276/2670 [06:41<57:45,  1.45s/it]

{'loss': 0.5825, 'grad_norm': 0.9415843486785889, 'learning_rate': 0.0001796622889305816, 'epoch': 1.03}


 10%|█         | 277/2670 [06:42<1:00:32,  1.52s/it]

{'loss': 0.3554, 'grad_norm': 0.7752965092658997, 'learning_rate': 0.00017958724202626642, 'epoch': 1.04}


 10%|█         | 278/2670 [06:43<56:23,  1.41s/it]  

{'loss': 0.298, 'grad_norm': 0.9006833434104919, 'learning_rate': 0.00017951219512195123, 'epoch': 1.04}


 10%|█         | 279/2670 [06:45<56:43,  1.42s/it]

{'loss': 0.3728, 'grad_norm': 1.1164859533309937, 'learning_rate': 0.00017943714821763604, 'epoch': 1.04}


 10%|█         | 280/2670 [06:46<55:25,  1.39s/it]

{'loss': 0.4215, 'grad_norm': 0.9479760527610779, 'learning_rate': 0.00017936210131332082, 'epoch': 1.05}


 11%|█         | 281/2670 [06:48<56:20,  1.41s/it]

{'loss': 0.4232, 'grad_norm': 0.8970595002174377, 'learning_rate': 0.00017928705440900563, 'epoch': 1.05}


 11%|█         | 282/2670 [06:50<1:01:08,  1.54s/it]

{'loss': 0.2763, 'grad_norm': 0.7162668108940125, 'learning_rate': 0.00017921200750469044, 'epoch': 1.06}


 11%|█         | 283/2670 [06:51<56:09,  1.41s/it]  

{'loss': 0.3517, 'grad_norm': 1.2858378887176514, 'learning_rate': 0.00017913696060037525, 'epoch': 1.06}


 11%|█         | 284/2670 [06:52<52:10,  1.31s/it]

{'loss': 0.3598, 'grad_norm': 1.1359354257583618, 'learning_rate': 0.00017906191369606005, 'epoch': 1.06}


 11%|█         | 285/2670 [06:53<53:28,  1.35s/it]

{'loss': 0.545, 'grad_norm': 0.9117560386657715, 'learning_rate': 0.00017898686679174486, 'epoch': 1.07}


 11%|█         | 286/2670 [06:55<55:26,  1.40s/it]

{'loss': 0.4078, 'grad_norm': 0.6767479181289673, 'learning_rate': 0.00017891181988742964, 'epoch': 1.07}


 11%|█         | 287/2670 [06:56<55:52,  1.41s/it]

{'loss': 0.4491, 'grad_norm': 0.8546830415725708, 'learning_rate': 0.00017883677298311445, 'epoch': 1.07}


 11%|█         | 288/2670 [06:58<55:59,  1.41s/it]

{'loss': 0.2006, 'grad_norm': 0.798409104347229, 'learning_rate': 0.00017876172607879926, 'epoch': 1.08}


 11%|█         | 289/2670 [06:59<59:29,  1.50s/it]

{'loss': 0.578, 'grad_norm': 0.9353043437004089, 'learning_rate': 0.00017868667917448404, 'epoch': 1.08}


 11%|█         | 290/2670 [07:00<56:49,  1.43s/it]

{'loss': 0.5226, 'grad_norm': 0.9665949940681458, 'learning_rate': 0.00017861163227016885, 'epoch': 1.09}


 11%|█         | 291/2670 [07:02<1:00:37,  1.53s/it]

{'loss': 0.3253, 'grad_norm': 0.7699593901634216, 'learning_rate': 0.00017853658536585366, 'epoch': 1.09}


 11%|█         | 292/2670 [07:04<1:02:28,  1.58s/it]

{'loss': 0.4049, 'grad_norm': 0.9381515383720398, 'learning_rate': 0.00017846153846153847, 'epoch': 1.09}


 11%|█         | 293/2670 [07:05<1:01:36,  1.56s/it]

{'loss': 0.298, 'grad_norm': 0.7000707983970642, 'learning_rate': 0.00017838649155722328, 'epoch': 1.1}


 11%|█         | 294/2670 [07:07<59:45,  1.51s/it]  

{'loss': 0.3632, 'grad_norm': 0.8374091982841492, 'learning_rate': 0.0001783114446529081, 'epoch': 1.1}


 11%|█         | 295/2670 [07:08<56:15,  1.42s/it]

{'loss': 0.5661, 'grad_norm': 1.23092782497406, 'learning_rate': 0.0001782363977485929, 'epoch': 1.1}


 11%|█         | 296/2670 [07:09<56:05,  1.42s/it]

{'loss': 0.4548, 'grad_norm': 0.925433337688446, 'learning_rate': 0.00017816135084427768, 'epoch': 1.11}


 11%|█         | 297/2670 [07:11<54:16,  1.37s/it]

{'loss': 0.3874, 'grad_norm': 0.9900577664375305, 'learning_rate': 0.0001780863039399625, 'epoch': 1.11}


 11%|█         | 298/2670 [07:12<55:53,  1.41s/it]

{'loss': 0.3532, 'grad_norm': 0.9772403836250305, 'learning_rate': 0.0001780112570356473, 'epoch': 1.12}


 11%|█         | 299/2670 [07:14<58:09,  1.47s/it]

{'loss': 0.3717, 'grad_norm': 0.9210931658744812, 'learning_rate': 0.00017793621013133208, 'epoch': 1.12}


 11%|█         | 300/2670 [07:15<57:35,  1.46s/it]

{'loss': 0.6088, 'grad_norm': 0.9341455698013306, 'learning_rate': 0.0001778611632270169, 'epoch': 1.12}


 11%|█▏        | 301/2670 [07:17<57:05,  1.45s/it]

{'loss': 0.231, 'grad_norm': 0.7084178328514099, 'learning_rate': 0.0001777861163227017, 'epoch': 1.13}


 11%|█▏        | 302/2670 [07:18<1:00:13,  1.53s/it]

{'loss': 0.3953, 'grad_norm': 0.6661517024040222, 'learning_rate': 0.0001777110694183865, 'epoch': 1.13}


 11%|█▏        | 303/2670 [07:20<59:04,  1.50s/it]  

{'loss': 0.2886, 'grad_norm': 0.6691703200340271, 'learning_rate': 0.0001776360225140713, 'epoch': 1.13}


 11%|█▏        | 304/2670 [07:21<58:17,  1.48s/it]

{'loss': 0.4118, 'grad_norm': 0.7700850367546082, 'learning_rate': 0.0001775609756097561, 'epoch': 1.14}


 11%|█▏        | 305/2670 [07:23<1:01:53,  1.57s/it]

{'loss': 0.5196, 'grad_norm': 0.7576090693473816, 'learning_rate': 0.0001774859287054409, 'epoch': 1.14}


 11%|█▏        | 306/2670 [07:25<1:04:59,  1.65s/it]

{'loss': 0.5784, 'grad_norm': 0.7350921630859375, 'learning_rate': 0.00017741088180112571, 'epoch': 1.15}


 11%|█▏        | 307/2670 [07:26<1:02:13,  1.58s/it]

{'loss': 0.3613, 'grad_norm': 1.2229583263397217, 'learning_rate': 0.00017733583489681052, 'epoch': 1.15}


 12%|█▏        | 308/2670 [07:28<1:01:45,  1.57s/it]

{'loss': 0.4185, 'grad_norm': 0.8678189516067505, 'learning_rate': 0.00017726078799249533, 'epoch': 1.15}


 12%|█▏        | 309/2670 [07:29<59:37,  1.52s/it]  

{'loss': 0.6526, 'grad_norm': 1.0388469696044922, 'learning_rate': 0.0001771857410881801, 'epoch': 1.16}


 12%|█▏        | 310/2670 [07:30<56:22,  1.43s/it]

{'loss': 0.2495, 'grad_norm': 1.2356703281402588, 'learning_rate': 0.00017711069418386492, 'epoch': 1.16}


 12%|█▏        | 311/2670 [07:32<59:03,  1.50s/it]

{'loss': 0.4248, 'grad_norm': 0.8371739387512207, 'learning_rate': 0.00017703564727954973, 'epoch': 1.16}


 12%|█▏        | 312/2670 [07:34<59:25,  1.51s/it]

{'loss': 0.4467, 'grad_norm': 0.7937420010566711, 'learning_rate': 0.0001769606003752345, 'epoch': 1.17}


 12%|█▏        | 313/2670 [07:35<58:19,  1.48s/it]

{'loss': 0.32, 'grad_norm': 0.8487620949745178, 'learning_rate': 0.00017688555347091932, 'epoch': 1.17}


 12%|█▏        | 314/2670 [07:37<58:55,  1.50s/it]

{'loss': 0.242, 'grad_norm': 0.8136025667190552, 'learning_rate': 0.00017681050656660413, 'epoch': 1.18}


 12%|█▏        | 315/2670 [07:38<54:36,  1.39s/it]

{'loss': 0.281, 'grad_norm': 0.8125027418136597, 'learning_rate': 0.00017673545966228894, 'epoch': 1.18}


 12%|█▏        | 316/2670 [07:39<54:24,  1.39s/it]

{'loss': 0.4002, 'grad_norm': 0.8900081515312195, 'learning_rate': 0.00017666041275797375, 'epoch': 1.18}


 12%|█▏        | 317/2670 [07:41<54:43,  1.40s/it]

{'loss': 0.442, 'grad_norm': 0.8576437830924988, 'learning_rate': 0.00017658536585365856, 'epoch': 1.19}


 12%|█▏        | 318/2670 [07:42<55:13,  1.41s/it]

{'loss': 0.5529, 'grad_norm': 1.0057305097579956, 'learning_rate': 0.00017651031894934337, 'epoch': 1.19}


 12%|█▏        | 319/2670 [07:44<56:58,  1.45s/it]

{'loss': 0.3837, 'grad_norm': 0.8354799747467041, 'learning_rate': 0.00017643527204502815, 'epoch': 1.19}


 12%|█▏        | 320/2670 [07:45<56:32,  1.44s/it]

{'loss': 0.3912, 'grad_norm': 0.9821653366088867, 'learning_rate': 0.00017636022514071296, 'epoch': 1.2}


 12%|█▏        | 321/2670 [07:47<59:56,  1.53s/it]

{'loss': 0.3958, 'grad_norm': 0.851405143737793, 'learning_rate': 0.00017628517823639777, 'epoch': 1.2}


 12%|█▏        | 322/2670 [07:48<59:30,  1.52s/it]

{'loss': 0.2093, 'grad_norm': 0.6308373808860779, 'learning_rate': 0.00017621013133208255, 'epoch': 1.21}


 12%|█▏        | 323/2670 [07:50<58:46,  1.50s/it]

{'loss': 0.358, 'grad_norm': 0.8078818917274475, 'learning_rate': 0.00017613508442776736, 'epoch': 1.21}


 12%|█▏        | 324/2670 [07:51<1:00:09,  1.54s/it]

{'loss': 0.2762, 'grad_norm': 0.7863413691520691, 'learning_rate': 0.00017606003752345216, 'epoch': 1.21}


 12%|█▏        | 325/2670 [07:53<59:57,  1.53s/it]  

{'loss': 0.3397, 'grad_norm': 0.8203716278076172, 'learning_rate': 0.00017598499061913695, 'epoch': 1.22}


 12%|█▏        | 326/2670 [07:54<59:49,  1.53s/it]

{'loss': 0.2275, 'grad_norm': 0.5872220396995544, 'learning_rate': 0.00017590994371482176, 'epoch': 1.22}


 12%|█▏        | 327/2670 [07:56<57:32,  1.47s/it]

{'loss': 0.3056, 'grad_norm': 0.7530582547187805, 'learning_rate': 0.00017583489681050656, 'epoch': 1.22}


 12%|█▏        | 328/2670 [07:57<57:06,  1.46s/it]

{'loss': 0.4983, 'grad_norm': 0.9398319721221924, 'learning_rate': 0.0001757598499061914, 'epoch': 1.23}


 12%|█▏        | 329/2670 [07:59<56:51,  1.46s/it]

{'loss': 0.4942, 'grad_norm': 0.9369226694107056, 'learning_rate': 0.00017568480300187618, 'epoch': 1.23}


 12%|█▏        | 330/2670 [08:00<59:39,  1.53s/it]

{'loss': 0.5628, 'grad_norm': 0.8223934173583984, 'learning_rate': 0.000175609756097561, 'epoch': 1.24}


 12%|█▏        | 331/2670 [08:02<58:31,  1.50s/it]

{'loss': 0.4418, 'grad_norm': 1.0635627508163452, 'learning_rate': 0.0001755347091932458, 'epoch': 1.24}


 12%|█▏        | 332/2670 [08:03<59:54,  1.54s/it]

{'loss': 0.7151, 'grad_norm': 0.9431918859481812, 'learning_rate': 0.00017545966228893058, 'epoch': 1.24}


 12%|█▏        | 333/2670 [08:05<59:56,  1.54s/it]

{'loss': 0.4861, 'grad_norm': 0.783811092376709, 'learning_rate': 0.0001753846153846154, 'epoch': 1.25}


 13%|█▎        | 334/2670 [08:06<58:33,  1.50s/it]

{'loss': 0.405, 'grad_norm': 0.8149716854095459, 'learning_rate': 0.0001753095684803002, 'epoch': 1.25}


 13%|█▎        | 335/2670 [08:08<56:44,  1.46s/it]

{'loss': 0.219, 'grad_norm': 0.7880218625068665, 'learning_rate': 0.00017523452157598498, 'epoch': 1.25}


 13%|█▎        | 336/2670 [08:09<53:12,  1.37s/it]

{'loss': 0.1734, 'grad_norm': 0.6933165192604065, 'learning_rate': 0.0001751594746716698, 'epoch': 1.26}


 13%|█▎        | 337/2670 [08:10<55:13,  1.42s/it]

{'loss': 0.4969, 'grad_norm': 0.9433191418647766, 'learning_rate': 0.0001750844277673546, 'epoch': 1.26}


 13%|█▎        | 338/2670 [08:12<56:32,  1.45s/it]

{'loss': 0.2244, 'grad_norm': 0.6241526007652283, 'learning_rate': 0.0001750093808630394, 'epoch': 1.27}


 13%|█▎        | 339/2670 [08:13<55:22,  1.43s/it]

{'loss': 0.4189, 'grad_norm': 0.9058762192726135, 'learning_rate': 0.00017493433395872422, 'epoch': 1.27}


 13%|█▎        | 340/2670 [08:15<58:02,  1.49s/it]

{'loss': 0.7265, 'grad_norm': 0.9623207449913025, 'learning_rate': 0.00017485928705440902, 'epoch': 1.27}


 13%|█▎        | 341/2670 [08:17<1:00:36,  1.56s/it]

{'loss': 0.392, 'grad_norm': 0.8722590208053589, 'learning_rate': 0.00017478424015009383, 'epoch': 1.28}


 13%|█▎        | 342/2670 [08:18<58:11,  1.50s/it]  

{'loss': 0.2654, 'grad_norm': 0.7893103361129761, 'learning_rate': 0.00017470919324577862, 'epoch': 1.28}


 13%|█▎        | 343/2670 [08:20<58:41,  1.51s/it]

{'loss': 0.6173, 'grad_norm': 0.9851779937744141, 'learning_rate': 0.00017463414634146342, 'epoch': 1.28}


 13%|█▎        | 344/2670 [08:21<57:00,  1.47s/it]

{'loss': 0.3133, 'grad_norm': 0.8277234435081482, 'learning_rate': 0.00017455909943714823, 'epoch': 1.29}


 13%|█▎        | 345/2670 [08:22<55:44,  1.44s/it]

{'loss': 0.2147, 'grad_norm': 0.7179730534553528, 'learning_rate': 0.00017448405253283302, 'epoch': 1.29}


 13%|█▎        | 346/2670 [08:24<57:46,  1.49s/it]

{'loss': 0.4765, 'grad_norm': 0.8164328336715698, 'learning_rate': 0.00017440900562851782, 'epoch': 1.3}


 13%|█▎        | 347/2670 [08:25<55:03,  1.42s/it]

{'loss': 0.499, 'grad_norm': 1.053031325340271, 'learning_rate': 0.00017433395872420263, 'epoch': 1.3}


 13%|█▎        | 348/2670 [08:27<54:54,  1.42s/it]

{'loss': 0.3387, 'grad_norm': 1.0053660869598389, 'learning_rate': 0.00017425891181988741, 'epoch': 1.3}


 13%|█▎        | 349/2670 [08:28<54:04,  1.40s/it]

{'loss': 0.5584, 'grad_norm': 0.9997620582580566, 'learning_rate': 0.00017418386491557222, 'epoch': 1.31}


 13%|█▎        | 350/2670 [08:29<53:31,  1.38s/it]

{'loss': 0.4792, 'grad_norm': 0.9601948857307434, 'learning_rate': 0.00017410881801125706, 'epoch': 1.31}


 13%|█▎        | 351/2670 [08:31<54:35,  1.41s/it]

{'loss': 0.3923, 'grad_norm': 0.8846482634544373, 'learning_rate': 0.00017403377110694187, 'epoch': 1.31}


 13%|█▎        | 352/2670 [08:32<54:33,  1.41s/it]

{'loss': 0.3173, 'grad_norm': 0.755922794342041, 'learning_rate': 0.00017395872420262665, 'epoch': 1.32}


 13%|█▎        | 353/2670 [08:34<57:10,  1.48s/it]

{'loss': 0.3777, 'grad_norm': 0.7629696726799011, 'learning_rate': 0.00017388367729831146, 'epoch': 1.32}


 13%|█▎        | 354/2670 [08:35<59:28,  1.54s/it]

{'loss': 0.4769, 'grad_norm': 0.8202177882194519, 'learning_rate': 0.00017380863039399627, 'epoch': 1.33}


 13%|█▎        | 355/2670 [08:37<57:26,  1.49s/it]

{'loss': 0.1995, 'grad_norm': 0.7830820083618164, 'learning_rate': 0.00017373358348968105, 'epoch': 1.33}


 13%|█▎        | 356/2670 [08:38<56:19,  1.46s/it]

{'loss': 0.5069, 'grad_norm': 0.9077515602111816, 'learning_rate': 0.00017365853658536586, 'epoch': 1.33}


 13%|█▎        | 357/2670 [08:40<55:08,  1.43s/it]

{'loss': 0.4416, 'grad_norm': 1.079761028289795, 'learning_rate': 0.00017358348968105067, 'epoch': 1.34}


 13%|█▎        | 358/2670 [08:41<54:00,  1.40s/it]

{'loss': 0.4368, 'grad_norm': 0.9440984725952148, 'learning_rate': 0.00017350844277673545, 'epoch': 1.34}


 13%|█▎        | 359/2670 [08:42<51:49,  1.35s/it]

{'loss': 0.5538, 'grad_norm': 1.5138273239135742, 'learning_rate': 0.00017343339587242026, 'epoch': 1.34}


 13%|█▎        | 360/2670 [08:43<50:52,  1.32s/it]

{'loss': 0.2725, 'grad_norm': 0.6825062036514282, 'learning_rate': 0.00017335834896810507, 'epoch': 1.35}


 14%|█▎        | 361/2670 [08:45<51:36,  1.34s/it]

{'loss': 0.4047, 'grad_norm': 0.990020215511322, 'learning_rate': 0.00017328330206378988, 'epoch': 1.35}


 14%|█▎        | 362/2670 [08:46<52:36,  1.37s/it]

{'loss': 0.2012, 'grad_norm': 0.6053915619850159, 'learning_rate': 0.00017320825515947468, 'epoch': 1.36}


 14%|█▎        | 363/2670 [08:48<53:21,  1.39s/it]

{'loss': 0.3119, 'grad_norm': 0.7344991564750671, 'learning_rate': 0.0001731332082551595, 'epoch': 1.36}


 14%|█▎        | 364/2670 [08:49<52:18,  1.36s/it]

{'loss': 0.5099, 'grad_norm': 0.9719470739364624, 'learning_rate': 0.0001730581613508443, 'epoch': 1.36}


 14%|█▎        | 365/2670 [08:51<54:41,  1.42s/it]

{'loss': 0.2287, 'grad_norm': 0.6003044843673706, 'learning_rate': 0.00017298311444652908, 'epoch': 1.37}


 14%|█▎        | 366/2670 [08:52<55:19,  1.44s/it]

{'loss': 0.3406, 'grad_norm': 0.7710877060890198, 'learning_rate': 0.0001729080675422139, 'epoch': 1.37}


 14%|█▎        | 367/2670 [08:53<56:09,  1.46s/it]

{'loss': 0.488, 'grad_norm': 0.858100175857544, 'learning_rate': 0.0001728330206378987, 'epoch': 1.37}


 14%|█▍        | 368/2670 [08:55<56:39,  1.48s/it]

{'loss': 0.3478, 'grad_norm': 0.7835261821746826, 'learning_rate': 0.00017275797373358348, 'epoch': 1.38}


 14%|█▍        | 369/2670 [08:56<55:49,  1.46s/it]

{'loss': 0.3626, 'grad_norm': 0.7799492478370667, 'learning_rate': 0.0001726829268292683, 'epoch': 1.38}


 14%|█▍        | 370/2670 [08:58<54:12,  1.41s/it]

{'loss': 0.343, 'grad_norm': 0.958564817905426, 'learning_rate': 0.0001726078799249531, 'epoch': 1.39}


 14%|█▍        | 371/2670 [08:59<52:41,  1.38s/it]

{'loss': 0.5385, 'grad_norm': 1.1275837421417236, 'learning_rate': 0.00017253283302063788, 'epoch': 1.39}


 14%|█▍        | 372/2670 [09:01<55:36,  1.45s/it]

{'loss': 0.4987, 'grad_norm': 0.8418052792549133, 'learning_rate': 0.00017245778611632272, 'epoch': 1.39}


 14%|█▍        | 373/2670 [09:02<55:12,  1.44s/it]

{'loss': 0.3975, 'grad_norm': 0.9436595439910889, 'learning_rate': 0.00017238273921200753, 'epoch': 1.4}


 14%|█▍        | 374/2670 [09:03<53:30,  1.40s/it]

{'loss': 0.4646, 'grad_norm': 0.9765052199363708, 'learning_rate': 0.00017230769230769234, 'epoch': 1.4}


 14%|█▍        | 375/2670 [09:05<52:46,  1.38s/it]

{'loss': 0.6342, 'grad_norm': 1.119372010231018, 'learning_rate': 0.00017223264540337712, 'epoch': 1.4}


 14%|█▍        | 376/2670 [09:06<54:29,  1.43s/it]

{'loss': 0.3648, 'grad_norm': 0.8270366191864014, 'learning_rate': 0.00017215759849906193, 'epoch': 1.41}


 14%|█▍        | 377/2670 [09:07<52:22,  1.37s/it]

{'loss': 0.2944, 'grad_norm': 0.9919695258140564, 'learning_rate': 0.00017208255159474674, 'epoch': 1.41}


 14%|█▍        | 378/2670 [09:09<55:18,  1.45s/it]

{'loss': 0.6478, 'grad_norm': 0.8825339078903198, 'learning_rate': 0.00017200750469043152, 'epoch': 1.42}


 14%|█▍        | 379/2670 [09:11<56:14,  1.47s/it]

{'loss': 0.4974, 'grad_norm': 0.8438308835029602, 'learning_rate': 0.00017193245778611633, 'epoch': 1.42}


 14%|█▍        | 380/2670 [09:12<55:58,  1.47s/it]

{'loss': 0.6229, 'grad_norm': 0.9473931193351746, 'learning_rate': 0.00017185741088180114, 'epoch': 1.42}


 14%|█▍        | 381/2670 [09:14<55:34,  1.46s/it]

{'loss': 0.439, 'grad_norm': 0.8883255124092102, 'learning_rate': 0.00017178236397748592, 'epoch': 1.43}


 14%|█▍        | 382/2670 [09:15<53:00,  1.39s/it]

{'loss': 0.3023, 'grad_norm': 0.8008376955986023, 'learning_rate': 0.00017170731707317073, 'epoch': 1.43}


 14%|█▍        | 383/2670 [09:16<53:45,  1.41s/it]

{'loss': 0.368, 'grad_norm': 0.7887706756591797, 'learning_rate': 0.00017163227016885553, 'epoch': 1.43}


 14%|█▍        | 384/2670 [09:18<54:04,  1.42s/it]

{'loss': 0.2267, 'grad_norm': 0.6580536961555481, 'learning_rate': 0.00017155722326454034, 'epoch': 1.44}


 14%|█▍        | 385/2670 [09:19<53:15,  1.40s/it]

{'loss': 0.4965, 'grad_norm': 0.7124298214912415, 'learning_rate': 0.00017148217636022515, 'epoch': 1.44}


 14%|█▍        | 386/2670 [09:21<54:27,  1.43s/it]

{'loss': 0.2462, 'grad_norm': 0.7356524467468262, 'learning_rate': 0.00017140712945590996, 'epoch': 1.45}


 14%|█▍        | 387/2670 [09:22<55:32,  1.46s/it]

{'loss': 0.2433, 'grad_norm': 0.8031590580940247, 'learning_rate': 0.00017133208255159477, 'epoch': 1.45}


 15%|█▍        | 388/2670 [09:24<55:57,  1.47s/it]

{'loss': 0.3608, 'grad_norm': 1.07081937789917, 'learning_rate': 0.00017125703564727955, 'epoch': 1.45}


 15%|█▍        | 389/2670 [09:25<54:56,  1.45s/it]

{'loss': 0.2212, 'grad_norm': 0.8634881377220154, 'learning_rate': 0.00017118198874296436, 'epoch': 1.46}


 15%|█▍        | 390/2670 [09:26<55:50,  1.47s/it]

{'loss': 0.3585, 'grad_norm': 0.8427318930625916, 'learning_rate': 0.00017110694183864917, 'epoch': 1.46}


 15%|█▍        | 391/2670 [09:28<57:21,  1.51s/it]

{'loss': 0.4223, 'grad_norm': 0.8283810615539551, 'learning_rate': 0.00017103189493433395, 'epoch': 1.46}


 15%|█▍        | 392/2670 [09:29<56:05,  1.48s/it]

{'loss': 0.4221, 'grad_norm': 0.9790589809417725, 'learning_rate': 0.00017095684803001876, 'epoch': 1.47}


 15%|█▍        | 393/2670 [09:31<57:10,  1.51s/it]

{'loss': 0.429, 'grad_norm': 0.904904305934906, 'learning_rate': 0.00017088180112570357, 'epoch': 1.47}


 15%|█▍        | 394/2670 [09:33<57:04,  1.50s/it]

{'loss': 0.4176, 'grad_norm': 0.9241225719451904, 'learning_rate': 0.00017080675422138838, 'epoch': 1.48}


 15%|█▍        | 395/2670 [09:34<56:48,  1.50s/it]

{'loss': 0.4142, 'grad_norm': 1.1183050870895386, 'learning_rate': 0.0001707317073170732, 'epoch': 1.48}


 15%|█▍        | 396/2670 [09:36<56:56,  1.50s/it]

{'loss': 0.1836, 'grad_norm': 0.645963728427887, 'learning_rate': 0.000170656660412758, 'epoch': 1.48}


 15%|█▍        | 397/2670 [09:37<57:10,  1.51s/it]

{'loss': 0.253, 'grad_norm': 0.7915809750556946, 'learning_rate': 0.0001705816135084428, 'epoch': 1.49}


 15%|█▍        | 398/2670 [09:38<54:25,  1.44s/it]

{'loss': 0.3284, 'grad_norm': 0.8310629725456238, 'learning_rate': 0.00017050656660412759, 'epoch': 1.49}


 15%|█▍        | 399/2670 [09:40<54:45,  1.45s/it]

{'loss': 0.4556, 'grad_norm': 0.9301605224609375, 'learning_rate': 0.0001704315196998124, 'epoch': 1.49}


 15%|█▍        | 400/2670 [09:41<54:58,  1.45s/it]

{'loss': 0.3943, 'grad_norm': 0.9546126127243042, 'learning_rate': 0.0001703564727954972, 'epoch': 1.5}


 15%|█▌        | 401/2670 [09:43<59:16,  1.57s/it]

{'loss': 0.6027, 'grad_norm': 0.797061026096344, 'learning_rate': 0.00017028142589118199, 'epoch': 1.5}


 15%|█▌        | 402/2670 [09:44<57:17,  1.52s/it]

{'loss': 0.4228, 'grad_norm': 0.8781175017356873, 'learning_rate': 0.0001702063789868668, 'epoch': 1.51}


 15%|█▌        | 403/2670 [09:46<56:07,  1.49s/it]

{'loss': 0.4182, 'grad_norm': 0.8658390641212463, 'learning_rate': 0.0001701313320825516, 'epoch': 1.51}


 15%|█▌        | 404/2670 [09:47<57:21,  1.52s/it]

{'loss': 0.5127, 'grad_norm': 0.815056324005127, 'learning_rate': 0.00017005628517823639, 'epoch': 1.51}


 15%|█▌        | 405/2670 [09:49<57:29,  1.52s/it]

{'loss': 0.3405, 'grad_norm': 0.7337132692337036, 'learning_rate': 0.0001699812382739212, 'epoch': 1.52}


 15%|█▌        | 406/2670 [09:50<55:02,  1.46s/it]

{'loss': 0.1257, 'grad_norm': 0.5862448215484619, 'learning_rate': 0.000169906191369606, 'epoch': 1.52}


 15%|█▌        | 407/2670 [09:52<52:49,  1.40s/it]

{'loss': 0.5617, 'grad_norm': 0.9582924842834473, 'learning_rate': 0.00016983114446529084, 'epoch': 1.52}


 15%|█▌        | 408/2670 [09:53<52:28,  1.39s/it]

{'loss': 0.4857, 'grad_norm': 0.9769982695579529, 'learning_rate': 0.00016975609756097562, 'epoch': 1.53}


 15%|█▌        | 409/2670 [09:55<54:51,  1.46s/it]

{'loss': 0.2904, 'grad_norm': 0.6860275268554688, 'learning_rate': 0.00016968105065666043, 'epoch': 1.53}


 15%|█▌        | 410/2670 [09:56<53:30,  1.42s/it]

{'loss': 0.3681, 'grad_norm': 0.8735485672950745, 'learning_rate': 0.00016960600375234524, 'epoch': 1.54}


 15%|█▌        | 411/2670 [09:58<56:52,  1.51s/it]

{'loss': 0.4971, 'grad_norm': 0.7847551703453064, 'learning_rate': 0.00016953095684803002, 'epoch': 1.54}


 15%|█▌        | 412/2670 [09:59<57:54,  1.54s/it]

{'loss': 0.385, 'grad_norm': 0.7976089715957642, 'learning_rate': 0.00016945590994371483, 'epoch': 1.54}


 15%|█▌        | 413/2670 [10:01<55:32,  1.48s/it]

{'loss': 0.3228, 'grad_norm': 0.780807375907898, 'learning_rate': 0.00016938086303939964, 'epoch': 1.55}


 16%|█▌        | 414/2670 [10:02<56:54,  1.51s/it]

{'loss': 0.686, 'grad_norm': 0.9629232883453369, 'learning_rate': 0.00016930581613508442, 'epoch': 1.55}


 16%|█▌        | 415/2670 [10:04<57:03,  1.52s/it]

{'loss': 0.7333, 'grad_norm': 1.22441828250885, 'learning_rate': 0.00016923076923076923, 'epoch': 1.55}


 16%|█▌        | 416/2670 [10:05<59:42,  1.59s/it]

{'loss': 0.3424, 'grad_norm': 0.8625447750091553, 'learning_rate': 0.00016915572232645404, 'epoch': 1.56}


 16%|█▌        | 417/2670 [10:07<56:34,  1.51s/it]

{'loss': 0.235, 'grad_norm': 0.7462207078933716, 'learning_rate': 0.00016908067542213885, 'epoch': 1.56}


 16%|█▌        | 418/2670 [10:08<56:42,  1.51s/it]

{'loss': 0.2746, 'grad_norm': 0.752883791923523, 'learning_rate': 0.00016900562851782366, 'epoch': 1.57}


 16%|█▌        | 419/2670 [10:10<57:04,  1.52s/it]

{'loss': 0.545, 'grad_norm': 0.8881988525390625, 'learning_rate': 0.00016893058161350846, 'epoch': 1.57}


 16%|█▌        | 420/2670 [10:11<55:37,  1.48s/it]

{'loss': 0.1229, 'grad_norm': 0.5849631428718567, 'learning_rate': 0.00016885553470919327, 'epoch': 1.57}


 16%|█▌        | 421/2670 [10:13<55:54,  1.49s/it]

{'loss': 0.4627, 'grad_norm': 0.8803382515907288, 'learning_rate': 0.00016878048780487805, 'epoch': 1.58}


 16%|█▌        | 422/2670 [10:14<54:51,  1.46s/it]

{'loss': 0.5066, 'grad_norm': 1.0201975107192993, 'learning_rate': 0.00016870544090056286, 'epoch': 1.58}


 16%|█▌        | 423/2670 [10:16<53:42,  1.43s/it]

{'loss': 0.374, 'grad_norm': 0.8808330297470093, 'learning_rate': 0.00016863039399624767, 'epoch': 1.58}


 16%|█▌        | 424/2670 [10:17<57:37,  1.54s/it]

{'loss': 0.3787, 'grad_norm': 0.7734600305557251, 'learning_rate': 0.00016855534709193245, 'epoch': 1.59}


 16%|█▌        | 425/2670 [10:19<57:56,  1.55s/it]

{'loss': 0.3878, 'grad_norm': 0.8587217330932617, 'learning_rate': 0.00016848030018761726, 'epoch': 1.59}


 16%|█▌        | 426/2670 [10:21<59:08,  1.58s/it]

{'loss': 0.4796, 'grad_norm': 1.0325937271118164, 'learning_rate': 0.00016840525328330207, 'epoch': 1.6}


 16%|█▌        | 427/2670 [10:22<59:49,  1.60s/it]

{'loss': 0.6655, 'grad_norm': 0.9764149785041809, 'learning_rate': 0.00016833020637898685, 'epoch': 1.6}


 16%|█▌        | 428/2670 [10:23<53:50,  1.44s/it]

{'loss': 0.6484, 'grad_norm': 5.389888763427734, 'learning_rate': 0.00016825515947467166, 'epoch': 1.6}


 16%|█▌        | 429/2670 [10:25<54:30,  1.46s/it]

{'loss': 0.2913, 'grad_norm': 0.758443295955658, 'learning_rate': 0.0001681801125703565, 'epoch': 1.61}


 16%|█▌        | 430/2670 [10:26<53:56,  1.44s/it]

{'loss': 0.3119, 'grad_norm': 0.8795582056045532, 'learning_rate': 0.0001681050656660413, 'epoch': 1.61}


 16%|█▌        | 431/2670 [10:28<55:35,  1.49s/it]

{'loss': 0.3533, 'grad_norm': 0.8143899440765381, 'learning_rate': 0.0001680300187617261, 'epoch': 1.61}


 16%|█▌        | 432/2670 [10:29<55:43,  1.49s/it]

{'loss': 0.1653, 'grad_norm': 0.5926997065544128, 'learning_rate': 0.0001679549718574109, 'epoch': 1.62}


 16%|█▌        | 433/2670 [10:31<57:25,  1.54s/it]

{'loss': 0.6326, 'grad_norm': 0.9262698888778687, 'learning_rate': 0.0001678799249530957, 'epoch': 1.62}


 16%|█▋        | 434/2670 [10:32<57:09,  1.53s/it]

{'loss': 0.4295, 'grad_norm': 0.9458078145980835, 'learning_rate': 0.0001678048780487805, 'epoch': 1.63}


 16%|█▋        | 435/2670 [10:34<54:33,  1.46s/it]

{'loss': 0.3005, 'grad_norm': 0.7886026501655579, 'learning_rate': 0.0001677298311444653, 'epoch': 1.63}


 16%|█▋        | 436/2670 [10:35<52:44,  1.42s/it]

{'loss': 0.2938, 'grad_norm': 0.6673480868339539, 'learning_rate': 0.0001676547842401501, 'epoch': 1.63}


 16%|█▋        | 437/2670 [10:37<56:09,  1.51s/it]

{'loss': 0.7363, 'grad_norm': 0.8350823521614075, 'learning_rate': 0.0001675797373358349, 'epoch': 1.64}


 16%|█▋        | 438/2670 [10:38<54:38,  1.47s/it]

{'loss': 0.3456, 'grad_norm': 1.076958179473877, 'learning_rate': 0.0001675046904315197, 'epoch': 1.64}


 16%|█▋        | 439/2670 [10:40<56:38,  1.52s/it]

{'loss': 0.3622, 'grad_norm': 0.6977125406265259, 'learning_rate': 0.0001674296435272045, 'epoch': 1.64}


 16%|█▋        | 440/2670 [10:42<59:07,  1.59s/it]

{'loss': 0.4909, 'grad_norm': 0.8209798336029053, 'learning_rate': 0.00016735459662288931, 'epoch': 1.65}


 17%|█▋        | 441/2670 [10:43<59:44,  1.61s/it]

{'loss': 0.5116, 'grad_norm': 0.90169757604599, 'learning_rate': 0.00016727954971857412, 'epoch': 1.65}


 17%|█▋        | 442/2670 [10:45<59:21,  1.60s/it]

{'loss': 0.409, 'grad_norm': 0.803968608379364, 'learning_rate': 0.00016720450281425893, 'epoch': 1.66}


 17%|█▋        | 443/2670 [10:46<57:33,  1.55s/it]

{'loss': 0.2531, 'grad_norm': 0.7775071859359741, 'learning_rate': 0.00016712945590994374, 'epoch': 1.66}


 17%|█▋        | 444/2670 [10:48<55:42,  1.50s/it]

{'loss': 0.3197, 'grad_norm': 1.0056419372558594, 'learning_rate': 0.00016705440900562852, 'epoch': 1.66}


 17%|█▋        | 445/2670 [10:49<56:36,  1.53s/it]

{'loss': 0.1956, 'grad_norm': 0.8371291160583496, 'learning_rate': 0.00016697936210131333, 'epoch': 1.67}


 17%|█▋        | 446/2670 [10:51<57:12,  1.54s/it]

{'loss': 0.4728, 'grad_norm': 0.9679154753684998, 'learning_rate': 0.00016690431519699814, 'epoch': 1.67}


 17%|█▋        | 447/2670 [10:52<53:41,  1.45s/it]

{'loss': 0.4399, 'grad_norm': 1.0473589897155762, 'learning_rate': 0.00016682926829268292, 'epoch': 1.67}


 17%|█▋        | 448/2670 [10:53<52:41,  1.42s/it]

{'loss': 0.3117, 'grad_norm': 1.0110397338867188, 'learning_rate': 0.00016675422138836773, 'epoch': 1.68}


 17%|█▋        | 449/2670 [10:55<52:00,  1.40s/it]

{'loss': 0.3384, 'grad_norm': 0.9204409718513489, 'learning_rate': 0.00016667917448405254, 'epoch': 1.68}


 17%|█▋        | 450/2670 [10:56<52:14,  1.41s/it]

{'loss': 0.3374, 'grad_norm': 0.8515989184379578, 'learning_rate': 0.00016660412757973732, 'epoch': 1.69}


 17%|█▋        | 451/2670 [10:58<57:09,  1.55s/it]

{'loss': 0.6321, 'grad_norm': 1.000788927078247, 'learning_rate': 0.00016652908067542216, 'epoch': 1.69}


 17%|█▋        | 452/2670 [10:59<55:51,  1.51s/it]

{'loss': 0.3075, 'grad_norm': 0.9461055994033813, 'learning_rate': 0.00016645403377110697, 'epoch': 1.69}


 17%|█▋        | 453/2670 [11:01<54:51,  1.48s/it]

{'loss': 0.1989, 'grad_norm': 0.793731689453125, 'learning_rate': 0.00016637898686679175, 'epoch': 1.7}


 17%|█▋        | 454/2670 [11:02<55:20,  1.50s/it]

{'loss': 0.5396, 'grad_norm': 0.971052885055542, 'learning_rate': 0.00016630393996247656, 'epoch': 1.7}


 17%|█▋        | 455/2670 [11:04<53:41,  1.45s/it]

{'loss': 0.3958, 'grad_norm': 1.030229091644287, 'learning_rate': 0.00016622889305816137, 'epoch': 1.7}


 17%|█▋        | 456/2670 [11:05<51:44,  1.40s/it]

{'loss': 0.743, 'grad_norm': 1.2081289291381836, 'learning_rate': 0.00016615384615384617, 'epoch': 1.71}


 17%|█▋        | 457/2670 [11:06<52:17,  1.42s/it]

{'loss': 0.4208, 'grad_norm': 0.9329719543457031, 'learning_rate': 0.00016607879924953096, 'epoch': 1.71}


 17%|█▋        | 458/2670 [11:08<52:14,  1.42s/it]

{'loss': 0.2484, 'grad_norm': 0.7970436811447144, 'learning_rate': 0.00016600375234521577, 'epoch': 1.72}


 17%|█▋        | 459/2670 [11:09<50:29,  1.37s/it]

{'loss': 0.282, 'grad_norm': 0.8422890305519104, 'learning_rate': 0.00016592870544090057, 'epoch': 1.72}


 17%|█▋        | 460/2670 [11:11<54:56,  1.49s/it]

{'loss': 0.2851, 'grad_norm': 0.6187687516212463, 'learning_rate': 0.00016585365853658536, 'epoch': 1.72}


 17%|█▋        | 461/2670 [11:12<53:43,  1.46s/it]

{'loss': 0.1538, 'grad_norm': 0.6090940237045288, 'learning_rate': 0.00016577861163227016, 'epoch': 1.73}


 17%|█▋        | 462/2670 [11:14<55:32,  1.51s/it]

{'loss': 0.3415, 'grad_norm': 0.759616494178772, 'learning_rate': 0.00016570356472795497, 'epoch': 1.73}


 17%|█▋        | 463/2670 [11:15<50:59,  1.39s/it]

{'loss': 0.4234, 'grad_norm': 1.0645416975021362, 'learning_rate': 0.00016562851782363978, 'epoch': 1.73}


 17%|█▋        | 464/2670 [11:16<51:21,  1.40s/it]

{'loss': 0.2168, 'grad_norm': 0.7795007824897766, 'learning_rate': 0.0001655534709193246, 'epoch': 1.74}


 17%|█▋        | 465/2670 [11:18<49:42,  1.35s/it]

{'loss': 0.3198, 'grad_norm': 1.016972303390503, 'learning_rate': 0.0001654784240150094, 'epoch': 1.74}


 17%|█▋        | 466/2670 [11:19<48:48,  1.33s/it]

{'loss': 0.3672, 'grad_norm': 0.8825218677520752, 'learning_rate': 0.0001654033771106942, 'epoch': 1.75}


 17%|█▋        | 467/2670 [11:20<48:13,  1.31s/it]

{'loss': 0.3168, 'grad_norm': 0.9120123386383057, 'learning_rate': 0.000165328330206379, 'epoch': 1.75}


 18%|█▊        | 468/2670 [11:22<48:47,  1.33s/it]

{'loss': 0.4782, 'grad_norm': 1.2310035228729248, 'learning_rate': 0.0001652532833020638, 'epoch': 1.75}


 18%|█▊        | 469/2670 [11:23<53:13,  1.45s/it]

{'loss': 0.5586, 'grad_norm': 0.9300328493118286, 'learning_rate': 0.0001651782363977486, 'epoch': 1.76}


 18%|█▊        | 470/2670 [11:25<52:46,  1.44s/it]

{'loss': 0.6277, 'grad_norm': 1.0564950704574585, 'learning_rate': 0.0001651031894934334, 'epoch': 1.76}


 18%|█▊        | 471/2670 [11:26<53:49,  1.47s/it]

{'loss': 0.3719, 'grad_norm': 0.9903765916824341, 'learning_rate': 0.0001650281425891182, 'epoch': 1.76}


 18%|█▊        | 472/2670 [11:28<54:16,  1.48s/it]

{'loss': 0.4694, 'grad_norm': 0.9682808518409729, 'learning_rate': 0.000164953095684803, 'epoch': 1.77}


 18%|█▊        | 473/2670 [11:29<53:36,  1.46s/it]

{'loss': 0.343, 'grad_norm': 0.9398562908172607, 'learning_rate': 0.00016487804878048782, 'epoch': 1.77}


 18%|█▊        | 474/2670 [11:31<52:49,  1.44s/it]

{'loss': 0.2364, 'grad_norm': 0.6939002871513367, 'learning_rate': 0.00016480300187617263, 'epoch': 1.78}


 18%|█▊        | 475/2670 [11:32<53:33,  1.46s/it]

{'loss': 0.2689, 'grad_norm': 0.7120252251625061, 'learning_rate': 0.00016472795497185743, 'epoch': 1.78}


 18%|█▊        | 476/2670 [11:33<52:21,  1.43s/it]

{'loss': 0.549, 'grad_norm': 1.0908973217010498, 'learning_rate': 0.00016465290806754222, 'epoch': 1.78}


 18%|█▊        | 477/2670 [11:35<53:06,  1.45s/it]

{'loss': 0.2788, 'grad_norm': 0.7312307357788086, 'learning_rate': 0.00016457786116322703, 'epoch': 1.79}


 18%|█▊        | 478/2670 [11:36<52:18,  1.43s/it]

{'loss': 0.2731, 'grad_norm': 0.8715852499008179, 'learning_rate': 0.00016450281425891183, 'epoch': 1.79}


 18%|█▊        | 479/2670 [11:38<49:12,  1.35s/it]

{'loss': 0.3915, 'grad_norm': 1.0309970378875732, 'learning_rate': 0.00016442776735459664, 'epoch': 1.79}


 18%|█▊        | 480/2670 [11:39<50:25,  1.38s/it]

{'loss': 0.3365, 'grad_norm': 0.7270018458366394, 'learning_rate': 0.00016435272045028142, 'epoch': 1.8}


 18%|█▊        | 481/2670 [11:41<52:55,  1.45s/it]

{'loss': 0.3452, 'grad_norm': 0.8567725419998169, 'learning_rate': 0.00016427767354596623, 'epoch': 1.8}


 18%|█▊        | 482/2670 [11:42<55:17,  1.52s/it]

{'loss': 0.4051, 'grad_norm': 1.036192774772644, 'learning_rate': 0.00016420262664165104, 'epoch': 1.81}


 18%|█▊        | 483/2670 [11:44<55:17,  1.52s/it]

{'loss': 0.5432, 'grad_norm': 1.0806796550750732, 'learning_rate': 0.00016412757973733582, 'epoch': 1.81}


 18%|█▊        | 484/2670 [11:45<53:17,  1.46s/it]

{'loss': 0.3464, 'grad_norm': 0.9526211023330688, 'learning_rate': 0.00016405253283302063, 'epoch': 1.81}


 18%|█▊        | 485/2670 [11:47<53:47,  1.48s/it]

{'loss': 0.3245, 'grad_norm': 0.8506864309310913, 'learning_rate': 0.00016397748592870544, 'epoch': 1.82}


 18%|█▊        | 486/2670 [11:48<51:00,  1.40s/it]

{'loss': 0.3593, 'grad_norm': 0.9057244062423706, 'learning_rate': 0.00016390243902439025, 'epoch': 1.82}


 18%|█▊        | 487/2670 [11:49<53:07,  1.46s/it]

{'loss': 0.396, 'grad_norm': 0.7943552732467651, 'learning_rate': 0.00016382739212007506, 'epoch': 1.82}


 18%|█▊        | 488/2670 [11:51<54:09,  1.49s/it]

{'loss': 0.2291, 'grad_norm': 0.7397666573524475, 'learning_rate': 0.00016375234521575987, 'epoch': 1.83}


 18%|█▊        | 489/2670 [11:53<54:21,  1.50s/it]

{'loss': 0.3082, 'grad_norm': 0.8310060501098633, 'learning_rate': 0.00016367729831144468, 'epoch': 1.83}


 18%|█▊        | 490/2670 [11:54<53:08,  1.46s/it]

{'loss': 0.3769, 'grad_norm': 0.9522413611412048, 'learning_rate': 0.00016360225140712946, 'epoch': 1.84}


 18%|█▊        | 491/2670 [11:55<53:53,  1.48s/it]

{'loss': 0.266, 'grad_norm': 0.7385202646255493, 'learning_rate': 0.00016352720450281427, 'epoch': 1.84}


 18%|█▊        | 492/2670 [11:57<51:15,  1.41s/it]

{'loss': 0.368, 'grad_norm': 1.1226485967636108, 'learning_rate': 0.00016345215759849908, 'epoch': 1.84}


 18%|█▊        | 493/2670 [11:58<52:35,  1.45s/it]

{'loss': 0.3551, 'grad_norm': 0.8426619172096252, 'learning_rate': 0.00016337711069418386, 'epoch': 1.85}


 19%|█▊        | 494/2670 [12:00<51:36,  1.42s/it]

{'loss': 0.2859, 'grad_norm': 0.8515224456787109, 'learning_rate': 0.00016330206378986867, 'epoch': 1.85}


 19%|█▊        | 495/2670 [12:01<52:40,  1.45s/it]

{'loss': 0.309, 'grad_norm': 1.0093464851379395, 'learning_rate': 0.00016322701688555348, 'epoch': 1.85}


 19%|█▊        | 496/2670 [12:02<50:43,  1.40s/it]

{'loss': 0.2737, 'grad_norm': 0.8200439810752869, 'learning_rate': 0.00016315196998123829, 'epoch': 1.86}


 19%|█▊        | 497/2670 [12:04<54:13,  1.50s/it]

{'loss': 0.7108, 'grad_norm': 0.9994272589683533, 'learning_rate': 0.0001630769230769231, 'epoch': 1.86}


 19%|█▊        | 498/2670 [12:06<54:46,  1.51s/it]

{'loss': 0.3906, 'grad_norm': 0.8856616616249084, 'learning_rate': 0.0001630018761726079, 'epoch': 1.87}


 19%|█▊        | 499/2670 [12:07<52:26,  1.45s/it]

{'loss': 0.4897, 'grad_norm': 1.0682029724121094, 'learning_rate': 0.00016292682926829268, 'epoch': 1.87}


 19%|█▊        | 500/2670 [12:08<53:04,  1.47s/it]

{'loss': 0.2579, 'grad_norm': 1.2906678915023804, 'learning_rate': 0.0001628517823639775, 'epoch': 1.87}


 19%|█▉        | 501/2670 [12:12<1:16:46,  2.12s/it]

{'loss': 0.4115, 'grad_norm': 0.8570066690444946, 'learning_rate': 0.0001627767354596623, 'epoch': 1.88}


 19%|█▉        | 502/2670 [12:14<1:13:02,  2.02s/it]

{'loss': 0.5929, 'grad_norm': 0.8682137727737427, 'learning_rate': 0.0001627016885553471, 'epoch': 1.88}


 19%|█▉        | 503/2670 [12:15<1:04:29,  1.79s/it]

{'loss': 0.3886, 'grad_norm': 1.0490803718566895, 'learning_rate': 0.0001626266416510319, 'epoch': 1.88}


 19%|█▉        | 504/2670 [12:17<1:01:45,  1.71s/it]

{'loss': 0.4105, 'grad_norm': 1.0744026899337769, 'learning_rate': 0.0001625515947467167, 'epoch': 1.89}


 19%|█▉        | 505/2670 [12:18<57:35,  1.60s/it]  

{'loss': 0.2964, 'grad_norm': 0.9976047873497009, 'learning_rate': 0.0001624765478424015, 'epoch': 1.89}


 19%|█▉        | 506/2670 [12:20<58:42,  1.63s/it]

{'loss': 0.3616, 'grad_norm': 0.7590601444244385, 'learning_rate': 0.0001624015009380863, 'epoch': 1.9}


 19%|█▉        | 507/2670 [12:21<58:22,  1.62s/it]

{'loss': 0.3096, 'grad_norm': 0.7558102607727051, 'learning_rate': 0.0001623264540337711, 'epoch': 1.9}


 19%|█▉        | 508/2670 [12:23<55:07,  1.53s/it]

{'loss': 0.367, 'grad_norm': 0.9013243913650513, 'learning_rate': 0.0001622514071294559, 'epoch': 1.9}


 19%|█▉        | 509/2670 [12:24<55:50,  1.55s/it]

{'loss': 0.1649, 'grad_norm': 0.6716073155403137, 'learning_rate': 0.00016217636022514072, 'epoch': 1.91}


 19%|█▉        | 510/2670 [12:26<54:20,  1.51s/it]

{'loss': 0.2611, 'grad_norm': 0.8243446350097656, 'learning_rate': 0.00016210131332082553, 'epoch': 1.91}


 19%|█▉        | 511/2670 [12:27<54:46,  1.52s/it]

{'loss': 0.3925, 'grad_norm': 1.0420633554458618, 'learning_rate': 0.00016202626641651034, 'epoch': 1.91}


 19%|█▉        | 512/2670 [12:29<54:38,  1.52s/it]

{'loss': 0.4625, 'grad_norm': 0.8218337297439575, 'learning_rate': 0.00016195121951219515, 'epoch': 1.92}


 19%|█▉        | 513/2670 [12:30<50:54,  1.42s/it]

{'loss': 0.5707, 'grad_norm': 1.0132533311843872, 'learning_rate': 0.00016187617260787993, 'epoch': 1.92}


 19%|█▉        | 514/2670 [12:31<50:56,  1.42s/it]

{'loss': 0.3217, 'grad_norm': 0.7970952987670898, 'learning_rate': 0.00016180112570356474, 'epoch': 1.93}


 19%|█▉        | 515/2670 [12:33<50:23,  1.40s/it]

{'loss': 0.2666, 'grad_norm': 0.7214953899383545, 'learning_rate': 0.00016172607879924954, 'epoch': 1.93}


 19%|█▉        | 516/2670 [12:34<50:26,  1.41s/it]

{'loss': 0.4233, 'grad_norm': 0.9273810982704163, 'learning_rate': 0.00016165103189493433, 'epoch': 1.93}


 19%|█▉        | 517/2670 [12:35<49:53,  1.39s/it]

{'loss': 0.397, 'grad_norm': 0.8839590549468994, 'learning_rate': 0.00016157598499061914, 'epoch': 1.94}


 19%|█▉        | 518/2670 [12:37<49:47,  1.39s/it]

{'loss': 0.3075, 'grad_norm': 0.8914449214935303, 'learning_rate': 0.00016150093808630394, 'epoch': 1.94}


 19%|█▉        | 519/2670 [12:38<49:48,  1.39s/it]

{'loss': 0.2162, 'grad_norm': 0.708358883857727, 'learning_rate': 0.00016142589118198875, 'epoch': 1.94}


 19%|█▉        | 520/2670 [12:40<49:12,  1.37s/it]

{'loss': 0.222, 'grad_norm': 0.8331077098846436, 'learning_rate': 0.00016135084427767356, 'epoch': 1.95}


 20%|█▉        | 521/2670 [12:41<55:10,  1.54s/it]

{'loss': 0.2593, 'grad_norm': 0.744538426399231, 'learning_rate': 0.00016127579737335837, 'epoch': 1.95}


 20%|█▉        | 522/2670 [12:43<54:18,  1.52s/it]

{'loss': 0.5103, 'grad_norm': 1.0682108402252197, 'learning_rate': 0.00016120075046904315, 'epoch': 1.96}


 20%|█▉        | 523/2670 [12:44<53:04,  1.48s/it]

{'loss': 0.23, 'grad_norm': 0.8245058059692383, 'learning_rate': 0.00016112570356472796, 'epoch': 1.96}


 20%|█▉        | 524/2670 [12:46<51:04,  1.43s/it]

{'loss': 0.212, 'grad_norm': 0.7628507614135742, 'learning_rate': 0.00016105065666041277, 'epoch': 1.96}


 20%|█▉        | 525/2670 [12:47<49:15,  1.38s/it]

{'loss': 0.3457, 'grad_norm': 0.9821978807449341, 'learning_rate': 0.00016097560975609758, 'epoch': 1.97}


 20%|█▉        | 526/2670 [12:48<47:14,  1.32s/it]

{'loss': 0.2457, 'grad_norm': 0.9650209546089172, 'learning_rate': 0.00016090056285178236, 'epoch': 1.97}


 20%|█▉        | 527/2670 [12:49<46:34,  1.30s/it]

{'loss': 0.2036, 'grad_norm': 1.0264387130737305, 'learning_rate': 0.00016082551594746717, 'epoch': 1.97}


 20%|█▉        | 528/2670 [12:51<49:40,  1.39s/it]

{'loss': 0.2077, 'grad_norm': 0.7071818709373474, 'learning_rate': 0.00016075046904315198, 'epoch': 1.98}


 20%|█▉        | 529/2670 [12:52<47:24,  1.33s/it]

{'loss': 0.3308, 'grad_norm': 1.071293830871582, 'learning_rate': 0.00016067542213883676, 'epoch': 1.98}


 20%|█▉        | 530/2670 [12:53<47:40,  1.34s/it]

{'loss': 0.5232, 'grad_norm': 0.9773825407028198, 'learning_rate': 0.00016060037523452157, 'epoch': 1.99}


 20%|█▉        | 531/2670 [12:55<49:46,  1.40s/it]

{'loss': 0.2139, 'grad_norm': 0.8657611608505249, 'learning_rate': 0.0001605253283302064, 'epoch': 1.99}


 20%|█▉        | 532/2670 [12:56<49:17,  1.38s/it]

{'loss': 0.2122, 'grad_norm': 0.7809075713157654, 'learning_rate': 0.0001604502814258912, 'epoch': 1.99}


 20%|█▉        | 533/2670 [12:58<47:47,  1.34s/it]

{'loss': 0.3335, 'grad_norm': 0.9479188919067383, 'learning_rate': 0.000160375234521576, 'epoch': 2.0}


 20%|██        | 534/2670 [12:59<50:06,  1.41s/it]

{'loss': 0.4182, 'grad_norm': 0.8265287280082703, 'learning_rate': 0.0001603001876172608, 'epoch': 2.0}


 20%|██        | 535/2670 [13:01<50:19,  1.41s/it]

{'loss': 0.1995, 'grad_norm': 0.8588557243347168, 'learning_rate': 0.00016022514071294561, 'epoch': 2.0}


 20%|██        | 536/2670 [13:02<51:34,  1.45s/it]

{'loss': 0.1165, 'grad_norm': 0.4943504333496094, 'learning_rate': 0.0001601500938086304, 'epoch': 2.01}


 20%|██        | 537/2670 [13:04<53:42,  1.51s/it]

{'loss': 0.1223, 'grad_norm': 0.5254456400871277, 'learning_rate': 0.0001600750469043152, 'epoch': 2.01}


 20%|██        | 538/2670 [13:05<55:48,  1.57s/it]

{'loss': 0.1508, 'grad_norm': 0.66147381067276, 'learning_rate': 0.00016, 'epoch': 2.01}


 20%|██        | 539/2670 [13:07<55:29,  1.56s/it]

{'loss': 0.1036, 'grad_norm': 0.5515768527984619, 'learning_rate': 0.0001599249530956848, 'epoch': 2.02}


 20%|██        | 540/2670 [13:09<54:51,  1.55s/it]

{'loss': 0.1252, 'grad_norm': 0.6738636493682861, 'learning_rate': 0.0001598499061913696, 'epoch': 2.02}


 20%|██        | 541/2670 [13:10<52:40,  1.48s/it]

{'loss': 0.1187, 'grad_norm': 0.5630198121070862, 'learning_rate': 0.0001597748592870544, 'epoch': 2.03}


 20%|██        | 542/2670 [13:11<49:09,  1.39s/it]

{'loss': 0.0789, 'grad_norm': 0.4631318747997284, 'learning_rate': 0.00015969981238273922, 'epoch': 2.03}


 20%|██        | 543/2670 [13:12<47:44,  1.35s/it]

{'loss': 0.2029, 'grad_norm': 0.8013830184936523, 'learning_rate': 0.00015962476547842403, 'epoch': 2.03}


 20%|██        | 544/2670 [13:14<47:56,  1.35s/it]

{'loss': 0.2136, 'grad_norm': 1.0162311792373657, 'learning_rate': 0.00015954971857410884, 'epoch': 2.04}


 20%|██        | 545/2670 [13:15<46:26,  1.31s/it]

{'loss': 0.1436, 'grad_norm': 0.7504698634147644, 'learning_rate': 0.00015947467166979362, 'epoch': 2.04}


 20%|██        | 546/2670 [13:16<48:22,  1.37s/it]

{'loss': 0.0891, 'grad_norm': 0.6538156867027283, 'learning_rate': 0.00015939962476547843, 'epoch': 2.04}


 20%|██        | 547/2670 [13:18<46:47,  1.32s/it]

{'loss': 0.168, 'grad_norm': 1.6030329465866089, 'learning_rate': 0.00015932457786116324, 'epoch': 2.05}


 21%|██        | 548/2670 [13:19<45:51,  1.30s/it]

{'loss': 0.1812, 'grad_norm': 0.9324310421943665, 'learning_rate': 0.00015924953095684805, 'epoch': 2.05}


 21%|██        | 549/2670 [13:20<43:38,  1.23s/it]

{'loss': 0.1388, 'grad_norm': 0.8246085047721863, 'learning_rate': 0.00015917448405253283, 'epoch': 2.06}


 21%|██        | 550/2670 [13:21<46:22,  1.31s/it]

{'loss': 0.1056, 'grad_norm': 1.0464224815368652, 'learning_rate': 0.00015909943714821764, 'epoch': 2.06}


 21%|██        | 551/2670 [13:23<48:25,  1.37s/it]

{'loss': 0.0986, 'grad_norm': 0.730577826499939, 'learning_rate': 0.00015902439024390245, 'epoch': 2.06}


 21%|██        | 552/2670 [13:24<47:48,  1.35s/it]

{'loss': 0.0569, 'grad_norm': 0.4988686144351959, 'learning_rate': 0.00015894934333958726, 'epoch': 2.07}


 21%|██        | 553/2670 [13:25<46:47,  1.33s/it]

{'loss': 0.2162, 'grad_norm': 1.2808233499526978, 'learning_rate': 0.00015887429643527206, 'epoch': 2.07}


 21%|██        | 554/2670 [13:27<49:32,  1.40s/it]

{'loss': 0.1704, 'grad_norm': 0.8394178748130798, 'learning_rate': 0.00015879924953095687, 'epoch': 2.07}


 21%|██        | 555/2670 [13:29<54:32,  1.55s/it]

{'loss': 0.1599, 'grad_norm': 0.7987822890281677, 'learning_rate': 0.00015872420262664166, 'epoch': 2.08}


 21%|██        | 556/2670 [13:30<50:17,  1.43s/it]

{'loss': 0.1364, 'grad_norm': 0.714813232421875, 'learning_rate': 0.00015864915572232646, 'epoch': 2.08}


 21%|██        | 557/2670 [13:32<52:02,  1.48s/it]

{'loss': 0.1463, 'grad_norm': 0.8373457193374634, 'learning_rate': 0.00015857410881801127, 'epoch': 2.09}


 21%|██        | 558/2670 [13:33<53:29,  1.52s/it]

{'loss': 0.1256, 'grad_norm': 0.6833876967430115, 'learning_rate': 0.00015849906191369608, 'epoch': 2.09}


 21%|██        | 559/2670 [13:35<55:01,  1.56s/it]

{'loss': 0.1024, 'grad_norm': 0.5881589651107788, 'learning_rate': 0.00015842401500938086, 'epoch': 2.09}


 21%|██        | 560/2670 [13:36<52:41,  1.50s/it]

{'loss': 0.1998, 'grad_norm': 0.8900511264801025, 'learning_rate': 0.00015834896810506567, 'epoch': 2.1}


 21%|██        | 561/2670 [13:38<53:26,  1.52s/it]

{'loss': 0.0617, 'grad_norm': 0.42191624641418457, 'learning_rate': 0.00015827392120075048, 'epoch': 2.1}


 21%|██        | 562/2670 [13:39<50:03,  1.42s/it]

{'loss': 0.0644, 'grad_norm': 0.46176043152809143, 'learning_rate': 0.00015819887429643526, 'epoch': 2.1}


 21%|██        | 563/2670 [13:41<52:02,  1.48s/it]

{'loss': 0.094, 'grad_norm': 0.6290491819381714, 'learning_rate': 0.00015812382739212007, 'epoch': 2.11}


 21%|██        | 564/2670 [13:42<50:36,  1.44s/it]

{'loss': 0.1735, 'grad_norm': 0.7695235013961792, 'learning_rate': 0.00015804878048780488, 'epoch': 2.11}


 21%|██        | 565/2670 [13:43<49:23,  1.41s/it]

{'loss': 0.2632, 'grad_norm': 0.9707357883453369, 'learning_rate': 0.0001579737335834897, 'epoch': 2.12}


 21%|██        | 566/2670 [13:45<51:16,  1.46s/it]

{'loss': 0.2341, 'grad_norm': 0.8057092428207397, 'learning_rate': 0.0001578986866791745, 'epoch': 2.12}


 21%|██        | 567/2670 [13:47<52:44,  1.50s/it]

{'loss': 0.1164, 'grad_norm': 0.5684698224067688, 'learning_rate': 0.0001578236397748593, 'epoch': 2.12}


 21%|██▏       | 568/2670 [13:48<47:25,  1.35s/it]

{'loss': 0.1993, 'grad_norm': 0.9761792421340942, 'learning_rate': 0.0001577485928705441, 'epoch': 2.13}


 21%|██▏       | 569/2670 [13:49<49:09,  1.40s/it]

{'loss': 0.1459, 'grad_norm': 0.6624965071678162, 'learning_rate': 0.0001576735459662289, 'epoch': 2.13}


 21%|██▏       | 570/2670 [13:51<50:22,  1.44s/it]

{'loss': 0.1443, 'grad_norm': 0.7219461798667908, 'learning_rate': 0.0001575984990619137, 'epoch': 2.13}


 21%|██▏       | 571/2670 [13:52<50:24,  1.44s/it]

{'loss': 0.3532, 'grad_norm': 1.7407523393630981, 'learning_rate': 0.00015752345215759852, 'epoch': 2.14}


 21%|██▏       | 572/2670 [13:53<47:27,  1.36s/it]

{'loss': 0.1756, 'grad_norm': 1.1089398860931396, 'learning_rate': 0.0001574484052532833, 'epoch': 2.14}


 21%|██▏       | 573/2670 [13:55<48:07,  1.38s/it]

{'loss': 0.174, 'grad_norm': 0.8077962398529053, 'learning_rate': 0.0001573733583489681, 'epoch': 2.15}


 21%|██▏       | 574/2670 [13:56<46:51,  1.34s/it]

{'loss': 0.2184, 'grad_norm': 0.9033962488174438, 'learning_rate': 0.00015729831144465292, 'epoch': 2.15}


 22%|██▏       | 575/2670 [13:57<45:01,  1.29s/it]

{'loss': 0.0765, 'grad_norm': 0.5206926465034485, 'learning_rate': 0.00015722326454033772, 'epoch': 2.15}


 22%|██▏       | 576/2670 [13:59<46:28,  1.33s/it]

{'loss': 0.219, 'grad_norm': 0.887611448764801, 'learning_rate': 0.00015714821763602253, 'epoch': 2.16}


 22%|██▏       | 577/2670 [14:00<48:50,  1.40s/it]

{'loss': 0.1453, 'grad_norm': 0.769986629486084, 'learning_rate': 0.00015707317073170734, 'epoch': 2.16}


 22%|██▏       | 578/2670 [14:02<50:11,  1.44s/it]

{'loss': 0.0773, 'grad_norm': 0.45508289337158203, 'learning_rate': 0.00015699812382739212, 'epoch': 2.16}


 22%|██▏       | 579/2670 [14:03<50:06,  1.44s/it]

{'loss': 0.2181, 'grad_norm': 0.8835832476615906, 'learning_rate': 0.00015692307692307693, 'epoch': 2.17}


 22%|██▏       | 580/2670 [14:04<49:14,  1.41s/it]

{'loss': 0.1415, 'grad_norm': 0.8512759208679199, 'learning_rate': 0.00015684803001876174, 'epoch': 2.17}


 22%|██▏       | 581/2670 [14:06<49:38,  1.43s/it]

{'loss': 0.2115, 'grad_norm': 0.68714439868927, 'learning_rate': 0.00015677298311444652, 'epoch': 2.18}


 22%|██▏       | 582/2670 [14:07<48:43,  1.40s/it]

{'loss': 0.0985, 'grad_norm': 0.868053674697876, 'learning_rate': 0.00015669793621013133, 'epoch': 2.18}


 22%|██▏       | 583/2670 [14:09<52:29,  1.51s/it]

{'loss': 0.2718, 'grad_norm': 0.7146534323692322, 'learning_rate': 0.00015662288930581614, 'epoch': 2.18}


 22%|██▏       | 584/2670 [14:11<53:21,  1.53s/it]

{'loss': 0.2078, 'grad_norm': 0.8846215605735779, 'learning_rate': 0.00015654784240150095, 'epoch': 2.19}


 22%|██▏       | 585/2670 [14:12<54:33,  1.57s/it]

{'loss': 0.1883, 'grad_norm': 0.6727622151374817, 'learning_rate': 0.00015647279549718573, 'epoch': 2.19}


 22%|██▏       | 586/2670 [14:14<53:22,  1.54s/it]

{'loss': 0.1289, 'grad_norm': 0.6722792983055115, 'learning_rate': 0.00015639774859287054, 'epoch': 2.19}


 22%|██▏       | 587/2670 [14:15<53:25,  1.54s/it]

{'loss': 0.1704, 'grad_norm': 0.6725286841392517, 'learning_rate': 0.00015632270168855535, 'epoch': 2.2}


 22%|██▏       | 588/2670 [14:16<50:29,  1.46s/it]

{'loss': 0.133, 'grad_norm': 0.749885082244873, 'learning_rate': 0.00015624765478424016, 'epoch': 2.2}


 22%|██▏       | 589/2670 [14:18<50:43,  1.46s/it]

{'loss': 0.1697, 'grad_norm': 0.7090213298797607, 'learning_rate': 0.00015617260787992497, 'epoch': 2.21}


 22%|██▏       | 590/2670 [14:20<52:19,  1.51s/it]

{'loss': 0.1814, 'grad_norm': 0.7160885334014893, 'learning_rate': 0.00015609756097560978, 'epoch': 2.21}


 22%|██▏       | 591/2670 [14:21<51:41,  1.49s/it]

{'loss': 0.2332, 'grad_norm': 0.8818460702896118, 'learning_rate': 0.00015602251407129456, 'epoch': 2.21}


 22%|██▏       | 592/2670 [14:23<52:14,  1.51s/it]

{'loss': 0.0859, 'grad_norm': 0.4955446124076843, 'learning_rate': 0.00015594746716697937, 'epoch': 2.22}


 22%|██▏       | 593/2670 [14:24<54:01,  1.56s/it]

{'loss': 0.0744, 'grad_norm': 0.4807446002960205, 'learning_rate': 0.00015587242026266417, 'epoch': 2.22}


 22%|██▏       | 594/2670 [14:26<53:00,  1.53s/it]

{'loss': 0.1636, 'grad_norm': 0.8396703600883484, 'learning_rate': 0.00015579737335834898, 'epoch': 2.22}


 22%|██▏       | 595/2670 [14:27<52:17,  1.51s/it]

{'loss': 0.1364, 'grad_norm': 0.6311339139938354, 'learning_rate': 0.00015572232645403377, 'epoch': 2.23}


 22%|██▏       | 596/2670 [14:29<51:55,  1.50s/it]

{'loss': 0.2116, 'grad_norm': 0.774257481098175, 'learning_rate': 0.00015564727954971857, 'epoch': 2.23}


 22%|██▏       | 597/2670 [14:30<51:35,  1.49s/it]

{'loss': 0.122, 'grad_norm': 0.7421743869781494, 'learning_rate': 0.00015557223264540338, 'epoch': 2.24}


 22%|██▏       | 598/2670 [14:31<48:56,  1.42s/it]

{'loss': 0.0826, 'grad_norm': 0.44308578968048096, 'learning_rate': 0.0001554971857410882, 'epoch': 2.24}


 22%|██▏       | 599/2670 [14:33<48:15,  1.40s/it]

{'loss': 0.1071, 'grad_norm': 0.7218056917190552, 'learning_rate': 0.000155422138836773, 'epoch': 2.24}


 22%|██▏       | 600/2670 [14:34<47:19,  1.37s/it]

{'loss': 0.17, 'grad_norm': 0.710468590259552, 'learning_rate': 0.0001553470919324578, 'epoch': 2.25}


 23%|██▎       | 601/2670 [14:35<47:02,  1.36s/it]

{'loss': 0.2363, 'grad_norm': 0.749695360660553, 'learning_rate': 0.0001552720450281426, 'epoch': 2.25}


 23%|██▎       | 602/2670 [14:37<47:35,  1.38s/it]

{'loss': 0.1719, 'grad_norm': 0.7740315794944763, 'learning_rate': 0.0001551969981238274, 'epoch': 2.25}


 23%|██▎       | 603/2670 [14:38<50:28,  1.47s/it]

{'loss': 0.1427, 'grad_norm': 0.622779905796051, 'learning_rate': 0.0001551219512195122, 'epoch': 2.26}


 23%|██▎       | 604/2670 [14:40<51:17,  1.49s/it]

{'loss': 0.1509, 'grad_norm': 0.6892864108085632, 'learning_rate': 0.000155046904315197, 'epoch': 2.26}


 23%|██▎       | 605/2670 [14:42<51:44,  1.50s/it]

{'loss': 0.1164, 'grad_norm': 0.7835919260978699, 'learning_rate': 0.0001549718574108818, 'epoch': 2.27}


 23%|██▎       | 606/2670 [14:43<50:45,  1.48s/it]

{'loss': 0.1275, 'grad_norm': 0.6654770374298096, 'learning_rate': 0.0001548968105065666, 'epoch': 2.27}


 23%|██▎       | 607/2670 [14:45<53:32,  1.56s/it]

{'loss': 0.1707, 'grad_norm': 0.821922779083252, 'learning_rate': 0.00015482176360225142, 'epoch': 2.27}


 23%|██▎       | 608/2670 [14:46<52:31,  1.53s/it]

{'loss': 0.1913, 'grad_norm': 0.8316491842269897, 'learning_rate': 0.0001547467166979362, 'epoch': 2.28}


 23%|██▎       | 609/2670 [14:48<53:26,  1.56s/it]

{'loss': 0.1692, 'grad_norm': 0.6940563321113586, 'learning_rate': 0.000154671669793621, 'epoch': 2.28}


 23%|██▎       | 610/2670 [14:49<53:55,  1.57s/it]

{'loss': 0.218, 'grad_norm': 0.8415630459785461, 'learning_rate': 0.00015459662288930584, 'epoch': 2.28}


 23%|██▎       | 611/2670 [14:51<54:06,  1.58s/it]

{'loss': 0.0726, 'grad_norm': 0.5758682489395142, 'learning_rate': 0.00015452157598499063, 'epoch': 2.29}


 23%|██▎       | 612/2670 [14:52<52:33,  1.53s/it]

{'loss': 0.153, 'grad_norm': 0.735988974571228, 'learning_rate': 0.00015444652908067543, 'epoch': 2.29}


 23%|██▎       | 613/2670 [14:54<52:34,  1.53s/it]

{'loss': 0.1399, 'grad_norm': 0.6565821766853333, 'learning_rate': 0.00015437148217636024, 'epoch': 2.3}


 23%|██▎       | 614/2670 [14:55<50:43,  1.48s/it]

{'loss': 0.1482, 'grad_norm': 0.6464351415634155, 'learning_rate': 0.00015429643527204503, 'epoch': 2.3}


 23%|██▎       | 615/2670 [14:57<48:14,  1.41s/it]

{'loss': 0.1423, 'grad_norm': 0.974770724773407, 'learning_rate': 0.00015422138836772983, 'epoch': 2.3}


 23%|██▎       | 616/2670 [14:58<51:04,  1.49s/it]

{'loss': 0.0737, 'grad_norm': 0.6357188820838928, 'learning_rate': 0.00015414634146341464, 'epoch': 2.31}


 23%|██▎       | 617/2670 [15:00<50:11,  1.47s/it]

{'loss': 0.1011, 'grad_norm': 0.5714409947395325, 'learning_rate': 0.00015407129455909945, 'epoch': 2.31}


 23%|██▎       | 618/2670 [15:01<46:21,  1.36s/it]

{'loss': 0.1248, 'grad_norm': 0.71414715051651, 'learning_rate': 0.00015399624765478423, 'epoch': 2.31}


 23%|██▎       | 619/2670 [15:02<49:22,  1.44s/it]

{'loss': 0.1001, 'grad_norm': 0.6542508006095886, 'learning_rate': 0.00015392120075046904, 'epoch': 2.32}


 23%|██▎       | 620/2670 [15:04<50:07,  1.47s/it]

{'loss': 0.2975, 'grad_norm': 0.9059062004089355, 'learning_rate': 0.00015384615384615385, 'epoch': 2.32}


 23%|██▎       | 621/2670 [15:06<52:31,  1.54s/it]

{'loss': 0.231, 'grad_norm': 0.7554742097854614, 'learning_rate': 0.00015377110694183866, 'epoch': 2.33}


 23%|██▎       | 622/2670 [15:07<52:03,  1.53s/it]

{'loss': 0.1124, 'grad_norm': 0.705100953578949, 'learning_rate': 0.00015369606003752347, 'epoch': 2.33}


 23%|██▎       | 623/2670 [15:09<53:15,  1.56s/it]

{'loss': 0.2908, 'grad_norm': 0.8863115310668945, 'learning_rate': 0.00015362101313320828, 'epoch': 2.33}


 23%|██▎       | 624/2670 [15:10<48:33,  1.42s/it]

{'loss': 0.0932, 'grad_norm': 0.6753265261650085, 'learning_rate': 0.00015354596622889306, 'epoch': 2.34}


 23%|██▎       | 625/2670 [15:11<48:26,  1.42s/it]

{'loss': 0.0571, 'grad_norm': 0.3941034972667694, 'learning_rate': 0.00015347091932457787, 'epoch': 2.34}


 23%|██▎       | 626/2670 [15:13<48:51,  1.43s/it]

{'loss': 0.1651, 'grad_norm': 0.6756704449653625, 'learning_rate': 0.00015339587242026268, 'epoch': 2.34}


 23%|██▎       | 627/2670 [15:14<50:47,  1.49s/it]

{'loss': 0.1898, 'grad_norm': 0.8803021907806396, 'learning_rate': 0.00015332082551594746, 'epoch': 2.35}


 24%|██▎       | 628/2670 [15:16<48:17,  1.42s/it]

{'loss': 0.1178, 'grad_norm': 0.6848833560943604, 'learning_rate': 0.00015324577861163227, 'epoch': 2.35}


 24%|██▎       | 629/2670 [15:17<45:10,  1.33s/it]

{'loss': 0.1102, 'grad_norm': 0.5649510622024536, 'learning_rate': 0.00015317073170731708, 'epoch': 2.36}


 24%|██▎       | 630/2670 [15:18<47:21,  1.39s/it]

{'loss': 0.1054, 'grad_norm': 0.7007929086685181, 'learning_rate': 0.00015309568480300189, 'epoch': 2.36}


 24%|██▎       | 631/2670 [15:20<48:39,  1.43s/it]

{'loss': 0.2228, 'grad_norm': 0.8442178964614868, 'learning_rate': 0.00015302063789868667, 'epoch': 2.36}


 24%|██▎       | 632/2670 [15:22<51:28,  1.52s/it]

{'loss': 0.066, 'grad_norm': 0.42316269874572754, 'learning_rate': 0.0001529455909943715, 'epoch': 2.37}


 24%|██▎       | 633/2670 [15:23<51:17,  1.51s/it]

{'loss': 0.281, 'grad_norm': 0.848116934299469, 'learning_rate': 0.0001528705440900563, 'epoch': 2.37}


 24%|██▎       | 634/2670 [15:25<51:29,  1.52s/it]

{'loss': 0.2954, 'grad_norm': 0.8493538498878479, 'learning_rate': 0.0001527954971857411, 'epoch': 2.37}


 24%|██▍       | 635/2670 [15:26<51:49,  1.53s/it]

{'loss': 0.1323, 'grad_norm': 0.65345698595047, 'learning_rate': 0.0001527204502814259, 'epoch': 2.38}


 24%|██▍       | 636/2670 [15:28<53:47,  1.59s/it]

{'loss': 0.2762, 'grad_norm': 0.7606272101402283, 'learning_rate': 0.0001526454033771107, 'epoch': 2.38}


 24%|██▍       | 637/2670 [15:29<51:08,  1.51s/it]

{'loss': 0.1858, 'grad_norm': 0.7388919591903687, 'learning_rate': 0.0001525703564727955, 'epoch': 2.39}


 24%|██▍       | 638/2670 [15:30<49:05,  1.45s/it]

{'loss': 0.1333, 'grad_norm': 0.6628273129463196, 'learning_rate': 0.0001524953095684803, 'epoch': 2.39}


 24%|██▍       | 639/2670 [15:32<51:32,  1.52s/it]

{'loss': 0.2167, 'grad_norm': 0.7645533680915833, 'learning_rate': 0.0001524202626641651, 'epoch': 2.39}


 24%|██▍       | 640/2670 [15:33<48:34,  1.44s/it]

{'loss': 0.1484, 'grad_norm': 0.6862233281135559, 'learning_rate': 0.00015234521575984992, 'epoch': 2.4}


 24%|██▍       | 641/2670 [15:35<46:25,  1.37s/it]

{'loss': 0.1551, 'grad_norm': 0.9734851121902466, 'learning_rate': 0.0001522701688555347, 'epoch': 2.4}


 24%|██▍       | 642/2670 [15:36<45:49,  1.36s/it]

{'loss': 0.0752, 'grad_norm': 0.5510382056236267, 'learning_rate': 0.0001521951219512195, 'epoch': 2.4}


 24%|██▍       | 643/2670 [15:37<47:29,  1.41s/it]

{'loss': 0.1426, 'grad_norm': 1.213181972503662, 'learning_rate': 0.00015212007504690432, 'epoch': 2.41}


 24%|██▍       | 644/2670 [15:39<48:36,  1.44s/it]

{'loss': 0.1299, 'grad_norm': 0.6772633194923401, 'learning_rate': 0.00015204502814258913, 'epoch': 2.41}


 24%|██▍       | 645/2670 [15:40<48:47,  1.45s/it]

{'loss': 0.0477, 'grad_norm': 0.43606874346733093, 'learning_rate': 0.00015196998123827394, 'epoch': 2.42}


 24%|██▍       | 646/2670 [15:42<47:15,  1.40s/it]

{'loss': 0.1552, 'grad_norm': 0.6841598153114319, 'learning_rate': 0.00015189493433395875, 'epoch': 2.42}


 24%|██▍       | 647/2670 [15:43<48:45,  1.45s/it]

{'loss': 0.1538, 'grad_norm': 0.6542750597000122, 'learning_rate': 0.00015181988742964353, 'epoch': 2.42}


 24%|██▍       | 648/2670 [15:45<47:41,  1.42s/it]

{'loss': 0.1034, 'grad_norm': 0.5385205745697021, 'learning_rate': 0.00015174484052532834, 'epoch': 2.43}


 24%|██▍       | 649/2670 [15:46<46:41,  1.39s/it]

{'loss': 0.1089, 'grad_norm': 0.8509840369224548, 'learning_rate': 0.00015166979362101315, 'epoch': 2.43}


 24%|██▍       | 650/2670 [15:47<47:05,  1.40s/it]

{'loss': 0.1472, 'grad_norm': 0.7718850374221802, 'learning_rate': 0.00015159474671669793, 'epoch': 2.43}


 24%|██▍       | 651/2670 [15:49<46:14,  1.37s/it]

{'loss': 0.0915, 'grad_norm': 0.4384502172470093, 'learning_rate': 0.00015151969981238274, 'epoch': 2.44}


 24%|██▍       | 652/2670 [15:50<46:19,  1.38s/it]

{'loss': 0.1812, 'grad_norm': 0.7964531183242798, 'learning_rate': 0.00015144465290806755, 'epoch': 2.44}


 24%|██▍       | 653/2670 [15:52<47:16,  1.41s/it]

{'loss': 0.2222, 'grad_norm': 0.7257682681083679, 'learning_rate': 0.00015136960600375235, 'epoch': 2.45}


 24%|██▍       | 654/2670 [15:53<50:09,  1.49s/it]

{'loss': 0.2025, 'grad_norm': 0.6941007375717163, 'learning_rate': 0.00015129455909943716, 'epoch': 2.45}


 25%|██▍       | 655/2670 [15:55<52:52,  1.57s/it]

{'loss': 0.1475, 'grad_norm': 0.6473216414451599, 'learning_rate': 0.00015121951219512197, 'epoch': 2.45}


 25%|██▍       | 656/2670 [15:56<51:05,  1.52s/it]

{'loss': 0.1522, 'grad_norm': 0.7598049640655518, 'learning_rate': 0.00015114446529080678, 'epoch': 2.46}


 25%|██▍       | 657/2670 [15:58<51:48,  1.54s/it]

{'loss': 0.1016, 'grad_norm': 0.5365865230560303, 'learning_rate': 0.00015106941838649156, 'epoch': 2.46}


 25%|██▍       | 658/2670 [15:59<50:04,  1.49s/it]

{'loss': 0.0811, 'grad_norm': 0.4751507043838501, 'learning_rate': 0.00015099437148217637, 'epoch': 2.46}


 25%|██▍       | 659/2670 [16:01<47:00,  1.40s/it]

{'loss': 0.0963, 'grad_norm': 0.653532862663269, 'learning_rate': 0.00015091932457786118, 'epoch': 2.47}


 25%|██▍       | 660/2670 [16:02<45:46,  1.37s/it]

{'loss': 0.2336, 'grad_norm': 1.1048611402511597, 'learning_rate': 0.00015084427767354596, 'epoch': 2.47}


 25%|██▍       | 661/2670 [16:03<48:03,  1.44s/it]

{'loss': 0.1064, 'grad_norm': 0.7002822160720825, 'learning_rate': 0.00015076923076923077, 'epoch': 2.48}


 25%|██▍       | 662/2670 [16:05<48:46,  1.46s/it]

{'loss': 0.1506, 'grad_norm': 0.6216983199119568, 'learning_rate': 0.00015069418386491558, 'epoch': 2.48}


 25%|██▍       | 663/2670 [16:06<47:34,  1.42s/it]

{'loss': 0.0677, 'grad_norm': 0.586851954460144, 'learning_rate': 0.0001506191369606004, 'epoch': 2.48}


 25%|██▍       | 664/2670 [16:07<45:26,  1.36s/it]

{'loss': 0.1588, 'grad_norm': 0.9784364104270935, 'learning_rate': 0.00015054409005628517, 'epoch': 2.49}


 25%|██▍       | 665/2670 [16:09<46:09,  1.38s/it]

{'loss': 0.1627, 'grad_norm': 0.7531434893608093, 'learning_rate': 0.00015046904315196998, 'epoch': 2.49}


 25%|██▍       | 666/2670 [16:10<47:39,  1.43s/it]

{'loss': 0.1076, 'grad_norm': 0.6067988872528076, 'learning_rate': 0.0001503939962476548, 'epoch': 2.49}


 25%|██▍       | 667/2670 [16:12<47:59,  1.44s/it]

{'loss': 0.0811, 'grad_norm': 0.6288000345230103, 'learning_rate': 0.0001503189493433396, 'epoch': 2.5}


 25%|██▌       | 668/2670 [16:14<50:48,  1.52s/it]

{'loss': 0.1281, 'grad_norm': 0.6321525573730469, 'learning_rate': 0.0001502439024390244, 'epoch': 2.5}


 25%|██▌       | 669/2670 [16:15<51:33,  1.55s/it]

{'loss': 0.1685, 'grad_norm': 0.739266574382782, 'learning_rate': 0.00015016885553470921, 'epoch': 2.51}


 25%|██▌       | 670/2670 [16:17<49:39,  1.49s/it]

{'loss': 0.1229, 'grad_norm': 0.7679492235183716, 'learning_rate': 0.000150093808630394, 'epoch': 2.51}


 25%|██▌       | 671/2670 [16:18<49:18,  1.48s/it]

{'loss': 0.0675, 'grad_norm': 0.5431169867515564, 'learning_rate': 0.0001500187617260788, 'epoch': 2.51}


 25%|██▌       | 672/2670 [16:19<46:11,  1.39s/it]

{'loss': 0.0946, 'grad_norm': 0.5428764224052429, 'learning_rate': 0.00014994371482176361, 'epoch': 2.52}


 25%|██▌       | 673/2670 [16:21<47:28,  1.43s/it]

{'loss': 0.1812, 'grad_norm': 0.7543787360191345, 'learning_rate': 0.0001498686679174484, 'epoch': 2.52}


 25%|██▌       | 674/2670 [16:23<51:56,  1.56s/it]

{'loss': 0.1422, 'grad_norm': 0.6816059350967407, 'learning_rate': 0.0001497936210131332, 'epoch': 2.52}


 25%|██▌       | 675/2670 [16:24<49:23,  1.49s/it]

{'loss': 0.1052, 'grad_norm': 0.6424818634986877, 'learning_rate': 0.000149718574108818, 'epoch': 2.53}


 25%|██▌       | 676/2670 [16:25<48:42,  1.47s/it]

{'loss': 0.0762, 'grad_norm': 0.6979430317878723, 'learning_rate': 0.00014964352720450282, 'epoch': 2.53}


 25%|██▌       | 677/2670 [16:27<49:59,  1.51s/it]

{'loss': 0.1328, 'grad_norm': 0.6640111804008484, 'learning_rate': 0.00014956848030018763, 'epoch': 2.54}


 25%|██▌       | 678/2670 [16:28<49:31,  1.49s/it]

{'loss': 0.0588, 'grad_norm': 0.5097827911376953, 'learning_rate': 0.00014949343339587244, 'epoch': 2.54}


 25%|██▌       | 679/2670 [16:30<47:39,  1.44s/it]

{'loss': 0.2019, 'grad_norm': 0.7227354645729065, 'learning_rate': 0.00014941838649155725, 'epoch': 2.54}


 25%|██▌       | 680/2670 [16:31<48:08,  1.45s/it]

{'loss': 0.0672, 'grad_norm': 0.5921526551246643, 'learning_rate': 0.00014934333958724203, 'epoch': 2.55}


 26%|██▌       | 681/2670 [16:33<48:36,  1.47s/it]

{'loss': 0.1269, 'grad_norm': 0.5730261206626892, 'learning_rate': 0.00014926829268292684, 'epoch': 2.55}


 26%|██▌       | 682/2670 [16:34<46:40,  1.41s/it]

{'loss': 0.1666, 'grad_norm': 0.9120457172393799, 'learning_rate': 0.00014919324577861165, 'epoch': 2.55}


 26%|██▌       | 683/2670 [16:35<45:21,  1.37s/it]

{'loss': 0.0871, 'grad_norm': 0.7331465482711792, 'learning_rate': 0.00014911819887429643, 'epoch': 2.56}


 26%|██▌       | 684/2670 [16:37<45:55,  1.39s/it]

{'loss': 0.1064, 'grad_norm': 0.5519458651542664, 'learning_rate': 0.00014904315196998124, 'epoch': 2.56}


 26%|██▌       | 685/2670 [16:38<45:35,  1.38s/it]

{'loss': 0.0844, 'grad_norm': 0.768690288066864, 'learning_rate': 0.00014896810506566605, 'epoch': 2.57}


 26%|██▌       | 686/2670 [16:39<43:48,  1.32s/it]

{'loss': 0.1191, 'grad_norm': 0.6339293718338013, 'learning_rate': 0.00014889305816135086, 'epoch': 2.57}


 26%|██▌       | 687/2670 [16:41<43:28,  1.32s/it]

{'loss': 0.0913, 'grad_norm': 0.5106675028800964, 'learning_rate': 0.00014881801125703564, 'epoch': 2.57}


 26%|██▌       | 688/2670 [16:42<45:53,  1.39s/it]

{'loss': 0.181, 'grad_norm': 0.8477293252944946, 'learning_rate': 0.00014874296435272045, 'epoch': 2.58}


 26%|██▌       | 689/2670 [16:44<48:23,  1.47s/it]

{'loss': 0.1557, 'grad_norm': 0.6788881421089172, 'learning_rate': 0.00014866791744840526, 'epoch': 2.58}


 26%|██▌       | 690/2670 [16:45<48:24,  1.47s/it]

{'loss': 0.1358, 'grad_norm': 0.5950583815574646, 'learning_rate': 0.00014859287054409006, 'epoch': 2.58}


 26%|██▌       | 691/2670 [16:46<45:45,  1.39s/it]

{'loss': 0.1037, 'grad_norm': 0.876578152179718, 'learning_rate': 0.00014851782363977487, 'epoch': 2.59}


 26%|██▌       | 692/2670 [16:48<48:09,  1.46s/it]

{'loss': 0.0711, 'grad_norm': 0.4435814917087555, 'learning_rate': 0.00014844277673545968, 'epoch': 2.59}


 26%|██▌       | 693/2670 [16:49<46:40,  1.42s/it]

{'loss': 0.0816, 'grad_norm': 0.598748505115509, 'learning_rate': 0.00014836772983114446, 'epoch': 2.6}


 26%|██▌       | 694/2670 [16:51<47:35,  1.45s/it]

{'loss': 0.1646, 'grad_norm': 0.7755576968193054, 'learning_rate': 0.00014829268292682927, 'epoch': 2.6}


 26%|██▌       | 695/2670 [16:53<52:46,  1.60s/it]

{'loss': 0.1019, 'grad_norm': 0.5050419569015503, 'learning_rate': 0.00014821763602251408, 'epoch': 2.6}


 26%|██▌       | 696/2670 [16:54<48:24,  1.47s/it]

{'loss': 0.1071, 'grad_norm': 0.7990249395370483, 'learning_rate': 0.00014814258911819886, 'epoch': 2.61}


 26%|██▌       | 697/2670 [16:55<47:33,  1.45s/it]

{'loss': 0.0597, 'grad_norm': 0.5224154591560364, 'learning_rate': 0.00014806754221388367, 'epoch': 2.61}


 26%|██▌       | 698/2670 [16:57<48:32,  1.48s/it]

{'loss': 0.1203, 'grad_norm': 0.5994104743003845, 'learning_rate': 0.00014799249530956848, 'epoch': 2.61}


 26%|██▌       | 699/2670 [16:58<49:08,  1.50s/it]

{'loss': 0.2679, 'grad_norm': 0.6716285943984985, 'learning_rate': 0.0001479174484052533, 'epoch': 2.62}


 26%|██▌       | 700/2670 [17:00<52:03,  1.59s/it]

{'loss': 0.2613, 'grad_norm': 0.7202606201171875, 'learning_rate': 0.0001478424015009381, 'epoch': 2.62}


 26%|██▋       | 701/2670 [17:02<51:18,  1.56s/it]

{'loss': 0.066, 'grad_norm': 0.46043628454208374, 'learning_rate': 0.0001477673545966229, 'epoch': 2.63}


 26%|██▋       | 702/2670 [17:03<52:14,  1.59s/it]

{'loss': 0.2412, 'grad_norm': 0.8109120726585388, 'learning_rate': 0.00014769230769230772, 'epoch': 2.63}


 26%|██▋       | 703/2670 [17:05<50:51,  1.55s/it]

{'loss': 0.1529, 'grad_norm': 0.670268714427948, 'learning_rate': 0.0001476172607879925, 'epoch': 2.63}


 26%|██▋       | 704/2670 [17:06<49:30,  1.51s/it]

{'loss': 0.2075, 'grad_norm': 0.7358793020248413, 'learning_rate': 0.0001475422138836773, 'epoch': 2.64}


 26%|██▋       | 705/2670 [17:08<48:22,  1.48s/it]

{'loss': 0.03, 'grad_norm': 0.2924574017524719, 'learning_rate': 0.00014746716697936212, 'epoch': 2.64}


 26%|██▋       | 706/2670 [17:09<49:43,  1.52s/it]

{'loss': 0.0954, 'grad_norm': 0.5961561799049377, 'learning_rate': 0.0001473921200750469, 'epoch': 2.64}


 26%|██▋       | 707/2670 [17:11<49:41,  1.52s/it]

{'loss': 0.1534, 'grad_norm': 0.6347991824150085, 'learning_rate': 0.0001473170731707317, 'epoch': 2.65}


 27%|██▋       | 708/2670 [17:12<48:53,  1.50s/it]

{'loss': 0.1121, 'grad_norm': 0.5879014730453491, 'learning_rate': 0.00014724202626641652, 'epoch': 2.65}


 27%|██▋       | 709/2670 [17:14<47:25,  1.45s/it]

{'loss': 0.1689, 'grad_norm': 0.7035996317863464, 'learning_rate': 0.0001471669793621013, 'epoch': 2.66}


 27%|██▋       | 710/2670 [17:15<47:07,  1.44s/it]

{'loss': 0.175, 'grad_norm': 0.718387246131897, 'learning_rate': 0.0001470919324577861, 'epoch': 2.66}


 27%|██▋       | 711/2670 [17:17<48:48,  1.49s/it]

{'loss': 0.1415, 'grad_norm': 0.7108581066131592, 'learning_rate': 0.00014701688555347092, 'epoch': 2.66}


 27%|██▋       | 712/2670 [17:18<48:03,  1.47s/it]

{'loss': 0.1101, 'grad_norm': 0.6433650255203247, 'learning_rate': 0.00014694183864915575, 'epoch': 2.67}


 27%|██▋       | 713/2670 [17:19<47:06,  1.44s/it]

{'loss': 0.0796, 'grad_norm': 0.5499284863471985, 'learning_rate': 0.00014686679174484053, 'epoch': 2.67}


 27%|██▋       | 714/2670 [17:21<46:24,  1.42s/it]

{'loss': 0.1674, 'grad_norm': 0.755446195602417, 'learning_rate': 0.00014679174484052534, 'epoch': 2.67}


 27%|██▋       | 715/2670 [17:22<46:04,  1.41s/it]

{'loss': 0.1545, 'grad_norm': 0.7762033343315125, 'learning_rate': 0.00014671669793621015, 'epoch': 2.68}


 27%|██▋       | 716/2670 [17:24<48:02,  1.48s/it]

{'loss': 0.1527, 'grad_norm': 0.7472879886627197, 'learning_rate': 0.00014664165103189493, 'epoch': 2.68}


 27%|██▋       | 717/2670 [17:25<47:35,  1.46s/it]

{'loss': 0.1294, 'grad_norm': 0.7250727415084839, 'learning_rate': 0.00014656660412757974, 'epoch': 2.69}


 27%|██▋       | 718/2670 [17:27<46:54,  1.44s/it]

{'loss': 0.1638, 'grad_norm': 0.8046157360076904, 'learning_rate': 0.00014649155722326455, 'epoch': 2.69}


 27%|██▋       | 719/2670 [17:28<47:48,  1.47s/it]

{'loss': 0.2317, 'grad_norm': 0.7996575832366943, 'learning_rate': 0.00014641651031894933, 'epoch': 2.69}


 27%|██▋       | 720/2670 [17:30<46:54,  1.44s/it]

{'loss': 0.136, 'grad_norm': 0.8277842998504639, 'learning_rate': 0.00014634146341463414, 'epoch': 2.7}


 27%|██▋       | 721/2670 [17:31<47:58,  1.48s/it]

{'loss': 0.0541, 'grad_norm': 0.41052526235580444, 'learning_rate': 0.00014626641651031895, 'epoch': 2.7}


 27%|██▋       | 722/2670 [17:33<49:26,  1.52s/it]

{'loss': 0.1976, 'grad_norm': 0.9219529032707214, 'learning_rate': 0.00014619136960600376, 'epoch': 2.7}


 27%|██▋       | 723/2670 [17:34<46:02,  1.42s/it]

{'loss': 0.1754, 'grad_norm': 0.8799299001693726, 'learning_rate': 0.00014611632270168857, 'epoch': 2.71}


 27%|██▋       | 724/2670 [17:36<47:59,  1.48s/it]

{'loss': 0.2207, 'grad_norm': 0.8349860906600952, 'learning_rate': 0.00014604127579737338, 'epoch': 2.71}


 27%|██▋       | 725/2670 [17:37<47:22,  1.46s/it]

{'loss': 0.1082, 'grad_norm': 0.578764021396637, 'learning_rate': 0.00014596622889305819, 'epoch': 2.72}


 27%|██▋       | 726/2670 [17:39<48:07,  1.49s/it]

{'loss': 0.1491, 'grad_norm': 1.032204508781433, 'learning_rate': 0.00014589118198874297, 'epoch': 2.72}


 27%|██▋       | 727/2670 [17:40<47:45,  1.47s/it]

{'loss': 0.1639, 'grad_norm': 0.8146791458129883, 'learning_rate': 0.00014581613508442778, 'epoch': 2.72}


 27%|██▋       | 728/2670 [17:41<46:37,  1.44s/it]

{'loss': 0.1641, 'grad_norm': 0.6566330790519714, 'learning_rate': 0.00014574108818011258, 'epoch': 2.73}


 27%|██▋       | 729/2670 [17:43<46:48,  1.45s/it]

{'loss': 0.0708, 'grad_norm': 0.5139054656028748, 'learning_rate': 0.00014566604127579737, 'epoch': 2.73}


 27%|██▋       | 730/2670 [17:44<46:40,  1.44s/it]

{'loss': 0.1386, 'grad_norm': 0.7871342897415161, 'learning_rate': 0.00014559099437148218, 'epoch': 2.73}


 27%|██▋       | 731/2670 [17:46<48:25,  1.50s/it]

{'loss': 0.1485, 'grad_norm': 0.6394511461257935, 'learning_rate': 0.00014551594746716698, 'epoch': 2.74}


 27%|██▋       | 732/2670 [17:47<48:47,  1.51s/it]

{'loss': 0.2045, 'grad_norm': 0.7853655815124512, 'learning_rate': 0.00014544090056285177, 'epoch': 2.74}


 27%|██▋       | 733/2670 [17:49<48:52,  1.51s/it]

{'loss': 0.1067, 'grad_norm': 0.44485482573509216, 'learning_rate': 0.0001453658536585366, 'epoch': 2.75}


 27%|██▋       | 734/2670 [17:50<47:51,  1.48s/it]

{'loss': 0.193, 'grad_norm': 0.8418623805046082, 'learning_rate': 0.0001452908067542214, 'epoch': 2.75}


 28%|██▊       | 735/2670 [17:52<48:02,  1.49s/it]

{'loss': 0.0866, 'grad_norm': 0.5621123313903809, 'learning_rate': 0.00014521575984990622, 'epoch': 2.75}


 28%|██▊       | 736/2670 [17:54<49:22,  1.53s/it]

{'loss': 0.1675, 'grad_norm': 0.6723294258117676, 'learning_rate': 0.000145140712945591, 'epoch': 2.76}


 28%|██▊       | 737/2670 [17:55<48:58,  1.52s/it]

{'loss': 0.1986, 'grad_norm': 0.9080674052238464, 'learning_rate': 0.0001450656660412758, 'epoch': 2.76}


 28%|██▊       | 738/2670 [17:56<46:24,  1.44s/it]

{'loss': 0.1302, 'grad_norm': 0.7894476056098938, 'learning_rate': 0.00014499061913696062, 'epoch': 2.76}


 28%|██▊       | 739/2670 [17:58<48:52,  1.52s/it]

{'loss': 0.0814, 'grad_norm': 0.48442333936691284, 'learning_rate': 0.0001449155722326454, 'epoch': 2.77}


 28%|██▊       | 740/2670 [17:59<47:54,  1.49s/it]

{'loss': 0.0908, 'grad_norm': 0.6837579011917114, 'learning_rate': 0.0001448405253283302, 'epoch': 2.77}


 28%|██▊       | 741/2670 [18:01<46:37,  1.45s/it]

{'loss': 0.1837, 'grad_norm': 0.8949213027954102, 'learning_rate': 0.00014476547842401502, 'epoch': 2.78}


 28%|██▊       | 742/2670 [18:02<45:10,  1.41s/it]

{'loss': 0.0989, 'grad_norm': 0.6792762875556946, 'learning_rate': 0.0001446904315196998, 'epoch': 2.78}


 28%|██▊       | 743/2670 [18:03<44:28,  1.38s/it]

{'loss': 0.1193, 'grad_norm': 0.681296706199646, 'learning_rate': 0.0001446153846153846, 'epoch': 2.78}


 28%|██▊       | 744/2670 [18:05<43:54,  1.37s/it]

{'loss': 0.1274, 'grad_norm': 0.8220813274383545, 'learning_rate': 0.00014454033771106942, 'epoch': 2.79}


 28%|██▊       | 745/2670 [18:06<45:23,  1.41s/it]

{'loss': 0.1276, 'grad_norm': 0.6852532029151917, 'learning_rate': 0.00014446529080675423, 'epoch': 2.79}


 28%|██▊       | 746/2670 [18:08<45:09,  1.41s/it]

{'loss': 0.0974, 'grad_norm': 0.801398515701294, 'learning_rate': 0.00014439024390243904, 'epoch': 2.79}


 28%|██▊       | 747/2670 [18:09<47:06,  1.47s/it]

{'loss': 0.1985, 'grad_norm': 0.8750125765800476, 'learning_rate': 0.00014431519699812384, 'epoch': 2.8}


 28%|██▊       | 748/2670 [18:11<47:03,  1.47s/it]

{'loss': 0.2384, 'grad_norm': 0.8425114750862122, 'learning_rate': 0.00014424015009380865, 'epoch': 2.8}


 28%|██▊       | 749/2670 [18:12<46:35,  1.45s/it]

{'loss': 0.1211, 'grad_norm': 0.5979477167129517, 'learning_rate': 0.00014416510318949344, 'epoch': 2.81}


 28%|██▊       | 750/2670 [18:14<46:42,  1.46s/it]

{'loss': 0.1809, 'grad_norm': 0.7964854836463928, 'learning_rate': 0.00014409005628517824, 'epoch': 2.81}


 28%|██▊       | 751/2670 [18:15<45:13,  1.41s/it]

{'loss': 0.1492, 'grad_norm': 0.8500566482543945, 'learning_rate': 0.00014401500938086305, 'epoch': 2.81}


 28%|██▊       | 752/2670 [18:16<43:56,  1.37s/it]

{'loss': 0.2286, 'grad_norm': 1.007235050201416, 'learning_rate': 0.00014393996247654783, 'epoch': 2.82}


 28%|██▊       | 753/2670 [18:18<44:12,  1.38s/it]

{'loss': 0.0794, 'grad_norm': 0.6097642779350281, 'learning_rate': 0.00014386491557223264, 'epoch': 2.82}


 28%|██▊       | 754/2670 [18:19<46:12,  1.45s/it]

{'loss': 0.1704, 'grad_norm': 0.949988603591919, 'learning_rate': 0.00014378986866791745, 'epoch': 2.82}


 28%|██▊       | 755/2670 [18:21<50:44,  1.59s/it]

{'loss': 0.2458, 'grad_norm': 0.7512903213500977, 'learning_rate': 0.00014371482176360226, 'epoch': 2.83}


 28%|██▊       | 756/2670 [18:23<50:13,  1.57s/it]

{'loss': 0.1548, 'grad_norm': 0.6190857291221619, 'learning_rate': 0.00014363977485928707, 'epoch': 2.83}


 28%|██▊       | 757/2670 [18:24<49:42,  1.56s/it]

{'loss': 0.2278, 'grad_norm': 0.8554903864860535, 'learning_rate': 0.00014356472795497188, 'epoch': 2.84}


 28%|██▊       | 758/2670 [18:25<46:57,  1.47s/it]

{'loss': 0.0985, 'grad_norm': 0.4571702480316162, 'learning_rate': 0.0001434896810506567, 'epoch': 2.84}


 28%|██▊       | 759/2670 [18:27<48:24,  1.52s/it]

{'loss': 0.3094, 'grad_norm': 0.8413286209106445, 'learning_rate': 0.00014341463414634147, 'epoch': 2.84}


 28%|██▊       | 760/2670 [18:29<48:13,  1.52s/it]

{'loss': 0.264, 'grad_norm': 0.8977476954460144, 'learning_rate': 0.00014333958724202628, 'epoch': 2.85}


 29%|██▊       | 761/2670 [18:30<51:15,  1.61s/it]

{'loss': 0.0564, 'grad_norm': 0.3487406373023987, 'learning_rate': 0.0001432645403377111, 'epoch': 2.85}


 29%|██▊       | 762/2670 [18:32<49:53,  1.57s/it]

{'loss': 0.1586, 'grad_norm': 0.6535152196884155, 'learning_rate': 0.00014318949343339587, 'epoch': 2.85}


 29%|██▊       | 763/2670 [18:33<47:46,  1.50s/it]

{'loss': 0.1324, 'grad_norm': 0.7402995824813843, 'learning_rate': 0.00014311444652908068, 'epoch': 2.86}


 29%|██▊       | 764/2670 [18:35<49:03,  1.54s/it]

{'loss': 0.2343, 'grad_norm': 0.8602921366691589, 'learning_rate': 0.0001430393996247655, 'epoch': 2.86}


 29%|██▊       | 765/2670 [18:36<49:50,  1.57s/it]

{'loss': 0.1905, 'grad_norm': 0.8315479159355164, 'learning_rate': 0.00014296435272045027, 'epoch': 2.87}


 29%|██▊       | 766/2670 [18:38<49:50,  1.57s/it]

{'loss': 0.1041, 'grad_norm': 0.7744430303573608, 'learning_rate': 0.00014288930581613508, 'epoch': 2.87}


 29%|██▊       | 767/2670 [18:39<47:11,  1.49s/it]

{'loss': 0.0982, 'grad_norm': 0.6603910326957703, 'learning_rate': 0.00014281425891181989, 'epoch': 2.87}


 29%|██▉       | 768/2670 [18:41<45:12,  1.43s/it]

{'loss': 0.1113, 'grad_norm': 0.5564302206039429, 'learning_rate': 0.0001427392120075047, 'epoch': 2.88}


 29%|██▉       | 769/2670 [18:42<45:15,  1.43s/it]

{'loss': 0.2187, 'grad_norm': 0.7431875467300415, 'learning_rate': 0.0001426641651031895, 'epoch': 2.88}


 29%|██▉       | 770/2670 [18:43<44:08,  1.39s/it]

{'loss': 0.1358, 'grad_norm': 0.7484090924263, 'learning_rate': 0.0001425891181988743, 'epoch': 2.88}


 29%|██▉       | 771/2670 [18:45<47:12,  1.49s/it]

{'loss': 0.1758, 'grad_norm': 0.6105278730392456, 'learning_rate': 0.00014251407129455912, 'epoch': 2.89}


 29%|██▉       | 772/2670 [18:46<45:56,  1.45s/it]

{'loss': 0.143, 'grad_norm': 0.7747563123703003, 'learning_rate': 0.0001424390243902439, 'epoch': 2.89}


 29%|██▉       | 773/2670 [18:48<46:17,  1.46s/it]

{'loss': 0.1011, 'grad_norm': 0.6962183117866516, 'learning_rate': 0.0001423639774859287, 'epoch': 2.9}


 29%|██▉       | 774/2670 [18:49<44:15,  1.40s/it]

{'loss': 0.1227, 'grad_norm': 0.6726738214492798, 'learning_rate': 0.00014228893058161352, 'epoch': 2.9}


 29%|██▉       | 775/2670 [18:51<44:10,  1.40s/it]

{'loss': 0.0994, 'grad_norm': 0.5555264353752136, 'learning_rate': 0.0001422138836772983, 'epoch': 2.9}


 29%|██▉       | 776/2670 [18:52<45:12,  1.43s/it]

{'loss': 0.0664, 'grad_norm': 0.43454328179359436, 'learning_rate': 0.0001421388367729831, 'epoch': 2.91}


 29%|██▉       | 777/2670 [18:53<43:26,  1.38s/it]

{'loss': 0.1167, 'grad_norm': 0.6756535768508911, 'learning_rate': 0.00014206378986866792, 'epoch': 2.91}


 29%|██▉       | 778/2670 [18:55<43:48,  1.39s/it]

{'loss': 0.1255, 'grad_norm': 0.6928317546844482, 'learning_rate': 0.00014198874296435273, 'epoch': 2.91}


 29%|██▉       | 779/2670 [18:56<41:53,  1.33s/it]

{'loss': 0.0825, 'grad_norm': 0.6019057631492615, 'learning_rate': 0.00014191369606003754, 'epoch': 2.92}


 29%|██▉       | 780/2670 [18:57<40:31,  1.29s/it]

{'loss': 0.1856, 'grad_norm': 0.8555563688278198, 'learning_rate': 0.00014183864915572235, 'epoch': 2.92}


 29%|██▉       | 781/2670 [18:59<44:40,  1.42s/it]

{'loss': 0.2419, 'grad_norm': 0.8856773972511292, 'learning_rate': 0.00014176360225140716, 'epoch': 2.93}


 29%|██▉       | 782/2670 [19:01<46:35,  1.48s/it]

{'loss': 0.1872, 'grad_norm': 0.8485321998596191, 'learning_rate': 0.00014168855534709194, 'epoch': 2.93}


 29%|██▉       | 783/2670 [19:02<45:55,  1.46s/it]

{'loss': 0.1917, 'grad_norm': 0.8838168978691101, 'learning_rate': 0.00014161350844277675, 'epoch': 2.93}


 29%|██▉       | 784/2670 [19:03<45:58,  1.46s/it]

{'loss': 0.18, 'grad_norm': 0.7845128178596497, 'learning_rate': 0.00014153846153846156, 'epoch': 2.94}


 29%|██▉       | 785/2670 [19:05<47:57,  1.53s/it]

{'loss': 0.2036, 'grad_norm': 0.9661158919334412, 'learning_rate': 0.00014146341463414634, 'epoch': 2.94}


 29%|██▉       | 786/2670 [19:07<51:07,  1.63s/it]

{'loss': 0.2252, 'grad_norm': 0.6296307444572449, 'learning_rate': 0.00014138836772983115, 'epoch': 2.94}


 29%|██▉       | 787/2670 [19:09<52:10,  1.66s/it]

{'loss': 0.1298, 'grad_norm': 0.6167003512382507, 'learning_rate': 0.00014131332082551595, 'epoch': 2.95}


 30%|██▉       | 788/2670 [19:10<51:33,  1.64s/it]

{'loss': 0.087, 'grad_norm': 0.5709974765777588, 'learning_rate': 0.00014123827392120074, 'epoch': 2.95}


 30%|██▉       | 789/2670 [19:12<47:46,  1.52s/it]

{'loss': 0.0997, 'grad_norm': 0.5726589560508728, 'learning_rate': 0.00014116322701688555, 'epoch': 2.96}


 30%|██▉       | 790/2670 [19:13<46:47,  1.49s/it]

{'loss': 0.1584, 'grad_norm': 0.6560164093971252, 'learning_rate': 0.00014108818011257035, 'epoch': 2.96}


 30%|██▉       | 791/2670 [19:15<49:01,  1.57s/it]

{'loss': 0.1466, 'grad_norm': 0.6565191149711609, 'learning_rate': 0.0001410131332082552, 'epoch': 2.96}


 30%|██▉       | 792/2670 [19:16<45:44,  1.46s/it]

{'loss': 0.1863, 'grad_norm': 0.8500398993492126, 'learning_rate': 0.00014093808630393997, 'epoch': 2.97}


 30%|██▉       | 793/2670 [19:18<47:03,  1.50s/it]

{'loss': 0.1756, 'grad_norm': 0.674309253692627, 'learning_rate': 0.00014086303939962478, 'epoch': 2.97}


 30%|██▉       | 794/2670 [19:19<47:05,  1.51s/it]

{'loss': 0.1073, 'grad_norm': 0.7419252395629883, 'learning_rate': 0.0001407879924953096, 'epoch': 2.97}


 30%|██▉       | 795/2670 [19:20<46:39,  1.49s/it]

{'loss': 0.206, 'grad_norm': 0.5593941807746887, 'learning_rate': 0.00014071294559099437, 'epoch': 2.98}


 30%|██▉       | 796/2670 [19:22<45:21,  1.45s/it]

{'loss': 0.1549, 'grad_norm': 0.7717687487602234, 'learning_rate': 0.00014063789868667918, 'epoch': 2.98}


 30%|██▉       | 797/2670 [19:23<41:20,  1.32s/it]

{'loss': 0.1366, 'grad_norm': 0.9084485173225403, 'learning_rate': 0.000140562851782364, 'epoch': 2.99}


 30%|██▉       | 798/2670 [19:24<44:09,  1.42s/it]

{'loss': 0.197, 'grad_norm': 0.8109543919563293, 'learning_rate': 0.00014048780487804877, 'epoch': 2.99}


 30%|██▉       | 799/2670 [19:26<44:56,  1.44s/it]

{'loss': 0.1544, 'grad_norm': 0.6773528456687927, 'learning_rate': 0.00014041275797373358, 'epoch': 2.99}


 30%|██▉       | 800/2670 [19:27<44:29,  1.43s/it]

{'loss': 0.2045, 'grad_norm': 0.6870987415313721, 'learning_rate': 0.0001403377110694184, 'epoch': 3.0}


 30%|███       | 801/2670 [19:29<42:44,  1.37s/it]

{'loss': 0.3024, 'grad_norm': 0.9904983639717102, 'learning_rate': 0.0001402626641651032, 'epoch': 3.0}


 30%|███       | 802/2670 [19:30<44:38,  1.43s/it]

{'loss': 0.059, 'grad_norm': 0.4155069887638092, 'learning_rate': 0.000140187617260788, 'epoch': 3.0}


 30%|███       | 803/2670 [19:32<43:45,  1.41s/it]

{'loss': 0.0767, 'grad_norm': 0.48734256625175476, 'learning_rate': 0.00014011257035647282, 'epoch': 3.01}


 30%|███       | 804/2670 [19:33<43:14,  1.39s/it]

{'loss': 0.1042, 'grad_norm': 0.800247848033905, 'learning_rate': 0.00014003752345215762, 'epoch': 3.01}


 30%|███       | 805/2670 [19:34<41:24,  1.33s/it]

{'loss': 0.0575, 'grad_norm': 0.611259400844574, 'learning_rate': 0.0001399624765478424, 'epoch': 3.01}


 30%|███       | 806/2670 [19:36<43:55,  1.41s/it]

{'loss': 0.0734, 'grad_norm': 0.40823912620544434, 'learning_rate': 0.00013988742964352721, 'epoch': 3.02}


 30%|███       | 807/2670 [19:37<42:17,  1.36s/it]

{'loss': 0.0321, 'grad_norm': 0.4088846743106842, 'learning_rate': 0.00013981238273921202, 'epoch': 3.02}


 30%|███       | 808/2670 [19:38<43:19,  1.40s/it]

{'loss': 0.071, 'grad_norm': 0.40053901076316833, 'learning_rate': 0.0001397373358348968, 'epoch': 3.03}


 30%|███       | 809/2670 [19:40<42:55,  1.38s/it]

{'loss': 0.0448, 'grad_norm': 0.34133073687553406, 'learning_rate': 0.00013966228893058161, 'epoch': 3.03}


 30%|███       | 810/2670 [19:41<43:36,  1.41s/it]

{'loss': 0.091, 'grad_norm': 0.7347246408462524, 'learning_rate': 0.00013958724202626642, 'epoch': 3.03}


 30%|███       | 811/2670 [19:43<44:53,  1.45s/it]

{'loss': 0.0714, 'grad_norm': 1.3861807584762573, 'learning_rate': 0.0001395121951219512, 'epoch': 3.04}


 30%|███       | 812/2670 [19:44<45:45,  1.48s/it]

{'loss': 0.1024, 'grad_norm': 0.6668906211853027, 'learning_rate': 0.000139437148217636, 'epoch': 3.04}


 30%|███       | 813/2670 [19:46<46:10,  1.49s/it]

{'loss': 0.0884, 'grad_norm': 0.5966193675994873, 'learning_rate': 0.00013936210131332085, 'epoch': 3.04}


 30%|███       | 814/2670 [19:47<42:42,  1.38s/it]

{'loss': 0.0342, 'grad_norm': 0.3771149218082428, 'learning_rate': 0.00013928705440900566, 'epoch': 3.05}


 31%|███       | 815/2670 [19:48<43:55,  1.42s/it]

{'loss': 0.0792, 'grad_norm': 0.7188996076583862, 'learning_rate': 0.00013921200750469044, 'epoch': 3.05}


 31%|███       | 816/2670 [19:50<48:12,  1.56s/it]

{'loss': 0.0771, 'grad_norm': 0.49409669637680054, 'learning_rate': 0.00013913696060037525, 'epoch': 3.06}


 31%|███       | 817/2670 [19:52<46:51,  1.52s/it]

{'loss': 0.0989, 'grad_norm': 1.2524374723434448, 'learning_rate': 0.00013906191369606006, 'epoch': 3.06}


 31%|███       | 818/2670 [19:53<47:06,  1.53s/it]

{'loss': 0.0884, 'grad_norm': 0.6487843990325928, 'learning_rate': 0.00013898686679174484, 'epoch': 3.06}


 31%|███       | 819/2670 [19:55<47:22,  1.54s/it]

{'loss': 0.0921, 'grad_norm': 0.6622267365455627, 'learning_rate': 0.00013891181988742965, 'epoch': 3.07}


 31%|███       | 820/2670 [19:57<50:06,  1.63s/it]

{'loss': 0.0464, 'grad_norm': 0.4821510910987854, 'learning_rate': 0.00013883677298311446, 'epoch': 3.07}


 31%|███       | 821/2670 [19:58<48:59,  1.59s/it]

{'loss': 0.0972, 'grad_norm': 0.7335585355758667, 'learning_rate': 0.00013876172607879924, 'epoch': 3.07}


 31%|███       | 822/2670 [20:00<47:45,  1.55s/it]

{'loss': 0.09, 'grad_norm': 0.7038542628288269, 'learning_rate': 0.00013868667917448405, 'epoch': 3.08}


 31%|███       | 823/2670 [20:01<49:14,  1.60s/it]

{'loss': 0.0516, 'grad_norm': 0.4345545470714569, 'learning_rate': 0.00013861163227016886, 'epoch': 3.08}


 31%|███       | 824/2670 [20:03<47:19,  1.54s/it]

{'loss': 0.0689, 'grad_norm': 0.5564051270484924, 'learning_rate': 0.00013853658536585367, 'epoch': 3.09}


 31%|███       | 825/2670 [20:04<47:18,  1.54s/it]

{'loss': 0.0495, 'grad_norm': 0.4747839570045471, 'learning_rate': 0.00013846153846153847, 'epoch': 3.09}


 31%|███       | 826/2670 [20:06<48:56,  1.59s/it]

{'loss': 0.131, 'grad_norm': 0.7256742119789124, 'learning_rate': 0.00013838649155722328, 'epoch': 3.09}


 31%|███       | 827/2670 [20:08<47:34,  1.55s/it]

{'loss': 0.069, 'grad_norm': 0.6228281259536743, 'learning_rate': 0.0001383114446529081, 'epoch': 3.1}


 31%|███       | 828/2670 [20:09<46:03,  1.50s/it]

{'loss': 0.1265, 'grad_norm': 0.6510199308395386, 'learning_rate': 0.00013823639774859287, 'epoch': 3.1}


 31%|███       | 829/2670 [20:11<49:20,  1.61s/it]

{'loss': 0.0546, 'grad_norm': 0.47152605652809143, 'learning_rate': 0.00013816135084427768, 'epoch': 3.1}


 31%|███       | 830/2670 [20:12<48:34,  1.58s/it]

{'loss': 0.16, 'grad_norm': 1.7374656200408936, 'learning_rate': 0.0001380863039399625, 'epoch': 3.11}


 31%|███       | 831/2670 [20:14<45:46,  1.49s/it]

{'loss': 0.1032, 'grad_norm': 0.7994008660316467, 'learning_rate': 0.00013801125703564727, 'epoch': 3.11}


 31%|███       | 832/2670 [20:15<45:25,  1.48s/it]

{'loss': 0.0362, 'grad_norm': 0.47039109468460083, 'learning_rate': 0.00013793621013133208, 'epoch': 3.12}


 31%|███       | 833/2670 [20:17<46:02,  1.50s/it]

{'loss': 0.1361, 'grad_norm': 0.6906189322471619, 'learning_rate': 0.0001378611632270169, 'epoch': 3.12}


 31%|███       | 834/2670 [20:18<47:33,  1.55s/it]

{'loss': 0.0648, 'grad_norm': 0.46178677678108215, 'learning_rate': 0.00013778611632270167, 'epoch': 3.12}


 31%|███▏      | 835/2670 [20:20<50:20,  1.65s/it]

{'loss': 0.0619, 'grad_norm': 0.4846166968345642, 'learning_rate': 0.0001377110694183865, 'epoch': 3.13}


 31%|███▏      | 836/2670 [20:22<50:02,  1.64s/it]

{'loss': 0.0714, 'grad_norm': 0.5446690320968628, 'learning_rate': 0.00013763602251407132, 'epoch': 3.13}


 31%|███▏      | 837/2670 [20:23<48:10,  1.58s/it]

{'loss': 0.1041, 'grad_norm': 0.6879042983055115, 'learning_rate': 0.0001375609756097561, 'epoch': 3.13}


 31%|███▏      | 838/2670 [20:25<50:35,  1.66s/it]

{'loss': 0.063, 'grad_norm': 0.5314264893531799, 'learning_rate': 0.0001374859287054409, 'epoch': 3.14}


 31%|███▏      | 839/2670 [20:27<50:13,  1.65s/it]

{'loss': 0.06, 'grad_norm': 0.4776507019996643, 'learning_rate': 0.00013741088180112572, 'epoch': 3.14}


 31%|███▏      | 840/2670 [20:28<46:47,  1.53s/it]

{'loss': 0.0833, 'grad_norm': 0.9159560203552246, 'learning_rate': 0.00013733583489681053, 'epoch': 3.15}


 31%|███▏      | 841/2670 [20:29<45:37,  1.50s/it]

{'loss': 0.1026, 'grad_norm': 0.5308805704116821, 'learning_rate': 0.0001372607879924953, 'epoch': 3.15}


 32%|███▏      | 842/2670 [20:31<45:57,  1.51s/it]

{'loss': 0.0384, 'grad_norm': 0.39426156878471375, 'learning_rate': 0.00013718574108818012, 'epoch': 3.15}


 32%|███▏      | 843/2670 [20:32<47:13,  1.55s/it]

{'loss': 0.0739, 'grad_norm': 0.688544511795044, 'learning_rate': 0.00013711069418386493, 'epoch': 3.16}


 32%|███▏      | 844/2670 [20:34<47:05,  1.55s/it]

{'loss': 0.0855, 'grad_norm': 0.6515462398529053, 'learning_rate': 0.0001370356472795497, 'epoch': 3.16}


 32%|███▏      | 845/2670 [20:35<45:56,  1.51s/it]

{'loss': 0.0488, 'grad_norm': 0.5726118683815002, 'learning_rate': 0.00013696060037523452, 'epoch': 3.16}


 32%|███▏      | 846/2670 [20:37<46:54,  1.54s/it]

{'loss': 0.0967, 'grad_norm': 0.9747046828269958, 'learning_rate': 0.00013688555347091932, 'epoch': 3.17}


 32%|███▏      | 847/2670 [20:39<46:38,  1.54s/it]

{'loss': 0.0673, 'grad_norm': 0.44492319226264954, 'learning_rate': 0.00013681050656660413, 'epoch': 3.17}


 32%|███▏      | 848/2670 [20:40<45:32,  1.50s/it]

{'loss': 0.0465, 'grad_norm': 0.3954547941684723, 'learning_rate': 0.00013673545966228894, 'epoch': 3.18}


 32%|███▏      | 849/2670 [20:41<44:05,  1.45s/it]

{'loss': 0.0623, 'grad_norm': 0.4766857326030731, 'learning_rate': 0.00013666041275797375, 'epoch': 3.18}


 32%|███▏      | 850/2670 [20:43<42:26,  1.40s/it]

{'loss': 0.0765, 'grad_norm': 0.5722558498382568, 'learning_rate': 0.00013658536585365856, 'epoch': 3.18}


 32%|███▏      | 851/2670 [20:44<45:08,  1.49s/it]

{'loss': 0.0601, 'grad_norm': 0.49004223942756653, 'learning_rate': 0.00013651031894934334, 'epoch': 3.19}


 32%|███▏      | 852/2670 [20:46<44:19,  1.46s/it]

{'loss': 0.0795, 'grad_norm': 0.7049011588096619, 'learning_rate': 0.00013643527204502815, 'epoch': 3.19}


 32%|███▏      | 853/2670 [20:47<43:09,  1.43s/it]

{'loss': 0.0601, 'grad_norm': 0.48119211196899414, 'learning_rate': 0.00013636022514071296, 'epoch': 3.19}


 32%|███▏      | 854/2670 [20:48<43:09,  1.43s/it]

{'loss': 0.0372, 'grad_norm': 0.3698757588863373, 'learning_rate': 0.00013628517823639774, 'epoch': 3.2}


 32%|███▏      | 855/2670 [20:50<43:44,  1.45s/it]

{'loss': 0.0229, 'grad_norm': 0.45548126101493835, 'learning_rate': 0.00013621013133208255, 'epoch': 3.2}


 32%|███▏      | 856/2670 [20:51<43:16,  1.43s/it]

{'loss': 0.0956, 'grad_norm': 0.6174457669258118, 'learning_rate': 0.00013613508442776736, 'epoch': 3.21}


 32%|███▏      | 857/2670 [20:53<47:37,  1.58s/it]

{'loss': 0.1431, 'grad_norm': 0.49171650409698486, 'learning_rate': 0.00013606003752345217, 'epoch': 3.21}


 32%|███▏      | 858/2670 [20:55<48:00,  1.59s/it]

{'loss': 0.0602, 'grad_norm': 0.5110611915588379, 'learning_rate': 0.00013598499061913698, 'epoch': 3.21}


 32%|███▏      | 859/2670 [20:56<46:54,  1.55s/it]

{'loss': 0.0485, 'grad_norm': 0.519769012928009, 'learning_rate': 0.00013590994371482179, 'epoch': 3.22}


 32%|███▏      | 860/2670 [20:58<44:47,  1.48s/it]

{'loss': 0.0583, 'grad_norm': 0.5092200636863708, 'learning_rate': 0.00013583489681050657, 'epoch': 3.22}


 32%|███▏      | 861/2670 [20:59<42:36,  1.41s/it]

{'loss': 0.1041, 'grad_norm': 0.5604702234268188, 'learning_rate': 0.00013575984990619138, 'epoch': 3.22}


 32%|███▏      | 862/2670 [21:00<43:41,  1.45s/it]

{'loss': 0.0489, 'grad_norm': 0.44783955812454224, 'learning_rate': 0.00013568480300187619, 'epoch': 3.23}


 32%|███▏      | 863/2670 [21:02<43:27,  1.44s/it]

{'loss': 0.0733, 'grad_norm': 0.5708857774734497, 'learning_rate': 0.000135609756097561, 'epoch': 3.23}


 32%|███▏      | 864/2670 [21:03<41:54,  1.39s/it]

{'loss': 0.0919, 'grad_norm': 0.7533451318740845, 'learning_rate': 0.00013553470919324578, 'epoch': 3.24}


 32%|███▏      | 865/2670 [21:05<43:08,  1.43s/it]

{'loss': 0.0682, 'grad_norm': 0.5310498476028442, 'learning_rate': 0.00013545966228893058, 'epoch': 3.24}


 32%|███▏      | 866/2670 [21:06<44:45,  1.49s/it]

{'loss': 0.0886, 'grad_norm': 0.5194545388221741, 'learning_rate': 0.0001353846153846154, 'epoch': 3.24}


 32%|███▏      | 867/2670 [21:08<42:38,  1.42s/it]

{'loss': 0.0462, 'grad_norm': 0.42100438475608826, 'learning_rate': 0.00013530956848030018, 'epoch': 3.25}


 33%|███▎      | 868/2670 [21:09<43:36,  1.45s/it]

{'loss': 0.085, 'grad_norm': 0.7831487655639648, 'learning_rate': 0.00013523452157598498, 'epoch': 3.25}


 33%|███▎      | 869/2670 [21:11<44:19,  1.48s/it]

{'loss': 0.0884, 'grad_norm': 0.658178985118866, 'learning_rate': 0.0001351594746716698, 'epoch': 3.25}


 33%|███▎      | 870/2670 [21:12<44:01,  1.47s/it]

{'loss': 0.0781, 'grad_norm': 0.5887109637260437, 'learning_rate': 0.0001350844277673546, 'epoch': 3.26}


 33%|███▎      | 871/2670 [21:13<43:06,  1.44s/it]

{'loss': 0.1316, 'grad_norm': 0.713162899017334, 'learning_rate': 0.0001350093808630394, 'epoch': 3.26}


 33%|███▎      | 872/2670 [21:15<41:35,  1.39s/it]

{'loss': 0.0654, 'grad_norm': 0.64961838722229, 'learning_rate': 0.00013493433395872422, 'epoch': 3.27}


 33%|███▎      | 873/2670 [21:16<41:01,  1.37s/it]

{'loss': 0.0647, 'grad_norm': 0.5987582206726074, 'learning_rate': 0.00013485928705440903, 'epoch': 3.27}


 33%|███▎      | 874/2670 [21:18<42:06,  1.41s/it]

{'loss': 0.0645, 'grad_norm': 0.5250503420829773, 'learning_rate': 0.0001347842401500938, 'epoch': 3.27}


 33%|███▎      | 875/2670 [21:19<42:37,  1.42s/it]

{'loss': 0.0476, 'grad_norm': 0.4191366136074066, 'learning_rate': 0.00013470919324577862, 'epoch': 3.28}


 33%|███▎      | 876/2670 [21:20<42:14,  1.41s/it]

{'loss': 0.0562, 'grad_norm': 0.5036846995353699, 'learning_rate': 0.00013463414634146343, 'epoch': 3.28}


 33%|███▎      | 877/2670 [21:22<42:16,  1.41s/it]

{'loss': 0.0353, 'grad_norm': 0.40055206418037415, 'learning_rate': 0.0001345590994371482, 'epoch': 3.28}


 33%|███▎      | 878/2670 [21:23<40:47,  1.37s/it]

{'loss': 0.0862, 'grad_norm': 0.5894114375114441, 'learning_rate': 0.00013448405253283302, 'epoch': 3.29}


 33%|███▎      | 879/2670 [21:24<40:26,  1.36s/it]

{'loss': 0.0996, 'grad_norm': 0.6902434825897217, 'learning_rate': 0.00013440900562851783, 'epoch': 3.29}


 33%|███▎      | 880/2670 [21:26<40:50,  1.37s/it]

{'loss': 0.0729, 'grad_norm': 0.5031479001045227, 'learning_rate': 0.00013433395872420264, 'epoch': 3.3}


 33%|███▎      | 881/2670 [21:27<42:10,  1.41s/it]

{'loss': 0.0583, 'grad_norm': 0.5543463826179504, 'learning_rate': 0.00013425891181988745, 'epoch': 3.3}


 33%|███▎      | 882/2670 [21:29<43:25,  1.46s/it]

{'loss': 0.0394, 'grad_norm': 0.4110580384731293, 'learning_rate': 0.00013418386491557225, 'epoch': 3.3}


 33%|███▎      | 883/2670 [21:30<43:43,  1.47s/it]

{'loss': 0.0341, 'grad_norm': 0.3951626718044281, 'learning_rate': 0.00013410881801125704, 'epoch': 3.31}


 33%|███▎      | 884/2670 [21:32<45:16,  1.52s/it]

{'loss': 0.0612, 'grad_norm': 0.36957061290740967, 'learning_rate': 0.00013403377110694184, 'epoch': 3.31}


 33%|███▎      | 885/2670 [21:33<44:15,  1.49s/it]

{'loss': 0.0583, 'grad_norm': 0.5726811289787292, 'learning_rate': 0.00013395872420262665, 'epoch': 3.31}


 33%|███▎      | 886/2670 [21:35<42:46,  1.44s/it]

{'loss': 0.1187, 'grad_norm': 0.900905966758728, 'learning_rate': 0.00013388367729831146, 'epoch': 3.32}


 33%|███▎      | 887/2670 [21:36<44:24,  1.49s/it]

{'loss': 0.0666, 'grad_norm': 0.47350966930389404, 'learning_rate': 0.00013380863039399624, 'epoch': 3.32}


 33%|███▎      | 888/2670 [21:38<43:27,  1.46s/it]

{'loss': 0.0529, 'grad_norm': 0.4265856146812439, 'learning_rate': 0.00013373358348968105, 'epoch': 3.33}


 33%|███▎      | 889/2670 [21:39<43:11,  1.46s/it]

{'loss': 0.0232, 'grad_norm': 0.44971027970314026, 'learning_rate': 0.00013365853658536586, 'epoch': 3.33}


 33%|███▎      | 890/2670 [21:41<44:37,  1.50s/it]

{'loss': 0.0603, 'grad_norm': 0.5382187366485596, 'learning_rate': 0.00013358348968105064, 'epoch': 3.33}


 33%|███▎      | 891/2670 [21:43<46:01,  1.55s/it]

{'loss': 0.0744, 'grad_norm': 0.5613914728164673, 'learning_rate': 0.00013350844277673545, 'epoch': 3.34}


 33%|███▎      | 892/2670 [21:44<47:28,  1.60s/it]

{'loss': 0.1008, 'grad_norm': 0.6323879361152649, 'learning_rate': 0.0001334333958724203, 'epoch': 3.34}


 33%|███▎      | 893/2670 [21:45<43:45,  1.48s/it]

{'loss': 0.0464, 'grad_norm': 0.37584757804870605, 'learning_rate': 0.00013335834896810507, 'epoch': 3.34}


 33%|███▎      | 894/2670 [21:47<41:28,  1.40s/it]

{'loss': 0.0727, 'grad_norm': 0.566994845867157, 'learning_rate': 0.00013328330206378988, 'epoch': 3.35}


 34%|███▎      | 895/2670 [21:48<44:03,  1.49s/it]

{'loss': 0.0787, 'grad_norm': 0.6019702553749084, 'learning_rate': 0.0001332082551594747, 'epoch': 3.35}


 34%|███▎      | 896/2670 [21:50<42:34,  1.44s/it]

{'loss': 0.0721, 'grad_norm': 0.5686037540435791, 'learning_rate': 0.0001331332082551595, 'epoch': 3.36}


 34%|███▎      | 897/2670 [21:51<43:10,  1.46s/it]

{'loss': 0.0494, 'grad_norm': 0.4895561933517456, 'learning_rate': 0.00013305816135084428, 'epoch': 3.36}


 34%|███▎      | 898/2670 [21:53<42:44,  1.45s/it]

{'loss': 0.0736, 'grad_norm': 0.5562372803688049, 'learning_rate': 0.0001329831144465291, 'epoch': 3.36}


 34%|███▎      | 899/2670 [21:54<41:08,  1.39s/it]

{'loss': 0.0782, 'grad_norm': 0.6131134629249573, 'learning_rate': 0.0001329080675422139, 'epoch': 3.37}


 34%|███▎      | 900/2670 [21:56<43:55,  1.49s/it]

{'loss': 0.0715, 'grad_norm': 0.49458226561546326, 'learning_rate': 0.00013283302063789868, 'epoch': 3.37}


 34%|███▎      | 901/2670 [21:57<43:25,  1.47s/it]

{'loss': 0.0691, 'grad_norm': 0.5004873871803284, 'learning_rate': 0.0001327579737335835, 'epoch': 3.37}


 34%|███▍      | 902/2670 [21:59<44:55,  1.52s/it]

{'loss': 0.0888, 'grad_norm': 0.8661516308784485, 'learning_rate': 0.0001326829268292683, 'epoch': 3.38}


 34%|███▍      | 903/2670 [22:00<44:46,  1.52s/it]

{'loss': 0.1267, 'grad_norm': 0.8717193007469177, 'learning_rate': 0.0001326078799249531, 'epoch': 3.38}


 34%|███▍      | 904/2670 [22:01<42:52,  1.46s/it]

{'loss': 0.0686, 'grad_norm': 0.5122511386871338, 'learning_rate': 0.0001325328330206379, 'epoch': 3.39}


 34%|███▍      | 905/2670 [22:03<42:33,  1.45s/it]

{'loss': 0.0528, 'grad_norm': 0.4168863892555237, 'learning_rate': 0.00013245778611632272, 'epoch': 3.39}


 34%|███▍      | 906/2670 [22:04<40:16,  1.37s/it]

{'loss': 0.0772, 'grad_norm': 0.6193875074386597, 'learning_rate': 0.0001323827392120075, 'epoch': 3.39}


 34%|███▍      | 907/2670 [22:06<41:41,  1.42s/it]

{'loss': 0.0531, 'grad_norm': 0.5319533348083496, 'learning_rate': 0.0001323076923076923, 'epoch': 3.4}


 34%|███▍      | 908/2670 [22:07<42:35,  1.45s/it]

{'loss': 0.0661, 'grad_norm': 0.5258303284645081, 'learning_rate': 0.00013223264540337712, 'epoch': 3.4}


 34%|███▍      | 909/2670 [22:08<41:53,  1.43s/it]

{'loss': 0.0195, 'grad_norm': 0.236260324716568, 'learning_rate': 0.00013215759849906193, 'epoch': 3.4}


 34%|███▍      | 910/2670 [22:10<42:47,  1.46s/it]

{'loss': 0.0736, 'grad_norm': 0.5227624177932739, 'learning_rate': 0.0001320825515947467, 'epoch': 3.41}


 34%|███▍      | 911/2670 [22:12<44:03,  1.50s/it]

{'loss': 0.0704, 'grad_norm': 0.5084413290023804, 'learning_rate': 0.00013200750469043152, 'epoch': 3.41}


 34%|███▍      | 912/2670 [22:13<41:53,  1.43s/it]

{'loss': 0.0968, 'grad_norm': 0.969139814376831, 'learning_rate': 0.00013193245778611633, 'epoch': 3.42}


 34%|███▍      | 913/2670 [22:14<42:39,  1.46s/it]

{'loss': 0.0691, 'grad_norm': 0.5874625444412231, 'learning_rate': 0.0001318574108818011, 'epoch': 3.42}


 34%|███▍      | 914/2670 [22:16<43:45,  1.49s/it]

{'loss': 0.057, 'grad_norm': 0.4968523681163788, 'learning_rate': 0.00013178236397748595, 'epoch': 3.42}


 34%|███▍      | 915/2670 [22:17<43:33,  1.49s/it]

{'loss': 0.0843, 'grad_norm': 0.5194451808929443, 'learning_rate': 0.00013170731707317076, 'epoch': 3.43}


 34%|███▍      | 916/2670 [22:19<43:21,  1.48s/it]

{'loss': 0.0764, 'grad_norm': 0.48933595418930054, 'learning_rate': 0.00013163227016885554, 'epoch': 3.43}


 34%|███▍      | 917/2670 [22:20<43:33,  1.49s/it]

{'loss': 0.1336, 'grad_norm': 0.7879453897476196, 'learning_rate': 0.00013155722326454035, 'epoch': 3.43}


 34%|███▍      | 918/2670 [22:22<42:12,  1.45s/it]

{'loss': 0.0285, 'grad_norm': 0.3094415068626404, 'learning_rate': 0.00013148217636022516, 'epoch': 3.44}


 34%|███▍      | 919/2670 [22:23<42:03,  1.44s/it]

{'loss': 0.0724, 'grad_norm': 0.52284175157547, 'learning_rate': 0.00013140712945590996, 'epoch': 3.44}


 34%|███▍      | 920/2670 [22:24<38:52,  1.33s/it]

{'loss': 0.1904, 'grad_norm': 1.7460136413574219, 'learning_rate': 0.00013133208255159475, 'epoch': 3.45}


 34%|███▍      | 921/2670 [22:26<40:41,  1.40s/it]

{'loss': 0.0974, 'grad_norm': 0.6354599595069885, 'learning_rate': 0.00013125703564727956, 'epoch': 3.45}


 35%|███▍      | 922/2670 [22:28<44:02,  1.51s/it]

{'loss': 0.088, 'grad_norm': 0.4925580620765686, 'learning_rate': 0.00013118198874296436, 'epoch': 3.45}


 35%|███▍      | 923/2670 [22:29<44:09,  1.52s/it]

{'loss': 0.0836, 'grad_norm': 0.5283346772193909, 'learning_rate': 0.00013110694183864915, 'epoch': 3.46}


 35%|███▍      | 924/2670 [22:31<42:48,  1.47s/it]

{'loss': 0.0445, 'grad_norm': 0.43251699209213257, 'learning_rate': 0.00013103189493433395, 'epoch': 3.46}


 35%|███▍      | 925/2670 [22:32<41:51,  1.44s/it]

{'loss': 0.0583, 'grad_norm': 0.40529245138168335, 'learning_rate': 0.00013095684803001876, 'epoch': 3.46}


 35%|███▍      | 926/2670 [22:33<42:27,  1.46s/it]

{'loss': 0.0557, 'grad_norm': 0.5164781212806702, 'learning_rate': 0.00013088180112570357, 'epoch': 3.47}


 35%|███▍      | 927/2670 [22:35<40:40,  1.40s/it]

{'loss': 0.0415, 'grad_norm': 0.30598461627960205, 'learning_rate': 0.00013080675422138838, 'epoch': 3.47}


 35%|███▍      | 928/2670 [22:36<39:15,  1.35s/it]

{'loss': 0.0234, 'grad_norm': 0.3563039302825928, 'learning_rate': 0.0001307317073170732, 'epoch': 3.48}


 35%|███▍      | 929/2670 [22:37<39:52,  1.37s/it]

{'loss': 0.0477, 'grad_norm': 0.42842739820480347, 'learning_rate': 0.00013065666041275797, 'epoch': 3.48}


 35%|███▍      | 930/2670 [22:39<38:53,  1.34s/it]

{'loss': 0.0552, 'grad_norm': 0.4514131247997284, 'learning_rate': 0.00013058161350844278, 'epoch': 3.48}


 35%|███▍      | 931/2670 [22:40<37:21,  1.29s/it]

{'loss': 0.0815, 'grad_norm': 0.6240585446357727, 'learning_rate': 0.0001305065666041276, 'epoch': 3.49}


 35%|███▍      | 932/2670 [22:41<39:29,  1.36s/it]

{'loss': 0.0754, 'grad_norm': 0.5058117508888245, 'learning_rate': 0.0001304315196998124, 'epoch': 3.49}


 35%|███▍      | 933/2670 [22:42<37:54,  1.31s/it]

{'loss': 0.0301, 'grad_norm': 0.23914673924446106, 'learning_rate': 0.00013035647279549718, 'epoch': 3.49}


 35%|███▍      | 934/2670 [22:44<37:28,  1.30s/it]

{'loss': 0.0488, 'grad_norm': 0.5139831304550171, 'learning_rate': 0.000130281425891182, 'epoch': 3.5}


 35%|███▌      | 935/2670 [22:45<38:22,  1.33s/it]

{'loss': 0.0743, 'grad_norm': 0.5363637804985046, 'learning_rate': 0.0001302063789868668, 'epoch': 3.5}


 35%|███▌      | 936/2670 [22:47<39:39,  1.37s/it]

{'loss': 0.0746, 'grad_norm': 0.6322453618049622, 'learning_rate': 0.0001301313320825516, 'epoch': 3.51}


 35%|███▌      | 937/2670 [22:48<40:15,  1.39s/it]

{'loss': 0.0449, 'grad_norm': 0.5334080457687378, 'learning_rate': 0.00013005628517823642, 'epoch': 3.51}


 35%|███▌      | 938/2670 [22:50<42:03,  1.46s/it]

{'loss': 0.0228, 'grad_norm': 0.2647143006324768, 'learning_rate': 0.00012998123827392122, 'epoch': 3.51}


 35%|███▌      | 939/2670 [22:51<42:27,  1.47s/it]

{'loss': 0.0742, 'grad_norm': 0.5494904518127441, 'learning_rate': 0.000129906191369606, 'epoch': 3.52}


 35%|███▌      | 940/2670 [22:53<45:20,  1.57s/it]

{'loss': 0.0623, 'grad_norm': 0.5316585302352905, 'learning_rate': 0.00012983114446529082, 'epoch': 3.52}


 35%|███▌      | 941/2670 [22:55<45:02,  1.56s/it]

{'loss': 0.0709, 'grad_norm': 0.5258689522743225, 'learning_rate': 0.00012975609756097562, 'epoch': 3.52}


 35%|███▌      | 942/2670 [22:56<44:09,  1.53s/it]

{'loss': 0.0494, 'grad_norm': 0.32501065731048584, 'learning_rate': 0.00012968105065666043, 'epoch': 3.53}


 35%|███▌      | 943/2670 [22:58<44:17,  1.54s/it]

{'loss': 0.044, 'grad_norm': 0.5923983454704285, 'learning_rate': 0.00012960600375234521, 'epoch': 3.53}


 35%|███▌      | 944/2670 [22:59<42:35,  1.48s/it]

{'loss': 0.0659, 'grad_norm': 0.38708722591400146, 'learning_rate': 0.00012953095684803002, 'epoch': 3.54}


 35%|███▌      | 945/2670 [23:00<41:03,  1.43s/it]

{'loss': 0.0978, 'grad_norm': 0.6530256271362305, 'learning_rate': 0.00012945590994371483, 'epoch': 3.54}


 35%|███▌      | 946/2670 [23:02<41:53,  1.46s/it]

{'loss': 0.0896, 'grad_norm': 0.7147018909454346, 'learning_rate': 0.00012938086303939961, 'epoch': 3.54}


 35%|███▌      | 947/2670 [23:03<42:17,  1.47s/it]

{'loss': 0.1035, 'grad_norm': 0.6006160378456116, 'learning_rate': 0.00012930581613508442, 'epoch': 3.55}


 36%|███▌      | 948/2670 [23:05<42:49,  1.49s/it]

{'loss': 0.0606, 'grad_norm': 0.49960243701934814, 'learning_rate': 0.00012923076923076923, 'epoch': 3.55}


 36%|███▌      | 949/2670 [23:06<41:42,  1.45s/it]

{'loss': 0.0581, 'grad_norm': 0.41444897651672363, 'learning_rate': 0.00012915572232645404, 'epoch': 3.55}


 36%|███▌      | 950/2670 [23:08<41:09,  1.44s/it]

{'loss': 0.059, 'grad_norm': 0.41726967692375183, 'learning_rate': 0.00012908067542213885, 'epoch': 3.56}


 36%|███▌      | 951/2670 [23:09<39:53,  1.39s/it]

{'loss': 0.1178, 'grad_norm': 0.5233067870140076, 'learning_rate': 0.00012900562851782366, 'epoch': 3.56}


 36%|███▌      | 952/2670 [23:10<40:42,  1.42s/it]

{'loss': 0.0521, 'grad_norm': 0.43362587690353394, 'learning_rate': 0.00012893058161350844, 'epoch': 3.57}


 36%|███▌      | 953/2670 [23:12<40:56,  1.43s/it]

{'loss': 0.037, 'grad_norm': 0.3909396529197693, 'learning_rate': 0.00012885553470919325, 'epoch': 3.57}


 36%|███▌      | 954/2670 [23:13<41:28,  1.45s/it]

{'loss': 0.0862, 'grad_norm': 0.6127503514289856, 'learning_rate': 0.00012878048780487806, 'epoch': 3.57}


 36%|███▌      | 955/2670 [23:15<42:03,  1.47s/it]

{'loss': 0.0716, 'grad_norm': 0.5006822943687439, 'learning_rate': 0.00012870544090056287, 'epoch': 3.58}


 36%|███▌      | 956/2670 [23:16<40:51,  1.43s/it]

{'loss': 0.0689, 'grad_norm': 0.5329442620277405, 'learning_rate': 0.00012863039399624765, 'epoch': 3.58}


 36%|███▌      | 957/2670 [23:18<42:21,  1.48s/it]

{'loss': 0.0404, 'grad_norm': 0.36186039447784424, 'learning_rate': 0.00012855534709193246, 'epoch': 3.58}


 36%|███▌      | 958/2670 [23:19<40:11,  1.41s/it]

{'loss': 0.0672, 'grad_norm': 0.6247409582138062, 'learning_rate': 0.00012848030018761727, 'epoch': 3.59}


 36%|███▌      | 959/2670 [23:20<41:11,  1.44s/it]

{'loss': 0.1217, 'grad_norm': 0.56944340467453, 'learning_rate': 0.00012840525328330208, 'epoch': 3.59}


 36%|███▌      | 960/2670 [23:22<42:41,  1.50s/it]

{'loss': 0.0599, 'grad_norm': 0.3532780706882477, 'learning_rate': 0.00012833020637898688, 'epoch': 3.6}


 36%|███▌      | 961/2670 [23:24<42:31,  1.49s/it]

{'loss': 0.1025, 'grad_norm': 0.6476199626922607, 'learning_rate': 0.0001282551594746717, 'epoch': 3.6}


 36%|███▌      | 962/2670 [23:25<39:52,  1.40s/it]

{'loss': 0.0518, 'grad_norm': 0.4336263835430145, 'learning_rate': 0.00012818011257035647, 'epoch': 3.6}


 36%|███▌      | 963/2670 [23:26<38:29,  1.35s/it]

{'loss': 0.083, 'grad_norm': 0.475089430809021, 'learning_rate': 0.00012810506566604128, 'epoch': 3.61}


 36%|███▌      | 964/2670 [23:28<40:12,  1.41s/it]

{'loss': 0.03, 'grad_norm': 0.3225310146808624, 'learning_rate': 0.0001280300187617261, 'epoch': 3.61}


 36%|███▌      | 965/2670 [23:29<41:57,  1.48s/it]

{'loss': 0.0464, 'grad_norm': 0.42568761110305786, 'learning_rate': 0.00012795497185741087, 'epoch': 3.61}


 36%|███▌      | 966/2670 [23:31<42:05,  1.48s/it]

{'loss': 0.0425, 'grad_norm': 0.4518797695636749, 'learning_rate': 0.00012787992495309568, 'epoch': 3.62}


 36%|███▌      | 967/2670 [23:32<41:36,  1.47s/it]

{'loss': 0.067, 'grad_norm': 0.47123387455940247, 'learning_rate': 0.0001278048780487805, 'epoch': 3.62}


 36%|███▋      | 968/2670 [23:33<39:40,  1.40s/it]

{'loss': 0.119, 'grad_norm': 0.9165560007095337, 'learning_rate': 0.0001277298311444653, 'epoch': 3.63}


 36%|███▋      | 969/2670 [23:35<39:52,  1.41s/it]

{'loss': 0.0594, 'grad_norm': 0.6374346613883972, 'learning_rate': 0.00012765478424015008, 'epoch': 3.63}


 36%|███▋      | 970/2670 [23:36<38:09,  1.35s/it]

{'loss': 0.0484, 'grad_norm': 0.7643871903419495, 'learning_rate': 0.0001275797373358349, 'epoch': 3.63}


 36%|███▋      | 971/2670 [23:37<38:48,  1.37s/it]

{'loss': 0.055, 'grad_norm': 0.5246492624282837, 'learning_rate': 0.0001275046904315197, 'epoch': 3.64}


 36%|███▋      | 972/2670 [23:39<37:50,  1.34s/it]

{'loss': 0.1032, 'grad_norm': 0.6773462891578674, 'learning_rate': 0.0001274296435272045, 'epoch': 3.64}


 36%|███▋      | 973/2670 [23:40<36:38,  1.30s/it]

{'loss': 0.0509, 'grad_norm': 0.6288099884986877, 'learning_rate': 0.00012735459662288932, 'epoch': 3.64}


 36%|███▋      | 974/2670 [23:41<37:53,  1.34s/it]

{'loss': 0.0794, 'grad_norm': 0.5327705144882202, 'learning_rate': 0.00012727954971857413, 'epoch': 3.65}


 37%|███▋      | 975/2670 [23:43<37:46,  1.34s/it]

{'loss': 0.0456, 'grad_norm': 0.8707559704780579, 'learning_rate': 0.0001272045028142589, 'epoch': 3.65}


 37%|███▋      | 976/2670 [23:44<39:53,  1.41s/it]

{'loss': 0.056, 'grad_norm': 0.5274943113327026, 'learning_rate': 0.00012712945590994372, 'epoch': 3.66}


 37%|███▋      | 977/2670 [23:46<39:29,  1.40s/it]

{'loss': 0.0792, 'grad_norm': 0.5633854866027832, 'learning_rate': 0.00012705440900562853, 'epoch': 3.66}


 37%|███▋      | 978/2670 [23:47<40:29,  1.44s/it]

{'loss': 0.0843, 'grad_norm': 0.6035155057907104, 'learning_rate': 0.00012697936210131334, 'epoch': 3.66}


 37%|███▋      | 979/2670 [23:48<39:33,  1.40s/it]

{'loss': 0.112, 'grad_norm': 0.6131971478462219, 'learning_rate': 0.00012690431519699812, 'epoch': 3.67}


 37%|███▋      | 980/2670 [23:50<40:13,  1.43s/it]

{'loss': 0.072, 'grad_norm': 0.5247800946235657, 'learning_rate': 0.00012682926829268293, 'epoch': 3.67}


 37%|███▋      | 981/2670 [23:52<41:44,  1.48s/it]

{'loss': 0.0506, 'grad_norm': 0.4244588613510132, 'learning_rate': 0.00012675422138836773, 'epoch': 3.67}


 37%|███▋      | 982/2670 [23:53<41:21,  1.47s/it]

{'loss': 0.0626, 'grad_norm': 0.4651292562484741, 'learning_rate': 0.00012667917448405254, 'epoch': 3.68}


 37%|███▋      | 983/2670 [23:55<42:42,  1.52s/it]

{'loss': 0.0486, 'grad_norm': 0.32957887649536133, 'learning_rate': 0.00012660412757973735, 'epoch': 3.68}


 37%|███▋      | 984/2670 [23:56<42:14,  1.50s/it]

{'loss': 0.088, 'grad_norm': 0.5762293934822083, 'learning_rate': 0.00012652908067542216, 'epoch': 3.69}


 37%|███▋      | 985/2670 [23:58<41:48,  1.49s/it]

{'loss': 0.0463, 'grad_norm': 0.4190669655799866, 'learning_rate': 0.00012645403377110694, 'epoch': 3.69}


 37%|███▋      | 986/2670 [23:59<42:22,  1.51s/it]

{'loss': 0.1034, 'grad_norm': 0.6292109489440918, 'learning_rate': 0.00012637898686679175, 'epoch': 3.69}


 37%|███▋      | 987/2670 [24:00<40:57,  1.46s/it]

{'loss': 0.0693, 'grad_norm': 0.4803236126899719, 'learning_rate': 0.00012630393996247656, 'epoch': 3.7}


 37%|███▋      | 988/2670 [24:02<41:32,  1.48s/it]

{'loss': 0.0391, 'grad_norm': 0.3411017060279846, 'learning_rate': 0.00012622889305816134, 'epoch': 3.7}


 37%|███▋      | 989/2670 [24:03<41:52,  1.49s/it]

{'loss': 0.0824, 'grad_norm': 0.63923180103302, 'learning_rate': 0.00012615384615384615, 'epoch': 3.7}


 37%|███▋      | 990/2670 [24:05<43:01,  1.54s/it]

{'loss': 0.0832, 'grad_norm': 0.5156157612800598, 'learning_rate': 0.00012607879924953096, 'epoch': 3.71}


 37%|███▋      | 991/2670 [24:07<43:30,  1.55s/it]

{'loss': 0.114, 'grad_norm': 0.6477682590484619, 'learning_rate': 0.00012600375234521577, 'epoch': 3.71}


 37%|███▋      | 992/2670 [24:08<44:02,  1.57s/it]

{'loss': 0.092, 'grad_norm': 0.517077624797821, 'learning_rate': 0.00012592870544090055, 'epoch': 3.72}


 37%|███▋      | 993/2670 [24:10<44:31,  1.59s/it]

{'loss': 0.0547, 'grad_norm': 0.4804627299308777, 'learning_rate': 0.00012585365853658536, 'epoch': 3.72}


 37%|███▋      | 994/2670 [24:11<41:36,  1.49s/it]

{'loss': 0.1677, 'grad_norm': 1.7512617111206055, 'learning_rate': 0.0001257786116322702, 'epoch': 3.72}


 37%|███▋      | 995/2670 [24:13<43:32,  1.56s/it]

{'loss': 0.0824, 'grad_norm': 0.5360488295555115, 'learning_rate': 0.00012570356472795498, 'epoch': 3.73}


 37%|███▋      | 996/2670 [24:15<43:50,  1.57s/it]

{'loss': 0.097, 'grad_norm': 0.5743194818496704, 'learning_rate': 0.00012562851782363979, 'epoch': 3.73}


 37%|███▋      | 997/2670 [24:16<45:07,  1.62s/it]

{'loss': 0.1288, 'grad_norm': 0.6789196133613586, 'learning_rate': 0.0001255534709193246, 'epoch': 3.73}


 37%|███▋      | 998/2670 [24:18<45:10,  1.62s/it]

{'loss': 0.0504, 'grad_norm': 0.41005927324295044, 'learning_rate': 0.00012547842401500938, 'epoch': 3.74}


 37%|███▋      | 999/2670 [24:19<39:43,  1.43s/it]

{'loss': 0.0851, 'grad_norm': 0.5950130820274353, 'learning_rate': 0.00012540337711069419, 'epoch': 3.74}


 37%|███▋      | 1000/2670 [24:20<39:27,  1.42s/it]

{'loss': 0.0669, 'grad_norm': 0.4557454288005829, 'learning_rate': 0.000125328330206379, 'epoch': 3.75}


 37%|███▋      | 1001/2670 [24:24<56:18,  2.02s/it]

{'loss': 0.0328, 'grad_norm': 0.43694281578063965, 'learning_rate': 0.0001252532833020638, 'epoch': 3.75}


 38%|███▊      | 1002/2670 [24:25<52:13,  1.88s/it]

{'loss': 0.0852, 'grad_norm': 0.5602312684059143, 'learning_rate': 0.00012517823639774858, 'epoch': 3.75}


 38%|███▊      | 1003/2670 [24:27<47:54,  1.72s/it]

{'loss': 0.0748, 'grad_norm': 0.4569249749183655, 'learning_rate': 0.0001251031894934334, 'epoch': 3.76}


 38%|███▊      | 1004/2670 [24:28<45:03,  1.62s/it]

{'loss': 0.052, 'grad_norm': 0.42766237258911133, 'learning_rate': 0.0001250281425891182, 'epoch': 3.76}


 38%|███▊      | 1005/2670 [24:30<44:16,  1.60s/it]

{'loss': 0.0768, 'grad_norm': 0.5413685441017151, 'learning_rate': 0.000124953095684803, 'epoch': 3.76}


 38%|███▊      | 1006/2670 [24:31<45:11,  1.63s/it]

{'loss': 0.0479, 'grad_norm': 0.633088231086731, 'learning_rate': 0.00012487804878048782, 'epoch': 3.77}


 38%|███▊      | 1007/2670 [24:33<42:39,  1.54s/it]

{'loss': 0.0882, 'grad_norm': 0.8685272932052612, 'learning_rate': 0.00012480300187617263, 'epoch': 3.77}


 38%|███▊      | 1008/2670 [24:34<41:15,  1.49s/it]

{'loss': 0.0695, 'grad_norm': 0.42859840393066406, 'learning_rate': 0.0001247279549718574, 'epoch': 3.78}


 38%|███▊      | 1009/2670 [24:35<39:31,  1.43s/it]

{'loss': 0.0306, 'grad_norm': 0.36201775074005127, 'learning_rate': 0.00012465290806754222, 'epoch': 3.78}


 38%|███▊      | 1010/2670 [24:37<39:24,  1.42s/it]

{'loss': 0.069, 'grad_norm': 0.7061306834220886, 'learning_rate': 0.00012457786116322703, 'epoch': 3.78}


 38%|███▊      | 1011/2670 [24:38<37:16,  1.35s/it]

{'loss': 0.0766, 'grad_norm': 0.7418325543403625, 'learning_rate': 0.0001245028142589118, 'epoch': 3.79}


 38%|███▊      | 1012/2670 [24:39<36:36,  1.32s/it]

{'loss': 0.1124, 'grad_norm': 1.7004916667938232, 'learning_rate': 0.00012442776735459662, 'epoch': 3.79}


 38%|███▊      | 1013/2670 [24:41<38:18,  1.39s/it]

{'loss': 0.1001, 'grad_norm': 0.6174584627151489, 'learning_rate': 0.00012435272045028143, 'epoch': 3.79}


 38%|███▊      | 1014/2670 [24:42<38:12,  1.38s/it]

{'loss': 0.0495, 'grad_norm': 0.3965018391609192, 'learning_rate': 0.00012427767354596624, 'epoch': 3.8}


 38%|███▊      | 1015/2670 [24:44<40:13,  1.46s/it]

{'loss': 0.0521, 'grad_norm': 0.5384741425514221, 'learning_rate': 0.00012420262664165102, 'epoch': 3.8}


 38%|███▊      | 1016/2670 [24:45<39:54,  1.45s/it]

{'loss': 0.0666, 'grad_norm': 0.5515367388725281, 'learning_rate': 0.00012412757973733585, 'epoch': 3.81}


 38%|███▊      | 1017/2670 [24:46<39:24,  1.43s/it]

{'loss': 0.0599, 'grad_norm': 0.7070996761322021, 'learning_rate': 0.00012405253283302066, 'epoch': 3.81}


 38%|███▊      | 1018/2670 [24:48<38:16,  1.39s/it]

{'loss': 0.0585, 'grad_norm': 0.5000608563423157, 'learning_rate': 0.00012397748592870545, 'epoch': 3.81}


 38%|███▊      | 1019/2670 [24:49<38:35,  1.40s/it]

{'loss': 0.0791, 'grad_norm': 0.6005622744560242, 'learning_rate': 0.00012390243902439025, 'epoch': 3.82}


 38%|███▊      | 1020/2670 [24:51<38:11,  1.39s/it]

{'loss': 0.0892, 'grad_norm': 0.5545933246612549, 'learning_rate': 0.00012382739212007506, 'epoch': 3.82}


 38%|███▊      | 1021/2670 [24:52<37:58,  1.38s/it]

{'loss': 0.096, 'grad_norm': 0.732860803604126, 'learning_rate': 0.00012375234521575984, 'epoch': 3.82}


 38%|███▊      | 1022/2670 [24:53<35:02,  1.28s/it]

{'loss': 0.1004, 'grad_norm': 0.7198096513748169, 'learning_rate': 0.00012367729831144465, 'epoch': 3.83}


 38%|███▊      | 1023/2670 [24:54<36:51,  1.34s/it]

{'loss': 0.0756, 'grad_norm': 0.4906739890575409, 'learning_rate': 0.00012360225140712946, 'epoch': 3.83}


 38%|███▊      | 1024/2670 [24:56<39:43,  1.45s/it]

{'loss': 0.0769, 'grad_norm': 0.4868440330028534, 'learning_rate': 0.00012352720450281427, 'epoch': 3.84}


 38%|███▊      | 1025/2670 [24:58<40:12,  1.47s/it]

{'loss': 0.086, 'grad_norm': 0.5656563639640808, 'learning_rate': 0.00012345215759849905, 'epoch': 3.84}


 38%|███▊      | 1026/2670 [24:59<38:36,  1.41s/it]

{'loss': 0.1297, 'grad_norm': 0.7403966188430786, 'learning_rate': 0.00012337711069418386, 'epoch': 3.84}


 38%|███▊      | 1027/2670 [25:00<36:33,  1.33s/it]

{'loss': 0.0511, 'grad_norm': 0.4014694094657898, 'learning_rate': 0.00012330206378986867, 'epoch': 3.85}


 39%|███▊      | 1028/2670 [25:01<37:12,  1.36s/it]

{'loss': 0.1086, 'grad_norm': 0.6226219534873962, 'learning_rate': 0.00012322701688555348, 'epoch': 3.85}


 39%|███▊      | 1029/2670 [25:03<39:41,  1.45s/it]

{'loss': 0.0408, 'grad_norm': 0.3562382757663727, 'learning_rate': 0.0001231519699812383, 'epoch': 3.85}


 39%|███▊      | 1030/2670 [25:05<39:37,  1.45s/it]

{'loss': 0.024, 'grad_norm': 0.2610847055912018, 'learning_rate': 0.0001230769230769231, 'epoch': 3.86}


 39%|███▊      | 1031/2670 [25:06<38:31,  1.41s/it]

{'loss': 0.054, 'grad_norm': 0.4153185486793518, 'learning_rate': 0.00012300187617260788, 'epoch': 3.86}


 39%|███▊      | 1032/2670 [25:08<42:05,  1.54s/it]

{'loss': 0.1375, 'grad_norm': 0.4557631313800812, 'learning_rate': 0.0001229268292682927, 'epoch': 3.87}


 39%|███▊      | 1033/2670 [25:09<43:13,  1.58s/it]

{'loss': 0.0539, 'grad_norm': 0.38486093282699585, 'learning_rate': 0.0001228517823639775, 'epoch': 3.87}


 39%|███▊      | 1034/2670 [25:11<41:50,  1.53s/it]

{'loss': 0.0894, 'grad_norm': 0.5199715495109558, 'learning_rate': 0.00012277673545966228, 'epoch': 3.87}


 39%|███▉      | 1035/2670 [25:12<41:54,  1.54s/it]

{'loss': 0.1247, 'grad_norm': 0.5545440912246704, 'learning_rate': 0.0001227016885553471, 'epoch': 3.88}


 39%|███▉      | 1036/2670 [25:14<41:38,  1.53s/it]

{'loss': 0.0767, 'grad_norm': 0.44278576970100403, 'learning_rate': 0.0001226266416510319, 'epoch': 3.88}


 39%|███▉      | 1037/2670 [25:16<42:34,  1.56s/it]

{'loss': 0.0765, 'grad_norm': 0.4626119136810303, 'learning_rate': 0.0001225515947467167, 'epoch': 3.88}


 39%|███▉      | 1038/2670 [25:17<41:05,  1.51s/it]

{'loss': 0.1041, 'grad_norm': 0.5955580472946167, 'learning_rate': 0.00012247654784240151, 'epoch': 3.89}


 39%|███▉      | 1039/2670 [25:18<41:13,  1.52s/it]

{'loss': 0.0501, 'grad_norm': 0.3940720558166504, 'learning_rate': 0.00012240150093808632, 'epoch': 3.89}


 39%|███▉      | 1040/2670 [25:20<39:43,  1.46s/it]

{'loss': 0.0477, 'grad_norm': 0.34620895981788635, 'learning_rate': 0.00012232645403377113, 'epoch': 3.9}


 39%|███▉      | 1041/2670 [25:21<39:26,  1.45s/it]

{'loss': 0.0531, 'grad_norm': 0.45137304067611694, 'learning_rate': 0.0001222514071294559, 'epoch': 3.9}


 39%|███▉      | 1042/2670 [25:23<39:30,  1.46s/it]

{'loss': 0.0475, 'grad_norm': 0.4917123317718506, 'learning_rate': 0.00012217636022514072, 'epoch': 3.9}


 39%|███▉      | 1043/2670 [25:24<36:55,  1.36s/it]

{'loss': 0.0543, 'grad_norm': 0.46491432189941406, 'learning_rate': 0.00012210131332082553, 'epoch': 3.91}


 39%|███▉      | 1044/2670 [25:25<35:17,  1.30s/it]

{'loss': 0.0412, 'grad_norm': 0.42258015275001526, 'learning_rate': 0.00012202626641651031, 'epoch': 3.91}


 39%|███▉      | 1045/2670 [25:26<36:39,  1.35s/it]

{'loss': 0.0949, 'grad_norm': 0.7326070666313171, 'learning_rate': 0.00012195121951219512, 'epoch': 3.91}


 39%|███▉      | 1046/2670 [25:28<36:07,  1.33s/it]

{'loss': 0.0632, 'grad_norm': 0.5600022673606873, 'learning_rate': 0.00012187617260787993, 'epoch': 3.92}


 39%|███▉      | 1047/2670 [25:29<36:55,  1.37s/it]

{'loss': 0.0456, 'grad_norm': 0.45318570733070374, 'learning_rate': 0.00012180112570356474, 'epoch': 3.92}


 39%|███▉      | 1048/2670 [25:31<37:08,  1.37s/it]

{'loss': 0.0568, 'grad_norm': 0.5260111093521118, 'learning_rate': 0.00012172607879924953, 'epoch': 3.93}


 39%|███▉      | 1049/2670 [25:32<37:30,  1.39s/it]

{'loss': 0.1067, 'grad_norm': 0.7890318036079407, 'learning_rate': 0.00012165103189493434, 'epoch': 3.93}


 39%|███▉      | 1050/2670 [25:33<37:41,  1.40s/it]

{'loss': 0.0572, 'grad_norm': 0.4300878345966339, 'learning_rate': 0.00012157598499061915, 'epoch': 3.93}


 39%|███▉      | 1051/2670 [25:35<38:34,  1.43s/it]

{'loss': 0.068, 'grad_norm': 0.66169273853302, 'learning_rate': 0.00012150093808630393, 'epoch': 3.94}


 39%|███▉      | 1052/2670 [25:36<39:08,  1.45s/it]

{'loss': 0.0482, 'grad_norm': 0.45780128240585327, 'learning_rate': 0.00012142589118198874, 'epoch': 3.94}


 39%|███▉      | 1053/2670 [25:38<38:48,  1.44s/it]

{'loss': 0.0742, 'grad_norm': 0.5381873250007629, 'learning_rate': 0.00012135084427767355, 'epoch': 3.94}


 39%|███▉      | 1054/2670 [25:39<38:48,  1.44s/it]

{'loss': 0.0807, 'grad_norm': 0.7049152255058289, 'learning_rate': 0.00012127579737335835, 'epoch': 3.95}


 40%|███▉      | 1055/2670 [25:41<37:38,  1.40s/it]

{'loss': 0.0488, 'grad_norm': 0.35837820172309875, 'learning_rate': 0.00012120075046904316, 'epoch': 3.95}


 40%|███▉      | 1056/2670 [25:42<37:00,  1.38s/it]

{'loss': 0.082, 'grad_norm': 0.4937233626842499, 'learning_rate': 0.00012112570356472797, 'epoch': 3.96}


 40%|███▉      | 1057/2670 [25:44<40:12,  1.50s/it]

{'loss': 0.062, 'grad_norm': 0.4375626742839813, 'learning_rate': 0.00012105065666041275, 'epoch': 3.96}


 40%|███▉      | 1058/2670 [25:45<38:31,  1.43s/it]

{'loss': 0.0851, 'grad_norm': 0.7623246312141418, 'learning_rate': 0.00012097560975609757, 'epoch': 3.96}


 40%|███▉      | 1059/2670 [25:46<38:00,  1.42s/it]

{'loss': 0.0622, 'grad_norm': 0.5742056965827942, 'learning_rate': 0.00012090056285178238, 'epoch': 3.97}


 40%|███▉      | 1060/2670 [25:48<38:50,  1.45s/it]

{'loss': 0.0659, 'grad_norm': 0.5477712154388428, 'learning_rate': 0.00012082551594746719, 'epoch': 3.97}


 40%|███▉      | 1061/2670 [25:50<40:46,  1.52s/it]

{'loss': 0.053, 'grad_norm': 0.3871777355670929, 'learning_rate': 0.00012075046904315197, 'epoch': 3.97}


 40%|███▉      | 1062/2670 [25:51<41:20,  1.54s/it]

{'loss': 0.0538, 'grad_norm': 0.4360158145427704, 'learning_rate': 0.00012067542213883678, 'epoch': 3.98}


 40%|███▉      | 1063/2670 [25:53<41:16,  1.54s/it]

{'loss': 0.0476, 'grad_norm': 0.42097407579421997, 'learning_rate': 0.00012060037523452159, 'epoch': 3.98}


 40%|███▉      | 1064/2670 [25:54<40:28,  1.51s/it]

{'loss': 0.0731, 'grad_norm': 0.47573432326316833, 'learning_rate': 0.00012052532833020638, 'epoch': 3.99}


 40%|███▉      | 1065/2670 [25:56<40:09,  1.50s/it]

{'loss': 0.0497, 'grad_norm': 0.3512307405471802, 'learning_rate': 0.00012045028142589119, 'epoch': 3.99}


 40%|███▉      | 1066/2670 [25:57<39:22,  1.47s/it]

{'loss': 0.1033, 'grad_norm': 0.6731744408607483, 'learning_rate': 0.000120375234521576, 'epoch': 3.99}


 40%|███▉      | 1067/2670 [25:59<40:28,  1.52s/it]

{'loss': 0.0314, 'grad_norm': 0.2573103904724121, 'learning_rate': 0.00012030018761726078, 'epoch': 4.0}


 40%|████      | 1068/2670 [26:00<38:43,  1.45s/it]

{'loss': 0.0412, 'grad_norm': 0.30312708020210266, 'learning_rate': 0.00012022514071294559, 'epoch': 4.0}


 40%|████      | 1069/2670 [26:02<40:36,  1.52s/it]

{'loss': 0.0384, 'grad_norm': 0.31664758920669556, 'learning_rate': 0.0001201500938086304, 'epoch': 4.0}


 40%|████      | 1070/2670 [26:03<42:56,  1.61s/it]

{'loss': 0.0374, 'grad_norm': 0.4375337064266205, 'learning_rate': 0.00012007504690431521, 'epoch': 4.01}


 40%|████      | 1071/2670 [26:05<41:25,  1.55s/it]

{'loss': 0.0135, 'grad_norm': 0.15403014421463013, 'learning_rate': 0.00012, 'epoch': 4.01}


 40%|████      | 1072/2670 [26:06<40:13,  1.51s/it]

{'loss': 0.0182, 'grad_norm': 0.20638588070869446, 'learning_rate': 0.00011992495309568481, 'epoch': 4.01}


 40%|████      | 1073/2670 [26:08<38:49,  1.46s/it]

{'loss': 0.0714, 'grad_norm': 1.0745230913162231, 'learning_rate': 0.00011984990619136962, 'epoch': 4.02}


 40%|████      | 1074/2670 [26:09<38:35,  1.45s/it]

{'loss': 0.0443, 'grad_norm': 0.56430584192276, 'learning_rate': 0.0001197748592870544, 'epoch': 4.02}


 40%|████      | 1075/2670 [26:11<40:47,  1.53s/it]

{'loss': 0.0251, 'grad_norm': 0.3476446866989136, 'learning_rate': 0.00011969981238273922, 'epoch': 4.03}


 40%|████      | 1076/2670 [26:12<41:16,  1.55s/it]

{'loss': 0.0064, 'grad_norm': 0.17308762669563293, 'learning_rate': 0.00011962476547842403, 'epoch': 4.03}


 40%|████      | 1077/2670 [26:14<40:40,  1.53s/it]

{'loss': 0.025, 'grad_norm': 0.3590766489505768, 'learning_rate': 0.00011954971857410882, 'epoch': 4.03}


 40%|████      | 1078/2670 [26:15<39:34,  1.49s/it]

{'loss': 0.0268, 'grad_norm': 0.38669782876968384, 'learning_rate': 0.00011947467166979362, 'epoch': 4.04}


 40%|████      | 1079/2670 [26:17<38:55,  1.47s/it]

{'loss': 0.0333, 'grad_norm': 0.574977695941925, 'learning_rate': 0.00011939962476547843, 'epoch': 4.04}


 40%|████      | 1080/2670 [26:18<40:13,  1.52s/it]

{'loss': 0.0574, 'grad_norm': 0.47936636209487915, 'learning_rate': 0.00011932457786116323, 'epoch': 4.04}


 40%|████      | 1081/2670 [26:20<39:29,  1.49s/it]

{'loss': 0.0738, 'grad_norm': 0.5955968499183655, 'learning_rate': 0.00011924953095684804, 'epoch': 4.05}


 41%|████      | 1082/2670 [26:21<39:27,  1.49s/it]

{'loss': 0.0446, 'grad_norm': 0.5957215428352356, 'learning_rate': 0.00011917448405253285, 'epoch': 4.05}


 41%|████      | 1083/2670 [26:23<39:53,  1.51s/it]

{'loss': 0.0726, 'grad_norm': 0.727241039276123, 'learning_rate': 0.00011909943714821766, 'epoch': 4.06}


 41%|████      | 1084/2670 [26:24<39:55,  1.51s/it]

{'loss': 0.032, 'grad_norm': 0.42640742659568787, 'learning_rate': 0.00011902439024390244, 'epoch': 4.06}


 41%|████      | 1085/2670 [26:26<39:46,  1.51s/it]

{'loss': 0.05, 'grad_norm': 0.48526957631111145, 'learning_rate': 0.00011894934333958725, 'epoch': 4.06}


 41%|████      | 1086/2670 [26:27<39:12,  1.48s/it]

{'loss': 0.0159, 'grad_norm': 0.24592219293117523, 'learning_rate': 0.00011887429643527205, 'epoch': 4.07}


 41%|████      | 1087/2670 [26:29<40:22,  1.53s/it]

{'loss': 0.0306, 'grad_norm': 0.29236188530921936, 'learning_rate': 0.00011879924953095685, 'epoch': 4.07}


 41%|████      | 1088/2670 [26:30<38:00,  1.44s/it]

{'loss': 0.1721, 'grad_norm': 2.4464938640594482, 'learning_rate': 0.00011872420262664166, 'epoch': 4.07}


 41%|████      | 1089/2670 [26:31<37:17,  1.42s/it]

{'loss': 0.0313, 'grad_norm': 0.27537456154823303, 'learning_rate': 0.00011864915572232647, 'epoch': 4.08}


 41%|████      | 1090/2670 [26:33<37:19,  1.42s/it]

{'loss': 0.0323, 'grad_norm': 0.3106077313423157, 'learning_rate': 0.00011857410881801125, 'epoch': 4.08}


 41%|████      | 1091/2670 [26:34<36:23,  1.38s/it]

{'loss': 0.0322, 'grad_norm': 0.5433679819107056, 'learning_rate': 0.00011849906191369606, 'epoch': 4.09}


 41%|████      | 1092/2670 [26:35<34:01,  1.29s/it]

{'loss': 0.0419, 'grad_norm': 0.7297554612159729, 'learning_rate': 0.00011842401500938087, 'epoch': 4.09}


 41%|████      | 1093/2670 [26:36<33:25,  1.27s/it]

{'loss': 0.0222, 'grad_norm': 0.400828093290329, 'learning_rate': 0.00011834896810506566, 'epoch': 4.09}


 41%|████      | 1094/2670 [26:38<35:17,  1.34s/it]

{'loss': 0.086, 'grad_norm': 0.5261372327804565, 'learning_rate': 0.00011827392120075047, 'epoch': 4.1}


 41%|████      | 1095/2670 [26:39<35:02,  1.34s/it]

{'loss': 0.0261, 'grad_norm': 0.3173755705356598, 'learning_rate': 0.00011819887429643528, 'epoch': 4.1}


 41%|████      | 1096/2670 [26:41<35:20,  1.35s/it]

{'loss': 0.0304, 'grad_norm': 0.28214433789253235, 'learning_rate': 0.00011812382739212009, 'epoch': 4.1}


 41%|████      | 1097/2670 [26:42<36:58,  1.41s/it]

{'loss': 0.0547, 'grad_norm': 0.40417832136154175, 'learning_rate': 0.00011804878048780488, 'epoch': 4.11}


 41%|████      | 1098/2670 [26:43<35:28,  1.35s/it]

{'loss': 0.0566, 'grad_norm': 0.4728490114212036, 'learning_rate': 0.00011797373358348969, 'epoch': 4.11}


 41%|████      | 1099/2670 [26:45<37:32,  1.43s/it]

{'loss': 0.0217, 'grad_norm': 0.3350580930709839, 'learning_rate': 0.0001178986866791745, 'epoch': 4.12}


 41%|████      | 1100/2670 [26:46<36:28,  1.39s/it]

{'loss': 0.0488, 'grad_norm': 0.5643281936645508, 'learning_rate': 0.00011782363977485928, 'epoch': 4.12}


 41%|████      | 1101/2670 [26:47<33:42,  1.29s/it]

{'loss': 0.0769, 'grad_norm': 0.43323439359664917, 'learning_rate': 0.00011774859287054409, 'epoch': 4.12}


 41%|████▏     | 1102/2670 [26:49<35:08,  1.34s/it]

{'loss': 0.041, 'grad_norm': 0.36190855503082275, 'learning_rate': 0.0001176735459662289, 'epoch': 4.13}


 41%|████▏     | 1103/2670 [26:51<37:29,  1.44s/it]

{'loss': 0.0424, 'grad_norm': 0.34557655453681946, 'learning_rate': 0.0001175984990619137, 'epoch': 4.13}


 41%|████▏     | 1104/2670 [26:52<38:00,  1.46s/it]

{'loss': 0.0364, 'grad_norm': 0.4377177357673645, 'learning_rate': 0.0001175234521575985, 'epoch': 4.13}


 41%|████▏     | 1105/2670 [26:53<35:38,  1.37s/it]

{'loss': 0.0395, 'grad_norm': 0.4214009940624237, 'learning_rate': 0.00011744840525328331, 'epoch': 4.14}


 41%|████▏     | 1106/2670 [26:55<37:45,  1.45s/it]

{'loss': 0.0226, 'grad_norm': 0.327541708946228, 'learning_rate': 0.00011737335834896812, 'epoch': 4.14}


 41%|████▏     | 1107/2670 [26:56<38:28,  1.48s/it]

{'loss': 0.0441, 'grad_norm': 0.41916146874427795, 'learning_rate': 0.0001172983114446529, 'epoch': 4.15}


 41%|████▏     | 1108/2670 [26:58<38:54,  1.49s/it]

{'loss': 0.0312, 'grad_norm': 0.30106717348098755, 'learning_rate': 0.00011722326454033771, 'epoch': 4.15}


 42%|████▏     | 1109/2670 [26:59<37:51,  1.46s/it]

{'loss': 0.0413, 'grad_norm': 0.399326890707016, 'learning_rate': 0.00011714821763602252, 'epoch': 4.15}


 42%|████▏     | 1110/2670 [27:01<38:27,  1.48s/it]

{'loss': 0.0505, 'grad_norm': 0.3964962363243103, 'learning_rate': 0.00011707317073170732, 'epoch': 4.16}


 42%|████▏     | 1111/2670 [27:02<37:26,  1.44s/it]

{'loss': 0.0548, 'grad_norm': 0.3990006446838379, 'learning_rate': 0.00011699812382739213, 'epoch': 4.16}


 42%|████▏     | 1112/2670 [27:03<36:26,  1.40s/it]

{'loss': 0.018, 'grad_norm': 0.23662534356117249, 'learning_rate': 0.00011692307692307694, 'epoch': 4.16}


 42%|████▏     | 1113/2670 [27:05<37:33,  1.45s/it]

{'loss': 0.0149, 'grad_norm': 0.22866229712963104, 'learning_rate': 0.00011684803001876172, 'epoch': 4.17}


 42%|████▏     | 1114/2670 [27:06<36:37,  1.41s/it]

{'loss': 0.103, 'grad_norm': 0.9070590734481812, 'learning_rate': 0.00011677298311444653, 'epoch': 4.17}


 42%|████▏     | 1115/2670 [27:08<37:18,  1.44s/it]

{'loss': 0.0387, 'grad_norm': 0.4236549437046051, 'learning_rate': 0.00011669793621013135, 'epoch': 4.18}


 42%|████▏     | 1116/2670 [27:09<37:42,  1.46s/it]

{'loss': 0.0147, 'grad_norm': 0.2987939417362213, 'learning_rate': 0.00011662288930581613, 'epoch': 4.18}


 42%|████▏     | 1117/2670 [27:11<39:24,  1.52s/it]

{'loss': 0.0432, 'grad_norm': 0.4490582346916199, 'learning_rate': 0.00011654784240150094, 'epoch': 4.18}


 42%|████▏     | 1118/2670 [27:13<38:52,  1.50s/it]

{'loss': 0.0506, 'grad_norm': 0.5151351094245911, 'learning_rate': 0.00011647279549718575, 'epoch': 4.19}


 42%|████▏     | 1119/2670 [27:14<39:56,  1.55s/it]

{'loss': 0.0142, 'grad_norm': 0.22055062651634216, 'learning_rate': 0.00011639774859287056, 'epoch': 4.19}


 42%|████▏     | 1120/2670 [27:16<38:59,  1.51s/it]

{'loss': 0.0186, 'grad_norm': 0.3043164014816284, 'learning_rate': 0.00011632270168855535, 'epoch': 4.19}


 42%|████▏     | 1121/2670 [27:17<39:14,  1.52s/it]

{'loss': 0.0516, 'grad_norm': 0.4777275323867798, 'learning_rate': 0.00011624765478424016, 'epoch': 4.2}


 42%|████▏     | 1122/2670 [27:18<37:46,  1.46s/it]

{'loss': 0.0166, 'grad_norm': 0.36349308490753174, 'learning_rate': 0.00011617260787992497, 'epoch': 4.2}


 42%|████▏     | 1123/2670 [27:20<35:15,  1.37s/it]

{'loss': 0.0373, 'grad_norm': 0.6658977270126343, 'learning_rate': 0.00011609756097560975, 'epoch': 4.21}


 42%|████▏     | 1124/2670 [27:21<35:08,  1.36s/it]

{'loss': 0.038, 'grad_norm': 0.27034950256347656, 'learning_rate': 0.00011602251407129456, 'epoch': 4.21}


 42%|████▏     | 1125/2670 [27:22<34:21,  1.33s/it]

{'loss': 0.0609, 'grad_norm': 0.41760653257369995, 'learning_rate': 0.00011594746716697937, 'epoch': 4.21}


 42%|████▏     | 1126/2670 [27:24<36:28,  1.42s/it]

{'loss': 0.0534, 'grad_norm': 0.6172930598258972, 'learning_rate': 0.00011587242026266416, 'epoch': 4.22}


 42%|████▏     | 1127/2670 [27:26<38:51,  1.51s/it]

{'loss': 0.0273, 'grad_norm': 0.43579724431037903, 'learning_rate': 0.00011579737335834897, 'epoch': 4.22}


 42%|████▏     | 1128/2670 [27:27<39:43,  1.55s/it]

{'loss': 0.0347, 'grad_norm': 0.3037914037704468, 'learning_rate': 0.00011572232645403378, 'epoch': 4.22}


 42%|████▏     | 1129/2670 [27:29<39:47,  1.55s/it]

{'loss': 0.0102, 'grad_norm': 0.1094219759106636, 'learning_rate': 0.00011564727954971859, 'epoch': 4.23}


 42%|████▏     | 1130/2670 [27:30<39:30,  1.54s/it]

{'loss': 0.0482, 'grad_norm': 0.4110032320022583, 'learning_rate': 0.00011557223264540337, 'epoch': 4.23}


 42%|████▏     | 1131/2670 [27:32<37:36,  1.47s/it]

{'loss': 0.0268, 'grad_norm': 0.33682993054389954, 'learning_rate': 0.00011549718574108818, 'epoch': 4.24}


 42%|████▏     | 1132/2670 [27:33<36:56,  1.44s/it]

{'loss': 0.0499, 'grad_norm': 0.6026962995529175, 'learning_rate': 0.00011542213883677299, 'epoch': 4.24}


 42%|████▏     | 1133/2670 [27:35<38:24,  1.50s/it]

{'loss': 0.0291, 'grad_norm': 0.25272905826568604, 'learning_rate': 0.00011534709193245779, 'epoch': 4.24}


 42%|████▏     | 1134/2670 [27:36<37:54,  1.48s/it]

{'loss': 0.0176, 'grad_norm': 0.3741079270839691, 'learning_rate': 0.0001152720450281426, 'epoch': 4.25}


 43%|████▎     | 1135/2670 [27:38<38:17,  1.50s/it]

{'loss': 0.0156, 'grad_norm': 0.207182377576828, 'learning_rate': 0.0001151969981238274, 'epoch': 4.25}


 43%|████▎     | 1136/2670 [27:39<37:47,  1.48s/it]

{'loss': 0.0373, 'grad_norm': 0.3757596015930176, 'learning_rate': 0.00011512195121951219, 'epoch': 4.25}


 43%|████▎     | 1137/2670 [27:41<38:12,  1.50s/it]

{'loss': 0.0226, 'grad_norm': 0.31083419919013977, 'learning_rate': 0.00011504690431519701, 'epoch': 4.26}


 43%|████▎     | 1138/2670 [27:42<37:37,  1.47s/it]

{'loss': 0.022, 'grad_norm': 0.24188338220119476, 'learning_rate': 0.00011497185741088182, 'epoch': 4.26}


 43%|████▎     | 1139/2670 [27:43<38:11,  1.50s/it]

{'loss': 0.0406, 'grad_norm': 0.4987911283969879, 'learning_rate': 0.0001148968105065666, 'epoch': 4.27}


 43%|████▎     | 1140/2670 [27:45<38:29,  1.51s/it]

{'loss': 0.0316, 'grad_norm': 0.4107014238834381, 'learning_rate': 0.00011482176360225141, 'epoch': 4.27}


 43%|████▎     | 1141/2670 [27:46<36:36,  1.44s/it]

{'loss': 0.0165, 'grad_norm': 0.25365543365478516, 'learning_rate': 0.00011474671669793622, 'epoch': 4.27}


 43%|████▎     | 1142/2670 [27:48<37:04,  1.46s/it]

{'loss': 0.0258, 'grad_norm': 0.28896963596343994, 'learning_rate': 0.00011467166979362103, 'epoch': 4.28}


 43%|████▎     | 1143/2670 [27:49<38:16,  1.50s/it]

{'loss': 0.0248, 'grad_norm': 0.21629926562309265, 'learning_rate': 0.00011459662288930582, 'epoch': 4.28}


 43%|████▎     | 1144/2670 [27:51<37:03,  1.46s/it]

{'loss': 0.0451, 'grad_norm': 0.4971151053905487, 'learning_rate': 0.00011452157598499063, 'epoch': 4.28}


 43%|████▎     | 1145/2670 [27:52<37:00,  1.46s/it]

{'loss': 0.0281, 'grad_norm': 0.3434620499610901, 'learning_rate': 0.00011444652908067544, 'epoch': 4.29}


 43%|████▎     | 1146/2670 [27:54<36:40,  1.44s/it]

{'loss': 0.0375, 'grad_norm': 0.3806328773498535, 'learning_rate': 0.00011437148217636022, 'epoch': 4.29}


 43%|████▎     | 1147/2670 [27:55<37:24,  1.47s/it]

{'loss': 0.0517, 'grad_norm': 0.2884376347064972, 'learning_rate': 0.00011429643527204503, 'epoch': 4.3}


 43%|████▎     | 1148/2670 [27:56<34:51,  1.37s/it]

{'loss': 0.0402, 'grad_norm': 0.5814339518547058, 'learning_rate': 0.00011422138836772984, 'epoch': 4.3}


 43%|████▎     | 1149/2670 [27:58<35:14,  1.39s/it]

{'loss': 0.0143, 'grad_norm': 0.1880427896976471, 'learning_rate': 0.00011414634146341463, 'epoch': 4.3}


 43%|████▎     | 1150/2670 [27:59<37:05,  1.46s/it]

{'loss': 0.0272, 'grad_norm': 0.2588435411453247, 'learning_rate': 0.00011407129455909944, 'epoch': 4.31}


 43%|████▎     | 1151/2670 [28:01<37:09,  1.47s/it]

{'loss': 0.0204, 'grad_norm': 0.2992102801799774, 'learning_rate': 0.00011399624765478425, 'epoch': 4.31}


 43%|████▎     | 1152/2670 [28:02<35:32,  1.40s/it]

{'loss': 0.0477, 'grad_norm': 0.37526410818099976, 'learning_rate': 0.00011392120075046906, 'epoch': 4.31}


 43%|████▎     | 1153/2670 [28:04<36:56,  1.46s/it]

{'loss': 0.0828, 'grad_norm': 0.6407425999641418, 'learning_rate': 0.00011384615384615384, 'epoch': 4.32}


 43%|████▎     | 1154/2670 [28:05<38:12,  1.51s/it]

{'loss': 0.028, 'grad_norm': 0.3388019800186157, 'learning_rate': 0.00011377110694183865, 'epoch': 4.32}


 43%|████▎     | 1155/2670 [28:07<37:38,  1.49s/it]

{'loss': 0.0368, 'grad_norm': 0.28026729822158813, 'learning_rate': 0.00011369606003752347, 'epoch': 4.33}


 43%|████▎     | 1156/2670 [28:08<39:04,  1.55s/it]

{'loss': 0.0512, 'grad_norm': 0.5162548422813416, 'learning_rate': 0.00011362101313320825, 'epoch': 4.33}


 43%|████▎     | 1157/2670 [28:10<39:33,  1.57s/it]

{'loss': 0.0273, 'grad_norm': 0.3908390700817108, 'learning_rate': 0.00011354596622889306, 'epoch': 4.33}


 43%|████▎     | 1158/2670 [28:11<38:18,  1.52s/it]

{'loss': 0.0552, 'grad_norm': 0.6804388165473938, 'learning_rate': 0.00011347091932457787, 'epoch': 4.34}


 43%|████▎     | 1159/2670 [28:13<39:33,  1.57s/it]

{'loss': 0.0502, 'grad_norm': 0.4053530991077423, 'learning_rate': 0.00011339587242026267, 'epoch': 4.34}


 43%|████▎     | 1160/2670 [28:15<43:01,  1.71s/it]

{'loss': 0.0562, 'grad_norm': 0.35384735465049744, 'learning_rate': 0.00011332082551594748, 'epoch': 4.34}


 43%|████▎     | 1161/2670 [28:17<40:35,  1.61s/it]

{'loss': 0.0492, 'grad_norm': 0.42318058013916016, 'learning_rate': 0.00011324577861163229, 'epoch': 4.35}


 44%|████▎     | 1162/2670 [28:18<38:49,  1.54s/it]

{'loss': 0.0308, 'grad_norm': 0.32112541794776917, 'learning_rate': 0.00011317073170731707, 'epoch': 4.35}


 44%|████▎     | 1163/2670 [28:19<38:34,  1.54s/it]

{'loss': 0.029, 'grad_norm': 0.40097641944885254, 'learning_rate': 0.00011309568480300188, 'epoch': 4.36}


 44%|████▎     | 1164/2670 [28:21<36:50,  1.47s/it]

{'loss': 0.0201, 'grad_norm': 0.25827428698539734, 'learning_rate': 0.00011302063789868668, 'epoch': 4.36}


 44%|████▎     | 1165/2670 [28:22<34:31,  1.38s/it]

{'loss': 0.0197, 'grad_norm': 0.48105108737945557, 'learning_rate': 0.0001129455909943715, 'epoch': 4.36}


 44%|████▎     | 1166/2670 [28:23<35:32,  1.42s/it]

{'loss': 0.05, 'grad_norm': 0.5905174016952515, 'learning_rate': 0.00011287054409005629, 'epoch': 4.37}


 44%|████▎     | 1167/2670 [28:25<37:04,  1.48s/it]

{'loss': 0.0354, 'grad_norm': 0.32737746834754944, 'learning_rate': 0.0001127954971857411, 'epoch': 4.37}


 44%|████▎     | 1168/2670 [28:27<37:36,  1.50s/it]

{'loss': 0.0212, 'grad_norm': 0.2941819429397583, 'learning_rate': 0.0001127204502814259, 'epoch': 4.37}


 44%|████▍     | 1169/2670 [28:28<37:45,  1.51s/it]

{'loss': 0.0177, 'grad_norm': 0.18597979843616486, 'learning_rate': 0.00011264540337711069, 'epoch': 4.38}


 44%|████▍     | 1170/2670 [28:30<39:31,  1.58s/it]

{'loss': 0.0449, 'grad_norm': 0.3246235251426697, 'learning_rate': 0.0001125703564727955, 'epoch': 4.38}


 44%|████▍     | 1171/2670 [28:31<36:43,  1.47s/it]

{'loss': 0.0604, 'grad_norm': 0.6298174858093262, 'learning_rate': 0.0001124953095684803, 'epoch': 4.39}


 44%|████▍     | 1172/2670 [28:32<34:49,  1.39s/it]

{'loss': 0.0176, 'grad_norm': 0.1738206148147583, 'learning_rate': 0.0001124202626641651, 'epoch': 4.39}


 44%|████▍     | 1173/2670 [28:34<36:46,  1.47s/it]

{'loss': 0.0234, 'grad_norm': 0.2862361967563629, 'learning_rate': 0.00011234521575984991, 'epoch': 4.39}


 44%|████▍     | 1174/2670 [28:36<37:47,  1.52s/it]

{'loss': 0.0317, 'grad_norm': 0.3559668958187103, 'learning_rate': 0.00011227016885553472, 'epoch': 4.4}


 44%|████▍     | 1175/2670 [28:37<38:39,  1.55s/it]

{'loss': 0.0332, 'grad_norm': 0.3228769898414612, 'learning_rate': 0.00011219512195121953, 'epoch': 4.4}


 44%|████▍     | 1176/2670 [28:39<39:08,  1.57s/it]

{'loss': 0.026, 'grad_norm': 0.3118053674697876, 'learning_rate': 0.00011212007504690431, 'epoch': 4.4}


 44%|████▍     | 1177/2670 [28:40<38:41,  1.55s/it]

{'loss': 0.0602, 'grad_norm': 0.44932055473327637, 'learning_rate': 0.00011204502814258913, 'epoch': 4.41}


 44%|████▍     | 1178/2670 [28:42<37:09,  1.49s/it]

{'loss': 0.0501, 'grad_norm': 0.5216091871261597, 'learning_rate': 0.00011196998123827394, 'epoch': 4.41}


 44%|████▍     | 1179/2670 [28:43<36:34,  1.47s/it]

{'loss': 0.0179, 'grad_norm': 0.28983354568481445, 'learning_rate': 0.00011189493433395872, 'epoch': 4.42}


 44%|████▍     | 1180/2670 [28:44<34:38,  1.39s/it]

{'loss': 0.0279, 'grad_norm': 0.2636585235595703, 'learning_rate': 0.00011181988742964353, 'epoch': 4.42}


 44%|████▍     | 1181/2670 [28:46<36:41,  1.48s/it]

{'loss': 0.0558, 'grad_norm': 0.42292410135269165, 'learning_rate': 0.00011174484052532834, 'epoch': 4.42}


 44%|████▍     | 1182/2670 [28:48<36:53,  1.49s/it]

{'loss': 0.0316, 'grad_norm': 0.3057440519332886, 'learning_rate': 0.00011166979362101314, 'epoch': 4.43}


 44%|████▍     | 1183/2670 [28:49<33:50,  1.37s/it]

{'loss': 0.036, 'grad_norm': 0.35562846064567566, 'learning_rate': 0.00011159474671669794, 'epoch': 4.43}


 44%|████▍     | 1184/2670 [28:50<31:36,  1.28s/it]

{'loss': 0.0194, 'grad_norm': 0.2062309831380844, 'learning_rate': 0.00011151969981238275, 'epoch': 4.43}


 44%|████▍     | 1185/2670 [28:51<32:47,  1.32s/it]

{'loss': 0.0217, 'grad_norm': 0.2327585518360138, 'learning_rate': 0.00011144465290806754, 'epoch': 4.44}


 44%|████▍     | 1186/2670 [28:53<33:26,  1.35s/it]

{'loss': 0.031, 'grad_norm': 0.5771418809890747, 'learning_rate': 0.00011136960600375234, 'epoch': 4.44}


 44%|████▍     | 1187/2670 [28:54<32:59,  1.33s/it]

{'loss': 0.0372, 'grad_norm': 0.438388466835022, 'learning_rate': 0.00011129455909943715, 'epoch': 4.45}


 44%|████▍     | 1188/2670 [28:55<33:52,  1.37s/it]

{'loss': 0.0248, 'grad_norm': 0.2594558000564575, 'learning_rate': 0.00011121951219512196, 'epoch': 4.45}


 45%|████▍     | 1189/2670 [28:57<32:42,  1.32s/it]

{'loss': 0.0216, 'grad_norm': 0.272077739238739, 'learning_rate': 0.00011114446529080676, 'epoch': 4.45}


 45%|████▍     | 1190/2670 [28:58<34:57,  1.42s/it]

{'loss': 0.0221, 'grad_norm': 0.3368917405605316, 'learning_rate': 0.00011106941838649157, 'epoch': 4.46}


 45%|████▍     | 1191/2670 [29:00<35:40,  1.45s/it]

{'loss': 0.0332, 'grad_norm': 0.3069421648979187, 'learning_rate': 0.00011099437148217637, 'epoch': 4.46}


 45%|████▍     | 1192/2670 [29:01<35:06,  1.42s/it]

{'loss': 0.0369, 'grad_norm': 0.3767137825489044, 'learning_rate': 0.00011091932457786116, 'epoch': 4.46}


 45%|████▍     | 1193/2670 [29:03<35:49,  1.46s/it]

{'loss': 0.0265, 'grad_norm': 0.3692798912525177, 'learning_rate': 0.00011084427767354597, 'epoch': 4.47}


 45%|████▍     | 1194/2670 [29:04<36:02,  1.47s/it]

{'loss': 0.0202, 'grad_norm': 0.26440349221229553, 'learning_rate': 0.00011076923076923077, 'epoch': 4.47}


 45%|████▍     | 1195/2670 [29:05<34:01,  1.38s/it]

{'loss': 0.0298, 'grad_norm': 0.2999420166015625, 'learning_rate': 0.00011069418386491557, 'epoch': 4.48}


 45%|████▍     | 1196/2670 [29:07<34:28,  1.40s/it]

{'loss': 0.0091, 'grad_norm': 0.15799134969711304, 'learning_rate': 0.00011061913696060038, 'epoch': 4.48}


 45%|████▍     | 1197/2670 [29:08<34:34,  1.41s/it]

{'loss': 0.0296, 'grad_norm': 0.44084620475769043, 'learning_rate': 0.00011054409005628519, 'epoch': 4.48}


 45%|████▍     | 1198/2670 [29:09<33:23,  1.36s/it]

{'loss': 0.0397, 'grad_norm': 0.45328667759895325, 'learning_rate': 0.00011046904315197, 'epoch': 4.49}


 45%|████▍     | 1199/2670 [29:11<36:58,  1.51s/it]

{'loss': 0.0177, 'grad_norm': 0.278729647397995, 'learning_rate': 0.00011039399624765479, 'epoch': 4.49}


 45%|████▍     | 1200/2670 [29:13<35:46,  1.46s/it]

{'loss': 0.0138, 'grad_norm': 0.2778342664241791, 'learning_rate': 0.0001103189493433396, 'epoch': 4.49}


 45%|████▍     | 1201/2670 [29:14<37:07,  1.52s/it]

{'loss': 0.0359, 'grad_norm': 0.48222026228904724, 'learning_rate': 0.00011024390243902441, 'epoch': 4.5}


 45%|████▌     | 1202/2670 [29:16<37:40,  1.54s/it]

{'loss': 0.0567, 'grad_norm': 0.5489521622657776, 'learning_rate': 0.00011016885553470919, 'epoch': 4.5}


 45%|████▌     | 1203/2670 [29:17<35:35,  1.46s/it]

{'loss': 0.0222, 'grad_norm': 0.3571557402610779, 'learning_rate': 0.000110093808630394, 'epoch': 4.51}


 45%|████▌     | 1204/2670 [29:18<34:38,  1.42s/it]

{'loss': 0.0618, 'grad_norm': 0.8322045803070068, 'learning_rate': 0.00011001876172607881, 'epoch': 4.51}


 45%|████▌     | 1205/2670 [29:20<36:04,  1.48s/it]

{'loss': 0.0126, 'grad_norm': 0.2690916061401367, 'learning_rate': 0.0001099437148217636, 'epoch': 4.51}


 45%|████▌     | 1206/2670 [29:22<38:10,  1.56s/it]

{'loss': 0.0352, 'grad_norm': 0.4279215633869171, 'learning_rate': 0.00010986866791744841, 'epoch': 4.52}


 45%|████▌     | 1207/2670 [29:23<35:09,  1.44s/it]

{'loss': 0.0205, 'grad_norm': 0.2075890302658081, 'learning_rate': 0.00010979362101313322, 'epoch': 4.52}


 45%|████▌     | 1208/2670 [29:24<34:09,  1.40s/it]

{'loss': 0.0366, 'grad_norm': 0.4086996614933014, 'learning_rate': 0.000109718574108818, 'epoch': 4.52}


 45%|████▌     | 1209/2670 [29:26<35:21,  1.45s/it]

{'loss': 0.0284, 'grad_norm': 0.282749742269516, 'learning_rate': 0.00010964352720450281, 'epoch': 4.53}


 45%|████▌     | 1210/2670 [29:27<34:54,  1.43s/it]

{'loss': 0.0279, 'grad_norm': 0.23256519436836243, 'learning_rate': 0.00010956848030018762, 'epoch': 4.53}


 45%|████▌     | 1211/2670 [29:29<36:02,  1.48s/it]

{'loss': 0.0178, 'grad_norm': 0.21615681052207947, 'learning_rate': 0.00010949343339587243, 'epoch': 4.54}


 45%|████▌     | 1212/2670 [29:30<33:24,  1.38s/it]

{'loss': 0.0412, 'grad_norm': 0.8398169279098511, 'learning_rate': 0.00010941838649155723, 'epoch': 4.54}


 45%|████▌     | 1213/2670 [29:32<34:53,  1.44s/it]

{'loss': 0.0254, 'grad_norm': 0.4076058864593506, 'learning_rate': 0.00010934333958724203, 'epoch': 4.54}


 45%|████▌     | 1214/2670 [29:32<31:22,  1.29s/it]

{'loss': 0.0131, 'grad_norm': 0.21538011729717255, 'learning_rate': 0.00010926829268292684, 'epoch': 4.55}


 46%|████▌     | 1215/2670 [29:34<31:52,  1.31s/it]

{'loss': 0.0611, 'grad_norm': 0.381353497505188, 'learning_rate': 0.00010919324577861162, 'epoch': 4.55}


 46%|████▌     | 1216/2670 [29:35<31:21,  1.29s/it]

{'loss': 0.0516, 'grad_norm': 0.5477539896965027, 'learning_rate': 0.00010911819887429643, 'epoch': 4.55}


 46%|████▌     | 1217/2670 [29:37<33:21,  1.38s/it]

{'loss': 0.0281, 'grad_norm': 0.3610478639602661, 'learning_rate': 0.00010904315196998126, 'epoch': 4.56}


 46%|████▌     | 1218/2670 [29:38<33:07,  1.37s/it]

{'loss': 0.0449, 'grad_norm': 0.45908600091934204, 'learning_rate': 0.00010896810506566604, 'epoch': 4.56}


 46%|████▌     | 1219/2670 [29:40<34:10,  1.41s/it]

{'loss': 0.0997, 'grad_norm': 0.5137485861778259, 'learning_rate': 0.00010889305816135085, 'epoch': 4.57}


 46%|████▌     | 1220/2670 [29:41<34:53,  1.44s/it]

{'loss': 0.0446, 'grad_norm': 0.3854930102825165, 'learning_rate': 0.00010881801125703566, 'epoch': 4.57}


 46%|████▌     | 1221/2670 [29:42<34:12,  1.42s/it]

{'loss': 0.0177, 'grad_norm': 0.2511517107486725, 'learning_rate': 0.00010874296435272045, 'epoch': 4.57}


 46%|████▌     | 1222/2670 [29:44<35:10,  1.46s/it]

{'loss': 0.0227, 'grad_norm': 0.25066131353378296, 'learning_rate': 0.00010866791744840526, 'epoch': 4.58}


 46%|████▌     | 1223/2670 [29:46<37:42,  1.56s/it]

{'loss': 0.0397, 'grad_norm': 0.27445492148399353, 'learning_rate': 0.00010859287054409007, 'epoch': 4.58}


 46%|████▌     | 1224/2670 [29:47<36:52,  1.53s/it]

{'loss': 0.0335, 'grad_norm': 0.3651977479457855, 'learning_rate': 0.00010851782363977488, 'epoch': 4.58}


 46%|████▌     | 1225/2670 [29:49<35:56,  1.49s/it]

{'loss': 0.025, 'grad_norm': 0.2476796805858612, 'learning_rate': 0.00010844277673545966, 'epoch': 4.59}


 46%|████▌     | 1226/2670 [29:50<36:50,  1.53s/it]

{'loss': 0.0578, 'grad_norm': 0.5294448733329773, 'learning_rate': 0.00010836772983114447, 'epoch': 4.59}


 46%|████▌     | 1227/2670 [29:52<35:27,  1.47s/it]

{'loss': 0.0191, 'grad_norm': 0.23354126513004303, 'learning_rate': 0.00010829268292682928, 'epoch': 4.6}


 46%|████▌     | 1228/2670 [29:53<35:34,  1.48s/it]

{'loss': 0.0272, 'grad_norm': 0.2961873710155487, 'learning_rate': 0.00010821763602251407, 'epoch': 4.6}


 46%|████▌     | 1229/2670 [29:55<36:29,  1.52s/it]

{'loss': 0.0347, 'grad_norm': 0.3704160749912262, 'learning_rate': 0.00010814258911819888, 'epoch': 4.6}


 46%|████▌     | 1230/2670 [29:56<37:15,  1.55s/it]

{'loss': 0.021, 'grad_norm': 0.29110535979270935, 'learning_rate': 0.00010806754221388369, 'epoch': 4.61}


 46%|████▌     | 1231/2670 [29:58<38:01,  1.59s/it]

{'loss': 0.0893, 'grad_norm': 0.5203683972358704, 'learning_rate': 0.00010799249530956847, 'epoch': 4.61}


 46%|████▌     | 1232/2670 [29:59<37:22,  1.56s/it]

{'loss': 0.0418, 'grad_norm': 0.5562618970870972, 'learning_rate': 0.00010791744840525328, 'epoch': 4.61}


 46%|████▌     | 1233/2670 [30:01<36:57,  1.54s/it]

{'loss': 0.0461, 'grad_norm': 0.3947610557079315, 'learning_rate': 0.00010784240150093809, 'epoch': 4.62}


 46%|████▌     | 1234/2670 [30:02<34:45,  1.45s/it]

{'loss': 0.0377, 'grad_norm': 0.29127761721611023, 'learning_rate': 0.00010776735459662291, 'epoch': 4.62}


 46%|████▋     | 1235/2670 [30:04<36:56,  1.54s/it]

{'loss': 0.0324, 'grad_norm': 0.4109661877155304, 'learning_rate': 0.0001076923076923077, 'epoch': 4.63}


 46%|████▋     | 1236/2670 [30:06<36:40,  1.53s/it]

{'loss': 0.0246, 'grad_norm': 0.264193058013916, 'learning_rate': 0.0001076172607879925, 'epoch': 4.63}


 46%|████▋     | 1237/2670 [30:07<36:40,  1.54s/it]

{'loss': 0.027, 'grad_norm': 0.42561039328575134, 'learning_rate': 0.00010754221388367731, 'epoch': 4.63}


 46%|████▋     | 1238/2670 [30:09<36:39,  1.54s/it]

{'loss': 0.0466, 'grad_norm': 0.41616156697273254, 'learning_rate': 0.00010746716697936209, 'epoch': 4.64}


 46%|████▋     | 1239/2670 [30:10<35:57,  1.51s/it]

{'loss': 0.0331, 'grad_norm': 0.3558961749076843, 'learning_rate': 0.00010739212007504692, 'epoch': 4.64}


 46%|████▋     | 1240/2670 [30:12<35:41,  1.50s/it]

{'loss': 0.0435, 'grad_norm': 0.5267574787139893, 'learning_rate': 0.00010731707317073172, 'epoch': 4.64}


 46%|████▋     | 1241/2670 [30:14<39:14,  1.65s/it]

{'loss': 0.053, 'grad_norm': 0.4536839425563812, 'learning_rate': 0.0001072420262664165, 'epoch': 4.65}


 47%|████▋     | 1242/2670 [30:15<39:51,  1.67s/it]

{'loss': 0.0488, 'grad_norm': 0.40984609723091125, 'learning_rate': 0.00010716697936210131, 'epoch': 4.65}


 47%|████▋     | 1243/2670 [30:17<38:00,  1.60s/it]

{'loss': 0.0255, 'grad_norm': 0.4052249789237976, 'learning_rate': 0.00010709193245778612, 'epoch': 4.66}


 47%|████▋     | 1244/2670 [30:18<39:14,  1.65s/it]

{'loss': 0.0241, 'grad_norm': 0.2757940888404846, 'learning_rate': 0.00010701688555347092, 'epoch': 4.66}


 47%|████▋     | 1245/2670 [30:20<40:30,  1.71s/it]

{'loss': 0.0293, 'grad_norm': 0.31405824422836304, 'learning_rate': 0.00010694183864915573, 'epoch': 4.66}


 47%|████▋     | 1246/2670 [30:22<39:15,  1.65s/it]

{'loss': 0.0274, 'grad_norm': 0.28793275356292725, 'learning_rate': 0.00010686679174484054, 'epoch': 4.67}


 47%|████▋     | 1247/2670 [30:23<38:45,  1.63s/it]

{'loss': 0.0399, 'grad_norm': 0.41115060448646545, 'learning_rate': 0.00010679174484052535, 'epoch': 4.67}


 47%|████▋     | 1248/2670 [30:25<37:57,  1.60s/it]

{'loss': 0.0555, 'grad_norm': 0.44515112042427063, 'learning_rate': 0.00010671669793621013, 'epoch': 4.67}


 47%|████▋     | 1249/2670 [30:26<36:45,  1.55s/it]

{'loss': 0.0495, 'grad_norm': 0.5167940855026245, 'learning_rate': 0.00010664165103189494, 'epoch': 4.68}


 47%|████▋     | 1250/2670 [30:28<35:43,  1.51s/it]

{'loss': 0.0192, 'grad_norm': 0.1909327507019043, 'learning_rate': 0.00010656660412757974, 'epoch': 4.68}


 47%|████▋     | 1251/2670 [30:29<36:17,  1.53s/it]

{'loss': 0.043, 'grad_norm': 0.43860551714897156, 'learning_rate': 0.00010649155722326454, 'epoch': 4.69}


 47%|████▋     | 1252/2670 [30:31<35:04,  1.48s/it]

{'loss': 0.0348, 'grad_norm': 0.3453276753425598, 'learning_rate': 0.00010641651031894935, 'epoch': 4.69}


 47%|████▋     | 1253/2670 [30:32<35:20,  1.50s/it]

{'loss': 0.037, 'grad_norm': 0.380332887172699, 'learning_rate': 0.00010634146341463416, 'epoch': 4.69}


 47%|████▋     | 1254/2670 [30:34<34:05,  1.44s/it]

{'loss': 0.0148, 'grad_norm': 0.17374984920024872, 'learning_rate': 0.00010626641651031894, 'epoch': 4.7}


 47%|████▋     | 1255/2670 [30:35<32:48,  1.39s/it]

{'loss': 0.0324, 'grad_norm': 0.47862672805786133, 'learning_rate': 0.00010619136960600375, 'epoch': 4.7}


 47%|████▋     | 1256/2670 [30:36<33:52,  1.44s/it]

{'loss': 0.054, 'grad_norm': 0.5869646072387695, 'learning_rate': 0.00010611632270168857, 'epoch': 4.7}


 47%|████▋     | 1257/2670 [30:38<33:26,  1.42s/it]

{'loss': 0.0462, 'grad_norm': 0.5124980211257935, 'learning_rate': 0.00010604127579737338, 'epoch': 4.71}


 47%|████▋     | 1258/2670 [30:39<34:46,  1.48s/it]

{'loss': 0.029, 'grad_norm': 0.3923136591911316, 'learning_rate': 0.00010596622889305816, 'epoch': 4.71}


 47%|████▋     | 1259/2670 [30:41<33:07,  1.41s/it]

{'loss': 0.0441, 'grad_norm': 0.5322574973106384, 'learning_rate': 0.00010589118198874297, 'epoch': 4.72}


 47%|████▋     | 1260/2670 [30:42<32:48,  1.40s/it]

{'loss': 0.0297, 'grad_norm': 0.2759733200073242, 'learning_rate': 0.00010581613508442778, 'epoch': 4.72}


 47%|████▋     | 1261/2670 [30:43<33:19,  1.42s/it]

{'loss': 0.0345, 'grad_norm': 0.38240984082221985, 'learning_rate': 0.00010574108818011257, 'epoch': 4.72}


 47%|████▋     | 1262/2670 [30:45<32:51,  1.40s/it]

{'loss': 0.0354, 'grad_norm': 0.5634448528289795, 'learning_rate': 0.00010566604127579738, 'epoch': 4.73}


 47%|████▋     | 1263/2670 [30:46<33:50,  1.44s/it]

{'loss': 0.0282, 'grad_norm': 0.3328951895236969, 'learning_rate': 0.00010559099437148219, 'epoch': 4.73}


 47%|████▋     | 1264/2670 [30:48<34:02,  1.45s/it]

{'loss': 0.0449, 'grad_norm': 0.455258309841156, 'learning_rate': 0.00010551594746716697, 'epoch': 4.73}


 47%|████▋     | 1265/2670 [30:49<33:05,  1.41s/it]

{'loss': 0.0635, 'grad_norm': 0.4657668173313141, 'learning_rate': 0.00010544090056285178, 'epoch': 4.74}


 47%|████▋     | 1266/2670 [30:50<32:07,  1.37s/it]

{'loss': 0.0267, 'grad_norm': 0.3110892176628113, 'learning_rate': 0.00010536585365853659, 'epoch': 4.74}


 47%|████▋     | 1267/2670 [30:52<33:15,  1.42s/it]

{'loss': 0.0378, 'grad_norm': 0.3647269606590271, 'learning_rate': 0.00010529080675422139, 'epoch': 4.75}


 47%|████▋     | 1268/2670 [30:54<34:44,  1.49s/it]

{'loss': 0.0281, 'grad_norm': 0.25518155097961426, 'learning_rate': 0.0001052157598499062, 'epoch': 4.75}


 48%|████▊     | 1269/2670 [30:55<34:02,  1.46s/it]

{'loss': 0.0365, 'grad_norm': 0.33137378096580505, 'learning_rate': 0.000105140712945591, 'epoch': 4.75}


 48%|████▊     | 1270/2670 [30:56<34:13,  1.47s/it]

{'loss': 0.0293, 'grad_norm': 0.2536441683769226, 'learning_rate': 0.00010506566604127581, 'epoch': 4.76}


 48%|████▊     | 1271/2670 [30:58<34:32,  1.48s/it]

{'loss': 0.0371, 'grad_norm': 0.33605527877807617, 'learning_rate': 0.0001049906191369606, 'epoch': 4.76}


 48%|████▊     | 1272/2670 [31:00<34:56,  1.50s/it]

{'loss': 0.0279, 'grad_norm': 0.24052684009075165, 'learning_rate': 0.0001049155722326454, 'epoch': 4.76}


 48%|████▊     | 1273/2670 [31:01<34:56,  1.50s/it]

{'loss': 0.0109, 'grad_norm': 0.11756906658411026, 'learning_rate': 0.00010484052532833021, 'epoch': 4.77}


 48%|████▊     | 1274/2670 [31:02<33:44,  1.45s/it]

{'loss': 0.0272, 'grad_norm': 0.2565789222717285, 'learning_rate': 0.00010476547842401501, 'epoch': 4.77}


 48%|████▊     | 1275/2670 [31:04<32:30,  1.40s/it]

{'loss': 0.0237, 'grad_norm': 0.1996869146823883, 'learning_rate': 0.00010469043151969982, 'epoch': 4.78}


 48%|████▊     | 1276/2670 [31:05<32:56,  1.42s/it]

{'loss': 0.0267, 'grad_norm': 0.3116162419319153, 'learning_rate': 0.00010461538461538463, 'epoch': 4.78}


 48%|████▊     | 1277/2670 [31:07<35:51,  1.54s/it]

{'loss': 0.0238, 'grad_norm': 0.2758793830871582, 'learning_rate': 0.00010454033771106941, 'epoch': 4.78}


 48%|████▊     | 1278/2670 [31:08<34:05,  1.47s/it]

{'loss': 0.0385, 'grad_norm': 0.6573187112808228, 'learning_rate': 0.00010446529080675423, 'epoch': 4.79}


 48%|████▊     | 1279/2670 [31:10<34:09,  1.47s/it]

{'loss': 0.0486, 'grad_norm': 0.6609554290771484, 'learning_rate': 0.00010439024390243904, 'epoch': 4.79}


 48%|████▊     | 1280/2670 [31:11<34:30,  1.49s/it]

{'loss': 0.0424, 'grad_norm': 0.5189324021339417, 'learning_rate': 0.00010431519699812385, 'epoch': 4.79}


 48%|████▊     | 1281/2670 [31:13<33:23,  1.44s/it]

{'loss': 0.0245, 'grad_norm': 0.433531790971756, 'learning_rate': 0.00010424015009380863, 'epoch': 4.8}


 48%|████▊     | 1282/2670 [31:14<35:02,  1.51s/it]

{'loss': 0.0541, 'grad_norm': 0.315906286239624, 'learning_rate': 0.00010416510318949344, 'epoch': 4.8}


 48%|████▊     | 1283/2670 [31:16<35:05,  1.52s/it]

{'loss': 0.0469, 'grad_norm': 0.4113926589488983, 'learning_rate': 0.00010409005628517825, 'epoch': 4.81}


 48%|████▊     | 1284/2670 [31:17<34:55,  1.51s/it]

{'loss': 0.0282, 'grad_norm': 0.3153471350669861, 'learning_rate': 0.00010401500938086304, 'epoch': 4.81}


 48%|████▊     | 1285/2670 [31:19<34:19,  1.49s/it]

{'loss': 0.0432, 'grad_norm': 0.31500184535980225, 'learning_rate': 0.00010393996247654785, 'epoch': 4.81}


 48%|████▊     | 1286/2670 [31:20<32:38,  1.42s/it]

{'loss': 0.0248, 'grad_norm': 0.2158835083246231, 'learning_rate': 0.00010386491557223266, 'epoch': 4.82}


 48%|████▊     | 1287/2670 [31:21<32:59,  1.43s/it]

{'loss': 0.0389, 'grad_norm': 0.4283323287963867, 'learning_rate': 0.00010378986866791744, 'epoch': 4.82}


 48%|████▊     | 1288/2670 [31:23<31:50,  1.38s/it]

{'loss': 0.0301, 'grad_norm': 0.2470802664756775, 'learning_rate': 0.00010371482176360225, 'epoch': 4.82}


 48%|████▊     | 1289/2670 [31:24<32:53,  1.43s/it]

{'loss': 0.0402, 'grad_norm': 0.41193801164627075, 'learning_rate': 0.00010363977485928706, 'epoch': 4.83}


 48%|████▊     | 1290/2670 [31:25<30:54,  1.34s/it]

{'loss': 0.0306, 'grad_norm': 0.4269319176673889, 'learning_rate': 0.00010356472795497186, 'epoch': 4.83}


 48%|████▊     | 1291/2670 [31:27<30:31,  1.33s/it]

{'loss': 0.03, 'grad_norm': 0.23264138400554657, 'learning_rate': 0.00010348968105065666, 'epoch': 4.84}


 48%|████▊     | 1292/2670 [31:29<33:56,  1.48s/it]

{'loss': 0.0249, 'grad_norm': 0.25335946679115295, 'learning_rate': 0.00010341463414634147, 'epoch': 4.84}


 48%|████▊     | 1293/2670 [31:30<31:56,  1.39s/it]

{'loss': 0.0232, 'grad_norm': 0.22374115884304047, 'learning_rate': 0.00010333958724202628, 'epoch': 4.84}


 48%|████▊     | 1294/2670 [31:31<33:05,  1.44s/it]

{'loss': 0.0315, 'grad_norm': 0.3539167046546936, 'learning_rate': 0.00010326454033771106, 'epoch': 4.85}


 49%|████▊     | 1295/2670 [31:33<32:35,  1.42s/it]

{'loss': 0.0247, 'grad_norm': 0.2967195510864258, 'learning_rate': 0.00010318949343339587, 'epoch': 4.85}


 49%|████▊     | 1296/2670 [31:34<31:17,  1.37s/it]

{'loss': 0.0292, 'grad_norm': 0.2435954511165619, 'learning_rate': 0.0001031144465290807, 'epoch': 4.85}


 49%|████▊     | 1297/2670 [31:36<33:06,  1.45s/it]

{'loss': 0.0226, 'grad_norm': 0.3429529368877411, 'learning_rate': 0.00010303939962476548, 'epoch': 4.86}


 49%|████▊     | 1298/2670 [31:37<34:25,  1.51s/it]

{'loss': 0.0417, 'grad_norm': 0.3324393928050995, 'learning_rate': 0.00010296435272045029, 'epoch': 4.86}


 49%|████▊     | 1299/2670 [31:38<33:03,  1.45s/it]

{'loss': 0.0379, 'grad_norm': 0.40563109517097473, 'learning_rate': 0.0001028893058161351, 'epoch': 4.87}


 49%|████▊     | 1300/2670 [31:40<33:14,  1.46s/it]

{'loss': 0.0461, 'grad_norm': 0.40829452872276306, 'learning_rate': 0.00010281425891181989, 'epoch': 4.87}


 49%|████▊     | 1301/2670 [31:42<35:17,  1.55s/it]

{'loss': 0.0449, 'grad_norm': 0.3506523370742798, 'learning_rate': 0.0001027392120075047, 'epoch': 4.87}


 49%|████▉     | 1302/2670 [31:43<35:45,  1.57s/it]

{'loss': 0.0345, 'grad_norm': 0.34744957089424133, 'learning_rate': 0.00010266416510318951, 'epoch': 4.88}


 49%|████▉     | 1303/2670 [31:45<34:44,  1.53s/it]

{'loss': 0.0366, 'grad_norm': 0.4483555853366852, 'learning_rate': 0.00010258911819887432, 'epoch': 4.88}


 49%|████▉     | 1304/2670 [31:46<32:03,  1.41s/it]

{'loss': 0.0346, 'grad_norm': 0.6410121917724609, 'learning_rate': 0.0001025140712945591, 'epoch': 4.88}


 49%|████▉     | 1305/2670 [31:47<30:38,  1.35s/it]

{'loss': 0.0261, 'grad_norm': 0.18232402205467224, 'learning_rate': 0.0001024390243902439, 'epoch': 4.89}


 49%|████▉     | 1306/2670 [31:48<30:11,  1.33s/it]

{'loss': 0.0081, 'grad_norm': 0.08862894773483276, 'learning_rate': 0.00010236397748592872, 'epoch': 4.89}


 49%|████▉     | 1307/2670 [31:50<29:24,  1.29s/it]

{'loss': 0.0527, 'grad_norm': 0.40398451685905457, 'learning_rate': 0.00010228893058161351, 'epoch': 4.9}


 49%|████▉     | 1308/2670 [31:51<28:25,  1.25s/it]

{'loss': 0.036, 'grad_norm': 0.46865618228912354, 'learning_rate': 0.00010221388367729832, 'epoch': 4.9}


 49%|████▉     | 1309/2670 [31:52<30:56,  1.36s/it]

{'loss': 0.0475, 'grad_norm': 0.4209856688976288, 'learning_rate': 0.00010213883677298313, 'epoch': 4.9}


 49%|████▉     | 1310/2670 [31:54<30:51,  1.36s/it]

{'loss': 0.033, 'grad_norm': 0.42364001274108887, 'learning_rate': 0.00010206378986866791, 'epoch': 4.91}


 49%|████▉     | 1311/2670 [31:55<31:20,  1.38s/it]

{'loss': 0.0355, 'grad_norm': 0.501617431640625, 'learning_rate': 0.00010198874296435272, 'epoch': 4.91}


 49%|████▉     | 1312/2670 [31:57<32:12,  1.42s/it]

{'loss': 0.0314, 'grad_norm': 0.510217010974884, 'learning_rate': 0.00010191369606003753, 'epoch': 4.91}


 49%|████▉     | 1313/2670 [31:58<33:29,  1.48s/it]

{'loss': 0.0225, 'grad_norm': 0.2804356515407562, 'learning_rate': 0.00010183864915572232, 'epoch': 4.92}


 49%|████▉     | 1314/2670 [31:59<30:46,  1.36s/it]

{'loss': 0.0221, 'grad_norm': 0.4669468402862549, 'learning_rate': 0.00010176360225140713, 'epoch': 4.92}


 49%|████▉     | 1315/2670 [32:01<30:21,  1.34s/it]

{'loss': 0.0303, 'grad_norm': 0.44393229484558105, 'learning_rate': 0.00010168855534709194, 'epoch': 4.93}


 49%|████▉     | 1316/2670 [32:02<30:59,  1.37s/it]

{'loss': 0.0209, 'grad_norm': 0.2607419788837433, 'learning_rate': 0.00010161350844277675, 'epoch': 4.93}


 49%|████▉     | 1317/2670 [32:04<32:07,  1.42s/it]

{'loss': 0.0223, 'grad_norm': 0.3072609603404999, 'learning_rate': 0.00010153846153846153, 'epoch': 4.93}


 49%|████▉     | 1318/2670 [32:05<33:23,  1.48s/it]

{'loss': 0.0413, 'grad_norm': 0.40555936098098755, 'learning_rate': 0.00010146341463414635, 'epoch': 4.94}


 49%|████▉     | 1319/2670 [32:07<33:37,  1.49s/it]

{'loss': 0.0535, 'grad_norm': 0.5187285542488098, 'learning_rate': 0.00010138836772983116, 'epoch': 4.94}


 49%|████▉     | 1320/2670 [32:08<33:22,  1.48s/it]

{'loss': 0.0492, 'grad_norm': 0.43304890394210815, 'learning_rate': 0.00010131332082551594, 'epoch': 4.94}


 49%|████▉     | 1321/2670 [32:10<33:38,  1.50s/it]

{'loss': 0.0358, 'grad_norm': 0.4381079077720642, 'learning_rate': 0.00010123827392120075, 'epoch': 4.95}


 50%|████▉     | 1322/2670 [32:11<32:12,  1.43s/it]

{'loss': 0.0293, 'grad_norm': 0.275993287563324, 'learning_rate': 0.00010116322701688556, 'epoch': 4.95}


 50%|████▉     | 1323/2670 [32:12<31:01,  1.38s/it]

{'loss': 0.0216, 'grad_norm': 0.20147240161895752, 'learning_rate': 0.00010108818011257036, 'epoch': 4.96}


 50%|████▉     | 1324/2670 [32:14<31:14,  1.39s/it]

{'loss': 0.0168, 'grad_norm': 0.2995123267173767, 'learning_rate': 0.00010101313320825517, 'epoch': 4.96}


 50%|████▉     | 1325/2670 [32:15<30:56,  1.38s/it]

{'loss': 0.0439, 'grad_norm': 0.428681343793869, 'learning_rate': 0.00010093808630393998, 'epoch': 4.96}


 50%|████▉     | 1326/2670 [32:17<31:58,  1.43s/it]

{'loss': 0.0208, 'grad_norm': 0.34648439288139343, 'learning_rate': 0.00010086303939962478, 'epoch': 4.97}


 50%|████▉     | 1327/2670 [32:18<31:34,  1.41s/it]

{'loss': 0.0393, 'grad_norm': 0.4804864823818207, 'learning_rate': 0.00010078799249530957, 'epoch': 4.97}


 50%|████▉     | 1328/2670 [32:20<34:28,  1.54s/it]

{'loss': 0.0434, 'grad_norm': 0.37388408184051514, 'learning_rate': 0.00010071294559099437, 'epoch': 4.97}


 50%|████▉     | 1329/2670 [32:21<33:52,  1.52s/it]

{'loss': 0.0451, 'grad_norm': 0.4291574954986572, 'learning_rate': 0.00010063789868667918, 'epoch': 4.98}


 50%|████▉     | 1330/2670 [32:23<34:45,  1.56s/it]

{'loss': 0.0177, 'grad_norm': 0.2879912257194519, 'learning_rate': 0.00010056285178236398, 'epoch': 4.98}


 50%|████▉     | 1331/2670 [32:24<32:21,  1.45s/it]

{'loss': 0.0152, 'grad_norm': 0.268112450838089, 'learning_rate': 0.00010048780487804879, 'epoch': 4.99}


 50%|████▉     | 1332/2670 [32:26<32:16,  1.45s/it]

{'loss': 0.098, 'grad_norm': 0.5160629153251648, 'learning_rate': 0.0001004127579737336, 'epoch': 4.99}


 50%|████▉     | 1333/2670 [32:27<31:44,  1.42s/it]

{'loss': 0.0366, 'grad_norm': 0.3376880884170532, 'learning_rate': 0.00010033771106941838, 'epoch': 4.99}


 50%|████▉     | 1334/2670 [32:28<30:48,  1.38s/it]

{'loss': 0.062, 'grad_norm': 0.45815014839172363, 'learning_rate': 0.00010026266416510319, 'epoch': 5.0}


 50%|█████     | 1335/2670 [32:30<31:45,  1.43s/it]

{'loss': 0.0458, 'grad_norm': 0.37274911999702454, 'learning_rate': 0.000100187617260788, 'epoch': 5.0}


 50%|█████     | 1336/2670 [32:32<35:23,  1.59s/it]

{'loss': 0.0177, 'grad_norm': 0.24457880854606628, 'learning_rate': 0.00010011257035647279, 'epoch': 5.0}


 50%|█████     | 1337/2670 [32:33<34:26,  1.55s/it]

{'loss': 0.0292, 'grad_norm': 0.23187968134880066, 'learning_rate': 0.0001000375234521576, 'epoch': 5.01}


 50%|█████     | 1338/2670 [32:35<35:14,  1.59s/it]

{'loss': 0.0521, 'grad_norm': 0.3778998851776123, 'learning_rate': 9.996247654784241e-05, 'epoch': 5.01}


 50%|█████     | 1339/2670 [32:37<36:05,  1.63s/it]

{'loss': 0.009, 'grad_norm': 0.10823983699083328, 'learning_rate': 9.98874296435272e-05, 'epoch': 5.01}


 50%|█████     | 1340/2670 [32:38<33:47,  1.52s/it]

{'loss': 0.0172, 'grad_norm': 0.33182480931282043, 'learning_rate': 9.981238273921201e-05, 'epoch': 5.02}


 50%|█████     | 1341/2670 [32:39<33:47,  1.53s/it]

{'loss': 0.0252, 'grad_norm': 0.31491148471832275, 'learning_rate': 9.973733583489682e-05, 'epoch': 5.02}


 50%|█████     | 1342/2670 [32:41<33:42,  1.52s/it]

{'loss': 0.0407, 'grad_norm': 0.3827368915081024, 'learning_rate': 9.966228893058162e-05, 'epoch': 5.03}


 50%|█████     | 1343/2670 [32:42<33:49,  1.53s/it]

{'loss': 0.0137, 'grad_norm': 0.2331538051366806, 'learning_rate': 9.958724202626643e-05, 'epoch': 5.03}


 50%|█████     | 1344/2670 [32:44<33:13,  1.50s/it]

{'loss': 0.0371, 'grad_norm': 0.3573716878890991, 'learning_rate': 9.951219512195122e-05, 'epoch': 5.03}


 50%|█████     | 1345/2670 [32:45<33:21,  1.51s/it]

{'loss': 0.0128, 'grad_norm': 0.21741090714931488, 'learning_rate': 9.943714821763602e-05, 'epoch': 5.04}


 50%|█████     | 1346/2670 [32:47<31:48,  1.44s/it]

{'loss': 0.0225, 'grad_norm': 0.2510015368461609, 'learning_rate': 9.936210131332083e-05, 'epoch': 5.04}


 50%|█████     | 1347/2670 [32:48<32:25,  1.47s/it]

{'loss': 0.0126, 'grad_norm': 0.28363555669784546, 'learning_rate': 9.928705440900563e-05, 'epoch': 5.04}


 50%|█████     | 1348/2670 [32:50<32:46,  1.49s/it]

{'loss': 0.0134, 'grad_norm': 0.18604622781276703, 'learning_rate': 9.921200750469044e-05, 'epoch': 5.05}


 51%|█████     | 1349/2670 [32:51<33:37,  1.53s/it]

{'loss': 0.0084, 'grad_norm': 0.13041652739048004, 'learning_rate': 9.913696060037524e-05, 'epoch': 5.05}


 51%|█████     | 1350/2670 [32:53<33:06,  1.50s/it]

{'loss': 0.0149, 'grad_norm': 0.2676853835582733, 'learning_rate': 9.906191369606003e-05, 'epoch': 5.06}


 51%|█████     | 1351/2670 [32:55<34:49,  1.58s/it]

{'loss': 0.0092, 'grad_norm': 0.10026442259550095, 'learning_rate': 9.898686679174484e-05, 'epoch': 5.06}


 51%|█████     | 1352/2670 [32:56<32:49,  1.49s/it]

{'loss': 0.0471, 'grad_norm': 0.3356820344924927, 'learning_rate': 9.891181988742965e-05, 'epoch': 5.06}


 51%|█████     | 1353/2670 [32:57<32:03,  1.46s/it]

{'loss': 0.0215, 'grad_norm': 0.31157758831977844, 'learning_rate': 9.883677298311446e-05, 'epoch': 5.07}


 51%|█████     | 1354/2670 [32:59<33:06,  1.51s/it]

{'loss': 0.0084, 'grad_norm': 0.182752788066864, 'learning_rate': 9.876172607879926e-05, 'epoch': 5.07}


 51%|█████     | 1355/2670 [33:01<34:04,  1.56s/it]

{'loss': 0.0147, 'grad_norm': 0.19428682327270508, 'learning_rate': 9.868667917448405e-05, 'epoch': 5.07}


 51%|█████     | 1356/2670 [33:02<32:46,  1.50s/it]

{'loss': 0.0102, 'grad_norm': 0.1508411020040512, 'learning_rate': 9.861163227016886e-05, 'epoch': 5.08}


 51%|█████     | 1357/2670 [33:03<30:54,  1.41s/it]

{'loss': 0.0151, 'grad_norm': 0.2199164479970932, 'learning_rate': 9.853658536585366e-05, 'epoch': 5.08}


 51%|█████     | 1358/2670 [33:05<31:42,  1.45s/it]

{'loss': 0.0183, 'grad_norm': 0.2718077600002289, 'learning_rate': 9.846153846153848e-05, 'epoch': 5.09}


 51%|█████     | 1359/2670 [33:06<30:48,  1.41s/it]

{'loss': 0.0214, 'grad_norm': 0.3134619891643524, 'learning_rate': 9.838649155722327e-05, 'epoch': 5.09}


 51%|█████     | 1360/2670 [33:08<33:25,  1.53s/it]

{'loss': 0.0174, 'grad_norm': 0.2933940887451172, 'learning_rate': 9.831144465290807e-05, 'epoch': 5.09}


 51%|█████     | 1361/2670 [33:09<32:11,  1.48s/it]

{'loss': 0.0195, 'grad_norm': 0.2327781766653061, 'learning_rate': 9.823639774859288e-05, 'epoch': 5.1}


 51%|█████     | 1362/2670 [33:11<31:09,  1.43s/it]

{'loss': 0.0164, 'grad_norm': 0.31122905015945435, 'learning_rate': 9.816135084427767e-05, 'epoch': 5.1}


 51%|█████     | 1363/2670 [33:12<32:21,  1.49s/it]

{'loss': 0.004, 'grad_norm': 0.1035596951842308, 'learning_rate': 9.808630393996248e-05, 'epoch': 5.1}


 51%|█████     | 1364/2670 [33:14<31:50,  1.46s/it]

{'loss': 0.0139, 'grad_norm': 0.17042911052703857, 'learning_rate': 9.801125703564729e-05, 'epoch': 5.11}


 51%|█████     | 1365/2670 [33:15<29:56,  1.38s/it]

{'loss': 0.0199, 'grad_norm': 0.3878258764743805, 'learning_rate': 9.793621013133209e-05, 'epoch': 5.11}


 51%|█████     | 1366/2670 [33:16<29:44,  1.37s/it]

{'loss': 0.0122, 'grad_norm': 0.15467949211597443, 'learning_rate': 9.78611632270169e-05, 'epoch': 5.12}


 51%|█████     | 1367/2670 [33:18<32:04,  1.48s/it]

{'loss': 0.0159, 'grad_norm': 0.2515994608402252, 'learning_rate': 9.778611632270169e-05, 'epoch': 5.12}


 51%|█████     | 1368/2670 [33:19<32:21,  1.49s/it]

{'loss': 0.0111, 'grad_norm': 0.21935710310935974, 'learning_rate': 9.771106941838649e-05, 'epoch': 5.12}


 51%|█████▏    | 1369/2670 [33:21<30:44,  1.42s/it]

{'loss': 0.0279, 'grad_norm': 0.28515005111694336, 'learning_rate': 9.763602251407131e-05, 'epoch': 5.13}


 51%|█████▏    | 1370/2670 [33:22<30:19,  1.40s/it]

{'loss': 0.015, 'grad_norm': 0.38438475131988525, 'learning_rate': 9.75609756097561e-05, 'epoch': 5.13}


 51%|█████▏    | 1371/2670 [33:23<30:33,  1.41s/it]

{'loss': 0.0078, 'grad_norm': 0.09750310331583023, 'learning_rate': 9.748592870544091e-05, 'epoch': 5.13}


 51%|█████▏    | 1372/2670 [33:25<28:58,  1.34s/it]

{'loss': 0.0138, 'grad_norm': 0.28386828303337097, 'learning_rate': 9.741088180112571e-05, 'epoch': 5.14}


 51%|█████▏    | 1373/2670 [33:26<30:57,  1.43s/it]

{'loss': 0.0212, 'grad_norm': 0.356589138507843, 'learning_rate': 9.73358348968105e-05, 'epoch': 5.14}


 51%|█████▏    | 1374/2670 [33:28<31:15,  1.45s/it]

{'loss': 0.0066, 'grad_norm': 0.14707611501216888, 'learning_rate': 9.726078799249531e-05, 'epoch': 5.15}


 51%|█████▏    | 1375/2670 [33:29<31:18,  1.45s/it]

{'loss': 0.0176, 'grad_norm': 0.37423306703567505, 'learning_rate': 9.718574108818012e-05, 'epoch': 5.15}


 52%|█████▏    | 1376/2670 [33:31<31:06,  1.44s/it]

{'loss': 0.0059, 'grad_norm': 0.14966648817062378, 'learning_rate': 9.711069418386493e-05, 'epoch': 5.15}


 52%|█████▏    | 1377/2670 [33:32<31:52,  1.48s/it]

{'loss': 0.0298, 'grad_norm': 0.45435163378715515, 'learning_rate': 9.703564727954972e-05, 'epoch': 5.16}


 52%|█████▏    | 1378/2670 [33:34<32:27,  1.51s/it]

{'loss': 0.0285, 'grad_norm': 0.393439382314682, 'learning_rate': 9.696060037523452e-05, 'epoch': 5.16}


 52%|█████▏    | 1379/2670 [33:35<33:02,  1.54s/it]

{'loss': 0.0108, 'grad_norm': 0.24134448170661926, 'learning_rate': 9.688555347091933e-05, 'epoch': 5.16}


 52%|█████▏    | 1380/2670 [33:37<32:19,  1.50s/it]

{'loss': 0.0192, 'grad_norm': 0.41197100281715393, 'learning_rate': 9.681050656660414e-05, 'epoch': 5.17}


 52%|█████▏    | 1381/2670 [33:38<29:36,  1.38s/it]

{'loss': 0.0441, 'grad_norm': 0.5847606062889099, 'learning_rate': 9.673545966228893e-05, 'epoch': 5.17}


 52%|█████▏    | 1382/2670 [33:39<27:53,  1.30s/it]

{'loss': 0.015, 'grad_norm': 0.40095558762550354, 'learning_rate': 9.666041275797374e-05, 'epoch': 5.18}


 52%|█████▏    | 1383/2670 [33:40<29:17,  1.37s/it]

{'loss': 0.0222, 'grad_norm': 0.21546970307826996, 'learning_rate': 9.658536585365854e-05, 'epoch': 5.18}


 52%|█████▏    | 1384/2670 [33:42<29:40,  1.38s/it]

{'loss': 0.01, 'grad_norm': 0.1270536333322525, 'learning_rate': 9.651031894934335e-05, 'epoch': 5.18}


 52%|█████▏    | 1385/2670 [33:43<29:44,  1.39s/it]

{'loss': 0.0138, 'grad_norm': 0.16520723700523376, 'learning_rate': 9.643527204502814e-05, 'epoch': 5.19}


 52%|█████▏    | 1386/2670 [33:45<32:22,  1.51s/it]

{'loss': 0.0226, 'grad_norm': 0.2676343321800232, 'learning_rate': 9.636022514071295e-05, 'epoch': 5.19}


 52%|█████▏    | 1387/2670 [33:47<33:22,  1.56s/it]

{'loss': 0.0424, 'grad_norm': 0.3583899438381195, 'learning_rate': 9.628517823639776e-05, 'epoch': 5.19}


 52%|█████▏    | 1388/2670 [33:48<32:28,  1.52s/it]

{'loss': 0.0407, 'grad_norm': 0.4630476236343384, 'learning_rate': 9.621013133208255e-05, 'epoch': 5.2}


 52%|█████▏    | 1389/2670 [33:50<31:39,  1.48s/it]

{'loss': 0.0113, 'grad_norm': 0.13716070353984833, 'learning_rate': 9.613508442776736e-05, 'epoch': 5.2}


 52%|█████▏    | 1390/2670 [33:51<31:17,  1.47s/it]

{'loss': 0.0089, 'grad_norm': 0.20390008389949799, 'learning_rate': 9.606003752345216e-05, 'epoch': 5.21}


 52%|█████▏    | 1391/2670 [33:53<31:50,  1.49s/it]

{'loss': 0.0299, 'grad_norm': 0.35238486528396606, 'learning_rate': 9.598499061913697e-05, 'epoch': 5.21}


 52%|█████▏    | 1392/2670 [33:54<31:30,  1.48s/it]

{'loss': 0.0132, 'grad_norm': 0.2920192778110504, 'learning_rate': 9.590994371482178e-05, 'epoch': 5.21}


 52%|█████▏    | 1393/2670 [33:55<30:29,  1.43s/it]

{'loss': 0.0086, 'grad_norm': 0.12992238998413086, 'learning_rate': 9.583489681050657e-05, 'epoch': 5.22}


 52%|█████▏    | 1394/2670 [33:57<30:11,  1.42s/it]

{'loss': 0.0505, 'grad_norm': 0.3733450472354889, 'learning_rate': 9.575984990619138e-05, 'epoch': 5.22}


 52%|█████▏    | 1395/2670 [33:58<31:25,  1.48s/it]

{'loss': 0.0084, 'grad_norm': 0.09881480783224106, 'learning_rate': 9.568480300187618e-05, 'epoch': 5.22}


 52%|█████▏    | 1396/2670 [34:00<32:06,  1.51s/it]

{'loss': 0.0273, 'grad_norm': 0.27434059977531433, 'learning_rate': 9.560975609756097e-05, 'epoch': 5.23}


 52%|█████▏    | 1397/2670 [34:01<31:05,  1.47s/it]

{'loss': 0.0081, 'grad_norm': 0.11497567594051361, 'learning_rate': 9.553470919324578e-05, 'epoch': 5.23}


 52%|█████▏    | 1398/2670 [34:03<31:18,  1.48s/it]

{'loss': 0.0253, 'grad_norm': 0.289185494184494, 'learning_rate': 9.545966228893059e-05, 'epoch': 5.24}


 52%|█████▏    | 1399/2670 [34:04<32:32,  1.54s/it]

{'loss': 0.0072, 'grad_norm': 0.12625977396965027, 'learning_rate': 9.53846153846154e-05, 'epoch': 5.24}


 52%|█████▏    | 1400/2670 [34:06<32:24,  1.53s/it]

{'loss': 0.0085, 'grad_norm': 0.15261748433113098, 'learning_rate': 9.530956848030019e-05, 'epoch': 5.24}


 52%|█████▏    | 1401/2670 [34:07<30:43,  1.45s/it]

{'loss': 0.0288, 'grad_norm': 0.22032475471496582, 'learning_rate': 9.523452157598499e-05, 'epoch': 5.25}


 53%|█████▎    | 1402/2670 [34:09<30:47,  1.46s/it]

{'loss': 0.0151, 'grad_norm': 0.24853317439556122, 'learning_rate': 9.51594746716698e-05, 'epoch': 5.25}


 53%|█████▎    | 1403/2670 [34:10<31:44,  1.50s/it]

{'loss': 0.0207, 'grad_norm': 0.2552948296070099, 'learning_rate': 9.50844277673546e-05, 'epoch': 5.25}


 53%|█████▎    | 1404/2670 [34:12<30:15,  1.43s/it]

{'loss': 0.0289, 'grad_norm': 0.3832373023033142, 'learning_rate': 9.50093808630394e-05, 'epoch': 5.26}


 53%|█████▎    | 1405/2670 [34:13<30:38,  1.45s/it]

{'loss': 0.0105, 'grad_norm': 0.13381920754909515, 'learning_rate': 9.493433395872421e-05, 'epoch': 5.26}


 53%|█████▎    | 1406/2670 [34:15<31:04,  1.47s/it]

{'loss': 0.0247, 'grad_norm': 0.3798273801803589, 'learning_rate': 9.4859287054409e-05, 'epoch': 5.27}


 53%|█████▎    | 1407/2670 [34:16<31:54,  1.52s/it]

{'loss': 0.0262, 'grad_norm': 0.22589516639709473, 'learning_rate': 9.478424015009381e-05, 'epoch': 5.27}


 53%|█████▎    | 1408/2670 [34:18<34:16,  1.63s/it]

{'loss': 0.0252, 'grad_norm': 0.2493157833814621, 'learning_rate': 9.470919324577861e-05, 'epoch': 5.27}


 53%|█████▎    | 1409/2670 [34:19<31:52,  1.52s/it]

{'loss': 0.0061, 'grad_norm': 0.073526531457901, 'learning_rate': 9.463414634146342e-05, 'epoch': 5.28}


 53%|█████▎    | 1410/2670 [34:21<32:02,  1.53s/it]

{'loss': 0.0225, 'grad_norm': 0.36835843324661255, 'learning_rate': 9.455909943714823e-05, 'epoch': 5.28}


 53%|█████▎    | 1411/2670 [34:22<31:27,  1.50s/it]

{'loss': 0.0082, 'grad_norm': 0.2935071289539337, 'learning_rate': 9.448405253283302e-05, 'epoch': 5.28}


 53%|█████▎    | 1412/2670 [34:24<30:50,  1.47s/it]

{'loss': 0.0109, 'grad_norm': 0.11446667462587357, 'learning_rate': 9.440900562851783e-05, 'epoch': 5.29}


 53%|█████▎    | 1413/2670 [34:25<29:58,  1.43s/it]

{'loss': 0.0151, 'grad_norm': 0.12736830115318298, 'learning_rate': 9.433395872420263e-05, 'epoch': 5.29}


 53%|█████▎    | 1414/2670 [34:27<31:41,  1.51s/it]

{'loss': 0.0173, 'grad_norm': 0.23559623956680298, 'learning_rate': 9.425891181988744e-05, 'epoch': 5.3}


 53%|█████▎    | 1415/2670 [34:29<32:49,  1.57s/it]

{'loss': 0.0092, 'grad_norm': 0.14318020641803741, 'learning_rate': 9.418386491557224e-05, 'epoch': 5.3}


 53%|█████▎    | 1416/2670 [34:30<30:50,  1.48s/it]

{'loss': 0.0256, 'grad_norm': 0.31464001536369324, 'learning_rate': 9.410881801125704e-05, 'epoch': 5.3}


 53%|█████▎    | 1417/2670 [34:31<29:56,  1.43s/it]

{'loss': 0.0331, 'grad_norm': 0.30021244287490845, 'learning_rate': 9.403377110694185e-05, 'epoch': 5.31}


 53%|█████▎    | 1418/2670 [34:32<27:42,  1.33s/it]

{'loss': 0.0553, 'grad_norm': 0.5703822374343872, 'learning_rate': 9.395872420262664e-05, 'epoch': 5.31}


 53%|█████▎    | 1419/2670 [34:34<27:49,  1.33s/it]

{'loss': 0.0307, 'grad_norm': 0.34571245312690735, 'learning_rate': 9.388367729831144e-05, 'epoch': 5.31}


 53%|█████▎    | 1420/2670 [34:35<28:48,  1.38s/it]

{'loss': 0.0267, 'grad_norm': 0.22766247391700745, 'learning_rate': 9.380863039399626e-05, 'epoch': 5.32}


 53%|█████▎    | 1421/2670 [34:37<30:07,  1.45s/it]

{'loss': 0.0199, 'grad_norm': 0.4213862717151642, 'learning_rate': 9.373358348968106e-05, 'epoch': 5.32}


 53%|█████▎    | 1422/2670 [34:38<29:11,  1.40s/it]

{'loss': 0.0286, 'grad_norm': 0.3264549672603607, 'learning_rate': 9.365853658536587e-05, 'epoch': 5.33}


 53%|█████▎    | 1423/2670 [34:39<27:37,  1.33s/it]

{'loss': 0.0129, 'grad_norm': 0.19980762898921967, 'learning_rate': 9.358348968105066e-05, 'epoch': 5.33}


 53%|█████▎    | 1424/2670 [34:41<28:17,  1.36s/it]

{'loss': 0.0183, 'grad_norm': 0.4348357617855072, 'learning_rate': 9.350844277673546e-05, 'epoch': 5.33}


 53%|█████▎    | 1425/2670 [34:42<27:04,  1.30s/it]

{'loss': 0.0176, 'grad_norm': 0.17837995290756226, 'learning_rate': 9.343339587242026e-05, 'epoch': 5.34}


 53%|█████▎    | 1426/2670 [34:43<27:07,  1.31s/it]

{'loss': 0.0126, 'grad_norm': 0.23086011409759521, 'learning_rate': 9.335834896810507e-05, 'epoch': 5.34}


 53%|█████▎    | 1427/2670 [34:44<27:41,  1.34s/it]

{'loss': 0.0143, 'grad_norm': 0.19789670407772064, 'learning_rate': 9.328330206378987e-05, 'epoch': 5.34}


 53%|█████▎    | 1428/2670 [34:46<26:48,  1.29s/it]

{'loss': 0.0299, 'grad_norm': 0.3720673620700836, 'learning_rate': 9.320825515947468e-05, 'epoch': 5.35}


 54%|█████▎    | 1429/2670 [34:47<27:52,  1.35s/it]

{'loss': 0.0125, 'grad_norm': 0.346329003572464, 'learning_rate': 9.313320825515947e-05, 'epoch': 5.35}


 54%|█████▎    | 1430/2670 [34:49<29:36,  1.43s/it]

{'loss': 0.0127, 'grad_norm': 0.16523505747318268, 'learning_rate': 9.305816135084428e-05, 'epoch': 5.36}


 54%|█████▎    | 1431/2670 [34:50<29:59,  1.45s/it]

{'loss': 0.0274, 'grad_norm': 0.3890019655227661, 'learning_rate': 9.298311444652909e-05, 'epoch': 5.36}


 54%|█████▎    | 1432/2670 [34:52<29:41,  1.44s/it]

{'loss': 0.0168, 'grad_norm': 0.21822044253349304, 'learning_rate': 9.290806754221389e-05, 'epoch': 5.36}


 54%|█████▎    | 1433/2670 [34:53<30:04,  1.46s/it]

{'loss': 0.0381, 'grad_norm': 0.3402078151702881, 'learning_rate': 9.28330206378987e-05, 'epoch': 5.37}


 54%|█████▎    | 1434/2670 [34:55<30:55,  1.50s/it]

{'loss': 0.0267, 'grad_norm': 0.26570582389831543, 'learning_rate': 9.275797373358349e-05, 'epoch': 5.37}


 54%|█████▎    | 1435/2670 [34:56<30:05,  1.46s/it]

{'loss': 0.015, 'grad_norm': 0.1911773830652237, 'learning_rate': 9.26829268292683e-05, 'epoch': 5.37}


 54%|█████▍    | 1436/2670 [34:58<30:18,  1.47s/it]

{'loss': 0.0082, 'grad_norm': 0.08633757382631302, 'learning_rate': 9.26078799249531e-05, 'epoch': 5.38}


 54%|█████▍    | 1437/2670 [34:59<30:58,  1.51s/it]

{'loss': 0.0118, 'grad_norm': 0.13942883908748627, 'learning_rate': 9.25328330206379e-05, 'epoch': 5.38}


 54%|█████▍    | 1438/2670 [35:01<30:35,  1.49s/it]

{'loss': 0.0164, 'grad_norm': 0.27621570229530334, 'learning_rate': 9.245778611632271e-05, 'epoch': 5.39}


 54%|█████▍    | 1439/2670 [35:02<29:52,  1.46s/it]

{'loss': 0.0097, 'grad_norm': 0.18377649784088135, 'learning_rate': 9.238273921200751e-05, 'epoch': 5.39}


 54%|█████▍    | 1440/2670 [35:04<30:14,  1.47s/it]

{'loss': 0.0179, 'grad_norm': 0.3650909662246704, 'learning_rate': 9.230769230769232e-05, 'epoch': 5.39}


 54%|█████▍    | 1441/2670 [35:05<30:28,  1.49s/it]

{'loss': 0.0184, 'grad_norm': 0.20286685228347778, 'learning_rate': 9.223264540337711e-05, 'epoch': 5.4}


 54%|█████▍    | 1442/2670 [35:07<30:46,  1.50s/it]

{'loss': 0.0262, 'grad_norm': 0.2163102924823761, 'learning_rate': 9.215759849906192e-05, 'epoch': 5.4}


 54%|█████▍    | 1443/2670 [35:08<30:33,  1.49s/it]

{'loss': 0.0088, 'grad_norm': 0.13720658421516418, 'learning_rate': 9.208255159474673e-05, 'epoch': 5.4}


 54%|█████▍    | 1444/2670 [35:09<30:12,  1.48s/it]

{'loss': 0.02, 'grad_norm': 0.1884811669588089, 'learning_rate': 9.200750469043152e-05, 'epoch': 5.41}


 54%|█████▍    | 1445/2670 [35:11<30:44,  1.51s/it]

{'loss': 0.0185, 'grad_norm': 0.26950782537460327, 'learning_rate': 9.193245778611632e-05, 'epoch': 5.41}


 54%|█████▍    | 1446/2670 [35:12<28:50,  1.41s/it]

{'loss': 0.0183, 'grad_norm': 0.37560442090034485, 'learning_rate': 9.185741088180113e-05, 'epoch': 5.42}


 54%|█████▍    | 1447/2670 [35:13<27:14,  1.34s/it]

{'loss': 0.0236, 'grad_norm': 0.24459196627140045, 'learning_rate': 9.178236397748592e-05, 'epoch': 5.42}


 54%|█████▍    | 1448/2670 [35:15<28:31,  1.40s/it]

{'loss': 0.0181, 'grad_norm': 0.30697011947631836, 'learning_rate': 9.170731707317075e-05, 'epoch': 5.42}


 54%|█████▍    | 1449/2670 [35:16<26:21,  1.30s/it]

{'loss': 0.0456, 'grad_norm': 0.3978016674518585, 'learning_rate': 9.163227016885554e-05, 'epoch': 5.43}


 54%|█████▍    | 1450/2670 [35:17<26:04,  1.28s/it]

{'loss': 0.0082, 'grad_norm': 0.09876500070095062, 'learning_rate': 9.155722326454034e-05, 'epoch': 5.43}


 54%|█████▍    | 1451/2670 [35:19<28:10,  1.39s/it]

{'loss': 0.0272, 'grad_norm': 0.43563729524612427, 'learning_rate': 9.148217636022515e-05, 'epoch': 5.43}


 54%|█████▍    | 1452/2670 [35:20<28:46,  1.42s/it]

{'loss': 0.0228, 'grad_norm': 0.2901982069015503, 'learning_rate': 9.140712945590994e-05, 'epoch': 5.44}


 54%|█████▍    | 1453/2670 [35:22<27:47,  1.37s/it]

{'loss': 0.0085, 'grad_norm': 0.13161881268024445, 'learning_rate': 9.133208255159475e-05, 'epoch': 5.44}


 54%|█████▍    | 1454/2670 [35:23<27:06,  1.34s/it]

{'loss': 0.0265, 'grad_norm': 0.5308930277824402, 'learning_rate': 9.125703564727956e-05, 'epoch': 5.45}


 54%|█████▍    | 1455/2670 [35:25<29:09,  1.44s/it]

{'loss': 0.0109, 'grad_norm': 0.28116780519485474, 'learning_rate': 9.118198874296435e-05, 'epoch': 5.45}


 55%|█████▍    | 1456/2670 [35:26<29:22,  1.45s/it]

{'loss': 0.0287, 'grad_norm': 0.25683337450027466, 'learning_rate': 9.110694183864916e-05, 'epoch': 5.45}


 55%|█████▍    | 1457/2670 [35:28<31:39,  1.57s/it]

{'loss': 0.0206, 'grad_norm': 0.35508325695991516, 'learning_rate': 9.103189493433396e-05, 'epoch': 5.46}


 55%|█████▍    | 1458/2670 [35:29<31:26,  1.56s/it]

{'loss': 0.0184, 'grad_norm': 0.2253381460905075, 'learning_rate': 9.095684803001877e-05, 'epoch': 5.46}


 55%|█████▍    | 1459/2670 [35:31<31:12,  1.55s/it]

{'loss': 0.0255, 'grad_norm': 0.2580952048301697, 'learning_rate': 9.088180112570358e-05, 'epoch': 5.46}


 55%|█████▍    | 1460/2670 [35:32<29:36,  1.47s/it]

{'loss': 0.0179, 'grad_norm': 0.33952489495277405, 'learning_rate': 9.080675422138837e-05, 'epoch': 5.47}


 55%|█████▍    | 1461/2670 [35:34<31:03,  1.54s/it]

{'loss': 0.0207, 'grad_norm': 0.34418731927871704, 'learning_rate': 9.073170731707318e-05, 'epoch': 5.47}


 55%|█████▍    | 1462/2670 [35:35<29:59,  1.49s/it]

{'loss': 0.026, 'grad_norm': 0.28350916504859924, 'learning_rate': 9.065666041275798e-05, 'epoch': 5.48}


 55%|█████▍    | 1463/2670 [35:37<29:25,  1.46s/it]

{'loss': 0.0147, 'grad_norm': 0.2786450982093811, 'learning_rate': 9.058161350844278e-05, 'epoch': 5.48}


 55%|█████▍    | 1464/2670 [35:38<29:44,  1.48s/it]

{'loss': 0.0208, 'grad_norm': 0.29106929898262024, 'learning_rate': 9.050656660412758e-05, 'epoch': 5.48}


 55%|█████▍    | 1465/2670 [35:40<29:21,  1.46s/it]

{'loss': 0.0242, 'grad_norm': 0.2732214331626892, 'learning_rate': 9.043151969981239e-05, 'epoch': 5.49}


 55%|█████▍    | 1466/2670 [35:41<28:55,  1.44s/it]

{'loss': 0.0293, 'grad_norm': 0.21670520305633545, 'learning_rate': 9.03564727954972e-05, 'epoch': 5.49}


 55%|█████▍    | 1467/2670 [35:42<28:39,  1.43s/it]

{'loss': 0.0213, 'grad_norm': 0.2334536910057068, 'learning_rate': 9.028142589118199e-05, 'epoch': 5.49}


 55%|█████▍    | 1468/2670 [35:44<28:37,  1.43s/it]

{'loss': 0.0292, 'grad_norm': 0.29029497504234314, 'learning_rate': 9.020637898686679e-05, 'epoch': 5.5}


 55%|█████▌    | 1469/2670 [35:45<29:29,  1.47s/it]

{'loss': 0.022, 'grad_norm': 0.21924155950546265, 'learning_rate': 9.01313320825516e-05, 'epoch': 5.5}


 55%|█████▌    | 1470/2670 [35:47<29:46,  1.49s/it]

{'loss': 0.0156, 'grad_norm': 0.13839265704154968, 'learning_rate': 9.00562851782364e-05, 'epoch': 5.51}


 55%|█████▌    | 1471/2670 [35:48<28:31,  1.43s/it]

{'loss': 0.0558, 'grad_norm': 0.4972872734069824, 'learning_rate': 8.998123827392121e-05, 'epoch': 5.51}


 55%|█████▌    | 1472/2670 [35:50<30:44,  1.54s/it]

{'loss': 0.0146, 'grad_norm': 0.17263193428516388, 'learning_rate': 8.990619136960601e-05, 'epoch': 5.51}


 55%|█████▌    | 1473/2670 [35:51<29:41,  1.49s/it]

{'loss': 0.0129, 'grad_norm': 0.22272555530071259, 'learning_rate': 8.98311444652908e-05, 'epoch': 5.52}


 55%|█████▌    | 1474/2670 [35:52<27:02,  1.36s/it]

{'loss': 0.0187, 'grad_norm': 0.18380451202392578, 'learning_rate': 8.975609756097561e-05, 'epoch': 5.52}


 55%|█████▌    | 1475/2670 [35:54<26:12,  1.32s/it]

{'loss': 0.0242, 'grad_norm': 0.2645740807056427, 'learning_rate': 8.968105065666041e-05, 'epoch': 5.52}


 55%|█████▌    | 1476/2670 [35:55<26:38,  1.34s/it]

{'loss': 0.016, 'grad_norm': 0.18602149188518524, 'learning_rate': 8.960600375234522e-05, 'epoch': 5.53}


 55%|█████▌    | 1477/2670 [35:56<25:10,  1.27s/it]

{'loss': 0.0334, 'grad_norm': 0.49423953890800476, 'learning_rate': 8.953095684803003e-05, 'epoch': 5.53}


 55%|█████▌    | 1478/2670 [35:58<26:12,  1.32s/it]

{'loss': 0.0267, 'grad_norm': 0.3004031777381897, 'learning_rate': 8.945590994371482e-05, 'epoch': 5.54}


 55%|█████▌    | 1479/2670 [35:59<25:38,  1.29s/it]

{'loss': 0.034, 'grad_norm': 0.9039240479469299, 'learning_rate': 8.938086303939963e-05, 'epoch': 5.54}


 55%|█████▌    | 1480/2670 [36:00<26:48,  1.35s/it]

{'loss': 0.0339, 'grad_norm': 0.2474776804447174, 'learning_rate': 8.930581613508443e-05, 'epoch': 5.54}


 55%|█████▌    | 1481/2670 [36:02<27:46,  1.40s/it]

{'loss': 0.0204, 'grad_norm': 0.22619931399822235, 'learning_rate': 8.923076923076924e-05, 'epoch': 5.55}


 56%|█████▌    | 1482/2670 [36:03<27:52,  1.41s/it]

{'loss': 0.0152, 'grad_norm': 0.12919016182422638, 'learning_rate': 8.915572232645404e-05, 'epoch': 5.55}


 56%|█████▌    | 1483/2670 [36:05<27:50,  1.41s/it]

{'loss': 0.0124, 'grad_norm': 0.1406899094581604, 'learning_rate': 8.908067542213884e-05, 'epoch': 5.55}


 56%|█████▌    | 1484/2670 [36:06<27:06,  1.37s/it]

{'loss': 0.0065, 'grad_norm': 0.11156241595745087, 'learning_rate': 8.900562851782365e-05, 'epoch': 5.56}


 56%|█████▌    | 1485/2670 [36:08<30:44,  1.56s/it]

{'loss': 0.0175, 'grad_norm': 0.25646039843559265, 'learning_rate': 8.893058161350844e-05, 'epoch': 5.56}


 56%|█████▌    | 1486/2670 [36:10<30:31,  1.55s/it]

{'loss': 0.0215, 'grad_norm': 0.329593688249588, 'learning_rate': 8.885553470919325e-05, 'epoch': 5.57}


 56%|█████▌    | 1487/2670 [36:11<30:24,  1.54s/it]

{'loss': 0.0125, 'grad_norm': 0.2106400728225708, 'learning_rate': 8.878048780487805e-05, 'epoch': 5.57}


 56%|█████▌    | 1488/2670 [36:12<29:40,  1.51s/it]

{'loss': 0.0215, 'grad_norm': 0.20388923585414886, 'learning_rate': 8.870544090056286e-05, 'epoch': 5.57}


 56%|█████▌    | 1489/2670 [36:14<29:22,  1.49s/it]

{'loss': 0.0063, 'grad_norm': 0.06903963536024094, 'learning_rate': 8.863039399624767e-05, 'epoch': 5.58}


 56%|█████▌    | 1490/2670 [36:15<28:54,  1.47s/it]

{'loss': 0.0136, 'grad_norm': 0.18728794157505035, 'learning_rate': 8.855534709193246e-05, 'epoch': 5.58}


 56%|█████▌    | 1491/2670 [36:17<27:26,  1.40s/it]

{'loss': 0.0246, 'grad_norm': 0.21242544054985046, 'learning_rate': 8.848030018761726e-05, 'epoch': 5.58}


 56%|█████▌    | 1492/2670 [36:18<28:45,  1.46s/it]

{'loss': 0.0197, 'grad_norm': 0.31258425116539, 'learning_rate': 8.840525328330207e-05, 'epoch': 5.59}


 56%|█████▌    | 1493/2670 [36:20<28:43,  1.46s/it]

{'loss': 0.0079, 'grad_norm': 0.13896414637565613, 'learning_rate': 8.833020637898687e-05, 'epoch': 5.59}


 56%|█████▌    | 1494/2670 [36:21<28:01,  1.43s/it]

{'loss': 0.0155, 'grad_norm': 0.1626349836587906, 'learning_rate': 8.825515947467168e-05, 'epoch': 5.6}


 56%|█████▌    | 1495/2670 [36:22<27:05,  1.38s/it]

{'loss': 0.0058, 'grad_norm': 0.11051925271749496, 'learning_rate': 8.818011257035648e-05, 'epoch': 5.6}


 56%|█████▌    | 1496/2670 [36:24<27:27,  1.40s/it]

{'loss': 0.0101, 'grad_norm': 0.19961747527122498, 'learning_rate': 8.810506566604127e-05, 'epoch': 5.6}


 56%|█████▌    | 1497/2670 [36:25<27:29,  1.41s/it]

{'loss': 0.0169, 'grad_norm': 0.16892951726913452, 'learning_rate': 8.803001876172608e-05, 'epoch': 5.61}


 56%|█████▌    | 1498/2670 [36:27<28:01,  1.43s/it]

{'loss': 0.016, 'grad_norm': 0.16955377161502838, 'learning_rate': 8.795497185741088e-05, 'epoch': 5.61}


 56%|█████▌    | 1499/2670 [36:28<28:34,  1.46s/it]

{'loss': 0.0066, 'grad_norm': 0.09958434849977493, 'learning_rate': 8.78799249530957e-05, 'epoch': 5.61}


 56%|█████▌    | 1500/2670 [36:29<27:04,  1.39s/it]

{'loss': 0.0206, 'grad_norm': 0.3282098174095154, 'learning_rate': 8.78048780487805e-05, 'epoch': 5.62}


 56%|█████▌    | 1501/2670 [36:33<38:09,  1.96s/it]

{'loss': 0.0272, 'grad_norm': 0.21864619851112366, 'learning_rate': 8.772983114446529e-05, 'epoch': 5.62}


 56%|█████▋    | 1502/2670 [36:34<34:11,  1.76s/it]

{'loss': 0.0296, 'grad_norm': 0.39483746886253357, 'learning_rate': 8.76547842401501e-05, 'epoch': 5.63}


 56%|█████▋    | 1503/2670 [36:35<32:54,  1.69s/it]

{'loss': 0.0129, 'grad_norm': 0.14733269810676575, 'learning_rate': 8.75797373358349e-05, 'epoch': 5.63}


 56%|█████▋    | 1504/2670 [36:37<32:23,  1.67s/it]

{'loss': 0.0121, 'grad_norm': 0.28634610772132874, 'learning_rate': 8.75046904315197e-05, 'epoch': 5.63}


 56%|█████▋    | 1505/2670 [36:39<32:07,  1.65s/it]

{'loss': 0.0177, 'grad_norm': 0.3607477843761444, 'learning_rate': 8.742964352720451e-05, 'epoch': 5.64}


 56%|█████▋    | 1506/2670 [36:40<30:48,  1.59s/it]

{'loss': 0.0229, 'grad_norm': 0.43507668375968933, 'learning_rate': 8.735459662288931e-05, 'epoch': 5.64}


 56%|█████▋    | 1507/2670 [36:41<29:03,  1.50s/it]

{'loss': 0.0128, 'grad_norm': 0.29658374190330505, 'learning_rate': 8.727954971857412e-05, 'epoch': 5.64}


 56%|█████▋    | 1508/2670 [36:43<27:28,  1.42s/it]

{'loss': 0.0138, 'grad_norm': 0.14552424848079681, 'learning_rate': 8.720450281425891e-05, 'epoch': 5.65}


 57%|█████▋    | 1509/2670 [36:44<27:08,  1.40s/it]

{'loss': 0.024, 'grad_norm': 0.2061094492673874, 'learning_rate': 8.712945590994371e-05, 'epoch': 5.65}


 57%|█████▋    | 1510/2670 [36:45<26:43,  1.38s/it]

{'loss': 0.0158, 'grad_norm': 0.30183666944503784, 'learning_rate': 8.705440900562853e-05, 'epoch': 5.66}


 57%|█████▋    | 1511/2670 [36:47<27:58,  1.45s/it]

{'loss': 0.0107, 'grad_norm': 0.23163239657878876, 'learning_rate': 8.697936210131332e-05, 'epoch': 5.66}


 57%|█████▋    | 1512/2670 [36:48<27:46,  1.44s/it]

{'loss': 0.0208, 'grad_norm': 0.2127498835325241, 'learning_rate': 8.690431519699813e-05, 'epoch': 5.66}


 57%|█████▋    | 1513/2670 [36:50<28:08,  1.46s/it]

{'loss': 0.0147, 'grad_norm': 0.2411385178565979, 'learning_rate': 8.682926829268293e-05, 'epoch': 5.67}


 57%|█████▋    | 1514/2670 [36:51<27:25,  1.42s/it]

{'loss': 0.0147, 'grad_norm': 0.3249267339706421, 'learning_rate': 8.675422138836772e-05, 'epoch': 5.67}


 57%|█████▋    | 1515/2670 [36:53<27:44,  1.44s/it]

{'loss': 0.0145, 'grad_norm': 0.1462206095457077, 'learning_rate': 8.667917448405253e-05, 'epoch': 5.67}


 57%|█████▋    | 1516/2670 [36:54<26:54,  1.40s/it]

{'loss': 0.0349, 'grad_norm': 0.6164582967758179, 'learning_rate': 8.660412757973734e-05, 'epoch': 5.68}


 57%|█████▋    | 1517/2670 [36:56<27:41,  1.44s/it]

{'loss': 0.0125, 'grad_norm': 0.1675141602754593, 'learning_rate': 8.652908067542215e-05, 'epoch': 5.68}


 57%|█████▋    | 1518/2670 [36:57<29:36,  1.54s/it]

{'loss': 0.0248, 'grad_norm': 0.4460832178592682, 'learning_rate': 8.645403377110695e-05, 'epoch': 5.69}


 57%|█████▋    | 1519/2670 [36:58<27:10,  1.42s/it]

{'loss': 0.0302, 'grad_norm': 2.261331796646118, 'learning_rate': 8.637898686679174e-05, 'epoch': 5.69}


 57%|█████▋    | 1520/2670 [37:00<27:50,  1.45s/it]

{'loss': 0.0207, 'grad_norm': 0.36557629704475403, 'learning_rate': 8.630393996247655e-05, 'epoch': 5.69}


 57%|█████▋    | 1521/2670 [37:02<28:13,  1.47s/it]

{'loss': 0.0079, 'grad_norm': 0.15635333955287933, 'learning_rate': 8.622889305816136e-05, 'epoch': 5.7}


 57%|█████▋    | 1522/2670 [37:03<27:58,  1.46s/it]

{'loss': 0.015, 'grad_norm': 0.27085304260253906, 'learning_rate': 8.615384615384617e-05, 'epoch': 5.7}


 57%|█████▋    | 1523/2670 [37:04<26:50,  1.40s/it]

{'loss': 0.012, 'grad_norm': 0.15578444302082062, 'learning_rate': 8.607879924953096e-05, 'epoch': 5.7}


 57%|█████▋    | 1524/2670 [37:05<25:01,  1.31s/it]

{'loss': 0.0114, 'grad_norm': 0.15026871860027313, 'learning_rate': 8.600375234521576e-05, 'epoch': 5.71}


 57%|█████▋    | 1525/2670 [37:07<25:47,  1.35s/it]

{'loss': 0.019, 'grad_norm': 0.2107684463262558, 'learning_rate': 8.592870544090057e-05, 'epoch': 5.71}


 57%|█████▋    | 1526/2670 [37:08<26:01,  1.36s/it]

{'loss': 0.0142, 'grad_norm': 0.4547654390335083, 'learning_rate': 8.585365853658536e-05, 'epoch': 5.72}


 57%|█████▋    | 1527/2670 [37:10<27:05,  1.42s/it]

{'loss': 0.0095, 'grad_norm': 0.15244998037815094, 'learning_rate': 8.577861163227017e-05, 'epoch': 5.72}


 57%|█████▋    | 1528/2670 [37:11<26:23,  1.39s/it]

{'loss': 0.0209, 'grad_norm': 0.23090893030166626, 'learning_rate': 8.570356472795498e-05, 'epoch': 5.72}


 57%|█████▋    | 1529/2670 [37:13<27:41,  1.46s/it]

{'loss': 0.0178, 'grad_norm': 0.3294438421726227, 'learning_rate': 8.562851782363978e-05, 'epoch': 5.73}


 57%|█████▋    | 1530/2670 [37:14<28:55,  1.52s/it]

{'loss': 0.0111, 'grad_norm': 0.15813219547271729, 'learning_rate': 8.555347091932458e-05, 'epoch': 5.73}


 57%|█████▋    | 1531/2670 [37:16<29:23,  1.55s/it]

{'loss': 0.0058, 'grad_norm': 0.08892594277858734, 'learning_rate': 8.547842401500938e-05, 'epoch': 5.73}


 57%|█████▋    | 1532/2670 [37:17<28:38,  1.51s/it]

{'loss': 0.016, 'grad_norm': 0.19362063705921173, 'learning_rate': 8.540337711069419e-05, 'epoch': 5.74}


 57%|█████▋    | 1533/2670 [37:19<28:48,  1.52s/it]

{'loss': 0.0249, 'grad_norm': 0.5670633912086487, 'learning_rate': 8.5328330206379e-05, 'epoch': 5.74}


 57%|█████▋    | 1534/2670 [37:20<28:41,  1.52s/it]

{'loss': 0.0249, 'grad_norm': 0.2563028633594513, 'learning_rate': 8.525328330206379e-05, 'epoch': 5.75}


 57%|█████▋    | 1535/2670 [37:22<28:44,  1.52s/it]

{'loss': 0.0168, 'grad_norm': 0.24774406850337982, 'learning_rate': 8.51782363977486e-05, 'epoch': 5.75}


 58%|█████▊    | 1536/2670 [37:23<28:08,  1.49s/it]

{'loss': 0.0164, 'grad_norm': 0.19436460733413696, 'learning_rate': 8.51031894934334e-05, 'epoch': 5.75}


 58%|█████▊    | 1537/2670 [37:25<27:54,  1.48s/it]

{'loss': 0.0081, 'grad_norm': 0.0952346920967102, 'learning_rate': 8.502814258911819e-05, 'epoch': 5.76}


 58%|█████▊    | 1538/2670 [37:26<27:16,  1.45s/it]

{'loss': 0.0284, 'grad_norm': 0.21666347980499268, 'learning_rate': 8.4953095684803e-05, 'epoch': 5.76}


 58%|█████▊    | 1539/2670 [37:27<26:22,  1.40s/it]

{'loss': 0.0148, 'grad_norm': 0.15377259254455566, 'learning_rate': 8.487804878048781e-05, 'epoch': 5.76}


 58%|█████▊    | 1540/2670 [37:29<26:04,  1.38s/it]

{'loss': 0.0259, 'grad_norm': 0.34811681509017944, 'learning_rate': 8.480300187617262e-05, 'epoch': 5.77}


 58%|█████▊    | 1541/2670 [37:30<27:12,  1.45s/it]

{'loss': 0.0172, 'grad_norm': 0.287230908870697, 'learning_rate': 8.472795497185741e-05, 'epoch': 5.77}


 58%|█████▊    | 1542/2670 [37:32<27:35,  1.47s/it]

{'loss': 0.052, 'grad_norm': 0.3600176274776459, 'learning_rate': 8.465290806754221e-05, 'epoch': 5.78}


 58%|█████▊    | 1543/2670 [37:34<29:13,  1.56s/it]

{'loss': 0.0084, 'grad_norm': 0.12831319868564606, 'learning_rate': 8.457786116322702e-05, 'epoch': 5.78}


 58%|█████▊    | 1544/2670 [37:35<28:52,  1.54s/it]

{'loss': 0.0186, 'grad_norm': 0.22920486330986023, 'learning_rate': 8.450281425891183e-05, 'epoch': 5.78}


 58%|█████▊    | 1545/2670 [37:37<28:27,  1.52s/it]

{'loss': 0.0214, 'grad_norm': 0.2596891522407532, 'learning_rate': 8.442776735459664e-05, 'epoch': 5.79}


 58%|█████▊    | 1546/2670 [37:38<27:36,  1.47s/it]

{'loss': 0.0128, 'grad_norm': 0.23946474492549896, 'learning_rate': 8.435272045028143e-05, 'epoch': 5.79}


 58%|█████▊    | 1547/2670 [37:39<26:59,  1.44s/it]

{'loss': 0.015, 'grad_norm': 0.25682878494262695, 'learning_rate': 8.427767354596623e-05, 'epoch': 5.79}


 58%|█████▊    | 1548/2670 [37:41<28:29,  1.52s/it]

{'loss': 0.0189, 'grad_norm': 0.20072437822818756, 'learning_rate': 8.420262664165104e-05, 'epoch': 5.8}


 58%|█████▊    | 1549/2670 [37:43<28:14,  1.51s/it]

{'loss': 0.0251, 'grad_norm': 0.31890547275543213, 'learning_rate': 8.412757973733583e-05, 'epoch': 5.8}


 58%|█████▊    | 1550/2670 [37:44<28:59,  1.55s/it]

{'loss': 0.0196, 'grad_norm': 0.2121955156326294, 'learning_rate': 8.405253283302065e-05, 'epoch': 5.81}


 58%|█████▊    | 1551/2670 [37:46<28:03,  1.50s/it]

{'loss': 0.0144, 'grad_norm': 0.21424978971481323, 'learning_rate': 8.397748592870545e-05, 'epoch': 5.81}


 58%|█████▊    | 1552/2670 [37:47<27:52,  1.50s/it]

{'loss': 0.0143, 'grad_norm': 0.12458554655313492, 'learning_rate': 8.390243902439024e-05, 'epoch': 5.81}


 58%|█████▊    | 1553/2670 [37:49<27:21,  1.47s/it]

{'loss': 0.0173, 'grad_norm': 0.2388649731874466, 'learning_rate': 8.382739212007505e-05, 'epoch': 5.82}


 58%|█████▊    | 1554/2670 [37:50<27:41,  1.49s/it]

{'loss': 0.0277, 'grad_norm': 0.28293660283088684, 'learning_rate': 8.375234521575985e-05, 'epoch': 5.82}


 58%|█████▊    | 1555/2670 [37:52<27:32,  1.48s/it]

{'loss': 0.0298, 'grad_norm': 0.31501665711402893, 'learning_rate': 8.367729831144466e-05, 'epoch': 5.82}


 58%|█████▊    | 1556/2670 [37:53<28:16,  1.52s/it]

{'loss': 0.0136, 'grad_norm': 0.18552617728710175, 'learning_rate': 8.360225140712947e-05, 'epoch': 5.83}


 58%|█████▊    | 1557/2670 [37:55<28:42,  1.55s/it]

{'loss': 0.0047, 'grad_norm': 0.07912012189626694, 'learning_rate': 8.352720450281426e-05, 'epoch': 5.83}


 58%|█████▊    | 1558/2670 [37:56<29:44,  1.60s/it]

{'loss': 0.021, 'grad_norm': 0.22856134176254272, 'learning_rate': 8.345215759849907e-05, 'epoch': 5.84}


 58%|█████▊    | 1559/2670 [37:58<30:33,  1.65s/it]

{'loss': 0.0292, 'grad_norm': 0.2510456442832947, 'learning_rate': 8.337711069418387e-05, 'epoch': 5.84}


 58%|█████▊    | 1560/2670 [38:00<30:27,  1.65s/it]

{'loss': 0.023, 'grad_norm': 0.3249055743217468, 'learning_rate': 8.330206378986866e-05, 'epoch': 5.84}


 58%|█████▊    | 1561/2670 [38:01<29:49,  1.61s/it]

{'loss': 0.0275, 'grad_norm': 0.4222799837589264, 'learning_rate': 8.322701688555348e-05, 'epoch': 5.85}


 59%|█████▊    | 1562/2670 [38:03<31:04,  1.68s/it]

{'loss': 0.0102, 'grad_norm': 0.11219021677970886, 'learning_rate': 8.315196998123828e-05, 'epoch': 5.85}


 59%|█████▊    | 1563/2670 [38:05<30:10,  1.64s/it]

{'loss': 0.0201, 'grad_norm': 0.28510338068008423, 'learning_rate': 8.307692307692309e-05, 'epoch': 5.85}


 59%|█████▊    | 1564/2670 [38:06<28:25,  1.54s/it]

{'loss': 0.0182, 'grad_norm': 0.2072124183177948, 'learning_rate': 8.300187617260788e-05, 'epoch': 5.86}


 59%|█████▊    | 1565/2670 [38:08<27:43,  1.51s/it]

{'loss': 0.0253, 'grad_norm': 0.22251975536346436, 'learning_rate': 8.292682926829268e-05, 'epoch': 5.86}


 59%|█████▊    | 1566/2670 [38:09<27:42,  1.51s/it]

{'loss': 0.0181, 'grad_norm': 0.3981642723083496, 'learning_rate': 8.285178236397749e-05, 'epoch': 5.87}


 59%|█████▊    | 1567/2670 [38:11<27:42,  1.51s/it]

{'loss': 0.0156, 'grad_norm': 0.28546738624572754, 'learning_rate': 8.27767354596623e-05, 'epoch': 5.87}


 59%|█████▊    | 1568/2670 [38:12<28:53,  1.57s/it]

{'loss': 0.0213, 'grad_norm': 0.3061770498752594, 'learning_rate': 8.27016885553471e-05, 'epoch': 5.87}


 59%|█████▉    | 1569/2670 [38:13<26:57,  1.47s/it]

{'loss': 0.0116, 'grad_norm': 0.20893919467926025, 'learning_rate': 8.26266416510319e-05, 'epoch': 5.88}


 59%|█████▉    | 1570/2670 [38:15<26:25,  1.44s/it]

{'loss': 0.0262, 'grad_norm': 0.32823100686073303, 'learning_rate': 8.25515947467167e-05, 'epoch': 5.88}


 59%|█████▉    | 1571/2670 [38:16<26:33,  1.45s/it]

{'loss': 0.0411, 'grad_norm': 0.3357205092906952, 'learning_rate': 8.24765478424015e-05, 'epoch': 5.88}


 59%|█████▉    | 1572/2670 [38:18<26:57,  1.47s/it]

{'loss': 0.0104, 'grad_norm': 0.266595721244812, 'learning_rate': 8.240150093808631e-05, 'epoch': 5.89}


 59%|█████▉    | 1573/2670 [38:20<28:49,  1.58s/it]

{'loss': 0.0286, 'grad_norm': 0.7596198320388794, 'learning_rate': 8.232645403377111e-05, 'epoch': 5.89}


 59%|█████▉    | 1574/2670 [38:21<27:12,  1.49s/it]

{'loss': 0.029, 'grad_norm': 0.22445903718471527, 'learning_rate': 8.225140712945592e-05, 'epoch': 5.9}


 59%|█████▉    | 1575/2670 [38:22<26:38,  1.46s/it]

{'loss': 0.0184, 'grad_norm': 0.2780212461948395, 'learning_rate': 8.217636022514071e-05, 'epoch': 5.9}


 59%|█████▉    | 1576/2670 [38:24<26:25,  1.45s/it]

{'loss': 0.0145, 'grad_norm': 0.12076184898614883, 'learning_rate': 8.210131332082552e-05, 'epoch': 5.9}


 59%|█████▉    | 1577/2670 [38:25<27:23,  1.50s/it]

{'loss': 0.0213, 'grad_norm': 0.3609505891799927, 'learning_rate': 8.202626641651032e-05, 'epoch': 5.91}


 59%|█████▉    | 1578/2670 [38:27<28:22,  1.56s/it]

{'loss': 0.0201, 'grad_norm': 0.24136875569820404, 'learning_rate': 8.195121951219513e-05, 'epoch': 5.91}


 59%|█████▉    | 1579/2670 [38:28<26:34,  1.46s/it]

{'loss': 0.0216, 'grad_norm': 0.29329627752304077, 'learning_rate': 8.187617260787993e-05, 'epoch': 5.91}


 59%|█████▉    | 1580/2670 [38:30<25:24,  1.40s/it]

{'loss': 0.0095, 'grad_norm': 0.13449446856975555, 'learning_rate': 8.180112570356473e-05, 'epoch': 5.92}


 59%|█████▉    | 1581/2670 [38:31<25:39,  1.41s/it]

{'loss': 0.0235, 'grad_norm': 0.2581024467945099, 'learning_rate': 8.172607879924954e-05, 'epoch': 5.92}


 59%|█████▉    | 1582/2670 [38:33<26:10,  1.44s/it]

{'loss': 0.0148, 'grad_norm': 0.2466321587562561, 'learning_rate': 8.165103189493433e-05, 'epoch': 5.93}


 59%|█████▉    | 1583/2670 [38:34<25:28,  1.41s/it]

{'loss': 0.0185, 'grad_norm': 0.2688828706741333, 'learning_rate': 8.157598499061914e-05, 'epoch': 5.93}


 59%|█████▉    | 1584/2670 [38:35<25:21,  1.40s/it]

{'loss': 0.0203, 'grad_norm': 0.168142631649971, 'learning_rate': 8.150093808630395e-05, 'epoch': 5.93}


 59%|█████▉    | 1585/2670 [38:37<25:29,  1.41s/it]

{'loss': 0.018, 'grad_norm': 0.31424030661582947, 'learning_rate': 8.142589118198875e-05, 'epoch': 5.94}


 59%|█████▉    | 1586/2670 [38:38<25:32,  1.41s/it]

{'loss': 0.0215, 'grad_norm': 0.1611519604921341, 'learning_rate': 8.135084427767356e-05, 'epoch': 5.94}


 59%|█████▉    | 1587/2670 [38:40<26:18,  1.46s/it]

{'loss': 0.0182, 'grad_norm': 0.3066765367984772, 'learning_rate': 8.127579737335835e-05, 'epoch': 5.94}


 59%|█████▉    | 1588/2670 [38:41<27:06,  1.50s/it]

{'loss': 0.0144, 'grad_norm': 0.16478534042835236, 'learning_rate': 8.120075046904315e-05, 'epoch': 5.95}


 60%|█████▉    | 1589/2670 [38:43<27:53,  1.55s/it]

{'loss': 0.0206, 'grad_norm': 0.33206361532211304, 'learning_rate': 8.112570356472796e-05, 'epoch': 5.95}


 60%|█████▉    | 1590/2670 [38:44<27:23,  1.52s/it]

{'loss': 0.0143, 'grad_norm': 0.1693791002035141, 'learning_rate': 8.105065666041276e-05, 'epoch': 5.96}


 60%|█████▉    | 1591/2670 [38:46<27:39,  1.54s/it]

{'loss': 0.0085, 'grad_norm': 0.10340838134288788, 'learning_rate': 8.097560975609757e-05, 'epoch': 5.96}


 60%|█████▉    | 1592/2670 [38:47<25:59,  1.45s/it]

{'loss': 0.0208, 'grad_norm': 0.19747157394886017, 'learning_rate': 8.090056285178237e-05, 'epoch': 5.96}


 60%|█████▉    | 1593/2670 [38:49<25:37,  1.43s/it]

{'loss': 0.0182, 'grad_norm': 0.3343648612499237, 'learning_rate': 8.082551594746716e-05, 'epoch': 5.97}


 60%|█████▉    | 1594/2670 [38:50<26:05,  1.46s/it]

{'loss': 0.0295, 'grad_norm': 0.44278186559677124, 'learning_rate': 8.075046904315197e-05, 'epoch': 5.97}


 60%|█████▉    | 1595/2670 [38:52<26:28,  1.48s/it]

{'loss': 0.0137, 'grad_norm': 0.28143060207366943, 'learning_rate': 8.067542213883678e-05, 'epoch': 5.97}


 60%|█████▉    | 1596/2670 [38:53<28:03,  1.57s/it]

{'loss': 0.0363, 'grad_norm': 0.45301732420921326, 'learning_rate': 8.060037523452158e-05, 'epoch': 5.98}


 60%|█████▉    | 1597/2670 [38:55<25:53,  1.45s/it]

{'loss': 0.0159, 'grad_norm': 0.20684009790420532, 'learning_rate': 8.052532833020639e-05, 'epoch': 5.98}


 60%|█████▉    | 1598/2670 [38:56<23:55,  1.34s/it]

{'loss': 0.0372, 'grad_norm': 0.5907332897186279, 'learning_rate': 8.045028142589118e-05, 'epoch': 5.99}


 60%|█████▉    | 1599/2670 [38:58<27:08,  1.52s/it]

{'loss': 0.0123, 'grad_norm': 0.16104595363140106, 'learning_rate': 8.037523452157599e-05, 'epoch': 5.99}


 60%|█████▉    | 1600/2670 [38:59<25:22,  1.42s/it]

{'loss': 0.0107, 'grad_norm': 0.12644460797309875, 'learning_rate': 8.030018761726078e-05, 'epoch': 5.99}


 60%|█████▉    | 1601/2670 [39:00<24:44,  1.39s/it]

{'loss': 0.0322, 'grad_norm': 0.5401408672332764, 'learning_rate': 8.02251407129456e-05, 'epoch': 6.0}


 60%|██████    | 1602/2670 [39:02<25:26,  1.43s/it]

{'loss': 0.0674, 'grad_norm': 0.635743260383606, 'learning_rate': 8.01500938086304e-05, 'epoch': 6.0}


 60%|██████    | 1603/2670 [39:03<24:50,  1.40s/it]

{'loss': 0.0063, 'grad_norm': 0.07349958270788193, 'learning_rate': 8.00750469043152e-05, 'epoch': 6.0}


 60%|██████    | 1604/2670 [39:04<24:24,  1.37s/it]

{'loss': 0.0133, 'grad_norm': 0.19343240559101105, 'learning_rate': 8e-05, 'epoch': 6.01}


 60%|██████    | 1605/2670 [39:06<25:10,  1.42s/it]

{'loss': 0.009, 'grad_norm': 0.15985628962516785, 'learning_rate': 7.99249530956848e-05, 'epoch': 6.01}


 60%|██████    | 1606/2670 [39:07<25:14,  1.42s/it]

{'loss': 0.0116, 'grad_norm': 0.1553405225276947, 'learning_rate': 7.984990619136961e-05, 'epoch': 6.01}


 60%|██████    | 1607/2670 [39:09<25:24,  1.43s/it]

{'loss': 0.0056, 'grad_norm': 0.058803491294384, 'learning_rate': 7.977485928705442e-05, 'epoch': 6.02}


 60%|██████    | 1608/2670 [39:10<25:55,  1.46s/it]

{'loss': 0.0127, 'grad_norm': 0.09809516370296478, 'learning_rate': 7.969981238273921e-05, 'epoch': 6.02}


 60%|██████    | 1609/2670 [39:12<27:16,  1.54s/it]

{'loss': 0.006, 'grad_norm': 0.15752506256103516, 'learning_rate': 7.962476547842402e-05, 'epoch': 6.03}


 60%|██████    | 1610/2670 [39:14<27:59,  1.58s/it]

{'loss': 0.0105, 'grad_norm': 0.20844674110412598, 'learning_rate': 7.954971857410882e-05, 'epoch': 6.03}


 60%|██████    | 1611/2670 [39:15<25:36,  1.45s/it]

{'loss': 0.0054, 'grad_norm': 0.09604202210903168, 'learning_rate': 7.947467166979363e-05, 'epoch': 6.03}


 60%|██████    | 1612/2670 [39:16<26:19,  1.49s/it]

{'loss': 0.0113, 'grad_norm': 0.1984594315290451, 'learning_rate': 7.939962476547844e-05, 'epoch': 6.04}


 60%|██████    | 1613/2670 [39:18<26:54,  1.53s/it]

{'loss': 0.0128, 'grad_norm': 0.22363629937171936, 'learning_rate': 7.932457786116323e-05, 'epoch': 6.04}


 60%|██████    | 1614/2670 [39:19<24:00,  1.36s/it]

{'loss': 0.0376, 'grad_norm': 0.5215232372283936, 'learning_rate': 7.924953095684804e-05, 'epoch': 6.04}


 60%|██████    | 1615/2670 [39:20<22:52,  1.30s/it]

{'loss': 0.0085, 'grad_norm': 0.11815076321363449, 'learning_rate': 7.917448405253284e-05, 'epoch': 6.05}


 61%|██████    | 1616/2670 [39:21<22:23,  1.28s/it]

{'loss': 0.0097, 'grad_norm': 0.09960027039051056, 'learning_rate': 7.909943714821763e-05, 'epoch': 6.05}


 61%|██████    | 1617/2670 [39:23<23:05,  1.32s/it]

{'loss': 0.0156, 'grad_norm': 0.28402677178382874, 'learning_rate': 7.902439024390244e-05, 'epoch': 6.06}


 61%|██████    | 1618/2670 [39:24<24:38,  1.41s/it]

{'loss': 0.0069, 'grad_norm': 0.09832191467285156, 'learning_rate': 7.894934333958725e-05, 'epoch': 6.06}


 61%|██████    | 1619/2670 [39:26<25:21,  1.45s/it]

{'loss': 0.0197, 'grad_norm': 0.14159180223941803, 'learning_rate': 7.887429643527204e-05, 'epoch': 6.06}


 61%|██████    | 1620/2670 [39:27<25:11,  1.44s/it]

{'loss': 0.0065, 'grad_norm': 0.08527175337076187, 'learning_rate': 7.879924953095685e-05, 'epoch': 6.07}


 61%|██████    | 1621/2670 [39:29<25:13,  1.44s/it]

{'loss': 0.0044, 'grad_norm': 0.07336462289094925, 'learning_rate': 7.872420262664165e-05, 'epoch': 6.07}


 61%|██████    | 1622/2670 [39:30<26:30,  1.52s/it]

{'loss': 0.0056, 'grad_norm': 0.11909867823123932, 'learning_rate': 7.864915572232646e-05, 'epoch': 6.07}


 61%|██████    | 1623/2670 [39:32<25:54,  1.48s/it]

{'loss': 0.0146, 'grad_norm': 0.24497392773628235, 'learning_rate': 7.857410881801127e-05, 'epoch': 6.08}


 61%|██████    | 1624/2670 [39:33<26:13,  1.50s/it]

{'loss': 0.0094, 'grad_norm': 0.18728435039520264, 'learning_rate': 7.849906191369606e-05, 'epoch': 6.08}


 61%|██████    | 1625/2670 [39:35<24:04,  1.38s/it]

{'loss': 0.0127, 'grad_norm': 0.12096396833658218, 'learning_rate': 7.842401500938087e-05, 'epoch': 6.09}


 61%|██████    | 1626/2670 [39:36<23:32,  1.35s/it]

{'loss': 0.0198, 'grad_norm': 0.25699713826179504, 'learning_rate': 7.834896810506567e-05, 'epoch': 6.09}


 61%|██████    | 1627/2670 [39:38<25:36,  1.47s/it]

{'loss': 0.0057, 'grad_norm': 0.09653349965810776, 'learning_rate': 7.827392120075047e-05, 'epoch': 6.09}


 61%|██████    | 1628/2670 [39:39<26:50,  1.55s/it]

{'loss': 0.0048, 'grad_norm': 0.09463624656200409, 'learning_rate': 7.819887429643527e-05, 'epoch': 6.1}


 61%|██████    | 1629/2670 [39:41<25:52,  1.49s/it]

{'loss': 0.0164, 'grad_norm': 0.32694900035858154, 'learning_rate': 7.812382739212008e-05, 'epoch': 6.1}


 61%|██████    | 1630/2670 [39:42<24:38,  1.42s/it]

{'loss': 0.0064, 'grad_norm': 0.07256990671157837, 'learning_rate': 7.804878048780489e-05, 'epoch': 6.1}


 61%|██████    | 1631/2670 [39:43<23:41,  1.37s/it]

{'loss': 0.0046, 'grad_norm': 0.07050886750221252, 'learning_rate': 7.797373358348968e-05, 'epoch': 6.11}


 61%|██████    | 1632/2670 [39:45<25:02,  1.45s/it]

{'loss': 0.0082, 'grad_norm': 0.1600005328655243, 'learning_rate': 7.789868667917449e-05, 'epoch': 6.11}


 61%|██████    | 1633/2670 [39:46<26:18,  1.52s/it]

{'loss': 0.0102, 'grad_norm': 0.16951487958431244, 'learning_rate': 7.782363977485929e-05, 'epoch': 6.12}


 61%|██████    | 1634/2670 [39:48<26:09,  1.52s/it]

{'loss': 0.0039, 'grad_norm': 0.052793119102716446, 'learning_rate': 7.77485928705441e-05, 'epoch': 6.12}


 61%|██████    | 1635/2670 [39:49<25:14,  1.46s/it]

{'loss': 0.0164, 'grad_norm': 3.1324751377105713, 'learning_rate': 7.76735459662289e-05, 'epoch': 6.12}


 61%|██████▏   | 1636/2670 [39:51<25:02,  1.45s/it]

{'loss': 0.0114, 'grad_norm': 0.16178904473781586, 'learning_rate': 7.75984990619137e-05, 'epoch': 6.13}


 61%|██████▏   | 1637/2670 [39:52<24:14,  1.41s/it]

{'loss': 0.0051, 'grad_norm': 0.23646417260169983, 'learning_rate': 7.75234521575985e-05, 'epoch': 6.13}


 61%|██████▏   | 1638/2670 [39:54<25:38,  1.49s/it]

{'loss': 0.0084, 'grad_norm': 0.2716037333011627, 'learning_rate': 7.74484052532833e-05, 'epoch': 6.13}


 61%|██████▏   | 1639/2670 [39:55<26:14,  1.53s/it]

{'loss': 0.0038, 'grad_norm': 0.07167454808950424, 'learning_rate': 7.73733583489681e-05, 'epoch': 6.14}


 61%|██████▏   | 1640/2670 [39:57<24:37,  1.43s/it]

{'loss': 0.0141, 'grad_norm': 0.483234167098999, 'learning_rate': 7.729831144465292e-05, 'epoch': 6.14}


 61%|██████▏   | 1641/2670 [39:58<24:08,  1.41s/it]

{'loss': 0.0271, 'grad_norm': 0.25829318165779114, 'learning_rate': 7.722326454033772e-05, 'epoch': 6.15}


 61%|██████▏   | 1642/2670 [39:59<24:50,  1.45s/it]

{'loss': 0.0125, 'grad_norm': 0.1786007583141327, 'learning_rate': 7.714821763602251e-05, 'epoch': 6.15}


 62%|██████▏   | 1643/2670 [40:01<25:31,  1.49s/it]

{'loss': 0.0292, 'grad_norm': 0.2839798331260681, 'learning_rate': 7.707317073170732e-05, 'epoch': 6.15}


 62%|██████▏   | 1644/2670 [40:03<25:45,  1.51s/it]

{'loss': 0.0113, 'grad_norm': 0.12556050717830658, 'learning_rate': 7.699812382739212e-05, 'epoch': 6.16}


 62%|██████▏   | 1645/2670 [40:04<24:22,  1.43s/it]

{'loss': 0.0042, 'grad_norm': 0.17883247137069702, 'learning_rate': 7.692307692307693e-05, 'epoch': 6.16}


 62%|██████▏   | 1646/2670 [40:05<24:17,  1.42s/it]

{'loss': 0.0465, 'grad_norm': 0.5083239078521729, 'learning_rate': 7.684803001876173e-05, 'epoch': 6.16}


 62%|██████▏   | 1647/2670 [40:07<25:15,  1.48s/it]

{'loss': 0.0097, 'grad_norm': 0.10646844655275345, 'learning_rate': 7.677298311444653e-05, 'epoch': 6.17}


 62%|██████▏   | 1648/2670 [40:08<25:20,  1.49s/it]

{'loss': 0.0196, 'grad_norm': 0.1401016265153885, 'learning_rate': 7.669793621013134e-05, 'epoch': 6.17}


 62%|██████▏   | 1649/2670 [40:10<25:22,  1.49s/it]

{'loss': 0.0051, 'grad_norm': 0.12095263600349426, 'learning_rate': 7.662288930581613e-05, 'epoch': 6.18}


 62%|██████▏   | 1650/2670 [40:11<24:35,  1.45s/it]

{'loss': 0.0138, 'grad_norm': 0.11084200441837311, 'learning_rate': 7.654784240150094e-05, 'epoch': 6.18}


 62%|██████▏   | 1651/2670 [40:12<23:06,  1.36s/it]

{'loss': 0.0216, 'grad_norm': 0.16204573214054108, 'learning_rate': 7.647279549718575e-05, 'epoch': 6.18}


 62%|██████▏   | 1652/2670 [40:14<24:15,  1.43s/it]

{'loss': 0.0198, 'grad_norm': 0.3422932028770447, 'learning_rate': 7.639774859287055e-05, 'epoch': 6.19}


 62%|██████▏   | 1653/2670 [40:15<24:36,  1.45s/it]

{'loss': 0.0208, 'grad_norm': 0.23672422766685486, 'learning_rate': 7.632270168855536e-05, 'epoch': 6.19}


 62%|██████▏   | 1654/2670 [40:17<24:37,  1.45s/it]

{'loss': 0.0139, 'grad_norm': 0.15893776714801788, 'learning_rate': 7.624765478424015e-05, 'epoch': 6.19}


 62%|██████▏   | 1655/2670 [40:18<24:44,  1.46s/it]

{'loss': 0.005, 'grad_norm': 0.11404293030500412, 'learning_rate': 7.617260787992496e-05, 'epoch': 6.2}


 62%|██████▏   | 1656/2670 [40:20<25:26,  1.51s/it]

{'loss': 0.0095, 'grad_norm': 0.13003948330879211, 'learning_rate': 7.609756097560976e-05, 'epoch': 6.2}


 62%|██████▏   | 1657/2670 [40:21<24:35,  1.46s/it]

{'loss': 0.0162, 'grad_norm': 0.16879230737686157, 'learning_rate': 7.602251407129456e-05, 'epoch': 6.21}


 62%|██████▏   | 1658/2670 [40:23<26:01,  1.54s/it]

{'loss': 0.0126, 'grad_norm': 0.20676441490650177, 'learning_rate': 7.594746716697937e-05, 'epoch': 6.21}


 62%|██████▏   | 1659/2670 [40:25<26:02,  1.55s/it]

{'loss': 0.0094, 'grad_norm': 0.11186865717172623, 'learning_rate': 7.587242026266417e-05, 'epoch': 6.21}


 62%|██████▏   | 1660/2670 [40:26<25:55,  1.54s/it]

{'loss': 0.011, 'grad_norm': 0.32624340057373047, 'learning_rate': 7.579737335834896e-05, 'epoch': 6.22}


 62%|██████▏   | 1661/2670 [40:28<24:53,  1.48s/it]

{'loss': 0.0112, 'grad_norm': 0.11610473692417145, 'learning_rate': 7.572232645403377e-05, 'epoch': 6.22}


 62%|██████▏   | 1662/2670 [40:29<25:23,  1.51s/it]

{'loss': 0.0038, 'grad_norm': 0.051549624651670456, 'learning_rate': 7.564727954971858e-05, 'epoch': 6.22}


 62%|██████▏   | 1663/2670 [40:31<25:01,  1.49s/it]

{'loss': 0.0093, 'grad_norm': 0.15956932306289673, 'learning_rate': 7.557223264540339e-05, 'epoch': 6.23}


 62%|██████▏   | 1664/2670 [40:32<25:06,  1.50s/it]

{'loss': 0.016, 'grad_norm': 0.15833592414855957, 'learning_rate': 7.549718574108819e-05, 'epoch': 6.23}


 62%|██████▏   | 1665/2670 [40:33<24:44,  1.48s/it]

{'loss': 0.0088, 'grad_norm': 0.5742579102516174, 'learning_rate': 7.542213883677298e-05, 'epoch': 6.24}


 62%|██████▏   | 1666/2670 [40:36<27:57,  1.67s/it]

{'loss': 0.0224, 'grad_norm': 0.25632455945014954, 'learning_rate': 7.534709193245779e-05, 'epoch': 6.24}


 62%|██████▏   | 1667/2670 [40:37<26:17,  1.57s/it]

{'loss': 0.0053, 'grad_norm': 0.07847350835800171, 'learning_rate': 7.527204502814259e-05, 'epoch': 6.24}


 62%|██████▏   | 1668/2670 [40:38<25:41,  1.54s/it]

{'loss': 0.0093, 'grad_norm': 0.1289823353290558, 'learning_rate': 7.51969981238274e-05, 'epoch': 6.25}


 63%|██████▎   | 1669/2670 [40:40<23:54,  1.43s/it]

{'loss': 0.007, 'grad_norm': 0.11626964807510376, 'learning_rate': 7.51219512195122e-05, 'epoch': 6.25}


 63%|██████▎   | 1670/2670 [40:41<23:50,  1.43s/it]

{'loss': 0.0103, 'grad_norm': 0.11593220382928848, 'learning_rate': 7.5046904315197e-05, 'epoch': 6.25}


 63%|██████▎   | 1671/2670 [40:43<25:45,  1.55s/it]

{'loss': 0.0222, 'grad_norm': 0.4753117859363556, 'learning_rate': 7.497185741088181e-05, 'epoch': 6.26}


 63%|██████▎   | 1672/2670 [40:44<26:08,  1.57s/it]

{'loss': 0.0079, 'grad_norm': 0.2043744921684265, 'learning_rate': 7.48968105065666e-05, 'epoch': 6.26}


 63%|██████▎   | 1673/2670 [40:46<25:04,  1.51s/it]

{'loss': 0.0113, 'grad_norm': 0.10810667276382446, 'learning_rate': 7.482176360225141e-05, 'epoch': 6.27}


 63%|██████▎   | 1674/2670 [40:47<24:41,  1.49s/it]

{'loss': 0.0065, 'grad_norm': 0.06782764196395874, 'learning_rate': 7.474671669793622e-05, 'epoch': 6.27}


 63%|██████▎   | 1675/2670 [40:49<23:44,  1.43s/it]

{'loss': 0.0062, 'grad_norm': 0.066738061606884, 'learning_rate': 7.467166979362102e-05, 'epoch': 6.27}


 63%|██████▎   | 1676/2670 [40:50<24:47,  1.50s/it]

{'loss': 0.0132, 'grad_norm': 0.15085574984550476, 'learning_rate': 7.459662288930582e-05, 'epoch': 6.28}


 63%|██████▎   | 1677/2670 [40:51<23:28,  1.42s/it]

{'loss': 0.0144, 'grad_norm': 0.243409663438797, 'learning_rate': 7.452157598499062e-05, 'epoch': 6.28}


 63%|██████▎   | 1678/2670 [40:53<23:35,  1.43s/it]

{'loss': 0.0053, 'grad_norm': 0.11737295240163803, 'learning_rate': 7.444652908067543e-05, 'epoch': 6.28}


 63%|██████▎   | 1679/2670 [40:54<23:05,  1.40s/it]

{'loss': 0.0095, 'grad_norm': 0.115748330950737, 'learning_rate': 7.437148217636022e-05, 'epoch': 6.29}


 63%|██████▎   | 1680/2670 [40:55<21:32,  1.31s/it]

{'loss': 0.0088, 'grad_norm': 0.10717777162790298, 'learning_rate': 7.429643527204503e-05, 'epoch': 6.29}


 63%|██████▎   | 1681/2670 [40:57<23:38,  1.43s/it]

{'loss': 0.0114, 'grad_norm': 0.1556520015001297, 'learning_rate': 7.422138836772984e-05, 'epoch': 6.3}


 63%|██████▎   | 1682/2670 [40:58<23:36,  1.43s/it]

{'loss': 0.0104, 'grad_norm': 0.16766756772994995, 'learning_rate': 7.414634146341464e-05, 'epoch': 6.3}


 63%|██████▎   | 1683/2670 [41:00<24:04,  1.46s/it]

{'loss': 0.0117, 'grad_norm': 0.11444485187530518, 'learning_rate': 7.407129455909943e-05, 'epoch': 6.3}


 63%|██████▎   | 1684/2670 [41:02<24:18,  1.48s/it]

{'loss': 0.0151, 'grad_norm': 0.1882791668176651, 'learning_rate': 7.399624765478424e-05, 'epoch': 6.31}


 63%|██████▎   | 1685/2670 [41:03<24:29,  1.49s/it]

{'loss': 0.0083, 'grad_norm': 0.10039932280778885, 'learning_rate': 7.392120075046905e-05, 'epoch': 6.31}


 63%|██████▎   | 1686/2670 [41:04<23:32,  1.44s/it]

{'loss': 0.0183, 'grad_norm': 0.7140663862228394, 'learning_rate': 7.384615384615386e-05, 'epoch': 6.31}


 63%|██████▎   | 1687/2670 [41:06<22:48,  1.39s/it]

{'loss': 0.0041, 'grad_norm': 0.13352829217910767, 'learning_rate': 7.377110694183865e-05, 'epoch': 6.32}


 63%|██████▎   | 1688/2670 [41:07<23:55,  1.46s/it]

{'loss': 0.0062, 'grad_norm': 0.06883440166711807, 'learning_rate': 7.369606003752345e-05, 'epoch': 6.32}


 63%|██████▎   | 1689/2670 [41:08<22:19,  1.37s/it]

{'loss': 0.01, 'grad_norm': 0.16814133524894714, 'learning_rate': 7.362101313320826e-05, 'epoch': 6.33}


 63%|██████▎   | 1690/2670 [41:10<22:23,  1.37s/it]

{'loss': 0.0047, 'grad_norm': 0.20988114178180695, 'learning_rate': 7.354596622889305e-05, 'epoch': 6.33}


 63%|██████▎   | 1691/2670 [41:11<22:06,  1.35s/it]

{'loss': 0.0044, 'grad_norm': 0.07011525332927704, 'learning_rate': 7.347091932457788e-05, 'epoch': 6.33}


 63%|██████▎   | 1692/2670 [41:12<21:09,  1.30s/it]

{'loss': 0.0092, 'grad_norm': 0.19139419496059418, 'learning_rate': 7.339587242026267e-05, 'epoch': 6.34}


 63%|██████▎   | 1693/2670 [41:14<22:50,  1.40s/it]

{'loss': 0.0116, 'grad_norm': 0.32390761375427246, 'learning_rate': 7.332082551594747e-05, 'epoch': 6.34}


 63%|██████▎   | 1694/2670 [41:15<22:55,  1.41s/it]

{'loss': 0.0073, 'grad_norm': 0.15727755427360535, 'learning_rate': 7.324577861163228e-05, 'epoch': 6.34}


 63%|██████▎   | 1695/2670 [41:17<22:49,  1.40s/it]

{'loss': 0.0172, 'grad_norm': 0.38605797290802, 'learning_rate': 7.317073170731707e-05, 'epoch': 6.35}


 64%|██████▎   | 1696/2670 [41:18<23:28,  1.45s/it]

{'loss': 0.0069, 'grad_norm': 0.12438949197530746, 'learning_rate': 7.309568480300188e-05, 'epoch': 6.35}


 64%|██████▎   | 1697/2670 [41:20<23:22,  1.44s/it]

{'loss': 0.0088, 'grad_norm': 0.12304209172725677, 'learning_rate': 7.302063789868669e-05, 'epoch': 6.36}


 64%|██████▎   | 1698/2670 [41:21<24:04,  1.49s/it]

{'loss': 0.0066, 'grad_norm': 0.1430331915616989, 'learning_rate': 7.294559099437148e-05, 'epoch': 6.36}


 64%|██████▎   | 1699/2670 [41:23<23:19,  1.44s/it]

{'loss': 0.0091, 'grad_norm': 0.09028641134500504, 'learning_rate': 7.287054409005629e-05, 'epoch': 6.36}


 64%|██████▎   | 1700/2670 [41:24<23:39,  1.46s/it]

{'loss': 0.0097, 'grad_norm': 0.1976800113916397, 'learning_rate': 7.279549718574109e-05, 'epoch': 6.37}


 64%|██████▎   | 1701/2670 [41:26<25:49,  1.60s/it]

{'loss': 0.0158, 'grad_norm': 0.36009034514427185, 'learning_rate': 7.272045028142588e-05, 'epoch': 6.37}


 64%|██████▎   | 1702/2670 [41:28<24:54,  1.54s/it]

{'loss': 0.0073, 'grad_norm': 0.11320500075817108, 'learning_rate': 7.26454033771107e-05, 'epoch': 6.37}


 64%|██████▍   | 1703/2670 [41:29<25:13,  1.56s/it]

{'loss': 0.0139, 'grad_norm': 0.235442653298378, 'learning_rate': 7.25703564727955e-05, 'epoch': 6.38}


 64%|██████▍   | 1704/2670 [41:31<25:17,  1.57s/it]

{'loss': 0.0109, 'grad_norm': 0.1546035259962082, 'learning_rate': 7.249530956848031e-05, 'epoch': 6.38}


 64%|██████▍   | 1705/2670 [41:32<25:24,  1.58s/it]

{'loss': 0.0177, 'grad_norm': 0.19517332315444946, 'learning_rate': 7.24202626641651e-05, 'epoch': 6.39}


 64%|██████▍   | 1706/2670 [41:34<26:02,  1.62s/it]

{'loss': 0.0103, 'grad_norm': 0.12823759019374847, 'learning_rate': 7.23452157598499e-05, 'epoch': 6.39}


 64%|██████▍   | 1707/2670 [41:36<26:03,  1.62s/it]

{'loss': 0.02, 'grad_norm': 0.19224540889263153, 'learning_rate': 7.227016885553471e-05, 'epoch': 6.39}


 64%|██████▍   | 1708/2670 [41:37<25:14,  1.57s/it]

{'loss': 0.0159, 'grad_norm': 0.2335471659898758, 'learning_rate': 7.219512195121952e-05, 'epoch': 6.4}


 64%|██████▍   | 1709/2670 [41:39<24:43,  1.54s/it]

{'loss': 0.004, 'grad_norm': 0.05234821140766144, 'learning_rate': 7.212007504690433e-05, 'epoch': 6.4}


 64%|██████▍   | 1710/2670 [41:40<25:13,  1.58s/it]

{'loss': 0.0114, 'grad_norm': 0.2473141998052597, 'learning_rate': 7.204502814258912e-05, 'epoch': 6.4}


 64%|██████▍   | 1711/2670 [41:42<24:40,  1.54s/it]

{'loss': 0.013, 'grad_norm': 0.1461043655872345, 'learning_rate': 7.196998123827392e-05, 'epoch': 6.41}


 64%|██████▍   | 1712/2670 [41:43<25:04,  1.57s/it]

{'loss': 0.0112, 'grad_norm': 0.33392903208732605, 'learning_rate': 7.189493433395873e-05, 'epoch': 6.41}


 64%|██████▍   | 1713/2670 [41:45<25:20,  1.59s/it]

{'loss': 0.0152, 'grad_norm': 0.20171870291233063, 'learning_rate': 7.181988742964353e-05, 'epoch': 6.42}


 64%|██████▍   | 1714/2670 [41:46<24:17,  1.52s/it]

{'loss': 0.0096, 'grad_norm': 0.12259242683649063, 'learning_rate': 7.174484052532834e-05, 'epoch': 6.42}


 64%|██████▍   | 1715/2670 [41:48<23:29,  1.48s/it]

{'loss': 0.0106, 'grad_norm': 0.14077578485012054, 'learning_rate': 7.166979362101314e-05, 'epoch': 6.42}


 64%|██████▍   | 1716/2670 [41:49<23:18,  1.47s/it]

{'loss': 0.0065, 'grad_norm': 0.10835234820842743, 'learning_rate': 7.159474671669793e-05, 'epoch': 6.43}


 64%|██████▍   | 1717/2670 [41:51<23:37,  1.49s/it]

{'loss': 0.0057, 'grad_norm': 0.06654787063598633, 'learning_rate': 7.151969981238274e-05, 'epoch': 6.43}


 64%|██████▍   | 1718/2670 [41:52<23:25,  1.48s/it]

{'loss': 0.0057, 'grad_norm': 0.1518305540084839, 'learning_rate': 7.144465290806754e-05, 'epoch': 6.43}


 64%|██████▍   | 1719/2670 [41:53<21:48,  1.38s/it]

{'loss': 0.0172, 'grad_norm': 0.15968520939350128, 'learning_rate': 7.136960600375235e-05, 'epoch': 6.44}


 64%|██████▍   | 1720/2670 [41:55<23:46,  1.50s/it]

{'loss': 0.0033, 'grad_norm': 0.09000230580568314, 'learning_rate': 7.129455909943716e-05, 'epoch': 6.44}


 64%|██████▍   | 1721/2670 [41:57<23:31,  1.49s/it]

{'loss': 0.0125, 'grad_norm': 0.16030800342559814, 'learning_rate': 7.121951219512195e-05, 'epoch': 6.45}


 64%|██████▍   | 1722/2670 [41:58<23:39,  1.50s/it]

{'loss': 0.0072, 'grad_norm': 0.11525318771600723, 'learning_rate': 7.114446529080676e-05, 'epoch': 6.45}


 65%|██████▍   | 1723/2670 [42:00<23:49,  1.51s/it]

{'loss': 0.0162, 'grad_norm': 0.2113245278596878, 'learning_rate': 7.106941838649156e-05, 'epoch': 6.45}


 65%|██████▍   | 1724/2670 [42:01<23:06,  1.47s/it]

{'loss': 0.0161, 'grad_norm': 0.15846209228038788, 'learning_rate': 7.099437148217636e-05, 'epoch': 6.46}


 65%|██████▍   | 1725/2670 [42:02<22:25,  1.42s/it]

{'loss': 0.0067, 'grad_norm': 0.18119221925735474, 'learning_rate': 7.091932457786117e-05, 'epoch': 6.46}


 65%|██████▍   | 1726/2670 [42:04<21:51,  1.39s/it]

{'loss': 0.0133, 'grad_norm': 0.30177488923072815, 'learning_rate': 7.084427767354597e-05, 'epoch': 6.46}


 65%|██████▍   | 1727/2670 [42:05<22:01,  1.40s/it]

{'loss': 0.0092, 'grad_norm': 0.13021735846996307, 'learning_rate': 7.076923076923078e-05, 'epoch': 6.47}


 65%|██████▍   | 1728/2670 [42:07<22:59,  1.46s/it]

{'loss': 0.008, 'grad_norm': 0.09629766643047333, 'learning_rate': 7.069418386491557e-05, 'epoch': 6.47}


 65%|██████▍   | 1729/2670 [42:08<24:07,  1.54s/it]

{'loss': 0.006, 'grad_norm': 0.24170084297657013, 'learning_rate': 7.061913696060037e-05, 'epoch': 6.48}


 65%|██████▍   | 1730/2670 [42:10<22:25,  1.43s/it]

{'loss': 0.0126, 'grad_norm': 0.13793477416038513, 'learning_rate': 7.054409005628518e-05, 'epoch': 6.48}


 65%|██████▍   | 1731/2670 [42:11<22:23,  1.43s/it]

{'loss': 0.0047, 'grad_norm': 0.06605049222707748, 'learning_rate': 7.046904315196999e-05, 'epoch': 6.48}


 65%|██████▍   | 1732/2670 [42:12<22:14,  1.42s/it]

{'loss': 0.0068, 'grad_norm': 0.20119836926460266, 'learning_rate': 7.03939962476548e-05, 'epoch': 6.49}


 65%|██████▍   | 1733/2670 [42:14<21:32,  1.38s/it]

{'loss': 0.0144, 'grad_norm': 0.14457844197750092, 'learning_rate': 7.031894934333959e-05, 'epoch': 6.49}


 65%|██████▍   | 1734/2670 [42:15<22:17,  1.43s/it]

{'loss': 0.0057, 'grad_norm': 0.06257633864879608, 'learning_rate': 7.024390243902439e-05, 'epoch': 6.49}


 65%|██████▍   | 1735/2670 [42:16<21:27,  1.38s/it]

{'loss': 0.0048, 'grad_norm': 0.046508047729730606, 'learning_rate': 7.01688555347092e-05, 'epoch': 6.5}


 65%|██████▌   | 1736/2670 [42:18<21:30,  1.38s/it]

{'loss': 0.0064, 'grad_norm': 0.21832649409770966, 'learning_rate': 7.0093808630394e-05, 'epoch': 6.5}


 65%|██████▌   | 1737/2670 [42:19<21:10,  1.36s/it]

{'loss': 0.015, 'grad_norm': 0.2103622853755951, 'learning_rate': 7.001876172607881e-05, 'epoch': 6.51}


 65%|██████▌   | 1738/2670 [42:20<20:57,  1.35s/it]

{'loss': 0.0112, 'grad_norm': 0.11773231625556946, 'learning_rate': 6.994371482176361e-05, 'epoch': 6.51}


 65%|██████▌   | 1739/2670 [42:22<20:46,  1.34s/it]

{'loss': 0.006, 'grad_norm': 0.11100290715694427, 'learning_rate': 6.98686679174484e-05, 'epoch': 6.51}


 65%|██████▌   | 1740/2670 [42:23<20:33,  1.33s/it]

{'loss': 0.0144, 'grad_norm': 0.2335183471441269, 'learning_rate': 6.979362101313321e-05, 'epoch': 6.52}


 65%|██████▌   | 1741/2670 [42:25<22:21,  1.44s/it]

{'loss': 0.0149, 'grad_norm': 0.18516257405281067, 'learning_rate': 6.9718574108818e-05, 'epoch': 6.52}


 65%|██████▌   | 1742/2670 [42:26<21:43,  1.40s/it]

{'loss': 0.0083, 'grad_norm': 0.11779121309518814, 'learning_rate': 6.964352720450283e-05, 'epoch': 6.52}


 65%|██████▌   | 1743/2670 [42:28<22:42,  1.47s/it]

{'loss': 0.0098, 'grad_norm': 0.12479325383901596, 'learning_rate': 6.956848030018762e-05, 'epoch': 6.53}


 65%|██████▌   | 1744/2670 [42:29<22:22,  1.45s/it]

{'loss': 0.0099, 'grad_norm': 0.07772546261548996, 'learning_rate': 6.949343339587242e-05, 'epoch': 6.53}


 65%|██████▌   | 1745/2670 [42:30<21:37,  1.40s/it]

{'loss': 0.0075, 'grad_norm': 0.08443472534418106, 'learning_rate': 6.941838649155723e-05, 'epoch': 6.54}


 65%|██████▌   | 1746/2670 [42:32<20:59,  1.36s/it]

{'loss': 0.0064, 'grad_norm': 0.07438591867685318, 'learning_rate': 6.934333958724202e-05, 'epoch': 6.54}


 65%|██████▌   | 1747/2670 [42:33<20:32,  1.34s/it]

{'loss': 0.0142, 'grad_norm': 0.2434963583946228, 'learning_rate': 6.926829268292683e-05, 'epoch': 6.54}


 65%|██████▌   | 1748/2670 [42:34<20:24,  1.33s/it]

{'loss': 0.0195, 'grad_norm': 0.17359761893749237, 'learning_rate': 6.919324577861164e-05, 'epoch': 6.55}


 66%|██████▌   | 1749/2670 [42:36<20:54,  1.36s/it]

{'loss': 0.0063, 'grad_norm': 0.10478538274765015, 'learning_rate': 6.911819887429644e-05, 'epoch': 6.55}


 66%|██████▌   | 1750/2670 [42:37<21:19,  1.39s/it]

{'loss': 0.0146, 'grad_norm': 0.26229792833328247, 'learning_rate': 6.904315196998125e-05, 'epoch': 6.55}


 66%|██████▌   | 1751/2670 [42:38<20:50,  1.36s/it]

{'loss': 0.0106, 'grad_norm': 0.16001035273075104, 'learning_rate': 6.896810506566604e-05, 'epoch': 6.56}


 66%|██████▌   | 1752/2670 [42:40<21:00,  1.37s/it]

{'loss': 0.0115, 'grad_norm': 0.2722207307815552, 'learning_rate': 6.889305816135084e-05, 'epoch': 6.56}


 66%|██████▌   | 1753/2670 [42:41<21:40,  1.42s/it]

{'loss': 0.0133, 'grad_norm': 0.14946947991847992, 'learning_rate': 6.881801125703566e-05, 'epoch': 6.57}


 66%|██████▌   | 1754/2670 [42:43<22:38,  1.48s/it]

{'loss': 0.0095, 'grad_norm': 0.1783282309770584, 'learning_rate': 6.874296435272045e-05, 'epoch': 6.57}


 66%|██████▌   | 1755/2670 [42:44<21:41,  1.42s/it]

{'loss': 0.0045, 'grad_norm': 0.0662420317530632, 'learning_rate': 6.866791744840526e-05, 'epoch': 6.57}


 66%|██████▌   | 1756/2670 [42:46<21:51,  1.44s/it]

{'loss': 0.008, 'grad_norm': 0.09453010559082031, 'learning_rate': 6.859287054409006e-05, 'epoch': 6.58}


 66%|██████▌   | 1757/2670 [42:47<21:19,  1.40s/it]

{'loss': 0.0055, 'grad_norm': 0.07067645341157913, 'learning_rate': 6.851782363977485e-05, 'epoch': 6.58}


 66%|██████▌   | 1758/2670 [42:48<21:17,  1.40s/it]

{'loss': 0.0149, 'grad_norm': 0.21038542687892914, 'learning_rate': 6.844277673545966e-05, 'epoch': 6.58}


 66%|██████▌   | 1759/2670 [42:50<21:53,  1.44s/it]

{'loss': 0.012, 'grad_norm': 0.24908298254013062, 'learning_rate': 6.836772983114447e-05, 'epoch': 6.59}


 66%|██████▌   | 1760/2670 [42:51<21:52,  1.44s/it]

{'loss': 0.0043, 'grad_norm': 0.060587327927351, 'learning_rate': 6.829268292682928e-05, 'epoch': 6.59}


 66%|██████▌   | 1761/2670 [42:53<22:09,  1.46s/it]

{'loss': 0.0067, 'grad_norm': 0.2700795829296112, 'learning_rate': 6.821763602251408e-05, 'epoch': 6.6}


 66%|██████▌   | 1762/2670 [42:55<24:13,  1.60s/it]

{'loss': 0.0065, 'grad_norm': 0.1002836674451828, 'learning_rate': 6.814258911819887e-05, 'epoch': 6.6}


 66%|██████▌   | 1763/2670 [42:57<24:19,  1.61s/it]

{'loss': 0.0067, 'grad_norm': 0.12694527208805084, 'learning_rate': 6.806754221388368e-05, 'epoch': 6.6}


 66%|██████▌   | 1764/2670 [42:58<24:03,  1.59s/it]

{'loss': 0.0059, 'grad_norm': 0.06841641664505005, 'learning_rate': 6.799249530956849e-05, 'epoch': 6.61}


 66%|██████▌   | 1765/2670 [43:00<24:11,  1.60s/it]

{'loss': 0.0152, 'grad_norm': 0.2857583463191986, 'learning_rate': 6.791744840525328e-05, 'epoch': 6.61}


 66%|██████▌   | 1766/2670 [43:01<22:58,  1.52s/it]

{'loss': 0.0047, 'grad_norm': 0.05609288439154625, 'learning_rate': 6.784240150093809e-05, 'epoch': 6.61}


 66%|██████▌   | 1767/2670 [43:03<23:30,  1.56s/it]

{'loss': 0.0084, 'grad_norm': 0.13123199343681335, 'learning_rate': 6.776735459662289e-05, 'epoch': 6.62}


 66%|██████▌   | 1768/2670 [43:04<21:48,  1.45s/it]

{'loss': 0.0122, 'grad_norm': 0.1426287591457367, 'learning_rate': 6.76923076923077e-05, 'epoch': 6.62}


 66%|██████▋   | 1769/2670 [43:05<22:11,  1.48s/it]

{'loss': 0.0053, 'grad_norm': 0.2012939602136612, 'learning_rate': 6.761726078799249e-05, 'epoch': 6.63}


 66%|██████▋   | 1770/2670 [43:07<22:29,  1.50s/it]

{'loss': 0.0077, 'grad_norm': 0.10652563720941544, 'learning_rate': 6.75422138836773e-05, 'epoch': 6.63}


 66%|██████▋   | 1771/2670 [43:08<22:04,  1.47s/it]

{'loss': 0.0096, 'grad_norm': 0.3767617344856262, 'learning_rate': 6.746716697936211e-05, 'epoch': 6.63}


 66%|██████▋   | 1772/2670 [43:10<21:35,  1.44s/it]

{'loss': 0.0089, 'grad_norm': 0.1487635374069214, 'learning_rate': 6.73921200750469e-05, 'epoch': 6.64}


 66%|██████▋   | 1773/2670 [43:11<20:51,  1.40s/it]

{'loss': 0.0116, 'grad_norm': 0.16188541054725647, 'learning_rate': 6.731707317073171e-05, 'epoch': 6.64}


 66%|██████▋   | 1774/2670 [43:12<21:02,  1.41s/it]

{'loss': 0.0376, 'grad_norm': 0.48503607511520386, 'learning_rate': 6.724202626641651e-05, 'epoch': 6.64}


 66%|██████▋   | 1775/2670 [43:14<22:04,  1.48s/it]

{'loss': 0.0102, 'grad_norm': 0.1588083803653717, 'learning_rate': 6.716697936210132e-05, 'epoch': 6.65}


 67%|██████▋   | 1776/2670 [43:16<22:15,  1.49s/it]

{'loss': 0.0101, 'grad_norm': 0.10612991452217102, 'learning_rate': 6.709193245778613e-05, 'epoch': 6.65}


 67%|██████▋   | 1777/2670 [43:17<22:24,  1.51s/it]

{'loss': 0.0124, 'grad_norm': 0.10256931930780411, 'learning_rate': 6.701688555347092e-05, 'epoch': 6.66}


 67%|██████▋   | 1778/2670 [43:19<22:29,  1.51s/it]

{'loss': 0.0113, 'grad_norm': 0.11366982012987137, 'learning_rate': 6.694183864915573e-05, 'epoch': 6.66}


 67%|██████▋   | 1779/2670 [43:20<22:33,  1.52s/it]

{'loss': 0.0071, 'grad_norm': 0.06889175623655319, 'learning_rate': 6.686679174484053e-05, 'epoch': 6.66}


 67%|██████▋   | 1780/2670 [43:22<21:59,  1.48s/it]

{'loss': 0.0092, 'grad_norm': 0.11478941142559052, 'learning_rate': 6.679174484052532e-05, 'epoch': 6.67}


 67%|██████▋   | 1781/2670 [43:23<23:14,  1.57s/it]

{'loss': 0.0116, 'grad_norm': 0.1580583155155182, 'learning_rate': 6.671669793621014e-05, 'epoch': 6.67}


 67%|██████▋   | 1782/2670 [43:25<23:02,  1.56s/it]

{'loss': 0.0092, 'grad_norm': 0.26720669865608215, 'learning_rate': 6.664165103189494e-05, 'epoch': 6.67}


 67%|██████▋   | 1783/2670 [43:27<23:36,  1.60s/it]

{'loss': 0.0079, 'grad_norm': 0.13973389565944672, 'learning_rate': 6.656660412757975e-05, 'epoch': 6.68}


 67%|██████▋   | 1784/2670 [43:28<22:45,  1.54s/it]

{'loss': 0.01, 'grad_norm': 0.10399484634399414, 'learning_rate': 6.649155722326454e-05, 'epoch': 6.68}


 67%|██████▋   | 1785/2670 [43:30<23:03,  1.56s/it]

{'loss': 0.0134, 'grad_norm': 1.1352256536483765, 'learning_rate': 6.641651031894934e-05, 'epoch': 6.69}


 67%|██████▋   | 1786/2670 [43:31<21:00,  1.43s/it]

{'loss': 0.0099, 'grad_norm': 0.15646067261695862, 'learning_rate': 6.634146341463415e-05, 'epoch': 6.69}


 67%|██████▋   | 1787/2670 [43:32<20:56,  1.42s/it]

{'loss': 0.0084, 'grad_norm': 0.11289165914058685, 'learning_rate': 6.626641651031896e-05, 'epoch': 6.69}


 67%|██████▋   | 1788/2670 [43:34<21:07,  1.44s/it]

{'loss': 0.0073, 'grad_norm': 0.15337002277374268, 'learning_rate': 6.619136960600375e-05, 'epoch': 6.7}


 67%|██████▋   | 1789/2670 [43:35<21:42,  1.48s/it]

{'loss': 0.0122, 'grad_norm': 0.16238635778427124, 'learning_rate': 6.611632270168856e-05, 'epoch': 6.7}


 67%|██████▋   | 1790/2670 [43:37<20:46,  1.42s/it]

{'loss': 0.0171, 'grad_norm': 0.3976469039916992, 'learning_rate': 6.604127579737336e-05, 'epoch': 6.7}


 67%|██████▋   | 1791/2670 [43:38<21:45,  1.49s/it]

{'loss': 0.0089, 'grad_norm': 0.16909384727478027, 'learning_rate': 6.596622889305816e-05, 'epoch': 6.71}


 67%|██████▋   | 1792/2670 [43:40<21:40,  1.48s/it]

{'loss': 0.0359, 'grad_norm': 0.33071720600128174, 'learning_rate': 6.589118198874297e-05, 'epoch': 6.71}


 67%|██████▋   | 1793/2670 [43:41<21:06,  1.44s/it]

{'loss': 0.0073, 'grad_norm': 0.09098144620656967, 'learning_rate': 6.581613508442777e-05, 'epoch': 6.72}


 67%|██████▋   | 1794/2670 [43:42<21:09,  1.45s/it]

{'loss': 0.0063, 'grad_norm': 0.07539656013250351, 'learning_rate': 6.574108818011258e-05, 'epoch': 6.72}


 67%|██████▋   | 1795/2670 [43:44<19:57,  1.37s/it]

{'loss': 0.0224, 'grad_norm': 1.383499264717102, 'learning_rate': 6.566604127579737e-05, 'epoch': 6.72}


 67%|██████▋   | 1796/2670 [43:45<19:43,  1.35s/it]

{'loss': 0.0091, 'grad_norm': 0.17896948754787445, 'learning_rate': 6.559099437148218e-05, 'epoch': 6.73}


 67%|██████▋   | 1797/2670 [43:46<19:55,  1.37s/it]

{'loss': 0.0155, 'grad_norm': 0.2335110902786255, 'learning_rate': 6.551594746716698e-05, 'epoch': 6.73}


 67%|██████▋   | 1798/2670 [43:48<19:43,  1.36s/it]

{'loss': 0.0107, 'grad_norm': 0.21347017586231232, 'learning_rate': 6.544090056285179e-05, 'epoch': 6.73}


 67%|██████▋   | 1799/2670 [43:49<20:52,  1.44s/it]

{'loss': 0.0162, 'grad_norm': 0.17433659732341766, 'learning_rate': 6.53658536585366e-05, 'epoch': 6.74}


 67%|██████▋   | 1800/2670 [43:51<20:41,  1.43s/it]

{'loss': 0.0144, 'grad_norm': 0.16619528830051422, 'learning_rate': 6.529080675422139e-05, 'epoch': 6.74}


 67%|██████▋   | 1801/2670 [43:52<21:33,  1.49s/it]

{'loss': 0.013, 'grad_norm': 0.31261420249938965, 'learning_rate': 6.52157598499062e-05, 'epoch': 6.75}


 67%|██████▋   | 1802/2670 [43:54<21:40,  1.50s/it]

{'loss': 0.0049, 'grad_norm': 0.06918865442276001, 'learning_rate': 6.5140712945591e-05, 'epoch': 6.75}


 68%|██████▊   | 1803/2670 [43:56<23:05,  1.60s/it]

{'loss': 0.0091, 'grad_norm': 0.1460278332233429, 'learning_rate': 6.50656660412758e-05, 'epoch': 6.75}


 68%|██████▊   | 1804/2670 [43:57<22:53,  1.59s/it]

{'loss': 0.0102, 'grad_norm': 0.10539080202579498, 'learning_rate': 6.499061913696061e-05, 'epoch': 6.76}


 68%|██████▊   | 1805/2670 [43:59<22:32,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.15135106444358826, 'learning_rate': 6.491557223264541e-05, 'epoch': 6.76}


 68%|██████▊   | 1806/2670 [44:00<21:56,  1.52s/it]

{'loss': 0.0081, 'grad_norm': 0.12867748737335205, 'learning_rate': 6.484052532833022e-05, 'epoch': 6.76}


 68%|██████▊   | 1807/2670 [44:02<21:08,  1.47s/it]

{'loss': 0.0171, 'grad_norm': 0.2035532295703888, 'learning_rate': 6.476547842401501e-05, 'epoch': 6.77}


 68%|██████▊   | 1808/2670 [44:03<20:35,  1.43s/it]

{'loss': 0.0079, 'grad_norm': 0.2499016523361206, 'learning_rate': 6.469043151969981e-05, 'epoch': 6.77}


 68%|██████▊   | 1809/2670 [44:04<21:05,  1.47s/it]

{'loss': 0.0136, 'grad_norm': 0.11762851476669312, 'learning_rate': 6.461538461538462e-05, 'epoch': 6.78}


 68%|██████▊   | 1810/2670 [44:06<22:06,  1.54s/it]

{'loss': 0.0149, 'grad_norm': 0.21230265498161316, 'learning_rate': 6.454033771106942e-05, 'epoch': 6.78}


 68%|██████▊   | 1811/2670 [44:08<21:20,  1.49s/it]

{'loss': 0.0167, 'grad_norm': 0.5459420680999756, 'learning_rate': 6.446529080675422e-05, 'epoch': 6.78}


 68%|██████▊   | 1812/2670 [44:09<21:43,  1.52s/it]

{'loss': 0.0074, 'grad_norm': 0.09207801520824432, 'learning_rate': 6.439024390243903e-05, 'epoch': 6.79}


 68%|██████▊   | 1813/2670 [44:11<21:38,  1.52s/it]

{'loss': 0.0086, 'grad_norm': 0.12909522652626038, 'learning_rate': 6.431519699812382e-05, 'epoch': 6.79}


 68%|██████▊   | 1814/2670 [44:12<20:33,  1.44s/it]

{'loss': 0.0156, 'grad_norm': 0.17670966684818268, 'learning_rate': 6.424015009380863e-05, 'epoch': 6.79}


 68%|██████▊   | 1815/2670 [44:13<20:18,  1.43s/it]

{'loss': 0.0188, 'grad_norm': 0.24553540349006653, 'learning_rate': 6.416510318949344e-05, 'epoch': 6.8}


 68%|██████▊   | 1816/2670 [44:15<20:41,  1.45s/it]

{'loss': 0.0118, 'grad_norm': 0.15250788629055023, 'learning_rate': 6.409005628517824e-05, 'epoch': 6.8}


 68%|██████▊   | 1817/2670 [44:16<20:34,  1.45s/it]

{'loss': 0.0199, 'grad_norm': 0.1666761338710785, 'learning_rate': 6.401500938086305e-05, 'epoch': 6.81}


 68%|██████▊   | 1818/2670 [44:18<21:20,  1.50s/it]

{'loss': 0.0097, 'grad_norm': 0.1558409482240677, 'learning_rate': 6.393996247654784e-05, 'epoch': 6.81}


 68%|██████▊   | 1819/2670 [44:19<21:45,  1.53s/it]

{'loss': 0.0061, 'grad_norm': 0.061023227870464325, 'learning_rate': 6.386491557223265e-05, 'epoch': 6.81}


 68%|██████▊   | 1820/2670 [44:21<21:42,  1.53s/it]

{'loss': 0.0081, 'grad_norm': 0.1190173327922821, 'learning_rate': 6.378986866791745e-05, 'epoch': 6.82}


 68%|██████▊   | 1821/2670 [44:22<21:11,  1.50s/it]

{'loss': 0.0046, 'grad_norm': 0.08087245374917984, 'learning_rate': 6.371482176360225e-05, 'epoch': 6.82}


 68%|██████▊   | 1822/2670 [44:24<21:26,  1.52s/it]

{'loss': 0.0104, 'grad_norm': 0.09813804179430008, 'learning_rate': 6.363977485928706e-05, 'epoch': 6.82}


 68%|██████▊   | 1823/2670 [44:25<21:03,  1.49s/it]

{'loss': 0.007, 'grad_norm': 0.0744071677327156, 'learning_rate': 6.356472795497186e-05, 'epoch': 6.83}


 68%|██████▊   | 1824/2670 [44:27<20:13,  1.43s/it]

{'loss': 0.0074, 'grad_norm': 0.09109531342983246, 'learning_rate': 6.348968105065667e-05, 'epoch': 6.83}


 68%|██████▊   | 1825/2670 [44:28<20:12,  1.44s/it]

{'loss': 0.0119, 'grad_norm': 0.296166330575943, 'learning_rate': 6.341463414634146e-05, 'epoch': 6.84}


 68%|██████▊   | 1826/2670 [44:30<20:01,  1.42s/it]

{'loss': 0.0107, 'grad_norm': 0.1298421174287796, 'learning_rate': 6.333958724202627e-05, 'epoch': 6.84}


 68%|██████▊   | 1827/2670 [44:31<20:47,  1.48s/it]

{'loss': 0.0073, 'grad_norm': 0.1133483275771141, 'learning_rate': 6.326454033771108e-05, 'epoch': 6.84}


 68%|██████▊   | 1828/2670 [44:33<20:32,  1.46s/it]

{'loss': 0.0088, 'grad_norm': 0.14374838769435883, 'learning_rate': 6.318949343339588e-05, 'epoch': 6.85}


 69%|██████▊   | 1829/2670 [44:34<21:17,  1.52s/it]

{'loss': 0.0043, 'grad_norm': 0.05145975574851036, 'learning_rate': 6.311444652908067e-05, 'epoch': 6.85}


 69%|██████▊   | 1830/2670 [44:35<19:40,  1.40s/it]

{'loss': 0.0095, 'grad_norm': 0.12322628498077393, 'learning_rate': 6.303939962476548e-05, 'epoch': 6.85}


 69%|██████▊   | 1831/2670 [44:37<20:06,  1.44s/it]

{'loss': 0.0164, 'grad_norm': 0.2389492392539978, 'learning_rate': 6.296435272045028e-05, 'epoch': 6.86}


 69%|██████▊   | 1832/2670 [44:38<20:49,  1.49s/it]

{'loss': 0.014, 'grad_norm': 0.175156831741333, 'learning_rate': 6.28893058161351e-05, 'epoch': 6.86}


 69%|██████▊   | 1833/2670 [44:40<20:25,  1.46s/it]

{'loss': 0.0089, 'grad_norm': 0.09996752440929413, 'learning_rate': 6.281425891181989e-05, 'epoch': 6.87}


 69%|██████▊   | 1834/2670 [44:41<20:35,  1.48s/it]

{'loss': 0.01, 'grad_norm': 0.28786206245422363, 'learning_rate': 6.273921200750469e-05, 'epoch': 6.87}


 69%|██████▊   | 1835/2670 [44:43<21:03,  1.51s/it]

{'loss': 0.0044, 'grad_norm': 0.07034969329833984, 'learning_rate': 6.26641651031895e-05, 'epoch': 6.87}


 69%|██████▉   | 1836/2670 [44:44<20:39,  1.49s/it]

{'loss': 0.0168, 'grad_norm': 0.2295340895652771, 'learning_rate': 6.258911819887429e-05, 'epoch': 6.88}


 69%|██████▉   | 1837/2670 [44:46<21:37,  1.56s/it]

{'loss': 0.0265, 'grad_norm': 0.4776749014854431, 'learning_rate': 6.25140712945591e-05, 'epoch': 6.88}


 69%|██████▉   | 1838/2670 [44:48<21:00,  1.51s/it]

{'loss': 0.0076, 'grad_norm': 0.18627724051475525, 'learning_rate': 6.243902439024391e-05, 'epoch': 6.88}


 69%|██████▉   | 1839/2670 [44:49<20:03,  1.45s/it]

{'loss': 0.0088, 'grad_norm': 0.0665806382894516, 'learning_rate': 6.23639774859287e-05, 'epoch': 6.89}


 69%|██████▉   | 1840/2670 [44:50<20:37,  1.49s/it]

{'loss': 0.0075, 'grad_norm': 0.09571224451065063, 'learning_rate': 6.228893058161351e-05, 'epoch': 6.89}


 69%|██████▉   | 1841/2670 [44:52<20:45,  1.50s/it]

{'loss': 0.0106, 'grad_norm': 0.14716866612434387, 'learning_rate': 6.221388367729831e-05, 'epoch': 6.9}


 69%|██████▉   | 1842/2670 [44:53<20:28,  1.48s/it]

{'loss': 0.0145, 'grad_norm': 0.18285921216011047, 'learning_rate': 6.213883677298312e-05, 'epoch': 6.9}


 69%|██████▉   | 1843/2670 [44:55<20:43,  1.50s/it]

{'loss': 0.0161, 'grad_norm': 0.1946737915277481, 'learning_rate': 6.206378986866793e-05, 'epoch': 6.9}


 69%|██████▉   | 1844/2670 [44:56<20:41,  1.50s/it]

{'loss': 0.0046, 'grad_norm': 0.05647142603993416, 'learning_rate': 6.198874296435272e-05, 'epoch': 6.91}


 69%|██████▉   | 1845/2670 [44:58<20:13,  1.47s/it]

{'loss': 0.0157, 'grad_norm': 0.38074249029159546, 'learning_rate': 6.191369606003753e-05, 'epoch': 6.91}


 69%|██████▉   | 1846/2670 [44:59<20:00,  1.46s/it]

{'loss': 0.009, 'grad_norm': 0.11332318186759949, 'learning_rate': 6.183864915572233e-05, 'epoch': 6.91}


 69%|██████▉   | 1847/2670 [45:01<19:24,  1.41s/it]

{'loss': 0.006, 'grad_norm': 0.100581593811512, 'learning_rate': 6.176360225140714e-05, 'epoch': 6.92}


 69%|██████▉   | 1848/2670 [45:02<19:17,  1.41s/it]

{'loss': 0.0119, 'grad_norm': 0.21184098720550537, 'learning_rate': 6.168855534709193e-05, 'epoch': 6.92}


 69%|██████▉   | 1849/2670 [45:03<19:23,  1.42s/it]

{'loss': 0.0169, 'grad_norm': 0.44117486476898193, 'learning_rate': 6.161350844277674e-05, 'epoch': 6.93}


 69%|██████▉   | 1850/2670 [45:05<19:50,  1.45s/it]

{'loss': 0.0087, 'grad_norm': 0.16614148020744324, 'learning_rate': 6.153846153846155e-05, 'epoch': 6.93}


 69%|██████▉   | 1851/2670 [45:06<19:22,  1.42s/it]

{'loss': 0.0157, 'grad_norm': 0.1798761934041977, 'learning_rate': 6.146341463414634e-05, 'epoch': 6.93}


 69%|██████▉   | 1852/2670 [45:08<18:46,  1.38s/it]

{'loss': 0.0166, 'grad_norm': 0.28718313574790955, 'learning_rate': 6.138836772983114e-05, 'epoch': 6.94}


 69%|██████▉   | 1853/2670 [45:09<18:35,  1.37s/it]

{'loss': 0.0132, 'grad_norm': 0.12464753538370132, 'learning_rate': 6.131332082551595e-05, 'epoch': 6.94}


 69%|██████▉   | 1854/2670 [45:10<18:08,  1.33s/it]

{'loss': 0.0188, 'grad_norm': 0.1234377920627594, 'learning_rate': 6.123827392120076e-05, 'epoch': 6.94}


 69%|██████▉   | 1855/2670 [45:12<18:34,  1.37s/it]

{'loss': 0.0128, 'grad_norm': 0.15265123546123505, 'learning_rate': 6.116322701688557e-05, 'epoch': 6.95}


 70%|██████▉   | 1856/2670 [45:13<19:55,  1.47s/it]

{'loss': 0.016, 'grad_norm': 0.18156372010707855, 'learning_rate': 6.108818011257036e-05, 'epoch': 6.95}


 70%|██████▉   | 1857/2670 [45:15<20:17,  1.50s/it]

{'loss': 0.0131, 'grad_norm': 0.27746716141700745, 'learning_rate': 6.1013133208255156e-05, 'epoch': 6.96}


 70%|██████▉   | 1858/2670 [45:17<20:57,  1.55s/it]

{'loss': 0.0158, 'grad_norm': 0.155207559466362, 'learning_rate': 6.0938086303939965e-05, 'epoch': 6.96}


 70%|██████▉   | 1859/2670 [45:18<19:57,  1.48s/it]

{'loss': 0.0169, 'grad_norm': 0.1501932442188263, 'learning_rate': 6.086303939962477e-05, 'epoch': 6.96}


 70%|██████▉   | 1860/2670 [45:19<19:09,  1.42s/it]

{'loss': 0.0119, 'grad_norm': 0.13179384171962738, 'learning_rate': 6.0787992495309576e-05, 'epoch': 6.97}


 70%|██████▉   | 1861/2670 [45:21<19:11,  1.42s/it]

{'loss': 0.009, 'grad_norm': 0.15155278146266937, 'learning_rate': 6.071294559099437e-05, 'epoch': 6.97}


 70%|██████▉   | 1862/2670 [45:22<19:02,  1.41s/it]

{'loss': 0.0094, 'grad_norm': 0.10418076068162918, 'learning_rate': 6.0637898686679174e-05, 'epoch': 6.97}


 70%|██████▉   | 1863/2670 [45:23<18:28,  1.37s/it]

{'loss': 0.0089, 'grad_norm': 0.0856180191040039, 'learning_rate': 6.056285178236398e-05, 'epoch': 6.98}


 70%|██████▉   | 1864/2670 [45:25<18:31,  1.38s/it]

{'loss': 0.0099, 'grad_norm': 0.1151801273226738, 'learning_rate': 6.0487804878048785e-05, 'epoch': 6.98}


 70%|██████▉   | 1865/2670 [45:26<19:22,  1.44s/it]

{'loss': 0.009, 'grad_norm': 0.08123242855072021, 'learning_rate': 6.0412757973733593e-05, 'epoch': 6.99}


 70%|██████▉   | 1866/2670 [45:27<18:14,  1.36s/it]

{'loss': 0.0119, 'grad_norm': 0.10442925244569778, 'learning_rate': 6.033771106941839e-05, 'epoch': 6.99}


 70%|██████▉   | 1867/2670 [45:29<18:53,  1.41s/it]

{'loss': 0.0138, 'grad_norm': 0.1880311369895935, 'learning_rate': 6.026266416510319e-05, 'epoch': 6.99}


 70%|██████▉   | 1868/2670 [45:30<19:13,  1.44s/it]

{'loss': 0.0117, 'grad_norm': 0.15273073315620422, 'learning_rate': 6.0187617260788e-05, 'epoch': 7.0}


 70%|███████   | 1869/2670 [45:32<18:30,  1.39s/it]

{'loss': 0.0087, 'grad_norm': 0.1301770955324173, 'learning_rate': 6.0112570356472795e-05, 'epoch': 7.0}


 70%|███████   | 1870/2670 [45:33<19:32,  1.47s/it]

{'loss': 0.0064, 'grad_norm': 0.06641454994678497, 'learning_rate': 6.0037523452157604e-05, 'epoch': 7.0}


 70%|███████   | 1871/2670 [45:35<19:26,  1.46s/it]

{'loss': 0.0042, 'grad_norm': 0.07271900773048401, 'learning_rate': 5.9962476547842406e-05, 'epoch': 7.01}


 70%|███████   | 1872/2670 [45:36<19:33,  1.47s/it]

{'loss': 0.0082, 'grad_norm': 0.43735000491142273, 'learning_rate': 5.98874296435272e-05, 'epoch': 7.01}


 70%|███████   | 1873/2670 [45:38<18:39,  1.40s/it]

{'loss': 0.015, 'grad_norm': 0.46651557087898254, 'learning_rate': 5.981238273921202e-05, 'epoch': 7.01}


 70%|███████   | 1874/2670 [45:39<17:37,  1.33s/it]

{'loss': 0.0064, 'grad_norm': 0.07294457405805588, 'learning_rate': 5.973733583489681e-05, 'epoch': 7.02}


 70%|███████   | 1875/2670 [45:40<18:26,  1.39s/it]

{'loss': 0.0064, 'grad_norm': 0.10088081657886505, 'learning_rate': 5.9662288930581614e-05, 'epoch': 7.02}


 70%|███████   | 1876/2670 [45:42<19:30,  1.47s/it]

{'loss': 0.0028, 'grad_norm': 0.032031089067459106, 'learning_rate': 5.958724202626642e-05, 'epoch': 7.03}


 70%|███████   | 1877/2670 [45:43<19:21,  1.46s/it]

{'loss': 0.0085, 'grad_norm': 0.14698031544685364, 'learning_rate': 5.951219512195122e-05, 'epoch': 7.03}


 70%|███████   | 1878/2670 [45:45<19:16,  1.46s/it]

{'loss': 0.0063, 'grad_norm': 0.12609010934829712, 'learning_rate': 5.943714821763603e-05, 'epoch': 7.03}


 70%|███████   | 1879/2670 [45:46<19:24,  1.47s/it]

{'loss': 0.0049, 'grad_norm': 0.0504642128944397, 'learning_rate': 5.936210131332083e-05, 'epoch': 7.04}


 70%|███████   | 1880/2670 [45:48<18:21,  1.39s/it]

{'loss': 0.0038, 'grad_norm': 0.04609556496143341, 'learning_rate': 5.9287054409005625e-05, 'epoch': 7.04}


 70%|███████   | 1881/2670 [45:49<17:43,  1.35s/it]

{'loss': 0.0098, 'grad_norm': 0.0957137942314148, 'learning_rate': 5.9212007504690434e-05, 'epoch': 7.04}


 70%|███████   | 1882/2670 [45:51<19:26,  1.48s/it]

{'loss': 0.0028, 'grad_norm': 0.04957021772861481, 'learning_rate': 5.9136960600375236e-05, 'epoch': 7.05}


 71%|███████   | 1883/2670 [45:53<21:17,  1.62s/it]

{'loss': 0.0119, 'grad_norm': 0.09817396104335785, 'learning_rate': 5.9061913696060044e-05, 'epoch': 7.05}


 71%|███████   | 1884/2670 [45:54<18:55,  1.44s/it]

{'loss': 0.0049, 'grad_norm': 0.1031528115272522, 'learning_rate': 5.8986866791744847e-05, 'epoch': 7.06}


 71%|███████   | 1885/2670 [45:55<19:59,  1.53s/it]

{'loss': 0.0073, 'grad_norm': 0.09555871039628983, 'learning_rate': 5.891181988742964e-05, 'epoch': 7.06}


 71%|███████   | 1886/2670 [45:57<18:56,  1.45s/it]

{'loss': 0.0063, 'grad_norm': 0.1371651440858841, 'learning_rate': 5.883677298311445e-05, 'epoch': 7.06}


 71%|███████   | 1887/2670 [45:58<18:55,  1.45s/it]

{'loss': 0.0039, 'grad_norm': 0.04189181327819824, 'learning_rate': 5.876172607879925e-05, 'epoch': 7.07}


 71%|███████   | 1888/2670 [46:00<19:51,  1.52s/it]

{'loss': 0.0031, 'grad_norm': 0.036768291145563126, 'learning_rate': 5.868667917448406e-05, 'epoch': 7.07}


 71%|███████   | 1889/2670 [46:01<20:41,  1.59s/it]

{'loss': 0.0116, 'grad_norm': 0.11238375306129456, 'learning_rate': 5.861163227016886e-05, 'epoch': 7.07}


 71%|███████   | 1890/2670 [46:03<19:43,  1.52s/it]

{'loss': 0.0081, 'grad_norm': 0.32680732011795044, 'learning_rate': 5.853658536585366e-05, 'epoch': 7.08}


 71%|███████   | 1891/2670 [46:04<19:03,  1.47s/it]

{'loss': 0.0057, 'grad_norm': 0.06890799105167389, 'learning_rate': 5.846153846153847e-05, 'epoch': 7.08}


 71%|███████   | 1892/2670 [46:05<18:40,  1.44s/it]

{'loss': 0.0079, 'grad_norm': 0.08181554079055786, 'learning_rate': 5.838649155722326e-05, 'epoch': 7.09}


 71%|███████   | 1893/2670 [46:07<20:28,  1.58s/it]

{'loss': 0.0047, 'grad_norm': 0.06092427298426628, 'learning_rate': 5.8311444652908065e-05, 'epoch': 7.09}


 71%|███████   | 1894/2670 [46:09<19:09,  1.48s/it]

{'loss': 0.0067, 'grad_norm': 0.13337840139865875, 'learning_rate': 5.8236397748592874e-05, 'epoch': 7.09}


 71%|███████   | 1895/2670 [46:10<18:43,  1.45s/it]

{'loss': 0.0058, 'grad_norm': 0.11683586239814758, 'learning_rate': 5.8161350844277676e-05, 'epoch': 7.1}


 71%|███████   | 1896/2670 [46:12<19:59,  1.55s/it]

{'loss': 0.0048, 'grad_norm': 0.06807338446378708, 'learning_rate': 5.8086303939962485e-05, 'epoch': 7.1}


 71%|███████   | 1897/2670 [46:13<18:49,  1.46s/it]

{'loss': 0.0133, 'grad_norm': 0.1464429646730423, 'learning_rate': 5.801125703564728e-05, 'epoch': 7.1}


 71%|███████   | 1898/2670 [46:15<19:01,  1.48s/it]

{'loss': 0.0091, 'grad_norm': 0.31703928112983704, 'learning_rate': 5.793621013133208e-05, 'epoch': 7.11}


 71%|███████   | 1899/2670 [46:16<18:57,  1.48s/it]

{'loss': 0.0067, 'grad_norm': 0.16213847696781158, 'learning_rate': 5.786116322701689e-05, 'epoch': 7.11}


 71%|███████   | 1900/2670 [46:17<18:30,  1.44s/it]

{'loss': 0.0039, 'grad_norm': 0.19900178909301758, 'learning_rate': 5.7786116322701687e-05, 'epoch': 7.12}


 71%|███████   | 1901/2670 [46:19<19:26,  1.52s/it]

{'loss': 0.0037, 'grad_norm': 0.037966739386320114, 'learning_rate': 5.7711069418386495e-05, 'epoch': 7.12}


 71%|███████   | 1902/2670 [46:20<18:28,  1.44s/it]

{'loss': 0.0047, 'grad_norm': 0.050923530012369156, 'learning_rate': 5.76360225140713e-05, 'epoch': 7.12}


 71%|███████▏  | 1903/2670 [46:22<18:22,  1.44s/it]

{'loss': 0.0024, 'grad_norm': 0.039235569536685944, 'learning_rate': 5.756097560975609e-05, 'epoch': 7.13}


 71%|███████▏  | 1904/2670 [46:23<18:02,  1.41s/it]

{'loss': 0.0092, 'grad_norm': 0.10358253121376038, 'learning_rate': 5.748592870544091e-05, 'epoch': 7.13}


 71%|███████▏  | 1905/2670 [46:25<18:24,  1.44s/it]

{'loss': 0.0045, 'grad_norm': 0.08190850168466568, 'learning_rate': 5.7410881801125704e-05, 'epoch': 7.13}


 71%|███████▏  | 1906/2670 [46:26<18:31,  1.45s/it]

{'loss': 0.0068, 'grad_norm': 0.17475475370883942, 'learning_rate': 5.733583489681051e-05, 'epoch': 7.14}


 71%|███████▏  | 1907/2670 [46:27<16:48,  1.32s/it]

{'loss': 0.0055, 'grad_norm': 0.06587167084217072, 'learning_rate': 5.7260787992495315e-05, 'epoch': 7.14}


 71%|███████▏  | 1908/2670 [46:29<17:32,  1.38s/it]

{'loss': 0.0035, 'grad_norm': 0.05416261777281761, 'learning_rate': 5.718574108818011e-05, 'epoch': 7.15}


 71%|███████▏  | 1909/2670 [46:30<17:49,  1.41s/it]

{'loss': 0.0063, 'grad_norm': 0.21898336708545685, 'learning_rate': 5.711069418386492e-05, 'epoch': 7.15}


 72%|███████▏  | 1910/2670 [46:31<17:22,  1.37s/it]

{'loss': 0.0057, 'grad_norm': 0.22339101135730743, 'learning_rate': 5.703564727954972e-05, 'epoch': 7.15}


 72%|███████▏  | 1911/2670 [46:33<18:05,  1.43s/it]

{'loss': 0.0046, 'grad_norm': 0.13090692460536957, 'learning_rate': 5.696060037523453e-05, 'epoch': 7.16}


 72%|███████▏  | 1912/2670 [46:34<18:15,  1.45s/it]

{'loss': 0.0053, 'grad_norm': 0.07829322665929794, 'learning_rate': 5.6885553470919325e-05, 'epoch': 7.16}


 72%|███████▏  | 1913/2670 [46:36<17:23,  1.38s/it]

{'loss': 0.006, 'grad_norm': 0.14688430726528168, 'learning_rate': 5.681050656660413e-05, 'epoch': 7.16}


 72%|███████▏  | 1914/2670 [46:37<18:18,  1.45s/it]

{'loss': 0.0071, 'grad_norm': 0.08413638919591904, 'learning_rate': 5.6735459662288936e-05, 'epoch': 7.17}


 72%|███████▏  | 1915/2670 [46:39<18:14,  1.45s/it]

{'loss': 0.0047, 'grad_norm': 0.04699331894516945, 'learning_rate': 5.666041275797374e-05, 'epoch': 7.17}


 72%|███████▏  | 1916/2670 [46:40<17:28,  1.39s/it]

{'loss': 0.0058, 'grad_norm': 0.06402646005153656, 'learning_rate': 5.6585365853658533e-05, 'epoch': 7.18}


 72%|███████▏  | 1917/2670 [46:42<17:53,  1.43s/it]

{'loss': 0.0039, 'grad_norm': 0.05679667741060257, 'learning_rate': 5.651031894934334e-05, 'epoch': 7.18}


 72%|███████▏  | 1918/2670 [46:43<17:21,  1.39s/it]

{'loss': 0.0082, 'grad_norm': 0.09843716025352478, 'learning_rate': 5.6435272045028144e-05, 'epoch': 7.18}


 72%|███████▏  | 1919/2670 [46:44<18:14,  1.46s/it]

{'loss': 0.0063, 'grad_norm': 0.10232862085103989, 'learning_rate': 5.636022514071295e-05, 'epoch': 7.19}


 72%|███████▏  | 1920/2670 [46:46<19:41,  1.58s/it]

{'loss': 0.0039, 'grad_norm': 0.04441889002919197, 'learning_rate': 5.628517823639775e-05, 'epoch': 7.19}


 72%|███████▏  | 1921/2670 [46:48<19:45,  1.58s/it]

{'loss': 0.0076, 'grad_norm': 0.1891971081495285, 'learning_rate': 5.621013133208255e-05, 'epoch': 7.19}


 72%|███████▏  | 1922/2670 [46:49<19:36,  1.57s/it]

{'loss': 0.0038, 'grad_norm': 0.05748794227838516, 'learning_rate': 5.613508442776736e-05, 'epoch': 7.2}


 72%|███████▏  | 1923/2670 [46:51<19:04,  1.53s/it]

{'loss': 0.0051, 'grad_norm': 0.07906089723110199, 'learning_rate': 5.6060037523452155e-05, 'epoch': 7.2}


 72%|███████▏  | 1924/2670 [46:52<18:28,  1.49s/it]

{'loss': 0.0052, 'grad_norm': 0.12253688275814056, 'learning_rate': 5.598499061913697e-05, 'epoch': 7.21}


 72%|███████▏  | 1925/2670 [46:54<18:20,  1.48s/it]

{'loss': 0.0039, 'grad_norm': 0.04947977513074875, 'learning_rate': 5.5909943714821766e-05, 'epoch': 7.21}


 72%|███████▏  | 1926/2670 [46:55<19:18,  1.56s/it]

{'loss': 0.0077, 'grad_norm': 0.14176708459854126, 'learning_rate': 5.583489681050657e-05, 'epoch': 7.21}


 72%|███████▏  | 1927/2670 [46:57<17:56,  1.45s/it]

{'loss': 0.0059, 'grad_norm': 0.08672411739826202, 'learning_rate': 5.575984990619138e-05, 'epoch': 7.22}


 72%|███████▏  | 1928/2670 [46:58<18:38,  1.51s/it]

{'loss': 0.0072, 'grad_norm': 0.13865427672863007, 'learning_rate': 5.568480300187617e-05, 'epoch': 7.22}


 72%|███████▏  | 1929/2670 [47:00<18:37,  1.51s/it]

{'loss': 0.0053, 'grad_norm': 0.0785355269908905, 'learning_rate': 5.560975609756098e-05, 'epoch': 7.22}


 72%|███████▏  | 1930/2670 [47:01<18:55,  1.53s/it]

{'loss': 0.0438, 'grad_norm': 0.8175063729286194, 'learning_rate': 5.553470919324578e-05, 'epoch': 7.23}


 72%|███████▏  | 1931/2670 [47:03<17:55,  1.46s/it]

{'loss': 0.006, 'grad_norm': 0.05844200775027275, 'learning_rate': 5.545966228893058e-05, 'epoch': 7.23}


 72%|███████▏  | 1932/2670 [47:04<17:50,  1.45s/it]

{'loss': 0.0034, 'grad_norm': 0.04776788502931595, 'learning_rate': 5.538461538461539e-05, 'epoch': 7.24}


 72%|███████▏  | 1933/2670 [47:05<17:23,  1.42s/it]

{'loss': 0.003, 'grad_norm': 0.05928156524896622, 'learning_rate': 5.530956848030019e-05, 'epoch': 7.24}


 72%|███████▏  | 1934/2670 [47:07<18:03,  1.47s/it]

{'loss': 0.0026, 'grad_norm': 0.05329188331961632, 'learning_rate': 5.5234521575985e-05, 'epoch': 7.24}


 72%|███████▏  | 1935/2670 [47:09<18:07,  1.48s/it]

{'loss': 0.0063, 'grad_norm': 0.15316586196422577, 'learning_rate': 5.51594746716698e-05, 'epoch': 7.25}


 73%|███████▎  | 1936/2670 [47:10<18:13,  1.49s/it]

{'loss': 0.0073, 'grad_norm': 0.07532471418380737, 'learning_rate': 5.5084427767354595e-05, 'epoch': 7.25}


 73%|███████▎  | 1937/2670 [47:11<17:26,  1.43s/it]

{'loss': 0.0025, 'grad_norm': 0.04243464395403862, 'learning_rate': 5.5009380863039404e-05, 'epoch': 7.25}


 73%|███████▎  | 1938/2670 [47:13<18:50,  1.54s/it]

{'loss': 0.0104, 'grad_norm': 0.2330462485551834, 'learning_rate': 5.4934333958724206e-05, 'epoch': 7.26}


 73%|███████▎  | 1939/2670 [47:14<17:54,  1.47s/it]

{'loss': 0.0207, 'grad_norm': 0.23339244723320007, 'learning_rate': 5.4859287054409e-05, 'epoch': 7.26}


 73%|███████▎  | 1940/2670 [47:16<17:49,  1.46s/it]

{'loss': 0.0126, 'grad_norm': 0.26746901869773865, 'learning_rate': 5.478424015009381e-05, 'epoch': 7.27}


 73%|███████▎  | 1941/2670 [47:17<17:30,  1.44s/it]

{'loss': 0.0054, 'grad_norm': 0.0715462863445282, 'learning_rate': 5.470919324577861e-05, 'epoch': 7.27}


 73%|███████▎  | 1942/2670 [47:18<16:19,  1.35s/it]

{'loss': 0.0106, 'grad_norm': 0.11262720078229904, 'learning_rate': 5.463414634146342e-05, 'epoch': 7.27}


 73%|███████▎  | 1943/2670 [47:20<15:57,  1.32s/it]

{'loss': 0.0037, 'grad_norm': 0.0453546978533268, 'learning_rate': 5.455909943714822e-05, 'epoch': 7.28}


 73%|███████▎  | 1944/2670 [47:21<16:21,  1.35s/it]

{'loss': 0.0092, 'grad_norm': 0.11039828509092331, 'learning_rate': 5.448405253283302e-05, 'epoch': 7.28}


 73%|███████▎  | 1945/2670 [47:23<16:27,  1.36s/it]

{'loss': 0.007, 'grad_norm': 0.1369301676750183, 'learning_rate': 5.440900562851783e-05, 'epoch': 7.28}


 73%|███████▎  | 1946/2670 [47:24<16:40,  1.38s/it]

{'loss': 0.0045, 'grad_norm': 0.08657915145158768, 'learning_rate': 5.433395872420263e-05, 'epoch': 7.29}


 73%|███████▎  | 1947/2670 [47:25<17:13,  1.43s/it]

{'loss': 0.0039, 'grad_norm': 0.16117793321609497, 'learning_rate': 5.425891181988744e-05, 'epoch': 7.29}


 73%|███████▎  | 1948/2670 [47:27<16:27,  1.37s/it]

{'loss': 0.0056, 'grad_norm': 0.21114103496074677, 'learning_rate': 5.4183864915572234e-05, 'epoch': 7.3}


 73%|███████▎  | 1949/2670 [47:28<16:06,  1.34s/it]

{'loss': 0.0067, 'grad_norm': 0.0763978585600853, 'learning_rate': 5.4108818011257036e-05, 'epoch': 7.3}


 73%|███████▎  | 1950/2670 [47:29<16:37,  1.39s/it]

{'loss': 0.0173, 'grad_norm': 0.5259388089179993, 'learning_rate': 5.4033771106941845e-05, 'epoch': 7.3}


 73%|███████▎  | 1951/2670 [47:31<16:25,  1.37s/it]

{'loss': 0.0045, 'grad_norm': 0.0745694637298584, 'learning_rate': 5.395872420262664e-05, 'epoch': 7.31}


 73%|███████▎  | 1952/2670 [47:32<16:06,  1.35s/it]

{'loss': 0.0058, 'grad_norm': 0.06839897483587265, 'learning_rate': 5.3883677298311456e-05, 'epoch': 7.31}


 73%|███████▎  | 1953/2670 [47:33<15:49,  1.32s/it]

{'loss': 0.005, 'grad_norm': 0.07210039347410202, 'learning_rate': 5.380863039399625e-05, 'epoch': 7.31}


 73%|███████▎  | 1954/2670 [47:35<15:51,  1.33s/it]

{'loss': 0.0036, 'grad_norm': 0.04580441117286682, 'learning_rate': 5.3733583489681046e-05, 'epoch': 7.32}


 73%|███████▎  | 1955/2670 [47:36<16:33,  1.39s/it]

{'loss': 0.0099, 'grad_norm': 0.18847548961639404, 'learning_rate': 5.365853658536586e-05, 'epoch': 7.32}


 73%|███████▎  | 1956/2670 [47:38<18:53,  1.59s/it]

{'loss': 0.0044, 'grad_norm': 0.07555493712425232, 'learning_rate': 5.358348968105066e-05, 'epoch': 7.33}


 73%|███████▎  | 1957/2670 [47:40<18:30,  1.56s/it]

{'loss': 0.0058, 'grad_norm': 0.0699533224105835, 'learning_rate': 5.350844277673546e-05, 'epoch': 7.33}


 73%|███████▎  | 1958/2670 [47:41<17:24,  1.47s/it]

{'loss': 0.002, 'grad_norm': 0.03362996503710747, 'learning_rate': 5.343339587242027e-05, 'epoch': 7.33}


 73%|███████▎  | 1959/2670 [47:42<16:59,  1.43s/it]

{'loss': 0.0084, 'grad_norm': 0.09239830821752548, 'learning_rate': 5.3358348968105064e-05, 'epoch': 7.34}


 73%|███████▎  | 1960/2670 [47:44<16:42,  1.41s/it]

{'loss': 0.0093, 'grad_norm': 0.0911584123969078, 'learning_rate': 5.328330206378987e-05, 'epoch': 7.34}


 73%|███████▎  | 1961/2670 [47:45<16:29,  1.40s/it]

{'loss': 0.0042, 'grad_norm': 0.04788007214665413, 'learning_rate': 5.3208255159474674e-05, 'epoch': 7.34}


 73%|███████▎  | 1962/2670 [47:46<15:41,  1.33s/it]

{'loss': 0.0106, 'grad_norm': 0.17254817485809326, 'learning_rate': 5.313320825515947e-05, 'epoch': 7.35}


 74%|███████▎  | 1963/2670 [47:48<16:42,  1.42s/it]

{'loss': 0.0137, 'grad_norm': 0.2589457035064697, 'learning_rate': 5.3058161350844285e-05, 'epoch': 7.35}


 74%|███████▎  | 1964/2670 [47:49<15:55,  1.35s/it]

{'loss': 0.0052, 'grad_norm': 0.06417998671531677, 'learning_rate': 5.298311444652908e-05, 'epoch': 7.36}


 74%|███████▎  | 1965/2670 [47:51<16:31,  1.41s/it]

{'loss': 0.0065, 'grad_norm': 0.07551396638154984, 'learning_rate': 5.290806754221389e-05, 'epoch': 7.36}


 74%|███████▎  | 1966/2670 [47:52<16:23,  1.40s/it]

{'loss': 0.0051, 'grad_norm': 0.11134962737560272, 'learning_rate': 5.283302063789869e-05, 'epoch': 7.36}


 74%|███████▎  | 1967/2670 [47:54<17:03,  1.46s/it]

{'loss': 0.0065, 'grad_norm': 0.13475444912910461, 'learning_rate': 5.275797373358349e-05, 'epoch': 7.37}


 74%|███████▎  | 1968/2670 [47:55<17:10,  1.47s/it]

{'loss': 0.0588, 'grad_norm': 3.763150453567505, 'learning_rate': 5.2682926829268296e-05, 'epoch': 7.37}


 74%|███████▎  | 1969/2670 [47:57<17:21,  1.49s/it]

{'loss': 0.0043, 'grad_norm': 0.04338296875357628, 'learning_rate': 5.26078799249531e-05, 'epoch': 7.37}


 74%|███████▍  | 1970/2670 [47:59<19:13,  1.65s/it]

{'loss': 0.0054, 'grad_norm': 0.1208205372095108, 'learning_rate': 5.253283302063791e-05, 'epoch': 7.38}


 74%|███████▍  | 1971/2670 [48:00<19:33,  1.68s/it]

{'loss': 0.0088, 'grad_norm': 0.10247337073087692, 'learning_rate': 5.24577861163227e-05, 'epoch': 7.38}


 74%|███████▍  | 1972/2670 [48:02<19:21,  1.66s/it]

{'loss': 0.0035, 'grad_norm': 0.04780654236674309, 'learning_rate': 5.2382739212007504e-05, 'epoch': 7.39}


 74%|███████▍  | 1973/2670 [48:03<18:31,  1.59s/it]

{'loss': 0.0052, 'grad_norm': 0.06470789760351181, 'learning_rate': 5.230769230769231e-05, 'epoch': 7.39}


 74%|███████▍  | 1974/2670 [48:05<18:16,  1.57s/it]

{'loss': 0.0063, 'grad_norm': 0.09801441431045532, 'learning_rate': 5.2232645403377115e-05, 'epoch': 7.39}


 74%|███████▍  | 1975/2670 [48:06<17:33,  1.52s/it]

{'loss': 0.0042, 'grad_norm': 0.05738504230976105, 'learning_rate': 5.2157598499061924e-05, 'epoch': 7.4}


 74%|███████▍  | 1976/2670 [48:08<16:40,  1.44s/it]

{'loss': 0.0071, 'grad_norm': 0.09255293756723404, 'learning_rate': 5.208255159474672e-05, 'epoch': 7.4}


 74%|███████▍  | 1977/2670 [48:09<16:18,  1.41s/it]

{'loss': 0.0042, 'grad_norm': 0.057990994304418564, 'learning_rate': 5.200750469043152e-05, 'epoch': 7.4}


 74%|███████▍  | 1978/2670 [48:10<16:26,  1.43s/it]

{'loss': 0.0051, 'grad_norm': 0.0762033760547638, 'learning_rate': 5.193245778611633e-05, 'epoch': 7.41}


 74%|███████▍  | 1979/2670 [48:12<16:38,  1.44s/it]

{'loss': 0.0038, 'grad_norm': 0.20146335661411285, 'learning_rate': 5.1857410881801125e-05, 'epoch': 7.41}


 74%|███████▍  | 1980/2670 [48:13<16:47,  1.46s/it]

{'loss': 0.0046, 'grad_norm': 0.051017723977565765, 'learning_rate': 5.178236397748593e-05, 'epoch': 7.42}


 74%|███████▍  | 1981/2670 [48:15<16:21,  1.42s/it]

{'loss': 0.0095, 'grad_norm': 0.18901343643665314, 'learning_rate': 5.1707317073170736e-05, 'epoch': 7.42}


 74%|███████▍  | 1982/2670 [48:16<16:41,  1.46s/it]

{'loss': 0.003, 'grad_norm': 0.05007309094071388, 'learning_rate': 5.163227016885553e-05, 'epoch': 7.42}


 74%|███████▍  | 1983/2670 [48:18<16:31,  1.44s/it]

{'loss': 0.015, 'grad_norm': 0.19623719155788422, 'learning_rate': 5.155722326454035e-05, 'epoch': 7.43}


 74%|███████▍  | 1984/2670 [48:19<17:06,  1.50s/it]

{'loss': 0.006, 'grad_norm': 0.08640771359205246, 'learning_rate': 5.148217636022514e-05, 'epoch': 7.43}


 74%|███████▍  | 1985/2670 [48:21<17:01,  1.49s/it]

{'loss': 0.0063, 'grad_norm': 0.08034704625606537, 'learning_rate': 5.1407129455909945e-05, 'epoch': 7.43}


 74%|███████▍  | 1986/2670 [48:22<17:09,  1.51s/it]

{'loss': 0.007, 'grad_norm': 0.08094910532236099, 'learning_rate': 5.1332082551594754e-05, 'epoch': 7.44}


 74%|███████▍  | 1987/2670 [48:24<16:43,  1.47s/it]

{'loss': 0.0061, 'grad_norm': 0.08889145404100418, 'learning_rate': 5.125703564727955e-05, 'epoch': 7.44}


 74%|███████▍  | 1988/2670 [48:25<17:13,  1.52s/it]

{'loss': 0.0031, 'grad_norm': 0.07290196418762207, 'learning_rate': 5.118198874296436e-05, 'epoch': 7.45}


 74%|███████▍  | 1989/2670 [48:27<16:51,  1.49s/it]

{'loss': 0.0078, 'grad_norm': 0.09012092649936676, 'learning_rate': 5.110694183864916e-05, 'epoch': 7.45}


 75%|███████▍  | 1990/2670 [48:28<16:12,  1.43s/it]

{'loss': 0.0059, 'grad_norm': 0.06278881430625916, 'learning_rate': 5.1031894934333955e-05, 'epoch': 7.45}


 75%|███████▍  | 1991/2670 [48:30<16:57,  1.50s/it]

{'loss': 0.0042, 'grad_norm': 0.0572514571249485, 'learning_rate': 5.0956848030018764e-05, 'epoch': 7.46}


 75%|███████▍  | 1992/2670 [48:31<16:39,  1.47s/it]

{'loss': 0.0096, 'grad_norm': 0.1041669249534607, 'learning_rate': 5.0881801125703566e-05, 'epoch': 7.46}


 75%|███████▍  | 1993/2670 [48:33<16:23,  1.45s/it]

{'loss': 0.0059, 'grad_norm': 0.06572592258453369, 'learning_rate': 5.0806754221388375e-05, 'epoch': 7.46}


 75%|███████▍  | 1994/2670 [48:34<16:10,  1.44s/it]

{'loss': 0.0046, 'grad_norm': 0.03975538909435272, 'learning_rate': 5.073170731707318e-05, 'epoch': 7.47}


 75%|███████▍  | 1995/2670 [48:35<15:41,  1.40s/it]

{'loss': 0.0099, 'grad_norm': 0.1169077605009079, 'learning_rate': 5.065666041275797e-05, 'epoch': 7.47}


 75%|███████▍  | 1996/2670 [48:37<15:49,  1.41s/it]

{'loss': 0.0098, 'grad_norm': 0.1864273101091385, 'learning_rate': 5.058161350844278e-05, 'epoch': 7.48}


 75%|███████▍  | 1997/2670 [48:38<15:54,  1.42s/it]

{'loss': 0.0104, 'grad_norm': 0.09419964253902435, 'learning_rate': 5.050656660412758e-05, 'epoch': 7.48}


 75%|███████▍  | 1998/2670 [48:40<16:08,  1.44s/it]

{'loss': 0.0023, 'grad_norm': 0.030149605125188828, 'learning_rate': 5.043151969981239e-05, 'epoch': 7.48}


 75%|███████▍  | 1999/2670 [48:41<16:03,  1.44s/it]

{'loss': 0.0061, 'grad_norm': 0.058626748621463776, 'learning_rate': 5.035647279549719e-05, 'epoch': 7.49}


 75%|███████▍  | 2000/2670 [48:42<15:48,  1.42s/it]

{'loss': 0.0025, 'grad_norm': 0.034577012062072754, 'learning_rate': 5.028142589118199e-05, 'epoch': 7.49}


 75%|███████▍  | 2001/2670 [48:45<21:10,  1.90s/it]

{'loss': 0.0049, 'grad_norm': 0.053503215312957764, 'learning_rate': 5.02063789868668e-05, 'epoch': 7.49}


 75%|███████▍  | 2002/2670 [48:47<19:40,  1.77s/it]

{'loss': 0.0033, 'grad_norm': 0.04331037774682045, 'learning_rate': 5.0131332082551594e-05, 'epoch': 7.5}


 75%|███████▌  | 2003/2670 [48:48<18:49,  1.69s/it]

{'loss': 0.0052, 'grad_norm': 0.08157249540090561, 'learning_rate': 5.0056285178236396e-05, 'epoch': 7.5}


 75%|███████▌  | 2004/2670 [48:50<18:19,  1.65s/it]

{'loss': 0.0065, 'grad_norm': 0.14796409010887146, 'learning_rate': 4.9981238273921205e-05, 'epoch': 7.51}


 75%|███████▌  | 2005/2670 [48:51<16:54,  1.53s/it]

{'loss': 0.0076, 'grad_norm': 0.30183541774749756, 'learning_rate': 4.990619136960601e-05, 'epoch': 7.51}


 75%|███████▌  | 2006/2670 [48:53<16:54,  1.53s/it]

{'loss': 0.0099, 'grad_norm': 0.14715167880058289, 'learning_rate': 4.983114446529081e-05, 'epoch': 7.51}


 75%|███████▌  | 2007/2670 [48:54<17:25,  1.58s/it]

{'loss': 0.0088, 'grad_norm': 0.24470776319503784, 'learning_rate': 4.975609756097561e-05, 'epoch': 7.52}


 75%|███████▌  | 2008/2670 [48:56<16:23,  1.49s/it]

{'loss': 0.0078, 'grad_norm': 0.10599647462368011, 'learning_rate': 4.968105065666041e-05, 'epoch': 7.52}


 75%|███████▌  | 2009/2670 [48:57<16:00,  1.45s/it]

{'loss': 0.0137, 'grad_norm': 0.15383557975292206, 'learning_rate': 4.960600375234522e-05, 'epoch': 7.52}


 75%|███████▌  | 2010/2670 [48:59<16:18,  1.48s/it]

{'loss': 0.0044, 'grad_norm': 0.07037656009197235, 'learning_rate': 4.953095684803002e-05, 'epoch': 7.53}


 75%|███████▌  | 2011/2670 [49:00<16:15,  1.48s/it]

{'loss': 0.0039, 'grad_norm': 0.057173050940036774, 'learning_rate': 4.9455909943714826e-05, 'epoch': 7.53}


 75%|███████▌  | 2012/2670 [49:01<15:52,  1.45s/it]

{'loss': 0.0059, 'grad_norm': 0.11171295493841171, 'learning_rate': 4.938086303939963e-05, 'epoch': 7.54}


 75%|███████▌  | 2013/2670 [49:03<15:27,  1.41s/it]

{'loss': 0.0059, 'grad_norm': 0.06840222328901291, 'learning_rate': 4.930581613508443e-05, 'epoch': 7.54}


 75%|███████▌  | 2014/2670 [49:04<15:36,  1.43s/it]

{'loss': 0.0046, 'grad_norm': 0.05603247508406639, 'learning_rate': 4.923076923076924e-05, 'epoch': 7.54}


 75%|███████▌  | 2015/2670 [49:06<15:13,  1.40s/it]

{'loss': 0.0038, 'grad_norm': 0.10944841802120209, 'learning_rate': 4.9155722326454034e-05, 'epoch': 7.55}


 76%|███████▌  | 2016/2670 [49:07<15:41,  1.44s/it]

{'loss': 0.0069, 'grad_norm': 0.08261758089065552, 'learning_rate': 4.9080675422138836e-05, 'epoch': 7.55}


 76%|███████▌  | 2017/2670 [49:09<15:40,  1.44s/it]

{'loss': 0.0064, 'grad_norm': 0.07633445411920547, 'learning_rate': 4.9005628517823645e-05, 'epoch': 7.55}


 76%|███████▌  | 2018/2670 [49:10<15:35,  1.44s/it]

{'loss': 0.0083, 'grad_norm': 0.12346494942903519, 'learning_rate': 4.893058161350845e-05, 'epoch': 7.56}


 76%|███████▌  | 2019/2670 [49:11<15:45,  1.45s/it]

{'loss': 0.0122, 'grad_norm': 0.5263869762420654, 'learning_rate': 4.885553470919324e-05, 'epoch': 7.56}


 76%|███████▌  | 2020/2670 [49:13<16:16,  1.50s/it]

{'loss': 0.0074, 'grad_norm': 0.09286694973707199, 'learning_rate': 4.878048780487805e-05, 'epoch': 7.57}


 76%|███████▌  | 2021/2670 [49:14<15:41,  1.45s/it]

{'loss': 0.0129, 'grad_norm': 0.43749716877937317, 'learning_rate': 4.8705440900562854e-05, 'epoch': 7.57}


 76%|███████▌  | 2022/2670 [49:16<15:08,  1.40s/it]

{'loss': 0.0044, 'grad_norm': 0.05243195593357086, 'learning_rate': 4.8630393996247656e-05, 'epoch': 7.57}


 76%|███████▌  | 2023/2670 [49:17<15:29,  1.44s/it]

{'loss': 0.0083, 'grad_norm': 0.08447962999343872, 'learning_rate': 4.8555347091932464e-05, 'epoch': 7.58}


 76%|███████▌  | 2024/2670 [49:19<15:17,  1.42s/it]

{'loss': 0.0112, 'grad_norm': 0.2986009120941162, 'learning_rate': 4.848030018761726e-05, 'epoch': 7.58}


 76%|███████▌  | 2025/2670 [49:20<15:16,  1.42s/it]

{'loss': 0.0074, 'grad_norm': 0.09081859141588211, 'learning_rate': 4.840525328330207e-05, 'epoch': 7.58}


 76%|███████▌  | 2026/2670 [49:21<14:55,  1.39s/it]

{'loss': 0.0082, 'grad_norm': 0.090232715010643, 'learning_rate': 4.833020637898687e-05, 'epoch': 7.59}


 76%|███████▌  | 2027/2670 [49:23<15:03,  1.41s/it]

{'loss': 0.0073, 'grad_norm': 0.10007913410663605, 'learning_rate': 4.825515947467167e-05, 'epoch': 7.59}


 76%|███████▌  | 2028/2670 [49:24<14:23,  1.35s/it]

{'loss': 0.0066, 'grad_norm': 0.06956639140844345, 'learning_rate': 4.8180112570356475e-05, 'epoch': 7.6}


 76%|███████▌  | 2029/2670 [49:26<15:06,  1.41s/it]

{'loss': 0.0043, 'grad_norm': 0.0544690303504467, 'learning_rate': 4.810506566604128e-05, 'epoch': 7.6}


 76%|███████▌  | 2030/2670 [49:27<15:16,  1.43s/it]

{'loss': 0.0071, 'grad_norm': 0.12831515073776245, 'learning_rate': 4.803001876172608e-05, 'epoch': 7.6}


 76%|███████▌  | 2031/2670 [49:28<14:55,  1.40s/it]

{'loss': 0.011, 'grad_norm': 0.10705015808343887, 'learning_rate': 4.795497185741089e-05, 'epoch': 7.61}


 76%|███████▌  | 2032/2670 [49:30<15:14,  1.43s/it]

{'loss': 0.0076, 'grad_norm': 0.09233949333429337, 'learning_rate': 4.787992495309569e-05, 'epoch': 7.61}


 76%|███████▌  | 2033/2670 [49:31<15:20,  1.45s/it]

{'loss': 0.0035, 'grad_norm': 0.054467327892780304, 'learning_rate': 4.7804878048780485e-05, 'epoch': 7.61}


 76%|███████▌  | 2034/2670 [49:33<15:06,  1.43s/it]

{'loss': 0.0106, 'grad_norm': 0.0828322246670723, 'learning_rate': 4.7729831144465294e-05, 'epoch': 7.62}


 76%|███████▌  | 2035/2670 [49:34<15:27,  1.46s/it]

{'loss': 0.0076, 'grad_norm': 0.09723799675703049, 'learning_rate': 4.7654784240150096e-05, 'epoch': 7.62}


 76%|███████▋  | 2036/2670 [49:36<15:42,  1.49s/it]

{'loss': 0.0037, 'grad_norm': 0.045301493257284164, 'learning_rate': 4.75797373358349e-05, 'epoch': 7.63}


 76%|███████▋  | 2037/2670 [49:37<15:11,  1.44s/it]

{'loss': 0.0051, 'grad_norm': 0.0810886099934578, 'learning_rate': 4.75046904315197e-05, 'epoch': 7.63}


 76%|███████▋  | 2038/2670 [49:39<15:00,  1.42s/it]

{'loss': 0.0063, 'grad_norm': 0.0672963336110115, 'learning_rate': 4.74296435272045e-05, 'epoch': 7.63}


 76%|███████▋  | 2039/2670 [49:40<14:50,  1.41s/it]

{'loss': 0.0072, 'grad_norm': 0.0996917262673378, 'learning_rate': 4.7354596622889305e-05, 'epoch': 7.64}


 76%|███████▋  | 2040/2670 [49:41<15:13,  1.45s/it]

{'loss': 0.0043, 'grad_norm': 0.060132596641778946, 'learning_rate': 4.727954971857411e-05, 'epoch': 7.64}


 76%|███████▋  | 2041/2670 [49:43<16:04,  1.53s/it]

{'loss': 0.0046, 'grad_norm': 0.06418140977621078, 'learning_rate': 4.7204502814258915e-05, 'epoch': 7.64}


 76%|███████▋  | 2042/2670 [49:45<16:22,  1.56s/it]

{'loss': 0.0087, 'grad_norm': 0.14755357801914215, 'learning_rate': 4.712945590994372e-05, 'epoch': 7.65}


 77%|███████▋  | 2043/2670 [49:47<17:14,  1.65s/it]

{'loss': 0.0079, 'grad_norm': 0.21844100952148438, 'learning_rate': 4.705440900562852e-05, 'epoch': 7.65}


 77%|███████▋  | 2044/2670 [49:48<16:14,  1.56s/it]

{'loss': 0.0083, 'grad_norm': 0.0896424651145935, 'learning_rate': 4.697936210131332e-05, 'epoch': 7.66}


 77%|███████▋  | 2045/2670 [49:49<15:34,  1.49s/it]

{'loss': 0.0072, 'grad_norm': 0.08759603649377823, 'learning_rate': 4.690431519699813e-05, 'epoch': 7.66}


 77%|███████▋  | 2046/2670 [49:51<14:58,  1.44s/it]

{'loss': 0.0084, 'grad_norm': 0.0938129872083664, 'learning_rate': 4.682926829268293e-05, 'epoch': 7.66}


 77%|███████▋  | 2047/2670 [49:52<15:04,  1.45s/it]

{'loss': 0.0034, 'grad_norm': 0.046313680708408356, 'learning_rate': 4.675422138836773e-05, 'epoch': 7.67}


 77%|███████▋  | 2048/2670 [49:54<14:58,  1.44s/it]

{'loss': 0.0056, 'grad_norm': 0.11789951473474503, 'learning_rate': 4.667917448405254e-05, 'epoch': 7.67}


 77%|███████▋  | 2049/2670 [49:55<14:05,  1.36s/it]

{'loss': 0.0067, 'grad_norm': 0.08684523403644562, 'learning_rate': 4.660412757973734e-05, 'epoch': 7.67}


 77%|███████▋  | 2050/2670 [49:56<14:53,  1.44s/it]

{'loss': 0.0068, 'grad_norm': 0.22703662514686584, 'learning_rate': 4.652908067542214e-05, 'epoch': 7.68}


 77%|███████▋  | 2051/2670 [49:58<15:34,  1.51s/it]

{'loss': 0.0064, 'grad_norm': 0.08451981097459793, 'learning_rate': 4.645403377110694e-05, 'epoch': 7.68}


 77%|███████▋  | 2052/2670 [50:00<15:52,  1.54s/it]

{'loss': 0.0018, 'grad_norm': 0.02624412812292576, 'learning_rate': 4.6378986866791745e-05, 'epoch': 7.69}


 77%|███████▋  | 2053/2670 [50:01<15:43,  1.53s/it]

{'loss': 0.0045, 'grad_norm': 0.06611353158950806, 'learning_rate': 4.630393996247655e-05, 'epoch': 7.69}


 77%|███████▋  | 2054/2670 [50:03<15:03,  1.47s/it]

{'loss': 0.0074, 'grad_norm': 0.07753124833106995, 'learning_rate': 4.6228893058161356e-05, 'epoch': 7.69}


 77%|███████▋  | 2055/2670 [50:04<14:18,  1.40s/it]

{'loss': 0.0122, 'grad_norm': 0.08899484574794769, 'learning_rate': 4.615384615384616e-05, 'epoch': 7.7}


 77%|███████▋  | 2056/2670 [50:05<14:41,  1.44s/it]

{'loss': 0.0092, 'grad_norm': 0.30001014471054077, 'learning_rate': 4.607879924953096e-05, 'epoch': 7.7}


 77%|███████▋  | 2057/2670 [50:07<14:11,  1.39s/it]

{'loss': 0.0086, 'grad_norm': 0.09514423459768295, 'learning_rate': 4.600375234521576e-05, 'epoch': 7.7}


 77%|███████▋  | 2058/2670 [50:08<14:40,  1.44s/it]

{'loss': 0.0082, 'grad_norm': 0.12326905131340027, 'learning_rate': 4.5928705440900564e-05, 'epoch': 7.71}


 77%|███████▋  | 2059/2670 [50:10<14:49,  1.46s/it]

{'loss': 0.0055, 'grad_norm': 0.1649925261735916, 'learning_rate': 4.585365853658537e-05, 'epoch': 7.71}


 77%|███████▋  | 2060/2670 [50:11<14:13,  1.40s/it]

{'loss': 0.0075, 'grad_norm': 0.08468618243932724, 'learning_rate': 4.577861163227017e-05, 'epoch': 7.72}


 77%|███████▋  | 2061/2670 [50:13<15:24,  1.52s/it]

{'loss': 0.006, 'grad_norm': 0.07563146203756332, 'learning_rate': 4.570356472795497e-05, 'epoch': 7.72}


 77%|███████▋  | 2062/2670 [50:14<14:46,  1.46s/it]

{'loss': 0.0019, 'grad_norm': 0.03506555035710335, 'learning_rate': 4.562851782363978e-05, 'epoch': 7.72}


 77%|███████▋  | 2063/2670 [50:15<14:14,  1.41s/it]

{'loss': 0.0069, 'grad_norm': 0.07219438999891281, 'learning_rate': 4.555347091932458e-05, 'epoch': 7.73}


 77%|███████▋  | 2064/2670 [50:17<14:29,  1.44s/it]

{'loss': 0.0056, 'grad_norm': 0.08350291103124619, 'learning_rate': 4.5478424015009384e-05, 'epoch': 7.73}


 77%|███████▋  | 2065/2670 [50:18<13:59,  1.39s/it]

{'loss': 0.0038, 'grad_norm': 0.05928688496351242, 'learning_rate': 4.5403377110694186e-05, 'epoch': 7.73}


 77%|███████▋  | 2066/2670 [50:20<15:00,  1.49s/it]

{'loss': 0.0053, 'grad_norm': 0.28033560514450073, 'learning_rate': 4.532833020637899e-05, 'epoch': 7.74}


 77%|███████▋  | 2067/2670 [50:21<14:43,  1.47s/it]

{'loss': 0.0054, 'grad_norm': 0.06593567132949829, 'learning_rate': 4.525328330206379e-05, 'epoch': 7.74}


 77%|███████▋  | 2068/2670 [50:23<14:31,  1.45s/it]

{'loss': 0.007, 'grad_norm': 0.09009452909231186, 'learning_rate': 4.51782363977486e-05, 'epoch': 7.75}


 77%|███████▋  | 2069/2670 [50:24<13:55,  1.39s/it]

{'loss': 0.0092, 'grad_norm': 0.10539688915014267, 'learning_rate': 4.5103189493433394e-05, 'epoch': 7.75}


 78%|███████▊  | 2070/2670 [50:25<14:19,  1.43s/it]

{'loss': 0.0068, 'grad_norm': 0.07729236781597137, 'learning_rate': 4.50281425891182e-05, 'epoch': 7.75}


 78%|███████▊  | 2071/2670 [50:27<14:52,  1.49s/it]

{'loss': 0.0052, 'grad_norm': 0.2393733114004135, 'learning_rate': 4.4953095684803005e-05, 'epoch': 7.76}


 78%|███████▊  | 2072/2670 [50:28<14:49,  1.49s/it]

{'loss': 0.0036, 'grad_norm': 0.053314127027988434, 'learning_rate': 4.487804878048781e-05, 'epoch': 7.76}


 78%|███████▊  | 2073/2670 [50:30<14:21,  1.44s/it]

{'loss': 0.0082, 'grad_norm': 0.07107644528150558, 'learning_rate': 4.480300187617261e-05, 'epoch': 7.76}


 78%|███████▊  | 2074/2670 [50:31<13:48,  1.39s/it]

{'loss': 0.0093, 'grad_norm': 0.1780117154121399, 'learning_rate': 4.472795497185741e-05, 'epoch': 7.77}


 78%|███████▊  | 2075/2670 [50:33<14:08,  1.43s/it]

{'loss': 0.0137, 'grad_norm': 0.2622995972633362, 'learning_rate': 4.465290806754221e-05, 'epoch': 7.77}


 78%|███████▊  | 2076/2670 [50:34<14:05,  1.42s/it]

{'loss': 0.0044, 'grad_norm': 0.06067490577697754, 'learning_rate': 4.457786116322702e-05, 'epoch': 7.78}


 78%|███████▊  | 2077/2670 [50:36<14:56,  1.51s/it]

{'loss': 0.0078, 'grad_norm': 0.15399616956710815, 'learning_rate': 4.4502814258911824e-05, 'epoch': 7.78}


 78%|███████▊  | 2078/2670 [50:37<15:01,  1.52s/it]

{'loss': 0.0081, 'grad_norm': 0.07925297319889069, 'learning_rate': 4.4427767354596626e-05, 'epoch': 7.78}


 78%|███████▊  | 2079/2670 [50:39<14:10,  1.44s/it]

{'loss': 0.0074, 'grad_norm': 0.07449667155742645, 'learning_rate': 4.435272045028143e-05, 'epoch': 7.79}


 78%|███████▊  | 2080/2670 [50:41<15:51,  1.61s/it]

{'loss': 0.0069, 'grad_norm': 0.09437887370586395, 'learning_rate': 4.427767354596623e-05, 'epoch': 7.79}


 78%|███████▊  | 2081/2670 [50:42<15:30,  1.58s/it]

{'loss': 0.006, 'grad_norm': 0.06684686988592148, 'learning_rate': 4.420262664165103e-05, 'epoch': 7.79}


 78%|███████▊  | 2082/2670 [50:43<14:51,  1.52s/it]

{'loss': 0.0073, 'grad_norm': 0.0884675681591034, 'learning_rate': 4.412757973733584e-05, 'epoch': 7.8}


 78%|███████▊  | 2083/2670 [50:45<14:49,  1.52s/it]

{'loss': 0.0051, 'grad_norm': 0.14042192697525024, 'learning_rate': 4.405253283302064e-05, 'epoch': 7.8}


 78%|███████▊  | 2084/2670 [50:46<14:29,  1.48s/it]

{'loss': 0.0211, 'grad_norm': 0.1536451280117035, 'learning_rate': 4.397748592870544e-05, 'epoch': 7.81}


 78%|███████▊  | 2085/2670 [50:48<13:59,  1.43s/it]

{'loss': 0.0095, 'grad_norm': 0.31247803568840027, 'learning_rate': 4.390243902439025e-05, 'epoch': 7.81}


 78%|███████▊  | 2086/2670 [50:49<13:44,  1.41s/it]

{'loss': 0.0053, 'grad_norm': 0.057323794811964035, 'learning_rate': 4.382739212007505e-05, 'epoch': 7.81}


 78%|███████▊  | 2087/2670 [50:50<13:44,  1.41s/it]

{'loss': 0.0057, 'grad_norm': 0.054326243698596954, 'learning_rate': 4.375234521575985e-05, 'epoch': 7.82}


 78%|███████▊  | 2088/2670 [50:52<14:18,  1.47s/it]

{'loss': 0.0047, 'grad_norm': 0.07295122742652893, 'learning_rate': 4.3677298311444654e-05, 'epoch': 7.82}


 78%|███████▊  | 2089/2670 [50:54<14:27,  1.49s/it]

{'loss': 0.0064, 'grad_norm': 0.08397146314382553, 'learning_rate': 4.3602251407129456e-05, 'epoch': 7.82}


 78%|███████▊  | 2090/2670 [50:55<15:01,  1.55s/it]

{'loss': 0.0111, 'grad_norm': 0.133133664727211, 'learning_rate': 4.3527204502814265e-05, 'epoch': 7.83}


 78%|███████▊  | 2091/2670 [50:57<15:10,  1.57s/it]

{'loss': 0.0102, 'grad_norm': 0.11349375545978546, 'learning_rate': 4.345215759849907e-05, 'epoch': 7.83}


 78%|███████▊  | 2092/2670 [50:58<14:41,  1.52s/it]

{'loss': 0.004, 'grad_norm': 0.0422426201403141, 'learning_rate': 4.337711069418386e-05, 'epoch': 7.84}


 78%|███████▊  | 2093/2670 [51:00<13:59,  1.45s/it]

{'loss': 0.0106, 'grad_norm': 0.08300530165433884, 'learning_rate': 4.330206378986867e-05, 'epoch': 7.84}


 78%|███████▊  | 2094/2670 [51:01<13:57,  1.45s/it]

{'loss': 0.0199, 'grad_norm': 0.2724382281303406, 'learning_rate': 4.322701688555347e-05, 'epoch': 7.84}


 78%|███████▊  | 2095/2670 [51:02<13:35,  1.42s/it]

{'loss': 0.0078, 'grad_norm': 0.08494796603918076, 'learning_rate': 4.3151969981238275e-05, 'epoch': 7.85}


 79%|███████▊  | 2096/2670 [51:04<14:15,  1.49s/it]

{'loss': 0.0104, 'grad_norm': 0.09465541690587997, 'learning_rate': 4.3076923076923084e-05, 'epoch': 7.85}


 79%|███████▊  | 2097/2670 [51:05<14:02,  1.47s/it]

{'loss': 0.0084, 'grad_norm': 0.1001514196395874, 'learning_rate': 4.300187617260788e-05, 'epoch': 7.85}


 79%|███████▊  | 2098/2670 [51:07<13:52,  1.46s/it]

{'loss': 0.0061, 'grad_norm': 0.10575857758522034, 'learning_rate': 4.292682926829268e-05, 'epoch': 7.86}


 79%|███████▊  | 2099/2670 [51:08<14:03,  1.48s/it]

{'loss': 0.0055, 'grad_norm': 0.05356356129050255, 'learning_rate': 4.285178236397749e-05, 'epoch': 7.86}


 79%|███████▊  | 2100/2670 [51:10<13:37,  1.44s/it]

{'loss': 0.0097, 'grad_norm': 0.08499806374311447, 'learning_rate': 4.277673545966229e-05, 'epoch': 7.87}


 79%|███████▊  | 2101/2670 [51:11<13:28,  1.42s/it]

{'loss': 0.0071, 'grad_norm': 0.08852889388799667, 'learning_rate': 4.2701688555347094e-05, 'epoch': 7.87}


 79%|███████▊  | 2102/2670 [51:12<12:34,  1.33s/it]

{'loss': 0.0072, 'grad_norm': 0.1808290034532547, 'learning_rate': 4.2626641651031897e-05, 'epoch': 7.87}


 79%|███████▉  | 2103/2670 [51:14<13:08,  1.39s/it]

{'loss': 0.0069, 'grad_norm': 0.10428079962730408, 'learning_rate': 4.25515947467167e-05, 'epoch': 7.88}


 79%|███████▉  | 2104/2670 [51:15<13:30,  1.43s/it]

{'loss': 0.0061, 'grad_norm': 0.07505317032337189, 'learning_rate': 4.24765478424015e-05, 'epoch': 7.88}


 79%|███████▉  | 2105/2670 [51:17<13:29,  1.43s/it]

{'loss': 0.0096, 'grad_norm': 0.1025073230266571, 'learning_rate': 4.240150093808631e-05, 'epoch': 7.88}


 79%|███████▉  | 2106/2670 [51:18<13:13,  1.41s/it]

{'loss': 0.0094, 'grad_norm': 0.11055665463209152, 'learning_rate': 4.2326454033771105e-05, 'epoch': 7.89}


 79%|███████▉  | 2107/2670 [51:20<14:07,  1.51s/it]

{'loss': 0.0073, 'grad_norm': 0.13730540871620178, 'learning_rate': 4.2251407129455914e-05, 'epoch': 7.89}


 79%|███████▉  | 2108/2670 [51:22<15:08,  1.62s/it]

{'loss': 0.0037, 'grad_norm': 0.062501922249794, 'learning_rate': 4.2176360225140716e-05, 'epoch': 7.9}


 79%|███████▉  | 2109/2670 [51:23<15:10,  1.62s/it]

{'loss': 0.0082, 'grad_norm': 0.08669881522655487, 'learning_rate': 4.210131332082552e-05, 'epoch': 7.9}


 79%|███████▉  | 2110/2670 [51:25<14:53,  1.60s/it]

{'loss': 0.0054, 'grad_norm': 0.06631550937891006, 'learning_rate': 4.202626641651033e-05, 'epoch': 7.9}


 79%|███████▉  | 2111/2670 [51:26<14:38,  1.57s/it]

{'loss': 0.0082, 'grad_norm': 0.22105713188648224, 'learning_rate': 4.195121951219512e-05, 'epoch': 7.91}


 79%|███████▉  | 2112/2670 [51:28<14:27,  1.55s/it]

{'loss': 0.0091, 'grad_norm': 0.09858458489179611, 'learning_rate': 4.1876172607879924e-05, 'epoch': 7.91}


 79%|███████▉  | 2113/2670 [51:29<14:00,  1.51s/it]

{'loss': 0.0036, 'grad_norm': 0.04607001319527626, 'learning_rate': 4.180112570356473e-05, 'epoch': 7.91}


 79%|███████▉  | 2114/2670 [51:31<13:35,  1.47s/it]

{'loss': 0.0197, 'grad_norm': 0.13563589751720428, 'learning_rate': 4.1726078799249535e-05, 'epoch': 7.92}


 79%|███████▉  | 2115/2670 [51:32<13:52,  1.50s/it]

{'loss': 0.0076, 'grad_norm': 0.10851956158876419, 'learning_rate': 4.165103189493433e-05, 'epoch': 7.92}


 79%|███████▉  | 2116/2670 [51:34<13:56,  1.51s/it]

{'loss': 0.0079, 'grad_norm': 0.07805101573467255, 'learning_rate': 4.157598499061914e-05, 'epoch': 7.93}


 79%|███████▉  | 2117/2670 [51:35<13:10,  1.43s/it]

{'loss': 0.0134, 'grad_norm': 0.1639900505542755, 'learning_rate': 4.150093808630394e-05, 'epoch': 7.93}


 79%|███████▉  | 2118/2670 [51:36<12:52,  1.40s/it]

{'loss': 0.0068, 'grad_norm': 0.07176848500967026, 'learning_rate': 4.1425891181988743e-05, 'epoch': 7.93}


 79%|███████▉  | 2119/2670 [51:38<13:01,  1.42s/it]

{'loss': 0.0039, 'grad_norm': 0.04418905824422836, 'learning_rate': 4.135084427767355e-05, 'epoch': 7.94}


 79%|███████▉  | 2120/2670 [51:39<13:09,  1.44s/it]

{'loss': 0.0036, 'grad_norm': 0.05386391654610634, 'learning_rate': 4.127579737335835e-05, 'epoch': 7.94}


 79%|███████▉  | 2121/2670 [51:41<13:24,  1.46s/it]

{'loss': 0.0045, 'grad_norm': 0.0526823066174984, 'learning_rate': 4.1200750469043156e-05, 'epoch': 7.94}


 79%|███████▉  | 2122/2670 [51:42<13:22,  1.46s/it]

{'loss': 0.0078, 'grad_norm': 0.15600275993347168, 'learning_rate': 4.112570356472796e-05, 'epoch': 7.95}


 80%|███████▉  | 2123/2670 [51:44<13:02,  1.43s/it]

{'loss': 0.0042, 'grad_norm': 0.05423121154308319, 'learning_rate': 4.105065666041276e-05, 'epoch': 7.95}


 80%|███████▉  | 2124/2670 [51:45<13:35,  1.49s/it]

{'loss': 0.0049, 'grad_norm': 0.05269043520092964, 'learning_rate': 4.097560975609756e-05, 'epoch': 7.96}


 80%|███████▉  | 2125/2670 [51:47<13:19,  1.47s/it]

{'loss': 0.0086, 'grad_norm': 0.32947587966918945, 'learning_rate': 4.0900562851782365e-05, 'epoch': 7.96}


 80%|███████▉  | 2126/2670 [51:48<13:12,  1.46s/it]

{'loss': 0.0045, 'grad_norm': 0.06316772103309631, 'learning_rate': 4.082551594746717e-05, 'epoch': 7.96}


 80%|███████▉  | 2127/2670 [51:50<13:06,  1.45s/it]

{'loss': 0.011, 'grad_norm': 0.17639730870723724, 'learning_rate': 4.0750469043151976e-05, 'epoch': 7.97}


 80%|███████▉  | 2128/2670 [51:51<13:18,  1.47s/it]

{'loss': 0.0054, 'grad_norm': 0.05376254394650459, 'learning_rate': 4.067542213883678e-05, 'epoch': 7.97}


 80%|███████▉  | 2129/2670 [51:53<14:07,  1.57s/it]

{'loss': 0.0048, 'grad_norm': 0.07901544123888016, 'learning_rate': 4.060037523452157e-05, 'epoch': 7.97}


 80%|███████▉  | 2130/2670 [51:54<13:47,  1.53s/it]

{'loss': 0.0485, 'grad_norm': 0.6760674715042114, 'learning_rate': 4.052532833020638e-05, 'epoch': 7.98}


 80%|███████▉  | 2131/2670 [51:56<13:29,  1.50s/it]

{'loss': 0.0087, 'grad_norm': 0.16901107132434845, 'learning_rate': 4.0450281425891184e-05, 'epoch': 7.98}


 80%|███████▉  | 2132/2670 [51:57<12:58,  1.45s/it]

{'loss': 0.0068, 'grad_norm': 0.06355517357587814, 'learning_rate': 4.0375234521575986e-05, 'epoch': 7.99}


 80%|███████▉  | 2133/2670 [51:59<13:10,  1.47s/it]

{'loss': 0.0055, 'grad_norm': 0.07586300373077393, 'learning_rate': 4.030018761726079e-05, 'epoch': 7.99}


 80%|███████▉  | 2134/2670 [52:00<12:51,  1.44s/it]

{'loss': 0.0071, 'grad_norm': 0.06587415933609009, 'learning_rate': 4.022514071294559e-05, 'epoch': 7.99}


 80%|███████▉  | 2135/2670 [52:01<12:16,  1.38s/it]

{'loss': 0.0047, 'grad_norm': 0.0600973479449749, 'learning_rate': 4.015009380863039e-05, 'epoch': 8.0}


 80%|████████  | 2136/2670 [52:03<12:46,  1.43s/it]

{'loss': 0.0057, 'grad_norm': 0.05893274396657944, 'learning_rate': 4.00750469043152e-05, 'epoch': 8.0}


 80%|████████  | 2137/2670 [52:04<12:21,  1.39s/it]

{'loss': 0.0074, 'grad_norm': 0.08578725159168243, 'learning_rate': 4e-05, 'epoch': 8.0}


 80%|████████  | 2138/2670 [52:05<12:14,  1.38s/it]

{'loss': 0.0046, 'grad_norm': 0.06471945345401764, 'learning_rate': 3.9924953095684805e-05, 'epoch': 8.01}


 80%|████████  | 2139/2670 [52:07<12:04,  1.36s/it]

{'loss': 0.0051, 'grad_norm': 0.05541151389479637, 'learning_rate': 3.984990619136961e-05, 'epoch': 8.01}


 80%|████████  | 2140/2670 [52:08<12:44,  1.44s/it]

{'loss': 0.0028, 'grad_norm': 0.024825915694236755, 'learning_rate': 3.977485928705441e-05, 'epoch': 8.01}


 80%|████████  | 2141/2670 [52:10<12:57,  1.47s/it]

{'loss': 0.0042, 'grad_norm': 0.19559234380722046, 'learning_rate': 3.969981238273922e-05, 'epoch': 8.02}


 80%|████████  | 2142/2670 [52:11<13:06,  1.49s/it]

{'loss': 0.0046, 'grad_norm': 0.04838912934064865, 'learning_rate': 3.962476547842402e-05, 'epoch': 8.02}


 80%|████████  | 2143/2670 [52:13<12:59,  1.48s/it]

{'loss': 0.0043, 'grad_norm': 0.06458818912506104, 'learning_rate': 3.9549718574108816e-05, 'epoch': 8.03}


 80%|████████  | 2144/2670 [52:14<12:48,  1.46s/it]

{'loss': 0.0066, 'grad_norm': 0.06645718961954117, 'learning_rate': 3.9474671669793625e-05, 'epoch': 8.03}


 80%|████████  | 2145/2670 [52:16<13:21,  1.53s/it]

{'loss': 0.0029, 'grad_norm': 0.034207653254270554, 'learning_rate': 3.939962476547843e-05, 'epoch': 8.03}


 80%|████████  | 2146/2670 [52:17<13:12,  1.51s/it]

{'loss': 0.0034, 'grad_norm': 0.038483839482069016, 'learning_rate': 3.932457786116323e-05, 'epoch': 8.04}


 80%|████████  | 2147/2670 [52:19<12:46,  1.47s/it]

{'loss': 0.0043, 'grad_norm': 0.046905647963285446, 'learning_rate': 3.924953095684803e-05, 'epoch': 8.04}


 80%|████████  | 2148/2670 [52:20<12:52,  1.48s/it]

{'loss': 0.0021, 'grad_norm': 0.025347376242280006, 'learning_rate': 3.917448405253283e-05, 'epoch': 8.04}


 80%|████████  | 2149/2670 [52:22<12:22,  1.43s/it]

{'loss': 0.0048, 'grad_norm': 0.056673966348171234, 'learning_rate': 3.9099437148217635e-05, 'epoch': 8.05}


 81%|████████  | 2150/2670 [52:23<12:25,  1.43s/it]

{'loss': 0.0045, 'grad_norm': 0.07999692112207413, 'learning_rate': 3.9024390243902444e-05, 'epoch': 8.05}


 81%|████████  | 2151/2670 [52:25<12:53,  1.49s/it]

{'loss': 0.0049, 'grad_norm': 0.06093117222189903, 'learning_rate': 3.8949343339587246e-05, 'epoch': 8.06}


 81%|████████  | 2152/2670 [52:26<12:28,  1.45s/it]

{'loss': 0.0049, 'grad_norm': 0.20146791636943817, 'learning_rate': 3.887429643527205e-05, 'epoch': 8.06}


 81%|████████  | 2153/2670 [52:28<13:00,  1.51s/it]

{'loss': 0.0055, 'grad_norm': 0.16829247772693634, 'learning_rate': 3.879924953095685e-05, 'epoch': 8.06}


 81%|████████  | 2154/2670 [52:29<12:28,  1.45s/it]

{'loss': 0.0048, 'grad_norm': 0.05192580446600914, 'learning_rate': 3.872420262664165e-05, 'epoch': 8.07}


 81%|████████  | 2155/2670 [52:30<12:28,  1.45s/it]

{'loss': 0.0071, 'grad_norm': 0.09548288583755493, 'learning_rate': 3.864915572232646e-05, 'epoch': 8.07}


 81%|████████  | 2156/2670 [52:32<12:24,  1.45s/it]

{'loss': 0.0043, 'grad_norm': 0.062001634389162064, 'learning_rate': 3.8574108818011256e-05, 'epoch': 8.07}


 81%|████████  | 2157/2670 [52:33<12:37,  1.48s/it]

{'loss': 0.0027, 'grad_norm': 0.03417634218931198, 'learning_rate': 3.849906191369606e-05, 'epoch': 8.08}


 81%|████████  | 2158/2670 [52:35<12:21,  1.45s/it]

{'loss': 0.0024, 'grad_norm': 0.031847089529037476, 'learning_rate': 3.842401500938087e-05, 'epoch': 8.08}


 81%|████████  | 2159/2670 [52:36<12:27,  1.46s/it]

{'loss': 0.0041, 'grad_norm': 0.04772992059588432, 'learning_rate': 3.834896810506567e-05, 'epoch': 8.09}


 81%|████████  | 2160/2670 [52:38<12:33,  1.48s/it]

{'loss': 0.0061, 'grad_norm': 0.07626762986183167, 'learning_rate': 3.827392120075047e-05, 'epoch': 8.09}


 81%|████████  | 2161/2670 [52:40<13:16,  1.56s/it]

{'loss': 0.0089, 'grad_norm': 0.08359074592590332, 'learning_rate': 3.8198874296435274e-05, 'epoch': 8.09}


 81%|████████  | 2162/2670 [52:41<12:48,  1.51s/it]

{'loss': 0.0045, 'grad_norm': 0.05591333657503128, 'learning_rate': 3.8123827392120076e-05, 'epoch': 8.1}


 81%|████████  | 2163/2670 [52:43<13:01,  1.54s/it]

{'loss': 0.0041, 'grad_norm': 0.05503192916512489, 'learning_rate': 3.804878048780488e-05, 'epoch': 8.1}


 81%|████████  | 2164/2670 [52:44<13:27,  1.60s/it]

{'loss': 0.0044, 'grad_norm': 0.06816671788692474, 'learning_rate': 3.7973733583489687e-05, 'epoch': 8.1}


 81%|████████  | 2165/2670 [52:46<13:12,  1.57s/it]

{'loss': 0.0031, 'grad_norm': 0.03353014588356018, 'learning_rate': 3.789868667917448e-05, 'epoch': 8.11}


 81%|████████  | 2166/2670 [52:47<12:53,  1.54s/it]

{'loss': 0.004, 'grad_norm': 0.05331452190876007, 'learning_rate': 3.782363977485929e-05, 'epoch': 8.11}


 81%|████████  | 2167/2670 [52:49<12:48,  1.53s/it]

{'loss': 0.0026, 'grad_norm': 0.0336156003177166, 'learning_rate': 3.774859287054409e-05, 'epoch': 8.12}


 81%|████████  | 2168/2670 [52:50<11:59,  1.43s/it]

{'loss': 0.0048, 'grad_norm': 0.061494436115026474, 'learning_rate': 3.7673545966228895e-05, 'epoch': 8.12}


 81%|████████  | 2169/2670 [52:52<12:01,  1.44s/it]

{'loss': 0.0028, 'grad_norm': 0.04276276007294655, 'learning_rate': 3.75984990619137e-05, 'epoch': 8.12}


 81%|████████▏ | 2170/2670 [52:53<12:19,  1.48s/it]

{'loss': 0.0024, 'grad_norm': 0.03573675826191902, 'learning_rate': 3.75234521575985e-05, 'epoch': 8.13}


 81%|████████▏ | 2171/2670 [52:54<12:06,  1.46s/it]

{'loss': 0.0039, 'grad_norm': 0.043408844619989395, 'learning_rate': 3.74484052532833e-05, 'epoch': 8.13}


 81%|████████▏ | 2172/2670 [52:56<11:47,  1.42s/it]

{'loss': 0.0023, 'grad_norm': 0.026752334088087082, 'learning_rate': 3.737335834896811e-05, 'epoch': 8.13}


 81%|████████▏ | 2173/2670 [52:57<11:16,  1.36s/it]

{'loss': 0.0028, 'grad_norm': 0.036617450416088104, 'learning_rate': 3.729831144465291e-05, 'epoch': 8.14}


 81%|████████▏ | 2174/2670 [52:59<11:37,  1.41s/it]

{'loss': 0.0102, 'grad_norm': 0.10098803043365479, 'learning_rate': 3.7223264540337714e-05, 'epoch': 8.14}


 81%|████████▏ | 2175/2670 [53:00<11:18,  1.37s/it]

{'loss': 0.0092, 'grad_norm': 0.09040574729442596, 'learning_rate': 3.7148217636022516e-05, 'epoch': 8.15}


 81%|████████▏ | 2176/2670 [53:01<11:38,  1.41s/it]

{'loss': 0.0047, 'grad_norm': 0.06202287971973419, 'learning_rate': 3.707317073170732e-05, 'epoch': 8.15}


 82%|████████▏ | 2177/2670 [53:03<11:27,  1.39s/it]

{'loss': 0.0037, 'grad_norm': 0.03840279579162598, 'learning_rate': 3.699812382739212e-05, 'epoch': 8.15}


 82%|████████▏ | 2178/2670 [53:04<11:11,  1.36s/it]

{'loss': 0.0019, 'grad_norm': 0.029885437339544296, 'learning_rate': 3.692307692307693e-05, 'epoch': 8.16}


 82%|████████▏ | 2179/2670 [53:05<11:29,  1.40s/it]

{'loss': 0.0126, 'grad_norm': 0.695096492767334, 'learning_rate': 3.6848030018761725e-05, 'epoch': 8.16}


 82%|████████▏ | 2180/2670 [53:07<11:44,  1.44s/it]

{'loss': 0.0063, 'grad_norm': 0.12341717630624771, 'learning_rate': 3.6772983114446527e-05, 'epoch': 8.16}


 82%|████████▏ | 2181/2670 [53:08<11:29,  1.41s/it]

{'loss': 0.0034, 'grad_norm': 0.0411500409245491, 'learning_rate': 3.6697936210131335e-05, 'epoch': 8.17}


 82%|████████▏ | 2182/2670 [53:10<11:06,  1.37s/it]

{'loss': 0.0058, 'grad_norm': 0.0875094085931778, 'learning_rate': 3.662288930581614e-05, 'epoch': 8.17}


 82%|████████▏ | 2183/2670 [53:11<11:05,  1.37s/it]

{'loss': 0.004, 'grad_norm': 0.05583875998854637, 'learning_rate': 3.654784240150094e-05, 'epoch': 8.18}


 82%|████████▏ | 2184/2670 [53:12<10:49,  1.34s/it]

{'loss': 0.0059, 'grad_norm': 0.06365196406841278, 'learning_rate': 3.647279549718574e-05, 'epoch': 8.18}


 82%|████████▏ | 2185/2670 [53:13<10:11,  1.26s/it]

{'loss': 0.0017, 'grad_norm': 0.025971349328756332, 'learning_rate': 3.6397748592870544e-05, 'epoch': 8.18}


 82%|████████▏ | 2186/2670 [53:15<11:01,  1.37s/it]

{'loss': 0.0061, 'grad_norm': 0.051239803433418274, 'learning_rate': 3.632270168855535e-05, 'epoch': 8.19}


 82%|████████▏ | 2187/2670 [53:16<10:35,  1.32s/it]

{'loss': 0.0045, 'grad_norm': 0.05344969406723976, 'learning_rate': 3.6247654784240155e-05, 'epoch': 8.19}


 82%|████████▏ | 2188/2670 [53:18<10:42,  1.33s/it]

{'loss': 0.0056, 'grad_norm': 0.10479165613651276, 'learning_rate': 3.617260787992495e-05, 'epoch': 8.19}


 82%|████████▏ | 2189/2670 [53:19<10:58,  1.37s/it]

{'loss': 0.0084, 'grad_norm': 0.08362750709056854, 'learning_rate': 3.609756097560976e-05, 'epoch': 8.2}


 82%|████████▏ | 2190/2670 [53:21<11:35,  1.45s/it]

{'loss': 0.003, 'grad_norm': 0.03680199384689331, 'learning_rate': 3.602251407129456e-05, 'epoch': 8.2}


 82%|████████▏ | 2191/2670 [53:22<11:35,  1.45s/it]

{'loss': 0.0052, 'grad_norm': 0.06817879527807236, 'learning_rate': 3.594746716697936e-05, 'epoch': 8.21}


 82%|████████▏ | 2192/2670 [53:23<11:16,  1.41s/it]

{'loss': 0.0041, 'grad_norm': 0.05098011717200279, 'learning_rate': 3.587242026266417e-05, 'epoch': 8.21}


 82%|████████▏ | 2193/2670 [53:25<11:29,  1.45s/it]

{'loss': 0.0041, 'grad_norm': 0.04728228226304054, 'learning_rate': 3.579737335834897e-05, 'epoch': 8.21}


 82%|████████▏ | 2194/2670 [53:26<11:27,  1.44s/it]

{'loss': 0.003, 'grad_norm': 0.04412712901830673, 'learning_rate': 3.572232645403377e-05, 'epoch': 8.22}


 82%|████████▏ | 2195/2670 [53:28<10:51,  1.37s/it]

{'loss': 0.0119, 'grad_norm': 0.3106161653995514, 'learning_rate': 3.564727954971858e-05, 'epoch': 8.22}


 82%|████████▏ | 2196/2670 [53:29<11:14,  1.42s/it]

{'loss': 0.0035, 'grad_norm': 0.04806550219655037, 'learning_rate': 3.557223264540338e-05, 'epoch': 8.22}


 82%|████████▏ | 2197/2670 [53:31<11:23,  1.45s/it]

{'loss': 0.0033, 'grad_norm': 0.04705136641860008, 'learning_rate': 3.549718574108818e-05, 'epoch': 8.23}


 82%|████████▏ | 2198/2670 [53:32<11:27,  1.46s/it]

{'loss': 0.0037, 'grad_norm': 0.04785561189055443, 'learning_rate': 3.5422138836772984e-05, 'epoch': 8.23}


 82%|████████▏ | 2199/2670 [53:34<11:33,  1.47s/it]

{'loss': 0.0041, 'grad_norm': 0.059153001755476, 'learning_rate': 3.5347091932457786e-05, 'epoch': 8.24}


 82%|████████▏ | 2200/2670 [53:35<10:49,  1.38s/it]

{'loss': 0.0053, 'grad_norm': 0.2132643759250641, 'learning_rate': 3.527204502814259e-05, 'epoch': 8.24}


 82%|████████▏ | 2201/2670 [53:36<10:49,  1.38s/it]

{'loss': 0.006, 'grad_norm': 0.07360200583934784, 'learning_rate': 3.51969981238274e-05, 'epoch': 8.24}


 82%|████████▏ | 2202/2670 [53:37<10:24,  1.33s/it]

{'loss': 0.0049, 'grad_norm': 0.05758986622095108, 'learning_rate': 3.512195121951219e-05, 'epoch': 8.25}


 83%|████████▎ | 2203/2670 [53:39<10:21,  1.33s/it]

{'loss': 0.0123, 'grad_norm': 0.47929248213768005, 'learning_rate': 3.5046904315197e-05, 'epoch': 8.25}


 83%|████████▎ | 2204/2670 [53:40<10:38,  1.37s/it]

{'loss': 0.0022, 'grad_norm': 0.044997718185186386, 'learning_rate': 3.4971857410881804e-05, 'epoch': 8.25}


 83%|████████▎ | 2205/2670 [53:42<10:56,  1.41s/it]

{'loss': 0.0018, 'grad_norm': 0.032734036445617676, 'learning_rate': 3.4896810506566606e-05, 'epoch': 8.26}


 83%|████████▎ | 2206/2670 [53:43<11:24,  1.47s/it]

{'loss': 0.0046, 'grad_norm': 0.03830742835998535, 'learning_rate': 3.4821763602251415e-05, 'epoch': 8.26}


 83%|████████▎ | 2207/2670 [53:45<11:39,  1.51s/it]

{'loss': 0.0031, 'grad_norm': 0.0498809739947319, 'learning_rate': 3.474671669793621e-05, 'epoch': 8.27}


 83%|████████▎ | 2208/2670 [53:46<11:36,  1.51s/it]

{'loss': 0.0026, 'grad_norm': 0.04607474431395531, 'learning_rate': 3.467166979362101e-05, 'epoch': 8.27}


 83%|████████▎ | 2209/2670 [53:48<11:53,  1.55s/it]

{'loss': 0.0031, 'grad_norm': 0.05254784971475601, 'learning_rate': 3.459662288930582e-05, 'epoch': 8.27}


 83%|████████▎ | 2210/2670 [53:49<11:29,  1.50s/it]

{'loss': 0.0037, 'grad_norm': 0.04582691565155983, 'learning_rate': 3.452157598499062e-05, 'epoch': 8.28}


 83%|████████▎ | 2211/2670 [53:51<11:12,  1.47s/it]

{'loss': 0.0035, 'grad_norm': 0.047631725668907166, 'learning_rate': 3.444652908067542e-05, 'epoch': 8.28}


 83%|████████▎ | 2212/2670 [53:53<11:46,  1.54s/it]

{'loss': 0.0083, 'grad_norm': 0.17536531388759613, 'learning_rate': 3.437148217636023e-05, 'epoch': 8.28}


 83%|████████▎ | 2213/2670 [53:54<11:33,  1.52s/it]

{'loss': 0.0043, 'grad_norm': 0.0478767566382885, 'learning_rate': 3.429643527204503e-05, 'epoch': 8.29}


 83%|████████▎ | 2214/2670 [53:56<11:37,  1.53s/it]

{'loss': 0.0031, 'grad_norm': 0.04461377114057541, 'learning_rate': 3.422138836772983e-05, 'epoch': 8.29}


 83%|████████▎ | 2215/2670 [53:57<11:06,  1.46s/it]

{'loss': 0.0026, 'grad_norm': 0.03441308066248894, 'learning_rate': 3.414634146341464e-05, 'epoch': 8.3}


 83%|████████▎ | 2216/2670 [53:58<11:01,  1.46s/it]

{'loss': 0.009, 'grad_norm': 0.20948779582977295, 'learning_rate': 3.4071294559099435e-05, 'epoch': 8.3}


 83%|████████▎ | 2217/2670 [54:00<10:53,  1.44s/it]

{'loss': 0.0042, 'grad_norm': 0.05998222902417183, 'learning_rate': 3.3996247654784244e-05, 'epoch': 8.3}


 83%|████████▎ | 2218/2670 [54:01<10:39,  1.42s/it]

{'loss': 0.0041, 'grad_norm': 0.05903862789273262, 'learning_rate': 3.3921200750469046e-05, 'epoch': 8.31}


 83%|████████▎ | 2219/2670 [54:03<10:51,  1.45s/it]

{'loss': 0.0055, 'grad_norm': 0.069550521671772, 'learning_rate': 3.384615384615385e-05, 'epoch': 8.31}


 83%|████████▎ | 2220/2670 [54:04<11:10,  1.49s/it]

{'loss': 0.0042, 'grad_norm': 0.05725942552089691, 'learning_rate': 3.377110694183865e-05, 'epoch': 8.31}


 83%|████████▎ | 2221/2670 [54:05<10:34,  1.41s/it]

{'loss': 0.0058, 'grad_norm': 0.0734555572271347, 'learning_rate': 3.369606003752345e-05, 'epoch': 8.32}


 83%|████████▎ | 2222/2670 [54:07<10:50,  1.45s/it]

{'loss': 0.0118, 'grad_norm': 0.4401838183403015, 'learning_rate': 3.3621013133208255e-05, 'epoch': 8.32}


 83%|████████▎ | 2223/2670 [54:08<10:50,  1.46s/it]

{'loss': 0.0073, 'grad_norm': 0.10162559151649475, 'learning_rate': 3.3545966228893063e-05, 'epoch': 8.33}


 83%|████████▎ | 2224/2670 [54:10<11:01,  1.48s/it]

{'loss': 0.003, 'grad_norm': 0.08704937249422073, 'learning_rate': 3.3470919324577866e-05, 'epoch': 8.33}


 83%|████████▎ | 2225/2670 [54:12<11:31,  1.55s/it]

{'loss': 0.006, 'grad_norm': 0.07739929854869843, 'learning_rate': 3.339587242026266e-05, 'epoch': 8.33}


 83%|████████▎ | 2226/2670 [54:13<11:02,  1.49s/it]

{'loss': 0.0042, 'grad_norm': 0.05438666045665741, 'learning_rate': 3.332082551594747e-05, 'epoch': 8.34}


 83%|████████▎ | 2227/2670 [54:15<11:22,  1.54s/it]

{'loss': 0.0035, 'grad_norm': 0.06151924282312393, 'learning_rate': 3.324577861163227e-05, 'epoch': 8.34}


 83%|████████▎ | 2228/2670 [54:16<11:06,  1.51s/it]

{'loss': 0.0028, 'grad_norm': 0.04729294404387474, 'learning_rate': 3.3170731707317074e-05, 'epoch': 8.34}


 83%|████████▎ | 2229/2670 [54:18<11:09,  1.52s/it]

{'loss': 0.0038, 'grad_norm': 0.04677840694785118, 'learning_rate': 3.3095684803001876e-05, 'epoch': 8.35}


 84%|████████▎ | 2230/2670 [54:19<11:17,  1.54s/it]

{'loss': 0.0049, 'grad_norm': 0.059220556169748306, 'learning_rate': 3.302063789868668e-05, 'epoch': 8.35}


 84%|████████▎ | 2231/2670 [54:21<11:15,  1.54s/it]

{'loss': 0.0032, 'grad_norm': 0.046005621552467346, 'learning_rate': 3.294559099437149e-05, 'epoch': 8.36}


 84%|████████▎ | 2232/2670 [54:22<10:53,  1.49s/it]

{'loss': 0.0028, 'grad_norm': 0.04712775722146034, 'learning_rate': 3.287054409005629e-05, 'epoch': 8.36}


 84%|████████▎ | 2233/2670 [54:24<11:06,  1.52s/it]

{'loss': 0.0035, 'grad_norm': 0.04412911832332611, 'learning_rate': 3.279549718574109e-05, 'epoch': 8.36}


 84%|████████▎ | 2234/2670 [54:25<10:53,  1.50s/it]

{'loss': 0.0023, 'grad_norm': 0.04007599502801895, 'learning_rate': 3.272045028142589e-05, 'epoch': 8.37}


 84%|████████▎ | 2235/2670 [54:27<10:43,  1.48s/it]

{'loss': 0.0025, 'grad_norm': 0.03674030303955078, 'learning_rate': 3.2645403377110695e-05, 'epoch': 8.37}


 84%|████████▎ | 2236/2670 [54:28<10:13,  1.41s/it]

{'loss': 0.0024, 'grad_norm': 0.040479667484760284, 'learning_rate': 3.25703564727955e-05, 'epoch': 8.37}


 84%|████████▍ | 2237/2670 [54:29<10:01,  1.39s/it]

{'loss': 0.0038, 'grad_norm': 0.054853346198797226, 'learning_rate': 3.2495309568480306e-05, 'epoch': 8.38}


 84%|████████▍ | 2238/2670 [54:31<10:03,  1.40s/it]

{'loss': 0.0087, 'grad_norm': 0.11105022579431534, 'learning_rate': 3.242026266416511e-05, 'epoch': 8.38}


 84%|████████▍ | 2239/2670 [54:32<10:08,  1.41s/it]

{'loss': 0.0077, 'grad_norm': 0.10506493598222733, 'learning_rate': 3.2345215759849904e-05, 'epoch': 8.39}


 84%|████████▍ | 2240/2670 [54:34<10:28,  1.46s/it]

{'loss': 0.0038, 'grad_norm': 0.05651915445923805, 'learning_rate': 3.227016885553471e-05, 'epoch': 8.39}


 84%|████████▍ | 2241/2670 [54:35<10:13,  1.43s/it]

{'loss': 0.0071, 'grad_norm': 0.1154741570353508, 'learning_rate': 3.2195121951219514e-05, 'epoch': 8.39}


 84%|████████▍ | 2242/2670 [54:37<10:25,  1.46s/it]

{'loss': 0.0058, 'grad_norm': 0.07238715142011642, 'learning_rate': 3.2120075046904317e-05, 'epoch': 8.4}


 84%|████████▍ | 2243/2670 [54:38<10:45,  1.51s/it]

{'loss': 0.0026, 'grad_norm': 0.03695250675082207, 'learning_rate': 3.204502814258912e-05, 'epoch': 8.4}


 84%|████████▍ | 2244/2670 [54:40<10:59,  1.55s/it]

{'loss': 0.0045, 'grad_norm': 0.0603330172598362, 'learning_rate': 3.196998123827392e-05, 'epoch': 8.4}


 84%|████████▍ | 2245/2670 [54:41<10:59,  1.55s/it]

{'loss': 0.0041, 'grad_norm': 0.0511670857667923, 'learning_rate': 3.189493433395872e-05, 'epoch': 8.41}


 84%|████████▍ | 2246/2670 [54:43<10:32,  1.49s/it]

{'loss': 0.0068, 'grad_norm': 0.17175358533859253, 'learning_rate': 3.181988742964353e-05, 'epoch': 8.41}


 84%|████████▍ | 2247/2670 [54:44<10:21,  1.47s/it]

{'loss': 0.0031, 'grad_norm': 0.04337158426642418, 'learning_rate': 3.1744840525328334e-05, 'epoch': 8.42}


 84%|████████▍ | 2248/2670 [54:46<10:45,  1.53s/it]

{'loss': 0.0026, 'grad_norm': 0.03560216724872589, 'learning_rate': 3.1669793621013136e-05, 'epoch': 8.42}


 84%|████████▍ | 2249/2670 [54:47<10:46,  1.54s/it]

{'loss': 0.0036, 'grad_norm': 0.04463071748614311, 'learning_rate': 3.159474671669794e-05, 'epoch': 8.42}


 84%|████████▍ | 2250/2670 [54:49<10:29,  1.50s/it]

{'loss': 0.007, 'grad_norm': 0.2848372161388397, 'learning_rate': 3.151969981238274e-05, 'epoch': 8.43}


 84%|████████▍ | 2251/2670 [54:50<10:20,  1.48s/it]

{'loss': 0.0047, 'grad_norm': 0.07332048565149307, 'learning_rate': 3.144465290806755e-05, 'epoch': 8.43}


 84%|████████▍ | 2252/2670 [54:51<09:45,  1.40s/it]

{'loss': 0.0074, 'grad_norm': 0.0893462672829628, 'learning_rate': 3.1369606003752344e-05, 'epoch': 8.43}


 84%|████████▍ | 2253/2670 [54:53<10:01,  1.44s/it]

{'loss': 0.0051, 'grad_norm': 0.06273811310529709, 'learning_rate': 3.1294559099437146e-05, 'epoch': 8.44}


 84%|████████▍ | 2254/2670 [54:54<10:08,  1.46s/it]

{'loss': 0.0033, 'grad_norm': 0.04723196476697922, 'learning_rate': 3.1219512195121955e-05, 'epoch': 8.44}


 84%|████████▍ | 2255/2670 [54:56<09:49,  1.42s/it]

{'loss': 0.0067, 'grad_norm': 0.06861183047294617, 'learning_rate': 3.114446529080676e-05, 'epoch': 8.45}


 84%|████████▍ | 2256/2670 [54:57<09:42,  1.41s/it]

{'loss': 0.004, 'grad_norm': 0.0611531026661396, 'learning_rate': 3.106941838649156e-05, 'epoch': 8.45}


 85%|████████▍ | 2257/2670 [54:59<09:56,  1.44s/it]

{'loss': 0.0036, 'grad_norm': 0.04769185557961464, 'learning_rate': 3.099437148217636e-05, 'epoch': 8.45}


 85%|████████▍ | 2258/2670 [55:00<09:55,  1.45s/it]

{'loss': 0.0048, 'grad_norm': 0.0863819569349289, 'learning_rate': 3.0919324577861163e-05, 'epoch': 8.46}


 85%|████████▍ | 2259/2670 [55:02<09:54,  1.45s/it]

{'loss': 0.0045, 'grad_norm': 0.049985408782958984, 'learning_rate': 3.0844277673545965e-05, 'epoch': 8.46}


 85%|████████▍ | 2260/2670 [55:03<10:18,  1.51s/it]

{'loss': 0.0028, 'grad_norm': 0.03636707738041878, 'learning_rate': 3.0769230769230774e-05, 'epoch': 8.46}


 85%|████████▍ | 2261/2670 [55:05<09:59,  1.46s/it]

{'loss': 0.0024, 'grad_norm': 0.03286414593458176, 'learning_rate': 3.069418386491557e-05, 'epoch': 8.47}


 85%|████████▍ | 2262/2670 [55:06<09:55,  1.46s/it]

{'loss': 0.0032, 'grad_norm': 0.04733199253678322, 'learning_rate': 3.061913696060038e-05, 'epoch': 8.47}


 85%|████████▍ | 2263/2670 [55:07<09:47,  1.44s/it]

{'loss': 0.0051, 'grad_norm': 0.052933644503355026, 'learning_rate': 3.054409005628518e-05, 'epoch': 8.48}


 85%|████████▍ | 2264/2670 [55:09<09:42,  1.44s/it]

{'loss': 0.0036, 'grad_norm': 0.051115483045578, 'learning_rate': 3.0469043151969983e-05, 'epoch': 8.48}


 85%|████████▍ | 2265/2670 [55:10<09:29,  1.41s/it]

{'loss': 0.0054, 'grad_norm': 0.07152114063501358, 'learning_rate': 3.0393996247654788e-05, 'epoch': 8.48}


 85%|████████▍ | 2266/2670 [55:12<09:35,  1.42s/it]

{'loss': 0.0046, 'grad_norm': 0.05282638967037201, 'learning_rate': 3.0318949343339587e-05, 'epoch': 8.49}


 85%|████████▍ | 2267/2670 [55:13<09:27,  1.41s/it]

{'loss': 0.0046, 'grad_norm': 0.05935118719935417, 'learning_rate': 3.0243902439024392e-05, 'epoch': 8.49}


 85%|████████▍ | 2268/2670 [55:15<09:36,  1.43s/it]

{'loss': 0.0045, 'grad_norm': 0.06700900197029114, 'learning_rate': 3.0168855534709194e-05, 'epoch': 8.49}


 85%|████████▍ | 2269/2670 [55:16<09:11,  1.38s/it]

{'loss': 0.0042, 'grad_norm': 0.06430771946907043, 'learning_rate': 3.0093808630394e-05, 'epoch': 8.5}


 85%|████████▌ | 2270/2670 [55:17<09:18,  1.40s/it]

{'loss': 0.0054, 'grad_norm': 0.07487604767084122, 'learning_rate': 3.0018761726078802e-05, 'epoch': 8.5}


 85%|████████▌ | 2271/2670 [55:19<09:07,  1.37s/it]

{'loss': 0.0051, 'grad_norm': 0.06117388606071472, 'learning_rate': 2.99437148217636e-05, 'epoch': 8.51}


 85%|████████▌ | 2272/2670 [55:20<09:33,  1.44s/it]

{'loss': 0.0046, 'grad_norm': 0.06229406222701073, 'learning_rate': 2.9868667917448406e-05, 'epoch': 8.51}


 85%|████████▌ | 2273/2670 [55:21<09:04,  1.37s/it]

{'loss': 0.0024, 'grad_norm': 0.03757752850651741, 'learning_rate': 2.979362101313321e-05, 'epoch': 8.51}


 85%|████████▌ | 2274/2670 [55:23<08:59,  1.36s/it]

{'loss': 0.0047, 'grad_norm': 0.06248132511973381, 'learning_rate': 2.9718574108818014e-05, 'epoch': 8.52}


 85%|████████▌ | 2275/2670 [55:24<09:17,  1.41s/it]

{'loss': 0.0041, 'grad_norm': 0.060736894607543945, 'learning_rate': 2.9643527204502812e-05, 'epoch': 8.52}


 85%|████████▌ | 2276/2670 [55:26<09:08,  1.39s/it]

{'loss': 0.0055, 'grad_norm': 0.08717387914657593, 'learning_rate': 2.9568480300187618e-05, 'epoch': 8.52}


 85%|████████▌ | 2277/2670 [55:27<08:50,  1.35s/it]

{'loss': 0.0054, 'grad_norm': 0.16166535019874573, 'learning_rate': 2.9493433395872423e-05, 'epoch': 8.53}


 85%|████████▌ | 2278/2670 [55:28<09:23,  1.44s/it]

{'loss': 0.0073, 'grad_norm': 0.07585661113262177, 'learning_rate': 2.9418386491557225e-05, 'epoch': 8.53}


 85%|████████▌ | 2279/2670 [55:30<09:56,  1.53s/it]

{'loss': 0.0024, 'grad_norm': 0.03520689904689789, 'learning_rate': 2.934333958724203e-05, 'epoch': 8.54}


 85%|████████▌ | 2280/2670 [55:32<09:35,  1.47s/it]

{'loss': 0.0049, 'grad_norm': 0.07740940898656845, 'learning_rate': 2.926829268292683e-05, 'epoch': 8.54}


 85%|████████▌ | 2281/2670 [55:33<09:39,  1.49s/it]

{'loss': 0.0046, 'grad_norm': 0.06269162148237228, 'learning_rate': 2.919324577861163e-05, 'epoch': 8.54}


 85%|████████▌ | 2282/2670 [55:35<09:29,  1.47s/it]

{'loss': 0.0042, 'grad_norm': 0.06297940760850906, 'learning_rate': 2.9118198874296437e-05, 'epoch': 8.55}


 86%|████████▌ | 2283/2670 [55:36<09:05,  1.41s/it]

{'loss': 0.0044, 'grad_norm': 0.05152755603194237, 'learning_rate': 2.9043151969981243e-05, 'epoch': 8.55}


 86%|████████▌ | 2284/2670 [55:37<09:15,  1.44s/it]

{'loss': 0.0031, 'grad_norm': 0.0467289574444294, 'learning_rate': 2.896810506566604e-05, 'epoch': 8.55}


 86%|████████▌ | 2285/2670 [55:40<10:58,  1.71s/it]

{'loss': 0.0091, 'grad_norm': 0.2008281797170639, 'learning_rate': 2.8893058161350843e-05, 'epoch': 8.56}


 86%|████████▌ | 2286/2670 [55:41<10:22,  1.62s/it]

{'loss': 0.003, 'grad_norm': 0.04666464775800705, 'learning_rate': 2.881801125703565e-05, 'epoch': 8.56}


 86%|████████▌ | 2287/2670 [55:42<09:40,  1.51s/it]

{'loss': 0.0056, 'grad_norm': 0.23689323663711548, 'learning_rate': 2.8742964352720454e-05, 'epoch': 8.57}


 86%|████████▌ | 2288/2670 [55:44<09:19,  1.46s/it]

{'loss': 0.0044, 'grad_norm': 0.06340949982404709, 'learning_rate': 2.8667917448405256e-05, 'epoch': 8.57}


 86%|████████▌ | 2289/2670 [55:45<09:24,  1.48s/it]

{'loss': 0.0026, 'grad_norm': 0.03697536140680313, 'learning_rate': 2.8592870544090055e-05, 'epoch': 8.57}


 86%|████████▌ | 2290/2670 [55:47<09:17,  1.47s/it]

{'loss': 0.0028, 'grad_norm': 0.042667679488658905, 'learning_rate': 2.851782363977486e-05, 'epoch': 8.58}


 86%|████████▌ | 2291/2670 [55:48<09:00,  1.43s/it]

{'loss': 0.004, 'grad_norm': 0.05994259938597679, 'learning_rate': 2.8442776735459663e-05, 'epoch': 8.58}


 86%|████████▌ | 2292/2670 [55:50<09:35,  1.52s/it]

{'loss': 0.0074, 'grad_norm': 0.08213254064321518, 'learning_rate': 2.8367729831144468e-05, 'epoch': 8.58}


 86%|████████▌ | 2293/2670 [55:51<09:42,  1.55s/it]

{'loss': 0.0033, 'grad_norm': 0.04428539425134659, 'learning_rate': 2.8292682926829267e-05, 'epoch': 8.59}


 86%|████████▌ | 2294/2670 [55:53<09:31,  1.52s/it]

{'loss': 0.0032, 'grad_norm': 0.042001232504844666, 'learning_rate': 2.8217636022514072e-05, 'epoch': 8.59}


 86%|████████▌ | 2295/2670 [55:54<09:36,  1.54s/it]

{'loss': 0.004, 'grad_norm': 0.05139493942260742, 'learning_rate': 2.8142589118198874e-05, 'epoch': 8.6}


 86%|████████▌ | 2296/2670 [55:56<09:06,  1.46s/it]

{'loss': 0.0069, 'grad_norm': 0.22893129289150238, 'learning_rate': 2.806754221388368e-05, 'epoch': 8.6}


 86%|████████▌ | 2297/2670 [55:57<08:52,  1.43s/it]

{'loss': 0.0027, 'grad_norm': 0.0407712385058403, 'learning_rate': 2.7992495309568485e-05, 'epoch': 8.6}


 86%|████████▌ | 2298/2670 [55:58<08:33,  1.38s/it]

{'loss': 0.006, 'grad_norm': 0.07882419973611832, 'learning_rate': 2.7917448405253284e-05, 'epoch': 8.61}


 86%|████████▌ | 2299/2670 [56:00<08:27,  1.37s/it]

{'loss': 0.0074, 'grad_norm': 0.08753135055303574, 'learning_rate': 2.7842401500938086e-05, 'epoch': 8.61}


 86%|████████▌ | 2300/2670 [56:01<08:33,  1.39s/it]

{'loss': 0.0405, 'grad_norm': 0.41867929697036743, 'learning_rate': 2.776735459662289e-05, 'epoch': 8.61}


 86%|████████▌ | 2301/2670 [56:02<08:36,  1.40s/it]

{'loss': 0.005, 'grad_norm': 0.06141234561800957, 'learning_rate': 2.7692307692307694e-05, 'epoch': 8.62}


 86%|████████▌ | 2302/2670 [56:04<08:09,  1.33s/it]

{'loss': 0.0084, 'grad_norm': 0.11584239453077316, 'learning_rate': 2.76172607879925e-05, 'epoch': 8.62}


 86%|████████▋ | 2303/2670 [56:05<08:34,  1.40s/it]

{'loss': 0.004, 'grad_norm': 0.056333284825086594, 'learning_rate': 2.7542213883677298e-05, 'epoch': 8.63}


 86%|████████▋ | 2304/2670 [56:07<09:05,  1.49s/it]

{'loss': 0.0042, 'grad_norm': 0.06374730914831161, 'learning_rate': 2.7467166979362103e-05, 'epoch': 8.63}


 86%|████████▋ | 2305/2670 [56:08<09:01,  1.48s/it]

{'loss': 0.0062, 'grad_norm': 0.07565007358789444, 'learning_rate': 2.7392120075046905e-05, 'epoch': 8.63}


 86%|████████▋ | 2306/2670 [56:10<09:01,  1.49s/it]

{'loss': 0.0054, 'grad_norm': 0.06645061820745468, 'learning_rate': 2.731707317073171e-05, 'epoch': 8.64}


 86%|████████▋ | 2307/2670 [56:11<09:01,  1.49s/it]

{'loss': 0.0081, 'grad_norm': 0.12379663437604904, 'learning_rate': 2.724202626641651e-05, 'epoch': 8.64}


 86%|████████▋ | 2308/2670 [56:13<08:50,  1.47s/it]

{'loss': 0.0055, 'grad_norm': 0.07264886051416397, 'learning_rate': 2.7166979362101315e-05, 'epoch': 8.64}


 86%|████████▋ | 2309/2670 [56:14<08:26,  1.40s/it]

{'loss': 0.0036, 'grad_norm': 0.04281177371740341, 'learning_rate': 2.7091932457786117e-05, 'epoch': 8.65}


 87%|████████▋ | 2310/2670 [56:15<08:19,  1.39s/it]

{'loss': 0.004, 'grad_norm': 0.05874604731798172, 'learning_rate': 2.7016885553470922e-05, 'epoch': 8.65}


 87%|████████▋ | 2311/2670 [56:17<08:12,  1.37s/it]

{'loss': 0.0046, 'grad_norm': 0.056728824973106384, 'learning_rate': 2.6941838649155728e-05, 'epoch': 8.66}


 87%|████████▋ | 2312/2670 [56:18<08:36,  1.44s/it]

{'loss': 0.0028, 'grad_norm': 0.039492592215538025, 'learning_rate': 2.6866791744840523e-05, 'epoch': 8.66}


 87%|████████▋ | 2313/2670 [56:20<08:47,  1.48s/it]

{'loss': 0.0063, 'grad_norm': 0.07109840214252472, 'learning_rate': 2.679174484052533e-05, 'epoch': 8.66}


 87%|████████▋ | 2314/2670 [56:21<08:42,  1.47s/it]

{'loss': 0.004, 'grad_norm': 0.04803866147994995, 'learning_rate': 2.6716697936210134e-05, 'epoch': 8.67}


 87%|████████▋ | 2315/2670 [56:23<08:23,  1.42s/it]

{'loss': 0.005, 'grad_norm': 0.06200924515724182, 'learning_rate': 2.6641651031894936e-05, 'epoch': 8.67}


 87%|████████▋ | 2316/2670 [56:24<08:02,  1.36s/it]

{'loss': 0.0098, 'grad_norm': 0.0941343754529953, 'learning_rate': 2.6566604127579735e-05, 'epoch': 8.67}


 87%|████████▋ | 2317/2670 [56:25<08:08,  1.39s/it]

{'loss': 0.0051, 'grad_norm': 0.05022620037198067, 'learning_rate': 2.649155722326454e-05, 'epoch': 8.68}


 87%|████████▋ | 2318/2670 [56:27<07:55,  1.35s/it]

{'loss': 0.0035, 'grad_norm': 0.05094018578529358, 'learning_rate': 2.6416510318949346e-05, 'epoch': 8.68}


 87%|████████▋ | 2319/2670 [56:28<08:20,  1.43s/it]

{'loss': 0.0046, 'grad_norm': 0.0516851507127285, 'learning_rate': 2.6341463414634148e-05, 'epoch': 8.69}


 87%|████████▋ | 2320/2670 [56:30<08:48,  1.51s/it]

{'loss': 0.0044, 'grad_norm': 0.052058346569538116, 'learning_rate': 2.6266416510318953e-05, 'epoch': 8.69}


 87%|████████▋ | 2321/2670 [56:31<08:47,  1.51s/it]

{'loss': 0.0079, 'grad_norm': 0.11204709857702255, 'learning_rate': 2.6191369606003752e-05, 'epoch': 8.69}


 87%|████████▋ | 2322/2670 [56:33<09:16,  1.60s/it]

{'loss': 0.0022, 'grad_norm': 0.029587816447019577, 'learning_rate': 2.6116322701688558e-05, 'epoch': 8.7}


 87%|████████▋ | 2323/2670 [56:35<09:15,  1.60s/it]

{'loss': 0.0032, 'grad_norm': 0.03904547169804573, 'learning_rate': 2.604127579737336e-05, 'epoch': 8.7}


 87%|████████▋ | 2324/2670 [56:37<09:38,  1.67s/it]

{'loss': 0.0084, 'grad_norm': 0.09586180746555328, 'learning_rate': 2.5966228893058165e-05, 'epoch': 8.7}


 87%|████████▋ | 2325/2670 [56:38<09:02,  1.57s/it]

{'loss': 0.0082, 'grad_norm': 0.18682242929935455, 'learning_rate': 2.5891181988742964e-05, 'epoch': 8.71}


 87%|████████▋ | 2326/2670 [56:39<08:42,  1.52s/it]

{'loss': 0.0045, 'grad_norm': 0.06338397413492203, 'learning_rate': 2.5816135084427766e-05, 'epoch': 8.71}


 87%|████████▋ | 2327/2670 [56:41<09:07,  1.60s/it]

{'loss': 0.0036, 'grad_norm': 0.059445422142744064, 'learning_rate': 2.574108818011257e-05, 'epoch': 8.72}


 87%|████████▋ | 2328/2670 [56:43<09:08,  1.60s/it]

{'loss': 0.0043, 'grad_norm': 0.05625591427087784, 'learning_rate': 2.5666041275797377e-05, 'epoch': 8.72}


 87%|████████▋ | 2329/2670 [56:44<08:43,  1.54s/it]

{'loss': 0.0046, 'grad_norm': 0.05459698662161827, 'learning_rate': 2.559099437148218e-05, 'epoch': 8.72}


 87%|████████▋ | 2330/2670 [56:46<08:38,  1.53s/it]

{'loss': 0.0028, 'grad_norm': 0.02852347493171692, 'learning_rate': 2.5515947467166978e-05, 'epoch': 8.73}


 87%|████████▋ | 2331/2670 [56:47<09:06,  1.61s/it]

{'loss': 0.0053, 'grad_norm': 0.09822290390729904, 'learning_rate': 2.5440900562851783e-05, 'epoch': 8.73}


 87%|████████▋ | 2332/2670 [56:49<08:52,  1.58s/it]

{'loss': 0.0058, 'grad_norm': 0.06714226305484772, 'learning_rate': 2.536585365853659e-05, 'epoch': 8.73}


 87%|████████▋ | 2333/2670 [56:51<09:12,  1.64s/it]

{'loss': 0.0054, 'grad_norm': 0.07275760918855667, 'learning_rate': 2.529080675422139e-05, 'epoch': 8.74}


 87%|████████▋ | 2334/2670 [56:52<08:49,  1.58s/it]

{'loss': 0.0055, 'grad_norm': 0.06901185214519501, 'learning_rate': 2.5215759849906196e-05, 'epoch': 8.74}


 87%|████████▋ | 2335/2670 [56:54<08:54,  1.59s/it]

{'loss': 0.0032, 'grad_norm': 0.047385863959789276, 'learning_rate': 2.5140712945590995e-05, 'epoch': 8.75}


 87%|████████▋ | 2336/2670 [56:55<08:47,  1.58s/it]

{'loss': 0.004, 'grad_norm': 0.04998200759291649, 'learning_rate': 2.5065666041275797e-05, 'epoch': 8.75}


 88%|████████▊ | 2337/2670 [56:57<08:25,  1.52s/it]

{'loss': 0.0064, 'grad_norm': 0.07301701605319977, 'learning_rate': 2.4990619136960602e-05, 'epoch': 8.75}


 88%|████████▊ | 2338/2670 [56:58<08:00,  1.45s/it]

{'loss': 0.0049, 'grad_norm': 0.06964770704507828, 'learning_rate': 2.4915572232645404e-05, 'epoch': 8.76}


 88%|████████▊ | 2339/2670 [56:59<08:00,  1.45s/it]

{'loss': 0.0052, 'grad_norm': 0.06884218752384186, 'learning_rate': 2.4840525328330206e-05, 'epoch': 8.76}


 88%|████████▊ | 2340/2670 [57:01<07:56,  1.44s/it]

{'loss': 0.0035, 'grad_norm': 0.05092659592628479, 'learning_rate': 2.476547842401501e-05, 'epoch': 8.76}


 88%|████████▊ | 2341/2670 [57:02<08:09,  1.49s/it]

{'loss': 0.0046, 'grad_norm': 0.0592644102871418, 'learning_rate': 2.4690431519699814e-05, 'epoch': 8.77}


 88%|████████▊ | 2342/2670 [57:04<07:59,  1.46s/it]

{'loss': 0.0078, 'grad_norm': 0.24325795471668243, 'learning_rate': 2.461538461538462e-05, 'epoch': 8.77}


 88%|████████▊ | 2343/2670 [57:05<08:04,  1.48s/it]

{'loss': 0.0058, 'grad_norm': 0.06992471218109131, 'learning_rate': 2.4540337711069418e-05, 'epoch': 8.78}


 88%|████████▊ | 2344/2670 [57:07<07:48,  1.44s/it]

{'loss': 0.0059, 'grad_norm': 0.06733429431915283, 'learning_rate': 2.4465290806754224e-05, 'epoch': 8.78}


 88%|████████▊ | 2345/2670 [57:08<07:40,  1.42s/it]

{'loss': 0.0047, 'grad_norm': 0.057576701045036316, 'learning_rate': 2.4390243902439026e-05, 'epoch': 8.78}


 88%|████████▊ | 2346/2670 [57:10<07:48,  1.45s/it]

{'loss': 0.0033, 'grad_norm': 0.043305594474077225, 'learning_rate': 2.4315196998123828e-05, 'epoch': 8.79}


 88%|████████▊ | 2347/2670 [57:11<07:32,  1.40s/it]

{'loss': 0.0032, 'grad_norm': 0.042444683611392975, 'learning_rate': 2.424015009380863e-05, 'epoch': 8.79}


 88%|████████▊ | 2348/2670 [57:12<07:23,  1.38s/it]

{'loss': 0.0041, 'grad_norm': 0.04645306244492531, 'learning_rate': 2.4165103189493435e-05, 'epoch': 8.79}


 88%|████████▊ | 2349/2670 [57:13<07:12,  1.35s/it]

{'loss': 0.0053, 'grad_norm': 0.07754278928041458, 'learning_rate': 2.4090056285178237e-05, 'epoch': 8.8}


 88%|████████▊ | 2350/2670 [57:15<07:50,  1.47s/it]

{'loss': 0.004, 'grad_norm': 0.07023625075817108, 'learning_rate': 2.401500938086304e-05, 'epoch': 8.8}


 88%|████████▊ | 2351/2670 [57:17<07:57,  1.50s/it]

{'loss': 0.0061, 'grad_norm': 0.1776880919933319, 'learning_rate': 2.3939962476547845e-05, 'epoch': 8.81}


 88%|████████▊ | 2352/2670 [57:18<07:59,  1.51s/it]

{'loss': 0.0042, 'grad_norm': 0.06254912912845612, 'learning_rate': 2.3864915572232647e-05, 'epoch': 8.81}


 88%|████████▊ | 2353/2670 [57:20<08:12,  1.55s/it]

{'loss': 0.0031, 'grad_norm': 0.04350389912724495, 'learning_rate': 2.378986866791745e-05, 'epoch': 8.81}


 88%|████████▊ | 2354/2670 [57:21<07:49,  1.49s/it]

{'loss': 0.0047, 'grad_norm': 0.09031443297863007, 'learning_rate': 2.371482176360225e-05, 'epoch': 8.82}


 88%|████████▊ | 2355/2670 [57:23<08:14,  1.57s/it]

{'loss': 0.0044, 'grad_norm': 0.05363195389509201, 'learning_rate': 2.3639774859287057e-05, 'epoch': 8.82}


 88%|████████▊ | 2356/2670 [57:25<08:09,  1.56s/it]

{'loss': 0.0058, 'grad_norm': 0.04331960529088974, 'learning_rate': 2.356472795497186e-05, 'epoch': 8.82}


 88%|████████▊ | 2357/2670 [57:26<08:03,  1.54s/it]

{'loss': 0.0032, 'grad_norm': 0.04652160778641701, 'learning_rate': 2.348968105065666e-05, 'epoch': 8.83}


 88%|████████▊ | 2358/2670 [57:28<07:51,  1.51s/it]

{'loss': 0.0057, 'grad_norm': 0.06313442438840866, 'learning_rate': 2.3414634146341466e-05, 'epoch': 8.83}


 88%|████████▊ | 2359/2670 [57:29<08:07,  1.57s/it]

{'loss': 0.0036, 'grad_norm': 0.050252899527549744, 'learning_rate': 2.333958724202627e-05, 'epoch': 8.84}


 88%|████████▊ | 2360/2670 [57:31<07:55,  1.53s/it]

{'loss': 0.0063, 'grad_norm': 0.08104687929153442, 'learning_rate': 2.326454033771107e-05, 'epoch': 8.84}


 88%|████████▊ | 2361/2670 [57:32<07:44,  1.50s/it]

{'loss': 0.0059, 'grad_norm': 0.09325919300317764, 'learning_rate': 2.3189493433395873e-05, 'epoch': 8.84}


 88%|████████▊ | 2362/2670 [57:34<07:46,  1.51s/it]

{'loss': 0.0064, 'grad_norm': 0.07591920346021652, 'learning_rate': 2.3114446529080678e-05, 'epoch': 8.85}


 89%|████████▊ | 2363/2670 [57:35<07:25,  1.45s/it]

{'loss': 0.0036, 'grad_norm': 0.055349938571453094, 'learning_rate': 2.303939962476548e-05, 'epoch': 8.85}


 89%|████████▊ | 2364/2670 [57:37<07:40,  1.50s/it]

{'loss': 0.0119, 'grad_norm': 0.2191375344991684, 'learning_rate': 2.2964352720450282e-05, 'epoch': 8.85}


 89%|████████▊ | 2365/2670 [57:38<07:47,  1.53s/it]

{'loss': 0.0057, 'grad_norm': 0.36433425545692444, 'learning_rate': 2.2889305816135084e-05, 'epoch': 8.86}


 89%|████████▊ | 2366/2670 [57:40<07:26,  1.47s/it]

{'loss': 0.0055, 'grad_norm': 0.1258823573589325, 'learning_rate': 2.281425891181989e-05, 'epoch': 8.86}


 89%|████████▊ | 2367/2670 [57:41<07:28,  1.48s/it]

{'loss': 0.0076, 'grad_norm': 0.07919173687696457, 'learning_rate': 2.2739212007504692e-05, 'epoch': 8.87}


 89%|████████▊ | 2368/2670 [57:43<07:52,  1.56s/it]

{'loss': 0.0039, 'grad_norm': 0.05047472193837166, 'learning_rate': 2.2664165103189494e-05, 'epoch': 8.87}


 89%|████████▊ | 2369/2670 [57:44<07:23,  1.47s/it]

{'loss': 0.0073, 'grad_norm': 0.08635606616735458, 'learning_rate': 2.25891181988743e-05, 'epoch': 8.87}


 89%|████████▉ | 2370/2670 [57:45<07:06,  1.42s/it]

{'loss': 0.0056, 'grad_norm': 0.0802357941865921, 'learning_rate': 2.25140712945591e-05, 'epoch': 8.88}


 89%|████████▉ | 2371/2670 [57:47<07:14,  1.45s/it]

{'loss': 0.0069, 'grad_norm': 0.23325824737548828, 'learning_rate': 2.2439024390243904e-05, 'epoch': 8.88}


 89%|████████▉ | 2372/2670 [57:48<07:25,  1.50s/it]

{'loss': 0.0078, 'grad_norm': 0.11570601165294647, 'learning_rate': 2.2363977485928706e-05, 'epoch': 8.88}


 89%|████████▉ | 2373/2670 [57:50<07:32,  1.52s/it]

{'loss': 0.0084, 'grad_norm': 0.32883530855178833, 'learning_rate': 2.228893058161351e-05, 'epoch': 8.89}


 89%|████████▉ | 2374/2670 [57:51<07:12,  1.46s/it]

{'loss': 0.0033, 'grad_norm': 0.04585174843668938, 'learning_rate': 2.2213883677298313e-05, 'epoch': 8.89}


 89%|████████▉ | 2375/2670 [57:53<06:59,  1.42s/it]

{'loss': 0.0083, 'grad_norm': 0.09592591226100922, 'learning_rate': 2.2138836772983115e-05, 'epoch': 8.9}


 89%|████████▉ | 2376/2670 [57:54<06:54,  1.41s/it]

{'loss': 0.0042, 'grad_norm': 0.05390365794301033, 'learning_rate': 2.206378986866792e-05, 'epoch': 8.9}


 89%|████████▉ | 2377/2670 [57:55<06:42,  1.37s/it]

{'loss': 0.0036, 'grad_norm': 0.04911287873983383, 'learning_rate': 2.198874296435272e-05, 'epoch': 8.9}


 89%|████████▉ | 2378/2670 [57:57<06:45,  1.39s/it]

{'loss': 0.0059, 'grad_norm': 0.1726037859916687, 'learning_rate': 2.1913696060037525e-05, 'epoch': 8.91}


 89%|████████▉ | 2379/2670 [57:58<06:57,  1.43s/it]

{'loss': 0.004, 'grad_norm': 0.052970416843891144, 'learning_rate': 2.1838649155722327e-05, 'epoch': 8.91}


 89%|████████▉ | 2380/2670 [58:00<07:05,  1.47s/it]

{'loss': 0.0043, 'grad_norm': 0.04077433794736862, 'learning_rate': 2.1763602251407132e-05, 'epoch': 8.91}


 89%|████████▉ | 2381/2670 [58:01<06:53,  1.43s/it]

{'loss': 0.0085, 'grad_norm': 0.1377258449792862, 'learning_rate': 2.168855534709193e-05, 'epoch': 8.92}


 89%|████████▉ | 2382/2670 [58:03<06:35,  1.37s/it]

{'loss': 0.01, 'grad_norm': 0.09850278496742249, 'learning_rate': 2.1613508442776737e-05, 'epoch': 8.92}


 89%|████████▉ | 2383/2670 [58:04<06:49,  1.43s/it]

{'loss': 0.005, 'grad_norm': 0.056416790932416916, 'learning_rate': 2.1538461538461542e-05, 'epoch': 8.93}


 89%|████████▉ | 2384/2670 [58:06<06:55,  1.45s/it]

{'loss': 0.0079, 'grad_norm': 0.07618497312068939, 'learning_rate': 2.146341463414634e-05, 'epoch': 8.93}


 89%|████████▉ | 2385/2670 [58:07<07:02,  1.48s/it]

{'loss': 0.0041, 'grad_norm': 0.04786885157227516, 'learning_rate': 2.1388367729831146e-05, 'epoch': 8.93}


 89%|████████▉ | 2386/2670 [58:09<07:07,  1.51s/it]

{'loss': 0.0039, 'grad_norm': 0.05384210869669914, 'learning_rate': 2.1313320825515948e-05, 'epoch': 8.94}


 89%|████████▉ | 2387/2670 [58:10<06:55,  1.47s/it]

{'loss': 0.0049, 'grad_norm': 0.0522756390273571, 'learning_rate': 2.123827392120075e-05, 'epoch': 8.94}


 89%|████████▉ | 2388/2670 [58:12<06:52,  1.46s/it]

{'loss': 0.0046, 'grad_norm': 0.06229430064558983, 'learning_rate': 2.1163227016885552e-05, 'epoch': 8.94}


 89%|████████▉ | 2389/2670 [58:13<06:31,  1.39s/it]

{'loss': 0.0068, 'grad_norm': 0.07820646464824677, 'learning_rate': 2.1088180112570358e-05, 'epoch': 8.95}


 90%|████████▉ | 2390/2670 [58:14<06:27,  1.38s/it]

{'loss': 0.0039, 'grad_norm': 0.048988934606313705, 'learning_rate': 2.1013133208255163e-05, 'epoch': 8.95}


 90%|████████▉ | 2391/2670 [58:16<06:37,  1.42s/it]

{'loss': 0.0051, 'grad_norm': 0.05655757337808609, 'learning_rate': 2.0938086303939962e-05, 'epoch': 8.96}


 90%|████████▉ | 2392/2670 [58:17<06:54,  1.49s/it]

{'loss': 0.0028, 'grad_norm': 0.04013131558895111, 'learning_rate': 2.0863039399624768e-05, 'epoch': 8.96}


 90%|████████▉ | 2393/2670 [58:19<07:13,  1.56s/it]

{'loss': 0.003, 'grad_norm': 0.04991436004638672, 'learning_rate': 2.078799249530957e-05, 'epoch': 8.96}


 90%|████████▉ | 2394/2670 [58:21<07:14,  1.58s/it]

{'loss': 0.0049, 'grad_norm': 0.062404289841651917, 'learning_rate': 2.0712945590994372e-05, 'epoch': 8.97}


 90%|████████▉ | 2395/2670 [58:22<07:08,  1.56s/it]

{'loss': 0.0064, 'grad_norm': 0.07980488240718842, 'learning_rate': 2.0637898686679174e-05, 'epoch': 8.97}


 90%|████████▉ | 2396/2670 [58:23<06:46,  1.48s/it]

{'loss': 0.0051, 'grad_norm': 0.06018561124801636, 'learning_rate': 2.056285178236398e-05, 'epoch': 8.97}


 90%|████████▉ | 2397/2670 [58:25<06:35,  1.45s/it]

{'loss': 0.0051, 'grad_norm': 0.060053396970033646, 'learning_rate': 2.048780487804878e-05, 'epoch': 8.98}


 90%|████████▉ | 2398/2670 [58:26<06:44,  1.49s/it]

{'loss': 0.0076, 'grad_norm': 0.0646914690732956, 'learning_rate': 2.0412757973733583e-05, 'epoch': 8.98}


 90%|████████▉ | 2399/2670 [58:28<06:46,  1.50s/it]

{'loss': 0.0036, 'grad_norm': 0.05063366889953613, 'learning_rate': 2.033771106941839e-05, 'epoch': 8.99}


 90%|████████▉ | 2400/2670 [58:29<06:29,  1.44s/it]

{'loss': 0.006, 'grad_norm': 0.0649741142988205, 'learning_rate': 2.026266416510319e-05, 'epoch': 8.99}


 90%|████████▉ | 2401/2670 [58:31<06:26,  1.44s/it]

{'loss': 0.0049, 'grad_norm': 0.06246205046772957, 'learning_rate': 2.0187617260787993e-05, 'epoch': 8.99}


 90%|████████▉ | 2402/2670 [58:32<06:35,  1.48s/it]

{'loss': 0.006, 'grad_norm': 0.11056031286716461, 'learning_rate': 2.0112570356472795e-05, 'epoch': 9.0}


 90%|█████████ | 2403/2670 [58:34<06:36,  1.49s/it]

{'loss': 0.0034, 'grad_norm': 0.04161662608385086, 'learning_rate': 2.00375234521576e-05, 'epoch': 9.0}


 90%|█████████ | 2404/2670 [58:35<06:46,  1.53s/it]

{'loss': 0.0023, 'grad_norm': 0.03601263836026192, 'learning_rate': 1.9962476547842403e-05, 'epoch': 9.0}


 90%|█████████ | 2405/2670 [58:37<06:38,  1.50s/it]

{'loss': 0.0037, 'grad_norm': 0.05017167329788208, 'learning_rate': 1.9887429643527205e-05, 'epoch': 9.01}


 90%|█████████ | 2406/2670 [58:38<06:12,  1.41s/it]

{'loss': 0.003, 'grad_norm': 0.04009184613823891, 'learning_rate': 1.981238273921201e-05, 'epoch': 9.01}


 90%|█████████ | 2407/2670 [58:39<06:17,  1.44s/it]

{'loss': 0.0038, 'grad_norm': 0.046520013362169266, 'learning_rate': 1.9737335834896812e-05, 'epoch': 9.01}


 90%|█████████ | 2408/2670 [58:41<06:16,  1.44s/it]

{'loss': 0.002, 'grad_norm': 0.023485848680138588, 'learning_rate': 1.9662288930581614e-05, 'epoch': 9.02}


 90%|█████████ | 2409/2670 [58:42<06:06,  1.40s/it]

{'loss': 0.0032, 'grad_norm': 0.039926353842020035, 'learning_rate': 1.9587242026266416e-05, 'epoch': 9.02}


 90%|█████████ | 2410/2670 [58:44<06:14,  1.44s/it]

{'loss': 0.0024, 'grad_norm': 0.03261886164546013, 'learning_rate': 1.9512195121951222e-05, 'epoch': 9.03}


 90%|█████████ | 2411/2670 [58:45<06:20,  1.47s/it]

{'loss': 0.0041, 'grad_norm': 0.06412622332572937, 'learning_rate': 1.9437148217636024e-05, 'epoch': 9.03}


 90%|█████████ | 2412/2670 [58:47<06:20,  1.47s/it]

{'loss': 0.0041, 'grad_norm': 0.046701282262802124, 'learning_rate': 1.9362101313320826e-05, 'epoch': 9.03}


 90%|█████████ | 2413/2670 [58:48<06:09,  1.44s/it]

{'loss': 0.0043, 'grad_norm': 0.054535262286663055, 'learning_rate': 1.9287054409005628e-05, 'epoch': 9.04}


 90%|█████████ | 2414/2670 [58:50<06:14,  1.46s/it]

{'loss': 0.0031, 'grad_norm': 0.04165560379624367, 'learning_rate': 1.9212007504690434e-05, 'epoch': 9.04}


 90%|█████████ | 2415/2670 [58:51<06:17,  1.48s/it]

{'loss': 0.0045, 'grad_norm': 0.060275401920080185, 'learning_rate': 1.9136960600375236e-05, 'epoch': 9.04}


 90%|█████████ | 2416/2670 [58:53<06:25,  1.52s/it]

{'loss': 0.0026, 'grad_norm': 0.036289457231760025, 'learning_rate': 1.9061913696060038e-05, 'epoch': 9.05}


 91%|█████████ | 2417/2670 [58:54<06:16,  1.49s/it]

{'loss': 0.0029, 'grad_norm': 0.03999430313706398, 'learning_rate': 1.8986866791744843e-05, 'epoch': 9.05}


 91%|█████████ | 2418/2670 [58:56<06:19,  1.50s/it]

{'loss': 0.0034, 'grad_norm': 0.04134495556354523, 'learning_rate': 1.8911819887429645e-05, 'epoch': 9.06}


 91%|█████████ | 2419/2670 [58:57<06:13,  1.49s/it]

{'loss': 0.0032, 'grad_norm': 0.04667677357792854, 'learning_rate': 1.8836772983114447e-05, 'epoch': 9.06}


 91%|█████████ | 2420/2670 [58:59<06:06,  1.47s/it]

{'loss': 0.004, 'grad_norm': 0.06350583583116531, 'learning_rate': 1.876172607879925e-05, 'epoch': 9.06}


 91%|█████████ | 2421/2670 [59:00<06:09,  1.48s/it]

{'loss': 0.0025, 'grad_norm': 0.03403039649128914, 'learning_rate': 1.8686679174484055e-05, 'epoch': 9.07}


 91%|█████████ | 2422/2670 [59:01<05:56,  1.44s/it]

{'loss': 0.003, 'grad_norm': 0.03637654334306717, 'learning_rate': 1.8611632270168857e-05, 'epoch': 9.07}


 91%|█████████ | 2423/2670 [59:03<06:08,  1.49s/it]

{'loss': 0.0036, 'grad_norm': 0.056775037199258804, 'learning_rate': 1.853658536585366e-05, 'epoch': 9.07}


 91%|█████████ | 2424/2670 [59:05<06:23,  1.56s/it]

{'loss': 0.0024, 'grad_norm': 0.034644417464733124, 'learning_rate': 1.8461538461538465e-05, 'epoch': 9.08}


 91%|█████████ | 2425/2670 [59:06<06:06,  1.50s/it]

{'loss': 0.0037, 'grad_norm': 0.05426502227783203, 'learning_rate': 1.8386491557223263e-05, 'epoch': 9.08}


 91%|█████████ | 2426/2670 [59:07<05:47,  1.43s/it]

{'loss': 0.0027, 'grad_norm': 0.03431712090969086, 'learning_rate': 1.831144465290807e-05, 'epoch': 9.09}


 91%|█████████ | 2427/2670 [59:09<05:30,  1.36s/it]

{'loss': 0.0067, 'grad_norm': 0.07231365144252777, 'learning_rate': 1.823639774859287e-05, 'epoch': 9.09}


 91%|█████████ | 2428/2670 [59:10<05:35,  1.39s/it]

{'loss': 0.0049, 'grad_norm': 0.05271144583821297, 'learning_rate': 1.8161350844277676e-05, 'epoch': 9.09}


 91%|█████████ | 2429/2670 [59:11<05:23,  1.34s/it]

{'loss': 0.0031, 'grad_norm': 0.05097869038581848, 'learning_rate': 1.8086303939962475e-05, 'epoch': 9.1}


 91%|█████████ | 2430/2670 [59:13<05:31,  1.38s/it]

{'loss': 0.0029, 'grad_norm': 0.03343106061220169, 'learning_rate': 1.801125703564728e-05, 'epoch': 9.1}


 91%|█████████ | 2431/2670 [59:14<05:47,  1.45s/it]

{'loss': 0.0028, 'grad_norm': 0.03631104528903961, 'learning_rate': 1.7936210131332086e-05, 'epoch': 9.1}


 91%|█████████ | 2432/2670 [59:16<05:26,  1.37s/it]

{'loss': 0.0075, 'grad_norm': 0.1886274218559265, 'learning_rate': 1.7861163227016885e-05, 'epoch': 9.11}


 91%|█████████ | 2433/2670 [59:17<05:17,  1.34s/it]

{'loss': 0.0027, 'grad_norm': 0.0593317449092865, 'learning_rate': 1.778611632270169e-05, 'epoch': 9.11}


 91%|█████████ | 2434/2670 [59:18<05:36,  1.43s/it]

{'loss': 0.0024, 'grad_norm': 0.037052661180496216, 'learning_rate': 1.7711069418386492e-05, 'epoch': 9.12}


 91%|█████████ | 2435/2670 [59:20<05:23,  1.38s/it]

{'loss': 0.0042, 'grad_norm': 0.046417295932769775, 'learning_rate': 1.7636022514071294e-05, 'epoch': 9.12}


 91%|█████████ | 2436/2670 [59:21<05:17,  1.36s/it]

{'loss': 0.003, 'grad_norm': 0.040671028196811676, 'learning_rate': 1.7560975609756096e-05, 'epoch': 9.12}


 91%|█████████▏| 2437/2670 [59:22<05:14,  1.35s/it]

{'loss': 0.0028, 'grad_norm': 0.0422985665500164, 'learning_rate': 1.7485928705440902e-05, 'epoch': 9.13}


 91%|█████████▏| 2438/2670 [59:24<05:11,  1.34s/it]

{'loss': 0.0029, 'grad_norm': 0.03982197120785713, 'learning_rate': 1.7410881801125707e-05, 'epoch': 9.13}


 91%|█████████▏| 2439/2670 [59:25<05:17,  1.37s/it]

{'loss': 0.0027, 'grad_norm': 0.04516848549246788, 'learning_rate': 1.7335834896810506e-05, 'epoch': 9.13}


 91%|█████████▏| 2440/2670 [59:27<05:21,  1.40s/it]

{'loss': 0.0022, 'grad_norm': 0.03417123854160309, 'learning_rate': 1.726078799249531e-05, 'epoch': 9.14}


 91%|█████████▏| 2441/2670 [59:28<05:30,  1.44s/it]

{'loss': 0.0033, 'grad_norm': 0.04556252062320709, 'learning_rate': 1.7185741088180114e-05, 'epoch': 9.14}


 91%|█████████▏| 2442/2670 [59:29<05:20,  1.41s/it]

{'loss': 0.0038, 'grad_norm': 0.047005534172058105, 'learning_rate': 1.7110694183864916e-05, 'epoch': 9.15}


 91%|█████████▏| 2443/2670 [59:31<05:32,  1.46s/it]

{'loss': 0.0031, 'grad_norm': 0.05336925387382507, 'learning_rate': 1.7035647279549718e-05, 'epoch': 9.15}


 92%|█████████▏| 2444/2670 [59:33<05:51,  1.55s/it]

{'loss': 0.0024, 'grad_norm': 0.03472273796796799, 'learning_rate': 1.6960600375234523e-05, 'epoch': 9.15}


 92%|█████████▏| 2445/2670 [59:34<05:35,  1.49s/it]

{'loss': 0.0033, 'grad_norm': 0.05171572417020798, 'learning_rate': 1.6885553470919325e-05, 'epoch': 9.16}


 92%|█████████▏| 2446/2670 [59:35<05:17,  1.42s/it]

{'loss': 0.0032, 'grad_norm': 0.04698804393410683, 'learning_rate': 1.6810506566604127e-05, 'epoch': 9.16}


 92%|█████████▏| 2447/2670 [59:37<05:10,  1.39s/it]

{'loss': 0.0042, 'grad_norm': 0.06025408208370209, 'learning_rate': 1.6735459662288933e-05, 'epoch': 9.16}


 92%|█████████▏| 2448/2670 [59:38<04:56,  1.33s/it]

{'loss': 0.005, 'grad_norm': 0.05667896568775177, 'learning_rate': 1.6660412757973735e-05, 'epoch': 9.17}


 92%|█████████▏| 2449/2670 [59:39<05:02,  1.37s/it]

{'loss': 0.0028, 'grad_norm': 0.04257461801171303, 'learning_rate': 1.6585365853658537e-05, 'epoch': 9.17}


 92%|█████████▏| 2450/2670 [59:41<04:56,  1.35s/it]

{'loss': 0.0044, 'grad_norm': 0.054356396198272705, 'learning_rate': 1.651031894934334e-05, 'epoch': 9.18}


 92%|█████████▏| 2451/2670 [59:42<04:49,  1.32s/it]

{'loss': 0.0046, 'grad_norm': 0.05990225449204445, 'learning_rate': 1.6435272045028144e-05, 'epoch': 9.18}


 92%|█████████▏| 2452/2670 [59:43<05:01,  1.38s/it]

{'loss': 0.0024, 'grad_norm': 0.03833319619297981, 'learning_rate': 1.6360225140712947e-05, 'epoch': 9.18}


 92%|█████████▏| 2453/2670 [59:45<04:49,  1.33s/it]

{'loss': 0.0048, 'grad_norm': 0.06026530638337135, 'learning_rate': 1.628517823639775e-05, 'epoch': 9.19}


 92%|█████████▏| 2454/2670 [59:46<05:05,  1.41s/it]

{'loss': 0.0031, 'grad_norm': 0.03843658044934273, 'learning_rate': 1.6210131332082554e-05, 'epoch': 9.19}


 92%|█████████▏| 2455/2670 [59:48<05:01,  1.40s/it]

{'loss': 0.003, 'grad_norm': 0.045483969151973724, 'learning_rate': 1.6135084427767356e-05, 'epoch': 9.19}


 92%|█████████▏| 2456/2670 [59:49<05:12,  1.46s/it]

{'loss': 0.0039, 'grad_norm': 0.05078183114528656, 'learning_rate': 1.6060037523452158e-05, 'epoch': 9.2}


 92%|█████████▏| 2457/2670 [59:51<05:14,  1.48s/it]

{'loss': 0.0047, 'grad_norm': 0.038544729351997375, 'learning_rate': 1.598499061913696e-05, 'epoch': 9.2}


 92%|█████████▏| 2458/2670 [59:52<05:15,  1.49s/it]

{'loss': 0.0029, 'grad_norm': 0.048404645174741745, 'learning_rate': 1.5909943714821766e-05, 'epoch': 9.21}


 92%|█████████▏| 2459/2670 [59:54<05:13,  1.49s/it]

{'loss': 0.0033, 'grad_norm': 0.10618600994348526, 'learning_rate': 1.5834896810506568e-05, 'epoch': 9.21}


 92%|█████████▏| 2460/2670 [59:55<05:16,  1.51s/it]

{'loss': 0.0032, 'grad_norm': 0.04490227997303009, 'learning_rate': 1.575984990619137e-05, 'epoch': 9.21}


 92%|█████████▏| 2461/2670 [59:57<05:09,  1.48s/it]

{'loss': 0.0049, 'grad_norm': 0.06070190668106079, 'learning_rate': 1.5684803001876172e-05, 'epoch': 9.22}


 92%|█████████▏| 2462/2670 [59:58<04:54,  1.42s/it]

{'loss': 0.0028, 'grad_norm': 0.04775979742407799, 'learning_rate': 1.5609756097560978e-05, 'epoch': 9.22}


 92%|█████████▏| 2463/2670 [1:00:00<04:58,  1.44s/it]

{'loss': 0.004, 'grad_norm': 0.056161198765039444, 'learning_rate': 1.553470919324578e-05, 'epoch': 9.22}


 92%|█████████▏| 2464/2670 [1:00:01<05:07,  1.49s/it]

{'loss': 0.0021, 'grad_norm': 0.03195654973387718, 'learning_rate': 1.5459662288930582e-05, 'epoch': 9.23}


 92%|█████████▏| 2465/2670 [1:00:03<05:09,  1.51s/it]

{'loss': 0.0024, 'grad_norm': 0.03327559679746628, 'learning_rate': 1.5384615384615387e-05, 'epoch': 9.23}


 92%|█████████▏| 2466/2670 [1:00:04<04:55,  1.45s/it]

{'loss': 0.0027, 'grad_norm': 0.039761219173669815, 'learning_rate': 1.530956848030019e-05, 'epoch': 9.24}


 92%|█████████▏| 2467/2670 [1:00:05<04:43,  1.40s/it]

{'loss': 0.0027, 'grad_norm': 0.04016274958848953, 'learning_rate': 1.5234521575984991e-05, 'epoch': 9.24}


 92%|█████████▏| 2468/2670 [1:00:07<04:50,  1.44s/it]

{'loss': 0.0022, 'grad_norm': 0.03793317824602127, 'learning_rate': 1.5159474671669793e-05, 'epoch': 9.24}


 92%|█████████▏| 2469/2670 [1:00:08<04:50,  1.44s/it]

{'loss': 0.0044, 'grad_norm': 0.04929938167333603, 'learning_rate': 1.5084427767354597e-05, 'epoch': 9.25}


 93%|█████████▎| 2470/2670 [1:00:10<04:59,  1.50s/it]

{'loss': 0.0028, 'grad_norm': 0.044052932411432266, 'learning_rate': 1.5009380863039401e-05, 'epoch': 9.25}


 93%|█████████▎| 2471/2670 [1:00:11<04:59,  1.51s/it]

{'loss': 0.0027, 'grad_norm': 0.029726145789027214, 'learning_rate': 1.4934333958724203e-05, 'epoch': 9.25}


 93%|█████████▎| 2472/2670 [1:00:13<05:04,  1.54s/it]

{'loss': 0.0044, 'grad_norm': 0.08968432247638702, 'learning_rate': 1.4859287054409007e-05, 'epoch': 9.26}


 93%|█████████▎| 2473/2670 [1:00:14<04:52,  1.49s/it]

{'loss': 0.0036, 'grad_norm': 0.05123215913772583, 'learning_rate': 1.4784240150093809e-05, 'epoch': 9.26}


 93%|█████████▎| 2474/2670 [1:00:16<04:43,  1.44s/it]

{'loss': 0.0032, 'grad_norm': 0.04338358715176582, 'learning_rate': 1.4709193245778613e-05, 'epoch': 9.27}


 93%|█████████▎| 2475/2670 [1:00:17<04:46,  1.47s/it]

{'loss': 0.0052, 'grad_norm': 0.057407233864068985, 'learning_rate': 1.4634146341463415e-05, 'epoch': 9.27}


 93%|█████████▎| 2476/2670 [1:00:18<04:29,  1.39s/it]

{'loss': 0.006, 'grad_norm': 0.0645618885755539, 'learning_rate': 1.4559099437148219e-05, 'epoch': 9.27}


 93%|█████████▎| 2477/2670 [1:00:20<04:44,  1.48s/it]

{'loss': 0.0041, 'grad_norm': 0.0494871512055397, 'learning_rate': 1.448405253283302e-05, 'epoch': 9.28}


 93%|█████████▎| 2478/2670 [1:00:21<04:35,  1.44s/it]

{'loss': 0.0031, 'grad_norm': 0.0442529134452343, 'learning_rate': 1.4409005628517824e-05, 'epoch': 9.28}


 93%|█████████▎| 2479/2670 [1:00:23<04:44,  1.49s/it]

{'loss': 0.003, 'grad_norm': 0.04242083430290222, 'learning_rate': 1.4333958724202628e-05, 'epoch': 9.28}


 93%|█████████▎| 2480/2670 [1:00:24<04:23,  1.39s/it]

{'loss': 0.003, 'grad_norm': 0.05140998587012291, 'learning_rate': 1.425891181988743e-05, 'epoch': 9.29}


 93%|█████████▎| 2481/2670 [1:00:26<04:20,  1.38s/it]

{'loss': 0.0059, 'grad_norm': 0.08361880481243134, 'learning_rate': 1.4183864915572234e-05, 'epoch': 9.29}


 93%|█████████▎| 2482/2670 [1:00:27<04:37,  1.48s/it]

{'loss': 0.0026, 'grad_norm': 0.04005025327205658, 'learning_rate': 1.4108818011257036e-05, 'epoch': 9.3}


 93%|█████████▎| 2483/2670 [1:00:29<04:32,  1.46s/it]

{'loss': 0.0029, 'grad_norm': 0.041064295917749405, 'learning_rate': 1.403377110694184e-05, 'epoch': 9.3}


 93%|█████████▎| 2484/2670 [1:00:30<04:45,  1.54s/it]

{'loss': 0.0029, 'grad_norm': 0.039932072162628174, 'learning_rate': 1.3958724202626642e-05, 'epoch': 9.3}


 93%|█████████▎| 2485/2670 [1:00:32<04:38,  1.51s/it]

{'loss': 0.003, 'grad_norm': 0.04769480973482132, 'learning_rate': 1.3883677298311446e-05, 'epoch': 9.31}


 93%|█████████▎| 2486/2670 [1:00:33<04:26,  1.45s/it]

{'loss': 0.003, 'grad_norm': 0.05507104843854904, 'learning_rate': 1.380863039399625e-05, 'epoch': 9.31}


 93%|█████████▎| 2487/2670 [1:00:35<04:34,  1.50s/it]

{'loss': 0.0024, 'grad_norm': 0.028279641643166542, 'learning_rate': 1.3733583489681052e-05, 'epoch': 9.31}


 93%|█████████▎| 2488/2670 [1:00:36<04:22,  1.44s/it]

{'loss': 0.0035, 'grad_norm': 0.047440145164728165, 'learning_rate': 1.3658536585365855e-05, 'epoch': 9.32}


 93%|█████████▎| 2489/2670 [1:00:37<04:12,  1.39s/it]

{'loss': 0.0053, 'grad_norm': 0.06300695985555649, 'learning_rate': 1.3583489681050657e-05, 'epoch': 9.32}


 93%|█████████▎| 2490/2670 [1:00:39<04:11,  1.39s/it]

{'loss': 0.0052, 'grad_norm': 0.07046379894018173, 'learning_rate': 1.3508442776735461e-05, 'epoch': 9.33}


 93%|█████████▎| 2491/2670 [1:00:40<04:13,  1.42s/it]

{'loss': 0.0031, 'grad_norm': 0.04790165647864342, 'learning_rate': 1.3433395872420262e-05, 'epoch': 9.33}


 93%|█████████▎| 2492/2670 [1:00:42<04:14,  1.43s/it]

{'loss': 0.004, 'grad_norm': 0.049003563821315765, 'learning_rate': 1.3358348968105067e-05, 'epoch': 9.33}


 93%|█████████▎| 2493/2670 [1:00:43<04:06,  1.39s/it]

{'loss': 0.0035, 'grad_norm': 0.05079207569360733, 'learning_rate': 1.3283302063789867e-05, 'epoch': 9.34}


 93%|█████████▎| 2494/2670 [1:00:44<04:05,  1.40s/it]

{'loss': 0.0026, 'grad_norm': 0.036255571991205215, 'learning_rate': 1.3208255159474673e-05, 'epoch': 9.34}


 93%|█████████▎| 2495/2670 [1:00:46<04:10,  1.43s/it]

{'loss': 0.0037, 'grad_norm': 0.05082211270928383, 'learning_rate': 1.3133208255159477e-05, 'epoch': 9.34}


 93%|█████████▎| 2496/2670 [1:00:47<04:11,  1.44s/it]

{'loss': 0.0032, 'grad_norm': 0.046935275197029114, 'learning_rate': 1.3058161350844279e-05, 'epoch': 9.35}


 94%|█████████▎| 2497/2670 [1:00:49<04:23,  1.52s/it]

{'loss': 0.0033, 'grad_norm': 0.04635535553097725, 'learning_rate': 1.2983114446529083e-05, 'epoch': 9.35}


 94%|█████████▎| 2498/2670 [1:00:50<04:12,  1.47s/it]

{'loss': 0.0028, 'grad_norm': 0.03351273015141487, 'learning_rate': 1.2908067542213883e-05, 'epoch': 9.36}


 94%|█████████▎| 2499/2670 [1:00:52<04:15,  1.50s/it]

{'loss': 0.0031, 'grad_norm': 0.0464714877307415, 'learning_rate': 1.2833020637898688e-05, 'epoch': 9.36}


 94%|█████████▎| 2500/2670 [1:00:54<04:15,  1.50s/it]

{'loss': 0.0027, 'grad_norm': 0.05334766209125519, 'learning_rate': 1.2757973733583489e-05, 'epoch': 9.36}


 94%|█████████▎| 2501/2670 [1:00:57<05:27,  1.94s/it]

{'loss': 0.004, 'grad_norm': 0.06395519524812698, 'learning_rate': 1.2682926829268294e-05, 'epoch': 9.37}


 94%|█████████▎| 2502/2670 [1:00:58<05:08,  1.84s/it]

{'loss': 0.0037, 'grad_norm': 0.08027903735637665, 'learning_rate': 1.2607879924953098e-05, 'epoch': 9.37}


 94%|█████████▎| 2503/2670 [1:00:59<04:41,  1.69s/it]

{'loss': 0.0046, 'grad_norm': 0.05630533769726753, 'learning_rate': 1.2532833020637898e-05, 'epoch': 9.37}


 94%|█████████▍| 2504/2670 [1:01:01<04:35,  1.66s/it]

{'loss': 0.0025, 'grad_norm': 0.05809566751122475, 'learning_rate': 1.2457786116322702e-05, 'epoch': 9.38}


 94%|█████████▍| 2505/2670 [1:01:02<04:21,  1.58s/it]

{'loss': 0.0027, 'grad_norm': 0.03764619305729866, 'learning_rate': 1.2382739212007504e-05, 'epoch': 9.38}


 94%|█████████▍| 2506/2670 [1:01:04<04:08,  1.51s/it]

{'loss': 0.0029, 'grad_norm': 0.040446311235427856, 'learning_rate': 1.230769230769231e-05, 'epoch': 9.39}


 94%|█████████▍| 2507/2670 [1:01:05<04:08,  1.52s/it]

{'loss': 0.0033, 'grad_norm': 0.047178033739328384, 'learning_rate': 1.2232645403377112e-05, 'epoch': 9.39}


 94%|█████████▍| 2508/2670 [1:01:07<04:02,  1.50s/it]

{'loss': 0.0032, 'grad_norm': 0.051152534782886505, 'learning_rate': 1.2157598499061914e-05, 'epoch': 9.39}


 94%|█████████▍| 2509/2670 [1:01:08<03:58,  1.48s/it]

{'loss': 0.0051, 'grad_norm': 0.07029983401298523, 'learning_rate': 1.2082551594746718e-05, 'epoch': 9.4}


 94%|█████████▍| 2510/2670 [1:01:10<03:50,  1.44s/it]

{'loss': 0.0042, 'grad_norm': 0.05923242121934891, 'learning_rate': 1.200750469043152e-05, 'epoch': 9.4}


 94%|█████████▍| 2511/2670 [1:01:11<03:53,  1.47s/it]

{'loss': 0.0028, 'grad_norm': 0.041711293160915375, 'learning_rate': 1.1932457786116324e-05, 'epoch': 9.4}


 94%|█████████▍| 2512/2670 [1:01:13<04:04,  1.55s/it]

{'loss': 0.0021, 'grad_norm': 0.026802709326148033, 'learning_rate': 1.1857410881801126e-05, 'epoch': 9.41}


 94%|█████████▍| 2513/2670 [1:01:14<03:59,  1.53s/it]

{'loss': 0.0017, 'grad_norm': 0.028891131281852722, 'learning_rate': 1.178236397748593e-05, 'epoch': 9.41}


 94%|█████████▍| 2514/2670 [1:01:16<03:52,  1.49s/it]

{'loss': 0.003, 'grad_norm': 0.04305355250835419, 'learning_rate': 1.1707317073170733e-05, 'epoch': 9.42}


 94%|█████████▍| 2515/2670 [1:01:17<03:44,  1.45s/it]

{'loss': 0.0022, 'grad_norm': 0.03648136928677559, 'learning_rate': 1.1632270168855535e-05, 'epoch': 9.42}


 94%|█████████▍| 2516/2670 [1:01:19<03:44,  1.46s/it]

{'loss': 0.0026, 'grad_norm': 0.03692798316478729, 'learning_rate': 1.1557223264540339e-05, 'epoch': 9.42}


 94%|█████████▍| 2517/2670 [1:01:20<03:38,  1.43s/it]

{'loss': 0.0032, 'grad_norm': 0.04222279414534569, 'learning_rate': 1.1482176360225141e-05, 'epoch': 9.43}


 94%|█████████▍| 2518/2670 [1:01:21<03:35,  1.42s/it]

{'loss': 0.0034, 'grad_norm': 0.045647285878658295, 'learning_rate': 1.1407129455909945e-05, 'epoch': 9.43}


 94%|█████████▍| 2519/2670 [1:01:23<03:30,  1.39s/it]

{'loss': 0.0036, 'grad_norm': 0.05328601226210594, 'learning_rate': 1.1332082551594747e-05, 'epoch': 9.43}


 94%|█████████▍| 2520/2670 [1:01:24<03:29,  1.40s/it]

{'loss': 0.0032, 'grad_norm': 0.045024171471595764, 'learning_rate': 1.125703564727955e-05, 'epoch': 9.44}


 94%|█████████▍| 2521/2670 [1:01:26<03:33,  1.43s/it]

{'loss': 0.0029, 'grad_norm': 0.03592384606599808, 'learning_rate': 1.1181988742964353e-05, 'epoch': 9.44}


 94%|█████████▍| 2522/2670 [1:01:27<03:25,  1.39s/it]

{'loss': 0.0044, 'grad_norm': 0.05719159170985222, 'learning_rate': 1.1106941838649157e-05, 'epoch': 9.45}


 94%|█████████▍| 2523/2670 [1:01:28<03:29,  1.43s/it]

{'loss': 0.0044, 'grad_norm': 0.061473581939935684, 'learning_rate': 1.103189493433396e-05, 'epoch': 9.45}


 95%|█████████▍| 2524/2670 [1:01:30<03:30,  1.44s/it]

{'loss': 0.0189, 'grad_norm': 0.6830554604530334, 'learning_rate': 1.0956848030018762e-05, 'epoch': 9.45}


 95%|█████████▍| 2525/2670 [1:01:31<03:25,  1.41s/it]

{'loss': 0.0031, 'grad_norm': 0.04973304644227028, 'learning_rate': 1.0881801125703566e-05, 'epoch': 9.46}


 95%|█████████▍| 2526/2670 [1:01:33<03:22,  1.41s/it]

{'loss': 0.0039, 'grad_norm': 0.05460500344634056, 'learning_rate': 1.0806754221388368e-05, 'epoch': 9.46}


 95%|█████████▍| 2527/2670 [1:01:34<03:18,  1.39s/it]

{'loss': 0.0045, 'grad_norm': 0.046347588300704956, 'learning_rate': 1.073170731707317e-05, 'epoch': 9.46}


 95%|█████████▍| 2528/2670 [1:01:36<03:24,  1.44s/it]

{'loss': 0.0047, 'grad_norm': 0.057163652032613754, 'learning_rate': 1.0656660412757974e-05, 'epoch': 9.47}


 95%|█████████▍| 2529/2670 [1:01:37<03:25,  1.46s/it]

{'loss': 0.0044, 'grad_norm': 0.0612531453371048, 'learning_rate': 1.0581613508442776e-05, 'epoch': 9.47}


 95%|█████████▍| 2530/2670 [1:01:38<03:19,  1.42s/it]

{'loss': 0.0025, 'grad_norm': 0.04098788648843765, 'learning_rate': 1.0506566604127582e-05, 'epoch': 9.48}


 95%|█████████▍| 2531/2670 [1:01:40<03:06,  1.34s/it]

{'loss': 0.0034, 'grad_norm': 0.05335719510912895, 'learning_rate': 1.0431519699812384e-05, 'epoch': 9.48}


 95%|█████████▍| 2532/2670 [1:01:41<03:10,  1.38s/it]

{'loss': 0.003, 'grad_norm': 0.04047873616218567, 'learning_rate': 1.0356472795497186e-05, 'epoch': 9.48}


 95%|█████████▍| 2533/2670 [1:01:42<03:10,  1.39s/it]

{'loss': 0.0033, 'grad_norm': 0.0396096371114254, 'learning_rate': 1.028142589118199e-05, 'epoch': 9.49}


 95%|█████████▍| 2534/2670 [1:01:44<03:11,  1.41s/it]

{'loss': 0.0019, 'grad_norm': 0.033489856868982315, 'learning_rate': 1.0206378986866792e-05, 'epoch': 9.49}


 95%|█████████▍| 2535/2670 [1:01:45<03:09,  1.41s/it]

{'loss': 0.0038, 'grad_norm': 0.048305731266736984, 'learning_rate': 1.0131332082551595e-05, 'epoch': 9.49}


 95%|█████████▍| 2536/2670 [1:01:47<03:17,  1.47s/it]

{'loss': 0.0036, 'grad_norm': 0.06326159089803696, 'learning_rate': 1.0056285178236398e-05, 'epoch': 9.5}


 95%|█████████▌| 2537/2670 [1:01:48<03:08,  1.42s/it]

{'loss': 0.0041, 'grad_norm': 0.05539437010884285, 'learning_rate': 9.981238273921201e-06, 'epoch': 9.5}


 95%|█████████▌| 2538/2670 [1:01:50<03:15,  1.48s/it]

{'loss': 0.0033, 'grad_norm': 0.057965945452451706, 'learning_rate': 9.906191369606005e-06, 'epoch': 9.51}


 95%|█████████▌| 2539/2670 [1:01:51<03:22,  1.54s/it]

{'loss': 0.0024, 'grad_norm': 0.03531288355588913, 'learning_rate': 9.831144465290807e-06, 'epoch': 9.51}


 95%|█████████▌| 2540/2670 [1:01:53<03:19,  1.54s/it]

{'loss': 0.0036, 'grad_norm': 0.04181978479027748, 'learning_rate': 9.756097560975611e-06, 'epoch': 9.51}


 95%|█████████▌| 2541/2670 [1:01:55<03:24,  1.59s/it]

{'loss': 0.0024, 'grad_norm': 0.04346490651369095, 'learning_rate': 9.681050656660413e-06, 'epoch': 9.52}


 95%|█████████▌| 2542/2670 [1:01:56<03:23,  1.59s/it]

{'loss': 0.0032, 'grad_norm': 0.048731885850429535, 'learning_rate': 9.606003752345217e-06, 'epoch': 9.52}


 95%|█████████▌| 2543/2670 [1:01:58<03:23,  1.60s/it]

{'loss': 0.0033, 'grad_norm': 0.03711260110139847, 'learning_rate': 9.530956848030019e-06, 'epoch': 9.52}


 95%|█████████▌| 2544/2670 [1:01:59<03:15,  1.55s/it]

{'loss': 0.0026, 'grad_norm': 0.04266221448779106, 'learning_rate': 9.455909943714823e-06, 'epoch': 9.53}


 95%|█████████▌| 2545/2670 [1:02:01<03:07,  1.50s/it]

{'loss': 0.0043, 'grad_norm': 0.05593423917889595, 'learning_rate': 9.380863039399625e-06, 'epoch': 9.53}


 95%|█████████▌| 2546/2670 [1:02:02<03:07,  1.51s/it]

{'loss': 0.0044, 'grad_norm': 0.06613053381443024, 'learning_rate': 9.305816135084429e-06, 'epoch': 9.54}


 95%|█████████▌| 2547/2670 [1:02:04<03:01,  1.48s/it]

{'loss': 0.0041, 'grad_norm': 0.07212024927139282, 'learning_rate': 9.230769230769232e-06, 'epoch': 9.54}


 95%|█████████▌| 2548/2670 [1:02:05<02:54,  1.43s/it]

{'loss': 0.0034, 'grad_norm': 0.0629015564918518, 'learning_rate': 9.155722326454034e-06, 'epoch': 9.54}


 95%|█████████▌| 2549/2670 [1:02:07<02:59,  1.49s/it]

{'loss': 0.0031, 'grad_norm': 0.045780353248119354, 'learning_rate': 9.080675422138838e-06, 'epoch': 9.55}


 96%|█████████▌| 2550/2670 [1:02:08<02:48,  1.41s/it]

{'loss': 0.0041, 'grad_norm': 0.06527800858020782, 'learning_rate': 9.00562851782364e-06, 'epoch': 9.55}


 96%|█████████▌| 2551/2670 [1:02:09<02:50,  1.43s/it]

{'loss': 0.0036, 'grad_norm': 0.05738327279686928, 'learning_rate': 8.930581613508442e-06, 'epoch': 9.55}


 96%|█████████▌| 2552/2670 [1:02:11<02:52,  1.46s/it]

{'loss': 0.0038, 'grad_norm': 0.05097493901848793, 'learning_rate': 8.855534709193246e-06, 'epoch': 9.56}


 96%|█████████▌| 2553/2670 [1:02:12<02:42,  1.39s/it]

{'loss': 0.0068, 'grad_norm': 0.07762411236763, 'learning_rate': 8.780487804878048e-06, 'epoch': 9.56}


 96%|█████████▌| 2554/2670 [1:02:13<02:41,  1.39s/it]

{'loss': 0.0036, 'grad_norm': 0.045649781823158264, 'learning_rate': 8.705440900562854e-06, 'epoch': 9.57}


 96%|█████████▌| 2555/2670 [1:02:15<02:39,  1.39s/it]

{'loss': 0.0031, 'grad_norm': 0.04478708654642105, 'learning_rate': 8.630393996247656e-06, 'epoch': 9.57}


 96%|█████████▌| 2556/2670 [1:02:16<02:38,  1.39s/it]

{'loss': 0.0025, 'grad_norm': 0.04108234494924545, 'learning_rate': 8.555347091932458e-06, 'epoch': 9.57}


 96%|█████████▌| 2557/2670 [1:02:18<02:40,  1.42s/it]

{'loss': 0.0047, 'grad_norm': 0.05825172737240791, 'learning_rate': 8.480300187617262e-06, 'epoch': 9.58}


 96%|█████████▌| 2558/2670 [1:02:19<02:45,  1.48s/it]

{'loss': 0.0037, 'grad_norm': 0.06330318003892899, 'learning_rate': 8.405253283302064e-06, 'epoch': 9.58}


 96%|█████████▌| 2559/2670 [1:02:21<02:47,  1.51s/it]

{'loss': 0.0033, 'grad_norm': 0.04586929455399513, 'learning_rate': 8.330206378986867e-06, 'epoch': 9.58}


 96%|█████████▌| 2560/2670 [1:02:22<02:37,  1.43s/it]

{'loss': 0.0058, 'grad_norm': 0.07798884063959122, 'learning_rate': 8.25515947467167e-06, 'epoch': 9.59}


 96%|█████████▌| 2561/2670 [1:02:24<02:39,  1.46s/it]

{'loss': 0.0046, 'grad_norm': 0.05779726430773735, 'learning_rate': 8.180112570356473e-06, 'epoch': 9.59}


 96%|█████████▌| 2562/2670 [1:02:25<02:42,  1.50s/it]

{'loss': 0.003, 'grad_norm': 0.04690731316804886, 'learning_rate': 8.105065666041277e-06, 'epoch': 9.6}


 96%|█████████▌| 2563/2670 [1:02:27<02:37,  1.47s/it]

{'loss': 0.0028, 'grad_norm': 0.04740187153220177, 'learning_rate': 8.030018761726079e-06, 'epoch': 9.6}


 96%|█████████▌| 2564/2670 [1:02:28<02:32,  1.44s/it]

{'loss': 0.0053, 'grad_norm': 0.056239575147628784, 'learning_rate': 7.954971857410883e-06, 'epoch': 9.6}


 96%|█████████▌| 2565/2670 [1:02:30<02:38,  1.51s/it]

{'loss': 0.0034, 'grad_norm': 0.08385259658098221, 'learning_rate': 7.879924953095685e-06, 'epoch': 9.61}


 96%|█████████▌| 2566/2670 [1:02:31<02:42,  1.57s/it]

{'loss': 0.0041, 'grad_norm': 0.043944064527750015, 'learning_rate': 7.804878048780489e-06, 'epoch': 9.61}


 96%|█████████▌| 2567/2670 [1:02:33<02:33,  1.49s/it]

{'loss': 0.0044, 'grad_norm': 0.06519152224063873, 'learning_rate': 7.729831144465291e-06, 'epoch': 9.61}


 96%|█████████▌| 2568/2670 [1:02:34<02:24,  1.41s/it]

{'loss': 0.0085, 'grad_norm': 0.04768695682287216, 'learning_rate': 7.654784240150095e-06, 'epoch': 9.62}


 96%|█████████▌| 2569/2670 [1:02:35<02:16,  1.36s/it]

{'loss': 0.0102, 'grad_norm': 0.2731023132801056, 'learning_rate': 7.579737335834897e-06, 'epoch': 9.62}


 96%|█████████▋| 2570/2670 [1:02:37<02:20,  1.41s/it]

{'loss': 0.0026, 'grad_norm': 0.042866166681051254, 'learning_rate': 7.5046904315197005e-06, 'epoch': 9.63}


 96%|█████████▋| 2571/2670 [1:02:38<02:22,  1.44s/it]

{'loss': 0.0027, 'grad_norm': 0.0411941222846508, 'learning_rate': 7.429643527204503e-06, 'epoch': 9.63}


 96%|█████████▋| 2572/2670 [1:02:40<02:15,  1.39s/it]

{'loss': 0.0029, 'grad_norm': 0.051022280007600784, 'learning_rate': 7.354596622889306e-06, 'epoch': 9.63}


 96%|█████████▋| 2573/2670 [1:02:41<02:16,  1.41s/it]

{'loss': 0.0043, 'grad_norm': 0.06000206246972084, 'learning_rate': 7.279549718574109e-06, 'epoch': 9.64}


 96%|█████████▋| 2574/2670 [1:02:42<02:13,  1.39s/it]

{'loss': 0.004, 'grad_norm': 0.0634259283542633, 'learning_rate': 7.204502814258912e-06, 'epoch': 9.64}


 96%|█████████▋| 2575/2670 [1:02:44<02:12,  1.39s/it]

{'loss': 0.0024, 'grad_norm': 0.04106370732188225, 'learning_rate': 7.129455909943715e-06, 'epoch': 9.64}


 96%|█████████▋| 2576/2670 [1:02:45<02:08,  1.36s/it]

{'loss': 0.0055, 'grad_norm': 0.07573509961366653, 'learning_rate': 7.054409005628518e-06, 'epoch': 9.65}


 97%|█████████▋| 2577/2670 [1:02:46<02:05,  1.35s/it]

{'loss': 0.0043, 'grad_norm': 0.051252420991659164, 'learning_rate': 6.979362101313321e-06, 'epoch': 9.65}


 97%|█████████▋| 2578/2670 [1:02:48<02:00,  1.31s/it]

{'loss': 0.0046, 'grad_norm': 0.062201108783483505, 'learning_rate': 6.904315196998125e-06, 'epoch': 9.66}


 97%|█████████▋| 2579/2670 [1:02:49<02:00,  1.33s/it]

{'loss': 0.0035, 'grad_norm': 0.057415351271629333, 'learning_rate': 6.829268292682928e-06, 'epoch': 9.66}


 97%|█████████▋| 2580/2670 [1:02:50<02:01,  1.35s/it]

{'loss': 0.0024, 'grad_norm': 0.03970223292708397, 'learning_rate': 6.754221388367731e-06, 'epoch': 9.66}


 97%|█████████▋| 2581/2670 [1:02:52<01:59,  1.35s/it]

{'loss': 0.0051, 'grad_norm': 0.06726040691137314, 'learning_rate': 6.6791744840525335e-06, 'epoch': 9.67}


 97%|█████████▋| 2582/2670 [1:02:53<02:03,  1.41s/it]

{'loss': 0.0054, 'grad_norm': 0.07274909317493439, 'learning_rate': 6.6041275797373365e-06, 'epoch': 9.67}


 97%|█████████▋| 2583/2670 [1:02:55<02:05,  1.45s/it]

{'loss': 0.0046, 'grad_norm': 0.19086909294128418, 'learning_rate': 6.529080675422139e-06, 'epoch': 9.67}


 97%|█████████▋| 2584/2670 [1:02:57<02:14,  1.57s/it]

{'loss': 0.0027, 'grad_norm': 0.04218347370624542, 'learning_rate': 6.4540337711069415e-06, 'epoch': 9.68}


 97%|█████████▋| 2585/2670 [1:02:58<02:12,  1.55s/it]

{'loss': 0.0032, 'grad_norm': 0.04540916904807091, 'learning_rate': 6.378986866791744e-06, 'epoch': 9.68}


 97%|█████████▋| 2586/2670 [1:03:00<02:08,  1.52s/it]

{'loss': 0.0022, 'grad_norm': 0.0350986123085022, 'learning_rate': 6.303939962476549e-06, 'epoch': 9.69}


 97%|█████████▋| 2587/2670 [1:03:01<02:07,  1.53s/it]

{'loss': 0.0038, 'grad_norm': 0.06016833335161209, 'learning_rate': 6.228893058161351e-06, 'epoch': 9.69}


 97%|█████████▋| 2588/2670 [1:03:03<02:15,  1.65s/it]

{'loss': 0.0023, 'grad_norm': 0.03547696769237518, 'learning_rate': 6.153846153846155e-06, 'epoch': 9.69}


 97%|█████████▋| 2589/2670 [1:03:04<02:07,  1.58s/it]

{'loss': 0.0019, 'grad_norm': 0.0340765118598938, 'learning_rate': 6.078799249530957e-06, 'epoch': 9.7}


 97%|█████████▋| 2590/2670 [1:03:06<02:14,  1.68s/it]

{'loss': 0.0043, 'grad_norm': 0.05642969533801079, 'learning_rate': 6.00375234521576e-06, 'epoch': 9.7}


 97%|█████████▋| 2591/2670 [1:03:08<02:06,  1.61s/it]

{'loss': 0.0021, 'grad_norm': 0.035043660551309586, 'learning_rate': 5.928705440900563e-06, 'epoch': 9.7}


 97%|█████████▋| 2592/2670 [1:03:09<02:01,  1.56s/it]

{'loss': 0.0046, 'grad_norm': 0.054091501981019974, 'learning_rate': 5.853658536585367e-06, 'epoch': 9.71}


 97%|█████████▋| 2593/2670 [1:03:11<01:59,  1.55s/it]

{'loss': 0.0043, 'grad_norm': 0.06006763502955437, 'learning_rate': 5.7786116322701695e-06, 'epoch': 9.71}


 97%|█████████▋| 2594/2670 [1:03:13<02:02,  1.61s/it]

{'loss': 0.0036, 'grad_norm': 0.05309227854013443, 'learning_rate': 5.7035647279549724e-06, 'epoch': 9.72}


 97%|█████████▋| 2595/2670 [1:03:14<01:57,  1.57s/it]

{'loss': 0.0045, 'grad_norm': 0.05300752446055412, 'learning_rate': 5.628517823639775e-06, 'epoch': 9.72}


 97%|█████████▋| 2596/2670 [1:03:15<01:48,  1.46s/it]

{'loss': 0.0054, 'grad_norm': 0.06750327348709106, 'learning_rate': 5.553470919324578e-06, 'epoch': 9.72}


 97%|█████████▋| 2597/2670 [1:03:17<01:53,  1.56s/it]

{'loss': 0.0033, 'grad_norm': 0.05140285566449165, 'learning_rate': 5.478424015009381e-06, 'epoch': 9.73}


 97%|█████████▋| 2598/2670 [1:03:18<01:44,  1.45s/it]

{'loss': 0.0058, 'grad_norm': 0.07415390014648438, 'learning_rate': 5.403377110694184e-06, 'epoch': 9.73}


 97%|█████████▋| 2599/2670 [1:03:20<01:44,  1.47s/it]

{'loss': 0.0033, 'grad_norm': 0.05091872811317444, 'learning_rate': 5.328330206378987e-06, 'epoch': 9.73}


 97%|█████████▋| 2600/2670 [1:03:21<01:40,  1.44s/it]

{'loss': 0.0031, 'grad_norm': 0.05513345077633858, 'learning_rate': 5.253283302063791e-06, 'epoch': 9.74}


 97%|█████████▋| 2601/2670 [1:03:23<01:38,  1.43s/it]

{'loss': 0.0053, 'grad_norm': 0.06130901724100113, 'learning_rate': 5.178236397748593e-06, 'epoch': 9.74}


 97%|█████████▋| 2602/2670 [1:03:24<01:35,  1.40s/it]

{'loss': 0.0035, 'grad_norm': 0.05505197122693062, 'learning_rate': 5.103189493433396e-06, 'epoch': 9.75}


 97%|█████████▋| 2603/2670 [1:03:25<01:31,  1.37s/it]

{'loss': 0.003, 'grad_norm': 0.04165085777640343, 'learning_rate': 5.028142589118199e-06, 'epoch': 9.75}


 98%|█████████▊| 2604/2670 [1:03:27<01:30,  1.37s/it]

{'loss': 0.0034, 'grad_norm': 0.04241397604346275, 'learning_rate': 4.9530956848030026e-06, 'epoch': 9.75}


 98%|█████████▊| 2605/2670 [1:03:28<01:30,  1.39s/it]

{'loss': 0.0047, 'grad_norm': 0.055026937276124954, 'learning_rate': 4.8780487804878055e-06, 'epoch': 9.76}


 98%|█████████▊| 2606/2670 [1:03:30<01:31,  1.44s/it]

{'loss': 0.003, 'grad_norm': 0.040264543145895004, 'learning_rate': 4.803001876172608e-06, 'epoch': 9.76}


 98%|█████████▊| 2607/2670 [1:03:31<01:27,  1.39s/it]

{'loss': 0.0071, 'grad_norm': 0.08177071064710617, 'learning_rate': 4.727954971857411e-06, 'epoch': 9.76}


 98%|█████████▊| 2608/2670 [1:03:32<01:28,  1.42s/it]

{'loss': 0.0027, 'grad_norm': 0.038754284381866455, 'learning_rate': 4.652908067542214e-06, 'epoch': 9.77}


 98%|█████████▊| 2609/2670 [1:03:34<01:30,  1.48s/it]

{'loss': 0.0025, 'grad_norm': 0.0335247702896595, 'learning_rate': 4.577861163227017e-06, 'epoch': 9.77}


 98%|█████████▊| 2610/2670 [1:03:35<01:26,  1.45s/it]

{'loss': 0.0049, 'grad_norm': 0.05552470684051514, 'learning_rate': 4.50281425891182e-06, 'epoch': 9.78}


 98%|█████████▊| 2611/2670 [1:03:37<01:29,  1.52s/it]

{'loss': 0.0038, 'grad_norm': 0.050155382603406906, 'learning_rate': 4.427767354596623e-06, 'epoch': 9.78}


 98%|█████████▊| 2612/2670 [1:03:39<01:29,  1.55s/it]

{'loss': 0.003, 'grad_norm': 0.055615127086639404, 'learning_rate': 4.352720450281427e-06, 'epoch': 9.78}


 98%|█████████▊| 2613/2670 [1:03:40<01:25,  1.49s/it]

{'loss': 0.0029, 'grad_norm': 0.040177736431360245, 'learning_rate': 4.277673545966229e-06, 'epoch': 9.79}


 98%|█████████▊| 2614/2670 [1:03:42<01:26,  1.55s/it]

{'loss': 0.0025, 'grad_norm': 0.059551797807216644, 'learning_rate': 4.202626641651032e-06, 'epoch': 9.79}


 98%|█████████▊| 2615/2670 [1:03:43<01:20,  1.46s/it]

{'loss': 0.0025, 'grad_norm': 0.047325026243925095, 'learning_rate': 4.127579737335835e-06, 'epoch': 9.79}


 98%|█████████▊| 2616/2670 [1:03:44<01:20,  1.49s/it]

{'loss': 0.0051, 'grad_norm': 0.061280421912670135, 'learning_rate': 4.0525328330206385e-06, 'epoch': 9.8}


 98%|█████████▊| 2617/2670 [1:03:46<01:19,  1.50s/it]

{'loss': 0.0034, 'grad_norm': 0.05120188742876053, 'learning_rate': 3.9774859287054415e-06, 'epoch': 9.8}


 98%|█████████▊| 2618/2670 [1:03:47<01:18,  1.50s/it]

{'loss': 0.0023, 'grad_norm': 0.03581766039133072, 'learning_rate': 3.902439024390244e-06, 'epoch': 9.81}


 98%|█████████▊| 2619/2670 [1:03:49<01:14,  1.46s/it]

{'loss': 0.0069, 'grad_norm': 0.08223654329776764, 'learning_rate': 3.827392120075047e-06, 'epoch': 9.81}


 98%|█████████▊| 2620/2670 [1:03:50<01:13,  1.48s/it]

{'loss': 0.0026, 'grad_norm': 0.041552189737558365, 'learning_rate': 3.7523452157598502e-06, 'epoch': 9.81}


 98%|█████████▊| 2621/2670 [1:03:52<01:12,  1.48s/it]

{'loss': 0.0033, 'grad_norm': 0.05665724352002144, 'learning_rate': 3.677298311444653e-06, 'epoch': 9.82}


 98%|█████████▊| 2622/2670 [1:03:53<01:10,  1.47s/it]

{'loss': 0.0038, 'grad_norm': 0.04210028052330017, 'learning_rate': 3.602251407129456e-06, 'epoch': 9.82}


 98%|█████████▊| 2623/2670 [1:03:55<01:08,  1.47s/it]

{'loss': 0.0024, 'grad_norm': 0.03623006120324135, 'learning_rate': 3.527204502814259e-06, 'epoch': 9.82}


 98%|█████████▊| 2624/2670 [1:03:56<01:09,  1.51s/it]

{'loss': 0.0036, 'grad_norm': 0.048509031534194946, 'learning_rate': 3.4521575984990624e-06, 'epoch': 9.83}


 98%|█████████▊| 2625/2670 [1:03:58<01:06,  1.47s/it]

{'loss': 0.0034, 'grad_norm': 0.05268669128417969, 'learning_rate': 3.3771106941838653e-06, 'epoch': 9.83}


 98%|█████████▊| 2626/2670 [1:03:59<01:05,  1.49s/it]

{'loss': 0.0026, 'grad_norm': 0.03432149440050125, 'learning_rate': 3.3020637898686682e-06, 'epoch': 9.84}


 98%|█████████▊| 2627/2670 [1:04:00<01:00,  1.40s/it]

{'loss': 0.0049, 'grad_norm': 0.06877008080482483, 'learning_rate': 3.2270168855534707e-06, 'epoch': 9.84}


 98%|█████████▊| 2628/2670 [1:04:02<01:01,  1.45s/it]

{'loss': 0.0026, 'grad_norm': 0.04291698336601257, 'learning_rate': 3.1519699812382745e-06, 'epoch': 9.84}


 98%|█████████▊| 2629/2670 [1:04:04<01:00,  1.48s/it]

{'loss': 0.0027, 'grad_norm': 0.04082442447543144, 'learning_rate': 3.0769230769230774e-06, 'epoch': 9.85}


 99%|█████████▊| 2630/2670 [1:04:05<01:01,  1.53s/it]

{'loss': 0.0046, 'grad_norm': 0.06482123583555222, 'learning_rate': 3.00187617260788e-06, 'epoch': 9.85}


 99%|█████████▊| 2631/2670 [1:04:07<01:01,  1.58s/it]

{'loss': 0.0033, 'grad_norm': 0.05429115146398544, 'learning_rate': 2.9268292682926833e-06, 'epoch': 9.85}


 99%|█████████▊| 2632/2670 [1:04:08<00:59,  1.56s/it]

{'loss': 0.0025, 'grad_norm': 0.034052371978759766, 'learning_rate': 2.8517823639774862e-06, 'epoch': 9.86}


 99%|█████████▊| 2633/2670 [1:04:10<00:58,  1.57s/it]

{'loss': 0.003, 'grad_norm': 0.04832801595330238, 'learning_rate': 2.776735459662289e-06, 'epoch': 9.86}


 99%|█████████▊| 2634/2670 [1:04:11<00:55,  1.54s/it]

{'loss': 0.0018, 'grad_norm': 0.03695274144411087, 'learning_rate': 2.701688555347092e-06, 'epoch': 9.87}


 99%|█████████▊| 2635/2670 [1:04:13<00:49,  1.41s/it]

{'loss': 0.0038, 'grad_norm': 0.062156736850738525, 'learning_rate': 2.6266416510318954e-06, 'epoch': 9.87}


 99%|█████████▊| 2636/2670 [1:04:14<00:42,  1.26s/it]

{'loss': 0.004, 'grad_norm': 0.05776910483837128, 'learning_rate': 2.551594746716698e-06, 'epoch': 9.87}


 99%|█████████▉| 2637/2670 [1:04:15<00:42,  1.27s/it]

{'loss': 0.0043, 'grad_norm': 0.051396694034338, 'learning_rate': 2.4765478424015013e-06, 'epoch': 9.88}


 99%|█████████▉| 2638/2670 [1:04:16<00:43,  1.37s/it]

{'loss': 0.0031, 'grad_norm': 0.05212782323360443, 'learning_rate': 2.401500938086304e-06, 'epoch': 9.88}


 99%|█████████▉| 2639/2670 [1:04:18<00:43,  1.42s/it]

{'loss': 0.0038, 'grad_norm': 0.049770694226026535, 'learning_rate': 2.326454033771107e-06, 'epoch': 9.88}


 99%|█████████▉| 2640/2670 [1:04:20<00:45,  1.51s/it]

{'loss': 0.0027, 'grad_norm': 0.047054193913936615, 'learning_rate': 2.25140712945591e-06, 'epoch': 9.89}


 99%|█████████▉| 2641/2670 [1:04:21<00:45,  1.58s/it]

{'loss': 0.0023, 'grad_norm': 0.03530747815966606, 'learning_rate': 2.1763602251407134e-06, 'epoch': 9.89}


 99%|█████████▉| 2642/2670 [1:04:23<00:46,  1.64s/it]

{'loss': 0.0023, 'grad_norm': 0.03807222470641136, 'learning_rate': 2.101313320825516e-06, 'epoch': 9.9}


 99%|█████████▉| 2643/2670 [1:04:25<00:42,  1.58s/it]

{'loss': 0.0043, 'grad_norm': 0.07891492545604706, 'learning_rate': 2.0262664165103193e-06, 'epoch': 9.9}


 99%|█████████▉| 2644/2670 [1:04:26<00:41,  1.61s/it]

{'loss': 0.0035, 'grad_norm': 0.0431559756398201, 'learning_rate': 1.951219512195122e-06, 'epoch': 9.9}


 99%|█████████▉| 2645/2670 [1:04:28<00:39,  1.58s/it]

{'loss': 0.0049, 'grad_norm': 0.07017149776220322, 'learning_rate': 1.8761726078799251e-06, 'epoch': 9.91}


 99%|█████████▉| 2646/2670 [1:04:29<00:37,  1.54s/it]

{'loss': 0.003, 'grad_norm': 0.0599701926112175, 'learning_rate': 1.801125703564728e-06, 'epoch': 9.91}


 99%|█████████▉| 2647/2670 [1:04:31<00:34,  1.49s/it]

{'loss': 0.0069, 'grad_norm': 0.2083398401737213, 'learning_rate': 1.7260787992495312e-06, 'epoch': 9.91}


 99%|█████████▉| 2648/2670 [1:04:32<00:32,  1.46s/it]

{'loss': 0.0058, 'grad_norm': 0.08112846314907074, 'learning_rate': 1.6510318949343341e-06, 'epoch': 9.92}


 99%|█████████▉| 2649/2670 [1:04:34<00:31,  1.49s/it]

{'loss': 0.0027, 'grad_norm': 0.04987359791994095, 'learning_rate': 1.5759849906191373e-06, 'epoch': 9.92}


 99%|█████████▉| 2650/2670 [1:04:35<00:29,  1.48s/it]

{'loss': 0.0026, 'grad_norm': 0.044389840215444565, 'learning_rate': 1.50093808630394e-06, 'epoch': 9.93}


 99%|█████████▉| 2651/2670 [1:04:36<00:27,  1.47s/it]

{'loss': 0.0031, 'grad_norm': 0.04559331387281418, 'learning_rate': 1.4258911819887431e-06, 'epoch': 9.93}


 99%|█████████▉| 2652/2670 [1:04:38<00:26,  1.46s/it]

{'loss': 0.0054, 'grad_norm': 0.15746666491031647, 'learning_rate': 1.350844277673546e-06, 'epoch': 9.93}


 99%|█████████▉| 2653/2670 [1:04:39<00:24,  1.45s/it]

{'loss': 0.0033, 'grad_norm': 0.04405747726559639, 'learning_rate': 1.275797373358349e-06, 'epoch': 9.94}


 99%|█████████▉| 2654/2670 [1:04:41<00:24,  1.53s/it]

{'loss': 0.0041, 'grad_norm': 0.06464705616235733, 'learning_rate': 1.200750469043152e-06, 'epoch': 9.94}


 99%|█████████▉| 2655/2670 [1:04:43<00:22,  1.51s/it]

{'loss': 0.0038, 'grad_norm': 0.042754992842674255, 'learning_rate': 1.125703564727955e-06, 'epoch': 9.94}


 99%|█████████▉| 2656/2670 [1:04:44<00:20,  1.47s/it]

{'loss': 0.0043, 'grad_norm': 0.056037914007902145, 'learning_rate': 1.050656660412758e-06, 'epoch': 9.95}


100%|█████████▉| 2657/2670 [1:04:45<00:19,  1.46s/it]

{'loss': 0.0043, 'grad_norm': 0.04030371457338333, 'learning_rate': 9.75609756097561e-07, 'epoch': 9.95}


100%|█████████▉| 2658/2670 [1:04:47<00:18,  1.53s/it]

{'loss': 0.005, 'grad_norm': 0.07162667065858841, 'learning_rate': 9.00562851782364e-07, 'epoch': 9.96}


100%|█████████▉| 2659/2670 [1:04:48<00:16,  1.50s/it]

{'loss': 0.0051, 'grad_norm': 0.06586946547031403, 'learning_rate': 8.255159474671671e-07, 'epoch': 9.96}


100%|█████████▉| 2660/2670 [1:04:50<00:15,  1.54s/it]

{'loss': 0.0058, 'grad_norm': 0.2307555228471756, 'learning_rate': 7.5046904315197e-07, 'epoch': 9.96}


100%|█████████▉| 2661/2670 [1:04:51<00:13,  1.48s/it]

{'loss': 0.0029, 'grad_norm': 0.04216348007321358, 'learning_rate': 6.75422138836773e-07, 'epoch': 9.97}


100%|█████████▉| 2662/2670 [1:04:53<00:11,  1.42s/it]

{'loss': 0.0036, 'grad_norm': 0.06057685613632202, 'learning_rate': 6.00375234521576e-07, 'epoch': 9.97}


100%|█████████▉| 2663/2670 [1:04:55<00:10,  1.54s/it]

{'loss': 0.0043, 'grad_norm': 0.06387597322463989, 'learning_rate': 5.25328330206379e-07, 'epoch': 9.97}


100%|█████████▉| 2664/2670 [1:04:56<00:09,  1.58s/it]

{'loss': 0.0041, 'grad_norm': 0.061511460691690445, 'learning_rate': 4.50281425891182e-07, 'epoch': 9.98}


100%|█████████▉| 2665/2670 [1:04:58<00:07,  1.53s/it]

{'loss': 0.0033, 'grad_norm': 0.04969601705670357, 'learning_rate': 3.75234521575985e-07, 'epoch': 9.98}


100%|█████████▉| 2666/2670 [1:04:59<00:06,  1.51s/it]

{'loss': 0.0083, 'grad_norm': 0.08040345460176468, 'learning_rate': 3.00187617260788e-07, 'epoch': 9.99}


100%|█████████▉| 2667/2670 [1:05:01<00:04,  1.53s/it]

{'loss': 0.003, 'grad_norm': 0.05069871246814728, 'learning_rate': 2.25140712945591e-07, 'epoch': 9.99}


100%|█████████▉| 2668/2670 [1:05:02<00:03,  1.53s/it]

{'loss': 0.0045, 'grad_norm': 0.057540714740753174, 'learning_rate': 1.50093808630394e-07, 'epoch': 9.99}


100%|█████████▉| 2669/2670 [1:05:04<00:01,  1.53s/it]

{'loss': 0.0043, 'grad_norm': 0.04822973906993866, 'learning_rate': 7.5046904315197e-08, 'epoch': 10.0}


100%|██████████| 2670/2670 [1:05:06<00:00,  1.62s/it]

{'loss': 0.0022, 'grad_norm': 0.034792423248291016, 'learning_rate': 0.0, 'epoch': 10.0}


100%|██████████| 2670/2670 [1:05:07<00:00,  1.46s/it]

{'train_runtime': 3907.5504, 'train_samples_per_second': 2.733, 'train_steps_per_second': 0.683, 'train_loss': 0.18064196587035306, 'epoch': 10.0}





In [22]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2815.5718 seconds used for training.
46.93 minutes used for training.
Peak reserved memory = 5.35 GB.
Peak reserved memory for training = 2.075 GB.
Peak reserved memory % of max memory = 45.52 %.
Peak reserved memory for training % of max memory = 17.655 %.


In [37]:
import pandas as pd
import matplotlib.pyplot as plt
training_df = pd.DataFrame(trainer.state.log_history)

In [38]:
training_df.head(2)

Unnamed: 0,loss,grad_norm,learning_rate,epoch,step,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
0,1.8814,1.499934,4e-05,0.003745,1,,,,,
1,1.9541,4.33538,8e-05,0.007491,2,,,,,


In [39]:
from helpers import create_training_plots

In [26]:
fig = create_training_plots(training_df)
fig.show()

In [41]:
training_df.to_csv(f"training_logs/{OUTPUT_MODEL_NAME}.csv", index = False)

## Run Inference

In [42]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

def get_response(user_query):
    messages = [
    {"role": "user", "content": user_query},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                            temperature = 1.5, min_p = 0.1)
    return tokenizer.batch_decode(outputs)

In [43]:
dataset_finetune['question'][0]

'Where are the CLS tokens added to the augmented input tensor x[+]1:T +1, and what is their role before and after processing through the Transformer?'

Need to investigate how changing the question affects responses

In [44]:
resp = get_response(dataset_finetune['question'][0])
print(resp[0].split("<|start_header_id|>assistant<|end_header_id|>")[1])

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.




The CLS tokens are added as an additional component (the \((T+1)\)-st component) to the augmented input tensor \(x^{+}_{1:T+1}\), which has a shape of \([T + 1] \times 2b\). Specifically, the augmented input tensor is


## Save to HF

In [45]:
print(f"Model dtype: {next(model.parameters()).dtype}")


Model dtype: torch.bfloat16


In [None]:
model.push_to_hub_gguf(
        f"CPSC532/{OUTPUT_MODEL_NAME}",
        tokenizer,
        # quantization_method = ["q4_k_m", "q8_0", "q5_k_m"], 
        quantization_method = ["not_quantized"], # save original precision (bfloat16)
        token = HF_TOKEN
    )

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 39.33 out of 62.67 RAM for saving.


100%|██████████| 28/28 [00:00<00:00, 87.94it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['bf16'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at CPSC532/2024NOV6_qwen_2_5_model into bf16 GGUF format.
The output location will be /home/owen/Desktop/github/532/implementation/finetuning/CPSC532/2024NOV6_qwen_2_5_model/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: 2024NOV6_qwen_2_5_model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: l

unsloth.BF16.gguf: 100%|██████████| 6.43G/6.43G [04:24<00:00, 24.3MB/s]


Saved GGUF to https://huggingface.co/CPSC532/2024NOV6_qwen_2_5_model


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/CPSC532/2024NOV6_qwen_2_5_model
