To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

See on our [blog post](https://unsloth.ai/blog/gemma) on how we made **Gemma 7b 2.5x faster** and **Gemma 2x faster**!

In [1]:
# %%capture
import torch
import os

os.environ["WANDB_PROJECT"] = "gemma_2b_test"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
# major_version, minor_version = torch.cuda.get_device_capability()
# if major_version >= 8:
#     # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
#     !pip install "unsloth[colab-ampere] @ git+https://github.com/unslothai/unsloth.git"
# else:
#     # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
#     !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
# pass

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2b-it-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

  from .autonotebook import tqdm as notebook_tqdm


==((====))==  Unsloth: Fast Gemma patching release 2024.3
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.24. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.3 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # max_steps = 240,
        num_train_epochs= 1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",
    ),
)

In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
2.219 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 3,235
 "-____-"     Number of trainable parameters = 19,611,648
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33myashas-salankimatt[0m ([33myashas-personal[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 1/3235 [00:01<1:13:18,  1.36s/it]

{'loss': 2.7577, 'grad_norm': 2.042825698852539, 'learning_rate': 4e-05, 'epoch': 0.0}


  0%|          | 2/3235 [00:02<1:01:07,  1.13s/it]

{'loss': 2.6127, 'grad_norm': 1.7111485004425049, 'learning_rate': 8e-05, 'epoch': 0.0}


  0%|          | 3/3235 [00:03<59:47,  1.11s/it]  

{'loss': 2.6039, 'grad_norm': 1.702540636062622, 'learning_rate': 0.00012, 'epoch': 0.0}


  0%|          | 4/3235 [00:04<1:00:12,  1.12s/it]

{'loss': 2.2959, 'grad_norm': 1.611228346824646, 'learning_rate': 0.00016, 'epoch': 0.0}


  0%|          | 5/3235 [00:05<56:24,  1.05s/it]  

{'loss': 2.2904, 'grad_norm': 2.1859259605407715, 'learning_rate': 0.0002, 'epoch': 0.0}


  0%|          | 6/3235 [00:06<1:05:06,  1.21s/it]

{'loss': 1.8072, 'grad_norm': 2.0116045475006104, 'learning_rate': 0.00019993808049535605, 'epoch': 0.0}


  0%|          | 7/3235 [00:08<1:02:22,  1.16s/it]

{'loss': 1.6574, 'grad_norm': 2.761702537536621, 'learning_rate': 0.0001998761609907121, 'epoch': 0.0}


  0%|          | 8/3235 [00:09<1:00:21,  1.12s/it]

{'loss': 1.4487, 'grad_norm': 3.828615427017212, 'learning_rate': 0.0001998142414860681, 'epoch': 0.0}


  0%|          | 9/3235 [00:10<1:03:07,  1.17s/it]

{'loss': 1.5813, 'grad_norm': 1.1483932733535767, 'learning_rate': 0.00019975232198142414, 'epoch': 0.0}


  0%|          | 10/3235 [00:11<1:04:14,  1.20s/it]

{'loss': 1.3505, 'grad_norm': 1.2917784452438354, 'learning_rate': 0.00019969040247678018, 'epoch': 0.0}


  0%|          | 11/3235 [00:12<57:53,  1.08s/it]  

{'loss': 1.3504, 'grad_norm': 1.8911992311477661, 'learning_rate': 0.00019962848297213622, 'epoch': 0.0}


  0%|          | 12/3235 [00:13<56:45,  1.06s/it]

{'loss': 1.3668, 'grad_norm': 0.8988338112831116, 'learning_rate': 0.00019956656346749226, 'epoch': 0.0}


  0%|          | 13/3235 [00:14<55:41,  1.04s/it]

{'loss': 1.3944, 'grad_norm': 0.9187014698982239, 'learning_rate': 0.0001995046439628483, 'epoch': 0.0}


  0%|          | 14/3235 [00:15<1:00:00,  1.12s/it]

{'loss': 1.2908, 'grad_norm': 0.6273434162139893, 'learning_rate': 0.00019944272445820434, 'epoch': 0.0}


  0%|          | 15/3235 [00:16<57:41,  1.07s/it]  

{'loss': 1.1602, 'grad_norm': 0.6584575176239014, 'learning_rate': 0.00019938080495356038, 'epoch': 0.0}


  0%|          | 16/3235 [00:18<1:02:04,  1.16s/it]

{'loss': 1.162, 'grad_norm': 0.5049461722373962, 'learning_rate': 0.00019931888544891642, 'epoch': 0.0}


  1%|          | 17/3235 [00:18<56:42,  1.06s/it]  

{'loss': 1.1594, 'grad_norm': 0.8495122194290161, 'learning_rate': 0.00019925696594427246, 'epoch': 0.01}


  1%|          | 18/3235 [00:20<58:40,  1.09s/it]

{'loss': 1.1759, 'grad_norm': 0.6134774684906006, 'learning_rate': 0.0001991950464396285, 'epoch': 0.01}


  1%|          | 19/3235 [00:21<59:49,  1.12s/it]

{'loss': 1.059, 'grad_norm': 0.7079939842224121, 'learning_rate': 0.00019913312693498454, 'epoch': 0.01}


  1%|          | 20/3235 [00:22<1:00:47,  1.13s/it]

{'loss': 1.3848, 'grad_norm': 0.6738852262496948, 'learning_rate': 0.00019907120743034056, 'epoch': 0.01}


  1%|          | 21/3235 [00:23<58:36,  1.09s/it]  

{'loss': 1.1985, 'grad_norm': 0.8572585582733154, 'learning_rate': 0.0001990092879256966, 'epoch': 0.01}


  1%|          | 22/3235 [00:24<57:09,  1.07s/it]

{'loss': 1.1111, 'grad_norm': 0.882152259349823, 'learning_rate': 0.00019894736842105264, 'epoch': 0.01}


  1%|          | 23/3235 [00:25<58:57,  1.10s/it]

{'loss': 1.1463, 'grad_norm': 0.7175677418708801, 'learning_rate': 0.00019888544891640868, 'epoch': 0.01}


  1%|          | 24/3235 [00:26<56:34,  1.06s/it]

{'loss': 1.2376, 'grad_norm': 0.5956952571868896, 'learning_rate': 0.00019882352941176472, 'epoch': 0.01}


  1%|          | 25/3235 [00:27<1:00:43,  1.14s/it]

{'loss': 1.2129, 'grad_norm': 0.4650154709815979, 'learning_rate': 0.00019876160990712076, 'epoch': 0.01}


  1%|          | 26/3235 [00:28<56:49,  1.06s/it]  

{'loss': 1.1786, 'grad_norm': 0.534343421459198, 'learning_rate': 0.00019869969040247677, 'epoch': 0.01}


  1%|          | 27/3235 [00:29<54:02,  1.01s/it]

{'loss': 1.2986, 'grad_norm': 0.5067979693412781, 'learning_rate': 0.0001986377708978328, 'epoch': 0.01}


  1%|          | 28/3235 [00:30<56:24,  1.06s/it]

{'loss': 1.1435, 'grad_norm': 0.43570560216903687, 'learning_rate': 0.00019857585139318885, 'epoch': 0.01}


  1%|          | 29/3235 [00:31<55:36,  1.04s/it]

{'loss': 1.0374, 'grad_norm': 0.434885174036026, 'learning_rate': 0.0001985139318885449, 'epoch': 0.01}


  1%|          | 30/3235 [00:33<59:13,  1.11s/it]

{'loss': 1.0791, 'grad_norm': 0.43702346086502075, 'learning_rate': 0.00019845201238390096, 'epoch': 0.01}


  1%|          | 31/3235 [00:34<58:30,  1.10s/it]

{'loss': 1.1298, 'grad_norm': 0.45557084679603577, 'learning_rate': 0.000198390092879257, 'epoch': 0.01}


  1%|          | 32/3235 [00:35<56:07,  1.05s/it]

{'loss': 1.1379, 'grad_norm': 0.4742666780948639, 'learning_rate': 0.000198328173374613, 'epoch': 0.01}


  1%|          | 33/3235 [00:36<1:02:37,  1.17s/it]

{'loss': 1.166, 'grad_norm': 0.42528513073921204, 'learning_rate': 0.00019826625386996905, 'epoch': 0.01}


  1%|          | 34/3235 [00:37<1:04:20,  1.21s/it]

{'loss': 1.2502, 'grad_norm': 0.45066821575164795, 'learning_rate': 0.0001982043343653251, 'epoch': 0.01}


  1%|          | 35/3235 [00:39<1:07:23,  1.26s/it]

{'loss': 1.1769, 'grad_norm': 0.3741401731967926, 'learning_rate': 0.00019814241486068113, 'epoch': 0.01}


  1%|          | 36/3235 [00:40<1:02:59,  1.18s/it]

{'loss': 1.2161, 'grad_norm': 0.49857982993125916, 'learning_rate': 0.00019808049535603717, 'epoch': 0.01}


  1%|          | 37/3235 [00:41<58:27,  1.10s/it]  

{'loss': 1.0873, 'grad_norm': 0.5266259908676147, 'learning_rate': 0.0001980185758513932, 'epoch': 0.01}


  1%|          | 38/3235 [00:42<57:44,  1.08s/it]

{'loss': 1.1678, 'grad_norm': 0.5943069458007812, 'learning_rate': 0.00019795665634674922, 'epoch': 0.01}


  1%|          | 39/3235 [00:43<1:06:19,  1.25s/it]

{'loss': 1.0328, 'grad_norm': 0.3751859962940216, 'learning_rate': 0.00019789473684210526, 'epoch': 0.01}


  1%|          | 40/3235 [00:44<1:01:57,  1.16s/it]

{'loss': 1.1795, 'grad_norm': 0.428859144449234, 'learning_rate': 0.0001978328173374613, 'epoch': 0.01}


  1%|▏         | 41/3235 [00:45<1:01:26,  1.15s/it]

{'loss': 0.9681, 'grad_norm': 0.4430111348628998, 'learning_rate': 0.00019777089783281734, 'epoch': 0.01}


  1%|▏         | 42/3235 [00:46<57:50,  1.09s/it]  

{'loss': 1.1416, 'grad_norm': 0.5288403630256653, 'learning_rate': 0.00019770897832817338, 'epoch': 0.01}


  1%|▏         | 43/3235 [00:48<59:12,  1.11s/it]

{'loss': 1.1976, 'grad_norm': 0.4824059009552002, 'learning_rate': 0.00019764705882352942, 'epoch': 0.01}


  1%|▏         | 44/3235 [00:49<1:00:37,  1.14s/it]

{'loss': 1.2964, 'grad_norm': 0.49149197340011597, 'learning_rate': 0.00019758513931888546, 'epoch': 0.01}


  1%|▏         | 45/3235 [00:50<58:15,  1.10s/it]  

{'loss': 1.2019, 'grad_norm': 0.4597718119621277, 'learning_rate': 0.0001975232198142415, 'epoch': 0.01}


  1%|▏         | 46/3235 [00:51<56:00,  1.05s/it]

{'loss': 1.353, 'grad_norm': 0.5180004835128784, 'learning_rate': 0.00019746130030959754, 'epoch': 0.01}


  1%|▏         | 47/3235 [00:52<53:26,  1.01s/it]

{'loss': 1.1653, 'grad_norm': 0.5029881000518799, 'learning_rate': 0.00019739938080495358, 'epoch': 0.01}


  1%|▏         | 48/3235 [00:53<54:46,  1.03s/it]

{'loss': 1.0956, 'grad_norm': 0.49980705976486206, 'learning_rate': 0.00019733746130030962, 'epoch': 0.01}


  2%|▏         | 49/3235 [00:54<52:07,  1.02it/s]

{'loss': 1.1338, 'grad_norm': 0.5938388109207153, 'learning_rate': 0.00019727554179566566, 'epoch': 0.02}


  2%|▏         | 50/3235 [00:55<55:58,  1.05s/it]

{'loss': 1.0887, 'grad_norm': 0.4904727637767792, 'learning_rate': 0.00019721362229102167, 'epoch': 0.02}


  2%|▏         | 51/3235 [00:56<59:12,  1.12s/it]

{'loss': 1.2396, 'grad_norm': 0.5216323137283325, 'learning_rate': 0.00019715170278637771, 'epoch': 0.02}


  2%|▏         | 52/3235 [00:57<59:42,  1.13s/it]

{'loss': 1.1423, 'grad_norm': 0.47425687313079834, 'learning_rate': 0.00019708978328173375, 'epoch': 0.02}


  2%|▏         | 53/3235 [00:58<1:01:20,  1.16s/it]

{'loss': 1.1101, 'grad_norm': 0.42081624269485474, 'learning_rate': 0.0001970278637770898, 'epoch': 0.02}


  2%|▏         | 54/3235 [01:00<1:05:01,  1.23s/it]

{'loss': 1.1762, 'grad_norm': 0.4037075340747833, 'learning_rate': 0.00019696594427244583, 'epoch': 0.02}


  2%|▏         | 55/3235 [01:01<1:01:14,  1.16s/it]

{'loss': 1.0037, 'grad_norm': 0.5066520571708679, 'learning_rate': 0.00019690402476780187, 'epoch': 0.02}


  2%|▏         | 56/3235 [01:02<56:47,  1.07s/it]  

{'loss': 0.986, 'grad_norm': 0.5049020051956177, 'learning_rate': 0.0001968421052631579, 'epoch': 0.02}


  2%|▏         | 57/3235 [01:03<56:49,  1.07s/it]

{'loss': 1.04, 'grad_norm': 0.4868570864200592, 'learning_rate': 0.00019678018575851393, 'epoch': 0.02}


  2%|▏         | 58/3235 [01:03<51:31,  1.03it/s]

{'loss': 0.9978, 'grad_norm': 0.5432862043380737, 'learning_rate': 0.00019671826625386997, 'epoch': 0.02}


  2%|▏         | 59/3235 [01:05<54:19,  1.03s/it]

{'loss': 1.1171, 'grad_norm': 0.4914666712284088, 'learning_rate': 0.000196656346749226, 'epoch': 0.02}


  2%|▏         | 60/3235 [01:06<53:55,  1.02s/it]

{'loss': 1.1739, 'grad_norm': 0.5072021484375, 'learning_rate': 0.00019659442724458205, 'epoch': 0.02}


  2%|▏         | 61/3235 [01:07<55:10,  1.04s/it]

{'loss': 0.965, 'grad_norm': 0.4402197003364563, 'learning_rate': 0.0001965325077399381, 'epoch': 0.02}


  2%|▏         | 62/3235 [01:08<59:14,  1.12s/it]

{'loss': 1.1315, 'grad_norm': 0.5006334781646729, 'learning_rate': 0.00019647058823529413, 'epoch': 0.02}


  2%|▏         | 63/3235 [01:09<58:38,  1.11s/it]

{'loss': 1.1038, 'grad_norm': 0.5063748955726624, 'learning_rate': 0.00019640866873065017, 'epoch': 0.02}


  2%|▏         | 64/3235 [01:10<1:01:36,  1.17s/it]

{'loss': 1.1482, 'grad_norm': 0.47498226165771484, 'learning_rate': 0.0001963467492260062, 'epoch': 0.02}


  2%|▏         | 65/3235 [01:12<1:01:00,  1.15s/it]

{'loss': 1.065, 'grad_norm': 0.4804336428642273, 'learning_rate': 0.00019628482972136225, 'epoch': 0.02}


  2%|▏         | 66/3235 [01:13<1:00:17,  1.14s/it]

{'loss': 1.0963, 'grad_norm': 0.5192240476608276, 'learning_rate': 0.0001962229102167183, 'epoch': 0.02}


  2%|▏         | 67/3235 [01:14<1:03:54,  1.21s/it]

{'loss': 1.1611, 'grad_norm': 0.46985307335853577, 'learning_rate': 0.00019616099071207433, 'epoch': 0.02}


  2%|▏         | 68/3235 [01:15<1:02:31,  1.18s/it]

{'loss': 1.0871, 'grad_norm': 0.5093206167221069, 'learning_rate': 0.00019609907120743034, 'epoch': 0.02}


  2%|▏         | 69/3235 [01:16<1:03:31,  1.20s/it]

{'loss': 1.1472, 'grad_norm': 0.43155714869499207, 'learning_rate': 0.00019603715170278638, 'epoch': 0.02}


  2%|▏         | 70/3235 [01:17<1:00:07,  1.14s/it]

{'loss': 0.9683, 'grad_norm': 0.47145527601242065, 'learning_rate': 0.00019597523219814242, 'epoch': 0.02}


  2%|▏         | 71/3235 [01:18<59:46,  1.13s/it]  

{'loss': 1.1414, 'grad_norm': 0.46166539192199707, 'learning_rate': 0.00019591331269349846, 'epoch': 0.02}


  2%|▏         | 72/3235 [01:19<54:06,  1.03s/it]

{'loss': 1.0632, 'grad_norm': 0.5409115552902222, 'learning_rate': 0.0001958513931888545, 'epoch': 0.02}


  2%|▏         | 73/3235 [01:21<59:09,  1.12s/it]

{'loss': 1.1666, 'grad_norm': 0.41641736030578613, 'learning_rate': 0.00019578947368421054, 'epoch': 0.02}


  2%|▏         | 74/3235 [01:22<57:31,  1.09s/it]

{'loss': 1.2636, 'grad_norm': 0.4698566496372223, 'learning_rate': 0.00019572755417956655, 'epoch': 0.02}


  2%|▏         | 75/3235 [01:23<55:13,  1.05s/it]

{'loss': 1.1783, 'grad_norm': 0.5197896957397461, 'learning_rate': 0.0001956656346749226, 'epoch': 0.02}


  2%|▏         | 76/3235 [01:24<54:49,  1.04s/it]

{'loss': 1.0388, 'grad_norm': 0.5142955183982849, 'learning_rate': 0.00019560371517027863, 'epoch': 0.02}


  2%|▏         | 77/3235 [01:24<48:43,  1.08it/s]

{'loss': 1.1325, 'grad_norm': 0.6262083053588867, 'learning_rate': 0.00019554179566563467, 'epoch': 0.02}


  2%|▏         | 78/3235 [01:25<53:29,  1.02s/it]

{'loss': 1.0875, 'grad_norm': 0.4518938362598419, 'learning_rate': 0.0001954798761609907, 'epoch': 0.02}


  2%|▏         | 79/3235 [01:26<52:59,  1.01s/it]

{'loss': 1.227, 'grad_norm': 0.48277968168258667, 'learning_rate': 0.00019541795665634678, 'epoch': 0.02}


  2%|▏         | 80/3235 [01:27<52:22,  1.00it/s]

{'loss': 1.0984, 'grad_norm': 0.5533376336097717, 'learning_rate': 0.0001953560371517028, 'epoch': 0.02}


  3%|▎         | 81/3235 [01:29<58:34,  1.11s/it]

{'loss': 1.1777, 'grad_norm': 0.4059504270553589, 'learning_rate': 0.00019529411764705883, 'epoch': 0.03}


  3%|▎         | 82/3235 [01:30<59:46,  1.14s/it]

{'loss': 1.1201, 'grad_norm': 0.49216359853744507, 'learning_rate': 0.00019523219814241487, 'epoch': 0.03}


  3%|▎         | 83/3235 [01:31<1:03:18,  1.21s/it]

{'loss': 1.074, 'grad_norm': 0.4154938757419586, 'learning_rate': 0.0001951702786377709, 'epoch': 0.03}


  3%|▎         | 84/3235 [01:32<55:09,  1.05s/it]  

{'loss': 1.0577, 'grad_norm': 0.5679575800895691, 'learning_rate': 0.00019510835913312695, 'epoch': 0.03}


  3%|▎         | 85/3235 [01:33<57:24,  1.09s/it]

{'loss': 1.0078, 'grad_norm': 0.5161795020103455, 'learning_rate': 0.000195046439628483, 'epoch': 0.03}


  3%|▎         | 86/3235 [01:34<57:38,  1.10s/it]

{'loss': 1.2446, 'grad_norm': 0.4712928533554077, 'learning_rate': 0.000194984520123839, 'epoch': 0.03}


  3%|▎         | 87/3235 [01:35<55:16,  1.05s/it]

{'loss': 1.1354, 'grad_norm': 0.5016033053398132, 'learning_rate': 0.00019492260061919505, 'epoch': 0.03}


  3%|▎         | 88/3235 [01:36<57:16,  1.09s/it]

{'loss': 1.0796, 'grad_norm': 0.5322299599647522, 'learning_rate': 0.0001948606811145511, 'epoch': 0.03}


  3%|▎         | 89/3235 [01:38<1:00:31,  1.15s/it]

{'loss': 1.1591, 'grad_norm': 0.4458385705947876, 'learning_rate': 0.00019479876160990713, 'epoch': 0.03}


  3%|▎         | 90/3235 [01:39<58:19,  1.11s/it]  

{'loss': 1.1378, 'grad_norm': 0.4441717565059662, 'learning_rate': 0.00019473684210526317, 'epoch': 0.03}


  3%|▎         | 91/3235 [01:40<55:34,  1.06s/it]

{'loss': 1.1149, 'grad_norm': 0.4952576458454132, 'learning_rate': 0.0001946749226006192, 'epoch': 0.03}


  3%|▎         | 92/3235 [01:41<58:30,  1.12s/it]

{'loss': 1.0017, 'grad_norm': 0.44997531175613403, 'learning_rate': 0.00019461300309597522, 'epoch': 0.03}


  3%|▎         | 93/3235 [01:42<59:01,  1.13s/it]

{'loss': 1.1957, 'grad_norm': 0.4364430606365204, 'learning_rate': 0.0001945510835913313, 'epoch': 0.03}


  3%|▎         | 94/3235 [01:43<59:17,  1.13s/it]

{'loss': 1.0315, 'grad_norm': 0.44381338357925415, 'learning_rate': 0.00019448916408668733, 'epoch': 0.03}


  3%|▎         | 95/3235 [01:44<55:40,  1.06s/it]

{'loss': 1.0439, 'grad_norm': 0.5077940821647644, 'learning_rate': 0.00019442724458204337, 'epoch': 0.03}


  3%|▎         | 96/3235 [01:45<52:25,  1.00s/it]

{'loss': 0.9758, 'grad_norm': 0.4741361141204834, 'learning_rate': 0.0001943653250773994, 'epoch': 0.03}


  3%|▎         | 97/3235 [01:46<58:14,  1.11s/it]

{'loss': 1.1465, 'grad_norm': 0.4515680968761444, 'learning_rate': 0.00019430340557275545, 'epoch': 0.03}


  3%|▎         | 98/3235 [01:47<53:41,  1.03s/it]

{'loss': 1.0513, 'grad_norm': 0.4800645411014557, 'learning_rate': 0.00019424148606811146, 'epoch': 0.03}


  3%|▎         | 99/3235 [01:48<56:56,  1.09s/it]

{'loss': 1.0308, 'grad_norm': 0.4026529788970947, 'learning_rate': 0.0001941795665634675, 'epoch': 0.03}


  3%|▎         | 100/3235 [01:50<57:57,  1.11s/it]

{'loss': 1.1427, 'grad_norm': 0.5050045847892761, 'learning_rate': 0.00019411764705882354, 'epoch': 0.03}


  3%|▎         | 101/3235 [01:51<1:00:23,  1.16s/it]

{'loss': 1.238, 'grad_norm': 0.5009869933128357, 'learning_rate': 0.00019405572755417958, 'epoch': 0.03}


  3%|▎         | 102/3235 [01:52<1:00:15,  1.15s/it]

{'loss': 1.1181, 'grad_norm': 0.4742361009120941, 'learning_rate': 0.00019399380804953562, 'epoch': 0.03}


  3%|▎         | 103/3235 [01:53<56:25,  1.08s/it]  

{'loss': 1.1398, 'grad_norm': 0.5081246495246887, 'learning_rate': 0.00019393188854489166, 'epoch': 0.03}


  3%|▎         | 104/3235 [01:54<54:44,  1.05s/it]

{'loss': 1.0203, 'grad_norm': 0.47368529438972473, 'learning_rate': 0.00019386996904024767, 'epoch': 0.03}


  3%|▎         | 105/3235 [01:55<53:00,  1.02s/it]

{'loss': 1.0138, 'grad_norm': 0.4910472631454468, 'learning_rate': 0.0001938080495356037, 'epoch': 0.03}


  3%|▎         | 106/3235 [01:56<54:33,  1.05s/it]

{'loss': 1.2229, 'grad_norm': 0.481093168258667, 'learning_rate': 0.00019374613003095975, 'epoch': 0.03}


  3%|▎         | 107/3235 [01:57<52:47,  1.01s/it]

{'loss': 1.1099, 'grad_norm': 0.48230060935020447, 'learning_rate': 0.0001936842105263158, 'epoch': 0.03}


  3%|▎         | 108/3235 [01:58<55:59,  1.07s/it]

{'loss': 1.0515, 'grad_norm': 0.48032346367836, 'learning_rate': 0.00019362229102167183, 'epoch': 0.03}


  3%|▎         | 109/3235 [01:59<57:41,  1.11s/it]

{'loss': 1.1056, 'grad_norm': 0.4256758689880371, 'learning_rate': 0.00019356037151702787, 'epoch': 0.03}


  3%|▎         | 110/3235 [02:01<59:14,  1.14s/it]

{'loss': 1.0439, 'grad_norm': 0.41008567810058594, 'learning_rate': 0.0001934984520123839, 'epoch': 0.03}


  3%|▎         | 111/3235 [02:02<1:00:04,  1.15s/it]

{'loss': 1.1825, 'grad_norm': 0.5071830153465271, 'learning_rate': 0.00019343653250773995, 'epoch': 0.03}


  3%|▎         | 112/3235 [02:03<57:02,  1.10s/it]  

{'loss': 1.214, 'grad_norm': 0.5607526302337646, 'learning_rate': 0.000193374613003096, 'epoch': 0.03}


  3%|▎         | 113/3235 [02:04<57:39,  1.11s/it]

{'loss': 1.0052, 'grad_norm': 0.4189150929450989, 'learning_rate': 0.00019331269349845203, 'epoch': 0.03}


  4%|▎         | 114/3235 [02:05<55:50,  1.07s/it]

{'loss': 1.0832, 'grad_norm': 0.46995025873184204, 'learning_rate': 0.00019325077399380807, 'epoch': 0.04}


  4%|▎         | 115/3235 [02:06<59:53,  1.15s/it]

{'loss': 1.0588, 'grad_norm': 0.41684070229530334, 'learning_rate': 0.0001931888544891641, 'epoch': 0.04}


  4%|▎         | 116/3235 [02:07<59:28,  1.14s/it]

{'loss': 0.9644, 'grad_norm': 0.4164528548717499, 'learning_rate': 0.00019312693498452013, 'epoch': 0.04}


  4%|▎         | 117/3235 [02:09<1:00:37,  1.17s/it]

{'loss': 1.2382, 'grad_norm': 0.4463561475276947, 'learning_rate': 0.00019306501547987617, 'epoch': 0.04}


  4%|▎         | 118/3235 [02:09<57:40,  1.11s/it]  

{'loss': 1.0685, 'grad_norm': 0.48785263299942017, 'learning_rate': 0.0001930030959752322, 'epoch': 0.04}


  4%|▎         | 119/3235 [02:11<57:50,  1.11s/it]

{'loss': 1.0497, 'grad_norm': 0.4506019055843353, 'learning_rate': 0.00019294117647058825, 'epoch': 0.04}


  4%|▎         | 120/3235 [02:12<58:31,  1.13s/it]

{'loss': 1.1482, 'grad_norm': 0.46969595551490784, 'learning_rate': 0.00019287925696594429, 'epoch': 0.04}


  4%|▎         | 121/3235 [02:13<59:24,  1.14s/it]

{'loss': 1.1096, 'grad_norm': 0.42424124479293823, 'learning_rate': 0.00019281733746130033, 'epoch': 0.04}


  4%|▍         | 122/3235 [02:14<1:01:13,  1.18s/it]

{'loss': 1.1779, 'grad_norm': 0.4709770977497101, 'learning_rate': 0.00019275541795665634, 'epoch': 0.04}


  4%|▍         | 123/3235 [02:15<59:12,  1.14s/it]  

{'loss': 1.1042, 'grad_norm': 0.5426083207130432, 'learning_rate': 0.00019269349845201238, 'epoch': 0.04}


  4%|▍         | 124/3235 [02:16<56:44,  1.09s/it]

{'loss': 1.059, 'grad_norm': 0.44493770599365234, 'learning_rate': 0.00019263157894736842, 'epoch': 0.04}


  4%|▍         | 125/3235 [02:17<55:00,  1.06s/it]

{'loss': 1.1146, 'grad_norm': 0.5444877743721008, 'learning_rate': 0.00019256965944272446, 'epoch': 0.04}


  4%|▍         | 126/3235 [02:18<55:21,  1.07s/it]

{'loss': 1.02, 'grad_norm': 0.44413092732429504, 'learning_rate': 0.0001925077399380805, 'epoch': 0.04}


  4%|▍         | 127/3235 [02:19<54:38,  1.05s/it]

{'loss': 0.9894, 'grad_norm': 0.48061999678611755, 'learning_rate': 0.00019244582043343657, 'epoch': 0.04}


  4%|▍         | 128/3235 [02:20<52:32,  1.01s/it]

{'loss': 1.0223, 'grad_norm': 0.4721930921077728, 'learning_rate': 0.00019238390092879258, 'epoch': 0.04}


  4%|▍         | 129/3235 [02:21<53:09,  1.03s/it]

{'loss': 1.1975, 'grad_norm': 0.5133720636367798, 'learning_rate': 0.00019232198142414862, 'epoch': 0.04}


  4%|▍         | 130/3235 [02:22<53:59,  1.04s/it]

{'loss': 1.0977, 'grad_norm': 0.4450669586658478, 'learning_rate': 0.00019226006191950466, 'epoch': 0.04}


  4%|▍         | 131/3235 [02:23<52:39,  1.02s/it]

{'loss': 1.0677, 'grad_norm': 0.45876285433769226, 'learning_rate': 0.0001921981424148607, 'epoch': 0.04}


  4%|▍         | 132/3235 [02:24<53:24,  1.03s/it]

{'loss': 1.0587, 'grad_norm': 0.5258201360702515, 'learning_rate': 0.00019213622291021674, 'epoch': 0.04}


  4%|▍         | 133/3235 [02:25<53:31,  1.04s/it]

{'loss': 1.1659, 'grad_norm': 0.4926135241985321, 'learning_rate': 0.00019207430340557278, 'epoch': 0.04}


  4%|▍         | 134/3235 [02:26<51:19,  1.01it/s]

{'loss': 1.1097, 'grad_norm': 0.4704086482524872, 'learning_rate': 0.0001920123839009288, 'epoch': 0.04}


  4%|▍         | 135/3235 [02:28<53:55,  1.04s/it]

{'loss': 1.168, 'grad_norm': 0.4637715816497803, 'learning_rate': 0.00019195046439628483, 'epoch': 0.04}


  4%|▍         | 136/3235 [02:29<55:29,  1.07s/it]

{'loss': 1.0937, 'grad_norm': 0.4321858286857605, 'learning_rate': 0.00019188854489164087, 'epoch': 0.04}


  4%|▍         | 137/3235 [02:30<56:46,  1.10s/it]

{'loss': 1.1428, 'grad_norm': 0.45199471712112427, 'learning_rate': 0.0001918266253869969, 'epoch': 0.04}


  4%|▍         | 138/3235 [02:31<55:12,  1.07s/it]

{'loss': 1.0285, 'grad_norm': 0.5153383612632751, 'learning_rate': 0.00019176470588235295, 'epoch': 0.04}


  4%|▍         | 139/3235 [02:32<56:20,  1.09s/it]

{'loss': 1.0197, 'grad_norm': 0.48451662063598633, 'learning_rate': 0.000191702786377709, 'epoch': 0.04}


  4%|▍         | 140/3235 [02:33<56:58,  1.10s/it]

{'loss': 0.9809, 'grad_norm': 0.44394785165786743, 'learning_rate': 0.000191640866873065, 'epoch': 0.04}


  4%|▍         | 141/3235 [02:34<54:09,  1.05s/it]

{'loss': 1.008, 'grad_norm': 0.5434911251068115, 'learning_rate': 0.00019157894736842104, 'epoch': 0.04}


  4%|▍         | 142/3235 [02:35<54:30,  1.06s/it]

{'loss': 1.1672, 'grad_norm': 0.47050362825393677, 'learning_rate': 0.0001915170278637771, 'epoch': 0.04}


  4%|▍         | 143/3235 [02:36<56:33,  1.10s/it]

{'loss': 1.056, 'grad_norm': 0.3994986414909363, 'learning_rate': 0.00019145510835913315, 'epoch': 0.04}


  4%|▍         | 144/3235 [02:37<53:05,  1.03s/it]

{'loss': 1.0416, 'grad_norm': 0.5113745927810669, 'learning_rate': 0.0001913931888544892, 'epoch': 0.04}


  4%|▍         | 145/3235 [02:38<53:44,  1.04s/it]

{'loss': 1.1074, 'grad_norm': 0.4354860484600067, 'learning_rate': 0.00019133126934984523, 'epoch': 0.04}


  5%|▍         | 146/3235 [02:39<52:02,  1.01s/it]

{'loss': 1.0874, 'grad_norm': 0.5227409601211548, 'learning_rate': 0.00019126934984520124, 'epoch': 0.05}


  5%|▍         | 147/3235 [02:40<53:31,  1.04s/it]

{'loss': 1.09, 'grad_norm': 0.48232460021972656, 'learning_rate': 0.00019120743034055728, 'epoch': 0.05}


  5%|▍         | 148/3235 [02:41<52:38,  1.02s/it]

{'loss': 1.3116, 'grad_norm': 0.5249983668327332, 'learning_rate': 0.00019114551083591332, 'epoch': 0.05}


  5%|▍         | 149/3235 [02:42<52:30,  1.02s/it]

{'loss': 0.9477, 'grad_norm': 0.4422226846218109, 'learning_rate': 0.00019108359133126936, 'epoch': 0.05}


  5%|▍         | 150/3235 [02:43<46:59,  1.09it/s]

{'loss': 0.9578, 'grad_norm': 0.5302420854568481, 'learning_rate': 0.0001910216718266254, 'epoch': 0.05}


  5%|▍         | 151/3235 [02:44<50:10,  1.02it/s]

{'loss': 1.0439, 'grad_norm': 0.4554576873779297, 'learning_rate': 0.00019095975232198144, 'epoch': 0.05}


  5%|▍         | 152/3235 [02:45<52:40,  1.03s/it]

{'loss': 1.1202, 'grad_norm': 0.46519869565963745, 'learning_rate': 0.00019089783281733746, 'epoch': 0.05}


  5%|▍         | 153/3235 [02:46<53:50,  1.05s/it]

{'loss': 1.2675, 'grad_norm': 0.49588000774383545, 'learning_rate': 0.0001908359133126935, 'epoch': 0.05}


  5%|▍         | 154/3235 [02:47<52:49,  1.03s/it]

{'loss': 0.9273, 'grad_norm': 0.474027544260025, 'learning_rate': 0.00019077399380804954, 'epoch': 0.05}


  5%|▍         | 155/3235 [02:48<53:50,  1.05s/it]

{'loss': 0.951, 'grad_norm': 0.4234587252140045, 'learning_rate': 0.00019071207430340558, 'epoch': 0.05}


  5%|▍         | 156/3235 [02:50<56:11,  1.09s/it]

{'loss': 1.0855, 'grad_norm': 0.4179360866546631, 'learning_rate': 0.00019065015479876162, 'epoch': 0.05}


  5%|▍         | 157/3235 [02:51<57:24,  1.12s/it]

{'loss': 1.1967, 'grad_norm': 0.45609819889068604, 'learning_rate': 0.00019058823529411766, 'epoch': 0.05}


  5%|▍         | 158/3235 [02:52<59:42,  1.16s/it]

{'loss': 1.0752, 'grad_norm': 0.36718130111694336, 'learning_rate': 0.0001905263157894737, 'epoch': 0.05}


  5%|▍         | 159/3235 [02:53<1:01:55,  1.21s/it]

{'loss': 1.1945, 'grad_norm': 0.4262561798095703, 'learning_rate': 0.00019046439628482974, 'epoch': 0.05}


  5%|▍         | 160/3235 [02:54<59:23,  1.16s/it]  

{'loss': 1.1453, 'grad_norm': 0.44211316108703613, 'learning_rate': 0.00019040247678018578, 'epoch': 0.05}


  5%|▍         | 161/3235 [02:56<58:51,  1.15s/it]

{'loss': 1.2228, 'grad_norm': 0.4646320343017578, 'learning_rate': 0.00019034055727554182, 'epoch': 0.05}


  5%|▌         | 162/3235 [02:57<59:37,  1.16s/it]

{'loss': 1.193, 'grad_norm': 0.4442059099674225, 'learning_rate': 0.00019027863777089786, 'epoch': 0.05}


  5%|▌         | 163/3235 [02:58<1:00:07,  1.17s/it]

{'loss': 1.1144, 'grad_norm': 0.44528335332870483, 'learning_rate': 0.00019021671826625387, 'epoch': 0.05}


  5%|▌         | 164/3235 [02:59<57:27,  1.12s/it]  

{'loss': 1.0887, 'grad_norm': 0.46695899963378906, 'learning_rate': 0.0001901547987616099, 'epoch': 0.05}


  5%|▌         | 165/3235 [03:00<56:55,  1.11s/it]

{'loss': 1.0395, 'grad_norm': 0.45134884119033813, 'learning_rate': 0.00019009287925696595, 'epoch': 0.05}


  5%|▌         | 166/3235 [03:01<56:32,  1.11s/it]

{'loss': 1.1809, 'grad_norm': 0.4203859865665436, 'learning_rate': 0.000190030959752322, 'epoch': 0.05}


  5%|▌         | 167/3235 [03:02<55:35,  1.09s/it]

{'loss': 1.0087, 'grad_norm': 0.4677518606185913, 'learning_rate': 0.00018996904024767803, 'epoch': 0.05}


  5%|▌         | 168/3235 [03:03<53:06,  1.04s/it]

{'loss': 1.0872, 'grad_norm': 0.4602053463459015, 'learning_rate': 0.00018990712074303407, 'epoch': 0.05}


  5%|▌         | 169/3235 [03:04<57:09,  1.12s/it]

{'loss': 1.198, 'grad_norm': 0.4078342318534851, 'learning_rate': 0.00018984520123839008, 'epoch': 0.05}


  5%|▌         | 170/3235 [03:06<58:45,  1.15s/it]

{'loss': 1.0947, 'grad_norm': 0.436149924993515, 'learning_rate': 0.00018978328173374612, 'epoch': 0.05}


  5%|▌         | 171/3235 [03:07<55:19,  1.08s/it]

{'loss': 1.0326, 'grad_norm': 0.4792775809764862, 'learning_rate': 0.00018972136222910216, 'epoch': 0.05}


  5%|▌         | 172/3235 [03:08<55:37,  1.09s/it]

{'loss': 1.0376, 'grad_norm': 0.4288654923439026, 'learning_rate': 0.0001896594427244582, 'epoch': 0.05}


  5%|▌         | 173/3235 [03:09<53:23,  1.05s/it]

{'loss': 1.0194, 'grad_norm': 0.4731075167655945, 'learning_rate': 0.00018959752321981424, 'epoch': 0.05}


  5%|▌         | 174/3235 [03:10<57:22,  1.12s/it]

{'loss': 1.1069, 'grad_norm': 0.4320301115512848, 'learning_rate': 0.00018953560371517028, 'epoch': 0.05}


  5%|▌         | 175/3235 [03:11<1:00:17,  1.18s/it]

{'loss': 1.27, 'grad_norm': 0.46661296486854553, 'learning_rate': 0.00018947368421052632, 'epoch': 0.05}


  5%|▌         | 176/3235 [03:12<56:26,  1.11s/it]  

{'loss': 1.0722, 'grad_norm': 0.4679522216320038, 'learning_rate': 0.00018941176470588236, 'epoch': 0.05}


  5%|▌         | 177/3235 [03:13<53:30,  1.05s/it]

{'loss': 1.1221, 'grad_norm': 0.4792155623435974, 'learning_rate': 0.0001893498452012384, 'epoch': 0.05}


  6%|▌         | 178/3235 [03:14<54:54,  1.08s/it]

{'loss': 1.0148, 'grad_norm': 0.4953596293926239, 'learning_rate': 0.00018928792569659444, 'epoch': 0.06}


  6%|▌         | 179/3235 [03:15<55:14,  1.08s/it]

{'loss': 1.0829, 'grad_norm': 0.46295714378356934, 'learning_rate': 0.00018922600619195048, 'epoch': 0.06}


  6%|▌         | 180/3235 [03:16<52:26,  1.03s/it]

{'loss': 1.0903, 'grad_norm': 0.5026911497116089, 'learning_rate': 0.00018916408668730652, 'epoch': 0.06}


  6%|▌         | 181/3235 [03:17<51:07,  1.00s/it]

{'loss': 0.9262, 'grad_norm': 0.4357762038707733, 'learning_rate': 0.00018910216718266254, 'epoch': 0.06}


  6%|▌         | 182/3235 [03:18<54:41,  1.07s/it]

{'loss': 1.1598, 'grad_norm': 0.5265110731124878, 'learning_rate': 0.00018904024767801858, 'epoch': 0.06}


  6%|▌         | 183/3235 [03:19<52:18,  1.03s/it]

{'loss': 1.0611, 'grad_norm': 0.46840301156044006, 'learning_rate': 0.00018897832817337462, 'epoch': 0.06}


  6%|▌         | 184/3235 [03:21<56:29,  1.11s/it]

{'loss': 1.1543, 'grad_norm': 0.39620092511177063, 'learning_rate': 0.00018891640866873066, 'epoch': 0.06}


  6%|▌         | 185/3235 [03:22<58:25,  1.15s/it]

{'loss': 1.0927, 'grad_norm': 0.4763052463531494, 'learning_rate': 0.0001888544891640867, 'epoch': 0.06}


  6%|▌         | 186/3235 [03:23<54:14,  1.07s/it]

{'loss': 1.022, 'grad_norm': 0.47791609168052673, 'learning_rate': 0.00018879256965944274, 'epoch': 0.06}


  6%|▌         | 187/3235 [03:24<52:37,  1.04s/it]

{'loss': 1.0708, 'grad_norm': 0.5048824548721313, 'learning_rate': 0.00018873065015479875, 'epoch': 0.06}


  6%|▌         | 188/3235 [03:25<54:37,  1.08s/it]

{'loss': 1.0634, 'grad_norm': 0.43199431896209717, 'learning_rate': 0.0001886687306501548, 'epoch': 0.06}


  6%|▌         | 189/3235 [03:26<57:12,  1.13s/it]

{'loss': 0.9523, 'grad_norm': 0.3915252089500427, 'learning_rate': 0.00018860681114551083, 'epoch': 0.06}


  6%|▌         | 190/3235 [03:27<52:42,  1.04s/it]

{'loss': 0.9325, 'grad_norm': 0.5447676777839661, 'learning_rate': 0.0001885448916408669, 'epoch': 0.06}


  6%|▌         | 191/3235 [03:28<57:51,  1.14s/it]

{'loss': 1.1145, 'grad_norm': 0.4445093274116516, 'learning_rate': 0.00018848297213622294, 'epoch': 0.06}


  6%|▌         | 192/3235 [03:29<52:10,  1.03s/it]

{'loss': 1.0795, 'grad_norm': 0.5036273002624512, 'learning_rate': 0.00018842105263157898, 'epoch': 0.06}


  6%|▌         | 193/3235 [03:30<52:00,  1.03s/it]

{'loss': 1.1199, 'grad_norm': 0.4620529115200043, 'learning_rate': 0.000188359133126935, 'epoch': 0.06}


  6%|▌         | 194/3235 [03:31<53:06,  1.05s/it]

{'loss': 1.1739, 'grad_norm': 0.4416956603527069, 'learning_rate': 0.00018829721362229103, 'epoch': 0.06}


  6%|▌         | 195/3235 [03:32<54:22,  1.07s/it]

{'loss': 1.0219, 'grad_norm': 0.4888019859790802, 'learning_rate': 0.00018823529411764707, 'epoch': 0.06}


  6%|▌         | 196/3235 [03:33<52:37,  1.04s/it]

{'loss': 1.0557, 'grad_norm': 0.47041556239128113, 'learning_rate': 0.0001881733746130031, 'epoch': 0.06}


  6%|▌         | 197/3235 [03:34<51:39,  1.02s/it]

{'loss': 1.0401, 'grad_norm': 0.5245705246925354, 'learning_rate': 0.00018811145510835915, 'epoch': 0.06}


  6%|▌         | 198/3235 [03:35<50:54,  1.01s/it]

{'loss': 1.0941, 'grad_norm': 0.4664936065673828, 'learning_rate': 0.0001880495356037152, 'epoch': 0.06}


  6%|▌         | 199/3235 [03:36<49:55,  1.01it/s]

{'loss': 1.1755, 'grad_norm': 0.5232594609260559, 'learning_rate': 0.0001879876160990712, 'epoch': 0.06}


  6%|▌         | 200/3235 [03:37<50:56,  1.01s/it]

{'loss': 1.1479, 'grad_norm': 0.44872787594795227, 'learning_rate': 0.00018792569659442724, 'epoch': 0.06}


  6%|▌         | 201/3235 [03:38<54:22,  1.08s/it]

{'loss': 0.975, 'grad_norm': 0.4640699326992035, 'learning_rate': 0.00018786377708978328, 'epoch': 0.06}


  6%|▌         | 202/3235 [03:39<51:25,  1.02s/it]

{'loss': 0.9228, 'grad_norm': 0.4693906903266907, 'learning_rate': 0.00018780185758513932, 'epoch': 0.06}


  6%|▋         | 203/3235 [03:41<56:31,  1.12s/it]

{'loss': 1.0679, 'grad_norm': 0.4156748950481415, 'learning_rate': 0.00018773993808049536, 'epoch': 0.06}


  6%|▋         | 204/3235 [03:42<56:34,  1.12s/it]

{'loss': 1.0945, 'grad_norm': 0.44134029746055603, 'learning_rate': 0.0001876780185758514, 'epoch': 0.06}


  6%|▋         | 205/3235 [03:43<57:05,  1.13s/it]

{'loss': 1.2106, 'grad_norm': 0.45273327827453613, 'learning_rate': 0.00018761609907120744, 'epoch': 0.06}


  6%|▋         | 206/3235 [03:44<56:45,  1.12s/it]

{'loss': 1.081, 'grad_norm': 0.45558780431747437, 'learning_rate': 0.00018755417956656348, 'epoch': 0.06}


  6%|▋         | 207/3235 [03:45<58:14,  1.15s/it]

{'loss': 1.0662, 'grad_norm': 0.4017143249511719, 'learning_rate': 0.00018749226006191952, 'epoch': 0.06}


  6%|▋         | 208/3235 [03:46<57:59,  1.15s/it]

{'loss': 0.993, 'grad_norm': 0.43394333124160767, 'learning_rate': 0.00018743034055727556, 'epoch': 0.06}


  6%|▋         | 209/3235 [03:47<53:17,  1.06s/it]

{'loss': 0.965, 'grad_norm': 0.4820634424686432, 'learning_rate': 0.0001873684210526316, 'epoch': 0.06}


  6%|▋         | 210/3235 [03:48<52:43,  1.05s/it]

{'loss': 1.1258, 'grad_norm': 0.45119526982307434, 'learning_rate': 0.00018730650154798764, 'epoch': 0.06}


  7%|▋         | 211/3235 [03:49<51:24,  1.02s/it]

{'loss': 1.0605, 'grad_norm': 0.4592263400554657, 'learning_rate': 0.00018724458204334366, 'epoch': 0.07}


  7%|▋         | 212/3235 [03:51<55:50,  1.11s/it]

{'loss': 1.0708, 'grad_norm': 0.4413527846336365, 'learning_rate': 0.0001871826625386997, 'epoch': 0.07}


  7%|▋         | 213/3235 [03:51<51:29,  1.02s/it]

{'loss': 0.9157, 'grad_norm': 0.4569911062717438, 'learning_rate': 0.00018712074303405574, 'epoch': 0.07}


  7%|▋         | 214/3235 [03:52<51:18,  1.02s/it]

{'loss': 1.0521, 'grad_norm': 0.45901939272880554, 'learning_rate': 0.00018705882352941178, 'epoch': 0.07}


  7%|▋         | 215/3235 [03:53<51:53,  1.03s/it]

{'loss': 1.0615, 'grad_norm': 0.417278528213501, 'learning_rate': 0.00018699690402476782, 'epoch': 0.07}


  7%|▋         | 216/3235 [03:54<51:25,  1.02s/it]

{'loss': 0.9812, 'grad_norm': 0.4399526119232178, 'learning_rate': 0.00018693498452012386, 'epoch': 0.07}


  7%|▋         | 217/3235 [03:56<51:44,  1.03s/it]

{'loss': 1.1382, 'grad_norm': 0.46579113602638245, 'learning_rate': 0.00018687306501547987, 'epoch': 0.07}


  7%|▋         | 218/3235 [03:57<55:26,  1.10s/it]

{'loss': 1.0762, 'grad_norm': 0.42997050285339355, 'learning_rate': 0.0001868111455108359, 'epoch': 0.07}


  7%|▋         | 219/3235 [03:58<57:47,  1.15s/it]

{'loss': 1.1259, 'grad_norm': 0.41108471155166626, 'learning_rate': 0.00018674922600619195, 'epoch': 0.07}


  7%|▋         | 220/3235 [03:59<54:31,  1.08s/it]

{'loss': 0.9177, 'grad_norm': 0.47552958130836487, 'learning_rate': 0.000186687306501548, 'epoch': 0.07}


  7%|▋         | 221/3235 [04:00<56:45,  1.13s/it]

{'loss': 1.1833, 'grad_norm': 0.4339301586151123, 'learning_rate': 0.00018662538699690403, 'epoch': 0.07}


  7%|▋         | 222/3235 [04:01<56:40,  1.13s/it]

{'loss': 0.991, 'grad_norm': 0.3825105130672455, 'learning_rate': 0.00018656346749226007, 'epoch': 0.07}


  7%|▋         | 223/3235 [04:02<56:08,  1.12s/it]

{'loss': 1.0187, 'grad_norm': 0.45915451645851135, 'learning_rate': 0.0001865015479876161, 'epoch': 0.07}


  7%|▋         | 224/3235 [04:03<53:58,  1.08s/it]

{'loss': 1.173, 'grad_norm': 0.4875468611717224, 'learning_rate': 0.00018643962848297215, 'epoch': 0.07}


  7%|▋         | 225/3235 [04:05<58:16,  1.16s/it]

{'loss': 1.1419, 'grad_norm': 0.4407069683074951, 'learning_rate': 0.0001863777089783282, 'epoch': 0.07}


  7%|▋         | 226/3235 [04:06<57:39,  1.15s/it]

{'loss': 1.1565, 'grad_norm': 0.4509962797164917, 'learning_rate': 0.00018631578947368423, 'epoch': 0.07}


  7%|▋         | 227/3235 [04:07<57:11,  1.14s/it]

{'loss': 1.1383, 'grad_norm': 0.46056628227233887, 'learning_rate': 0.00018625386996904027, 'epoch': 0.07}


  7%|▋         | 228/3235 [04:08<55:24,  1.11s/it]

{'loss': 1.0099, 'grad_norm': 0.4729224145412445, 'learning_rate': 0.0001861919504643963, 'epoch': 0.07}


  7%|▋         | 229/3235 [04:10<1:04:09,  1.28s/it]

{'loss': 0.915, 'grad_norm': 0.42343461513519287, 'learning_rate': 0.00018613003095975232, 'epoch': 0.07}


  7%|▋         | 230/3235 [04:11<1:02:37,  1.25s/it]

{'loss': 0.9933, 'grad_norm': 0.44350019097328186, 'learning_rate': 0.00018606811145510836, 'epoch': 0.07}


  7%|▋         | 231/3235 [04:12<56:55,  1.14s/it]  

{'loss': 1.0912, 'grad_norm': 0.4564179480075836, 'learning_rate': 0.0001860061919504644, 'epoch': 0.07}


  7%|▋         | 232/3235 [04:13<51:40,  1.03s/it]

{'loss': 0.9955, 'grad_norm': 0.5368212461471558, 'learning_rate': 0.00018594427244582044, 'epoch': 0.07}


  7%|▋         | 233/3235 [04:14<53:02,  1.06s/it]

{'loss': 1.1448, 'grad_norm': 0.4347473680973053, 'learning_rate': 0.00018588235294117648, 'epoch': 0.07}


  7%|▋         | 234/3235 [04:15<50:18,  1.01s/it]

{'loss': 1.0779, 'grad_norm': 0.4703226089477539, 'learning_rate': 0.00018582043343653252, 'epoch': 0.07}


  7%|▋         | 235/3235 [04:16<52:41,  1.05s/it]

{'loss': 1.0237, 'grad_norm': 0.40833014249801636, 'learning_rate': 0.00018575851393188853, 'epoch': 0.07}


  7%|▋         | 236/3235 [04:17<52:32,  1.05s/it]

{'loss': 1.2321, 'grad_norm': 0.452378511428833, 'learning_rate': 0.00018569659442724457, 'epoch': 0.07}


  7%|▋         | 237/3235 [04:18<54:38,  1.09s/it]

{'loss': 1.0551, 'grad_norm': 0.4489210247993469, 'learning_rate': 0.00018563467492260061, 'epoch': 0.07}


  7%|▋         | 238/3235 [04:19<54:01,  1.08s/it]

{'loss': 1.0948, 'grad_norm': 0.4327194094657898, 'learning_rate': 0.00018557275541795665, 'epoch': 0.07}


  7%|▋         | 239/3235 [04:20<55:31,  1.11s/it]

{'loss': 1.0994, 'grad_norm': 0.45221707224845886, 'learning_rate': 0.00018551083591331272, 'epoch': 0.07}


  7%|▋         | 240/3235 [04:21<53:06,  1.06s/it]

{'loss': 1.1583, 'grad_norm': 0.5031891465187073, 'learning_rate': 0.00018544891640866876, 'epoch': 0.07}


  7%|▋         | 241/3235 [04:22<53:24,  1.07s/it]

{'loss': 1.0352, 'grad_norm': 0.467680424451828, 'learning_rate': 0.00018538699690402477, 'epoch': 0.07}


  7%|▋         | 242/3235 [04:23<53:16,  1.07s/it]

{'loss': 0.9454, 'grad_norm': 0.470249205827713, 'learning_rate': 0.00018532507739938081, 'epoch': 0.07}


  8%|▊         | 243/3235 [04:24<52:04,  1.04s/it]

{'loss': 1.0592, 'grad_norm': 0.47327592968940735, 'learning_rate': 0.00018526315789473685, 'epoch': 0.08}


  8%|▊         | 244/3235 [04:25<50:41,  1.02s/it]

{'loss': 1.0165, 'grad_norm': 0.5275542140007019, 'learning_rate': 0.0001852012383900929, 'epoch': 0.08}


  8%|▊         | 245/3235 [04:26<50:08,  1.01s/it]

{'loss': 1.0887, 'grad_norm': 0.5577475428581238, 'learning_rate': 0.00018513931888544893, 'epoch': 0.08}


  8%|▊         | 246/3235 [04:27<45:23,  1.10it/s]

{'loss': 1.0531, 'grad_norm': 0.5528433322906494, 'learning_rate': 0.00018507739938080497, 'epoch': 0.08}


  8%|▊         | 247/3235 [04:28<45:47,  1.09it/s]

{'loss': 1.0924, 'grad_norm': 0.46846145391464233, 'learning_rate': 0.000185015479876161, 'epoch': 0.08}


  8%|▊         | 248/3235 [04:29<47:58,  1.04it/s]

{'loss': 1.0854, 'grad_norm': 0.5052444338798523, 'learning_rate': 0.00018495356037151703, 'epoch': 0.08}


  8%|▊         | 249/3235 [04:30<52:43,  1.06s/it]

{'loss': 1.1351, 'grad_norm': 0.3943932056427002, 'learning_rate': 0.00018489164086687307, 'epoch': 0.08}


  8%|▊         | 250/3235 [04:31<48:56,  1.02it/s]

{'loss': 1.0718, 'grad_norm': 0.5504528880119324, 'learning_rate': 0.0001848297213622291, 'epoch': 0.08}


  8%|▊         | 251/3235 [04:32<51:21,  1.03s/it]

{'loss': 1.0855, 'grad_norm': 0.47578001022338867, 'learning_rate': 0.00018476780185758515, 'epoch': 0.08}


  8%|▊         | 252/3235 [04:33<50:41,  1.02s/it]

{'loss': 1.0446, 'grad_norm': 0.5215237140655518, 'learning_rate': 0.0001847058823529412, 'epoch': 0.08}


  8%|▊         | 253/3235 [04:35<57:40,  1.16s/it]

{'loss': 1.0254, 'grad_norm': 0.41444864869117737, 'learning_rate': 0.00018464396284829723, 'epoch': 0.08}


  8%|▊         | 254/3235 [04:36<53:42,  1.08s/it]

{'loss': 0.9174, 'grad_norm': 0.46703779697418213, 'learning_rate': 0.00018458204334365327, 'epoch': 0.08}


  8%|▊         | 255/3235 [04:37<59:08,  1.19s/it]

{'loss': 1.0551, 'grad_norm': 0.3897922933101654, 'learning_rate': 0.0001845201238390093, 'epoch': 0.08}


  8%|▊         | 256/3235 [04:38<57:32,  1.16s/it]

{'loss': 1.0332, 'grad_norm': 0.45722755789756775, 'learning_rate': 0.00018445820433436535, 'epoch': 0.08}


  8%|▊         | 257/3235 [04:39<56:17,  1.13s/it]

{'loss': 1.18, 'grad_norm': 0.50833660364151, 'learning_rate': 0.0001843962848297214, 'epoch': 0.08}


  8%|▊         | 258/3235 [04:40<51:39,  1.04s/it]

{'loss': 1.0239, 'grad_norm': 0.4641098082065582, 'learning_rate': 0.00018433436532507743, 'epoch': 0.08}


  8%|▊         | 259/3235 [04:41<54:43,  1.10s/it]

{'loss': 1.0202, 'grad_norm': 0.46811336278915405, 'learning_rate': 0.00018427244582043344, 'epoch': 0.08}


  8%|▊         | 260/3235 [04:42<55:51,  1.13s/it]

{'loss': 1.0379, 'grad_norm': 0.3993578851222992, 'learning_rate': 0.00018421052631578948, 'epoch': 0.08}


  8%|▊         | 261/3235 [04:44<57:03,  1.15s/it]

{'loss': 1.0794, 'grad_norm': 0.4132360816001892, 'learning_rate': 0.00018414860681114552, 'epoch': 0.08}


  8%|▊         | 262/3235 [04:45<53:22,  1.08s/it]

{'loss': 1.1083, 'grad_norm': 0.47737157344818115, 'learning_rate': 0.00018408668730650156, 'epoch': 0.08}


  8%|▊         | 263/3235 [04:45<50:24,  1.02s/it]

{'loss': 1.053, 'grad_norm': 0.5107963681221008, 'learning_rate': 0.0001840247678018576, 'epoch': 0.08}


  8%|▊         | 264/3235 [04:47<52:26,  1.06s/it]

{'loss': 1.05, 'grad_norm': 0.4100928008556366, 'learning_rate': 0.00018396284829721364, 'epoch': 0.08}


  8%|▊         | 265/3235 [04:48<53:00,  1.07s/it]

{'loss': 1.0966, 'grad_norm': 0.4497043490409851, 'learning_rate': 0.00018390092879256965, 'epoch': 0.08}


  8%|▊         | 266/3235 [04:49<52:15,  1.06s/it]

{'loss': 1.0808, 'grad_norm': 0.5203713178634644, 'learning_rate': 0.0001838390092879257, 'epoch': 0.08}


  8%|▊         | 267/3235 [04:50<50:24,  1.02s/it]

{'loss': 1.1751, 'grad_norm': 0.4917738735675812, 'learning_rate': 0.00018377708978328173, 'epoch': 0.08}


  8%|▊         | 268/3235 [04:50<48:11,  1.03it/s]

{'loss': 0.9611, 'grad_norm': 0.42276400327682495, 'learning_rate': 0.00018371517027863777, 'epoch': 0.08}


  8%|▊         | 269/3235 [04:52<49:06,  1.01it/s]

{'loss': 1.1052, 'grad_norm': 0.5050876140594482, 'learning_rate': 0.0001836532507739938, 'epoch': 0.08}


  8%|▊         | 270/3235 [04:52<46:58,  1.05it/s]

{'loss': 1.0562, 'grad_norm': 0.5712774991989136, 'learning_rate': 0.00018359133126934985, 'epoch': 0.08}


  8%|▊         | 271/3235 [04:53<49:03,  1.01it/s]

{'loss': 1.173, 'grad_norm': 0.429294615983963, 'learning_rate': 0.0001835294117647059, 'epoch': 0.08}


  8%|▊         | 272/3235 [04:54<47:35,  1.04it/s]

{'loss': 1.0195, 'grad_norm': 0.5066455602645874, 'learning_rate': 0.00018346749226006193, 'epoch': 0.08}


  8%|▊         | 273/3235 [04:55<48:41,  1.01it/s]

{'loss': 1.0678, 'grad_norm': 0.4707094132900238, 'learning_rate': 0.00018340557275541797, 'epoch': 0.08}


  8%|▊         | 274/3235 [04:56<48:49,  1.01it/s]

{'loss': 0.8476, 'grad_norm': 0.4089484214782715, 'learning_rate': 0.000183343653250774, 'epoch': 0.08}


  9%|▊         | 275/3235 [04:58<51:47,  1.05s/it]

{'loss': 1.0334, 'grad_norm': 0.40872061252593994, 'learning_rate': 0.00018328173374613005, 'epoch': 0.09}


  9%|▊         | 276/3235 [04:59<51:26,  1.04s/it]

{'loss': 1.0624, 'grad_norm': 0.44944965839385986, 'learning_rate': 0.0001832198142414861, 'epoch': 0.09}


  9%|▊         | 277/3235 [05:00<50:33,  1.03s/it]

{'loss': 0.9346, 'grad_norm': 0.435306191444397, 'learning_rate': 0.0001831578947368421, 'epoch': 0.09}


  9%|▊         | 278/3235 [05:01<51:25,  1.04s/it]

{'loss': 0.8997, 'grad_norm': 0.45313510298728943, 'learning_rate': 0.00018309597523219815, 'epoch': 0.09}


  9%|▊         | 279/3235 [05:02<52:16,  1.06s/it]

{'loss': 1.0776, 'grad_norm': 0.4566480815410614, 'learning_rate': 0.00018303405572755419, 'epoch': 0.09}


  9%|▊         | 280/3235 [05:03<49:22,  1.00s/it]

{'loss': 1.0753, 'grad_norm': 0.4832928478717804, 'learning_rate': 0.00018297213622291023, 'epoch': 0.09}


  9%|▊         | 281/3235 [05:04<51:02,  1.04s/it]

{'loss': 1.1039, 'grad_norm': 0.4407956898212433, 'learning_rate': 0.00018291021671826627, 'epoch': 0.09}


  9%|▊         | 282/3235 [05:05<51:39,  1.05s/it]

{'loss': 1.0967, 'grad_norm': 0.452994167804718, 'learning_rate': 0.0001828482972136223, 'epoch': 0.09}


  9%|▊         | 283/3235 [05:06<52:12,  1.06s/it]

{'loss': 1.1249, 'grad_norm': 0.4247986078262329, 'learning_rate': 0.00018278637770897832, 'epoch': 0.09}


  9%|▉         | 284/3235 [05:07<51:37,  1.05s/it]

{'loss': 0.9209, 'grad_norm': 0.43229496479034424, 'learning_rate': 0.00018272445820433436, 'epoch': 0.09}


  9%|▉         | 285/3235 [05:08<53:46,  1.09s/it]

{'loss': 1.147, 'grad_norm': 0.4449012875556946, 'learning_rate': 0.0001826625386996904, 'epoch': 0.09}


  9%|▉         | 286/3235 [05:09<56:41,  1.15s/it]

{'loss': 1.023, 'grad_norm': 0.4208511412143707, 'learning_rate': 0.00018260061919504644, 'epoch': 0.09}


  9%|▉         | 287/3235 [05:10<52:35,  1.07s/it]

{'loss': 0.9572, 'grad_norm': 0.5026318430900574, 'learning_rate': 0.00018253869969040248, 'epoch': 0.09}


  9%|▉         | 288/3235 [05:11<49:30,  1.01s/it]

{'loss': 1.0814, 'grad_norm': 0.4704461693763733, 'learning_rate': 0.00018247678018575855, 'epoch': 0.09}


  9%|▉         | 289/3235 [05:12<54:01,  1.10s/it]

{'loss': 0.9788, 'grad_norm': 0.43701067566871643, 'learning_rate': 0.00018241486068111456, 'epoch': 0.09}


  9%|▉         | 290/3235 [05:14<55:59,  1.14s/it]

{'loss': 1.0826, 'grad_norm': 0.4856424629688263, 'learning_rate': 0.0001823529411764706, 'epoch': 0.09}


  9%|▉         | 291/3235 [05:15<50:57,  1.04s/it]

{'loss': 0.8992, 'grad_norm': 0.4539082646369934, 'learning_rate': 0.00018229102167182664, 'epoch': 0.09}


  9%|▉         | 292/3235 [05:16<52:15,  1.07s/it]

{'loss': 1.0838, 'grad_norm': 0.42008334398269653, 'learning_rate': 0.00018222910216718268, 'epoch': 0.09}


  9%|▉         | 293/3235 [05:17<52:27,  1.07s/it]

{'loss': 1.0334, 'grad_norm': 0.48021063208580017, 'learning_rate': 0.00018216718266253872, 'epoch': 0.09}


  9%|▉         | 294/3235 [05:18<54:30,  1.11s/it]

{'loss': 1.2094, 'grad_norm': 0.4582887589931488, 'learning_rate': 0.00018210526315789476, 'epoch': 0.09}


  9%|▉         | 295/3235 [05:19<55:29,  1.13s/it]

{'loss': 1.1295, 'grad_norm': 0.4315063953399658, 'learning_rate': 0.00018204334365325077, 'epoch': 0.09}


  9%|▉         | 296/3235 [05:20<51:31,  1.05s/it]

{'loss': 1.0312, 'grad_norm': 0.5040947794914246, 'learning_rate': 0.0001819814241486068, 'epoch': 0.09}


  9%|▉         | 297/3235 [05:21<50:11,  1.03s/it]

{'loss': 1.072, 'grad_norm': 0.5154882073402405, 'learning_rate': 0.00018191950464396285, 'epoch': 0.09}


  9%|▉         | 298/3235 [05:22<48:05,  1.02it/s]

{'loss': 1.0929, 'grad_norm': 0.4705055356025696, 'learning_rate': 0.0001818575851393189, 'epoch': 0.09}


  9%|▉         | 299/3235 [05:23<46:28,  1.05it/s]

{'loss': 0.9699, 'grad_norm': 0.4763842821121216, 'learning_rate': 0.00018179566563467493, 'epoch': 0.09}


  9%|▉         | 300/3235 [05:24<50:09,  1.03s/it]

{'loss': 1.0184, 'grad_norm': 0.3915202021598816, 'learning_rate': 0.00018173374613003095, 'epoch': 0.09}


  9%|▉         | 301/3235 [05:25<52:55,  1.08s/it]

{'loss': 1.1188, 'grad_norm': 0.4097493588924408, 'learning_rate': 0.00018167182662538699, 'epoch': 0.09}


  9%|▉         | 302/3235 [05:26<52:45,  1.08s/it]

{'loss': 1.0356, 'grad_norm': 0.4534677565097809, 'learning_rate': 0.00018160990712074305, 'epoch': 0.09}


  9%|▉         | 303/3235 [05:27<52:40,  1.08s/it]

{'loss': 1.0723, 'grad_norm': 0.4497092366218567, 'learning_rate': 0.0001815479876160991, 'epoch': 0.09}


  9%|▉         | 304/3235 [05:28<51:35,  1.06s/it]

{'loss': 1.064, 'grad_norm': 0.4699910283088684, 'learning_rate': 0.00018148606811145513, 'epoch': 0.09}


  9%|▉         | 305/3235 [05:29<51:36,  1.06s/it]

{'loss': 1.1174, 'grad_norm': 0.4952399432659149, 'learning_rate': 0.00018142414860681117, 'epoch': 0.09}


  9%|▉         | 306/3235 [05:30<52:34,  1.08s/it]

{'loss': 1.116, 'grad_norm': 0.46013134717941284, 'learning_rate': 0.00018136222910216719, 'epoch': 0.09}


  9%|▉         | 307/3235 [05:32<52:11,  1.07s/it]

{'loss': 1.0169, 'grad_norm': 0.42944371700286865, 'learning_rate': 0.00018130030959752323, 'epoch': 0.09}


 10%|▉         | 308/3235 [05:33<51:34,  1.06s/it]

{'loss': 0.9936, 'grad_norm': 0.4470241367816925, 'learning_rate': 0.00018123839009287927, 'epoch': 0.1}


 10%|▉         | 309/3235 [05:34<53:27,  1.10s/it]

{'loss': 1.2333, 'grad_norm': 0.43253329396247864, 'learning_rate': 0.0001811764705882353, 'epoch': 0.1}


 10%|▉         | 310/3235 [05:35<50:47,  1.04s/it]

{'loss': 0.9898, 'grad_norm': 0.4470554292201996, 'learning_rate': 0.00018111455108359135, 'epoch': 0.1}


 10%|▉         | 311/3235 [05:36<52:52,  1.08s/it]

{'loss': 1.1817, 'grad_norm': 0.4370402693748474, 'learning_rate': 0.00018105263157894739, 'epoch': 0.1}


 10%|▉         | 312/3235 [05:37<52:16,  1.07s/it]

{'loss': 1.2915, 'grad_norm': 0.4850009083747864, 'learning_rate': 0.0001809907120743034, 'epoch': 0.1}


 10%|▉         | 313/3235 [05:38<50:49,  1.04s/it]

{'loss': 0.8973, 'grad_norm': 0.46866098046302795, 'learning_rate': 0.00018092879256965944, 'epoch': 0.1}


 10%|▉         | 314/3235 [05:39<52:30,  1.08s/it]

{'loss': 1.0884, 'grad_norm': 0.39264675974845886, 'learning_rate': 0.00018086687306501548, 'epoch': 0.1}


 10%|▉         | 315/3235 [05:40<56:26,  1.16s/it]

{'loss': 1.1729, 'grad_norm': 0.46160510182380676, 'learning_rate': 0.00018080495356037152, 'epoch': 0.1}


 10%|▉         | 316/3235 [05:41<54:47,  1.13s/it]

{'loss': 1.1416, 'grad_norm': 0.46379798650741577, 'learning_rate': 0.00018074303405572756, 'epoch': 0.1}


 10%|▉         | 317/3235 [05:42<53:37,  1.10s/it]

{'loss': 1.061, 'grad_norm': 0.4580579400062561, 'learning_rate': 0.0001806811145510836, 'epoch': 0.1}


 10%|▉         | 318/3235 [05:44<54:42,  1.13s/it]

{'loss': 1.0191, 'grad_norm': 0.44927066564559937, 'learning_rate': 0.00018061919504643964, 'epoch': 0.1}


 10%|▉         | 319/3235 [05:45<53:18,  1.10s/it]

{'loss': 1.0478, 'grad_norm': 0.44715192914009094, 'learning_rate': 0.00018055727554179568, 'epoch': 0.1}


 10%|▉         | 320/3235 [05:46<54:17,  1.12s/it]

{'loss': 0.9699, 'grad_norm': 0.4499310255050659, 'learning_rate': 0.00018049535603715172, 'epoch': 0.1}


 10%|▉         | 321/3235 [05:47<50:50,  1.05s/it]

{'loss': 1.0287, 'grad_norm': 0.48360979557037354, 'learning_rate': 0.00018043343653250776, 'epoch': 0.1}


 10%|▉         | 322/3235 [05:48<54:51,  1.13s/it]

{'loss': 1.1691, 'grad_norm': 0.4168757498264313, 'learning_rate': 0.0001803715170278638, 'epoch': 0.1}


 10%|▉         | 323/3235 [05:49<53:32,  1.10s/it]

{'loss': 1.1621, 'grad_norm': 0.4628090560436249, 'learning_rate': 0.00018030959752321984, 'epoch': 0.1}


 10%|█         | 324/3235 [05:50<53:04,  1.09s/it]

{'loss': 1.2237, 'grad_norm': 0.4778311252593994, 'learning_rate': 0.00018024767801857585, 'epoch': 0.1}


 10%|█         | 325/3235 [05:51<52:58,  1.09s/it]

{'loss': 1.0162, 'grad_norm': 0.41751304268836975, 'learning_rate': 0.0001801857585139319, 'epoch': 0.1}


 10%|█         | 326/3235 [05:52<49:45,  1.03s/it]

{'loss': 1.104, 'grad_norm': 0.5153523087501526, 'learning_rate': 0.00018012383900928793, 'epoch': 0.1}


 10%|█         | 327/3235 [05:53<53:27,  1.10s/it]

{'loss': 1.0383, 'grad_norm': 0.4551860988140106, 'learning_rate': 0.00018006191950464397, 'epoch': 0.1}


 10%|█         | 328/3235 [05:54<51:36,  1.07s/it]

{'loss': 1.1378, 'grad_norm': 0.4793952703475952, 'learning_rate': 0.00018, 'epoch': 0.1}


 10%|█         | 329/3235 [05:55<51:41,  1.07s/it]

{'loss': 1.1378, 'grad_norm': 0.44087910652160645, 'learning_rate': 0.00017993808049535605, 'epoch': 0.1}


 10%|█         | 330/3235 [05:56<48:13,  1.00it/s]

{'loss': 1.0315, 'grad_norm': 0.505214273929596, 'learning_rate': 0.00017987616099071206, 'epoch': 0.1}


 10%|█         | 331/3235 [05:57<48:03,  1.01it/s]

{'loss': 0.9393, 'grad_norm': 0.48200151324272156, 'learning_rate': 0.0001798142414860681, 'epoch': 0.1}


 10%|█         | 332/3235 [05:58<48:35,  1.00s/it]

{'loss': 1.0574, 'grad_norm': 0.4943074584007263, 'learning_rate': 0.00017975232198142414, 'epoch': 0.1}


 10%|█         | 333/3235 [05:59<50:49,  1.05s/it]

{'loss': 1.0543, 'grad_norm': 0.4315146803855896, 'learning_rate': 0.00017969040247678018, 'epoch': 0.1}


 10%|█         | 334/3235 [06:00<50:28,  1.04s/it]

{'loss': 0.9709, 'grad_norm': 0.44228023290634155, 'learning_rate': 0.00017962848297213622, 'epoch': 0.1}


 10%|█         | 335/3235 [06:02<51:29,  1.07s/it]

{'loss': 0.9512, 'grad_norm': 0.4546314477920532, 'learning_rate': 0.00017956656346749226, 'epoch': 0.1}


 10%|█         | 336/3235 [06:03<50:31,  1.05s/it]

{'loss': 1.0195, 'grad_norm': 0.4513036012649536, 'learning_rate': 0.0001795046439628483, 'epoch': 0.1}


 10%|█         | 337/3235 [06:03<48:28,  1.00s/it]

{'loss': 0.8768, 'grad_norm': 0.4669680595397949, 'learning_rate': 0.00017944272445820434, 'epoch': 0.1}


 10%|█         | 338/3235 [06:04<47:28,  1.02it/s]

{'loss': 1.104, 'grad_norm': 0.4785921573638916, 'learning_rate': 0.00017938080495356038, 'epoch': 0.1}


 10%|█         | 339/3235 [06:06<50:55,  1.06s/it]

{'loss': 1.07, 'grad_norm': 0.4524392783641815, 'learning_rate': 0.00017931888544891642, 'epoch': 0.1}


 11%|█         | 340/3235 [06:07<51:38,  1.07s/it]

{'loss': 1.2394, 'grad_norm': 0.4706572890281677, 'learning_rate': 0.00017925696594427246, 'epoch': 0.11}


 11%|█         | 341/3235 [06:08<54:28,  1.13s/it]

{'loss': 1.0091, 'grad_norm': 0.45660075545310974, 'learning_rate': 0.0001791950464396285, 'epoch': 0.11}


 11%|█         | 342/3235 [06:09<51:23,  1.07s/it]

{'loss': 1.0874, 'grad_norm': 0.48065119981765747, 'learning_rate': 0.00017913312693498452, 'epoch': 0.11}


 11%|█         | 343/3235 [06:10<48:38,  1.01s/it]

{'loss': 1.1957, 'grad_norm': 0.47186192870140076, 'learning_rate': 0.00017907120743034056, 'epoch': 0.11}


 11%|█         | 344/3235 [06:11<52:29,  1.09s/it]

{'loss': 1.0436, 'grad_norm': 0.4017937183380127, 'learning_rate': 0.0001790092879256966, 'epoch': 0.11}


 11%|█         | 345/3235 [06:12<51:41,  1.07s/it]

{'loss': 1.2161, 'grad_norm': 0.5010197758674622, 'learning_rate': 0.00017894736842105264, 'epoch': 0.11}


 11%|█         | 346/3235 [06:13<53:14,  1.11s/it]

{'loss': 1.0923, 'grad_norm': 0.4409404993057251, 'learning_rate': 0.00017888544891640868, 'epoch': 0.11}


 11%|█         | 347/3235 [06:14<51:54,  1.08s/it]

{'loss': 0.9044, 'grad_norm': 0.4907321631908417, 'learning_rate': 0.00017882352941176472, 'epoch': 0.11}


 11%|█         | 348/3235 [06:16<54:37,  1.14s/it]

{'loss': 0.9309, 'grad_norm': 0.4281328618526459, 'learning_rate': 0.00017876160990712073, 'epoch': 0.11}


 11%|█         | 349/3235 [06:17<52:37,  1.09s/it]

{'loss': 1.0539, 'grad_norm': 0.48910942673683167, 'learning_rate': 0.00017869969040247677, 'epoch': 0.11}


 11%|█         | 350/3235 [06:18<53:53,  1.12s/it]

{'loss': 1.1044, 'grad_norm': 0.4249371588230133, 'learning_rate': 0.0001786377708978328, 'epoch': 0.11}


 11%|█         | 351/3235 [06:19<54:02,  1.12s/it]

{'loss': 1.2053, 'grad_norm': 0.4798548221588135, 'learning_rate': 0.00017857585139318888, 'epoch': 0.11}


 11%|█         | 352/3235 [06:20<52:34,  1.09s/it]

{'loss': 1.0024, 'grad_norm': 0.45164981484413147, 'learning_rate': 0.00017851393188854492, 'epoch': 0.11}


 11%|█         | 353/3235 [06:21<51:32,  1.07s/it]

{'loss': 0.9988, 'grad_norm': 0.45937612652778625, 'learning_rate': 0.00017845201238390096, 'epoch': 0.11}


 11%|█         | 354/3235 [06:22<54:17,  1.13s/it]

{'loss': 1.214, 'grad_norm': 0.4886884093284607, 'learning_rate': 0.00017839009287925697, 'epoch': 0.11}


 11%|█         | 355/3235 [06:24<56:58,  1.19s/it]

{'loss': 1.1178, 'grad_norm': 0.438447505235672, 'learning_rate': 0.000178328173374613, 'epoch': 0.11}


 11%|█         | 356/3235 [06:25<56:08,  1.17s/it]

{'loss': 0.8695, 'grad_norm': 0.4482799172401428, 'learning_rate': 0.00017826625386996905, 'epoch': 0.11}


 11%|█         | 357/3235 [06:26<54:46,  1.14s/it]

{'loss': 1.0565, 'grad_norm': 0.5167585015296936, 'learning_rate': 0.0001782043343653251, 'epoch': 0.11}


 11%|█         | 358/3235 [06:27<55:27,  1.16s/it]

{'loss': 0.9837, 'grad_norm': 0.41219595074653625, 'learning_rate': 0.00017814241486068113, 'epoch': 0.11}


 11%|█         | 359/3235 [06:28<57:08,  1.19s/it]

{'loss': 1.0374, 'grad_norm': 0.4369116723537445, 'learning_rate': 0.00017808049535603717, 'epoch': 0.11}


 11%|█         | 360/3235 [06:29<54:36,  1.14s/it]

{'loss': 1.0882, 'grad_norm': 0.49223053455352783, 'learning_rate': 0.00017801857585139318, 'epoch': 0.11}


 11%|█         | 361/3235 [06:30<55:16,  1.15s/it]

{'loss': 0.9142, 'grad_norm': 0.45403212308883667, 'learning_rate': 0.00017795665634674922, 'epoch': 0.11}


 11%|█         | 362/3235 [06:31<52:26,  1.10s/it]

{'loss': 1.0112, 'grad_norm': 0.48392215371131897, 'learning_rate': 0.00017789473684210526, 'epoch': 0.11}


 11%|█         | 363/3235 [06:33<53:56,  1.13s/it]

{'loss': 1.0826, 'grad_norm': 0.43945515155792236, 'learning_rate': 0.0001778328173374613, 'epoch': 0.11}


 11%|█▏        | 364/3235 [06:34<52:51,  1.10s/it]

{'loss': 1.1155, 'grad_norm': 0.4319715201854706, 'learning_rate': 0.00017777089783281734, 'epoch': 0.11}


 11%|█▏        | 365/3235 [06:34<46:16,  1.03it/s]

{'loss': 0.9353, 'grad_norm': 0.5370807647705078, 'learning_rate': 0.00017770897832817338, 'epoch': 0.11}


 11%|█▏        | 366/3235 [06:36<51:40,  1.08s/it]

{'loss': 1.1514, 'grad_norm': 0.4597385823726654, 'learning_rate': 0.00017764705882352942, 'epoch': 0.11}


 11%|█▏        | 367/3235 [06:37<53:42,  1.12s/it]

{'loss': 1.0234, 'grad_norm': 0.4415033757686615, 'learning_rate': 0.00017758513931888546, 'epoch': 0.11}


 11%|█▏        | 368/3235 [06:38<54:22,  1.14s/it]

{'loss': 1.1548, 'grad_norm': 0.44446489214897156, 'learning_rate': 0.0001775232198142415, 'epoch': 0.11}


 11%|█▏        | 369/3235 [06:39<52:19,  1.10s/it]

{'loss': 1.0252, 'grad_norm': 0.4886744022369385, 'learning_rate': 0.00017746130030959754, 'epoch': 0.11}


 11%|█▏        | 370/3235 [06:40<46:33,  1.03it/s]

{'loss': 0.947, 'grad_norm': 0.5174376964569092, 'learning_rate': 0.00017739938080495358, 'epoch': 0.11}


 11%|█▏        | 371/3235 [06:41<51:18,  1.07s/it]

{'loss': 0.996, 'grad_norm': 0.3936271369457245, 'learning_rate': 0.00017733746130030962, 'epoch': 0.11}


 11%|█▏        | 372/3235 [06:42<54:19,  1.14s/it]

{'loss': 1.147, 'grad_norm': 0.4337683320045471, 'learning_rate': 0.00017727554179566564, 'epoch': 0.11}


 12%|█▏        | 373/3235 [06:43<53:11,  1.12s/it]

{'loss': 1.1087, 'grad_norm': 0.5147650837898254, 'learning_rate': 0.00017721362229102168, 'epoch': 0.12}


 12%|█▏        | 374/3235 [06:45<55:23,  1.16s/it]

{'loss': 1.2433, 'grad_norm': 0.4981346130371094, 'learning_rate': 0.00017715170278637772, 'epoch': 0.12}


 12%|█▏        | 375/3235 [06:46<53:08,  1.11s/it]

{'loss': 1.0805, 'grad_norm': 0.5264873504638672, 'learning_rate': 0.00017708978328173376, 'epoch': 0.12}


 12%|█▏        | 376/3235 [06:47<59:06,  1.24s/it]

{'loss': 0.9709, 'grad_norm': 0.3963066339492798, 'learning_rate': 0.0001770278637770898, 'epoch': 0.12}


 12%|█▏        | 377/3235 [06:48<58:05,  1.22s/it]

{'loss': 1.104, 'grad_norm': 0.4654442369937897, 'learning_rate': 0.00017696594427244584, 'epoch': 0.12}


 12%|█▏        | 378/3235 [06:49<53:42,  1.13s/it]

{'loss': 0.9935, 'grad_norm': 0.5076888799667358, 'learning_rate': 0.00017690402476780185, 'epoch': 0.12}


 12%|█▏        | 379/3235 [06:50<53:22,  1.12s/it]

{'loss': 0.9946, 'grad_norm': 0.49438410997390747, 'learning_rate': 0.0001768421052631579, 'epoch': 0.12}


 12%|█▏        | 380/3235 [06:51<50:14,  1.06s/it]

{'loss': 1.0073, 'grad_norm': 0.47196242213249207, 'learning_rate': 0.00017678018575851393, 'epoch': 0.12}


 12%|█▏        | 381/3235 [06:52<50:00,  1.05s/it]

{'loss': 1.1975, 'grad_norm': 0.4995451271533966, 'learning_rate': 0.00017671826625386997, 'epoch': 0.12}


 12%|█▏        | 382/3235 [06:53<48:08,  1.01s/it]

{'loss': 1.0198, 'grad_norm': 0.4552617371082306, 'learning_rate': 0.000176656346749226, 'epoch': 0.12}


 12%|█▏        | 383/3235 [06:54<46:01,  1.03it/s]

{'loss': 1.1729, 'grad_norm': 0.6840447187423706, 'learning_rate': 0.00017659442724458205, 'epoch': 0.12}


 12%|█▏        | 384/3235 [06:55<50:04,  1.05s/it]

{'loss': 1.007, 'grad_norm': 0.43049535155296326, 'learning_rate': 0.0001765325077399381, 'epoch': 0.12}


 12%|█▏        | 385/3235 [06:56<49:30,  1.04s/it]

{'loss': 0.9667, 'grad_norm': 0.4384138286113739, 'learning_rate': 0.00017647058823529413, 'epoch': 0.12}


 12%|█▏        | 386/3235 [06:58<51:01,  1.07s/it]

{'loss': 0.9771, 'grad_norm': 0.40528514981269836, 'learning_rate': 0.00017640866873065017, 'epoch': 0.12}


 12%|█▏        | 387/3235 [06:59<52:24,  1.10s/it]

{'loss': 1.1299, 'grad_norm': 0.41673439741134644, 'learning_rate': 0.0001763467492260062, 'epoch': 0.12}


 12%|█▏        | 388/3235 [07:00<49:06,  1.03s/it]

{'loss': 0.954, 'grad_norm': 0.5134264826774597, 'learning_rate': 0.00017628482972136225, 'epoch': 0.12}


 12%|█▏        | 389/3235 [07:01<49:38,  1.05s/it]

{'loss': 1.1707, 'grad_norm': 0.47499364614486694, 'learning_rate': 0.0001762229102167183, 'epoch': 0.12}


 12%|█▏        | 390/3235 [07:02<50:16,  1.06s/it]

{'loss': 0.9072, 'grad_norm': 0.46043816208839417, 'learning_rate': 0.0001761609907120743, 'epoch': 0.12}


 12%|█▏        | 391/3235 [07:03<53:22,  1.13s/it]

{'loss': 0.9759, 'grad_norm': 0.4224616289138794, 'learning_rate': 0.00017609907120743034, 'epoch': 0.12}


 12%|█▏        | 392/3235 [07:04<51:48,  1.09s/it]

{'loss': 0.9741, 'grad_norm': 0.4604766070842743, 'learning_rate': 0.00017603715170278638, 'epoch': 0.12}


 12%|█▏        | 393/3235 [07:05<51:19,  1.08s/it]

{'loss': 1.0198, 'grad_norm': 0.4911970794200897, 'learning_rate': 0.00017597523219814242, 'epoch': 0.12}


 12%|█▏        | 394/3235 [07:06<52:32,  1.11s/it]

{'loss': 0.9964, 'grad_norm': 0.4452950060367584, 'learning_rate': 0.00017591331269349846, 'epoch': 0.12}


 12%|█▏        | 395/3235 [07:07<52:55,  1.12s/it]

{'loss': 1.1029, 'grad_norm': 0.4133904278278351, 'learning_rate': 0.0001758513931888545, 'epoch': 0.12}


 12%|█▏        | 396/3235 [07:08<48:56,  1.03s/it]

{'loss': 1.2304, 'grad_norm': 0.514706552028656, 'learning_rate': 0.00017578947368421052, 'epoch': 0.12}


 12%|█▏        | 397/3235 [07:09<51:21,  1.09s/it]

{'loss': 1.1132, 'grad_norm': 0.4302632808685303, 'learning_rate': 0.00017572755417956656, 'epoch': 0.12}


 12%|█▏        | 398/3235 [07:11<51:27,  1.09s/it]

{'loss': 0.9124, 'grad_norm': 0.4495971202850342, 'learning_rate': 0.0001756656346749226, 'epoch': 0.12}


 12%|█▏        | 399/3235 [07:12<50:55,  1.08s/it]

{'loss': 0.9364, 'grad_norm': 0.4574093222618103, 'learning_rate': 0.00017560371517027864, 'epoch': 0.12}


 12%|█▏        | 400/3235 [07:13<50:28,  1.07s/it]

{'loss': 0.9995, 'grad_norm': 0.46104469895362854, 'learning_rate': 0.0001755417956656347, 'epoch': 0.12}


 12%|█▏        | 401/3235 [07:14<50:38,  1.07s/it]

{'loss': 1.136, 'grad_norm': 0.44788238406181335, 'learning_rate': 0.00017547987616099074, 'epoch': 0.12}


 12%|█▏        | 402/3235 [07:15<49:59,  1.06s/it]

{'loss': 1.0827, 'grad_norm': 0.5206626057624817, 'learning_rate': 0.00017541795665634676, 'epoch': 0.12}


 12%|█▏        | 403/3235 [07:16<49:03,  1.04s/it]

{'loss': 1.0218, 'grad_norm': 0.47797778248786926, 'learning_rate': 0.0001753560371517028, 'epoch': 0.12}


 12%|█▏        | 404/3235 [07:17<49:13,  1.04s/it]

{'loss': 1.0677, 'grad_norm': 0.4295448958873749, 'learning_rate': 0.00017529411764705884, 'epoch': 0.12}


 13%|█▎        | 405/3235 [07:18<49:09,  1.04s/it]

{'loss': 1.0394, 'grad_norm': 0.4553869366645813, 'learning_rate': 0.00017523219814241488, 'epoch': 0.13}


 13%|█▎        | 406/3235 [07:19<51:11,  1.09s/it]

{'loss': 0.9947, 'grad_norm': 0.4464775621891022, 'learning_rate': 0.00017517027863777092, 'epoch': 0.13}


 13%|█▎        | 407/3235 [07:20<53:53,  1.14s/it]

{'loss': 0.9634, 'grad_norm': 0.4229426681995392, 'learning_rate': 0.00017510835913312696, 'epoch': 0.13}


 13%|█▎        | 408/3235 [07:21<50:08,  1.06s/it]

{'loss': 1.0198, 'grad_norm': 0.5217769742012024, 'learning_rate': 0.00017504643962848297, 'epoch': 0.13}


 13%|█▎        | 409/3235 [07:22<49:10,  1.04s/it]

{'loss': 1.0295, 'grad_norm': 0.4618252217769623, 'learning_rate': 0.000174984520123839, 'epoch': 0.13}


 13%|█▎        | 410/3235 [07:24<53:16,  1.13s/it]

{'loss': 1.1422, 'grad_norm': 0.4345448911190033, 'learning_rate': 0.00017492260061919505, 'epoch': 0.13}


 13%|█▎        | 411/3235 [07:24<48:41,  1.03s/it]

{'loss': 1.1221, 'grad_norm': 0.5083858966827393, 'learning_rate': 0.0001748606811145511, 'epoch': 0.13}


 13%|█▎        | 412/3235 [07:25<47:23,  1.01s/it]

{'loss': 1.0747, 'grad_norm': 0.4614167809486389, 'learning_rate': 0.00017479876160990713, 'epoch': 0.13}


 13%|█▎        | 413/3235 [07:26<48:35,  1.03s/it]

{'loss': 0.9929, 'grad_norm': 0.43892720341682434, 'learning_rate': 0.00017473684210526317, 'epoch': 0.13}


 13%|█▎        | 414/3235 [07:28<52:45,  1.12s/it]

{'loss': 0.9255, 'grad_norm': 0.41471806168556213, 'learning_rate': 0.0001746749226006192, 'epoch': 0.13}


 13%|█▎        | 415/3235 [07:29<54:00,  1.15s/it]

{'loss': 1.0479, 'grad_norm': 0.39907601475715637, 'learning_rate': 0.00017461300309597525, 'epoch': 0.13}


 13%|█▎        | 416/3235 [07:30<53:12,  1.13s/it]

{'loss': 1.0265, 'grad_norm': 0.47164031863212585, 'learning_rate': 0.0001745510835913313, 'epoch': 0.13}


 13%|█▎        | 417/3235 [07:31<51:21,  1.09s/it]

{'loss': 1.1199, 'grad_norm': 0.4688468873500824, 'learning_rate': 0.00017448916408668733, 'epoch': 0.13}


 13%|█▎        | 418/3235 [07:32<54:52,  1.17s/it]

{'loss': 1.0623, 'grad_norm': 0.40509140491485596, 'learning_rate': 0.00017442724458204337, 'epoch': 0.13}


 13%|█▎        | 419/3235 [07:33<50:53,  1.08s/it]

{'loss': 1.1239, 'grad_norm': 0.49022427201271057, 'learning_rate': 0.0001743653250773994, 'epoch': 0.13}


 13%|█▎        | 420/3235 [07:34<50:22,  1.07s/it]

{'loss': 0.9995, 'grad_norm': 0.428762823343277, 'learning_rate': 0.00017430340557275542, 'epoch': 0.13}


 13%|█▎        | 421/3235 [07:35<47:26,  1.01s/it]

{'loss': 1.0678, 'grad_norm': 0.5062124133110046, 'learning_rate': 0.00017424148606811146, 'epoch': 0.13}


 13%|█▎        | 422/3235 [07:36<47:05,  1.00s/it]

{'loss': 1.0939, 'grad_norm': 0.4841926097869873, 'learning_rate': 0.0001741795665634675, 'epoch': 0.13}


 13%|█▎        | 423/3235 [07:37<46:50,  1.00it/s]

{'loss': 1.0409, 'grad_norm': 0.45915910601615906, 'learning_rate': 0.00017411764705882354, 'epoch': 0.13}


 13%|█▎        | 424/3235 [07:38<47:23,  1.01s/it]

{'loss': 0.9958, 'grad_norm': 0.4766846299171448, 'learning_rate': 0.00017405572755417958, 'epoch': 0.13}


 13%|█▎        | 425/3235 [07:39<46:50,  1.00s/it]

{'loss': 0.9583, 'grad_norm': 0.45480218529701233, 'learning_rate': 0.00017399380804953562, 'epoch': 0.13}


 13%|█▎        | 426/3235 [07:40<48:20,  1.03s/it]

{'loss': 0.9762, 'grad_norm': 0.40440329909324646, 'learning_rate': 0.00017393188854489163, 'epoch': 0.13}


 13%|█▎        | 427/3235 [07:41<46:46,  1.00it/s]

{'loss': 1.0065, 'grad_norm': 0.4523254334926605, 'learning_rate': 0.00017386996904024767, 'epoch': 0.13}


 13%|█▎        | 428/3235 [07:42<43:40,  1.07it/s]

{'loss': 1.0644, 'grad_norm': 0.511111855506897, 'learning_rate': 0.00017380804953560371, 'epoch': 0.13}


 13%|█▎        | 429/3235 [07:43<47:44,  1.02s/it]

{'loss': 1.1847, 'grad_norm': 0.4311838150024414, 'learning_rate': 0.00017374613003095975, 'epoch': 0.13}


 13%|█▎        | 430/3235 [07:44<47:58,  1.03s/it]

{'loss': 0.9819, 'grad_norm': 0.41411685943603516, 'learning_rate': 0.0001736842105263158, 'epoch': 0.13}


 13%|█▎        | 431/3235 [07:45<50:17,  1.08s/it]

{'loss': 1.1206, 'grad_norm': 0.4676339328289032, 'learning_rate': 0.00017362229102167183, 'epoch': 0.13}


 13%|█▎        | 432/3235 [07:46<48:45,  1.04s/it]

{'loss': 1.0853, 'grad_norm': 0.4781356155872345, 'learning_rate': 0.00017356037151702787, 'epoch': 0.13}


 13%|█▎        | 433/3235 [07:47<48:53,  1.05s/it]

{'loss': 1.0259, 'grad_norm': 0.4748821258544922, 'learning_rate': 0.00017349845201238391, 'epoch': 0.13}


 13%|█▎        | 434/3235 [07:48<45:39,  1.02it/s]

{'loss': 1.0547, 'grad_norm': 0.47715485095977783, 'learning_rate': 0.00017343653250773995, 'epoch': 0.13}


 13%|█▎        | 435/3235 [07:50<51:14,  1.10s/it]

{'loss': 1.0777, 'grad_norm': 0.44187024235725403, 'learning_rate': 0.000173374613003096, 'epoch': 0.13}


 13%|█▎        | 436/3235 [07:51<53:13,  1.14s/it]

{'loss': 1.2032, 'grad_norm': 0.412697434425354, 'learning_rate': 0.00017331269349845203, 'epoch': 0.13}


 14%|█▎        | 437/3235 [07:52<53:55,  1.16s/it]

{'loss': 1.1239, 'grad_norm': 0.46139129996299744, 'learning_rate': 0.00017325077399380807, 'epoch': 0.14}


 14%|█▎        | 438/3235 [07:53<54:23,  1.17s/it]

{'loss': 1.0385, 'grad_norm': 0.42146825790405273, 'learning_rate': 0.0001731888544891641, 'epoch': 0.14}


 14%|█▎        | 439/3235 [07:54<51:06,  1.10s/it]

{'loss': 1.153, 'grad_norm': 0.5224251747131348, 'learning_rate': 0.00017312693498452013, 'epoch': 0.14}


 14%|█▎        | 440/3235 [07:55<51:49,  1.11s/it]

{'loss': 1.1368, 'grad_norm': 0.5157164335250854, 'learning_rate': 0.00017306501547987617, 'epoch': 0.14}


 14%|█▎        | 441/3235 [07:56<50:27,  1.08s/it]

{'loss': 1.0731, 'grad_norm': 0.5042895674705505, 'learning_rate': 0.0001730030959752322, 'epoch': 0.14}


 14%|█▎        | 442/3235 [07:57<47:58,  1.03s/it]

{'loss': 0.9982, 'grad_norm': 0.47555816173553467, 'learning_rate': 0.00017294117647058825, 'epoch': 0.14}


 14%|█▎        | 443/3235 [07:58<48:20,  1.04s/it]

{'loss': 0.9586, 'grad_norm': 0.4622500240802765, 'learning_rate': 0.0001728792569659443, 'epoch': 0.14}


 14%|█▎        | 444/3235 [07:59<45:36,  1.02it/s]

{'loss': 0.9356, 'grad_norm': 0.5115309357643127, 'learning_rate': 0.0001728173374613003, 'epoch': 0.14}


 14%|█▍        | 445/3235 [08:00<49:58,  1.07s/it]

{'loss': 0.9258, 'grad_norm': 0.4086318612098694, 'learning_rate': 0.00017275541795665634, 'epoch': 0.14}


 14%|█▍        | 446/3235 [08:02<52:37,  1.13s/it]

{'loss': 1.0071, 'grad_norm': 0.4517531394958496, 'learning_rate': 0.00017269349845201238, 'epoch': 0.14}


 14%|█▍        | 447/3235 [08:03<56:34,  1.22s/it]

{'loss': 1.1338, 'grad_norm': 0.4452888071537018, 'learning_rate': 0.00017263157894736842, 'epoch': 0.14}


 14%|█▍        | 448/3235 [08:04<56:59,  1.23s/it]

{'loss': 1.1039, 'grad_norm': 0.4314531981945038, 'learning_rate': 0.0001725696594427245, 'epoch': 0.14}


 14%|█▍        | 449/3235 [08:05<53:50,  1.16s/it]

{'loss': 1.0771, 'grad_norm': 0.5096027255058289, 'learning_rate': 0.0001725077399380805, 'epoch': 0.14}


 14%|█▍        | 450/3235 [08:06<49:49,  1.07s/it]

{'loss': 0.9312, 'grad_norm': 0.45852118730545044, 'learning_rate': 0.00017244582043343654, 'epoch': 0.14}


 14%|█▍        | 451/3235 [08:07<50:33,  1.09s/it]

{'loss': 1.1623, 'grad_norm': 0.4652368724346161, 'learning_rate': 0.00017238390092879258, 'epoch': 0.14}


 14%|█▍        | 452/3235 [08:08<49:34,  1.07s/it]

{'loss': 1.073, 'grad_norm': 0.4653695821762085, 'learning_rate': 0.00017232198142414862, 'epoch': 0.14}


 14%|█▍        | 453/3235 [08:09<49:18,  1.06s/it]

{'loss': 1.0847, 'grad_norm': 0.42813003063201904, 'learning_rate': 0.00017226006191950466, 'epoch': 0.14}


 14%|█▍        | 454/3235 [08:11<54:06,  1.17s/it]

{'loss': 1.093, 'grad_norm': 0.44426771998405457, 'learning_rate': 0.0001721981424148607, 'epoch': 0.14}


 14%|█▍        | 455/3235 [08:12<52:38,  1.14s/it]

{'loss': 0.9046, 'grad_norm': 0.4376387596130371, 'learning_rate': 0.0001721362229102167, 'epoch': 0.14}


 14%|█▍        | 456/3235 [08:13<48:48,  1.05s/it]

{'loss': 1.0748, 'grad_norm': 0.4912249445915222, 'learning_rate': 0.00017207430340557275, 'epoch': 0.14}


 14%|█▍        | 457/3235 [08:14<46:11,  1.00it/s]

{'loss': 1.0862, 'grad_norm': 0.5331355333328247, 'learning_rate': 0.0001720123839009288, 'epoch': 0.14}


 14%|█▍        | 458/3235 [08:14<43:59,  1.05it/s]

{'loss': 1.0013, 'grad_norm': 0.470041424036026, 'learning_rate': 0.00017195046439628483, 'epoch': 0.14}


 14%|█▍        | 459/3235 [08:16<49:38,  1.07s/it]

{'loss': 1.3066, 'grad_norm': 0.4532346725463867, 'learning_rate': 0.00017188854489164087, 'epoch': 0.14}


 14%|█▍        | 460/3235 [08:17<48:55,  1.06s/it]

{'loss': 1.0833, 'grad_norm': 0.4743482172489166, 'learning_rate': 0.0001718266253869969, 'epoch': 0.14}


 14%|█▍        | 461/3235 [08:18<49:08,  1.06s/it]

{'loss': 1.0793, 'grad_norm': 0.4607042968273163, 'learning_rate': 0.00017176470588235293, 'epoch': 0.14}


 14%|█▍        | 462/3235 [08:19<50:51,  1.10s/it]

{'loss': 1.0228, 'grad_norm': 0.4110008776187897, 'learning_rate': 0.00017170278637770897, 'epoch': 0.14}


 14%|█▍        | 463/3235 [08:20<51:17,  1.11s/it]

{'loss': 0.9341, 'grad_norm': 0.4150453507900238, 'learning_rate': 0.00017164086687306503, 'epoch': 0.14}


 14%|█▍        | 464/3235 [08:22<53:27,  1.16s/it]

{'loss': 1.079, 'grad_norm': 0.45077571272850037, 'learning_rate': 0.00017157894736842107, 'epoch': 0.14}


 14%|█▍        | 465/3235 [08:23<56:07,  1.22s/it]

{'loss': 1.009, 'grad_norm': 0.4125334918498993, 'learning_rate': 0.0001715170278637771, 'epoch': 0.14}


 14%|█▍        | 466/3235 [08:24<56:33,  1.23s/it]

{'loss': 0.9153, 'grad_norm': 0.408729612827301, 'learning_rate': 0.00017145510835913315, 'epoch': 0.14}


 14%|█▍        | 467/3235 [08:25<52:25,  1.14s/it]

{'loss': 0.9952, 'grad_norm': 0.4622020125389099, 'learning_rate': 0.00017139318885448917, 'epoch': 0.14}


 14%|█▍        | 468/3235 [08:26<52:59,  1.15s/it]

{'loss': 0.9903, 'grad_norm': 0.3924243450164795, 'learning_rate': 0.0001713312693498452, 'epoch': 0.14}


 14%|█▍        | 469/3235 [08:27<49:51,  1.08s/it]

{'loss': 1.0503, 'grad_norm': 0.469462513923645, 'learning_rate': 0.00017126934984520125, 'epoch': 0.14}


 15%|█▍        | 470/3235 [08:28<48:34,  1.05s/it]

{'loss': 0.9639, 'grad_norm': 0.5232253670692444, 'learning_rate': 0.00017120743034055729, 'epoch': 0.15}


 15%|█▍        | 471/3235 [08:29<46:22,  1.01s/it]

{'loss': 1.0717, 'grad_norm': 0.5344030261039734, 'learning_rate': 0.00017114551083591333, 'epoch': 0.15}


 15%|█▍        | 472/3235 [08:30<52:03,  1.13s/it]

{'loss': 1.0425, 'grad_norm': 0.38921019434928894, 'learning_rate': 0.00017108359133126937, 'epoch': 0.15}


 15%|█▍        | 473/3235 [08:32<54:07,  1.18s/it]

{'loss': 1.088, 'grad_norm': 0.4371280372142792, 'learning_rate': 0.00017102167182662538, 'epoch': 0.15}


 15%|█▍        | 474/3235 [08:33<50:45,  1.10s/it]

{'loss': 0.9505, 'grad_norm': 0.4552868902683258, 'learning_rate': 0.00017095975232198142, 'epoch': 0.15}


 15%|█▍        | 475/3235 [08:34<49:35,  1.08s/it]

{'loss': 1.0029, 'grad_norm': 0.5538909435272217, 'learning_rate': 0.00017089783281733746, 'epoch': 0.15}


 15%|█▍        | 476/3235 [08:35<48:23,  1.05s/it]

{'loss': 0.9918, 'grad_norm': 0.5040006041526794, 'learning_rate': 0.0001708359133126935, 'epoch': 0.15}


 15%|█▍        | 477/3235 [08:36<47:19,  1.03s/it]

{'loss': 1.1138, 'grad_norm': 0.4561942219734192, 'learning_rate': 0.00017077399380804954, 'epoch': 0.15}


 15%|█▍        | 478/3235 [08:37<51:38,  1.12s/it]

{'loss': 1.0768, 'grad_norm': 0.4593261182308197, 'learning_rate': 0.00017071207430340558, 'epoch': 0.15}


 15%|█▍        | 479/3235 [08:38<49:18,  1.07s/it]

{'loss': 1.1788, 'grad_norm': 0.43778032064437866, 'learning_rate': 0.00017065015479876162, 'epoch': 0.15}


 15%|█▍        | 480/3235 [08:39<49:25,  1.08s/it]

{'loss': 1.1338, 'grad_norm': 0.4515131413936615, 'learning_rate': 0.00017058823529411766, 'epoch': 0.15}


 15%|█▍        | 481/3235 [08:40<50:46,  1.11s/it]

{'loss': 1.0012, 'grad_norm': 0.4023047387599945, 'learning_rate': 0.0001705263157894737, 'epoch': 0.15}


 15%|█▍        | 482/3235 [08:41<44:51,  1.02it/s]

{'loss': 1.0105, 'grad_norm': 0.5551360845565796, 'learning_rate': 0.00017046439628482974, 'epoch': 0.15}


 15%|█▍        | 483/3235 [08:42<42:25,  1.08it/s]

{'loss': 0.9361, 'grad_norm': 0.5089658498764038, 'learning_rate': 0.00017040247678018578, 'epoch': 0.15}


 15%|█▍        | 484/3235 [08:43<44:22,  1.03it/s]

{'loss': 1.0258, 'grad_norm': 0.4322760999202728, 'learning_rate': 0.00017034055727554182, 'epoch': 0.15}


 15%|█▍        | 485/3235 [08:44<46:47,  1.02s/it]

{'loss': 1.0129, 'grad_norm': 0.41547006368637085, 'learning_rate': 0.00017027863777089783, 'epoch': 0.15}


 15%|█▌        | 486/3235 [08:45<45:19,  1.01it/s]

{'loss': 1.0683, 'grad_norm': 0.5360144376754761, 'learning_rate': 0.00017021671826625387, 'epoch': 0.15}


 15%|█▌        | 487/3235 [08:46<47:07,  1.03s/it]

{'loss': 1.077, 'grad_norm': 0.5326995253562927, 'learning_rate': 0.0001701547987616099, 'epoch': 0.15}


 15%|█▌        | 488/3235 [08:47<46:07,  1.01s/it]

{'loss': 1.012, 'grad_norm': 0.505562424659729, 'learning_rate': 0.00017009287925696595, 'epoch': 0.15}


 15%|█▌        | 489/3235 [08:48<45:20,  1.01it/s]

{'loss': 1.0486, 'grad_norm': 0.47963231801986694, 'learning_rate': 0.000170030959752322, 'epoch': 0.15}


 15%|█▌        | 490/3235 [08:49<47:51,  1.05s/it]

{'loss': 1.047, 'grad_norm': 0.45525896549224854, 'learning_rate': 0.00016996904024767803, 'epoch': 0.15}


 15%|█▌        | 491/3235 [08:50<46:53,  1.03s/it]

{'loss': 0.9623, 'grad_norm': 0.4821057915687561, 'learning_rate': 0.00016990712074303405, 'epoch': 0.15}


 15%|█▌        | 492/3235 [08:51<46:56,  1.03s/it]

{'loss': 1.0854, 'grad_norm': 0.4605782926082611, 'learning_rate': 0.00016984520123839009, 'epoch': 0.15}


 15%|█▌        | 493/3235 [08:52<51:58,  1.14s/it]

{'loss': 1.2501, 'grad_norm': 0.44438108801841736, 'learning_rate': 0.00016978328173374613, 'epoch': 0.15}


 15%|█▌        | 494/3235 [08:54<52:20,  1.15s/it]

{'loss': 0.9032, 'grad_norm': 0.41385847330093384, 'learning_rate': 0.00016972136222910217, 'epoch': 0.15}


 15%|█▌        | 495/3235 [08:55<49:44,  1.09s/it]

{'loss': 1.0725, 'grad_norm': 0.48123934864997864, 'learning_rate': 0.0001696594427244582, 'epoch': 0.15}


 15%|█▌        | 496/3235 [08:56<49:41,  1.09s/it]

{'loss': 1.1278, 'grad_norm': 0.4919598698616028, 'learning_rate': 0.00016959752321981425, 'epoch': 0.15}


 15%|█▌        | 497/3235 [08:57<52:38,  1.15s/it]

{'loss': 0.9909, 'grad_norm': 0.4744219481945038, 'learning_rate': 0.00016953560371517029, 'epoch': 0.15}


 15%|█▌        | 498/3235 [08:58<55:01,  1.21s/it]

{'loss': 1.1555, 'grad_norm': 0.4418342709541321, 'learning_rate': 0.00016947368421052633, 'epoch': 0.15}


 15%|█▌        | 499/3235 [08:59<53:57,  1.18s/it]

{'loss': 1.0186, 'grad_norm': 0.3970569968223572, 'learning_rate': 0.00016941176470588237, 'epoch': 0.15}


 15%|█▌        | 500/3235 [09:00<45:54,  1.01s/it]

{'loss': 0.935, 'grad_norm': 0.6065307259559631, 'learning_rate': 0.0001693498452012384, 'epoch': 0.15}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-500)... Done. 0.2s
 15%|█▌        | 501/3235 [09:02<55:15,  1.21s/it]

{'loss': 1.0431, 'grad_norm': 0.4895930290222168, 'learning_rate': 0.00016928792569659445, 'epoch': 0.15}


 16%|█▌        | 502/3235 [09:03<55:33,  1.22s/it]

{'loss': 1.049, 'grad_norm': 0.4345063865184784, 'learning_rate': 0.00016922600619195049, 'epoch': 0.16}


 16%|█▌        | 503/3235 [09:04<53:28,  1.17s/it]

{'loss': 1.1209, 'grad_norm': 0.4723479449748993, 'learning_rate': 0.0001691640866873065, 'epoch': 0.16}


 16%|█▌        | 504/3235 [09:05<50:04,  1.10s/it]

{'loss': 1.0631, 'grad_norm': 0.4799247682094574, 'learning_rate': 0.00016910216718266254, 'epoch': 0.16}


 16%|█▌        | 505/3235 [09:06<51:45,  1.14s/it]

{'loss': 1.0465, 'grad_norm': 0.428109735250473, 'learning_rate': 0.00016904024767801858, 'epoch': 0.16}


 16%|█▌        | 506/3235 [09:07<49:43,  1.09s/it]

{'loss': 0.9884, 'grad_norm': 0.507106602191925, 'learning_rate': 0.00016897832817337462, 'epoch': 0.16}


 16%|█▌        | 507/3235 [09:08<51:02,  1.12s/it]

{'loss': 1.0985, 'grad_norm': 0.449139267206192, 'learning_rate': 0.00016891640866873066, 'epoch': 0.16}


 16%|█▌        | 508/3235 [09:09<49:27,  1.09s/it]

{'loss': 0.8706, 'grad_norm': 0.41463613510131836, 'learning_rate': 0.0001688544891640867, 'epoch': 0.16}


 16%|█▌        | 509/3235 [09:10<49:34,  1.09s/it]

{'loss': 1.1547, 'grad_norm': 0.49091389775276184, 'learning_rate': 0.0001687925696594427, 'epoch': 0.16}


 16%|█▌        | 510/3235 [09:11<45:35,  1.00s/it]

{'loss': 1.0264, 'grad_norm': 0.5046709775924683, 'learning_rate': 0.00016873065015479875, 'epoch': 0.16}


 16%|█▌        | 511/3235 [09:12<45:33,  1.00s/it]

{'loss': 1.1063, 'grad_norm': 0.48848244547843933, 'learning_rate': 0.00016866873065015482, 'epoch': 0.16}


 16%|█▌        | 512/3235 [09:13<44:47,  1.01it/s]

{'loss': 1.2094, 'grad_norm': 0.5018621683120728, 'learning_rate': 0.00016860681114551086, 'epoch': 0.16}


 16%|█▌        | 513/3235 [09:14<44:56,  1.01it/s]

{'loss': 1.1061, 'grad_norm': 0.42703425884246826, 'learning_rate': 0.0001685448916408669, 'epoch': 0.16}


 16%|█▌        | 514/3235 [09:15<46:06,  1.02s/it]

{'loss': 1.0239, 'grad_norm': 0.4313018321990967, 'learning_rate': 0.00016848297213622294, 'epoch': 0.16}


 16%|█▌        | 515/3235 [09:17<50:06,  1.11s/it]

{'loss': 0.9022, 'grad_norm': 0.4031568765640259, 'learning_rate': 0.00016842105263157895, 'epoch': 0.16}


 16%|█▌        | 516/3235 [09:17<44:07,  1.03it/s]

{'loss': 1.0525, 'grad_norm': 0.5201647281646729, 'learning_rate': 0.000168359133126935, 'epoch': 0.16}


 16%|█▌        | 517/3235 [09:18<47:01,  1.04s/it]

{'loss': 1.0638, 'grad_norm': 0.4072067439556122, 'learning_rate': 0.00016829721362229103, 'epoch': 0.16}


 16%|█▌        | 518/3235 [09:20<47:28,  1.05s/it]

{'loss': 1.0132, 'grad_norm': 0.48387008905410767, 'learning_rate': 0.00016823529411764707, 'epoch': 0.16}


 16%|█▌        | 519/3235 [09:20<43:44,  1.03it/s]

{'loss': 0.9142, 'grad_norm': 0.4931531548500061, 'learning_rate': 0.0001681733746130031, 'epoch': 0.16}


 16%|█▌        | 520/3235 [09:21<45:03,  1.00it/s]

{'loss': 1.1779, 'grad_norm': 0.4308741092681885, 'learning_rate': 0.00016811145510835915, 'epoch': 0.16}


 16%|█▌        | 521/3235 [09:22<46:37,  1.03s/it]

{'loss': 1.0371, 'grad_norm': 0.4887942969799042, 'learning_rate': 0.00016804953560371516, 'epoch': 0.16}


 16%|█▌        | 522/3235 [09:23<44:32,  1.02it/s]

{'loss': 1.0714, 'grad_norm': 0.4977259039878845, 'learning_rate': 0.0001679876160990712, 'epoch': 0.16}


 16%|█▌        | 523/3235 [09:24<45:00,  1.00it/s]

{'loss': 1.0844, 'grad_norm': 0.44181641936302185, 'learning_rate': 0.00016792569659442724, 'epoch': 0.16}


 16%|█▌        | 524/3235 [09:26<48:01,  1.06s/it]

{'loss': 1.0014, 'grad_norm': 0.40984436869621277, 'learning_rate': 0.00016786377708978328, 'epoch': 0.16}


 16%|█▌        | 525/3235 [09:26<45:14,  1.00s/it]

{'loss': 0.986, 'grad_norm': 0.4788881540298462, 'learning_rate': 0.00016780185758513932, 'epoch': 0.16}


 16%|█▋        | 526/3235 [09:28<47:57,  1.06s/it]

{'loss': 1.0792, 'grad_norm': 0.44364628195762634, 'learning_rate': 0.00016773993808049536, 'epoch': 0.16}


 16%|█▋        | 527/3235 [09:29<49:06,  1.09s/it]

{'loss': 1.1014, 'grad_norm': 0.5055645704269409, 'learning_rate': 0.0001676780185758514, 'epoch': 0.16}


 16%|█▋        | 528/3235 [09:30<53:03,  1.18s/it]

{'loss': 1.0102, 'grad_norm': 0.42361530661582947, 'learning_rate': 0.00016761609907120744, 'epoch': 0.16}


 16%|█▋        | 529/3235 [09:31<51:24,  1.14s/it]

{'loss': 1.0785, 'grad_norm': 0.44418397545814514, 'learning_rate': 0.00016755417956656348, 'epoch': 0.16}


 16%|█▋        | 530/3235 [09:32<52:11,  1.16s/it]

{'loss': 1.1355, 'grad_norm': 0.4527415931224823, 'learning_rate': 0.00016749226006191952, 'epoch': 0.16}


 16%|█▋        | 531/3235 [09:33<50:37,  1.12s/it]

{'loss': 1.0625, 'grad_norm': 0.4477647840976715, 'learning_rate': 0.00016743034055727556, 'epoch': 0.16}


 16%|█▋        | 532/3235 [09:35<50:37,  1.12s/it]

{'loss': 1.1622, 'grad_norm': 0.47683119773864746, 'learning_rate': 0.0001673684210526316, 'epoch': 0.16}


 16%|█▋        | 533/3235 [09:36<51:28,  1.14s/it]

{'loss': 1.0031, 'grad_norm': 0.4587249755859375, 'learning_rate': 0.00016730650154798762, 'epoch': 0.16}


 17%|█▋        | 534/3235 [09:37<48:17,  1.07s/it]

{'loss': 1.126, 'grad_norm': 0.46043261885643005, 'learning_rate': 0.00016724458204334366, 'epoch': 0.17}


 17%|█▋        | 535/3235 [09:38<47:42,  1.06s/it]

{'loss': 1.2042, 'grad_norm': 0.4641822874546051, 'learning_rate': 0.0001671826625386997, 'epoch': 0.17}


 17%|█▋        | 536/3235 [09:39<47:33,  1.06s/it]

{'loss': 1.0997, 'grad_norm': 0.4555662274360657, 'learning_rate': 0.00016712074303405574, 'epoch': 0.17}


 17%|█▋        | 537/3235 [09:40<51:45,  1.15s/it]

{'loss': 1.0545, 'grad_norm': 0.36359018087387085, 'learning_rate': 0.00016705882352941178, 'epoch': 0.17}


 17%|█▋        | 538/3235 [09:41<48:40,  1.08s/it]

{'loss': 0.9194, 'grad_norm': 0.46778351068496704, 'learning_rate': 0.00016699690402476782, 'epoch': 0.17}


 17%|█▋        | 539/3235 [09:42<44:43,  1.00it/s]

{'loss': 1.0356, 'grad_norm': 0.5282687544822693, 'learning_rate': 0.00016693498452012383, 'epoch': 0.17}


 17%|█▋        | 540/3235 [09:43<42:23,  1.06it/s]

{'loss': 1.0113, 'grad_norm': 0.47647547721862793, 'learning_rate': 0.00016687306501547987, 'epoch': 0.17}


 17%|█▋        | 541/3235 [09:44<46:23,  1.03s/it]

{'loss': 1.1945, 'grad_norm': 0.45739349722862244, 'learning_rate': 0.0001668111455108359, 'epoch': 0.17}


 17%|█▋        | 542/3235 [09:45<44:51,  1.00it/s]

{'loss': 1.1178, 'grad_norm': 0.470180869102478, 'learning_rate': 0.00016674922600619195, 'epoch': 0.17}


 17%|█▋        | 543/3235 [09:46<42:29,  1.06it/s]

{'loss': 1.0135, 'grad_norm': 0.4960077106952667, 'learning_rate': 0.000166687306501548, 'epoch': 0.17}


 17%|█▋        | 544/3235 [09:47<43:08,  1.04it/s]

{'loss': 1.1184, 'grad_norm': 0.45460113883018494, 'learning_rate': 0.00016662538699690403, 'epoch': 0.17}


 17%|█▋        | 545/3235 [09:48<46:01,  1.03s/it]

{'loss': 1.0277, 'grad_norm': 0.41908979415893555, 'learning_rate': 0.00016656346749226007, 'epoch': 0.17}


 17%|█▋        | 546/3235 [09:49<47:23,  1.06s/it]

{'loss': 0.979, 'grad_norm': 0.45301494002342224, 'learning_rate': 0.0001665015479876161, 'epoch': 0.17}


 17%|█▋        | 547/3235 [09:50<49:44,  1.11s/it]

{'loss': 0.9613, 'grad_norm': 0.37770092487335205, 'learning_rate': 0.00016643962848297215, 'epoch': 0.17}


 17%|█▋        | 548/3235 [09:51<46:51,  1.05s/it]

{'loss': 0.9848, 'grad_norm': 0.5209594964981079, 'learning_rate': 0.0001663777089783282, 'epoch': 0.17}


 17%|█▋        | 549/3235 [09:52<50:02,  1.12s/it]

{'loss': 1.0761, 'grad_norm': 0.44933876395225525, 'learning_rate': 0.00016631578947368423, 'epoch': 0.17}


 17%|█▋        | 550/3235 [09:54<52:09,  1.17s/it]

{'loss': 1.0544, 'grad_norm': 0.4324674606323242, 'learning_rate': 0.00016625386996904027, 'epoch': 0.17}


 17%|█▋        | 551/3235 [09:55<53:33,  1.20s/it]

{'loss': 1.0724, 'grad_norm': 0.439921110868454, 'learning_rate': 0.00016619195046439628, 'epoch': 0.17}


 17%|█▋        | 552/3235 [09:56<52:36,  1.18s/it]

{'loss': 1.0081, 'grad_norm': 0.4613172709941864, 'learning_rate': 0.00016613003095975232, 'epoch': 0.17}


 17%|█▋        | 553/3235 [09:57<52:36,  1.18s/it]

{'loss': 1.0942, 'grad_norm': 0.44682955741882324, 'learning_rate': 0.00016606811145510836, 'epoch': 0.17}


 17%|█▋        | 554/3235 [09:58<50:57,  1.14s/it]

{'loss': 1.0703, 'grad_norm': 0.47070157527923584, 'learning_rate': 0.0001660061919504644, 'epoch': 0.17}


 17%|█▋        | 555/3235 [09:59<49:10,  1.10s/it]

{'loss': 0.9581, 'grad_norm': 0.44080299139022827, 'learning_rate': 0.00016594427244582044, 'epoch': 0.17}


 17%|█▋        | 556/3235 [10:00<48:58,  1.10s/it]

{'loss': 1.0209, 'grad_norm': 0.46899664402008057, 'learning_rate': 0.00016588235294117648, 'epoch': 0.17}


 17%|█▋        | 557/3235 [10:02<51:45,  1.16s/it]

{'loss': 1.0427, 'grad_norm': 0.416217178106308, 'learning_rate': 0.0001658204334365325, 'epoch': 0.17}


 17%|█▋        | 558/3235 [10:03<54:03,  1.21s/it]

{'loss': 1.0164, 'grad_norm': 0.4193258583545685, 'learning_rate': 0.00016575851393188854, 'epoch': 0.17}


 17%|█▋        | 559/3235 [10:04<55:04,  1.23s/it]

{'loss': 1.1265, 'grad_norm': 0.42302292585372925, 'learning_rate': 0.00016569659442724458, 'epoch': 0.17}


 17%|█▋        | 560/3235 [10:05<50:46,  1.14s/it]

{'loss': 1.111, 'grad_norm': 0.47640499472618103, 'learning_rate': 0.00016563467492260064, 'epoch': 0.17}


 17%|█▋        | 561/3235 [10:06<48:49,  1.10s/it]

{'loss': 1.0121, 'grad_norm': 0.4609692394733429, 'learning_rate': 0.00016557275541795668, 'epoch': 0.17}


 17%|█▋        | 562/3235 [10:08<52:05,  1.17s/it]

{'loss': 0.8894, 'grad_norm': 0.3879295289516449, 'learning_rate': 0.00016551083591331272, 'epoch': 0.17}


 17%|█▋        | 563/3235 [10:08<44:54,  1.01s/it]

{'loss': 0.9878, 'grad_norm': 0.5397077798843384, 'learning_rate': 0.00016544891640866874, 'epoch': 0.17}


 17%|█▋        | 564/3235 [10:09<43:00,  1.03it/s]

{'loss': 1.2076, 'grad_norm': 0.5390294194221497, 'learning_rate': 0.00016538699690402478, 'epoch': 0.17}


 17%|█▋        | 565/3235 [10:10<46:22,  1.04s/it]

{'loss': 1.2335, 'grad_norm': 0.43244805932044983, 'learning_rate': 0.00016532507739938082, 'epoch': 0.17}


 17%|█▋        | 566/3235 [10:11<48:01,  1.08s/it]

{'loss': 1.0589, 'grad_norm': 0.4218522906303406, 'learning_rate': 0.00016526315789473686, 'epoch': 0.17}


 18%|█▊        | 567/3235 [10:12<46:43,  1.05s/it]

{'loss': 1.0093, 'grad_norm': 0.4568462669849396, 'learning_rate': 0.0001652012383900929, 'epoch': 0.18}


 18%|█▊        | 568/3235 [10:13<46:27,  1.05s/it]

{'loss': 0.9947, 'grad_norm': 0.44078776240348816, 'learning_rate': 0.00016513931888544894, 'epoch': 0.18}


 18%|█▊        | 569/3235 [10:15<47:25,  1.07s/it]

{'loss': 1.0887, 'grad_norm': 0.4173734486103058, 'learning_rate': 0.00016507739938080495, 'epoch': 0.18}


 18%|█▊        | 570/3235 [10:16<48:22,  1.09s/it]

{'loss': 1.0883, 'grad_norm': 0.4431476891040802, 'learning_rate': 0.000165015479876161, 'epoch': 0.18}


 18%|█▊        | 571/3235 [10:17<47:01,  1.06s/it]

{'loss': 1.0925, 'grad_norm': 0.5464932322502136, 'learning_rate': 0.00016495356037151703, 'epoch': 0.18}


 18%|█▊        | 572/3235 [10:18<49:16,  1.11s/it]

{'loss': 1.1346, 'grad_norm': 0.42850959300994873, 'learning_rate': 0.00016489164086687307, 'epoch': 0.18}


 18%|█▊        | 573/3235 [10:19<51:04,  1.15s/it]

{'loss': 1.0431, 'grad_norm': 0.39013221859931946, 'learning_rate': 0.0001648297213622291, 'epoch': 0.18}


 18%|█▊        | 574/3235 [10:20<48:07,  1.09s/it]

{'loss': 1.1131, 'grad_norm': 0.46941259503364563, 'learning_rate': 0.00016476780185758515, 'epoch': 0.18}


 18%|█▊        | 575/3235 [10:21<49:16,  1.11s/it]

{'loss': 1.1017, 'grad_norm': 0.4538022577762604, 'learning_rate': 0.0001647058823529412, 'epoch': 0.18}


 18%|█▊        | 576/3235 [10:22<48:44,  1.10s/it]

{'loss': 1.0357, 'grad_norm': 0.4423985481262207, 'learning_rate': 0.00016464396284829723, 'epoch': 0.18}


 18%|█▊        | 577/3235 [10:23<46:56,  1.06s/it]

{'loss': 1.0724, 'grad_norm': 0.4967707693576813, 'learning_rate': 0.00016458204334365327, 'epoch': 0.18}


 18%|█▊        | 578/3235 [10:25<48:55,  1.10s/it]

{'loss': 1.1524, 'grad_norm': 0.41659456491470337, 'learning_rate': 0.0001645201238390093, 'epoch': 0.18}


 18%|█▊        | 579/3235 [10:26<48:16,  1.09s/it]

{'loss': 0.9905, 'grad_norm': 0.4396796226501465, 'learning_rate': 0.00016445820433436535, 'epoch': 0.18}


 18%|█▊        | 580/3235 [10:27<48:23,  1.09s/it]

{'loss': 1.0279, 'grad_norm': 0.4369862675666809, 'learning_rate': 0.0001643962848297214, 'epoch': 0.18}


 18%|█▊        | 581/3235 [10:28<45:52,  1.04s/it]

{'loss': 0.8995, 'grad_norm': 0.49277791380882263, 'learning_rate': 0.0001643343653250774, 'epoch': 0.18}


 18%|█▊        | 582/3235 [10:29<48:03,  1.09s/it]

{'loss': 1.0953, 'grad_norm': 0.39401745796203613, 'learning_rate': 0.00016427244582043344, 'epoch': 0.18}


 18%|█▊        | 583/3235 [10:30<47:05,  1.07s/it]

{'loss': 1.0807, 'grad_norm': 0.4395563304424286, 'learning_rate': 0.00016421052631578948, 'epoch': 0.18}


 18%|█▊        | 584/3235 [10:31<49:39,  1.12s/it]

{'loss': 0.985, 'grad_norm': 0.41209015250205994, 'learning_rate': 0.00016414860681114552, 'epoch': 0.18}


 18%|█▊        | 585/3235 [10:32<49:52,  1.13s/it]

{'loss': 1.0752, 'grad_norm': 0.42477861046791077, 'learning_rate': 0.00016408668730650156, 'epoch': 0.18}


 18%|█▊        | 586/3235 [10:33<44:50,  1.02s/it]

{'loss': 0.9473, 'grad_norm': 0.5478788614273071, 'learning_rate': 0.0001640247678018576, 'epoch': 0.18}


 18%|█▊        | 587/3235 [10:34<46:38,  1.06s/it]

{'loss': 1.0834, 'grad_norm': 0.4664691090583801, 'learning_rate': 0.00016396284829721362, 'epoch': 0.18}


 18%|█▊        | 588/3235 [10:35<48:08,  1.09s/it]

{'loss': 1.0535, 'grad_norm': 0.4516645669937134, 'learning_rate': 0.00016390092879256966, 'epoch': 0.18}


 18%|█▊        | 589/3235 [10:36<48:48,  1.11s/it]

{'loss': 1.1181, 'grad_norm': 0.44280439615249634, 'learning_rate': 0.0001638390092879257, 'epoch': 0.18}


 18%|█▊        | 590/3235 [10:38<48:32,  1.10s/it]

{'loss': 0.9425, 'grad_norm': 0.42685383558273315, 'learning_rate': 0.00016377708978328174, 'epoch': 0.18}


 18%|█▊        | 591/3235 [10:39<48:34,  1.10s/it]

{'loss': 1.0175, 'grad_norm': 0.4516229033470154, 'learning_rate': 0.00016371517027863778, 'epoch': 0.18}


 18%|█▊        | 592/3235 [10:40<49:43,  1.13s/it]

{'loss': 1.0302, 'grad_norm': 0.43265363574028015, 'learning_rate': 0.00016365325077399382, 'epoch': 0.18}


 18%|█▊        | 593/3235 [10:41<49:18,  1.12s/it]

{'loss': 1.0604, 'grad_norm': 0.42565709352493286, 'learning_rate': 0.00016359133126934986, 'epoch': 0.18}


 18%|█▊        | 594/3235 [10:42<51:02,  1.16s/it]

{'loss': 1.1459, 'grad_norm': 0.4040333032608032, 'learning_rate': 0.0001635294117647059, 'epoch': 0.18}


 18%|█▊        | 595/3235 [10:43<52:07,  1.18s/it]

{'loss': 1.0176, 'grad_norm': 0.4670743942260742, 'learning_rate': 0.00016346749226006194, 'epoch': 0.18}


 18%|█▊        | 596/3235 [10:45<52:12,  1.19s/it]

{'loss': 0.9714, 'grad_norm': 0.47060492634773254, 'learning_rate': 0.00016340557275541798, 'epoch': 0.18}


 18%|█▊        | 597/3235 [10:46<54:51,  1.25s/it]

{'loss': 1.0719, 'grad_norm': 0.4198873043060303, 'learning_rate': 0.00016334365325077402, 'epoch': 0.18}


 18%|█▊        | 598/3235 [10:47<52:09,  1.19s/it]

{'loss': 1.0461, 'grad_norm': 0.5210872888565063, 'learning_rate': 0.00016328173374613003, 'epoch': 0.18}


 19%|█▊        | 599/3235 [10:48<52:14,  1.19s/it]

{'loss': 1.1455, 'grad_norm': 0.4836914539337158, 'learning_rate': 0.00016321981424148607, 'epoch': 0.19}


 19%|█▊        | 600/3235 [10:50<53:02,  1.21s/it]

{'loss': 1.0841, 'grad_norm': 0.43644019961357117, 'learning_rate': 0.0001631578947368421, 'epoch': 0.19}


 19%|█▊        | 601/3235 [10:51<50:58,  1.16s/it]

{'loss': 1.1372, 'grad_norm': 0.5016985535621643, 'learning_rate': 0.00016309597523219815, 'epoch': 0.19}


 19%|█▊        | 602/3235 [10:51<47:46,  1.09s/it]

{'loss': 1.0445, 'grad_norm': 0.5172518491744995, 'learning_rate': 0.0001630340557275542, 'epoch': 0.19}


 19%|█▊        | 603/3235 [10:53<47:19,  1.08s/it]

{'loss': 0.9581, 'grad_norm': 0.4314672648906708, 'learning_rate': 0.00016297213622291023, 'epoch': 0.19}


 19%|█▊        | 604/3235 [10:54<48:19,  1.10s/it]

{'loss': 0.9903, 'grad_norm': 0.4355614185333252, 'learning_rate': 0.00016291021671826624, 'epoch': 0.19}


 19%|█▊        | 605/3235 [10:55<46:11,  1.05s/it]

{'loss': 1.1063, 'grad_norm': 0.5005284547805786, 'learning_rate': 0.00016284829721362228, 'epoch': 0.19}


 19%|█▊        | 606/3235 [10:56<46:04,  1.05s/it]

{'loss': 1.0628, 'grad_norm': 0.4781421422958374, 'learning_rate': 0.00016278637770897832, 'epoch': 0.19}


 19%|█▉        | 607/3235 [10:57<44:02,  1.01s/it]

{'loss': 1.1185, 'grad_norm': 0.4664442837238312, 'learning_rate': 0.00016272445820433436, 'epoch': 0.19}


 19%|█▉        | 608/3235 [10:58<46:49,  1.07s/it]

{'loss': 0.9866, 'grad_norm': 0.4308042824268341, 'learning_rate': 0.0001626625386996904, 'epoch': 0.19}


 19%|█▉        | 609/3235 [10:59<47:52,  1.09s/it]

{'loss': 1.0264, 'grad_norm': 0.46317964792251587, 'learning_rate': 0.00016260061919504647, 'epoch': 0.19}


 19%|█▉        | 610/3235 [11:00<47:53,  1.09s/it]

{'loss': 1.0711, 'grad_norm': 0.46421948075294495, 'learning_rate': 0.00016253869969040248, 'epoch': 0.19}


 19%|█▉        | 611/3235 [11:01<45:26,  1.04s/it]

{'loss': 1.1391, 'grad_norm': 0.5142749547958374, 'learning_rate': 0.00016247678018575852, 'epoch': 0.19}


 19%|█▉        | 612/3235 [11:02<45:55,  1.05s/it]

{'loss': 1.057, 'grad_norm': 0.4480326771736145, 'learning_rate': 0.00016241486068111456, 'epoch': 0.19}


 19%|█▉        | 613/3235 [11:03<42:14,  1.03it/s]

{'loss': 1.0118, 'grad_norm': 0.49957552552223206, 'learning_rate': 0.0001623529411764706, 'epoch': 0.19}


 19%|█▉        | 614/3235 [11:04<43:55,  1.01s/it]

{'loss': 1.0601, 'grad_norm': 0.5218062400817871, 'learning_rate': 0.00016229102167182664, 'epoch': 0.19}


 19%|█▉        | 615/3235 [11:05<46:59,  1.08s/it]

{'loss': 1.1018, 'grad_norm': 0.42256611585617065, 'learning_rate': 0.00016222910216718268, 'epoch': 0.19}


 19%|█▉        | 616/3235 [11:06<49:58,  1.14s/it]

{'loss': 1.0603, 'grad_norm': 0.45438316464424133, 'learning_rate': 0.0001621671826625387, 'epoch': 0.19}


 19%|█▉        | 617/3235 [11:08<53:52,  1.23s/it]

{'loss': 1.0876, 'grad_norm': 0.40835800766944885, 'learning_rate': 0.00016210526315789473, 'epoch': 0.19}


 19%|█▉        | 618/3235 [11:09<47:57,  1.10s/it]

{'loss': 0.9282, 'grad_norm': 0.6477875113487244, 'learning_rate': 0.00016204334365325077, 'epoch': 0.19}


 19%|█▉        | 619/3235 [11:10<47:18,  1.09s/it]

{'loss': 0.966, 'grad_norm': 0.48822420835494995, 'learning_rate': 0.00016198142414860681, 'epoch': 0.19}


 19%|█▉        | 620/3235 [11:11<44:26,  1.02s/it]

{'loss': 1.1446, 'grad_norm': 0.5176165699958801, 'learning_rate': 0.00016191950464396285, 'epoch': 0.19}


 19%|█▉        | 621/3235 [11:12<46:43,  1.07s/it]

{'loss': 1.145, 'grad_norm': 0.43583178520202637, 'learning_rate': 0.0001618575851393189, 'epoch': 0.19}


 19%|█▉        | 622/3235 [11:13<44:06,  1.01s/it]

{'loss': 1.0144, 'grad_norm': 0.476747989654541, 'learning_rate': 0.0001617956656346749, 'epoch': 0.19}


 19%|█▉        | 623/3235 [11:14<42:46,  1.02it/s]

{'loss': 1.0255, 'grad_norm': 0.4615476131439209, 'learning_rate': 0.00016173374613003097, 'epoch': 0.19}


 19%|█▉        | 624/3235 [11:15<42:02,  1.04it/s]

{'loss': 1.2297, 'grad_norm': 0.5089612007141113, 'learning_rate': 0.00016167182662538701, 'epoch': 0.19}


 19%|█▉        | 625/3235 [11:16<42:39,  1.02it/s]

{'loss': 1.0028, 'grad_norm': 0.44607746601104736, 'learning_rate': 0.00016160990712074305, 'epoch': 0.19}


 19%|█▉        | 626/3235 [11:17<43:44,  1.01s/it]

{'loss': 1.0922, 'grad_norm': 0.44987952709198, 'learning_rate': 0.0001615479876160991, 'epoch': 0.19}


 19%|█▉        | 627/3235 [11:18<47:41,  1.10s/it]

{'loss': 1.0914, 'grad_norm': 0.4471246600151062, 'learning_rate': 0.00016148606811145513, 'epoch': 0.19}


 19%|█▉        | 628/3235 [11:19<46:54,  1.08s/it]

{'loss': 1.0844, 'grad_norm': 0.4509156048297882, 'learning_rate': 0.00016142414860681115, 'epoch': 0.19}


 19%|█▉        | 629/3235 [11:20<44:18,  1.02s/it]

{'loss': 1.0304, 'grad_norm': 0.49851444363594055, 'learning_rate': 0.0001613622291021672, 'epoch': 0.19}


 19%|█▉        | 630/3235 [11:21<45:30,  1.05s/it]

{'loss': 1.0992, 'grad_norm': 0.4864300787448883, 'learning_rate': 0.00016130030959752323, 'epoch': 0.19}


 20%|█▉        | 631/3235 [11:22<44:20,  1.02s/it]

{'loss': 1.0682, 'grad_norm': 0.49172958731651306, 'learning_rate': 0.00016123839009287927, 'epoch': 0.2}


 20%|█▉        | 632/3235 [11:23<43:15,  1.00it/s]

{'loss': 1.123, 'grad_norm': 0.49957701563835144, 'learning_rate': 0.0001611764705882353, 'epoch': 0.2}


 20%|█▉        | 633/3235 [11:24<45:03,  1.04s/it]

{'loss': 1.0283, 'grad_norm': 0.4392688572406769, 'learning_rate': 0.00016111455108359135, 'epoch': 0.2}


 20%|█▉        | 634/3235 [11:25<48:10,  1.11s/it]

{'loss': 0.9582, 'grad_norm': 0.419941246509552, 'learning_rate': 0.00016105263157894736, 'epoch': 0.2}


 20%|█▉        | 635/3235 [11:26<45:55,  1.06s/it]

{'loss': 1.0446, 'grad_norm': 0.50629723072052, 'learning_rate': 0.0001609907120743034, 'epoch': 0.2}


 20%|█▉        | 636/3235 [11:27<45:54,  1.06s/it]

{'loss': 1.0061, 'grad_norm': 0.43243640661239624, 'learning_rate': 0.00016092879256965944, 'epoch': 0.2}


 20%|█▉        | 637/3235 [11:29<48:54,  1.13s/it]

{'loss': 1.1649, 'grad_norm': 0.4764527678489685, 'learning_rate': 0.00016086687306501548, 'epoch': 0.2}


 20%|█▉        | 638/3235 [11:29<45:57,  1.06s/it]

{'loss': 1.0796, 'grad_norm': 0.46867600083351135, 'learning_rate': 0.00016080495356037152, 'epoch': 0.2}


 20%|█▉        | 639/3235 [11:31<47:46,  1.10s/it]

{'loss': 1.1415, 'grad_norm': 0.4915861487388611, 'learning_rate': 0.00016074303405572756, 'epoch': 0.2}


 20%|█▉        | 640/3235 [11:32<46:45,  1.08s/it]

{'loss': 1.0011, 'grad_norm': 0.47925660014152527, 'learning_rate': 0.0001606811145510836, 'epoch': 0.2}


 20%|█▉        | 641/3235 [11:33<46:16,  1.07s/it]

{'loss': 0.9851, 'grad_norm': 0.4124748408794403, 'learning_rate': 0.00016061919504643964, 'epoch': 0.2}


 20%|█▉        | 642/3235 [11:34<46:42,  1.08s/it]

{'loss': 0.9776, 'grad_norm': 0.4704872667789459, 'learning_rate': 0.00016055727554179568, 'epoch': 0.2}


 20%|█▉        | 643/3235 [11:35<51:08,  1.18s/it]

{'loss': 1.0961, 'grad_norm': 0.4298701584339142, 'learning_rate': 0.00016049535603715172, 'epoch': 0.2}


 20%|█▉        | 644/3235 [11:36<50:13,  1.16s/it]

{'loss': 1.0439, 'grad_norm': 0.49175193905830383, 'learning_rate': 0.00016043343653250776, 'epoch': 0.2}


 20%|█▉        | 645/3235 [11:37<47:05,  1.09s/it]

{'loss': 0.9124, 'grad_norm': 0.45485880970954895, 'learning_rate': 0.0001603715170278638, 'epoch': 0.2}


 20%|█▉        | 646/3235 [11:38<48:26,  1.12s/it]

{'loss': 0.9189, 'grad_norm': 0.45453524589538574, 'learning_rate': 0.0001603095975232198, 'epoch': 0.2}


 20%|██        | 647/3235 [11:39<45:43,  1.06s/it]

{'loss': 1.045, 'grad_norm': 0.4892820417881012, 'learning_rate': 0.00016024767801857585, 'epoch': 0.2}


 20%|██        | 648/3235 [11:41<49:35,  1.15s/it]

{'loss': 1.0671, 'grad_norm': 0.40045133233070374, 'learning_rate': 0.0001601857585139319, 'epoch': 0.2}


 20%|██        | 649/3235 [11:42<47:03,  1.09s/it]

{'loss': 1.0783, 'grad_norm': 0.5199349522590637, 'learning_rate': 0.00016012383900928793, 'epoch': 0.2}


 20%|██        | 650/3235 [11:43<45:01,  1.05s/it]

{'loss': 1.0688, 'grad_norm': 0.48456838726997375, 'learning_rate': 0.00016006191950464397, 'epoch': 0.2}


 20%|██        | 651/3235 [11:44<45:36,  1.06s/it]

{'loss': 1.0635, 'grad_norm': 0.45679134130477905, 'learning_rate': 0.00016, 'epoch': 0.2}


 20%|██        | 652/3235 [11:45<43:13,  1.00s/it]

{'loss': 1.1384, 'grad_norm': 0.5091138482093811, 'learning_rate': 0.00015993808049535603, 'epoch': 0.2}


 20%|██        | 653/3235 [11:46<43:53,  1.02s/it]

{'loss': 0.9433, 'grad_norm': 0.4499826729297638, 'learning_rate': 0.00015987616099071207, 'epoch': 0.2}


 20%|██        | 654/3235 [11:47<47:42,  1.11s/it]

{'loss': 0.9328, 'grad_norm': 0.4090941846370697, 'learning_rate': 0.0001598142414860681, 'epoch': 0.2}


 20%|██        | 655/3235 [11:48<48:24,  1.13s/it]

{'loss': 1.1171, 'grad_norm': 0.46669653058052063, 'learning_rate': 0.00015975232198142415, 'epoch': 0.2}


 20%|██        | 656/3235 [11:49<48:15,  1.12s/it]

{'loss': 0.9545, 'grad_norm': 0.46003997325897217, 'learning_rate': 0.00015969040247678019, 'epoch': 0.2}


 20%|██        | 657/3235 [11:50<45:31,  1.06s/it]

{'loss': 1.1012, 'grad_norm': 0.5110046863555908, 'learning_rate': 0.00015962848297213623, 'epoch': 0.2}


 20%|██        | 658/3235 [11:52<49:44,  1.16s/it]

{'loss': 0.9865, 'grad_norm': 0.398687481880188, 'learning_rate': 0.00015956656346749227, 'epoch': 0.2}


 20%|██        | 659/3235 [11:53<48:20,  1.13s/it]

{'loss': 1.0101, 'grad_norm': 0.43391212821006775, 'learning_rate': 0.0001595046439628483, 'epoch': 0.2}


 20%|██        | 660/3235 [11:54<49:21,  1.15s/it]

{'loss': 1.1031, 'grad_norm': 0.48432210087776184, 'learning_rate': 0.00015944272445820435, 'epoch': 0.2}


 20%|██        | 661/3235 [11:55<49:51,  1.16s/it]

{'loss': 1.1859, 'grad_norm': 0.4646123945713043, 'learning_rate': 0.00015938080495356039, 'epoch': 0.2}


 20%|██        | 662/3235 [11:56<48:20,  1.13s/it]

{'loss': 1.1669, 'grad_norm': 0.488565593957901, 'learning_rate': 0.00015931888544891643, 'epoch': 0.2}


 20%|██        | 663/3235 [11:57<49:34,  1.16s/it]

{'loss': 0.9704, 'grad_norm': 0.4404005706310272, 'learning_rate': 0.00015925696594427247, 'epoch': 0.2}


 21%|██        | 664/3235 [11:58<47:48,  1.12s/it]

{'loss': 1.066, 'grad_norm': 0.4482964277267456, 'learning_rate': 0.00015919504643962848, 'epoch': 0.21}


 21%|██        | 665/3235 [12:00<51:33,  1.20s/it]

{'loss': 1.1044, 'grad_norm': 0.40813466906547546, 'learning_rate': 0.00015913312693498452, 'epoch': 0.21}


 21%|██        | 666/3235 [12:01<48:01,  1.12s/it]

{'loss': 1.0018, 'grad_norm': 0.4710635542869568, 'learning_rate': 0.00015907120743034056, 'epoch': 0.21}


 21%|██        | 667/3235 [12:02<52:28,  1.23s/it]

{'loss': 1.0041, 'grad_norm': 0.38794878125190735, 'learning_rate': 0.0001590092879256966, 'epoch': 0.21}


 21%|██        | 668/3235 [12:03<47:22,  1.11s/it]

{'loss': 0.9056, 'grad_norm': 0.4773879647254944, 'learning_rate': 0.00015894736842105264, 'epoch': 0.21}


 21%|██        | 669/3235 [12:04<48:48,  1.14s/it]

{'loss': 1.0499, 'grad_norm': 0.4862302839756012, 'learning_rate': 0.00015888544891640868, 'epoch': 0.21}


 21%|██        | 670/3235 [12:05<48:12,  1.13s/it]

{'loss': 1.21, 'grad_norm': 0.4831734299659729, 'learning_rate': 0.0001588235294117647, 'epoch': 0.21}


 21%|██        | 671/3235 [12:07<51:59,  1.22s/it]

{'loss': 1.0792, 'grad_norm': 0.3989211916923523, 'learning_rate': 0.00015876160990712073, 'epoch': 0.21}


 21%|██        | 672/3235 [12:08<49:27,  1.16s/it]

{'loss': 0.9821, 'grad_norm': 0.4945138692855835, 'learning_rate': 0.0001586996904024768, 'epoch': 0.21}


 21%|██        | 673/3235 [12:09<47:41,  1.12s/it]

{'loss': 1.0451, 'grad_norm': 0.47929251194000244, 'learning_rate': 0.00015863777089783284, 'epoch': 0.21}


 21%|██        | 674/3235 [12:10<47:49,  1.12s/it]

{'loss': 1.0353, 'grad_norm': 0.4594912827014923, 'learning_rate': 0.00015857585139318888, 'epoch': 0.21}


 21%|██        | 675/3235 [12:11<47:51,  1.12s/it]

{'loss': 1.0552, 'grad_norm': 0.4450334906578064, 'learning_rate': 0.00015851393188854492, 'epoch': 0.21}


 21%|██        | 676/3235 [12:12<49:27,  1.16s/it]

{'loss': 0.9594, 'grad_norm': 0.40371009707450867, 'learning_rate': 0.00015845201238390093, 'epoch': 0.21}


 21%|██        | 677/3235 [12:13<49:39,  1.16s/it]

{'loss': 1.2277, 'grad_norm': 0.48419326543807983, 'learning_rate': 0.00015839009287925697, 'epoch': 0.21}


 21%|██        | 678/3235 [12:14<44:49,  1.05s/it]

{'loss': 0.9743, 'grad_norm': 0.5252306461334229, 'learning_rate': 0.000158328173374613, 'epoch': 0.21}


 21%|██        | 679/3235 [12:15<43:52,  1.03s/it]

{'loss': 0.8473, 'grad_norm': 0.46389761567115784, 'learning_rate': 0.00015826625386996905, 'epoch': 0.21}


 21%|██        | 680/3235 [12:16<44:43,  1.05s/it]

{'loss': 0.9637, 'grad_norm': 0.4604925513267517, 'learning_rate': 0.0001582043343653251, 'epoch': 0.21}


 21%|██        | 681/3235 [12:17<44:53,  1.05s/it]

{'loss': 1.1384, 'grad_norm': 0.46643340587615967, 'learning_rate': 0.00015814241486068113, 'epoch': 0.21}


 21%|██        | 682/3235 [12:18<45:49,  1.08s/it]

{'loss': 0.9533, 'grad_norm': 0.42394930124282837, 'learning_rate': 0.00015808049535603714, 'epoch': 0.21}


 21%|██        | 683/3235 [12:20<47:20,  1.11s/it]

{'loss': 1.0446, 'grad_norm': 0.4052025377750397, 'learning_rate': 0.00015801857585139318, 'epoch': 0.21}


 21%|██        | 684/3235 [12:21<49:10,  1.16s/it]

{'loss': 1.0114, 'grad_norm': 0.41243720054626465, 'learning_rate': 0.00015795665634674923, 'epoch': 0.21}


 21%|██        | 685/3235 [12:22<51:07,  1.20s/it]

{'loss': 1.1424, 'grad_norm': 0.44226446747779846, 'learning_rate': 0.00015789473684210527, 'epoch': 0.21}


 21%|██        | 686/3235 [12:23<47:43,  1.12s/it]

{'loss': 1.1905, 'grad_norm': 0.5355028510093689, 'learning_rate': 0.0001578328173374613, 'epoch': 0.21}


 21%|██        | 687/3235 [12:24<45:34,  1.07s/it]

{'loss': 1.022, 'grad_norm': 0.5219385027885437, 'learning_rate': 0.00015777089783281735, 'epoch': 0.21}


 21%|██▏       | 688/3235 [12:25<44:03,  1.04s/it]

{'loss': 1.0811, 'grad_norm': 0.5043584704399109, 'learning_rate': 0.00015770897832817339, 'epoch': 0.21}


 21%|██▏       | 689/3235 [12:26<45:55,  1.08s/it]

{'loss': 1.0454, 'grad_norm': 0.44597455859184265, 'learning_rate': 0.00015764705882352943, 'epoch': 0.21}


 21%|██▏       | 690/3235 [12:27<42:45,  1.01s/it]

{'loss': 1.1549, 'grad_norm': 0.4593571126461029, 'learning_rate': 0.00015758513931888547, 'epoch': 0.21}


 21%|██▏       | 691/3235 [12:28<44:06,  1.04s/it]

{'loss': 1.1014, 'grad_norm': 0.47039371728897095, 'learning_rate': 0.0001575232198142415, 'epoch': 0.21}


 21%|██▏       | 692/3235 [12:29<46:39,  1.10s/it]

{'loss': 1.0355, 'grad_norm': 0.7239323854446411, 'learning_rate': 0.00015746130030959755, 'epoch': 0.21}


 21%|██▏       | 693/3235 [12:31<46:16,  1.09s/it]

{'loss': 0.9379, 'grad_norm': 0.4341233968734741, 'learning_rate': 0.00015739938080495359, 'epoch': 0.21}


 21%|██▏       | 694/3235 [12:32<48:37,  1.15s/it]

{'loss': 1.1296, 'grad_norm': 0.41496774554252625, 'learning_rate': 0.0001573374613003096, 'epoch': 0.21}


 21%|██▏       | 695/3235 [12:33<47:34,  1.12s/it]

{'loss': 1.0814, 'grad_norm': 0.4632503092288971, 'learning_rate': 0.00015727554179566564, 'epoch': 0.21}


 22%|██▏       | 696/3235 [12:34<46:32,  1.10s/it]

{'loss': 1.0533, 'grad_norm': 0.49980470538139343, 'learning_rate': 0.00015721362229102168, 'epoch': 0.22}


 22%|██▏       | 697/3235 [12:35<44:20,  1.05s/it]

{'loss': 0.9524, 'grad_norm': 0.47177836298942566, 'learning_rate': 0.00015715170278637772, 'epoch': 0.22}


 22%|██▏       | 698/3235 [12:36<46:27,  1.10s/it]

{'loss': 1.0477, 'grad_norm': 0.4190630316734314, 'learning_rate': 0.00015708978328173376, 'epoch': 0.22}


 22%|██▏       | 699/3235 [12:37<45:04,  1.07s/it]

{'loss': 0.948, 'grad_norm': 0.44234368205070496, 'learning_rate': 0.0001570278637770898, 'epoch': 0.22}


 22%|██▏       | 700/3235 [12:38<46:17,  1.10s/it]

{'loss': 1.01, 'grad_norm': 0.4273510277271271, 'learning_rate': 0.0001569659442724458, 'epoch': 0.22}


 22%|██▏       | 701/3235 [12:39<47:48,  1.13s/it]

{'loss': 1.0949, 'grad_norm': 0.4746944308280945, 'learning_rate': 0.00015690402476780185, 'epoch': 0.22}


 22%|██▏       | 702/3235 [12:40<43:19,  1.03s/it]

{'loss': 0.9396, 'grad_norm': 0.5177167057991028, 'learning_rate': 0.0001568421052631579, 'epoch': 0.22}


 22%|██▏       | 703/3235 [12:41<45:08,  1.07s/it]

{'loss': 1.1157, 'grad_norm': 0.4945661425590515, 'learning_rate': 0.00015678018575851393, 'epoch': 0.22}


 22%|██▏       | 704/3235 [12:43<47:07,  1.12s/it]

{'loss': 0.9352, 'grad_norm': 0.4392951428890228, 'learning_rate': 0.00015671826625386997, 'epoch': 0.22}


 22%|██▏       | 705/3235 [12:43<43:28,  1.03s/it]

{'loss': 1.2077, 'grad_norm': 1.4921598434448242, 'learning_rate': 0.000156656346749226, 'epoch': 0.22}


 22%|██▏       | 706/3235 [12:45<44:35,  1.06s/it]

{'loss': 1.0922, 'grad_norm': 0.458978533744812, 'learning_rate': 0.00015659442724458205, 'epoch': 0.22}


 22%|██▏       | 707/3235 [12:45<42:17,  1.00s/it]

{'loss': 0.9552, 'grad_norm': 0.49179041385650635, 'learning_rate': 0.0001565325077399381, 'epoch': 0.22}


 22%|██▏       | 708/3235 [12:46<42:37,  1.01s/it]

{'loss': 0.9387, 'grad_norm': 0.43496811389923096, 'learning_rate': 0.00015647058823529413, 'epoch': 0.22}


 22%|██▏       | 709/3235 [12:48<45:26,  1.08s/it]

{'loss': 0.8349, 'grad_norm': 0.4495636522769928, 'learning_rate': 0.00015640866873065017, 'epoch': 0.22}


 22%|██▏       | 710/3235 [12:49<46:53,  1.11s/it]

{'loss': 0.9474, 'grad_norm': 0.4388023018836975, 'learning_rate': 0.0001563467492260062, 'epoch': 0.22}


 22%|██▏       | 711/3235 [12:50<45:45,  1.09s/it]

{'loss': 0.931, 'grad_norm': 0.45224112272262573, 'learning_rate': 0.00015628482972136225, 'epoch': 0.22}


 22%|██▏       | 712/3235 [12:51<42:19,  1.01s/it]

{'loss': 0.9951, 'grad_norm': 0.4866211414337158, 'learning_rate': 0.00015622291021671826, 'epoch': 0.22}


 22%|██▏       | 713/3235 [12:52<43:18,  1.03s/it]

{'loss': 1.0529, 'grad_norm': 0.4583304524421692, 'learning_rate': 0.0001561609907120743, 'epoch': 0.22}


 22%|██▏       | 714/3235 [12:53<42:28,  1.01s/it]

{'loss': 1.0109, 'grad_norm': 0.47578540444374084, 'learning_rate': 0.00015609907120743034, 'epoch': 0.22}


 22%|██▏       | 715/3235 [12:54<41:53,  1.00it/s]

{'loss': 1.0455, 'grad_norm': 0.4660297632217407, 'learning_rate': 0.00015603715170278638, 'epoch': 0.22}


 22%|██▏       | 716/3235 [12:55<40:51,  1.03it/s]

{'loss': 1.0479, 'grad_norm': 0.4843333959579468, 'learning_rate': 0.00015597523219814242, 'epoch': 0.22}


 22%|██▏       | 717/3235 [12:56<43:25,  1.03s/it]

{'loss': 1.0148, 'grad_norm': 0.43758565187454224, 'learning_rate': 0.00015591331269349846, 'epoch': 0.22}


 22%|██▏       | 718/3235 [12:57<40:54,  1.03it/s]

{'loss': 1.0145, 'grad_norm': 0.48207148909568787, 'learning_rate': 0.00015585139318885448, 'epoch': 0.22}


 22%|██▏       | 719/3235 [12:58<40:53,  1.03it/s]

{'loss': 1.0537, 'grad_norm': 0.45951178669929504, 'learning_rate': 0.00015578947368421052, 'epoch': 0.22}


 22%|██▏       | 720/3235 [12:59<45:56,  1.10s/it]

{'loss': 0.9685, 'grad_norm': 0.3724649250507355, 'learning_rate': 0.00015572755417956656, 'epoch': 0.22}


 22%|██▏       | 721/3235 [13:00<47:36,  1.14s/it]

{'loss': 1.0542, 'grad_norm': 0.41491180658340454, 'learning_rate': 0.00015566563467492262, 'epoch': 0.22}


 22%|██▏       | 722/3235 [13:01<48:24,  1.16s/it]

{'loss': 0.9892, 'grad_norm': 0.48338186740875244, 'learning_rate': 0.00015560371517027866, 'epoch': 0.22}


 22%|██▏       | 723/3235 [13:03<47:34,  1.14s/it]

{'loss': 1.2086, 'grad_norm': 0.4369971752166748, 'learning_rate': 0.0001555417956656347, 'epoch': 0.22}


 22%|██▏       | 724/3235 [13:04<46:38,  1.11s/it]

{'loss': 1.1064, 'grad_norm': 0.5119318962097168, 'learning_rate': 0.00015547987616099072, 'epoch': 0.22}


 22%|██▏       | 725/3235 [13:05<46:58,  1.12s/it]

{'loss': 0.9829, 'grad_norm': 0.44279587268829346, 'learning_rate': 0.00015541795665634676, 'epoch': 0.22}


 22%|██▏       | 726/3235 [13:06<48:05,  1.15s/it]

{'loss': 1.0576, 'grad_norm': 0.4411042630672455, 'learning_rate': 0.0001553560371517028, 'epoch': 0.22}


 22%|██▏       | 727/3235 [13:07<47:54,  1.15s/it]

{'loss': 1.1177, 'grad_norm': 0.45427951216697693, 'learning_rate': 0.00015529411764705884, 'epoch': 0.22}


 23%|██▎       | 728/3235 [13:08<47:59,  1.15s/it]

{'loss': 1.0647, 'grad_norm': 0.44717854261398315, 'learning_rate': 0.00015523219814241488, 'epoch': 0.23}


 23%|██▎       | 729/3235 [13:10<49:14,  1.18s/it]

{'loss': 1.0986, 'grad_norm': 0.44211113452911377, 'learning_rate': 0.00015517027863777092, 'epoch': 0.23}


 23%|██▎       | 730/3235 [13:11<47:57,  1.15s/it]

{'loss': 0.9749, 'grad_norm': 0.47524434328079224, 'learning_rate': 0.00015510835913312693, 'epoch': 0.23}


 23%|██▎       | 731/3235 [13:12<44:52,  1.08s/it]

{'loss': 1.0166, 'grad_norm': 0.5323488712310791, 'learning_rate': 0.00015504643962848297, 'epoch': 0.23}


 23%|██▎       | 732/3235 [13:13<45:34,  1.09s/it]

{'loss': 0.9809, 'grad_norm': 0.4371962249279022, 'learning_rate': 0.000154984520123839, 'epoch': 0.23}


 23%|██▎       | 733/3235 [13:14<43:50,  1.05s/it]

{'loss': 1.095, 'grad_norm': 0.4868119955062866, 'learning_rate': 0.00015492260061919505, 'epoch': 0.23}


 23%|██▎       | 734/3235 [13:15<44:59,  1.08s/it]

{'loss': 1.1122, 'grad_norm': 0.45791345834732056, 'learning_rate': 0.0001548606811145511, 'epoch': 0.23}


 23%|██▎       | 735/3235 [13:16<46:36,  1.12s/it]

{'loss': 1.1857, 'grad_norm': 0.4371449053287506, 'learning_rate': 0.00015479876160990713, 'epoch': 0.23}


 23%|██▎       | 736/3235 [13:17<46:56,  1.13s/it]

{'loss': 0.9961, 'grad_norm': 0.41897955536842346, 'learning_rate': 0.00015473684210526317, 'epoch': 0.23}


 23%|██▎       | 737/3235 [13:18<45:51,  1.10s/it]

{'loss': 0.955, 'grad_norm': 0.47415319085121155, 'learning_rate': 0.0001546749226006192, 'epoch': 0.23}


 23%|██▎       | 738/3235 [13:19<44:43,  1.07s/it]

{'loss': 1.1005, 'grad_norm': 0.4458077549934387, 'learning_rate': 0.00015461300309597525, 'epoch': 0.23}


 23%|██▎       | 739/3235 [13:20<44:58,  1.08s/it]

{'loss': 0.9837, 'grad_norm': 0.440742164850235, 'learning_rate': 0.0001545510835913313, 'epoch': 0.23}


 23%|██▎       | 740/3235 [13:21<42:02,  1.01s/it]

{'loss': 0.9233, 'grad_norm': 0.46688711643218994, 'learning_rate': 0.00015448916408668733, 'epoch': 0.23}


 23%|██▎       | 741/3235 [13:22<43:58,  1.06s/it]

{'loss': 1.0112, 'grad_norm': 0.4179796278476715, 'learning_rate': 0.00015442724458204334, 'epoch': 0.23}


 23%|██▎       | 742/3235 [13:23<43:15,  1.04s/it]

{'loss': 1.0301, 'grad_norm': 0.4734555184841156, 'learning_rate': 0.00015436532507739938, 'epoch': 0.23}


 23%|██▎       | 743/3235 [13:24<44:19,  1.07s/it]

{'loss': 0.981, 'grad_norm': 0.46361297369003296, 'learning_rate': 0.00015430340557275542, 'epoch': 0.23}


 23%|██▎       | 744/3235 [13:26<44:59,  1.08s/it]

{'loss': 1.1155, 'grad_norm': 0.458469420671463, 'learning_rate': 0.00015424148606811146, 'epoch': 0.23}


 23%|██▎       | 745/3235 [13:27<46:30,  1.12s/it]

{'loss': 1.1564, 'grad_norm': 0.4726048409938812, 'learning_rate': 0.0001541795665634675, 'epoch': 0.23}


 23%|██▎       | 746/3235 [13:28<47:36,  1.15s/it]

{'loss': 1.0805, 'grad_norm': 0.41280314326286316, 'learning_rate': 0.00015411764705882354, 'epoch': 0.23}


 23%|██▎       | 747/3235 [13:29<41:30,  1.00s/it]

{'loss': 0.9154, 'grad_norm': 0.5175613760948181, 'learning_rate': 0.00015405572755417956, 'epoch': 0.23}


 23%|██▎       | 748/3235 [13:30<46:43,  1.13s/it]

{'loss': 0.9323, 'grad_norm': 0.38933977484703064, 'learning_rate': 0.0001539938080495356, 'epoch': 0.23}


 23%|██▎       | 749/3235 [13:31<45:27,  1.10s/it]

{'loss': 1.1607, 'grad_norm': 0.44524437189102173, 'learning_rate': 0.00015393188854489164, 'epoch': 0.23}


 23%|██▎       | 750/3235 [13:32<45:28,  1.10s/it]

{'loss': 0.981, 'grad_norm': 0.5009401440620422, 'learning_rate': 0.00015386996904024768, 'epoch': 0.23}


 23%|██▎       | 751/3235 [13:33<46:47,  1.13s/it]

{'loss': 1.1491, 'grad_norm': 0.41819125413894653, 'learning_rate': 0.00015380804953560372, 'epoch': 0.23}


 23%|██▎       | 752/3235 [13:34<46:29,  1.12s/it]

{'loss': 1.1469, 'grad_norm': 0.4318445920944214, 'learning_rate': 0.00015374613003095976, 'epoch': 0.23}


 23%|██▎       | 753/3235 [13:36<46:41,  1.13s/it]

{'loss': 1.0906, 'grad_norm': 0.4619191884994507, 'learning_rate': 0.0001536842105263158, 'epoch': 0.23}


 23%|██▎       | 754/3235 [13:37<49:34,  1.20s/it]

{'loss': 1.0642, 'grad_norm': 0.40154382586479187, 'learning_rate': 0.00015362229102167184, 'epoch': 0.23}


 23%|██▎       | 755/3235 [13:38<51:55,  1.26s/it]

{'loss': 1.2277, 'grad_norm': 0.40367525815963745, 'learning_rate': 0.00015356037151702788, 'epoch': 0.23}


 23%|██▎       | 756/3235 [13:39<46:58,  1.14s/it]

{'loss': 0.9759, 'grad_norm': 0.4724382758140564, 'learning_rate': 0.00015349845201238392, 'epoch': 0.23}


 23%|██▎       | 757/3235 [13:40<46:30,  1.13s/it]

{'loss': 0.8893, 'grad_norm': 0.455630898475647, 'learning_rate': 0.00015343653250773996, 'epoch': 0.23}


 23%|██▎       | 758/3235 [13:41<45:41,  1.11s/it]

{'loss': 1.1598, 'grad_norm': 0.5051549077033997, 'learning_rate': 0.000153374613003096, 'epoch': 0.23}


 23%|██▎       | 759/3235 [13:43<48:25,  1.17s/it]

{'loss': 1.0039, 'grad_norm': 0.42332565784454346, 'learning_rate': 0.000153312693498452, 'epoch': 0.23}


 23%|██▎       | 760/3235 [13:44<45:29,  1.10s/it]

{'loss': 0.9416, 'grad_norm': 0.5317744612693787, 'learning_rate': 0.00015325077399380805, 'epoch': 0.23}


 24%|██▎       | 761/3235 [13:45<46:50,  1.14s/it]

{'loss': 1.0717, 'grad_norm': 0.4530036449432373, 'learning_rate': 0.0001531888544891641, 'epoch': 0.24}


 24%|██▎       | 762/3235 [13:46<44:39,  1.08s/it]

{'loss': 1.0102, 'grad_norm': 0.4855031669139862, 'learning_rate': 0.00015312693498452013, 'epoch': 0.24}


 24%|██▎       | 763/3235 [13:47<43:56,  1.07s/it]

{'loss': 1.0232, 'grad_norm': 0.46706250309944153, 'learning_rate': 0.00015306501547987617, 'epoch': 0.24}


 24%|██▎       | 764/3235 [13:48<45:18,  1.10s/it]

{'loss': 0.9839, 'grad_norm': 0.44025206565856934, 'learning_rate': 0.0001530030959752322, 'epoch': 0.24}


 24%|██▎       | 765/3235 [13:49<45:15,  1.10s/it]

{'loss': 0.9849, 'grad_norm': 0.4255676567554474, 'learning_rate': 0.00015294117647058822, 'epoch': 0.24}


 24%|██▎       | 766/3235 [13:50<41:57,  1.02s/it]

{'loss': 0.9828, 'grad_norm': 0.4694061577320099, 'learning_rate': 0.00015287925696594426, 'epoch': 0.24}


 24%|██▎       | 767/3235 [13:51<43:58,  1.07s/it]

{'loss': 1.2152, 'grad_norm': 0.49981600046157837, 'learning_rate': 0.0001528173374613003, 'epoch': 0.24}


 24%|██▎       | 768/3235 [13:52<42:05,  1.02s/it]

{'loss': 1.106, 'grad_norm': 0.5545634031295776, 'learning_rate': 0.00015275541795665634, 'epoch': 0.24}


 24%|██▍       | 769/3235 [13:53<45:34,  1.11s/it]

{'loss': 0.911, 'grad_norm': 0.4309737980365753, 'learning_rate': 0.0001526934984520124, 'epoch': 0.24}


 24%|██▍       | 770/3235 [13:55<49:09,  1.20s/it]

{'loss': 0.9098, 'grad_norm': 0.4356989860534668, 'learning_rate': 0.00015263157894736845, 'epoch': 0.24}


 24%|██▍       | 771/3235 [13:56<46:16,  1.13s/it]

{'loss': 0.9293, 'grad_norm': 0.4763388931751251, 'learning_rate': 0.00015256965944272446, 'epoch': 0.24}


 24%|██▍       | 772/3235 [13:57<43:32,  1.06s/it]

{'loss': 0.9916, 'grad_norm': 0.47181907296180725, 'learning_rate': 0.0001525077399380805, 'epoch': 0.24}


 24%|██▍       | 773/3235 [13:58<41:39,  1.02s/it]

{'loss': 1.1836, 'grad_norm': 0.514782726764679, 'learning_rate': 0.00015244582043343654, 'epoch': 0.24}


 24%|██▍       | 774/3235 [13:59<42:14,  1.03s/it]

{'loss': 0.9704, 'grad_norm': 0.5002947449684143, 'learning_rate': 0.00015238390092879258, 'epoch': 0.24}


 24%|██▍       | 775/3235 [14:00<43:02,  1.05s/it]

{'loss': 0.9885, 'grad_norm': 0.48426005244255066, 'learning_rate': 0.00015232198142414862, 'epoch': 0.24}


 24%|██▍       | 776/3235 [14:00<38:52,  1.05it/s]

{'loss': 1.1352, 'grad_norm': 0.5712942481040955, 'learning_rate': 0.00015226006191950466, 'epoch': 0.24}


 24%|██▍       | 777/3235 [14:01<38:31,  1.06it/s]

{'loss': 1.0193, 'grad_norm': 0.47317856550216675, 'learning_rate': 0.00015219814241486067, 'epoch': 0.24}


 24%|██▍       | 778/3235 [14:02<41:06,  1.00s/it]

{'loss': 1.012, 'grad_norm': 0.4176483750343323, 'learning_rate': 0.00015213622291021671, 'epoch': 0.24}


 24%|██▍       | 779/3235 [14:03<39:22,  1.04it/s]

{'loss': 1.1022, 'grad_norm': 0.522635281085968, 'learning_rate': 0.00015207430340557275, 'epoch': 0.24}


 24%|██▍       | 780/3235 [14:05<42:43,  1.04s/it]

{'loss': 1.1703, 'grad_norm': 0.48312753438949585, 'learning_rate': 0.0001520123839009288, 'epoch': 0.24}


 24%|██▍       | 781/3235 [14:06<44:19,  1.08s/it]

{'loss': 0.9654, 'grad_norm': 0.40989601612091064, 'learning_rate': 0.00015195046439628483, 'epoch': 0.24}


 24%|██▍       | 782/3235 [14:07<43:08,  1.06s/it]

{'loss': 1.1725, 'grad_norm': 0.5132450461387634, 'learning_rate': 0.00015188854489164087, 'epoch': 0.24}


 24%|██▍       | 783/3235 [14:08<42:00,  1.03s/it]

{'loss': 0.9314, 'grad_norm': 0.4909552335739136, 'learning_rate': 0.0001518266253869969, 'epoch': 0.24}


 24%|██▍       | 784/3235 [14:09<45:27,  1.11s/it]

{'loss': 0.8484, 'grad_norm': 0.4111533761024475, 'learning_rate': 0.00015176470588235295, 'epoch': 0.24}


 24%|██▍       | 785/3235 [14:10<45:10,  1.11s/it]

{'loss': 1.1283, 'grad_norm': 0.47470515966415405, 'learning_rate': 0.000151702786377709, 'epoch': 0.24}


 24%|██▍       | 786/3235 [14:11<40:07,  1.02it/s]

{'loss': 0.9693, 'grad_norm': 0.49018076062202454, 'learning_rate': 0.00015164086687306504, 'epoch': 0.24}


 24%|██▍       | 787/3235 [14:12<42:08,  1.03s/it]

{'loss': 1.0694, 'grad_norm': 0.4804699718952179, 'learning_rate': 0.00015157894736842108, 'epoch': 0.24}


 24%|██▍       | 788/3235 [14:13<43:10,  1.06s/it]

{'loss': 1.0618, 'grad_norm': 0.48323798179626465, 'learning_rate': 0.00015151702786377712, 'epoch': 0.24}


 24%|██▍       | 789/3235 [14:14<45:30,  1.12s/it]

{'loss': 1.1288, 'grad_norm': 0.45175230503082275, 'learning_rate': 0.00015145510835913313, 'epoch': 0.24}


 24%|██▍       | 790/3235 [14:16<48:23,  1.19s/it]

{'loss': 1.1075, 'grad_norm': 0.456444650888443, 'learning_rate': 0.00015139318885448917, 'epoch': 0.24}


 24%|██▍       | 791/3235 [14:17<46:49,  1.15s/it]

{'loss': 1.0523, 'grad_norm': 0.4718805253505707, 'learning_rate': 0.0001513312693498452, 'epoch': 0.24}


 24%|██▍       | 792/3235 [14:18<45:11,  1.11s/it]

{'loss': 1.0476, 'grad_norm': 0.529040515422821, 'learning_rate': 0.00015126934984520125, 'epoch': 0.24}


 25%|██▍       | 793/3235 [14:19<43:34,  1.07s/it]

{'loss': 0.9495, 'grad_norm': 0.5507084727287292, 'learning_rate': 0.0001512074303405573, 'epoch': 0.25}


 25%|██▍       | 794/3235 [14:20<41:49,  1.03s/it]

{'loss': 0.9437, 'grad_norm': 0.4452938735485077, 'learning_rate': 0.00015114551083591333, 'epoch': 0.25}


 25%|██▍       | 795/3235 [14:21<44:02,  1.08s/it]

{'loss': 1.0125, 'grad_norm': 0.42029958963394165, 'learning_rate': 0.00015108359133126934, 'epoch': 0.25}


 25%|██▍       | 796/3235 [14:22<47:14,  1.16s/it]

{'loss': 1.0538, 'grad_norm': 0.40582138299942017, 'learning_rate': 0.00015102167182662538, 'epoch': 0.25}


 25%|██▍       | 797/3235 [14:23<47:33,  1.17s/it]

{'loss': 1.126, 'grad_norm': 0.44232165813446045, 'learning_rate': 0.00015095975232198142, 'epoch': 0.25}


 25%|██▍       | 798/3235 [14:25<48:55,  1.20s/it]

{'loss': 1.1017, 'grad_norm': 0.461870938539505, 'learning_rate': 0.00015089783281733746, 'epoch': 0.25}


 25%|██▍       | 799/3235 [14:26<46:32,  1.15s/it]

{'loss': 1.0095, 'grad_norm': 0.4668799638748169, 'learning_rate': 0.0001508359133126935, 'epoch': 0.25}


 25%|██▍       | 800/3235 [14:27<44:37,  1.10s/it]

{'loss': 1.1653, 'grad_norm': 0.472650408744812, 'learning_rate': 0.00015077399380804954, 'epoch': 0.25}


 25%|██▍       | 801/3235 [14:28<43:54,  1.08s/it]

{'loss': 1.0514, 'grad_norm': 0.4602973759174347, 'learning_rate': 0.00015071207430340558, 'epoch': 0.25}


 25%|██▍       | 802/3235 [14:29<45:44,  1.13s/it]

{'loss': 0.9717, 'grad_norm': 0.4138469099998474, 'learning_rate': 0.00015065015479876162, 'epoch': 0.25}


 25%|██▍       | 803/3235 [14:30<47:24,  1.17s/it]

{'loss': 1.0637, 'grad_norm': 0.41331857442855835, 'learning_rate': 0.00015058823529411766, 'epoch': 0.25}


 25%|██▍       | 804/3235 [14:31<45:53,  1.13s/it]

{'loss': 1.0249, 'grad_norm': 0.4535231292247772, 'learning_rate': 0.0001505263157894737, 'epoch': 0.25}


 25%|██▍       | 805/3235 [14:32<44:20,  1.09s/it]

{'loss': 1.0645, 'grad_norm': 0.4677095413208008, 'learning_rate': 0.00015046439628482974, 'epoch': 0.25}


 25%|██▍       | 806/3235 [14:33<42:38,  1.05s/it]

{'loss': 1.0115, 'grad_norm': 0.5284358263015747, 'learning_rate': 0.00015040247678018578, 'epoch': 0.25}


 25%|██▍       | 807/3235 [14:35<45:10,  1.12s/it]

{'loss': 0.9699, 'grad_norm': 0.4315777122974396, 'learning_rate': 0.0001503405572755418, 'epoch': 0.25}


 25%|██▍       | 808/3235 [14:35<43:12,  1.07s/it]

{'loss': 1.0587, 'grad_norm': 0.5176039338111877, 'learning_rate': 0.00015027863777089783, 'epoch': 0.25}


 25%|██▌       | 809/3235 [14:37<44:54,  1.11s/it]

{'loss': 0.8938, 'grad_norm': 0.4322737455368042, 'learning_rate': 0.00015021671826625387, 'epoch': 0.25}


 25%|██▌       | 810/3235 [14:38<42:10,  1.04s/it]

{'loss': 0.9914, 'grad_norm': 0.5012994408607483, 'learning_rate': 0.00015015479876160991, 'epoch': 0.25}


 25%|██▌       | 811/3235 [14:39<43:30,  1.08s/it]

{'loss': 1.0142, 'grad_norm': 0.4471655786037445, 'learning_rate': 0.00015009287925696595, 'epoch': 0.25}


 25%|██▌       | 812/3235 [14:40<44:53,  1.11s/it]

{'loss': 1.0671, 'grad_norm': 0.4435122311115265, 'learning_rate': 0.000150030959752322, 'epoch': 0.25}


 25%|██▌       | 813/3235 [14:41<44:44,  1.11s/it]

{'loss': 1.1065, 'grad_norm': 0.49519291520118713, 'learning_rate': 0.000149969040247678, 'epoch': 0.25}


 25%|██▌       | 814/3235 [14:42<44:05,  1.09s/it]

{'loss': 1.0854, 'grad_norm': 0.4812068045139313, 'learning_rate': 0.00014990712074303405, 'epoch': 0.25}


 25%|██▌       | 815/3235 [14:43<41:50,  1.04s/it]

{'loss': 0.9807, 'grad_norm': 0.47892752289772034, 'learning_rate': 0.0001498452012383901, 'epoch': 0.25}


 25%|██▌       | 816/3235 [14:44<44:27,  1.10s/it]

{'loss': 1.0064, 'grad_norm': 0.3912297189235687, 'learning_rate': 0.00014978328173374613, 'epoch': 0.25}


 25%|██▌       | 817/3235 [14:45<42:20,  1.05s/it]

{'loss': 1.0403, 'grad_norm': 0.45545312762260437, 'learning_rate': 0.00014972136222910217, 'epoch': 0.25}


 25%|██▌       | 818/3235 [14:46<43:43,  1.09s/it]

{'loss': 0.9284, 'grad_norm': 0.4423091411590576, 'learning_rate': 0.00014965944272445823, 'epoch': 0.25}


 25%|██▌       | 819/3235 [14:48<44:48,  1.11s/it]

{'loss': 1.0844, 'grad_norm': 0.4437365233898163, 'learning_rate': 0.00014959752321981425, 'epoch': 0.25}


 25%|██▌       | 820/3235 [14:49<44:40,  1.11s/it]

{'loss': 1.0373, 'grad_norm': 0.4825378656387329, 'learning_rate': 0.0001495356037151703, 'epoch': 0.25}


 25%|██▌       | 821/3235 [14:50<44:49,  1.11s/it]

{'loss': 0.9851, 'grad_norm': 0.41351646184921265, 'learning_rate': 0.00014947368421052633, 'epoch': 0.25}


 25%|██▌       | 822/3235 [14:51<46:48,  1.16s/it]

{'loss': 0.9809, 'grad_norm': 0.47254982590675354, 'learning_rate': 0.00014941176470588237, 'epoch': 0.25}


 25%|██▌       | 823/3235 [14:52<46:52,  1.17s/it]

{'loss': 1.1204, 'grad_norm': 0.47640448808670044, 'learning_rate': 0.0001493498452012384, 'epoch': 0.25}


 25%|██▌       | 824/3235 [14:54<49:13,  1.23s/it]

{'loss': 1.034, 'grad_norm': 0.41692960262298584, 'learning_rate': 0.00014928792569659445, 'epoch': 0.25}


 26%|██▌       | 825/3235 [14:55<48:20,  1.20s/it]

{'loss': 1.1117, 'grad_norm': 0.47897472977638245, 'learning_rate': 0.00014922600619195046, 'epoch': 0.26}


 26%|██▌       | 826/3235 [14:56<49:19,  1.23s/it]

{'loss': 1.0128, 'grad_norm': 0.42255932092666626, 'learning_rate': 0.0001491640866873065, 'epoch': 0.26}


 26%|██▌       | 827/3235 [14:57<46:06,  1.15s/it]

{'loss': 1.074, 'grad_norm': 0.4804135859012604, 'learning_rate': 0.00014910216718266254, 'epoch': 0.26}


 26%|██▌       | 828/3235 [14:58<43:36,  1.09s/it]

{'loss': 1.1223, 'grad_norm': 0.5430240631103516, 'learning_rate': 0.00014904024767801858, 'epoch': 0.26}


 26%|██▌       | 829/3235 [14:59<44:24,  1.11s/it]

{'loss': 1.1012, 'grad_norm': 0.43293869495391846, 'learning_rate': 0.00014897832817337462, 'epoch': 0.26}


 26%|██▌       | 830/3235 [15:00<43:38,  1.09s/it]

{'loss': 1.1054, 'grad_norm': 0.47305554151535034, 'learning_rate': 0.00014891640866873066, 'epoch': 0.26}


 26%|██▌       | 831/3235 [15:01<42:02,  1.05s/it]

{'loss': 1.0624, 'grad_norm': 0.5086494088172913, 'learning_rate': 0.00014885448916408667, 'epoch': 0.26}


 26%|██▌       | 832/3235 [15:02<43:03,  1.08s/it]

{'loss': 0.9876, 'grad_norm': 0.4417053163051605, 'learning_rate': 0.00014879256965944274, 'epoch': 0.26}


 26%|██▌       | 833/3235 [15:03<45:40,  1.14s/it]

{'loss': 1.1526, 'grad_norm': 0.46775302290916443, 'learning_rate': 0.00014873065015479878, 'epoch': 0.26}


 26%|██▌       | 834/3235 [15:05<47:11,  1.18s/it]

{'loss': 0.9204, 'grad_norm': 0.419626921415329, 'learning_rate': 0.00014866873065015482, 'epoch': 0.26}


 26%|██▌       | 835/3235 [15:06<43:48,  1.10s/it]

{'loss': 1.0753, 'grad_norm': 0.5844084024429321, 'learning_rate': 0.00014860681114551086, 'epoch': 0.26}


 26%|██▌       | 836/3235 [15:07<41:57,  1.05s/it]

{'loss': 1.0355, 'grad_norm': 0.5002879500389099, 'learning_rate': 0.0001485448916408669, 'epoch': 0.26}


 26%|██▌       | 837/3235 [15:08<43:54,  1.10s/it]

{'loss': 1.0908, 'grad_norm': 0.4655494689941406, 'learning_rate': 0.0001484829721362229, 'epoch': 0.26}


 26%|██▌       | 838/3235 [15:09<40:47,  1.02s/it]

{'loss': 0.8929, 'grad_norm': 0.48124781250953674, 'learning_rate': 0.00014842105263157895, 'epoch': 0.26}


 26%|██▌       | 839/3235 [15:10<40:37,  1.02s/it]

{'loss': 1.0947, 'grad_norm': 0.4552043676376343, 'learning_rate': 0.000148359133126935, 'epoch': 0.26}


 26%|██▌       | 840/3235 [15:11<42:44,  1.07s/it]

{'loss': 1.1371, 'grad_norm': 0.4557468891143799, 'learning_rate': 0.00014829721362229103, 'epoch': 0.26}


 26%|██▌       | 841/3235 [15:12<41:27,  1.04s/it]

{'loss': 1.0201, 'grad_norm': 0.46190696954727173, 'learning_rate': 0.00014823529411764707, 'epoch': 0.26}


 26%|██▌       | 842/3235 [15:13<37:47,  1.06it/s]

{'loss': 1.2041, 'grad_norm': 0.5483676791191101, 'learning_rate': 0.0001481733746130031, 'epoch': 0.26}


 26%|██▌       | 843/3235 [15:14<37:50,  1.05it/s]

{'loss': 1.0307, 'grad_norm': 0.45319756865501404, 'learning_rate': 0.00014811145510835913, 'epoch': 0.26}


 26%|██▌       | 844/3235 [15:14<34:51,  1.14it/s]

{'loss': 0.9988, 'grad_norm': 0.6087924242019653, 'learning_rate': 0.00014804953560371517, 'epoch': 0.26}


 26%|██▌       | 845/3235 [15:15<34:59,  1.14it/s]

{'loss': 0.986, 'grad_norm': 0.4641036093235016, 'learning_rate': 0.0001479876160990712, 'epoch': 0.26}


 26%|██▌       | 846/3235 [15:16<37:02,  1.07it/s]

{'loss': 1.0286, 'grad_norm': 0.47361627221107483, 'learning_rate': 0.00014792569659442725, 'epoch': 0.26}


 26%|██▌       | 847/3235 [15:17<36:32,  1.09it/s]

{'loss': 1.0371, 'grad_norm': 0.4726375937461853, 'learning_rate': 0.00014786377708978329, 'epoch': 0.26}


 26%|██▌       | 848/3235 [15:18<38:01,  1.05it/s]

{'loss': 1.0791, 'grad_norm': 0.4954090118408203, 'learning_rate': 0.00014780185758513933, 'epoch': 0.26}


 26%|██▌       | 849/3235 [15:19<39:00,  1.02it/s]

{'loss': 1.1184, 'grad_norm': 0.49471575021743774, 'learning_rate': 0.00014773993808049537, 'epoch': 0.26}


 26%|██▋       | 850/3235 [15:20<42:00,  1.06s/it]

{'loss': 1.1317, 'grad_norm': 0.41693198680877686, 'learning_rate': 0.0001476780185758514, 'epoch': 0.26}


 26%|██▋       | 851/3235 [15:21<39:48,  1.00s/it]

{'loss': 0.9719, 'grad_norm': 0.48285531997680664, 'learning_rate': 0.00014761609907120745, 'epoch': 0.26}


 26%|██▋       | 852/3235 [15:23<43:43,  1.10s/it]

{'loss': 0.9687, 'grad_norm': 0.3974907696247101, 'learning_rate': 0.00014755417956656349, 'epoch': 0.26}


 26%|██▋       | 853/3235 [15:24<44:21,  1.12s/it]

{'loss': 0.9315, 'grad_norm': 0.45884400606155396, 'learning_rate': 0.00014749226006191953, 'epoch': 0.26}


 26%|██▋       | 854/3235 [15:25<44:00,  1.11s/it]

{'loss': 1.0315, 'grad_norm': 0.47014859318733215, 'learning_rate': 0.00014743034055727557, 'epoch': 0.26}


 26%|██▋       | 855/3235 [15:26<43:09,  1.09s/it]

{'loss': 0.9922, 'grad_norm': 0.46206900477409363, 'learning_rate': 0.00014736842105263158, 'epoch': 0.26}


 26%|██▋       | 856/3235 [15:27<44:11,  1.11s/it]

{'loss': 1.092, 'grad_norm': 0.465896874666214, 'learning_rate': 0.00014730650154798762, 'epoch': 0.26}


 26%|██▋       | 857/3235 [15:28<46:09,  1.16s/it]

{'loss': 1.1713, 'grad_norm': 0.4110265374183655, 'learning_rate': 0.00014724458204334366, 'epoch': 0.26}


 27%|██▋       | 858/3235 [15:29<44:26,  1.12s/it]

{'loss': 1.0995, 'grad_norm': 0.4857025444507599, 'learning_rate': 0.0001471826625386997, 'epoch': 0.27}


 27%|██▋       | 859/3235 [15:30<44:25,  1.12s/it]

{'loss': 0.9379, 'grad_norm': 0.45211270451545715, 'learning_rate': 0.00014712074303405574, 'epoch': 0.27}


 27%|██▋       | 860/3235 [15:32<44:44,  1.13s/it]

{'loss': 1.0408, 'grad_norm': 0.4499737024307251, 'learning_rate': 0.00014705882352941178, 'epoch': 0.27}


 27%|██▋       | 861/3235 [15:33<45:54,  1.16s/it]

{'loss': 1.0974, 'grad_norm': 0.42635226249694824, 'learning_rate': 0.0001469969040247678, 'epoch': 0.27}


 27%|██▋       | 862/3235 [15:34<46:18,  1.17s/it]

{'loss': 1.0763, 'grad_norm': 0.5059679746627808, 'learning_rate': 0.00014693498452012383, 'epoch': 0.27}


 27%|██▋       | 863/3235 [15:35<44:09,  1.12s/it]

{'loss': 0.93, 'grad_norm': 0.5491071343421936, 'learning_rate': 0.00014687306501547987, 'epoch': 0.27}


 27%|██▋       | 864/3235 [15:36<41:05,  1.04s/it]

{'loss': 0.9008, 'grad_norm': 0.5169339179992676, 'learning_rate': 0.0001468111455108359, 'epoch': 0.27}


 27%|██▋       | 865/3235 [15:37<40:09,  1.02s/it]

{'loss': 0.9199, 'grad_norm': 0.4755426049232483, 'learning_rate': 0.00014674922600619195, 'epoch': 0.27}


 27%|██▋       | 866/3235 [15:38<41:31,  1.05s/it]

{'loss': 1.0356, 'grad_norm': 0.4513292610645294, 'learning_rate': 0.000146687306501548, 'epoch': 0.27}


 27%|██▋       | 867/3235 [15:39<41:01,  1.04s/it]

{'loss': 1.1034, 'grad_norm': 0.4688643217086792, 'learning_rate': 0.00014662538699690403, 'epoch': 0.27}


 27%|██▋       | 868/3235 [15:40<42:23,  1.07s/it]

{'loss': 1.094, 'grad_norm': 0.45586562156677246, 'learning_rate': 0.00014656346749226007, 'epoch': 0.27}


 27%|██▋       | 869/3235 [15:41<39:19,  1.00it/s]

{'loss': 1.0856, 'grad_norm': 0.7009754180908203, 'learning_rate': 0.0001465015479876161, 'epoch': 0.27}


 27%|██▋       | 870/3235 [15:42<42:54,  1.09s/it]

{'loss': 1.0317, 'grad_norm': 0.4361122250556946, 'learning_rate': 0.00014643962848297215, 'epoch': 0.27}


 27%|██▋       | 871/3235 [15:43<43:13,  1.10s/it]

{'loss': 1.047, 'grad_norm': 0.42394623160362244, 'learning_rate': 0.0001463777089783282, 'epoch': 0.27}


 27%|██▋       | 872/3235 [15:44<42:13,  1.07s/it]

{'loss': 0.8474, 'grad_norm': 0.4242505133152008, 'learning_rate': 0.00014631578947368423, 'epoch': 0.27}


 27%|██▋       | 873/3235 [15:45<42:36,  1.08s/it]

{'loss': 0.8536, 'grad_norm': 0.4295571446418762, 'learning_rate': 0.00014625386996904024, 'epoch': 0.27}


 27%|██▋       | 874/3235 [15:46<41:28,  1.05s/it]

{'loss': 0.9737, 'grad_norm': 0.43789079785346985, 'learning_rate': 0.00014619195046439628, 'epoch': 0.27}


 27%|██▋       | 875/3235 [15:48<41:38,  1.06s/it]

{'loss': 1.0045, 'grad_norm': 0.4307894706726074, 'learning_rate': 0.00014613003095975232, 'epoch': 0.27}


 27%|██▋       | 876/3235 [15:49<41:46,  1.06s/it]

{'loss': 1.0103, 'grad_norm': 0.44882938265800476, 'learning_rate': 0.00014606811145510836, 'epoch': 0.27}


 27%|██▋       | 877/3235 [15:50<40:25,  1.03s/it]

{'loss': 1.0359, 'grad_norm': 0.49261391162872314, 'learning_rate': 0.0001460061919504644, 'epoch': 0.27}


 27%|██▋       | 878/3235 [15:51<40:30,  1.03s/it]

{'loss': 1.1202, 'grad_norm': 0.476340115070343, 'learning_rate': 0.00014594427244582044, 'epoch': 0.27}


 27%|██▋       | 879/3235 [15:51<37:07,  1.06it/s]

{'loss': 0.9648, 'grad_norm': 0.5232070088386536, 'learning_rate': 0.00014588235294117646, 'epoch': 0.27}


 27%|██▋       | 880/3235 [15:52<39:17,  1.00s/it]

{'loss': 1.1029, 'grad_norm': 0.4580400288105011, 'learning_rate': 0.0001458204334365325, 'epoch': 0.27}


 27%|██▋       | 881/3235 [15:54<40:54,  1.04s/it]

{'loss': 1.0137, 'grad_norm': 0.4859107434749603, 'learning_rate': 0.00014575851393188856, 'epoch': 0.27}


 27%|██▋       | 882/3235 [15:55<43:53,  1.12s/it]

{'loss': 1.0092, 'grad_norm': 0.39848417043685913, 'learning_rate': 0.0001456965944272446, 'epoch': 0.27}


 27%|██▋       | 883/3235 [15:56<41:54,  1.07s/it]

{'loss': 0.9713, 'grad_norm': 0.5444724559783936, 'learning_rate': 0.00014563467492260064, 'epoch': 0.27}


 27%|██▋       | 884/3235 [15:57<40:37,  1.04s/it]

{'loss': 1.1543, 'grad_norm': 0.46761077642440796, 'learning_rate': 0.00014557275541795668, 'epoch': 0.27}


 27%|██▋       | 885/3235 [15:58<41:25,  1.06s/it]

{'loss': 1.1709, 'grad_norm': 0.4585787355899811, 'learning_rate': 0.0001455108359133127, 'epoch': 0.27}


 27%|██▋       | 886/3235 [15:59<41:10,  1.05s/it]

{'loss': 0.8779, 'grad_norm': 0.4609561860561371, 'learning_rate': 0.00014544891640866874, 'epoch': 0.27}


 27%|██▋       | 887/3235 [16:00<42:36,  1.09s/it]

{'loss': 1.0757, 'grad_norm': 0.45320427417755127, 'learning_rate': 0.00014538699690402478, 'epoch': 0.27}


 27%|██▋       | 888/3235 [16:01<42:25,  1.08s/it]

{'loss': 0.9178, 'grad_norm': 0.47095370292663574, 'learning_rate': 0.00014532507739938082, 'epoch': 0.27}


 27%|██▋       | 889/3235 [16:02<41:06,  1.05s/it]

{'loss': 0.9372, 'grad_norm': 0.46278220415115356, 'learning_rate': 0.00014526315789473686, 'epoch': 0.27}


 28%|██▊       | 890/3235 [16:03<42:34,  1.09s/it]

{'loss': 1.1049, 'grad_norm': 0.43570899963378906, 'learning_rate': 0.00014520123839009287, 'epoch': 0.28}


 28%|██▊       | 891/3235 [16:05<43:05,  1.10s/it]

{'loss': 0.9296, 'grad_norm': 0.4667748212814331, 'learning_rate': 0.0001451393188854489, 'epoch': 0.28}


 28%|██▊       | 892/3235 [16:05<40:43,  1.04s/it]

{'loss': 0.9707, 'grad_norm': 0.5027751922607422, 'learning_rate': 0.00014507739938080495, 'epoch': 0.28}


 28%|██▊       | 893/3235 [16:06<40:57,  1.05s/it]

{'loss': 1.1035, 'grad_norm': 0.4396493434906006, 'learning_rate': 0.000145015479876161, 'epoch': 0.28}


 28%|██▊       | 894/3235 [16:08<44:42,  1.15s/it]

{'loss': 1.0433, 'grad_norm': 0.4100717604160309, 'learning_rate': 0.00014495356037151703, 'epoch': 0.28}


 28%|██▊       | 895/3235 [16:09<45:39,  1.17s/it]

{'loss': 1.0494, 'grad_norm': 0.45666319131851196, 'learning_rate': 0.00014489164086687307, 'epoch': 0.28}


 28%|██▊       | 896/3235 [16:10<44:24,  1.14s/it]

{'loss': 1.1721, 'grad_norm': 0.4770738184452057, 'learning_rate': 0.0001448297213622291, 'epoch': 0.28}


 28%|██▊       | 897/3235 [16:11<44:05,  1.13s/it]

{'loss': 1.1981, 'grad_norm': 0.4811008870601654, 'learning_rate': 0.00014476780185758515, 'epoch': 0.28}


 28%|██▊       | 898/3235 [16:13<45:22,  1.16s/it]

{'loss': 1.1485, 'grad_norm': 0.45230233669281006, 'learning_rate': 0.0001447058823529412, 'epoch': 0.28}


 28%|██▊       | 899/3235 [16:14<45:33,  1.17s/it]

{'loss': 0.9931, 'grad_norm': 0.44261932373046875, 'learning_rate': 0.00014464396284829723, 'epoch': 0.28}


 28%|██▊       | 900/3235 [16:15<43:49,  1.13s/it]

{'loss': 1.0509, 'grad_norm': 0.5049288868904114, 'learning_rate': 0.00014458204334365327, 'epoch': 0.28}


 28%|██▊       | 901/3235 [16:16<42:27,  1.09s/it]

{'loss': 1.0877, 'grad_norm': 0.5049896836280823, 'learning_rate': 0.0001445201238390093, 'epoch': 0.28}


 28%|██▊       | 902/3235 [16:17<40:05,  1.03s/it]

{'loss': 1.0625, 'grad_norm': 0.4928757846355438, 'learning_rate': 0.00014445820433436532, 'epoch': 0.28}


 28%|██▊       | 903/3235 [16:18<42:22,  1.09s/it]

{'loss': 1.1, 'grad_norm': 0.4650018513202667, 'learning_rate': 0.00014439628482972136, 'epoch': 0.28}


 28%|██▊       | 904/3235 [16:19<42:20,  1.09s/it]

{'loss': 1.0701, 'grad_norm': 0.45615634322166443, 'learning_rate': 0.0001443343653250774, 'epoch': 0.28}


 28%|██▊       | 905/3235 [16:20<39:57,  1.03s/it]

{'loss': 1.0901, 'grad_norm': 0.4661777913570404, 'learning_rate': 0.00014427244582043344, 'epoch': 0.28}


 28%|██▊       | 906/3235 [16:21<41:43,  1.08s/it]

{'loss': 0.8022, 'grad_norm': 0.47411245107650757, 'learning_rate': 0.00014421052631578948, 'epoch': 0.28}


 28%|██▊       | 907/3235 [16:22<42:15,  1.09s/it]

{'loss': 0.9181, 'grad_norm': 0.43352577090263367, 'learning_rate': 0.00014414860681114552, 'epoch': 0.28}


 28%|██▊       | 908/3235 [16:23<41:48,  1.08s/it]

{'loss': 1.0689, 'grad_norm': 0.4722276031970978, 'learning_rate': 0.00014408668730650154, 'epoch': 0.28}


 28%|██▊       | 909/3235 [16:24<42:35,  1.10s/it]

{'loss': 1.0639, 'grad_norm': 0.4564206004142761, 'learning_rate': 0.00014402476780185758, 'epoch': 0.28}


 28%|██▊       | 910/3235 [16:25<39:48,  1.03s/it]

{'loss': 1.1054, 'grad_norm': 0.5156978964805603, 'learning_rate': 0.00014396284829721362, 'epoch': 0.28}


 28%|██▊       | 911/3235 [16:26<39:27,  1.02s/it]

{'loss': 1.0649, 'grad_norm': 0.4818112254142761, 'learning_rate': 0.00014390092879256966, 'epoch': 0.28}


 28%|██▊       | 912/3235 [16:27<40:40,  1.05s/it]

{'loss': 1.0689, 'grad_norm': 0.4471592307090759, 'learning_rate': 0.0001438390092879257, 'epoch': 0.28}


 28%|██▊       | 913/3235 [16:29<42:20,  1.09s/it]

{'loss': 1.0357, 'grad_norm': 0.44264036417007446, 'learning_rate': 0.00014377708978328174, 'epoch': 0.28}


 28%|██▊       | 914/3235 [16:30<43:51,  1.13s/it]

{'loss': 1.1185, 'grad_norm': 0.4588319957256317, 'learning_rate': 0.00014371517027863778, 'epoch': 0.28}


 28%|██▊       | 915/3235 [16:31<44:45,  1.16s/it]

{'loss': 1.0612, 'grad_norm': 0.4491364359855652, 'learning_rate': 0.00014365325077399382, 'epoch': 0.28}


 28%|██▊       | 916/3235 [16:32<43:29,  1.13s/it]

{'loss': 1.001, 'grad_norm': 0.5130767226219177, 'learning_rate': 0.00014359133126934986, 'epoch': 0.28}


 28%|██▊       | 917/3235 [16:33<43:52,  1.14s/it]

{'loss': 0.9985, 'grad_norm': 0.4661855697631836, 'learning_rate': 0.0001435294117647059, 'epoch': 0.28}


 28%|██▊       | 918/3235 [16:34<45:35,  1.18s/it]

{'loss': 1.1397, 'grad_norm': 0.46655139327049255, 'learning_rate': 0.00014346749226006194, 'epoch': 0.28}


 28%|██▊       | 919/3235 [16:36<47:51,  1.24s/it]

{'loss': 1.085, 'grad_norm': 0.39440152049064636, 'learning_rate': 0.00014340557275541798, 'epoch': 0.28}


 28%|██▊       | 920/3235 [16:37<45:55,  1.19s/it]

{'loss': 1.0457, 'grad_norm': 0.4639681875705719, 'learning_rate': 0.000143343653250774, 'epoch': 0.28}


 28%|██▊       | 921/3235 [16:38<45:00,  1.17s/it]

{'loss': 1.1164, 'grad_norm': 0.5765901207923889, 'learning_rate': 0.00014328173374613003, 'epoch': 0.28}


 29%|██▊       | 922/3235 [16:39<43:18,  1.12s/it]

{'loss': 1.0905, 'grad_norm': 0.4939430058002472, 'learning_rate': 0.00014321981424148607, 'epoch': 0.29}


 29%|██▊       | 923/3235 [16:40<43:47,  1.14s/it]

{'loss': 0.9309, 'grad_norm': 0.40581387281417847, 'learning_rate': 0.0001431578947368421, 'epoch': 0.29}


 29%|██▊       | 924/3235 [16:41<42:58,  1.12s/it]

{'loss': 1.0068, 'grad_norm': 0.45840755105018616, 'learning_rate': 0.00014309597523219815, 'epoch': 0.29}


 29%|██▊       | 925/3235 [16:42<43:39,  1.13s/it]

{'loss': 0.9856, 'grad_norm': 0.422463059425354, 'learning_rate': 0.0001430340557275542, 'epoch': 0.29}


 29%|██▊       | 926/3235 [16:43<39:39,  1.03s/it]

{'loss': 0.9624, 'grad_norm': 0.5114031434059143, 'learning_rate': 0.0001429721362229102, 'epoch': 0.29}


 29%|██▊       | 927/3235 [16:44<35:46,  1.08it/s]

{'loss': 0.9596, 'grad_norm': 0.5744758248329163, 'learning_rate': 0.00014291021671826624, 'epoch': 0.29}


 29%|██▊       | 928/3235 [16:45<39:14,  1.02s/it]

{'loss': 1.0048, 'grad_norm': 0.46374961733818054, 'learning_rate': 0.00014284829721362228, 'epoch': 0.29}


 29%|██▊       | 929/3235 [16:46<38:10,  1.01it/s]

{'loss': 0.9284, 'grad_norm': 0.48129791021347046, 'learning_rate': 0.00014278637770897832, 'epoch': 0.29}


 29%|██▊       | 930/3235 [16:47<39:20,  1.02s/it]

{'loss': 1.12, 'grad_norm': 0.5209022760391235, 'learning_rate': 0.0001427244582043344, 'epoch': 0.29}


 29%|██▉       | 931/3235 [16:48<37:07,  1.03it/s]

{'loss': 0.9824, 'grad_norm': 0.5080044269561768, 'learning_rate': 0.00014266253869969043, 'epoch': 0.29}


 29%|██▉       | 932/3235 [16:49<37:01,  1.04it/s]

{'loss': 1.0915, 'grad_norm': 0.5330417156219482, 'learning_rate': 0.00014260061919504644, 'epoch': 0.29}


 29%|██▉       | 933/3235 [16:50<35:40,  1.08it/s]

{'loss': 1.0089, 'grad_norm': 0.5054718852043152, 'learning_rate': 0.00014253869969040248, 'epoch': 0.29}


 29%|██▉       | 934/3235 [16:51<37:07,  1.03it/s]

{'loss': 1.048, 'grad_norm': 0.46583420038223267, 'learning_rate': 0.00014247678018575852, 'epoch': 0.29}


 29%|██▉       | 935/3235 [16:52<40:48,  1.06s/it]

{'loss': 1.0993, 'grad_norm': 0.48323652148246765, 'learning_rate': 0.00014241486068111456, 'epoch': 0.29}


 29%|██▉       | 936/3235 [16:53<42:39,  1.11s/it]

{'loss': 0.9974, 'grad_norm': 0.49665307998657227, 'learning_rate': 0.0001423529411764706, 'epoch': 0.29}


 29%|██▉       | 937/3235 [16:54<39:18,  1.03s/it]

{'loss': 0.9024, 'grad_norm': 0.4981301724910736, 'learning_rate': 0.00014229102167182664, 'epoch': 0.29}


 29%|██▉       | 938/3235 [16:55<36:54,  1.04it/s]

{'loss': 0.9022, 'grad_norm': 0.48980239033699036, 'learning_rate': 0.00014222910216718266, 'epoch': 0.29}


 29%|██▉       | 939/3235 [16:56<35:00,  1.09it/s]

{'loss': 0.9551, 'grad_norm': 0.48661354184150696, 'learning_rate': 0.0001421671826625387, 'epoch': 0.29}


 29%|██▉       | 940/3235 [16:57<36:56,  1.04it/s]

{'loss': 0.9859, 'grad_norm': 0.45588135719299316, 'learning_rate': 0.00014210526315789474, 'epoch': 0.29}


 29%|██▉       | 941/3235 [16:58<38:22,  1.00s/it]

{'loss': 0.9303, 'grad_norm': 0.4405968189239502, 'learning_rate': 0.00014204334365325078, 'epoch': 0.29}


 29%|██▉       | 942/3235 [16:59<37:50,  1.01it/s]

{'loss': 1.0684, 'grad_norm': 0.5645321607589722, 'learning_rate': 0.00014198142414860682, 'epoch': 0.29}


 29%|██▉       | 943/3235 [17:00<40:52,  1.07s/it]

{'loss': 0.9643, 'grad_norm': 0.42900794744491577, 'learning_rate': 0.00014191950464396286, 'epoch': 0.29}


 29%|██▉       | 944/3235 [17:01<37:43,  1.01it/s]

{'loss': 0.9847, 'grad_norm': 0.510648787021637, 'learning_rate': 0.0001418575851393189, 'epoch': 0.29}


 29%|██▉       | 945/3235 [17:02<39:36,  1.04s/it]

{'loss': 1.0878, 'grad_norm': 0.45989277958869934, 'learning_rate': 0.00014179566563467494, 'epoch': 0.29}


 29%|██▉       | 946/3235 [17:03<40:36,  1.06s/it]

{'loss': 1.08, 'grad_norm': 0.4679636061191559, 'learning_rate': 0.00014173374613003098, 'epoch': 0.29}


 29%|██▉       | 947/3235 [17:05<42:13,  1.11s/it]

{'loss': 1.093, 'grad_norm': 0.4310796856880188, 'learning_rate': 0.00014167182662538702, 'epoch': 0.29}


 29%|██▉       | 948/3235 [17:06<46:07,  1.21s/it]

{'loss': 1.0234, 'grad_norm': 0.41938281059265137, 'learning_rate': 0.00014160990712074306, 'epoch': 0.29}


 29%|██▉       | 949/3235 [17:07<45:48,  1.20s/it]

{'loss': 0.9286, 'grad_norm': 0.45675089955329895, 'learning_rate': 0.0001415479876160991, 'epoch': 0.29}


 29%|██▉       | 950/3235 [17:08<44:03,  1.16s/it]

{'loss': 1.0059, 'grad_norm': 0.47126007080078125, 'learning_rate': 0.0001414860681114551, 'epoch': 0.29}


 29%|██▉       | 951/3235 [17:09<44:23,  1.17s/it]

{'loss': 0.8902, 'grad_norm': 0.46553564071655273, 'learning_rate': 0.00014142414860681115, 'epoch': 0.29}


 29%|██▉       | 952/3235 [17:10<43:23,  1.14s/it]

{'loss': 1.0343, 'grad_norm': 0.43456578254699707, 'learning_rate': 0.0001413622291021672, 'epoch': 0.29}


 29%|██▉       | 953/3235 [17:11<41:10,  1.08s/it]

{'loss': 0.9974, 'grad_norm': 0.5433354377746582, 'learning_rate': 0.00014130030959752323, 'epoch': 0.29}


 29%|██▉       | 954/3235 [17:13<41:48,  1.10s/it]

{'loss': 1.0116, 'grad_norm': 0.47841331362724304, 'learning_rate': 0.00014123839009287927, 'epoch': 0.29}


 30%|██▉       | 955/3235 [17:14<40:39,  1.07s/it]

{'loss': 0.9303, 'grad_norm': 0.5410110354423523, 'learning_rate': 0.0001411764705882353, 'epoch': 0.3}


 30%|██▉       | 956/3235 [17:15<41:48,  1.10s/it]

{'loss': 1.0614, 'grad_norm': 0.40849533677101135, 'learning_rate': 0.00014111455108359132, 'epoch': 0.3}


 30%|██▉       | 957/3235 [17:16<39:11,  1.03s/it]

{'loss': 1.0931, 'grad_norm': 0.5885010957717896, 'learning_rate': 0.00014105263157894736, 'epoch': 0.3}


 30%|██▉       | 958/3235 [17:16<36:51,  1.03it/s]

{'loss': 1.2262, 'grad_norm': 0.6239563226699829, 'learning_rate': 0.0001409907120743034, 'epoch': 0.3}


 30%|██▉       | 959/3235 [17:17<36:38,  1.04it/s]

{'loss': 1.0275, 'grad_norm': 0.46513932943344116, 'learning_rate': 0.00014092879256965944, 'epoch': 0.3}


 30%|██▉       | 960/3235 [17:19<39:03,  1.03s/it]

{'loss': 1.0895, 'grad_norm': 0.5013121962547302, 'learning_rate': 0.00014086687306501548, 'epoch': 0.3}


 30%|██▉       | 961/3235 [17:20<41:32,  1.10s/it]

{'loss': 1.0405, 'grad_norm': 0.44953081011772156, 'learning_rate': 0.00014080495356037152, 'epoch': 0.3}


 30%|██▉       | 962/3235 [17:21<42:42,  1.13s/it]

{'loss': 0.8786, 'grad_norm': 0.4600675106048584, 'learning_rate': 0.00014074303405572756, 'epoch': 0.3}


 30%|██▉       | 963/3235 [17:22<39:14,  1.04s/it]

{'loss': 1.0288, 'grad_norm': 0.506923258304596, 'learning_rate': 0.0001406811145510836, 'epoch': 0.3}


 30%|██▉       | 964/3235 [17:23<38:17,  1.01s/it]

{'loss': 1.0401, 'grad_norm': 0.4689143896102905, 'learning_rate': 0.00014061919504643964, 'epoch': 0.3}


 30%|██▉       | 965/3235 [17:24<39:03,  1.03s/it]

{'loss': 1.0217, 'grad_norm': 0.4794379472732544, 'learning_rate': 0.00014055727554179568, 'epoch': 0.3}


 30%|██▉       | 966/3235 [17:25<40:26,  1.07s/it]

{'loss': 0.9811, 'grad_norm': 0.4373503029346466, 'learning_rate': 0.00014049535603715172, 'epoch': 0.3}


 30%|██▉       | 967/3235 [17:26<37:51,  1.00s/it]

{'loss': 0.8978, 'grad_norm': 0.5156981945037842, 'learning_rate': 0.00014043343653250776, 'epoch': 0.3}


 30%|██▉       | 968/3235 [17:27<36:35,  1.03it/s]

{'loss': 0.9686, 'grad_norm': 0.48474979400634766, 'learning_rate': 0.00014037151702786377, 'epoch': 0.3}


 30%|██▉       | 969/3235 [17:28<36:44,  1.03it/s]

{'loss': 0.9542, 'grad_norm': 0.5048162937164307, 'learning_rate': 0.00014030959752321981, 'epoch': 0.3}


 30%|██▉       | 970/3235 [17:29<35:57,  1.05it/s]

{'loss': 1.0782, 'grad_norm': 0.5701727271080017, 'learning_rate': 0.00014024767801857585, 'epoch': 0.3}


 30%|███       | 971/3235 [17:29<34:27,  1.09it/s]

{'loss': 0.9998, 'grad_norm': 0.54084712266922, 'learning_rate': 0.0001401857585139319, 'epoch': 0.3}


 30%|███       | 972/3235 [17:31<36:33,  1.03it/s]

{'loss': 1.1089, 'grad_norm': 0.47432056069374084, 'learning_rate': 0.00014012383900928793, 'epoch': 0.3}


 30%|███       | 973/3235 [17:32<37:35,  1.00it/s]

{'loss': 1.0773, 'grad_norm': 0.4606998860836029, 'learning_rate': 0.00014006191950464397, 'epoch': 0.3}


 30%|███       | 974/3235 [17:33<38:39,  1.03s/it]

{'loss': 1.1259, 'grad_norm': 0.48202723264694214, 'learning_rate': 0.00014, 'epoch': 0.3}


 30%|███       | 975/3235 [17:34<41:47,  1.11s/it]

{'loss': 1.0393, 'grad_norm': 0.46663475036621094, 'learning_rate': 0.00013993808049535603, 'epoch': 0.3}


 30%|███       | 976/3235 [17:35<40:51,  1.09s/it]

{'loss': 0.9987, 'grad_norm': 0.5010754466056824, 'learning_rate': 0.00013987616099071207, 'epoch': 0.3}


 30%|███       | 977/3235 [17:36<43:44,  1.16s/it]

{'loss': 1.1564, 'grad_norm': 0.4450337588787079, 'learning_rate': 0.0001398142414860681, 'epoch': 0.3}


 30%|███       | 978/3235 [17:37<41:47,  1.11s/it]

{'loss': 0.9931, 'grad_norm': 0.5428578853607178, 'learning_rate': 0.00013975232198142415, 'epoch': 0.3}


 30%|███       | 979/3235 [17:39<42:55,  1.14s/it]

{'loss': 0.9283, 'grad_norm': 0.46396127343177795, 'learning_rate': 0.00013969040247678021, 'epoch': 0.3}


 30%|███       | 980/3235 [17:39<39:57,  1.06s/it]

{'loss': 1.0349, 'grad_norm': 0.47261402010917664, 'learning_rate': 0.00013962848297213623, 'epoch': 0.3}


 30%|███       | 981/3235 [17:41<41:46,  1.11s/it]

{'loss': 0.9144, 'grad_norm': 0.47657114267349243, 'learning_rate': 0.00013956656346749227, 'epoch': 0.3}


 30%|███       | 982/3235 [17:42<44:12,  1.18s/it]

{'loss': 1.2063, 'grad_norm': 0.46460816264152527, 'learning_rate': 0.0001395046439628483, 'epoch': 0.3}


 30%|███       | 983/3235 [17:43<42:35,  1.13s/it]

{'loss': 1.0023, 'grad_norm': 0.4414892792701721, 'learning_rate': 0.00013944272445820435, 'epoch': 0.3}


 30%|███       | 984/3235 [17:44<39:55,  1.06s/it]

{'loss': 1.0066, 'grad_norm': 0.48517343401908875, 'learning_rate': 0.0001393808049535604, 'epoch': 0.3}


 30%|███       | 985/3235 [17:45<40:23,  1.08s/it]

{'loss': 1.1067, 'grad_norm': 0.4491899311542511, 'learning_rate': 0.00013931888544891643, 'epoch': 0.3}


 30%|███       | 986/3235 [17:46<39:08,  1.04s/it]

{'loss': 0.9554, 'grad_norm': 0.45468881726264954, 'learning_rate': 0.00013925696594427244, 'epoch': 0.3}


 31%|███       | 987/3235 [17:47<39:33,  1.06s/it]

{'loss': 1.0201, 'grad_norm': 0.4780237376689911, 'learning_rate': 0.00013919504643962848, 'epoch': 0.31}


 31%|███       | 988/3235 [17:48<36:42,  1.02it/s]

{'loss': 1.1341, 'grad_norm': 0.5689890384674072, 'learning_rate': 0.00013913312693498452, 'epoch': 0.31}


 31%|███       | 989/3235 [17:49<36:00,  1.04it/s]

{'loss': 0.9657, 'grad_norm': 0.5159749984741211, 'learning_rate': 0.00013907120743034056, 'epoch': 0.31}


 31%|███       | 990/3235 [17:50<36:47,  1.02it/s]

{'loss': 1.0727, 'grad_norm': 0.49744904041290283, 'learning_rate': 0.0001390092879256966, 'epoch': 0.31}


 31%|███       | 991/3235 [17:51<35:54,  1.04it/s]

{'loss': 1.0415, 'grad_norm': 0.5210123658180237, 'learning_rate': 0.00013894736842105264, 'epoch': 0.31}


 31%|███       | 992/3235 [17:52<38:55,  1.04s/it]

{'loss': 1.0773, 'grad_norm': 0.44179973006248474, 'learning_rate': 0.00013888544891640865, 'epoch': 0.31}


 31%|███       | 993/3235 [17:53<37:53,  1.01s/it]

{'loss': 1.0805, 'grad_norm': 0.509414792060852, 'learning_rate': 0.00013882352941176472, 'epoch': 0.31}


 31%|███       | 994/3235 [17:54<38:50,  1.04s/it]

{'loss': 0.8809, 'grad_norm': 0.4586712718009949, 'learning_rate': 0.00013876160990712076, 'epoch': 0.31}


 31%|███       | 995/3235 [17:55<39:14,  1.05s/it]

{'loss': 1.0653, 'grad_norm': 0.524432897567749, 'learning_rate': 0.0001386996904024768, 'epoch': 0.31}


 31%|███       | 996/3235 [17:56<40:36,  1.09s/it]

{'loss': 1.0031, 'grad_norm': 0.46410417556762695, 'learning_rate': 0.00013863777089783284, 'epoch': 0.31}


 31%|███       | 997/3235 [17:57<39:13,  1.05s/it]

{'loss': 0.9441, 'grad_norm': 0.4717578589916229, 'learning_rate': 0.00013857585139318888, 'epoch': 0.31}


 31%|███       | 998/3235 [17:59<43:04,  1.16s/it]

{'loss': 1.0182, 'grad_norm': 0.43215620517730713, 'learning_rate': 0.0001385139318885449, 'epoch': 0.31}


 31%|███       | 999/3235 [18:00<42:33,  1.14s/it]

{'loss': 1.1678, 'grad_norm': 0.5056880116462708, 'learning_rate': 0.00013845201238390093, 'epoch': 0.31}


 31%|███       | 1000/3235 [18:01<42:14,  1.13s/it]

{'loss': 0.9981, 'grad_norm': 0.4624656140804291, 'learning_rate': 0.00013839009287925697, 'epoch': 0.31}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1000)... Done. 0.1s
 31%|███       | 1001/3235 [18:03<51:09,  1.37s/it]

{'loss': 0.9696, 'grad_norm': 0.43210604786872864, 'learning_rate': 0.00013832817337461301, 'epoch': 0.31}


 31%|███       | 1002/3235 [18:04<50:29,  1.36s/it]

{'loss': 1.0817, 'grad_norm': 0.4423803985118866, 'learning_rate': 0.00013826625386996905, 'epoch': 0.31}


 31%|███       | 1003/3235 [18:05<46:39,  1.25s/it]

{'loss': 0.9592, 'grad_norm': 0.44726160168647766, 'learning_rate': 0.0001382043343653251, 'epoch': 0.31}


 31%|███       | 1004/3235 [18:06<45:58,  1.24s/it]

{'loss': 1.1391, 'grad_norm': 0.4605949819087982, 'learning_rate': 0.0001381424148606811, 'epoch': 0.31}


 31%|███       | 1005/3235 [18:08<47:14,  1.27s/it]

{'loss': 1.1182, 'grad_norm': 0.4584994614124298, 'learning_rate': 0.00013808049535603715, 'epoch': 0.31}


 31%|███       | 1006/3235 [18:09<42:21,  1.14s/it]

{'loss': 1.1482, 'grad_norm': 0.7043921947479248, 'learning_rate': 0.0001380185758513932, 'epoch': 0.31}


 31%|███       | 1007/3235 [18:10<41:49,  1.13s/it]

{'loss': 0.9293, 'grad_norm': 0.4348108470439911, 'learning_rate': 0.00013795665634674923, 'epoch': 0.31}


 31%|███       | 1008/3235 [18:11<43:32,  1.17s/it]

{'loss': 1.0377, 'grad_norm': 0.4393388330936432, 'learning_rate': 0.00013789473684210527, 'epoch': 0.31}


 31%|███       | 1009/3235 [18:12<40:51,  1.10s/it]

{'loss': 1.0534, 'grad_norm': 0.49187299609184265, 'learning_rate': 0.0001378328173374613, 'epoch': 0.31}


 31%|███       | 1010/3235 [18:13<38:12,  1.03s/it]

{'loss': 1.0513, 'grad_norm': 0.5595635771751404, 'learning_rate': 0.00013777089783281735, 'epoch': 0.31}


 31%|███▏      | 1011/3235 [18:14<38:56,  1.05s/it]

{'loss': 1.0127, 'grad_norm': 0.4658570885658264, 'learning_rate': 0.0001377089783281734, 'epoch': 0.31}


 31%|███▏      | 1012/3235 [18:15<37:47,  1.02s/it]

{'loss': 1.1559, 'grad_norm': 0.4828183352947235, 'learning_rate': 0.00013764705882352943, 'epoch': 0.31}


 31%|███▏      | 1013/3235 [18:16<38:11,  1.03s/it]

{'loss': 1.16, 'grad_norm': 0.6089365482330322, 'learning_rate': 0.00013758513931888547, 'epoch': 0.31}


 31%|███▏      | 1014/3235 [18:17<42:36,  1.15s/it]

{'loss': 0.9926, 'grad_norm': 0.39407920837402344, 'learning_rate': 0.0001375232198142415, 'epoch': 0.31}


 31%|███▏      | 1015/3235 [18:18<42:46,  1.16s/it]

{'loss': 0.9503, 'grad_norm': 0.4274018108844757, 'learning_rate': 0.00013746130030959755, 'epoch': 0.31}


 31%|███▏      | 1016/3235 [18:20<43:32,  1.18s/it]

{'loss': 0.7836, 'grad_norm': 0.4395201504230499, 'learning_rate': 0.00013739938080495356, 'epoch': 0.31}


 31%|███▏      | 1017/3235 [18:21<41:55,  1.13s/it]

{'loss': 1.0513, 'grad_norm': 0.4398971199989319, 'learning_rate': 0.0001373374613003096, 'epoch': 0.31}


 31%|███▏      | 1018/3235 [18:22<41:36,  1.13s/it]

{'loss': 1.2323, 'grad_norm': 0.5059861540794373, 'learning_rate': 0.00013727554179566564, 'epoch': 0.31}


 31%|███▏      | 1019/3235 [18:23<43:07,  1.17s/it]

{'loss': 1.0289, 'grad_norm': 0.44846493005752563, 'learning_rate': 0.00013721362229102168, 'epoch': 0.31}


 32%|███▏      | 1020/3235 [18:24<41:25,  1.12s/it]

{'loss': 1.0365, 'grad_norm': 0.500684380531311, 'learning_rate': 0.00013715170278637772, 'epoch': 0.32}


 32%|███▏      | 1021/3235 [18:25<43:42,  1.18s/it]

{'loss': 1.1205, 'grad_norm': 0.42716550827026367, 'learning_rate': 0.00013708978328173376, 'epoch': 0.32}


 32%|███▏      | 1022/3235 [18:26<40:49,  1.11s/it]

{'loss': 1.0101, 'grad_norm': 0.6182792782783508, 'learning_rate': 0.00013702786377708977, 'epoch': 0.32}


 32%|███▏      | 1023/3235 [18:27<38:56,  1.06s/it]

{'loss': 1.0606, 'grad_norm': 0.49621206521987915, 'learning_rate': 0.0001369659442724458, 'epoch': 0.32}


 32%|███▏      | 1024/3235 [18:28<36:52,  1.00s/it]

{'loss': 0.8154, 'grad_norm': 0.4376847445964813, 'learning_rate': 0.00013690402476780185, 'epoch': 0.32}


 32%|███▏      | 1025/3235 [18:29<36:32,  1.01it/s]

{'loss': 1.0218, 'grad_norm': 0.5041518807411194, 'learning_rate': 0.0001368421052631579, 'epoch': 0.32}


 32%|███▏      | 1026/3235 [18:30<37:24,  1.02s/it]

{'loss': 1.1034, 'grad_norm': 0.4673978090286255, 'learning_rate': 0.00013678018575851393, 'epoch': 0.32}


 32%|███▏      | 1027/3235 [18:31<40:01,  1.09s/it]

{'loss': 1.092, 'grad_norm': 0.4388231337070465, 'learning_rate': 0.00013671826625387, 'epoch': 0.32}


 32%|███▏      | 1028/3235 [18:32<36:51,  1.00s/it]

{'loss': 1.0527, 'grad_norm': 0.5439139604568481, 'learning_rate': 0.000136656346749226, 'epoch': 0.32}


 32%|███▏      | 1029/3235 [18:34<41:04,  1.12s/it]

{'loss': 1.136, 'grad_norm': 0.46665850281715393, 'learning_rate': 0.00013659442724458205, 'epoch': 0.32}


 32%|███▏      | 1030/3235 [18:35<40:27,  1.10s/it]

{'loss': 0.9459, 'grad_norm': 0.501520037651062, 'learning_rate': 0.0001365325077399381, 'epoch': 0.32}


 32%|███▏      | 1031/3235 [18:36<38:33,  1.05s/it]

{'loss': 0.9931, 'grad_norm': 0.5142756104469299, 'learning_rate': 0.00013647058823529413, 'epoch': 0.32}


 32%|███▏      | 1032/3235 [18:37<41:55,  1.14s/it]

{'loss': 1.1262, 'grad_norm': 0.45068761706352234, 'learning_rate': 0.00013640866873065017, 'epoch': 0.32}


 32%|███▏      | 1033/3235 [18:38<40:36,  1.11s/it]

{'loss': 1.0871, 'grad_norm': 0.4442950487136841, 'learning_rate': 0.00013634674922600619, 'epoch': 0.32}


 32%|███▏      | 1034/3235 [18:39<39:00,  1.06s/it]

{'loss': 0.9805, 'grad_norm': 0.5156621336936951, 'learning_rate': 0.00013628482972136223, 'epoch': 0.32}


 32%|███▏      | 1035/3235 [18:40<37:42,  1.03s/it]

{'loss': 1.1214, 'grad_norm': 0.5023043155670166, 'learning_rate': 0.00013622291021671827, 'epoch': 0.32}


 32%|███▏      | 1036/3235 [18:41<38:11,  1.04s/it]

{'loss': 1.0203, 'grad_norm': 0.4723163843154907, 'learning_rate': 0.0001361609907120743, 'epoch': 0.32}


 32%|███▏      | 1037/3235 [18:42<38:41,  1.06s/it]

{'loss': 0.9811, 'grad_norm': 0.5144045948982239, 'learning_rate': 0.00013609907120743035, 'epoch': 0.32}


 32%|███▏      | 1038/3235 [18:43<37:33,  1.03s/it]

{'loss': 1.184, 'grad_norm': 0.5483201742172241, 'learning_rate': 0.00013603715170278639, 'epoch': 0.32}


 32%|███▏      | 1039/3235 [18:44<40:01,  1.09s/it]

{'loss': 1.1754, 'grad_norm': 0.5124549865722656, 'learning_rate': 0.0001359752321981424, 'epoch': 0.32}


 32%|███▏      | 1040/3235 [18:45<40:42,  1.11s/it]

{'loss': 0.9718, 'grad_norm': 0.4807547628879547, 'learning_rate': 0.00013591331269349844, 'epoch': 0.32}


 32%|███▏      | 1041/3235 [18:47<40:33,  1.11s/it]

{'loss': 1.02, 'grad_norm': 0.46539106965065, 'learning_rate': 0.00013585139318885448, 'epoch': 0.32}


 32%|███▏      | 1042/3235 [18:48<41:23,  1.13s/it]

{'loss': 1.0185, 'grad_norm': 0.4335551857948303, 'learning_rate': 0.00013578947368421055, 'epoch': 0.32}


 32%|███▏      | 1043/3235 [18:49<40:34,  1.11s/it]

{'loss': 1.0184, 'grad_norm': 0.46052318811416626, 'learning_rate': 0.00013572755417956659, 'epoch': 0.32}


 32%|███▏      | 1044/3235 [18:50<38:52,  1.06s/it]

{'loss': 1.0292, 'grad_norm': 0.5115307569503784, 'learning_rate': 0.00013566563467492263, 'epoch': 0.32}


 32%|███▏      | 1045/3235 [18:51<37:35,  1.03s/it]

{'loss': 1.0932, 'grad_norm': 0.48556846380233765, 'learning_rate': 0.00013560371517027864, 'epoch': 0.32}


 32%|███▏      | 1046/3235 [18:52<38:37,  1.06s/it]

{'loss': 1.1797, 'grad_norm': 0.49452197551727295, 'learning_rate': 0.00013554179566563468, 'epoch': 0.32}


 32%|███▏      | 1047/3235 [18:53<40:15,  1.10s/it]

{'loss': 1.1768, 'grad_norm': 0.4453267753124237, 'learning_rate': 0.00013547987616099072, 'epoch': 0.32}


 32%|███▏      | 1048/3235 [18:54<40:11,  1.10s/it]

{'loss': 1.2039, 'grad_norm': 0.5034487843513489, 'learning_rate': 0.00013541795665634676, 'epoch': 0.32}


 32%|███▏      | 1049/3235 [18:55<39:48,  1.09s/it]

{'loss': 1.0259, 'grad_norm': 0.5020856261253357, 'learning_rate': 0.0001353560371517028, 'epoch': 0.32}


 32%|███▏      | 1050/3235 [18:56<40:00,  1.10s/it]

{'loss': 0.9597, 'grad_norm': 0.48227065801620483, 'learning_rate': 0.00013529411764705884, 'epoch': 0.32}


 32%|███▏      | 1051/3235 [18:57<38:56,  1.07s/it]

{'loss': 1.1793, 'grad_norm': 0.5568605661392212, 'learning_rate': 0.00013523219814241485, 'epoch': 0.32}


 33%|███▎      | 1052/3235 [18:59<40:21,  1.11s/it]

{'loss': 1.029, 'grad_norm': 0.4425099194049835, 'learning_rate': 0.0001351702786377709, 'epoch': 0.33}


 33%|███▎      | 1053/3235 [19:00<42:00,  1.16s/it]

{'loss': 1.0003, 'grad_norm': 0.43174439668655396, 'learning_rate': 0.00013510835913312693, 'epoch': 0.33}


 33%|███▎      | 1054/3235 [19:01<39:48,  1.10s/it]

{'loss': 0.9094, 'grad_norm': 0.47098520398139954, 'learning_rate': 0.00013504643962848297, 'epoch': 0.33}


 33%|███▎      | 1055/3235 [19:02<38:55,  1.07s/it]

{'loss': 1.1346, 'grad_norm': 0.46638041734695435, 'learning_rate': 0.000134984520123839, 'epoch': 0.33}


 33%|███▎      | 1056/3235 [19:03<43:05,  1.19s/it]

{'loss': 1.0073, 'grad_norm': 0.41788437962532043, 'learning_rate': 0.00013492260061919505, 'epoch': 0.33}


 33%|███▎      | 1057/3235 [19:04<40:12,  1.11s/it]

{'loss': 1.0832, 'grad_norm': 0.5362136960029602, 'learning_rate': 0.0001348606811145511, 'epoch': 0.33}


 33%|███▎      | 1058/3235 [19:05<39:09,  1.08s/it]

{'loss': 0.8716, 'grad_norm': 0.48984283208847046, 'learning_rate': 0.00013479876160990713, 'epoch': 0.33}


 33%|███▎      | 1059/3235 [19:06<40:07,  1.11s/it]

{'loss': 0.9728, 'grad_norm': 0.4649367332458496, 'learning_rate': 0.00013473684210526317, 'epoch': 0.33}


 33%|███▎      | 1060/3235 [19:08<41:42,  1.15s/it]

{'loss': 1.0375, 'grad_norm': 0.45823249220848083, 'learning_rate': 0.0001346749226006192, 'epoch': 0.33}


 33%|███▎      | 1061/3235 [19:09<41:22,  1.14s/it]

{'loss': 0.8524, 'grad_norm': 0.4533309042453766, 'learning_rate': 0.00013461300309597525, 'epoch': 0.33}


 33%|███▎      | 1062/3235 [19:10<41:17,  1.14s/it]

{'loss': 0.9542, 'grad_norm': 0.4607775807380676, 'learning_rate': 0.0001345510835913313, 'epoch': 0.33}


 33%|███▎      | 1063/3235 [19:11<40:04,  1.11s/it]

{'loss': 0.9079, 'grad_norm': 0.4622279107570648, 'learning_rate': 0.0001344891640866873, 'epoch': 0.33}


 33%|███▎      | 1064/3235 [19:12<41:06,  1.14s/it]

{'loss': 1.0264, 'grad_norm': 0.41330358386039734, 'learning_rate': 0.00013442724458204334, 'epoch': 0.33}


 33%|███▎      | 1065/3235 [19:13<42:38,  1.18s/it]

{'loss': 1.0359, 'grad_norm': 0.44065961241722107, 'learning_rate': 0.00013436532507739938, 'epoch': 0.33}


 33%|███▎      | 1066/3235 [19:14<40:47,  1.13s/it]

{'loss': 1.0829, 'grad_norm': 0.4764159321784973, 'learning_rate': 0.00013430340557275542, 'epoch': 0.33}


 33%|███▎      | 1067/3235 [19:16<43:06,  1.19s/it]

{'loss': 1.0924, 'grad_norm': 0.44737836718559265, 'learning_rate': 0.00013424148606811146, 'epoch': 0.33}


 33%|███▎      | 1068/3235 [19:17<41:00,  1.14s/it]

{'loss': 0.9046, 'grad_norm': 0.48172733187675476, 'learning_rate': 0.0001341795665634675, 'epoch': 0.33}


 33%|███▎      | 1069/3235 [19:18<41:35,  1.15s/it]

{'loss': 1.094, 'grad_norm': 0.458603173494339, 'learning_rate': 0.00013411764705882352, 'epoch': 0.33}


 33%|███▎      | 1070/3235 [19:19<41:33,  1.15s/it]

{'loss': 1.1017, 'grad_norm': 0.5274955630302429, 'learning_rate': 0.00013405572755417956, 'epoch': 0.33}


 33%|███▎      | 1071/3235 [19:20<41:22,  1.15s/it]

{'loss': 1.1203, 'grad_norm': 0.504108190536499, 'learning_rate': 0.0001339938080495356, 'epoch': 0.33}


 33%|███▎      | 1072/3235 [19:22<43:35,  1.21s/it]

{'loss': 1.0107, 'grad_norm': 0.44416719675064087, 'learning_rate': 0.00013393188854489164, 'epoch': 0.33}


 33%|███▎      | 1073/3235 [19:23<41:10,  1.14s/it]

{'loss': 1.0564, 'grad_norm': 0.49730539321899414, 'learning_rate': 0.00013386996904024768, 'epoch': 0.33}


 33%|███▎      | 1074/3235 [19:24<40:04,  1.11s/it]

{'loss': 1.0337, 'grad_norm': 0.5373873114585876, 'learning_rate': 0.00013380804953560372, 'epoch': 0.33}


 33%|███▎      | 1075/3235 [19:25<40:38,  1.13s/it]

{'loss': 1.034, 'grad_norm': 0.443822979927063, 'learning_rate': 0.00013374613003095976, 'epoch': 0.33}


 33%|███▎      | 1076/3235 [19:26<42:23,  1.18s/it]

{'loss': 0.9618, 'grad_norm': 0.40786781907081604, 'learning_rate': 0.0001336842105263158, 'epoch': 0.33}


 33%|███▎      | 1077/3235 [19:27<42:00,  1.17s/it]

{'loss': 1.0823, 'grad_norm': 0.5033513307571411, 'learning_rate': 0.00013362229102167184, 'epoch': 0.33}


 33%|███▎      | 1078/3235 [19:28<42:42,  1.19s/it]

{'loss': 1.1801, 'grad_norm': 0.4515264332294464, 'learning_rate': 0.00013356037151702788, 'epoch': 0.33}


 33%|███▎      | 1079/3235 [19:29<41:36,  1.16s/it]

{'loss': 1.0603, 'grad_norm': 0.4873657822608948, 'learning_rate': 0.00013349845201238392, 'epoch': 0.33}


 33%|███▎      | 1080/3235 [19:31<41:48,  1.16s/it]

{'loss': 0.9842, 'grad_norm': 0.4662872552871704, 'learning_rate': 0.00013343653250773996, 'epoch': 0.33}


 33%|███▎      | 1081/3235 [19:32<40:29,  1.13s/it]

{'loss': 1.0741, 'grad_norm': 0.45862337946891785, 'learning_rate': 0.00013337461300309597, 'epoch': 0.33}


 33%|███▎      | 1082/3235 [19:33<38:18,  1.07s/it]

{'loss': 1.0976, 'grad_norm': 0.4927886724472046, 'learning_rate': 0.000133312693498452, 'epoch': 0.33}


 33%|███▎      | 1083/3235 [19:33<35:53,  1.00s/it]

{'loss': 0.9048, 'grad_norm': 0.5472935438156128, 'learning_rate': 0.00013325077399380805, 'epoch': 0.33}


 34%|███▎      | 1084/3235 [19:34<34:50,  1.03it/s]

{'loss': 0.9685, 'grad_norm': 0.5347598791122437, 'learning_rate': 0.0001331888544891641, 'epoch': 0.34}


 34%|███▎      | 1085/3235 [19:35<34:53,  1.03it/s]

{'loss': 1.0892, 'grad_norm': 0.4667496085166931, 'learning_rate': 0.00013312693498452013, 'epoch': 0.34}


 34%|███▎      | 1086/3235 [19:36<33:14,  1.08it/s]

{'loss': 1.0738, 'grad_norm': 0.5835923552513123, 'learning_rate': 0.00013306501547987617, 'epoch': 0.34}


 34%|███▎      | 1087/3235 [19:37<30:57,  1.16it/s]

{'loss': 0.8887, 'grad_norm': 0.49722155928611755, 'learning_rate': 0.00013300309597523218, 'epoch': 0.34}


 34%|███▎      | 1088/3235 [19:38<32:13,  1.11it/s]

{'loss': 1.0382, 'grad_norm': 0.4766005277633667, 'learning_rate': 0.00013294117647058822, 'epoch': 0.34}


 34%|███▎      | 1089/3235 [19:39<33:42,  1.06it/s]

{'loss': 1.111, 'grad_norm': 0.5436107516288757, 'learning_rate': 0.00013287925696594426, 'epoch': 0.34}


 34%|███▎      | 1090/3235 [19:40<35:27,  1.01it/s]

{'loss': 1.0519, 'grad_norm': 0.5380786061286926, 'learning_rate': 0.00013281733746130033, 'epoch': 0.34}


 34%|███▎      | 1091/3235 [19:41<37:27,  1.05s/it]

{'loss': 0.9044, 'grad_norm': 0.44670411944389343, 'learning_rate': 0.00013275541795665637, 'epoch': 0.34}


 34%|███▍      | 1092/3235 [19:42<38:15,  1.07s/it]

{'loss': 1.0223, 'grad_norm': 0.4835392236709595, 'learning_rate': 0.0001326934984520124, 'epoch': 0.34}


 34%|███▍      | 1093/3235 [19:44<39:30,  1.11s/it]

{'loss': 1.0508, 'grad_norm': 0.4158784747123718, 'learning_rate': 0.00013263157894736842, 'epoch': 0.34}


 34%|███▍      | 1094/3235 [19:44<37:51,  1.06s/it]

{'loss': 1.0341, 'grad_norm': 0.47008267045021057, 'learning_rate': 0.00013256965944272446, 'epoch': 0.34}


 34%|███▍      | 1095/3235 [19:46<39:57,  1.12s/it]

{'loss': 1.1048, 'grad_norm': 0.4169985353946686, 'learning_rate': 0.0001325077399380805, 'epoch': 0.34}


 34%|███▍      | 1096/3235 [19:47<41:28,  1.16s/it]

{'loss': 1.0604, 'grad_norm': 0.47168853878974915, 'learning_rate': 0.00013244582043343654, 'epoch': 0.34}


 34%|███▍      | 1097/3235 [19:48<40:25,  1.13s/it]

{'loss': 0.9181, 'grad_norm': 0.4746823012828827, 'learning_rate': 0.00013238390092879258, 'epoch': 0.34}


 34%|███▍      | 1098/3235 [19:49<41:00,  1.15s/it]

{'loss': 1.2285, 'grad_norm': 0.4661833941936493, 'learning_rate': 0.00013232198142414862, 'epoch': 0.34}


 34%|███▍      | 1099/3235 [19:50<39:26,  1.11s/it]

{'loss': 0.9608, 'grad_norm': 0.4991014003753662, 'learning_rate': 0.00013226006191950464, 'epoch': 0.34}


 34%|███▍      | 1100/3235 [19:51<38:30,  1.08s/it]

{'loss': 0.9711, 'grad_norm': 0.47071361541748047, 'learning_rate': 0.00013219814241486068, 'epoch': 0.34}


 34%|███▍      | 1101/3235 [19:52<38:51,  1.09s/it]

{'loss': 1.0711, 'grad_norm': 0.4376075267791748, 'learning_rate': 0.00013213622291021672, 'epoch': 0.34}


 34%|███▍      | 1102/3235 [19:54<38:55,  1.10s/it]

{'loss': 0.9976, 'grad_norm': 0.45898938179016113, 'learning_rate': 0.00013207430340557276, 'epoch': 0.34}


 34%|███▍      | 1103/3235 [19:54<37:26,  1.05s/it]

{'loss': 0.9563, 'grad_norm': 0.47703638672828674, 'learning_rate': 0.0001320123839009288, 'epoch': 0.34}


 34%|███▍      | 1104/3235 [19:55<36:51,  1.04s/it]

{'loss': 0.8983, 'grad_norm': 0.4763844907283783, 'learning_rate': 0.00013195046439628484, 'epoch': 0.34}


 34%|███▍      | 1105/3235 [19:56<34:27,  1.03it/s]

{'loss': 1.1305, 'grad_norm': 0.5680310130119324, 'learning_rate': 0.00013188854489164088, 'epoch': 0.34}


 34%|███▍      | 1106/3235 [19:58<38:27,  1.08s/it]

{'loss': 1.0541, 'grad_norm': 0.49523475766181946, 'learning_rate': 0.00013182662538699692, 'epoch': 0.34}


 34%|███▍      | 1107/3235 [19:59<40:02,  1.13s/it]

{'loss': 1.0829, 'grad_norm': 0.4569251835346222, 'learning_rate': 0.00013176470588235296, 'epoch': 0.34}


 34%|███▍      | 1108/3235 [20:00<38:40,  1.09s/it]

{'loss': 1.037, 'grad_norm': 0.5209143161773682, 'learning_rate': 0.000131702786377709, 'epoch': 0.34}


 34%|███▍      | 1109/3235 [20:01<40:05,  1.13s/it]

{'loss': 0.9156, 'grad_norm': 0.43134158849716187, 'learning_rate': 0.00013164086687306504, 'epoch': 0.34}


 34%|███▍      | 1110/3235 [20:02<37:42,  1.06s/it]

{'loss': 0.9476, 'grad_norm': 0.5094008445739746, 'learning_rate': 0.00013157894736842108, 'epoch': 0.34}


 34%|███▍      | 1111/3235 [20:03<38:33,  1.09s/it]

{'loss': 1.1823, 'grad_norm': 0.49976760149002075, 'learning_rate': 0.0001315170278637771, 'epoch': 0.34}


 34%|███▍      | 1112/3235 [20:04<35:38,  1.01s/it]

{'loss': 0.9939, 'grad_norm': 0.5250904560089111, 'learning_rate': 0.00013145510835913313, 'epoch': 0.34}


 34%|███▍      | 1113/3235 [20:05<34:40,  1.02it/s]

{'loss': 0.8792, 'grad_norm': 0.47415634989738464, 'learning_rate': 0.00013139318885448917, 'epoch': 0.34}


 34%|███▍      | 1114/3235 [20:06<33:44,  1.05it/s]

{'loss': 1.0893, 'grad_norm': 0.46333855390548706, 'learning_rate': 0.0001313312693498452, 'epoch': 0.34}


 34%|███▍      | 1115/3235 [20:07<35:45,  1.01s/it]

{'loss': 1.024, 'grad_norm': 0.4496683180332184, 'learning_rate': 0.00013126934984520125, 'epoch': 0.34}


 34%|███▍      | 1116/3235 [20:08<39:12,  1.11s/it]

{'loss': 1.1041, 'grad_norm': 0.4742432236671448, 'learning_rate': 0.0001312074303405573, 'epoch': 0.34}


 35%|███▍      | 1117/3235 [20:10<41:21,  1.17s/it]

{'loss': 1.0009, 'grad_norm': 0.4332391023635864, 'learning_rate': 0.0001311455108359133, 'epoch': 0.35}


 35%|███▍      | 1118/3235 [20:11<41:27,  1.17s/it]

{'loss': 1.183, 'grad_norm': 0.5005272626876831, 'learning_rate': 0.00013108359133126934, 'epoch': 0.35}


 35%|███▍      | 1119/3235 [20:12<38:30,  1.09s/it]

{'loss': 0.8649, 'grad_norm': 0.5213527083396912, 'learning_rate': 0.00013102167182662538, 'epoch': 0.35}


 35%|███▍      | 1120/3235 [20:13<40:11,  1.14s/it]

{'loss': 0.9335, 'grad_norm': 0.4177415668964386, 'learning_rate': 0.00013095975232198142, 'epoch': 0.35}


 35%|███▍      | 1121/3235 [20:14<39:15,  1.11s/it]

{'loss': 1.0027, 'grad_norm': 0.4415709674358368, 'learning_rate': 0.00013089783281733746, 'epoch': 0.35}


 35%|███▍      | 1122/3235 [20:15<41:05,  1.17s/it]

{'loss': 1.1007, 'grad_norm': 0.4893939793109894, 'learning_rate': 0.0001308359133126935, 'epoch': 0.35}


 35%|███▍      | 1123/3235 [20:16<40:09,  1.14s/it]

{'loss': 0.963, 'grad_norm': 0.4605821669101715, 'learning_rate': 0.00013077399380804954, 'epoch': 0.35}


 35%|███▍      | 1124/3235 [20:18<40:27,  1.15s/it]

{'loss': 1.1138, 'grad_norm': 0.49848079681396484, 'learning_rate': 0.00013071207430340558, 'epoch': 0.35}


 35%|███▍      | 1125/3235 [20:18<38:24,  1.09s/it]

{'loss': 0.9748, 'grad_norm': 0.49395427107810974, 'learning_rate': 0.00013065015479876162, 'epoch': 0.35}


 35%|███▍      | 1126/3235 [20:20<41:19,  1.18s/it]

{'loss': 1.016, 'grad_norm': 0.4175964295864105, 'learning_rate': 0.00013058823529411766, 'epoch': 0.35}


 35%|███▍      | 1127/3235 [20:21<39:15,  1.12s/it]

{'loss': 1.0707, 'grad_norm': 0.49783527851104736, 'learning_rate': 0.0001305263157894737, 'epoch': 0.35}


 35%|███▍      | 1128/3235 [20:22<42:17,  1.20s/it]

{'loss': 1.2782, 'grad_norm': 0.4907507300376892, 'learning_rate': 0.00013046439628482974, 'epoch': 0.35}


 35%|███▍      | 1129/3235 [20:23<42:58,  1.22s/it]

{'loss': 0.8696, 'grad_norm': 0.41593265533447266, 'learning_rate': 0.00013040247678018576, 'epoch': 0.35}


 35%|███▍      | 1130/3235 [20:25<44:02,  1.26s/it]

{'loss': 0.8907, 'grad_norm': 0.41271403431892395, 'learning_rate': 0.0001303405572755418, 'epoch': 0.35}


 35%|███▍      | 1131/3235 [20:26<43:30,  1.24s/it]

{'loss': 1.0187, 'grad_norm': 0.4083469808101654, 'learning_rate': 0.00013027863777089784, 'epoch': 0.35}


 35%|███▍      | 1132/3235 [20:27<41:49,  1.19s/it]

{'loss': 0.8851, 'grad_norm': 0.4541577398777008, 'learning_rate': 0.00013021671826625388, 'epoch': 0.35}


 35%|███▌      | 1133/3235 [20:28<40:41,  1.16s/it]

{'loss': 1.1127, 'grad_norm': 0.4619521498680115, 'learning_rate': 0.00013015479876160992, 'epoch': 0.35}


 35%|███▌      | 1134/3235 [20:29<37:05,  1.06s/it]

{'loss': 0.9602, 'grad_norm': 0.540935754776001, 'learning_rate': 0.00013009287925696596, 'epoch': 0.35}


 35%|███▌      | 1135/3235 [20:30<39:28,  1.13s/it]

{'loss': 1.1111, 'grad_norm': 0.45005184412002563, 'learning_rate': 0.00013003095975232197, 'epoch': 0.35}


 35%|███▌      | 1136/3235 [20:31<36:23,  1.04s/it]

{'loss': 1.2025, 'grad_norm': 0.5428012609481812, 'learning_rate': 0.000129969040247678, 'epoch': 0.35}


 35%|███▌      | 1137/3235 [20:32<37:09,  1.06s/it]

{'loss': 1.048, 'grad_norm': 0.4434652030467987, 'learning_rate': 0.00012990712074303405, 'epoch': 0.35}


 35%|███▌      | 1138/3235 [20:33<35:55,  1.03s/it]

{'loss': 0.8628, 'grad_norm': 0.4865885376930237, 'learning_rate': 0.0001298452012383901, 'epoch': 0.35}


 35%|███▌      | 1139/3235 [20:34<36:00,  1.03s/it]

{'loss': 1.0633, 'grad_norm': 0.48176270723342896, 'learning_rate': 0.00012978328173374616, 'epoch': 0.35}


 35%|███▌      | 1140/3235 [20:36<38:52,  1.11s/it]

{'loss': 1.0749, 'grad_norm': 0.4276570975780487, 'learning_rate': 0.0001297213622291022, 'epoch': 0.35}


 35%|███▌      | 1141/3235 [20:37<37:59,  1.09s/it]

{'loss': 1.0685, 'grad_norm': 0.519633948802948, 'learning_rate': 0.0001296594427244582, 'epoch': 0.35}


 35%|███▌      | 1142/3235 [20:37<35:52,  1.03s/it]

{'loss': 0.9387, 'grad_norm': 0.5259674191474915, 'learning_rate': 0.00012959752321981425, 'epoch': 0.35}


 35%|███▌      | 1143/3235 [20:39<37:07,  1.06s/it]

{'loss': 1.1984, 'grad_norm': 0.4830550253391266, 'learning_rate': 0.0001295356037151703, 'epoch': 0.35}


 35%|███▌      | 1144/3235 [20:40<38:25,  1.10s/it]

{'loss': 1.0222, 'grad_norm': 0.48208117485046387, 'learning_rate': 0.00012947368421052633, 'epoch': 0.35}


 35%|███▌      | 1145/3235 [20:41<36:02,  1.03s/it]

{'loss': 1.0437, 'grad_norm': 0.5413769483566284, 'learning_rate': 0.00012941176470588237, 'epoch': 0.35}


 35%|███▌      | 1146/3235 [20:42<38:38,  1.11s/it]

{'loss': 1.0731, 'grad_norm': 0.47320085763931274, 'learning_rate': 0.0001293498452012384, 'epoch': 0.35}


 35%|███▌      | 1147/3235 [20:43<37:34,  1.08s/it]

{'loss': 0.9922, 'grad_norm': 0.47495511174201965, 'learning_rate': 0.00012928792569659442, 'epoch': 0.35}


 35%|███▌      | 1148/3235 [20:44<39:48,  1.14s/it]

{'loss': 1.1155, 'grad_norm': 0.44695043563842773, 'learning_rate': 0.00012922600619195046, 'epoch': 0.35}


 36%|███▌      | 1149/3235 [20:45<36:07,  1.04s/it]

{'loss': 1.0257, 'grad_norm': 0.5460366606712341, 'learning_rate': 0.0001291640866873065, 'epoch': 0.36}


 36%|███▌      | 1150/3235 [20:46<36:11,  1.04s/it]

{'loss': 1.0546, 'grad_norm': 0.4761335253715515, 'learning_rate': 0.00012910216718266254, 'epoch': 0.36}


 36%|███▌      | 1151/3235 [20:47<35:55,  1.03s/it]

{'loss': 1.0156, 'grad_norm': 0.48968827724456787, 'learning_rate': 0.00012904024767801858, 'epoch': 0.36}


 36%|███▌      | 1152/3235 [20:48<35:37,  1.03s/it]

{'loss': 0.9708, 'grad_norm': 0.4969792366027832, 'learning_rate': 0.00012897832817337462, 'epoch': 0.36}


 36%|███▌      | 1153/3235 [20:49<36:58,  1.07s/it]

{'loss': 1.0316, 'grad_norm': 0.478255033493042, 'learning_rate': 0.00012891640866873066, 'epoch': 0.36}


 36%|███▌      | 1154/3235 [20:50<37:42,  1.09s/it]

{'loss': 0.9849, 'grad_norm': 0.4879499077796936, 'learning_rate': 0.0001288544891640867, 'epoch': 0.36}


 36%|███▌      | 1155/3235 [20:51<36:55,  1.06s/it]

{'loss': 0.8634, 'grad_norm': 0.457796573638916, 'learning_rate': 0.00012879256965944274, 'epoch': 0.36}


 36%|███▌      | 1156/3235 [20:53<38:03,  1.10s/it]

{'loss': 1.1356, 'grad_norm': 0.43681904673576355, 'learning_rate': 0.00012873065015479878, 'epoch': 0.36}


 36%|███▌      | 1157/3235 [20:53<35:35,  1.03s/it]

{'loss': 1.0309, 'grad_norm': 0.5504410266876221, 'learning_rate': 0.00012866873065015482, 'epoch': 0.36}


 36%|███▌      | 1158/3235 [20:55<35:46,  1.03s/it]

{'loss': 1.1778, 'grad_norm': 0.49179205298423767, 'learning_rate': 0.00012860681114551086, 'epoch': 0.36}


 36%|███▌      | 1159/3235 [20:56<35:12,  1.02s/it]

{'loss': 1.0071, 'grad_norm': 0.4762362837791443, 'learning_rate': 0.00012854489164086687, 'epoch': 0.36}


 36%|███▌      | 1160/3235 [20:57<35:10,  1.02s/it]

{'loss': 0.977, 'grad_norm': 0.44905588030815125, 'learning_rate': 0.00012848297213622291, 'epoch': 0.36}


 36%|███▌      | 1161/3235 [20:58<38:45,  1.12s/it]

{'loss': 1.0327, 'grad_norm': 0.5036118626594543, 'learning_rate': 0.00012842105263157895, 'epoch': 0.36}


 36%|███▌      | 1162/3235 [20:59<36:12,  1.05s/it]

{'loss': 1.0096, 'grad_norm': 0.5496920943260193, 'learning_rate': 0.000128359133126935, 'epoch': 0.36}


 36%|███▌      | 1163/3235 [21:00<38:02,  1.10s/it]

{'loss': 0.9636, 'grad_norm': 0.4490275979042053, 'learning_rate': 0.00012829721362229103, 'epoch': 0.36}


 36%|███▌      | 1164/3235 [21:01<35:26,  1.03s/it]

{'loss': 0.9487, 'grad_norm': 0.5473751425743103, 'learning_rate': 0.00012823529411764707, 'epoch': 0.36}


 36%|███▌      | 1165/3235 [21:02<39:17,  1.14s/it]

{'loss': 1.0546, 'grad_norm': 0.4801374673843384, 'learning_rate': 0.0001281733746130031, 'epoch': 0.36}


 36%|███▌      | 1166/3235 [21:03<39:00,  1.13s/it]

{'loss': 0.982, 'grad_norm': 0.4624873697757721, 'learning_rate': 0.00012811145510835913, 'epoch': 0.36}


 36%|███▌      | 1167/3235 [21:04<36:35,  1.06s/it]

{'loss': 0.8734, 'grad_norm': 0.4783005714416504, 'learning_rate': 0.00012804953560371517, 'epoch': 0.36}


 36%|███▌      | 1168/3235 [21:05<35:33,  1.03s/it]

{'loss': 1.0096, 'grad_norm': 0.5018796920776367, 'learning_rate': 0.0001279876160990712, 'epoch': 0.36}


 36%|███▌      | 1169/3235 [21:06<35:49,  1.04s/it]

{'loss': 0.9661, 'grad_norm': 0.48647022247314453, 'learning_rate': 0.00012792569659442725, 'epoch': 0.36}


 36%|███▌      | 1170/3235 [21:07<32:56,  1.04it/s]

{'loss': 0.8918, 'grad_norm': 0.5168787240982056, 'learning_rate': 0.0001278637770897833, 'epoch': 0.36}


 36%|███▌      | 1171/3235 [21:08<36:23,  1.06s/it]

{'loss': 0.9532, 'grad_norm': 0.4506259560585022, 'learning_rate': 0.00012780185758513933, 'epoch': 0.36}


 36%|███▌      | 1172/3235 [21:09<34:32,  1.00s/it]

{'loss': 0.9798, 'grad_norm': 0.4829738438129425, 'learning_rate': 0.00012773993808049537, 'epoch': 0.36}


 36%|███▋      | 1173/3235 [21:10<36:03,  1.05s/it]

{'loss': 1.1007, 'grad_norm': 0.49216097593307495, 'learning_rate': 0.0001276780185758514, 'epoch': 0.36}


 36%|███▋      | 1174/3235 [21:11<34:52,  1.02s/it]

{'loss': 1.0589, 'grad_norm': 0.5082858204841614, 'learning_rate': 0.00012761609907120745, 'epoch': 0.36}


 36%|███▋      | 1175/3235 [21:12<34:50,  1.01s/it]

{'loss': 1.0441, 'grad_norm': 0.4964713752269745, 'learning_rate': 0.0001275541795665635, 'epoch': 0.36}


 36%|███▋      | 1176/3235 [21:13<35:45,  1.04s/it]

{'loss': 1.0358, 'grad_norm': 0.5119860768318176, 'learning_rate': 0.00012749226006191953, 'epoch': 0.36}


 36%|███▋      | 1177/3235 [21:15<36:49,  1.07s/it]

{'loss': 1.0269, 'grad_norm': 0.4491749405860901, 'learning_rate': 0.00012743034055727554, 'epoch': 0.36}


 36%|███▋      | 1178/3235 [21:16<39:28,  1.15s/it]

{'loss': 1.0691, 'grad_norm': 0.451164573431015, 'learning_rate': 0.00012736842105263158, 'epoch': 0.36}


 36%|███▋      | 1179/3235 [21:17<38:07,  1.11s/it]

{'loss': 1.2155, 'grad_norm': 0.44875892996788025, 'learning_rate': 0.00012730650154798762, 'epoch': 0.36}


 36%|███▋      | 1180/3235 [21:18<36:41,  1.07s/it]

{'loss': 1.1138, 'grad_norm': 0.47349920868873596, 'learning_rate': 0.00012724458204334366, 'epoch': 0.36}


 37%|███▋      | 1181/3235 [21:19<39:44,  1.16s/it]

{'loss': 1.1005, 'grad_norm': 0.4727257490158081, 'learning_rate': 0.0001271826625386997, 'epoch': 0.37}


 37%|███▋      | 1182/3235 [21:20<38:23,  1.12s/it]

{'loss': 1.0653, 'grad_norm': 0.45790204405784607, 'learning_rate': 0.00012712074303405571, 'epoch': 0.37}


 37%|███▋      | 1183/3235 [21:21<35:51,  1.05s/it]

{'loss': 1.0871, 'grad_norm': 0.5445446968078613, 'learning_rate': 0.00012705882352941175, 'epoch': 0.37}


 37%|███▋      | 1184/3235 [21:22<35:46,  1.05s/it]

{'loss': 1.0373, 'grad_norm': 0.47459179162979126, 'learning_rate': 0.0001269969040247678, 'epoch': 0.37}


 37%|███▋      | 1185/3235 [21:23<37:33,  1.10s/it]

{'loss': 0.9707, 'grad_norm': 0.4394359588623047, 'learning_rate': 0.00012693498452012383, 'epoch': 0.37}


 37%|███▋      | 1186/3235 [21:25<38:18,  1.12s/it]

{'loss': 0.9943, 'grad_norm': 0.4462940990924835, 'learning_rate': 0.00012687306501547987, 'epoch': 0.37}


 37%|███▋      | 1187/3235 [21:26<35:45,  1.05s/it]

{'loss': 1.0375, 'grad_norm': 0.44660043716430664, 'learning_rate': 0.00012681114551083591, 'epoch': 0.37}


 37%|███▋      | 1188/3235 [21:27<38:07,  1.12s/it]

{'loss': 1.0337, 'grad_norm': 0.4753504693508148, 'learning_rate': 0.00012674922600619195, 'epoch': 0.37}


 37%|███▋      | 1189/3235 [21:28<39:21,  1.15s/it]

{'loss': 0.8635, 'grad_norm': 0.3857966661453247, 'learning_rate': 0.000126687306501548, 'epoch': 0.37}


 37%|███▋      | 1190/3235 [21:29<37:54,  1.11s/it]

{'loss': 0.948, 'grad_norm': 0.4576360285282135, 'learning_rate': 0.00012662538699690403, 'epoch': 0.37}


 37%|███▋      | 1191/3235 [21:30<36:10,  1.06s/it]

{'loss': 1.0726, 'grad_norm': 0.5403964519500732, 'learning_rate': 0.00012656346749226007, 'epoch': 0.37}


 37%|███▋      | 1192/3235 [21:31<35:49,  1.05s/it]

{'loss': 1.1928, 'grad_norm': 0.4922701418399811, 'learning_rate': 0.00012650154798761611, 'epoch': 0.37}


 37%|███▋      | 1193/3235 [21:32<38:58,  1.15s/it]

{'loss': 0.9759, 'grad_norm': 0.4052366018295288, 'learning_rate': 0.00012643962848297215, 'epoch': 0.37}


 37%|███▋      | 1194/3235 [21:34<39:44,  1.17s/it]

{'loss': 1.1077, 'grad_norm': 0.4768565893173218, 'learning_rate': 0.00012637770897832817, 'epoch': 0.37}


 37%|███▋      | 1195/3235 [21:35<38:05,  1.12s/it]

{'loss': 0.9933, 'grad_norm': 0.46970829367637634, 'learning_rate': 0.0001263157894736842, 'epoch': 0.37}


 37%|███▋      | 1196/3235 [21:36<38:26,  1.13s/it]

{'loss': 1.1614, 'grad_norm': 0.4970814883708954, 'learning_rate': 0.00012625386996904025, 'epoch': 0.37}


 37%|███▋      | 1197/3235 [21:37<37:38,  1.11s/it]

{'loss': 1.0179, 'grad_norm': 0.4652767479419708, 'learning_rate': 0.0001261919504643963, 'epoch': 0.37}


 37%|███▋      | 1198/3235 [21:38<35:52,  1.06s/it]

{'loss': 0.8933, 'grad_norm': 0.5282986164093018, 'learning_rate': 0.00012613003095975233, 'epoch': 0.37}


 37%|███▋      | 1199/3235 [21:39<37:06,  1.09s/it]

{'loss': 1.1205, 'grad_norm': 0.4641414284706116, 'learning_rate': 0.00012606811145510837, 'epoch': 0.37}


 37%|███▋      | 1200/3235 [21:40<36:59,  1.09s/it]

{'loss': 1.0161, 'grad_norm': 0.5152137279510498, 'learning_rate': 0.00012600619195046438, 'epoch': 0.37}


 37%|███▋      | 1201/3235 [21:41<37:50,  1.12s/it]

{'loss': 1.1422, 'grad_norm': 0.4845248758792877, 'learning_rate': 0.00012594427244582042, 'epoch': 0.37}


 37%|███▋      | 1202/3235 [21:42<37:15,  1.10s/it]

{'loss': 0.9218, 'grad_norm': 0.48425039649009705, 'learning_rate': 0.0001258823529411765, 'epoch': 0.37}


 37%|███▋      | 1203/3235 [21:43<35:19,  1.04s/it]

{'loss': 0.9926, 'grad_norm': 0.5100606679916382, 'learning_rate': 0.00012582043343653253, 'epoch': 0.37}


 37%|███▋      | 1204/3235 [21:44<37:58,  1.12s/it]

{'loss': 1.2122, 'grad_norm': 0.44164201617240906, 'learning_rate': 0.00012575851393188857, 'epoch': 0.37}


 37%|███▋      | 1205/3235 [21:45<36:53,  1.09s/it]

{'loss': 1.0568, 'grad_norm': 0.4529624283313751, 'learning_rate': 0.0001256965944272446, 'epoch': 0.37}


 37%|███▋      | 1206/3235 [21:47<37:32,  1.11s/it]

{'loss': 0.9808, 'grad_norm': 0.4358229637145996, 'learning_rate': 0.00012563467492260062, 'epoch': 0.37}


 37%|███▋      | 1207/3235 [21:48<39:01,  1.15s/it]

{'loss': 1.0537, 'grad_norm': 0.46166378259658813, 'learning_rate': 0.00012557275541795666, 'epoch': 0.37}


 37%|███▋      | 1208/3235 [21:49<40:02,  1.19s/it]

{'loss': 1.0847, 'grad_norm': 0.4733392894268036, 'learning_rate': 0.0001255108359133127, 'epoch': 0.37}


 37%|███▋      | 1209/3235 [21:50<38:14,  1.13s/it]

{'loss': 1.0501, 'grad_norm': 0.5064077377319336, 'learning_rate': 0.00012544891640866874, 'epoch': 0.37}


 37%|███▋      | 1210/3235 [21:51<36:35,  1.08s/it]

{'loss': 0.9753, 'grad_norm': 0.4733152389526367, 'learning_rate': 0.00012538699690402478, 'epoch': 0.37}


 37%|███▋      | 1211/3235 [21:52<38:10,  1.13s/it]

{'loss': 1.0565, 'grad_norm': 0.42552632093429565, 'learning_rate': 0.00012532507739938082, 'epoch': 0.37}


 37%|███▋      | 1212/3235 [21:53<37:12,  1.10s/it]

{'loss': 1.1956, 'grad_norm': 0.48679816722869873, 'learning_rate': 0.00012526315789473683, 'epoch': 0.37}


 37%|███▋      | 1213/3235 [21:55<37:10,  1.10s/it]

{'loss': 0.992, 'grad_norm': 0.4905405640602112, 'learning_rate': 0.00012520123839009287, 'epoch': 0.37}


 38%|███▊      | 1214/3235 [21:56<38:59,  1.16s/it]

{'loss': 1.1977, 'grad_norm': 0.4776170551776886, 'learning_rate': 0.0001251393188854489, 'epoch': 0.38}


 38%|███▊      | 1215/3235 [21:57<40:06,  1.19s/it]

{'loss': 1.0942, 'grad_norm': 0.4373241364955902, 'learning_rate': 0.00012507739938080495, 'epoch': 0.38}


 38%|███▊      | 1216/3235 [21:58<33:58,  1.01s/it]

{'loss': 0.8787, 'grad_norm': 0.5517276525497437, 'learning_rate': 0.000125015479876161, 'epoch': 0.38}


 38%|███▊      | 1217/3235 [21:59<33:07,  1.02it/s]

{'loss': 1.0662, 'grad_norm': 0.5253419280052185, 'learning_rate': 0.00012495356037151703, 'epoch': 0.38}


 38%|███▊      | 1218/3235 [22:00<34:49,  1.04s/it]

{'loss': 1.0601, 'grad_norm': 0.4338569641113281, 'learning_rate': 0.00012489164086687307, 'epoch': 0.38}


 38%|███▊      | 1219/3235 [22:01<37:46,  1.12s/it]

{'loss': 0.8513, 'grad_norm': 0.4486396312713623, 'learning_rate': 0.0001248297213622291, 'epoch': 0.38}


 38%|███▊      | 1220/3235 [22:02<40:19,  1.20s/it]

{'loss': 1.0742, 'grad_norm': 0.4798882007598877, 'learning_rate': 0.00012476780185758515, 'epoch': 0.38}


 38%|███▊      | 1221/3235 [22:04<41:32,  1.24s/it]

{'loss': 1.0602, 'grad_norm': 0.4173312485218048, 'learning_rate': 0.0001247058823529412, 'epoch': 0.38}


 38%|███▊      | 1222/3235 [22:05<40:48,  1.22s/it]

{'loss': 1.009, 'grad_norm': 0.5134776830673218, 'learning_rate': 0.00012464396284829723, 'epoch': 0.38}


 38%|███▊      | 1223/3235 [22:06<37:07,  1.11s/it]

{'loss': 1.0838, 'grad_norm': 0.509294867515564, 'learning_rate': 0.00012458204334365327, 'epoch': 0.38}


 38%|███▊      | 1224/3235 [22:07<36:03,  1.08s/it]

{'loss': 1.0722, 'grad_norm': 0.489131897687912, 'learning_rate': 0.00012452012383900929, 'epoch': 0.38}


 38%|███▊      | 1225/3235 [22:08<34:09,  1.02s/it]

{'loss': 1.1088, 'grad_norm': 0.4803500175476074, 'learning_rate': 0.00012445820433436533, 'epoch': 0.38}


 38%|███▊      | 1226/3235 [22:09<35:27,  1.06s/it]

{'loss': 1.0599, 'grad_norm': 0.4302743971347809, 'learning_rate': 0.00012439628482972137, 'epoch': 0.38}


 38%|███▊      | 1227/3235 [22:10<36:35,  1.09s/it]

{'loss': 0.976, 'grad_norm': 0.46085435152053833, 'learning_rate': 0.0001243343653250774, 'epoch': 0.38}


 38%|███▊      | 1228/3235 [22:11<37:14,  1.11s/it]

{'loss': 1.0986, 'grad_norm': 0.43769407272338867, 'learning_rate': 0.00012427244582043345, 'epoch': 0.38}


 38%|███▊      | 1229/3235 [22:12<36:54,  1.10s/it]

{'loss': 1.0807, 'grad_norm': 0.5104029178619385, 'learning_rate': 0.00012421052631578949, 'epoch': 0.38}


 38%|███▊      | 1230/3235 [22:13<36:40,  1.10s/it]

{'loss': 1.0401, 'grad_norm': 0.4267701804637909, 'learning_rate': 0.0001241486068111455, 'epoch': 0.38}


 38%|███▊      | 1231/3235 [22:15<39:00,  1.17s/it]

{'loss': 1.1816, 'grad_norm': 0.47648197412490845, 'learning_rate': 0.00012408668730650154, 'epoch': 0.38}


 38%|███▊      | 1232/3235 [22:16<39:22,  1.18s/it]

{'loss': 0.9068, 'grad_norm': 0.45404013991355896, 'learning_rate': 0.00012402476780185758, 'epoch': 0.38}


 38%|███▊      | 1233/3235 [22:17<36:36,  1.10s/it]

{'loss': 1.0631, 'grad_norm': 0.5110720992088318, 'learning_rate': 0.00012396284829721362, 'epoch': 0.38}


 38%|███▊      | 1234/3235 [22:18<39:52,  1.20s/it]

{'loss': 0.9538, 'grad_norm': 0.41506069898605347, 'learning_rate': 0.00012390092879256966, 'epoch': 0.38}


 38%|███▊      | 1235/3235 [22:19<38:23,  1.15s/it]

{'loss': 0.9891, 'grad_norm': 0.473034530878067, 'learning_rate': 0.0001238390092879257, 'epoch': 0.38}


 38%|███▊      | 1236/3235 [22:20<38:16,  1.15s/it]

{'loss': 1.1309, 'grad_norm': 0.42644545435905457, 'learning_rate': 0.00012377708978328174, 'epoch': 0.38}


 38%|███▊      | 1237/3235 [22:21<35:59,  1.08s/it]

{'loss': 0.9382, 'grad_norm': 0.49187904596328735, 'learning_rate': 0.00012371517027863778, 'epoch': 0.38}


 38%|███▊      | 1238/3235 [22:22<35:53,  1.08s/it]

{'loss': 1.025, 'grad_norm': 0.45616427063941956, 'learning_rate': 0.00012365325077399382, 'epoch': 0.38}


 38%|███▊      | 1239/3235 [22:24<37:32,  1.13s/it]

{'loss': 1.1216, 'grad_norm': 0.4433143138885498, 'learning_rate': 0.00012359133126934986, 'epoch': 0.38}


 38%|███▊      | 1240/3235 [22:25<36:44,  1.11s/it]

{'loss': 1.2081, 'grad_norm': 0.5162357687950134, 'learning_rate': 0.0001235294117647059, 'epoch': 0.38}


 38%|███▊      | 1241/3235 [22:26<35:43,  1.07s/it]

{'loss': 1.0513, 'grad_norm': 0.5398398637771606, 'learning_rate': 0.00012346749226006194, 'epoch': 0.38}


 38%|███▊      | 1242/3235 [22:27<36:23,  1.10s/it]

{'loss': 1.0101, 'grad_norm': 0.48731234669685364, 'learning_rate': 0.00012340557275541795, 'epoch': 0.38}


 38%|███▊      | 1243/3235 [22:28<35:32,  1.07s/it]

{'loss': 0.9711, 'grad_norm': 0.5370266437530518, 'learning_rate': 0.000123343653250774, 'epoch': 0.38}


 38%|███▊      | 1244/3235 [22:29<34:55,  1.05s/it]

{'loss': 0.8972, 'grad_norm': 0.45951414108276367, 'learning_rate': 0.00012328173374613003, 'epoch': 0.38}


 38%|███▊      | 1245/3235 [22:30<33:50,  1.02s/it]

{'loss': 0.9389, 'grad_norm': 0.5023002028465271, 'learning_rate': 0.00012321981424148607, 'epoch': 0.38}


 39%|███▊      | 1246/3235 [22:31<33:37,  1.01s/it]

{'loss': 1.0315, 'grad_norm': 0.4965507686138153, 'learning_rate': 0.0001231578947368421, 'epoch': 0.39}


 39%|███▊      | 1247/3235 [22:32<33:41,  1.02s/it]

{'loss': 0.9869, 'grad_norm': 0.46849149465560913, 'learning_rate': 0.00012309597523219815, 'epoch': 0.39}


 39%|███▊      | 1248/3235 [22:33<34:58,  1.06s/it]

{'loss': 0.9875, 'grad_norm': 0.4782750904560089, 'learning_rate': 0.00012303405572755416, 'epoch': 0.39}


 39%|███▊      | 1249/3235 [22:34<34:17,  1.04s/it]

{'loss': 1.0752, 'grad_norm': 0.4666416347026825, 'learning_rate': 0.0001229721362229102, 'epoch': 0.39}


 39%|███▊      | 1250/3235 [22:35<36:22,  1.10s/it]

{'loss': 1.071, 'grad_norm': 0.45978859066963196, 'learning_rate': 0.00012291021671826624, 'epoch': 0.39}


 39%|███▊      | 1251/3235 [22:36<35:30,  1.07s/it]

{'loss': 1.0102, 'grad_norm': 0.47126463055610657, 'learning_rate': 0.0001228482972136223, 'epoch': 0.39}


 39%|███▊      | 1252/3235 [22:37<36:54,  1.12s/it]

{'loss': 0.9799, 'grad_norm': 0.4770672023296356, 'learning_rate': 0.00012278637770897835, 'epoch': 0.39}


 39%|███▊      | 1253/3235 [22:38<33:05,  1.00s/it]

{'loss': 0.9357, 'grad_norm': 0.48496541380882263, 'learning_rate': 0.0001227244582043344, 'epoch': 0.39}


 39%|███▉      | 1254/3235 [22:39<33:49,  1.02s/it]

{'loss': 1.0198, 'grad_norm': 0.46943894028663635, 'learning_rate': 0.0001226625386996904, 'epoch': 0.39}


 39%|███▉      | 1255/3235 [22:40<34:44,  1.05s/it]

{'loss': 1.0295, 'grad_norm': 0.46494850516319275, 'learning_rate': 0.00012260061919504644, 'epoch': 0.39}


 39%|███▉      | 1256/3235 [22:41<34:24,  1.04s/it]

{'loss': 0.9986, 'grad_norm': 0.5240963101387024, 'learning_rate': 0.00012253869969040248, 'epoch': 0.39}


 39%|███▉      | 1257/3235 [22:43<35:44,  1.08s/it]

{'loss': 1.116, 'grad_norm': 0.45742321014404297, 'learning_rate': 0.00012247678018575852, 'epoch': 0.39}


 39%|███▉      | 1258/3235 [22:44<35:28,  1.08s/it]

{'loss': 0.9977, 'grad_norm': 0.5017337203025818, 'learning_rate': 0.00012241486068111456, 'epoch': 0.39}


 39%|███▉      | 1259/3235 [22:45<36:41,  1.11s/it]

{'loss': 0.9365, 'grad_norm': 0.4284571707248688, 'learning_rate': 0.0001223529411764706, 'epoch': 0.39}


 39%|███▉      | 1260/3235 [22:46<36:38,  1.11s/it]

{'loss': 1.0566, 'grad_norm': 0.48127418756484985, 'learning_rate': 0.00012229102167182662, 'epoch': 0.39}


 39%|███▉      | 1261/3235 [22:47<36:23,  1.11s/it]

{'loss': 1.0895, 'grad_norm': 0.4809674322605133, 'learning_rate': 0.00012222910216718266, 'epoch': 0.39}


 39%|███▉      | 1262/3235 [22:48<34:08,  1.04s/it]

{'loss': 0.976, 'grad_norm': 0.5108175277709961, 'learning_rate': 0.0001221671826625387, 'epoch': 0.39}


 39%|███▉      | 1263/3235 [22:49<37:29,  1.14s/it]

{'loss': 0.9567, 'grad_norm': 0.42574650049209595, 'learning_rate': 0.00012210526315789474, 'epoch': 0.39}


 39%|███▉      | 1264/3235 [22:50<36:24,  1.11s/it]

{'loss': 1.0637, 'grad_norm': 0.49884364008903503, 'learning_rate': 0.00012204334365325079, 'epoch': 0.39}


 39%|███▉      | 1265/3235 [22:51<36:46,  1.12s/it]

{'loss': 1.1688, 'grad_norm': 0.47951799631118774, 'learning_rate': 0.00012198142414860683, 'epoch': 0.39}


 39%|███▉      | 1266/3235 [22:52<33:45,  1.03s/it]

{'loss': 0.9588, 'grad_norm': 0.5414559245109558, 'learning_rate': 0.00012191950464396284, 'epoch': 0.39}


 39%|███▉      | 1267/3235 [22:53<31:56,  1.03it/s]

{'loss': 1.0174, 'grad_norm': 0.5570845603942871, 'learning_rate': 0.00012185758513931888, 'epoch': 0.39}


 39%|███▉      | 1268/3235 [22:54<32:16,  1.02it/s]

{'loss': 1.0492, 'grad_norm': 0.4776279032230377, 'learning_rate': 0.00012179566563467492, 'epoch': 0.39}


 39%|███▉      | 1269/3235 [22:55<33:37,  1.03s/it]

{'loss': 1.0014, 'grad_norm': 0.44656434655189514, 'learning_rate': 0.00012173374613003096, 'epoch': 0.39}


 39%|███▉      | 1270/3235 [22:56<34:26,  1.05s/it]

{'loss': 1.069, 'grad_norm': 0.44920992851257324, 'learning_rate': 0.000121671826625387, 'epoch': 0.39}


 39%|███▉      | 1271/3235 [22:57<34:24,  1.05s/it]

{'loss': 0.9289, 'grad_norm': 0.522042453289032, 'learning_rate': 0.00012160990712074304, 'epoch': 0.39}


 39%|███▉      | 1272/3235 [22:58<32:22,  1.01it/s]

{'loss': 1.0616, 'grad_norm': 0.6734819412231445, 'learning_rate': 0.00012154798761609907, 'epoch': 0.39}


 39%|███▉      | 1273/3235 [22:59<32:39,  1.00it/s]

{'loss': 1.056, 'grad_norm': 0.5345318913459778, 'learning_rate': 0.00012148606811145511, 'epoch': 0.39}


 39%|███▉      | 1274/3235 [23:00<33:52,  1.04s/it]

{'loss': 0.9075, 'grad_norm': 0.5048902034759521, 'learning_rate': 0.00012142414860681115, 'epoch': 0.39}


 39%|███▉      | 1275/3235 [23:01<33:21,  1.02s/it]

{'loss': 1.1639, 'grad_norm': 0.48876774311065674, 'learning_rate': 0.00012136222910216719, 'epoch': 0.39}


 39%|███▉      | 1276/3235 [23:02<32:21,  1.01it/s]

{'loss': 0.8748, 'grad_norm': 0.49205896258354187, 'learning_rate': 0.00012130030959752323, 'epoch': 0.39}


 39%|███▉      | 1277/3235 [23:03<33:58,  1.04s/it]

{'loss': 1.0613, 'grad_norm': 0.46240946650505066, 'learning_rate': 0.00012123839009287927, 'epoch': 0.39}


 40%|███▉      | 1278/3235 [23:04<33:30,  1.03s/it]

{'loss': 0.9173, 'grad_norm': 0.46399691700935364, 'learning_rate': 0.0001211764705882353, 'epoch': 0.4}


 40%|███▉      | 1279/3235 [23:06<35:12,  1.08s/it]

{'loss': 0.9639, 'grad_norm': 0.43945935368537903, 'learning_rate': 0.00012111455108359134, 'epoch': 0.4}


 40%|███▉      | 1280/3235 [23:07<33:42,  1.03s/it]

{'loss': 0.9745, 'grad_norm': 0.5216076970100403, 'learning_rate': 0.00012105263157894738, 'epoch': 0.4}


 40%|███▉      | 1281/3235 [23:08<34:07,  1.05s/it]

{'loss': 0.8879, 'grad_norm': 0.45335081219673157, 'learning_rate': 0.00012099071207430342, 'epoch': 0.4}


 40%|███▉      | 1282/3235 [23:09<33:48,  1.04s/it]

{'loss': 1.0983, 'grad_norm': 0.48018306493759155, 'learning_rate': 0.00012092879256965946, 'epoch': 0.4}


 40%|███▉      | 1283/3235 [23:10<35:44,  1.10s/it]

{'loss': 1.1084, 'grad_norm': 0.4458473026752472, 'learning_rate': 0.0001208668730650155, 'epoch': 0.4}


 40%|███▉      | 1284/3235 [23:11<36:31,  1.12s/it]

{'loss': 0.9574, 'grad_norm': 0.4717921018600464, 'learning_rate': 0.00012080495356037151, 'epoch': 0.4}


 40%|███▉      | 1285/3235 [23:12<33:13,  1.02s/it]

{'loss': 1.0288, 'grad_norm': 0.5353822708129883, 'learning_rate': 0.00012074303405572755, 'epoch': 0.4}


 40%|███▉      | 1286/3235 [23:13<32:13,  1.01it/s]

{'loss': 0.8836, 'grad_norm': 0.5148781538009644, 'learning_rate': 0.00012068111455108359, 'epoch': 0.4}


 40%|███▉      | 1287/3235 [23:14<33:55,  1.04s/it]

{'loss': 0.9824, 'grad_norm': 0.47034651041030884, 'learning_rate': 0.00012061919504643964, 'epoch': 0.4}


 40%|███▉      | 1288/3235 [23:15<31:03,  1.04it/s]

{'loss': 1.0143, 'grad_norm': 0.5294078588485718, 'learning_rate': 0.00012055727554179568, 'epoch': 0.4}


 40%|███▉      | 1289/3235 [23:16<32:30,  1.00s/it]

{'loss': 1.1403, 'grad_norm': 0.4880809187889099, 'learning_rate': 0.00012049535603715172, 'epoch': 0.4}


 40%|███▉      | 1290/3235 [23:17<32:25,  1.00s/it]

{'loss': 0.9851, 'grad_norm': 0.45856738090515137, 'learning_rate': 0.00012043343653250774, 'epoch': 0.4}


 40%|███▉      | 1291/3235 [23:18<33:19,  1.03s/it]

{'loss': 1.0135, 'grad_norm': 0.49948757886886597, 'learning_rate': 0.00012037151702786378, 'epoch': 0.4}


 40%|███▉      | 1292/3235 [23:19<33:23,  1.03s/it]

{'loss': 1.0226, 'grad_norm': 0.4825475513935089, 'learning_rate': 0.00012030959752321982, 'epoch': 0.4}


 40%|███▉      | 1293/3235 [23:20<32:43,  1.01s/it]

{'loss': 1.0822, 'grad_norm': 0.5497623682022095, 'learning_rate': 0.00012024767801857586, 'epoch': 0.4}


 40%|████      | 1294/3235 [23:21<31:43,  1.02it/s]

{'loss': 0.8468, 'grad_norm': 0.49908292293548584, 'learning_rate': 0.0001201857585139319, 'epoch': 0.4}


 40%|████      | 1295/3235 [23:22<33:14,  1.03s/it]

{'loss': 0.9181, 'grad_norm': 0.5072263479232788, 'learning_rate': 0.00012012383900928794, 'epoch': 0.4}


 40%|████      | 1296/3235 [23:23<31:56,  1.01it/s]

{'loss': 1.1073, 'grad_norm': 0.5470303297042847, 'learning_rate': 0.00012006191950464396, 'epoch': 0.4}


 40%|████      | 1297/3235 [23:24<32:34,  1.01s/it]

{'loss': 1.0665, 'grad_norm': 0.4629620611667633, 'learning_rate': 0.00012, 'epoch': 0.4}


 40%|████      | 1298/3235 [23:25<34:31,  1.07s/it]

{'loss': 1.1321, 'grad_norm': 0.4553929567337036, 'learning_rate': 0.00011993808049535604, 'epoch': 0.4}


 40%|████      | 1299/3235 [23:26<34:10,  1.06s/it]

{'loss': 1.0651, 'grad_norm': 0.509706974029541, 'learning_rate': 0.00011987616099071208, 'epoch': 0.4}


 40%|████      | 1300/3235 [23:27<34:33,  1.07s/it]

{'loss': 1.0636, 'grad_norm': 0.5473368167877197, 'learning_rate': 0.00011981424148606812, 'epoch': 0.4}


 40%|████      | 1301/3235 [23:28<33:37,  1.04s/it]

{'loss': 0.9871, 'grad_norm': 0.5231086611747742, 'learning_rate': 0.00011975232198142416, 'epoch': 0.4}


 40%|████      | 1302/3235 [23:29<34:12,  1.06s/it]

{'loss': 0.851, 'grad_norm': 0.4475088119506836, 'learning_rate': 0.00011969040247678019, 'epoch': 0.4}


 40%|████      | 1303/3235 [23:31<35:34,  1.11s/it]

{'loss': 1.0601, 'grad_norm': 0.48441624641418457, 'learning_rate': 0.00011962848297213623, 'epoch': 0.4}


 40%|████      | 1304/3235 [23:32<33:53,  1.05s/it]

{'loss': 0.861, 'grad_norm': 0.5451982021331787, 'learning_rate': 0.00011956656346749227, 'epoch': 0.4}


 40%|████      | 1305/3235 [23:33<37:35,  1.17s/it]

{'loss': 0.9071, 'grad_norm': 0.42358657717704773, 'learning_rate': 0.00011950464396284831, 'epoch': 0.4}


 40%|████      | 1306/3235 [23:34<35:23,  1.10s/it]

{'loss': 1.0175, 'grad_norm': 0.47935912013053894, 'learning_rate': 0.00011944272445820435, 'epoch': 0.4}


 40%|████      | 1307/3235 [23:35<34:05,  1.06s/it]

{'loss': 1.0434, 'grad_norm': 0.5030666589736938, 'learning_rate': 0.00011938080495356039, 'epoch': 0.4}


 40%|████      | 1308/3235 [23:36<32:58,  1.03s/it]

{'loss': 0.8736, 'grad_norm': 0.5207985043525696, 'learning_rate': 0.0001193188854489164, 'epoch': 0.4}


 40%|████      | 1309/3235 [23:37<34:05,  1.06s/it]

{'loss': 1.046, 'grad_norm': 0.49602484703063965, 'learning_rate': 0.00011925696594427244, 'epoch': 0.4}


 40%|████      | 1310/3235 [23:38<32:09,  1.00s/it]

{'loss': 0.9246, 'grad_norm': 0.4763335585594177, 'learning_rate': 0.00011919504643962848, 'epoch': 0.4}


 41%|████      | 1311/3235 [23:39<32:47,  1.02s/it]

{'loss': 0.9642, 'grad_norm': 0.45685386657714844, 'learning_rate': 0.00011913312693498452, 'epoch': 0.41}


 41%|████      | 1312/3235 [23:40<31:53,  1.01it/s]

{'loss': 1.0387, 'grad_norm': 0.5275527834892273, 'learning_rate': 0.00011907120743034058, 'epoch': 0.41}


 41%|████      | 1313/3235 [23:41<32:49,  1.02s/it]

{'loss': 1.1133, 'grad_norm': 0.5566388368606567, 'learning_rate': 0.00011900928792569662, 'epoch': 0.41}


 41%|████      | 1314/3235 [23:42<32:47,  1.02s/it]

{'loss': 1.0071, 'grad_norm': 0.5173267722129822, 'learning_rate': 0.00011894736842105263, 'epoch': 0.41}


 41%|████      | 1315/3235 [23:43<33:31,  1.05s/it]

{'loss': 1.0145, 'grad_norm': 0.5297030806541443, 'learning_rate': 0.00011888544891640867, 'epoch': 0.41}


 41%|████      | 1316/3235 [23:44<33:35,  1.05s/it]

{'loss': 1.0317, 'grad_norm': 0.5270516276359558, 'learning_rate': 0.00011882352941176471, 'epoch': 0.41}


 41%|████      | 1317/3235 [23:45<35:30,  1.11s/it]

{'loss': 1.0389, 'grad_norm': 0.46971192955970764, 'learning_rate': 0.00011876160990712075, 'epoch': 0.41}


 41%|████      | 1318/3235 [23:47<38:11,  1.20s/it]

{'loss': 1.019, 'grad_norm': 0.4527840316295624, 'learning_rate': 0.00011869969040247679, 'epoch': 0.41}


 41%|████      | 1319/3235 [23:48<37:43,  1.18s/it]

{'loss': 1.2197, 'grad_norm': 0.6696037650108337, 'learning_rate': 0.00011863777089783283, 'epoch': 0.41}


 41%|████      | 1320/3235 [23:49<35:17,  1.11s/it]

{'loss': 1.0526, 'grad_norm': 0.6065015196800232, 'learning_rate': 0.00011857585139318886, 'epoch': 0.41}


 41%|████      | 1321/3235 [23:50<37:04,  1.16s/it]

{'loss': 1.0701, 'grad_norm': 0.3916011154651642, 'learning_rate': 0.0001185139318885449, 'epoch': 0.41}


 41%|████      | 1322/3235 [23:51<35:11,  1.10s/it]

{'loss': 0.9326, 'grad_norm': 0.5027738213539124, 'learning_rate': 0.00011845201238390094, 'epoch': 0.41}


 41%|████      | 1323/3235 [23:52<35:50,  1.12s/it]

{'loss': 1.0178, 'grad_norm': 0.5187383890151978, 'learning_rate': 0.00011839009287925698, 'epoch': 0.41}


 41%|████      | 1324/3235 [23:53<34:57,  1.10s/it]

{'loss': 1.1084, 'grad_norm': 0.4956180453300476, 'learning_rate': 0.00011832817337461302, 'epoch': 0.41}


 41%|████      | 1325/3235 [23:54<32:30,  1.02s/it]

{'loss': 1.0837, 'grad_norm': 2.1580429077148438, 'learning_rate': 0.00011826625386996906, 'epoch': 0.41}


 41%|████      | 1326/3235 [23:55<33:42,  1.06s/it]

{'loss': 1.113, 'grad_norm': 0.4976961016654968, 'learning_rate': 0.00011820433436532508, 'epoch': 0.41}


 41%|████      | 1327/3235 [23:56<32:40,  1.03s/it]

{'loss': 0.9286, 'grad_norm': 0.516808807849884, 'learning_rate': 0.00011814241486068112, 'epoch': 0.41}


 41%|████      | 1328/3235 [23:58<34:56,  1.10s/it]

{'loss': 1.0458, 'grad_norm': 0.4785497188568115, 'learning_rate': 0.00011808049535603716, 'epoch': 0.41}


 41%|████      | 1329/3235 [23:59<35:41,  1.12s/it]

{'loss': 1.1222, 'grad_norm': 0.45455724000930786, 'learning_rate': 0.0001180185758513932, 'epoch': 0.41}


 41%|████      | 1330/3235 [24:00<33:46,  1.06s/it]

{'loss': 1.019, 'grad_norm': 0.5151283740997314, 'learning_rate': 0.00011795665634674924, 'epoch': 0.41}


 41%|████      | 1331/3235 [24:01<35:50,  1.13s/it]

{'loss': 1.1301, 'grad_norm': 0.42056554555892944, 'learning_rate': 0.00011789473684210525, 'epoch': 0.41}


 41%|████      | 1332/3235 [24:02<35:08,  1.11s/it]

{'loss': 1.1023, 'grad_norm': 0.49025100469589233, 'learning_rate': 0.0001178328173374613, 'epoch': 0.41}


 41%|████      | 1333/3235 [24:03<38:20,  1.21s/it]

{'loss': 1.089, 'grad_norm': 0.3914242088794708, 'learning_rate': 0.00011777089783281733, 'epoch': 0.41}


 41%|████      | 1334/3235 [24:05<39:48,  1.26s/it]

{'loss': 1.0579, 'grad_norm': 0.446160227060318, 'learning_rate': 0.00011770897832817338, 'epoch': 0.41}


 41%|████▏     | 1335/3235 [24:06<40:01,  1.26s/it]

{'loss': 1.0071, 'grad_norm': 0.4548097848892212, 'learning_rate': 0.00011764705882352942, 'epoch': 0.41}


 41%|████▏     | 1336/3235 [24:07<37:58,  1.20s/it]

{'loss': 0.8344, 'grad_norm': 0.47754019498825073, 'learning_rate': 0.00011758513931888547, 'epoch': 0.41}


 41%|████▏     | 1337/3235 [24:08<35:10,  1.11s/it]

{'loss': 0.9413, 'grad_norm': 0.4626919627189636, 'learning_rate': 0.00011752321981424148, 'epoch': 0.41}


 41%|████▏     | 1338/3235 [24:09<33:06,  1.05s/it]

{'loss': 1.014, 'grad_norm': 0.5316646695137024, 'learning_rate': 0.00011746130030959752, 'epoch': 0.41}


 41%|████▏     | 1339/3235 [24:10<34:40,  1.10s/it]

{'loss': 1.0593, 'grad_norm': 0.45009616017341614, 'learning_rate': 0.00011739938080495356, 'epoch': 0.41}


 41%|████▏     | 1340/3235 [24:11<33:47,  1.07s/it]

{'loss': 1.0263, 'grad_norm': 0.5006476044654846, 'learning_rate': 0.0001173374613003096, 'epoch': 0.41}


 41%|████▏     | 1341/3235 [24:12<34:12,  1.08s/it]

{'loss': 1.0571, 'grad_norm': 0.4628365933895111, 'learning_rate': 0.00011727554179566564, 'epoch': 0.41}


 41%|████▏     | 1342/3235 [24:13<33:10,  1.05s/it]

{'loss': 1.0862, 'grad_norm': 0.5053414106369019, 'learning_rate': 0.00011721362229102168, 'epoch': 0.41}


 42%|████▏     | 1343/3235 [24:14<32:37,  1.03s/it]

{'loss': 1.081, 'grad_norm': 0.5300461649894714, 'learning_rate': 0.00011715170278637771, 'epoch': 0.42}


 42%|████▏     | 1344/3235 [24:15<32:26,  1.03s/it]

{'loss': 1.0213, 'grad_norm': 0.5262525677680969, 'learning_rate': 0.00011708978328173375, 'epoch': 0.42}


 42%|████▏     | 1345/3235 [24:17<35:55,  1.14s/it]

{'loss': 1.0277, 'grad_norm': 0.470968097448349, 'learning_rate': 0.00011702786377708979, 'epoch': 0.42}


 42%|████▏     | 1346/3235 [24:18<34:19,  1.09s/it]

{'loss': 1.0132, 'grad_norm': 0.4860926866531372, 'learning_rate': 0.00011696594427244583, 'epoch': 0.42}


 42%|████▏     | 1347/3235 [24:18<32:14,  1.02s/it]

{'loss': 1.1875, 'grad_norm': 0.5128049254417419, 'learning_rate': 0.00011690402476780187, 'epoch': 0.42}


 42%|████▏     | 1348/3235 [24:19<31:55,  1.02s/it]

{'loss': 1.0156, 'grad_norm': 0.45474180579185486, 'learning_rate': 0.00011684210526315791, 'epoch': 0.42}


 42%|████▏     | 1349/3235 [24:21<32:30,  1.03s/it]

{'loss': 1.125, 'grad_norm': 0.4637676477432251, 'learning_rate': 0.00011678018575851392, 'epoch': 0.42}


 42%|████▏     | 1350/3235 [24:22<33:05,  1.05s/it]

{'loss': 1.0002, 'grad_norm': 0.5065112709999084, 'learning_rate': 0.00011671826625386997, 'epoch': 0.42}


 42%|████▏     | 1351/3235 [24:23<31:40,  1.01s/it]

{'loss': 0.9281, 'grad_norm': 0.49424266815185547, 'learning_rate': 0.00011665634674922601, 'epoch': 0.42}


 42%|████▏     | 1352/3235 [24:23<30:46,  1.02it/s]

{'loss': 1.1036, 'grad_norm': 0.4917353391647339, 'learning_rate': 0.00011659442724458205, 'epoch': 0.42}


 42%|████▏     | 1353/3235 [24:25<32:00,  1.02s/it]

{'loss': 0.9983, 'grad_norm': 0.4622959792613983, 'learning_rate': 0.0001165325077399381, 'epoch': 0.42}


 42%|████▏     | 1354/3235 [24:26<32:48,  1.05s/it]

{'loss': 1.0065, 'grad_norm': 0.470695823431015, 'learning_rate': 0.00011647058823529413, 'epoch': 0.42}


 42%|████▏     | 1355/3235 [24:27<31:30,  1.01s/it]

{'loss': 1.0475, 'grad_norm': 0.5355264544487, 'learning_rate': 0.00011640866873065015, 'epoch': 0.42}


 42%|████▏     | 1356/3235 [24:27<29:22,  1.07it/s]

{'loss': 0.9549, 'grad_norm': 0.5363137125968933, 'learning_rate': 0.00011634674922600619, 'epoch': 0.42}


 42%|████▏     | 1357/3235 [24:29<31:27,  1.01s/it]

{'loss': 1.0446, 'grad_norm': 0.49929726123809814, 'learning_rate': 0.00011628482972136223, 'epoch': 0.42}


 42%|████▏     | 1358/3235 [24:30<31:10,  1.00it/s]

{'loss': 0.9921, 'grad_norm': 0.5265922546386719, 'learning_rate': 0.00011622291021671827, 'epoch': 0.42}


 42%|████▏     | 1359/3235 [24:31<33:22,  1.07s/it]

{'loss': 0.9735, 'grad_norm': 0.43826526403427124, 'learning_rate': 0.00011616099071207431, 'epoch': 0.42}


 42%|████▏     | 1360/3235 [24:32<33:41,  1.08s/it]

{'loss': 0.8292, 'grad_norm': 0.4416889250278473, 'learning_rate': 0.00011609907120743036, 'epoch': 0.42}


 42%|████▏     | 1361/3235 [24:33<32:42,  1.05s/it]

{'loss': 0.8697, 'grad_norm': 0.4731942415237427, 'learning_rate': 0.00011603715170278637, 'epoch': 0.42}


 42%|████▏     | 1362/3235 [24:34<31:11,  1.00it/s]

{'loss': 1.0347, 'grad_norm': 0.4925220012664795, 'learning_rate': 0.00011597523219814241, 'epoch': 0.42}


 42%|████▏     | 1363/3235 [24:35<30:12,  1.03it/s]

{'loss': 1.0724, 'grad_norm': 0.5343618392944336, 'learning_rate': 0.00011591331269349845, 'epoch': 0.42}


 42%|████▏     | 1364/3235 [24:36<31:57,  1.03s/it]

{'loss': 1.0009, 'grad_norm': 0.5038585662841797, 'learning_rate': 0.0001158513931888545, 'epoch': 0.42}


 42%|████▏     | 1365/3235 [24:37<34:04,  1.09s/it]

{'loss': 1.0992, 'grad_norm': 0.4521566331386566, 'learning_rate': 0.00011578947368421053, 'epoch': 0.42}


 42%|████▏     | 1366/3235 [24:38<34:15,  1.10s/it]

{'loss': 1.0042, 'grad_norm': 0.5023670792579651, 'learning_rate': 0.00011572755417956657, 'epoch': 0.42}


 42%|████▏     | 1367/3235 [24:39<35:14,  1.13s/it]

{'loss': 1.0847, 'grad_norm': 0.4806240499019623, 'learning_rate': 0.0001156656346749226, 'epoch': 0.42}


 42%|████▏     | 1368/3235 [24:40<31:04,  1.00it/s]

{'loss': 0.8737, 'grad_norm': 0.5595768690109253, 'learning_rate': 0.00011560371517027864, 'epoch': 0.42}


 42%|████▏     | 1369/3235 [24:41<30:57,  1.00it/s]

{'loss': 1.0951, 'grad_norm': 0.5471968650817871, 'learning_rate': 0.00011554179566563468, 'epoch': 0.42}


 42%|████▏     | 1370/3235 [24:42<29:50,  1.04it/s]

{'loss': 1.045, 'grad_norm': 0.5471184849739075, 'learning_rate': 0.00011547987616099072, 'epoch': 0.42}


 42%|████▏     | 1371/3235 [24:43<30:33,  1.02it/s]

{'loss': 1.0016, 'grad_norm': 0.48135900497436523, 'learning_rate': 0.00011541795665634676, 'epoch': 0.42}


 42%|████▏     | 1372/3235 [24:44<30:53,  1.01it/s]

{'loss': 0.9129, 'grad_norm': 0.5094776749610901, 'learning_rate': 0.0001153560371517028, 'epoch': 0.42}


 42%|████▏     | 1373/3235 [24:45<31:01,  1.00it/s]

{'loss': 1.0546, 'grad_norm': 0.4954851269721985, 'learning_rate': 0.00011529411764705881, 'epoch': 0.42}


 42%|████▏     | 1374/3235 [24:46<33:30,  1.08s/it]

{'loss': 0.9439, 'grad_norm': 0.4550986588001251, 'learning_rate': 0.00011523219814241485, 'epoch': 0.42}


 43%|████▎     | 1375/3235 [24:47<32:38,  1.05s/it]

{'loss': 0.9715, 'grad_norm': 0.48214709758758545, 'learning_rate': 0.00011517027863777091, 'epoch': 0.43}


 43%|████▎     | 1376/3235 [24:48<33:05,  1.07s/it]

{'loss': 0.928, 'grad_norm': 0.4480733573436737, 'learning_rate': 0.00011510835913312695, 'epoch': 0.43}


 43%|████▎     | 1377/3235 [24:49<33:48,  1.09s/it]

{'loss': 0.9885, 'grad_norm': 0.501505434513092, 'learning_rate': 0.00011504643962848299, 'epoch': 0.43}


 43%|████▎     | 1378/3235 [24:50<31:07,  1.01s/it]

{'loss': 0.9452, 'grad_norm': 0.47422242164611816, 'learning_rate': 0.00011498452012383903, 'epoch': 0.43}


 43%|████▎     | 1379/3235 [24:51<31:54,  1.03s/it]

{'loss': 1.0008, 'grad_norm': 0.4766296148300171, 'learning_rate': 0.00011492260061919504, 'epoch': 0.43}


 43%|████▎     | 1380/3235 [24:53<33:10,  1.07s/it]

{'loss': 1.03, 'grad_norm': 0.5050650238990784, 'learning_rate': 0.00011486068111455108, 'epoch': 0.43}


 43%|████▎     | 1381/3235 [24:53<30:50,  1.00it/s]

{'loss': 0.9694, 'grad_norm': 0.5344510674476624, 'learning_rate': 0.00011479876160990712, 'epoch': 0.43}


 43%|████▎     | 1382/3235 [24:54<30:57,  1.00s/it]

{'loss': 0.9675, 'grad_norm': 0.4552922248840332, 'learning_rate': 0.00011473684210526316, 'epoch': 0.43}


 43%|████▎     | 1383/3235 [24:55<32:11,  1.04s/it]

{'loss': 1.1132, 'grad_norm': 0.5188258290290833, 'learning_rate': 0.0001146749226006192, 'epoch': 0.43}


 43%|████▎     | 1384/3235 [24:56<31:41,  1.03s/it]

{'loss': 1.1375, 'grad_norm': 0.5143023729324341, 'learning_rate': 0.00011461300309597524, 'epoch': 0.43}


 43%|████▎     | 1385/3235 [24:58<31:49,  1.03s/it]

{'loss': 1.0254, 'grad_norm': 0.45180055499076843, 'learning_rate': 0.00011455108359133127, 'epoch': 0.43}


 43%|████▎     | 1386/3235 [24:59<31:26,  1.02s/it]

{'loss': 1.0923, 'grad_norm': 0.49011313915252686, 'learning_rate': 0.0001144891640866873, 'epoch': 0.43}


 43%|████▎     | 1387/3235 [25:00<32:19,  1.05s/it]

{'loss': 0.9866, 'grad_norm': 0.48757123947143555, 'learning_rate': 0.00011442724458204335, 'epoch': 0.43}


 43%|████▎     | 1388/3235 [25:01<31:19,  1.02s/it]

{'loss': 1.0337, 'grad_norm': 0.4525142312049866, 'learning_rate': 0.00011436532507739939, 'epoch': 0.43}


 43%|████▎     | 1389/3235 [25:02<32:27,  1.05s/it]

{'loss': 1.0636, 'grad_norm': 0.488993763923645, 'learning_rate': 0.00011430340557275543, 'epoch': 0.43}


 43%|████▎     | 1390/3235 [25:03<33:32,  1.09s/it]

{'loss': 1.1891, 'grad_norm': 0.6011990904808044, 'learning_rate': 0.00011424148606811147, 'epoch': 0.43}


 43%|████▎     | 1391/3235 [25:04<31:34,  1.03s/it]

{'loss': 0.8179, 'grad_norm': 0.46799618005752563, 'learning_rate': 0.00011417956656346749, 'epoch': 0.43}


 43%|████▎     | 1392/3235 [25:05<31:59,  1.04s/it]

{'loss': 0.9228, 'grad_norm': 0.4737793803215027, 'learning_rate': 0.00011411764705882353, 'epoch': 0.43}


 43%|████▎     | 1393/3235 [25:06<31:26,  1.02s/it]

{'loss': 1.0085, 'grad_norm': 0.5021919012069702, 'learning_rate': 0.00011405572755417957, 'epoch': 0.43}


 43%|████▎     | 1394/3235 [25:07<34:06,  1.11s/it]

{'loss': 1.0687, 'grad_norm': 0.46502476930618286, 'learning_rate': 0.00011399380804953561, 'epoch': 0.43}


 43%|████▎     | 1395/3235 [25:08<34:34,  1.13s/it]

{'loss': 0.9707, 'grad_norm': 0.43304643034935, 'learning_rate': 0.00011393188854489165, 'epoch': 0.43}


 43%|████▎     | 1396/3235 [25:09<33:26,  1.09s/it]

{'loss': 1.1306, 'grad_norm': 0.48844268918037415, 'learning_rate': 0.00011386996904024769, 'epoch': 0.43}


 43%|████▎     | 1397/3235 [25:10<32:42,  1.07s/it]

{'loss': 0.8682, 'grad_norm': 0.4650123119354248, 'learning_rate': 0.0001138080495356037, 'epoch': 0.43}


 43%|████▎     | 1398/3235 [25:11<32:42,  1.07s/it]

{'loss': 1.0448, 'grad_norm': 0.4667564034461975, 'learning_rate': 0.00011374613003095975, 'epoch': 0.43}


 43%|████▎     | 1399/3235 [25:12<31:54,  1.04s/it]

{'loss': 0.8221, 'grad_norm': 0.43755221366882324, 'learning_rate': 0.0001136842105263158, 'epoch': 0.43}


 43%|████▎     | 1400/3235 [25:13<31:34,  1.03s/it]

{'loss': 0.8084, 'grad_norm': 0.5021741986274719, 'learning_rate': 0.00011362229102167184, 'epoch': 0.43}


 43%|████▎     | 1401/3235 [25:15<32:52,  1.08s/it]

{'loss': 0.9363, 'grad_norm': 0.5015639662742615, 'learning_rate': 0.00011356037151702788, 'epoch': 0.43}


 43%|████▎     | 1402/3235 [25:15<31:23,  1.03s/it]

{'loss': 1.0289, 'grad_norm': 0.5024269819259644, 'learning_rate': 0.00011349845201238392, 'epoch': 0.43}


 43%|████▎     | 1403/3235 [25:17<33:25,  1.09s/it]

{'loss': 1.1159, 'grad_norm': 0.5098652243614197, 'learning_rate': 0.00011343653250773993, 'epoch': 0.43}


 43%|████▎     | 1404/3235 [25:18<33:23,  1.09s/it]

{'loss': 0.9968, 'grad_norm': 0.5000103712081909, 'learning_rate': 0.00011337461300309597, 'epoch': 0.43}


 43%|████▎     | 1405/3235 [25:19<34:24,  1.13s/it]

{'loss': 1.1359, 'grad_norm': 0.43864619731903076, 'learning_rate': 0.00011331269349845201, 'epoch': 0.43}


 43%|████▎     | 1406/3235 [25:20<33:09,  1.09s/it]

{'loss': 0.9767, 'grad_norm': 0.5130276083946228, 'learning_rate': 0.00011325077399380805, 'epoch': 0.43}


 43%|████▎     | 1407/3235 [25:21<32:32,  1.07s/it]

{'loss': 1.0389, 'grad_norm': 0.5079069137573242, 'learning_rate': 0.00011318885448916409, 'epoch': 0.43}


 44%|████▎     | 1408/3235 [25:23<36:03,  1.18s/it]

{'loss': 0.9826, 'grad_norm': 0.5647114515304565, 'learning_rate': 0.00011312693498452013, 'epoch': 0.44}


 44%|████▎     | 1409/3235 [25:24<37:03,  1.22s/it]

{'loss': 1.0959, 'grad_norm': 0.44462722539901733, 'learning_rate': 0.00011306501547987616, 'epoch': 0.44}


 44%|████▎     | 1410/3235 [25:25<35:59,  1.18s/it]

{'loss': 1.0287, 'grad_norm': 0.46013087034225464, 'learning_rate': 0.0001130030959752322, 'epoch': 0.44}


 44%|████▎     | 1411/3235 [25:26<36:37,  1.20s/it]

{'loss': 1.0014, 'grad_norm': 0.4481346011161804, 'learning_rate': 0.00011294117647058824, 'epoch': 0.44}


 44%|████▎     | 1412/3235 [25:27<34:41,  1.14s/it]

{'loss': 1.2144, 'grad_norm': 0.5064657926559448, 'learning_rate': 0.00011287925696594428, 'epoch': 0.44}


 44%|████▎     | 1413/3235 [25:29<37:55,  1.25s/it]

{'loss': 0.8322, 'grad_norm': 0.3716229200363159, 'learning_rate': 0.00011281733746130032, 'epoch': 0.44}


 44%|████▎     | 1414/3235 [25:30<35:29,  1.17s/it]

{'loss': 0.8656, 'grad_norm': 0.4754398763179779, 'learning_rate': 0.00011275541795665636, 'epoch': 0.44}


 44%|████▎     | 1415/3235 [25:31<33:20,  1.10s/it]

{'loss': 1.0534, 'grad_norm': 0.537880003452301, 'learning_rate': 0.00011269349845201239, 'epoch': 0.44}


 44%|████▍     | 1416/3235 [25:32<35:00,  1.15s/it]

{'loss': 0.9339, 'grad_norm': 0.4728538393974304, 'learning_rate': 0.00011263157894736843, 'epoch': 0.44}


 44%|████▍     | 1417/3235 [25:33<33:40,  1.11s/it]

{'loss': 0.9861, 'grad_norm': 0.5011928677558899, 'learning_rate': 0.00011256965944272447, 'epoch': 0.44}


 44%|████▍     | 1418/3235 [25:34<33:54,  1.12s/it]

{'loss': 1.1417, 'grad_norm': 0.4618915915489197, 'learning_rate': 0.0001125077399380805, 'epoch': 0.44}


 44%|████▍     | 1419/3235 [25:35<31:14,  1.03s/it]

{'loss': 0.9957, 'grad_norm': 0.5379235148429871, 'learning_rate': 0.00011244582043343655, 'epoch': 0.44}


 44%|████▍     | 1420/3235 [25:36<31:22,  1.04s/it]

{'loss': 0.9414, 'grad_norm': 0.45660683512687683, 'learning_rate': 0.00011238390092879259, 'epoch': 0.44}


 44%|████▍     | 1421/3235 [25:37<30:30,  1.01s/it]

{'loss': 1.0991, 'grad_norm': 0.5472123026847839, 'learning_rate': 0.0001123219814241486, 'epoch': 0.44}


 44%|████▍     | 1422/3235 [25:38<30:41,  1.02s/it]

{'loss': 0.8998, 'grad_norm': 0.46381399035453796, 'learning_rate': 0.00011226006191950464, 'epoch': 0.44}


 44%|████▍     | 1423/3235 [25:39<31:19,  1.04s/it]

{'loss': 1.0749, 'grad_norm': 0.4779459238052368, 'learning_rate': 0.00011219814241486069, 'epoch': 0.44}


 44%|████▍     | 1424/3235 [25:40<31:42,  1.05s/it]

{'loss': 0.8974, 'grad_norm': 0.47967585921287537, 'learning_rate': 0.00011213622291021673, 'epoch': 0.44}


 44%|████▍     | 1425/3235 [25:41<31:05,  1.03s/it]

{'loss': 1.0359, 'grad_norm': 0.5304611325263977, 'learning_rate': 0.00011207430340557277, 'epoch': 0.44}


 44%|████▍     | 1426/3235 [25:42<30:47,  1.02s/it]

{'loss': 0.8407, 'grad_norm': 0.49081629514694214, 'learning_rate': 0.00011201238390092881, 'epoch': 0.44}


 44%|████▍     | 1427/3235 [25:43<31:18,  1.04s/it]

{'loss': 0.9706, 'grad_norm': 0.45010140538215637, 'learning_rate': 0.00011195046439628482, 'epoch': 0.44}


 44%|████▍     | 1428/3235 [25:44<29:23,  1.02it/s]

{'loss': 0.9542, 'grad_norm': 0.5260913372039795, 'learning_rate': 0.00011188854489164086, 'epoch': 0.44}


 44%|████▍     | 1429/3235 [25:45<27:17,  1.10it/s]

{'loss': 0.8827, 'grad_norm': 0.5215414762496948, 'learning_rate': 0.0001118266253869969, 'epoch': 0.44}


 44%|████▍     | 1430/3235 [25:46<26:34,  1.13it/s]

{'loss': 1.1176, 'grad_norm': 0.5915969014167786, 'learning_rate': 0.00011176470588235294, 'epoch': 0.44}


 44%|████▍     | 1431/3235 [25:47<27:49,  1.08it/s]

{'loss': 1.0184, 'grad_norm': 0.5466330051422119, 'learning_rate': 0.00011170278637770898, 'epoch': 0.44}


 44%|████▍     | 1432/3235 [25:47<27:50,  1.08it/s]

{'loss': 1.081, 'grad_norm': 0.5132094621658325, 'learning_rate': 0.00011164086687306502, 'epoch': 0.44}


 44%|████▍     | 1433/3235 [25:49<30:04,  1.00s/it]

{'loss': 1.1468, 'grad_norm': 0.4877340793609619, 'learning_rate': 0.00011157894736842105, 'epoch': 0.44}


 44%|████▍     | 1434/3235 [25:50<31:47,  1.06s/it]

{'loss': 1.0188, 'grad_norm': 0.43775856494903564, 'learning_rate': 0.00011151702786377709, 'epoch': 0.44}


 44%|████▍     | 1435/3235 [25:51<32:04,  1.07s/it]

{'loss': 1.1375, 'grad_norm': 0.48477792739868164, 'learning_rate': 0.00011145510835913313, 'epoch': 0.44}


 44%|████▍     | 1436/3235 [25:52<31:33,  1.05s/it]

{'loss': 1.1766, 'grad_norm': 0.48775142431259155, 'learning_rate': 0.00011139318885448917, 'epoch': 0.44}


 44%|████▍     | 1437/3235 [25:53<30:16,  1.01s/it]

{'loss': 1.0147, 'grad_norm': 0.48918312788009644, 'learning_rate': 0.00011133126934984521, 'epoch': 0.44}


 44%|████▍     | 1438/3235 [25:54<31:11,  1.04s/it]

{'loss': 1.125, 'grad_norm': 0.5014275908470154, 'learning_rate': 0.00011126934984520125, 'epoch': 0.44}


 44%|████▍     | 1439/3235 [25:55<31:45,  1.06s/it]

{'loss': 1.0904, 'grad_norm': 0.4870995283126831, 'learning_rate': 0.00011120743034055728, 'epoch': 0.44}


 45%|████▍     | 1440/3235 [25:56<32:15,  1.08s/it]

{'loss': 1.0289, 'grad_norm': 0.43331071734428406, 'learning_rate': 0.00011114551083591332, 'epoch': 0.45}


 45%|████▍     | 1441/3235 [25:57<30:28,  1.02s/it]

{'loss': 1.0349, 'grad_norm': 0.4885208010673523, 'learning_rate': 0.00011108359133126936, 'epoch': 0.45}


 45%|████▍     | 1442/3235 [25:58<31:31,  1.05s/it]

{'loss': 1.0943, 'grad_norm': 0.4639982581138611, 'learning_rate': 0.0001110216718266254, 'epoch': 0.45}


 45%|████▍     | 1443/3235 [25:59<32:09,  1.08s/it]

{'loss': 1.0445, 'grad_norm': 0.5015563368797302, 'learning_rate': 0.00011095975232198144, 'epoch': 0.45}


 45%|████▍     | 1444/3235 [26:00<31:50,  1.07s/it]

{'loss': 0.9765, 'grad_norm': 0.5105340480804443, 'learning_rate': 0.00011089783281733748, 'epoch': 0.45}


 45%|████▍     | 1445/3235 [26:01<31:28,  1.06s/it]

{'loss': 1.2269, 'grad_norm': 0.5120456218719482, 'learning_rate': 0.00011083591331269349, 'epoch': 0.45}


 45%|████▍     | 1446/3235 [26:03<32:47,  1.10s/it]

{'loss': 1.0597, 'grad_norm': 0.485337495803833, 'learning_rate': 0.00011077399380804953, 'epoch': 0.45}


 45%|████▍     | 1447/3235 [26:04<34:10,  1.15s/it]

{'loss': 0.9583, 'grad_norm': 0.4451933801174164, 'learning_rate': 0.00011071207430340557, 'epoch': 0.45}


 45%|████▍     | 1448/3235 [26:05<31:09,  1.05s/it]

{'loss': 1.0268, 'grad_norm': 0.5531357526779175, 'learning_rate': 0.00011065015479876162, 'epoch': 0.45}


 45%|████▍     | 1449/3235 [26:06<32:20,  1.09s/it]

{'loss': 0.9652, 'grad_norm': 0.42052286863327026, 'learning_rate': 0.00011058823529411766, 'epoch': 0.45}


 45%|████▍     | 1450/3235 [26:07<31:55,  1.07s/it]

{'loss': 1.0975, 'grad_norm': 0.6248580813407898, 'learning_rate': 0.0001105263157894737, 'epoch': 0.45}


 45%|████▍     | 1451/3235 [26:08<31:39,  1.07s/it]

{'loss': 1.0702, 'grad_norm': 0.4909203052520752, 'learning_rate': 0.00011046439628482972, 'epoch': 0.45}


 45%|████▍     | 1452/3235 [26:09<31:34,  1.06s/it]

{'loss': 1.0734, 'grad_norm': 0.49329113960266113, 'learning_rate': 0.00011040247678018576, 'epoch': 0.45}


 45%|████▍     | 1453/3235 [26:10<29:33,  1.00it/s]

{'loss': 1.1225, 'grad_norm': 0.5208361744880676, 'learning_rate': 0.0001103405572755418, 'epoch': 0.45}


 45%|████▍     | 1454/3235 [26:11<31:01,  1.05s/it]

{'loss': 1.0597, 'grad_norm': 0.4461063742637634, 'learning_rate': 0.00011027863777089784, 'epoch': 0.45}


 45%|████▍     | 1455/3235 [26:12<33:15,  1.12s/it]

{'loss': 0.8632, 'grad_norm': 0.43709370493888855, 'learning_rate': 0.00011021671826625388, 'epoch': 0.45}


 45%|████▌     | 1456/3235 [26:13<32:07,  1.08s/it]

{'loss': 1.111, 'grad_norm': 0.4726514220237732, 'learning_rate': 0.00011015479876160992, 'epoch': 0.45}


 45%|████▌     | 1457/3235 [26:14<30:03,  1.01s/it]

{'loss': 0.9203, 'grad_norm': 0.5092998147010803, 'learning_rate': 0.00011009287925696594, 'epoch': 0.45}


 45%|████▌     | 1458/3235 [26:15<32:02,  1.08s/it]

{'loss': 0.8457, 'grad_norm': 0.4514256417751312, 'learning_rate': 0.00011003095975232198, 'epoch': 0.45}


 45%|████▌     | 1459/3235 [26:16<31:53,  1.08s/it]

{'loss': 1.0305, 'grad_norm': 0.5262970924377441, 'learning_rate': 0.00010996904024767802, 'epoch': 0.45}


 45%|████▌     | 1460/3235 [26:17<31:35,  1.07s/it]

{'loss': 0.9632, 'grad_norm': 0.47618991136550903, 'learning_rate': 0.00010990712074303406, 'epoch': 0.45}


 45%|████▌     | 1461/3235 [26:19<32:30,  1.10s/it]

{'loss': 1.0434, 'grad_norm': 0.46954911947250366, 'learning_rate': 0.0001098452012383901, 'epoch': 0.45}


 45%|████▌     | 1462/3235 [26:20<31:46,  1.08s/it]

{'loss': 1.0921, 'grad_norm': 0.5054996013641357, 'learning_rate': 0.00010978328173374614, 'epoch': 0.45}


 45%|████▌     | 1463/3235 [26:21<34:38,  1.17s/it]

{'loss': 1.0383, 'grad_norm': 0.4400572180747986, 'learning_rate': 0.00010972136222910217, 'epoch': 0.45}


 45%|████▌     | 1464/3235 [26:22<30:36,  1.04s/it]

{'loss': 0.9674, 'grad_norm': 0.517799973487854, 'learning_rate': 0.00010965944272445821, 'epoch': 0.45}


 45%|████▌     | 1465/3235 [26:23<30:56,  1.05s/it]

{'loss': 0.927, 'grad_norm': 0.48452699184417725, 'learning_rate': 0.00010959752321981425, 'epoch': 0.45}


 45%|████▌     | 1466/3235 [26:24<31:20,  1.06s/it]

{'loss': 1.1254, 'grad_norm': 0.4602823555469513, 'learning_rate': 0.00010953560371517029, 'epoch': 0.45}


 45%|████▌     | 1467/3235 [26:25<31:15,  1.06s/it]

{'loss': 1.0505, 'grad_norm': 0.5664620995521545, 'learning_rate': 0.00010947368421052633, 'epoch': 0.45}


 45%|████▌     | 1468/3235 [26:26<30:01,  1.02s/it]

{'loss': 1.1501, 'grad_norm': 0.4780856668949127, 'learning_rate': 0.00010941176470588237, 'epoch': 0.45}


 45%|████▌     | 1469/3235 [26:27<31:48,  1.08s/it]

{'loss': 1.0074, 'grad_norm': 0.4918375313282013, 'learning_rate': 0.00010934984520123838, 'epoch': 0.45}


 45%|████▌     | 1470/3235 [26:28<32:15,  1.10s/it]

{'loss': 1.0818, 'grad_norm': 0.4672580361366272, 'learning_rate': 0.00010928792569659442, 'epoch': 0.45}


 45%|████▌     | 1471/3235 [26:29<31:38,  1.08s/it]

{'loss': 0.9544, 'grad_norm': 0.4706929624080658, 'learning_rate': 0.00010922600619195046, 'epoch': 0.45}


 46%|████▌     | 1472/3235 [26:30<29:36,  1.01s/it]

{'loss': 1.0103, 'grad_norm': 0.512165904045105, 'learning_rate': 0.00010916408668730652, 'epoch': 0.46}


 46%|████▌     | 1473/3235 [26:31<30:47,  1.05s/it]

{'loss': 1.0705, 'grad_norm': 0.4699469208717346, 'learning_rate': 0.00010910216718266256, 'epoch': 0.46}


 46%|████▌     | 1474/3235 [26:32<30:53,  1.05s/it]

{'loss': 0.8239, 'grad_norm': 0.4488561451435089, 'learning_rate': 0.00010904024767801857, 'epoch': 0.46}


 46%|████▌     | 1475/3235 [26:34<31:29,  1.07s/it]

{'loss': 1.0309, 'grad_norm': 0.498611181974411, 'learning_rate': 0.00010897832817337461, 'epoch': 0.46}


 46%|████▌     | 1476/3235 [26:35<31:55,  1.09s/it]

{'loss': 1.0353, 'grad_norm': 0.4755857586860657, 'learning_rate': 0.00010891640866873065, 'epoch': 0.46}


 46%|████▌     | 1477/3235 [26:36<30:57,  1.06s/it]

{'loss': 1.0732, 'grad_norm': 0.49995502829551697, 'learning_rate': 0.00010885448916408669, 'epoch': 0.46}


 46%|████▌     | 1478/3235 [26:37<32:10,  1.10s/it]

{'loss': 1.1947, 'grad_norm': 0.45068231225013733, 'learning_rate': 0.00010879256965944273, 'epoch': 0.46}


 46%|████▌     | 1479/3235 [26:38<35:02,  1.20s/it]

{'loss': 1.0473, 'grad_norm': 0.40365317463874817, 'learning_rate': 0.00010873065015479877, 'epoch': 0.46}


 46%|████▌     | 1480/3235 [26:39<33:50,  1.16s/it]

{'loss': 0.9477, 'grad_norm': 0.451967716217041, 'learning_rate': 0.0001086687306501548, 'epoch': 0.46}


 46%|████▌     | 1481/3235 [26:40<32:01,  1.10s/it]

{'loss': 1.1016, 'grad_norm': 0.559539794921875, 'learning_rate': 0.00010860681114551084, 'epoch': 0.46}


 46%|████▌     | 1482/3235 [26:41<30:30,  1.04s/it]

{'loss': 1.0853, 'grad_norm': 0.47298768162727356, 'learning_rate': 0.00010854489164086688, 'epoch': 0.46}


 46%|████▌     | 1483/3235 [26:42<32:33,  1.11s/it]

{'loss': 1.1042, 'grad_norm': 0.4314648509025574, 'learning_rate': 0.00010848297213622292, 'epoch': 0.46}


 46%|████▌     | 1484/3235 [26:43<31:09,  1.07s/it]

{'loss': 1.0877, 'grad_norm': 0.5117505192756653, 'learning_rate': 0.00010842105263157896, 'epoch': 0.46}


 46%|████▌     | 1485/3235 [26:45<35:58,  1.23s/it]

{'loss': 0.9007, 'grad_norm': 0.43119677901268005, 'learning_rate': 0.000108359133126935, 'epoch': 0.46}


 46%|████▌     | 1486/3235 [26:46<34:53,  1.20s/it]

{'loss': 0.8202, 'grad_norm': 0.4177727997303009, 'learning_rate': 0.00010829721362229101, 'epoch': 0.46}


 46%|████▌     | 1487/3235 [26:47<31:19,  1.08s/it]

{'loss': 0.9241, 'grad_norm': 0.5023682713508606, 'learning_rate': 0.00010823529411764706, 'epoch': 0.46}


 46%|████▌     | 1488/3235 [26:48<29:46,  1.02s/it]

{'loss': 1.0066, 'grad_norm': 0.5040862560272217, 'learning_rate': 0.0001081733746130031, 'epoch': 0.46}


 46%|████▌     | 1489/3235 [26:49<29:35,  1.02s/it]

{'loss': 1.0228, 'grad_norm': 0.4918416440486908, 'learning_rate': 0.00010811145510835914, 'epoch': 0.46}


 46%|████▌     | 1490/3235 [26:50<31:28,  1.08s/it]

{'loss': 1.1083, 'grad_norm': 0.453797847032547, 'learning_rate': 0.00010804953560371518, 'epoch': 0.46}


 46%|████▌     | 1491/3235 [26:51<29:44,  1.02s/it]

{'loss': 0.8551, 'grad_norm': 0.5026349425315857, 'learning_rate': 0.00010798761609907122, 'epoch': 0.46}


 46%|████▌     | 1492/3235 [26:52<30:13,  1.04s/it]

{'loss': 0.9773, 'grad_norm': 0.4454306960105896, 'learning_rate': 0.00010792569659442724, 'epoch': 0.46}


 46%|████▌     | 1493/3235 [26:53<30:06,  1.04s/it]

{'loss': 0.9577, 'grad_norm': 0.4479050934314728, 'learning_rate': 0.00010786377708978328, 'epoch': 0.46}


 46%|████▌     | 1494/3235 [26:54<30:39,  1.06s/it]

{'loss': 1.033, 'grad_norm': 0.4175361394882202, 'learning_rate': 0.00010780185758513932, 'epoch': 0.46}


 46%|████▌     | 1495/3235 [26:55<30:46,  1.06s/it]

{'loss': 1.0409, 'grad_norm': 0.4481686055660248, 'learning_rate': 0.00010773993808049536, 'epoch': 0.46}


 46%|████▌     | 1496/3235 [26:56<32:02,  1.11s/it]

{'loss': 1.028, 'grad_norm': 0.4279133379459381, 'learning_rate': 0.0001076780185758514, 'epoch': 0.46}


 46%|████▋     | 1497/3235 [26:58<32:20,  1.12s/it]

{'loss': 1.0701, 'grad_norm': 0.47624507546424866, 'learning_rate': 0.00010761609907120745, 'epoch': 0.46}


 46%|████▋     | 1498/3235 [26:58<29:38,  1.02s/it]

{'loss': 1.1216, 'grad_norm': 0.5130699872970581, 'learning_rate': 0.00010755417956656346, 'epoch': 0.46}


 46%|████▋     | 1499/3235 [27:00<30:31,  1.05s/it]

{'loss': 1.077, 'grad_norm': 0.47760042548179626, 'learning_rate': 0.0001074922600619195, 'epoch': 0.46}


 46%|████▋     | 1500/3235 [27:01<30:45,  1.06s/it]

{'loss': 0.9319, 'grad_norm': 0.4635612666606903, 'learning_rate': 0.00010743034055727554, 'epoch': 0.46}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1500)... Done. 0.2s
 46%|████▋     | 1501/3235 [27:03<39:56,  1.38s/it]

{'loss': 1.2269, 'grad_norm': 0.5178483128547668, 'learning_rate': 0.00010736842105263158, 'epoch': 0.46}


 46%|████▋     | 1502/3235 [27:04<38:13,  1.32s/it]

{'loss': 1.0625, 'grad_norm': 0.48807859420776367, 'learning_rate': 0.00010730650154798762, 'epoch': 0.46}


 46%|████▋     | 1503/3235 [27:05<35:24,  1.23s/it]

{'loss': 0.93, 'grad_norm': 0.4882313907146454, 'learning_rate': 0.00010724458204334366, 'epoch': 0.46}


 46%|████▋     | 1504/3235 [27:06<35:04,  1.22s/it]

{'loss': 0.9823, 'grad_norm': 0.3927008807659149, 'learning_rate': 0.00010718266253869969, 'epoch': 0.46}


 47%|████▋     | 1505/3235 [27:07<35:02,  1.22s/it]

{'loss': 1.0556, 'grad_norm': 0.6222098469734192, 'learning_rate': 0.00010712074303405573, 'epoch': 0.47}


 47%|████▋     | 1506/3235 [27:08<32:23,  1.12s/it]

{'loss': 1.0681, 'grad_norm': 0.5992091298103333, 'learning_rate': 0.00010705882352941177, 'epoch': 0.47}


 47%|████▋     | 1507/3235 [27:09<32:32,  1.13s/it]

{'loss': 0.9827, 'grad_norm': 0.48788365721702576, 'learning_rate': 0.00010699690402476781, 'epoch': 0.47}


 47%|████▋     | 1508/3235 [27:11<32:45,  1.14s/it]

{'loss': 0.9508, 'grad_norm': 0.44143977761268616, 'learning_rate': 0.00010693498452012385, 'epoch': 0.47}


 47%|████▋     | 1509/3235 [27:12<31:52,  1.11s/it]

{'loss': 1.0815, 'grad_norm': 0.4552812874317169, 'learning_rate': 0.00010687306501547989, 'epoch': 0.47}


 47%|████▋     | 1510/3235 [27:13<32:27,  1.13s/it]

{'loss': 1.1211, 'grad_norm': 0.4761623740196228, 'learning_rate': 0.0001068111455108359, 'epoch': 0.47}


 47%|████▋     | 1511/3235 [27:14<33:54,  1.18s/it]

{'loss': 1.0593, 'grad_norm': 0.44079771637916565, 'learning_rate': 0.00010674922600619196, 'epoch': 0.47}


 47%|████▋     | 1512/3235 [27:15<33:21,  1.16s/it]

{'loss': 1.0297, 'grad_norm': 0.4564419686794281, 'learning_rate': 0.000106687306501548, 'epoch': 0.47}


 47%|████▋     | 1513/3235 [27:16<34:12,  1.19s/it]

{'loss': 1.0293, 'grad_norm': 0.4547681212425232, 'learning_rate': 0.00010662538699690404, 'epoch': 0.47}


 47%|████▋     | 1514/3235 [27:18<34:04,  1.19s/it]

{'loss': 1.0515, 'grad_norm': 0.46948838233947754, 'learning_rate': 0.00010656346749226008, 'epoch': 0.47}


 47%|████▋     | 1515/3235 [27:19<35:05,  1.22s/it]

{'loss': 1.1123, 'grad_norm': 0.47512194514274597, 'learning_rate': 0.00010650154798761612, 'epoch': 0.47}


 47%|████▋     | 1516/3235 [27:20<33:53,  1.18s/it]

{'loss': 0.8495, 'grad_norm': 0.4304784834384918, 'learning_rate': 0.00010643962848297213, 'epoch': 0.47}


 47%|████▋     | 1517/3235 [27:21<30:56,  1.08s/it]

{'loss': 1.2968, 'grad_norm': 0.5371948480606079, 'learning_rate': 0.00010637770897832817, 'epoch': 0.47}


 47%|████▋     | 1518/3235 [27:22<30:25,  1.06s/it]

{'loss': 1.2062, 'grad_norm': 0.47279131412506104, 'learning_rate': 0.00010631578947368421, 'epoch': 0.47}


 47%|████▋     | 1519/3235 [27:23<30:01,  1.05s/it]

{'loss': 1.1054, 'grad_norm': 0.4857122600078583, 'learning_rate': 0.00010625386996904025, 'epoch': 0.47}


 47%|████▋     | 1520/3235 [27:24<30:19,  1.06s/it]

{'loss': 1.1489, 'grad_norm': 0.5040034651756287, 'learning_rate': 0.00010619195046439629, 'epoch': 0.47}


 47%|████▋     | 1521/3235 [27:25<29:33,  1.03s/it]

{'loss': 1.0652, 'grad_norm': 0.4683949053287506, 'learning_rate': 0.00010613003095975234, 'epoch': 0.47}


 47%|████▋     | 1522/3235 [27:26<26:13,  1.09it/s]

{'loss': 1.1559, 'grad_norm': 0.5921148657798767, 'learning_rate': 0.00010606811145510835, 'epoch': 0.47}


 47%|████▋     | 1523/3235 [27:27<28:54,  1.01s/it]

{'loss': 1.1393, 'grad_norm': 0.4337461590766907, 'learning_rate': 0.0001060061919504644, 'epoch': 0.47}


 47%|████▋     | 1524/3235 [27:28<29:13,  1.02s/it]

{'loss': 0.8716, 'grad_norm': 0.4331403076648712, 'learning_rate': 0.00010594427244582043, 'epoch': 0.47}


 47%|████▋     | 1525/3235 [27:29<28:17,  1.01it/s]

{'loss': 1.0883, 'grad_norm': 0.477946937084198, 'learning_rate': 0.00010588235294117647, 'epoch': 0.47}


 47%|████▋     | 1526/3235 [27:30<30:55,  1.09s/it]

{'loss': 1.0791, 'grad_norm': 0.44569286704063416, 'learning_rate': 0.00010582043343653251, 'epoch': 0.47}


 47%|████▋     | 1527/3235 [27:31<31:39,  1.11s/it]

{'loss': 0.7711, 'grad_norm': 0.4285123944282532, 'learning_rate': 0.00010575851393188855, 'epoch': 0.47}


 47%|████▋     | 1528/3235 [27:33<32:41,  1.15s/it]

{'loss': 0.9638, 'grad_norm': 0.43032053112983704, 'learning_rate': 0.00010569659442724458, 'epoch': 0.47}


 47%|████▋     | 1529/3235 [27:34<31:50,  1.12s/it]

{'loss': 1.1269, 'grad_norm': 0.47363901138305664, 'learning_rate': 0.00010563467492260062, 'epoch': 0.47}


 47%|████▋     | 1530/3235 [27:35<32:55,  1.16s/it]

{'loss': 1.0627, 'grad_norm': 0.4439764618873596, 'learning_rate': 0.00010557275541795666, 'epoch': 0.47}


 47%|████▋     | 1531/3235 [27:36<30:19,  1.07s/it]

{'loss': 1.0788, 'grad_norm': 0.5029974579811096, 'learning_rate': 0.0001055108359133127, 'epoch': 0.47}


 47%|████▋     | 1532/3235 [27:37<32:07,  1.13s/it]

{'loss': 1.1845, 'grad_norm': 0.5004132390022278, 'learning_rate': 0.00010544891640866874, 'epoch': 0.47}


 47%|████▋     | 1533/3235 [27:38<31:07,  1.10s/it]

{'loss': 1.1201, 'grad_norm': 0.5115221738815308, 'learning_rate': 0.00010538699690402478, 'epoch': 0.47}


 47%|████▋     | 1534/3235 [27:39<29:14,  1.03s/it]

{'loss': 0.9771, 'grad_norm': 0.5266652703285217, 'learning_rate': 0.0001053250773993808, 'epoch': 0.47}


 47%|████▋     | 1535/3235 [27:40<26:34,  1.07it/s]

{'loss': 0.8162, 'grad_norm': 0.5014608502388, 'learning_rate': 0.00010526315789473685, 'epoch': 0.47}


 47%|████▋     | 1536/3235 [27:41<27:07,  1.04it/s]

{'loss': 1.1093, 'grad_norm': 0.4929008483886719, 'learning_rate': 0.00010520123839009289, 'epoch': 0.47}


 48%|████▊     | 1537/3235 [27:42<30:22,  1.07s/it]

{'loss': 1.0242, 'grad_norm': 0.41678497195243835, 'learning_rate': 0.00010513931888544893, 'epoch': 0.48}


 48%|████▊     | 1538/3235 [27:43<31:52,  1.13s/it]

{'loss': 1.0862, 'grad_norm': 0.4665501117706299, 'learning_rate': 0.00010507739938080497, 'epoch': 0.48}


 48%|████▊     | 1539/3235 [27:44<27:39,  1.02it/s]

{'loss': 0.9313, 'grad_norm': 0.5639320611953735, 'learning_rate': 0.00010501547987616101, 'epoch': 0.48}


 48%|████▊     | 1540/3235 [27:45<27:37,  1.02it/s]

{'loss': 1.0431, 'grad_norm': 0.46937960386276245, 'learning_rate': 0.00010495356037151702, 'epoch': 0.48}


 48%|████▊     | 1541/3235 [27:46<28:07,  1.00it/s]

{'loss': 0.9826, 'grad_norm': 0.487773060798645, 'learning_rate': 0.00010489164086687306, 'epoch': 0.48}


 48%|████▊     | 1542/3235 [27:47<29:30,  1.05s/it]

{'loss': 1.1517, 'grad_norm': 0.4915841817855835, 'learning_rate': 0.0001048297213622291, 'epoch': 0.48}


 48%|████▊     | 1543/3235 [27:48<30:20,  1.08s/it]

{'loss': 0.9795, 'grad_norm': 0.4695834815502167, 'learning_rate': 0.00010476780185758514, 'epoch': 0.48}


 48%|████▊     | 1544/3235 [27:49<29:41,  1.05s/it]

{'loss': 1.1903, 'grad_norm': 0.4926116466522217, 'learning_rate': 0.00010470588235294118, 'epoch': 0.48}


 48%|████▊     | 1545/3235 [27:50<30:21,  1.08s/it]

{'loss': 1.0961, 'grad_norm': 0.4795020520687103, 'learning_rate': 0.00010464396284829723, 'epoch': 0.48}


 48%|████▊     | 1546/3235 [27:51<29:41,  1.05s/it]

{'loss': 0.9923, 'grad_norm': 0.4703900218009949, 'learning_rate': 0.00010458204334365325, 'epoch': 0.48}


 48%|████▊     | 1547/3235 [27:52<28:05,  1.00it/s]

{'loss': 0.9408, 'grad_norm': 0.49035635590553284, 'learning_rate': 0.00010452012383900929, 'epoch': 0.48}


 48%|████▊     | 1548/3235 [27:53<29:13,  1.04s/it]

{'loss': 0.9724, 'grad_norm': 0.4496094286441803, 'learning_rate': 0.00010445820433436533, 'epoch': 0.48}


 48%|████▊     | 1549/3235 [27:55<30:57,  1.10s/it]

{'loss': 1.0157, 'grad_norm': 0.42020174860954285, 'learning_rate': 0.00010439628482972137, 'epoch': 0.48}


 48%|████▊     | 1550/3235 [27:56<30:30,  1.09s/it]

{'loss': 0.8592, 'grad_norm': 0.4633684754371643, 'learning_rate': 0.00010433436532507741, 'epoch': 0.48}


 48%|████▊     | 1551/3235 [27:57<29:52,  1.06s/it]

{'loss': 1.1052, 'grad_norm': 0.5231487154960632, 'learning_rate': 0.00010427244582043345, 'epoch': 0.48}


 48%|████▊     | 1552/3235 [27:58<30:48,  1.10s/it]

{'loss': 1.2227, 'grad_norm': 0.4955470561981201, 'learning_rate': 0.00010421052631578947, 'epoch': 0.48}


 48%|████▊     | 1553/3235 [27:59<30:15,  1.08s/it]

{'loss': 1.1024, 'grad_norm': 0.4660821259021759, 'learning_rate': 0.00010414860681114551, 'epoch': 0.48}


 48%|████▊     | 1554/3235 [28:00<30:09,  1.08s/it]

{'loss': 0.9356, 'grad_norm': 0.46193385124206543, 'learning_rate': 0.00010408668730650155, 'epoch': 0.48}


 48%|████▊     | 1555/3235 [28:01<29:19,  1.05s/it]

{'loss': 0.8574, 'grad_norm': 0.4942115247249603, 'learning_rate': 0.0001040247678018576, 'epoch': 0.48}


 48%|████▊     | 1556/3235 [28:02<30:30,  1.09s/it]

{'loss': 1.1823, 'grad_norm': 0.5516014099121094, 'learning_rate': 0.00010396284829721363, 'epoch': 0.48}


 48%|████▊     | 1557/3235 [28:03<27:36,  1.01it/s]

{'loss': 0.919, 'grad_norm': 0.5038414597511292, 'learning_rate': 0.00010390092879256967, 'epoch': 0.48}


 48%|████▊     | 1558/3235 [28:04<29:55,  1.07s/it]

{'loss': 1.1717, 'grad_norm': 0.46026065945625305, 'learning_rate': 0.00010383900928792569, 'epoch': 0.48}


 48%|████▊     | 1559/3235 [28:05<28:44,  1.03s/it]

{'loss': 1.1698, 'grad_norm': 0.533929169178009, 'learning_rate': 0.00010377708978328173, 'epoch': 0.48}


 48%|████▊     | 1560/3235 [28:06<31:40,  1.13s/it]

{'loss': 1.0658, 'grad_norm': 0.39030900597572327, 'learning_rate': 0.00010371517027863778, 'epoch': 0.48}


 48%|████▊     | 1561/3235 [28:08<31:55,  1.14s/it]

{'loss': 1.041, 'grad_norm': 0.45184293389320374, 'learning_rate': 0.00010365325077399382, 'epoch': 0.48}


 48%|████▊     | 1562/3235 [28:09<31:17,  1.12s/it]

{'loss': 1.2317, 'grad_norm': 0.5262457728385925, 'learning_rate': 0.00010359133126934986, 'epoch': 0.48}


 48%|████▊     | 1563/3235 [28:10<30:52,  1.11s/it]

{'loss': 0.8921, 'grad_norm': 0.45044150948524475, 'learning_rate': 0.0001035294117647059, 'epoch': 0.48}


 48%|████▊     | 1564/3235 [28:11<29:41,  1.07s/it]

{'loss': 1.0421, 'grad_norm': 0.44503480195999146, 'learning_rate': 0.00010346749226006191, 'epoch': 0.48}


 48%|████▊     | 1565/3235 [28:12<31:01,  1.11s/it]

{'loss': 0.8473, 'grad_norm': 0.4504369795322418, 'learning_rate': 0.00010340557275541795, 'epoch': 0.48}


 48%|████▊     | 1566/3235 [28:13<30:12,  1.09s/it]

{'loss': 1.0497, 'grad_norm': 0.5334395170211792, 'learning_rate': 0.000103343653250774, 'epoch': 0.48}


 48%|████▊     | 1567/3235 [28:14<31:25,  1.13s/it]

{'loss': 1.031, 'grad_norm': 0.484441876411438, 'learning_rate': 0.00010328173374613003, 'epoch': 0.48}


 48%|████▊     | 1568/3235 [28:15<30:05,  1.08s/it]

{'loss': 0.9936, 'grad_norm': 0.5000624656677246, 'learning_rate': 0.00010321981424148607, 'epoch': 0.48}


 49%|████▊     | 1569/3235 [28:16<28:37,  1.03s/it]

{'loss': 1.053, 'grad_norm': 0.5043544769287109, 'learning_rate': 0.00010315789473684211, 'epoch': 0.49}


 49%|████▊     | 1570/3235 [28:17<28:01,  1.01s/it]

{'loss': 0.9893, 'grad_norm': 0.49111950397491455, 'learning_rate': 0.00010309597523219814, 'epoch': 0.49}


 49%|████▊     | 1571/3235 [28:18<29:30,  1.06s/it]

{'loss': 0.9773, 'grad_norm': 0.45539405941963196, 'learning_rate': 0.00010303405572755418, 'epoch': 0.49}


 49%|████▊     | 1572/3235 [28:19<28:42,  1.04s/it]

{'loss': 0.993, 'grad_norm': 0.5227646231651306, 'learning_rate': 0.00010297213622291022, 'epoch': 0.49}


 49%|████▊     | 1573/3235 [28:20<29:06,  1.05s/it]

{'loss': 0.9395, 'grad_norm': 0.5185227990150452, 'learning_rate': 0.00010291021671826626, 'epoch': 0.49}


 49%|████▊     | 1574/3235 [28:21<27:18,  1.01it/s]

{'loss': 0.9375, 'grad_norm': 0.49275267124176025, 'learning_rate': 0.0001028482972136223, 'epoch': 0.49}


 49%|████▊     | 1575/3235 [28:22<27:06,  1.02it/s]

{'loss': 1.1398, 'grad_norm': 0.5617106556892395, 'learning_rate': 0.00010278637770897834, 'epoch': 0.49}


 49%|████▊     | 1576/3235 [28:23<28:02,  1.01s/it]

{'loss': 0.853, 'grad_norm': 0.4281705617904663, 'learning_rate': 0.00010272445820433437, 'epoch': 0.49}


 49%|████▊     | 1577/3235 [28:24<29:02,  1.05s/it]

{'loss': 0.9872, 'grad_norm': 0.4898727238178253, 'learning_rate': 0.0001026625386996904, 'epoch': 0.49}


 49%|████▉     | 1578/3235 [28:25<28:33,  1.03s/it]

{'loss': 1.0195, 'grad_norm': 0.4824479818344116, 'learning_rate': 0.00010260061919504645, 'epoch': 0.49}


 49%|████▉     | 1579/3235 [28:26<29:13,  1.06s/it]

{'loss': 1.0188, 'grad_norm': 0.46344253420829773, 'learning_rate': 0.00010253869969040249, 'epoch': 0.49}


 49%|████▉     | 1580/3235 [28:28<31:33,  1.14s/it]

{'loss': 0.9699, 'grad_norm': 0.43123671412467957, 'learning_rate': 0.00010247678018575853, 'epoch': 0.49}


 49%|████▉     | 1581/3235 [28:29<32:19,  1.17s/it]

{'loss': 1.0894, 'grad_norm': 0.4806608259677887, 'learning_rate': 0.00010241486068111457, 'epoch': 0.49}


 49%|████▉     | 1582/3235 [28:30<31:27,  1.14s/it]

{'loss': 0.8493, 'grad_norm': 0.45801612734794617, 'learning_rate': 0.00010235294117647058, 'epoch': 0.49}


 49%|████▉     | 1583/3235 [28:31<31:30,  1.14s/it]

{'loss': 0.9871, 'grad_norm': 0.4600420296192169, 'learning_rate': 0.00010229102167182662, 'epoch': 0.49}


 49%|████▉     | 1584/3235 [28:32<28:47,  1.05s/it]

{'loss': 0.9052, 'grad_norm': 0.49703434109687805, 'learning_rate': 0.00010222910216718267, 'epoch': 0.49}


 49%|████▉     | 1585/3235 [28:33<29:06,  1.06s/it]

{'loss': 1.0467, 'grad_norm': 0.49456924200057983, 'learning_rate': 0.00010216718266253871, 'epoch': 0.49}


 49%|████▉     | 1586/3235 [28:34<28:09,  1.02s/it]

{'loss': 1.1238, 'grad_norm': 0.4866708815097809, 'learning_rate': 0.00010210526315789475, 'epoch': 0.49}


 49%|████▉     | 1587/3235 [28:35<28:44,  1.05s/it]

{'loss': 1.036, 'grad_norm': 0.48780566453933716, 'learning_rate': 0.00010204334365325079, 'epoch': 0.49}


 49%|████▉     | 1588/3235 [28:36<29:12,  1.06s/it]

{'loss': 0.9413, 'grad_norm': 0.4461033344268799, 'learning_rate': 0.0001019814241486068, 'epoch': 0.49}


 49%|████▉     | 1589/3235 [28:37<29:13,  1.07s/it]

{'loss': 0.994, 'grad_norm': 0.5081995129585266, 'learning_rate': 0.00010191950464396285, 'epoch': 0.49}


 49%|████▉     | 1590/3235 [28:38<29:51,  1.09s/it]

{'loss': 1.0046, 'grad_norm': 0.478524774312973, 'learning_rate': 0.00010185758513931889, 'epoch': 0.49}


 49%|████▉     | 1591/3235 [28:40<30:58,  1.13s/it]

{'loss': 1.0673, 'grad_norm': 0.527376651763916, 'learning_rate': 0.00010179566563467493, 'epoch': 0.49}


 49%|████▉     | 1592/3235 [28:41<29:17,  1.07s/it]

{'loss': 1.072, 'grad_norm': 0.7761176228523254, 'learning_rate': 0.00010173374613003097, 'epoch': 0.49}


 49%|████▉     | 1593/3235 [28:42<30:18,  1.11s/it]

{'loss': 1.0178, 'grad_norm': 0.46264681220054626, 'learning_rate': 0.000101671826625387, 'epoch': 0.49}


 49%|████▉     | 1594/3235 [28:43<27:22,  1.00s/it]

{'loss': 1.0222, 'grad_norm': 0.4979709982872009, 'learning_rate': 0.00010160990712074303, 'epoch': 0.49}


 49%|████▉     | 1595/3235 [28:44<28:05,  1.03s/it]

{'loss': 1.051, 'grad_norm': 0.4663741886615753, 'learning_rate': 0.00010154798761609907, 'epoch': 0.49}


 49%|████▉     | 1596/3235 [28:45<27:38,  1.01s/it]

{'loss': 1.049, 'grad_norm': 0.5242851376533508, 'learning_rate': 0.00010148606811145511, 'epoch': 0.49}


 49%|████▉     | 1597/3235 [28:46<29:03,  1.06s/it]

{'loss': 1.0196, 'grad_norm': 0.4621525704860687, 'learning_rate': 0.00010142414860681115, 'epoch': 0.49}


 49%|████▉     | 1598/3235 [28:47<28:25,  1.04s/it]

{'loss': 1.0927, 'grad_norm': 0.48770877718925476, 'learning_rate': 0.00010136222910216719, 'epoch': 0.49}


 49%|████▉     | 1599/3235 [28:48<28:02,  1.03s/it]

{'loss': 0.9096, 'grad_norm': 0.48702406883239746, 'learning_rate': 0.00010130030959752323, 'epoch': 0.49}


 49%|████▉     | 1600/3235 [28:49<28:50,  1.06s/it]

{'loss': 0.9683, 'grad_norm': 0.47115689516067505, 'learning_rate': 0.00010123839009287926, 'epoch': 0.49}


 49%|████▉     | 1601/3235 [28:50<32:07,  1.18s/it]

{'loss': 0.9629, 'grad_norm': 0.41227301955223083, 'learning_rate': 0.0001011764705882353, 'epoch': 0.49}


 50%|████▉     | 1602/3235 [28:51<30:02,  1.10s/it]

{'loss': 1.0079, 'grad_norm': 0.514503538608551, 'learning_rate': 0.00010111455108359134, 'epoch': 0.5}


 50%|████▉     | 1603/3235 [28:52<29:56,  1.10s/it]

{'loss': 1.1648, 'grad_norm': 0.4656318128108978, 'learning_rate': 0.00010105263157894738, 'epoch': 0.5}


 50%|████▉     | 1604/3235 [28:54<33:13,  1.22s/it]

{'loss': 1.0817, 'grad_norm': 0.43730929493904114, 'learning_rate': 0.00010099071207430342, 'epoch': 0.5}


 50%|████▉     | 1605/3235 [28:55<33:48,  1.24s/it]

{'loss': 1.1315, 'grad_norm': 0.4703636169433594, 'learning_rate': 0.00010092879256965946, 'epoch': 0.5}


 50%|████▉     | 1606/3235 [28:56<31:45,  1.17s/it]

{'loss': 0.907, 'grad_norm': 0.45968329906463623, 'learning_rate': 0.00010086687306501547, 'epoch': 0.5}


 50%|████▉     | 1607/3235 [28:57<30:38,  1.13s/it]

{'loss': 0.958, 'grad_norm': 0.602642834186554, 'learning_rate': 0.00010080495356037151, 'epoch': 0.5}


 50%|████▉     | 1608/3235 [28:58<29:25,  1.08s/it]

{'loss': 0.9639, 'grad_norm': 0.5404233932495117, 'learning_rate': 0.00010074303405572757, 'epoch': 0.5}


 50%|████▉     | 1609/3235 [28:59<29:17,  1.08s/it]

{'loss': 1.0073, 'grad_norm': 0.46414825320243835, 'learning_rate': 0.0001006811145510836, 'epoch': 0.5}


 50%|████▉     | 1610/3235 [29:00<26:47,  1.01it/s]

{'loss': 0.9874, 'grad_norm': 0.5167809128761292, 'learning_rate': 0.00010061919504643965, 'epoch': 0.5}


 50%|████▉     | 1611/3235 [29:01<25:28,  1.06it/s]

{'loss': 0.9652, 'grad_norm': 0.54207444190979, 'learning_rate': 0.00010055727554179569, 'epoch': 0.5}


 50%|████▉     | 1612/3235 [29:02<26:51,  1.01it/s]

{'loss': 1.0803, 'grad_norm': 0.4548029899597168, 'learning_rate': 0.0001004953560371517, 'epoch': 0.5}


 50%|████▉     | 1613/3235 [29:03<27:12,  1.01s/it]

{'loss': 1.0001, 'grad_norm': 0.4764210879802704, 'learning_rate': 0.00010043343653250774, 'epoch': 0.5}


 50%|████▉     | 1614/3235 [29:04<28:15,  1.05s/it]

{'loss': 1.002, 'grad_norm': 0.5099153518676758, 'learning_rate': 0.00010037151702786378, 'epoch': 0.5}


 50%|████▉     | 1615/3235 [29:05<27:05,  1.00s/it]

{'loss': 0.882, 'grad_norm': 0.5061657428741455, 'learning_rate': 0.00010030959752321982, 'epoch': 0.5}


 50%|████▉     | 1616/3235 [29:06<26:33,  1.02it/s]

{'loss': 0.9753, 'grad_norm': 0.5212693810462952, 'learning_rate': 0.00010024767801857586, 'epoch': 0.5}


 50%|████▉     | 1617/3235 [29:07<27:36,  1.02s/it]

{'loss': 0.9698, 'grad_norm': 0.5325403213500977, 'learning_rate': 0.0001001857585139319, 'epoch': 0.5}


 50%|█████     | 1618/3235 [29:08<28:18,  1.05s/it]

{'loss': 0.9714, 'grad_norm': 0.46864578127861023, 'learning_rate': 0.00010012383900928792, 'epoch': 0.5}


 50%|█████     | 1619/3235 [29:09<28:36,  1.06s/it]

{'loss': 0.9635, 'grad_norm': 0.46320950984954834, 'learning_rate': 0.00010006191950464396, 'epoch': 0.5}


 50%|█████     | 1620/3235 [29:10<29:26,  1.09s/it]

{'loss': 0.97, 'grad_norm': 0.5101369023323059, 'learning_rate': 0.0001, 'epoch': 0.5}


 50%|█████     | 1621/3235 [29:12<29:19,  1.09s/it]

{'loss': 1.1029, 'grad_norm': 0.5699488520622253, 'learning_rate': 9.993808049535604e-05, 'epoch': 0.5}


 50%|█████     | 1622/3235 [29:13<29:38,  1.10s/it]

{'loss': 0.982, 'grad_norm': 0.5017547011375427, 'learning_rate': 9.987616099071207e-05, 'epoch': 0.5}


 50%|█████     | 1623/3235 [29:14<31:02,  1.16s/it]

{'loss': 1.0911, 'grad_norm': 0.4362192749977112, 'learning_rate': 9.981424148606811e-05, 'epoch': 0.5}


 50%|█████     | 1624/3235 [29:15<29:19,  1.09s/it]

{'loss': 0.896, 'grad_norm': 0.5148343443870544, 'learning_rate': 9.975232198142415e-05, 'epoch': 0.5}


 50%|█████     | 1625/3235 [29:16<31:11,  1.16s/it]

{'loss': 1.1173, 'grad_norm': 0.4343252182006836, 'learning_rate': 9.969040247678019e-05, 'epoch': 0.5}


 50%|█████     | 1626/3235 [29:17<30:46,  1.15s/it]

{'loss': 0.9729, 'grad_norm': 0.4915314018726349, 'learning_rate': 9.962848297213623e-05, 'epoch': 0.5}


 50%|█████     | 1627/3235 [29:18<29:13,  1.09s/it]

{'loss': 1.0163, 'grad_norm': 0.5752735733985901, 'learning_rate': 9.956656346749227e-05, 'epoch': 0.5}


 50%|█████     | 1628/3235 [29:20<29:58,  1.12s/it]

{'loss': 1.0564, 'grad_norm': 0.42110732197761536, 'learning_rate': 9.95046439628483e-05, 'epoch': 0.5}


 50%|█████     | 1629/3235 [29:21<29:12,  1.09s/it]

{'loss': 1.1019, 'grad_norm': 0.495068222284317, 'learning_rate': 9.944272445820434e-05, 'epoch': 0.5}


 50%|█████     | 1630/3235 [29:22<29:21,  1.10s/it]

{'loss': 1.0986, 'grad_norm': 0.4855363667011261, 'learning_rate': 9.938080495356038e-05, 'epoch': 0.5}


 50%|█████     | 1631/3235 [29:23<31:10,  1.17s/it]

{'loss': 1.0543, 'grad_norm': 0.451222687959671, 'learning_rate': 9.93188854489164e-05, 'epoch': 0.5}


 50%|█████     | 1632/3235 [29:24<27:55,  1.05s/it]

{'loss': 0.9596, 'grad_norm': 0.4973883032798767, 'learning_rate': 9.925696594427244e-05, 'epoch': 0.5}


 50%|█████     | 1633/3235 [29:25<26:32,  1.01it/s]

{'loss': 1.1134, 'grad_norm': 0.5193992853164673, 'learning_rate': 9.91950464396285e-05, 'epoch': 0.5}


 51%|█████     | 1634/3235 [29:26<27:56,  1.05s/it]

{'loss': 1.0567, 'grad_norm': 0.48969942331314087, 'learning_rate': 9.913312693498452e-05, 'epoch': 0.51}


 51%|█████     | 1635/3235 [29:27<26:28,  1.01it/s]

{'loss': 0.9736, 'grad_norm': 0.5515402555465698, 'learning_rate': 9.907120743034056e-05, 'epoch': 0.51}


 51%|█████     | 1636/3235 [29:28<26:44,  1.00s/it]

{'loss': 0.9805, 'grad_norm': 0.5040329098701477, 'learning_rate': 9.90092879256966e-05, 'epoch': 0.51}


 51%|█████     | 1637/3235 [29:29<25:57,  1.03it/s]

{'loss': 0.8647, 'grad_norm': 0.5314661860466003, 'learning_rate': 9.894736842105263e-05, 'epoch': 0.51}


 51%|█████     | 1638/3235 [29:30<27:12,  1.02s/it]

{'loss': 1.0692, 'grad_norm': 0.4695720374584198, 'learning_rate': 9.888544891640867e-05, 'epoch': 0.51}


 51%|█████     | 1639/3235 [29:31<28:34,  1.07s/it]

{'loss': 1.1097, 'grad_norm': 0.46942174434661865, 'learning_rate': 9.882352941176471e-05, 'epoch': 0.51}


 51%|█████     | 1640/3235 [29:32<28:44,  1.08s/it]

{'loss': 0.9664, 'grad_norm': 0.5040940642356873, 'learning_rate': 9.876160990712075e-05, 'epoch': 0.51}


 51%|█████     | 1641/3235 [29:33<29:03,  1.09s/it]

{'loss': 0.9629, 'grad_norm': 0.4549126923084259, 'learning_rate': 9.869969040247679e-05, 'epoch': 0.51}


 51%|█████     | 1642/3235 [29:34<26:41,  1.01s/it]

{'loss': 0.8515, 'grad_norm': 0.5150634050369263, 'learning_rate': 9.863777089783283e-05, 'epoch': 0.51}


 51%|█████     | 1643/3235 [29:35<27:52,  1.05s/it]

{'loss': 1.0577, 'grad_norm': 0.4872148931026459, 'learning_rate': 9.857585139318886e-05, 'epoch': 0.51}


 51%|█████     | 1644/3235 [29:36<28:45,  1.08s/it]

{'loss': 1.034, 'grad_norm': 0.5061047673225403, 'learning_rate': 9.85139318885449e-05, 'epoch': 0.51}


 51%|█████     | 1645/3235 [29:37<28:35,  1.08s/it]

{'loss': 1.0041, 'grad_norm': 0.45897290110588074, 'learning_rate': 9.845201238390094e-05, 'epoch': 0.51}


 51%|█████     | 1646/3235 [29:38<28:58,  1.09s/it]

{'loss': 0.9169, 'grad_norm': 0.4967002272605896, 'learning_rate': 9.839009287925696e-05, 'epoch': 0.51}


 51%|█████     | 1647/3235 [29:39<28:04,  1.06s/it]

{'loss': 0.9336, 'grad_norm': 0.4947400689125061, 'learning_rate': 9.8328173374613e-05, 'epoch': 0.51}


 51%|█████     | 1648/3235 [29:41<29:24,  1.11s/it]

{'loss': 1.0269, 'grad_norm': 0.4516045153141022, 'learning_rate': 9.826625386996904e-05, 'epoch': 0.51}


 51%|█████     | 1649/3235 [29:42<29:51,  1.13s/it]

{'loss': 1.1218, 'grad_norm': 0.4661218225955963, 'learning_rate': 9.820433436532508e-05, 'epoch': 0.51}


 51%|█████     | 1650/3235 [29:43<29:08,  1.10s/it]

{'loss': 0.9589, 'grad_norm': 0.4547477960586548, 'learning_rate': 9.814241486068112e-05, 'epoch': 0.51}


 51%|█████     | 1651/3235 [29:44<28:23,  1.08s/it]

{'loss': 0.8908, 'grad_norm': 0.49123966693878174, 'learning_rate': 9.808049535603716e-05, 'epoch': 0.51}


 51%|█████     | 1652/3235 [29:45<28:16,  1.07s/it]

{'loss': 1.014, 'grad_norm': 0.46972399950027466, 'learning_rate': 9.801857585139319e-05, 'epoch': 0.51}


 51%|█████     | 1653/3235 [29:46<28:36,  1.09s/it]

{'loss': 1.0554, 'grad_norm': 0.4670642018318176, 'learning_rate': 9.795665634674923e-05, 'epoch': 0.51}


 51%|█████     | 1654/3235 [29:47<29:01,  1.10s/it]

{'loss': 1.0175, 'grad_norm': 0.42649292945861816, 'learning_rate': 9.789473684210527e-05, 'epoch': 0.51}


 51%|█████     | 1655/3235 [29:48<27:57,  1.06s/it]

{'loss': 0.9884, 'grad_norm': 0.5066139698028564, 'learning_rate': 9.78328173374613e-05, 'epoch': 0.51}


 51%|█████     | 1656/3235 [29:49<29:12,  1.11s/it]

{'loss': 1.1187, 'grad_norm': 0.5102474689483643, 'learning_rate': 9.777089783281734e-05, 'epoch': 0.51}


 51%|█████     | 1657/3235 [29:51<29:30,  1.12s/it]

{'loss': 1.063, 'grad_norm': 0.519985020160675, 'learning_rate': 9.770897832817339e-05, 'epoch': 0.51}


 51%|█████▏    | 1658/3235 [29:52<30:24,  1.16s/it]

{'loss': 0.9871, 'grad_norm': 0.5269684195518494, 'learning_rate': 9.764705882352942e-05, 'epoch': 0.51}


 51%|█████▏    | 1659/3235 [29:53<28:08,  1.07s/it]

{'loss': 1.03, 'grad_norm': 0.5105963945388794, 'learning_rate': 9.758513931888546e-05, 'epoch': 0.51}


 51%|█████▏    | 1660/3235 [29:54<26:56,  1.03s/it]

{'loss': 0.8774, 'grad_norm': 0.4972764253616333, 'learning_rate': 9.75232198142415e-05, 'epoch': 0.51}


 51%|█████▏    | 1661/3235 [29:55<27:51,  1.06s/it]

{'loss': 1.0311, 'grad_norm': 0.46696579456329346, 'learning_rate': 9.746130030959752e-05, 'epoch': 0.51}


 51%|█████▏    | 1662/3235 [29:55<25:28,  1.03it/s]

{'loss': 1.0552, 'grad_norm': 0.57465660572052, 'learning_rate': 9.739938080495356e-05, 'epoch': 0.51}


 51%|█████▏    | 1663/3235 [29:57<26:37,  1.02s/it]

{'loss': 1.1812, 'grad_norm': 0.4831675589084625, 'learning_rate': 9.73374613003096e-05, 'epoch': 0.51}


 51%|█████▏    | 1664/3235 [29:58<26:37,  1.02s/it]

{'loss': 0.9652, 'grad_norm': 0.48310866951942444, 'learning_rate': 9.727554179566564e-05, 'epoch': 0.51}


 51%|█████▏    | 1665/3235 [29:59<27:08,  1.04s/it]

{'loss': 0.9329, 'grad_norm': 0.47816362977027893, 'learning_rate': 9.721362229102168e-05, 'epoch': 0.51}


 51%|█████▏    | 1666/3235 [30:00<26:15,  1.00s/it]

{'loss': 1.0029, 'grad_norm': 0.5049270391464233, 'learning_rate': 9.715170278637772e-05, 'epoch': 0.51}


 52%|█████▏    | 1667/3235 [30:01<26:20,  1.01s/it]

{'loss': 1.2146, 'grad_norm': 0.49234819412231445, 'learning_rate': 9.708978328173375e-05, 'epoch': 0.52}


 52%|█████▏    | 1668/3235 [30:02<26:51,  1.03s/it]

{'loss': 1.0792, 'grad_norm': 0.5197743773460388, 'learning_rate': 9.702786377708979e-05, 'epoch': 0.52}


 52%|█████▏    | 1669/3235 [30:03<28:09,  1.08s/it]

{'loss': 1.0493, 'grad_norm': 0.43041345477104187, 'learning_rate': 9.696594427244583e-05, 'epoch': 0.52}


 52%|█████▏    | 1670/3235 [30:04<29:18,  1.12s/it]

{'loss': 1.161, 'grad_norm': 0.4832325279712677, 'learning_rate': 9.690402476780186e-05, 'epoch': 0.52}


 52%|█████▏    | 1671/3235 [30:05<28:40,  1.10s/it]

{'loss': 0.986, 'grad_norm': 0.4458174407482147, 'learning_rate': 9.68421052631579e-05, 'epoch': 0.52}


 52%|█████▏    | 1672/3235 [30:06<27:25,  1.05s/it]

{'loss': 1.081, 'grad_norm': 0.5016168355941772, 'learning_rate': 9.678018575851394e-05, 'epoch': 0.52}


 52%|█████▏    | 1673/3235 [30:08<31:03,  1.19s/it]

{'loss': 0.9822, 'grad_norm': 0.397258996963501, 'learning_rate': 9.671826625386998e-05, 'epoch': 0.52}


 52%|█████▏    | 1674/3235 [30:09<32:43,  1.26s/it]

{'loss': 0.9873, 'grad_norm': 0.4128391146659851, 'learning_rate': 9.665634674922602e-05, 'epoch': 0.52}


 52%|█████▏    | 1675/3235 [30:10<31:32,  1.21s/it]

{'loss': 1.0465, 'grad_norm': 0.45618703961372375, 'learning_rate': 9.659442724458206e-05, 'epoch': 0.52}


 52%|█████▏    | 1676/3235 [30:11<31:43,  1.22s/it]

{'loss': 1.032, 'grad_norm': 0.4396643042564392, 'learning_rate': 9.653250773993808e-05, 'epoch': 0.52}


 52%|█████▏    | 1677/3235 [30:12<28:16,  1.09s/it]

{'loss': 1.0234, 'grad_norm': 0.5361607074737549, 'learning_rate': 9.647058823529412e-05, 'epoch': 0.52}


 52%|█████▏    | 1678/3235 [30:13<27:10,  1.05s/it]

{'loss': 1.0502, 'grad_norm': 0.4826086461544037, 'learning_rate': 9.640866873065016e-05, 'epoch': 0.52}


 52%|█████▏    | 1679/3235 [30:14<27:20,  1.05s/it]

{'loss': 0.8736, 'grad_norm': 0.4483848512172699, 'learning_rate': 9.634674922600619e-05, 'epoch': 0.52}


 52%|█████▏    | 1680/3235 [30:15<25:47,  1.00it/s]

{'loss': 1.0123, 'grad_norm': 0.5046988725662231, 'learning_rate': 9.628482972136223e-05, 'epoch': 0.52}


 52%|█████▏    | 1681/3235 [30:16<25:54,  1.00s/it]

{'loss': 0.9774, 'grad_norm': 0.5166234374046326, 'learning_rate': 9.622291021671828e-05, 'epoch': 0.52}


 52%|█████▏    | 1682/3235 [30:17<26:23,  1.02s/it]

{'loss': 1.0409, 'grad_norm': 0.49455881118774414, 'learning_rate': 9.616099071207431e-05, 'epoch': 0.52}


 52%|█████▏    | 1683/3235 [30:18<27:48,  1.07s/it]

{'loss': 0.9332, 'grad_norm': 0.46962490677833557, 'learning_rate': 9.609907120743035e-05, 'epoch': 0.52}


 52%|█████▏    | 1684/3235 [30:20<28:25,  1.10s/it]

{'loss': 1.0921, 'grad_norm': 0.506592333316803, 'learning_rate': 9.603715170278639e-05, 'epoch': 0.52}


 52%|█████▏    | 1685/3235 [30:21<29:42,  1.15s/it]

{'loss': 0.9338, 'grad_norm': 0.49164777994155884, 'learning_rate': 9.597523219814242e-05, 'epoch': 0.52}


 52%|█████▏    | 1686/3235 [30:22<28:46,  1.11s/it]

{'loss': 1.028, 'grad_norm': 0.4795977473258972, 'learning_rate': 9.591331269349846e-05, 'epoch': 0.52}


 52%|█████▏    | 1687/3235 [30:23<29:42,  1.15s/it]

{'loss': 1.0209, 'grad_norm': 0.4330868422985077, 'learning_rate': 9.58513931888545e-05, 'epoch': 0.52}


 52%|█████▏    | 1688/3235 [30:24<29:15,  1.13s/it]

{'loss': 1.1381, 'grad_norm': 0.5146920084953308, 'learning_rate': 9.578947368421052e-05, 'epoch': 0.52}


 52%|█████▏    | 1689/3235 [30:25<30:24,  1.18s/it]

{'loss': 0.8851, 'grad_norm': 0.3932933807373047, 'learning_rate': 9.572755417956658e-05, 'epoch': 0.52}


 52%|█████▏    | 1690/3235 [30:26<29:19,  1.14s/it]

{'loss': 1.0765, 'grad_norm': 0.5211792588233948, 'learning_rate': 9.566563467492262e-05, 'epoch': 0.52}


 52%|█████▏    | 1691/3235 [30:27<28:07,  1.09s/it]

{'loss': 1.0782, 'grad_norm': 0.487120658159256, 'learning_rate': 9.560371517027864e-05, 'epoch': 0.52}


 52%|█████▏    | 1692/3235 [30:28<27:15,  1.06s/it]

{'loss': 0.9688, 'grad_norm': 0.47547802329063416, 'learning_rate': 9.554179566563468e-05, 'epoch': 0.52}


 52%|█████▏    | 1693/3235 [30:30<29:06,  1.13s/it]

{'loss': 1.0945, 'grad_norm': 0.4427221417427063, 'learning_rate': 9.547987616099072e-05, 'epoch': 0.52}


 52%|█████▏    | 1694/3235 [30:31<28:34,  1.11s/it]

{'loss': 1.1569, 'grad_norm': 0.47838371992111206, 'learning_rate': 9.541795665634675e-05, 'epoch': 0.52}


 52%|█████▏    | 1695/3235 [30:32<26:33,  1.03s/it]

{'loss': 0.9919, 'grad_norm': 0.49520012736320496, 'learning_rate': 9.535603715170279e-05, 'epoch': 0.52}


 52%|█████▏    | 1696/3235 [30:33<27:28,  1.07s/it]

{'loss': 1.0476, 'grad_norm': 0.47100526094436646, 'learning_rate': 9.529411764705883e-05, 'epoch': 0.52}


 52%|█████▏    | 1697/3235 [30:34<26:17,  1.03s/it]

{'loss': 1.0626, 'grad_norm': 0.5326591730117798, 'learning_rate': 9.523219814241487e-05, 'epoch': 0.52}


 52%|█████▏    | 1698/3235 [30:35<26:01,  1.02s/it]

{'loss': 1.1021, 'grad_norm': 0.4858030676841736, 'learning_rate': 9.517027863777091e-05, 'epoch': 0.52}


 53%|█████▎    | 1699/3235 [30:36<25:42,  1.00s/it]

{'loss': 0.9871, 'grad_norm': 0.464283287525177, 'learning_rate': 9.510835913312694e-05, 'epoch': 0.53}


 53%|█████▎    | 1700/3235 [30:37<25:20,  1.01it/s]

{'loss': 0.9372, 'grad_norm': 0.4507381319999695, 'learning_rate': 9.504643962848298e-05, 'epoch': 0.53}


 53%|█████▎    | 1701/3235 [30:38<27:02,  1.06s/it]

{'loss': 0.9557, 'grad_norm': 0.4713885486125946, 'learning_rate': 9.498452012383902e-05, 'epoch': 0.53}


 53%|█████▎    | 1702/3235 [30:39<27:28,  1.08s/it]

{'loss': 1.0538, 'grad_norm': 0.43431979417800903, 'learning_rate': 9.492260061919504e-05, 'epoch': 0.53}


 53%|█████▎    | 1703/3235 [30:40<27:21,  1.07s/it]

{'loss': 0.8833, 'grad_norm': 0.4574432373046875, 'learning_rate': 9.486068111455108e-05, 'epoch': 0.53}


 53%|█████▎    | 1704/3235 [30:41<29:40,  1.16s/it]

{'loss': 1.0179, 'grad_norm': 0.4478190541267395, 'learning_rate': 9.479876160990712e-05, 'epoch': 0.53}


 53%|█████▎    | 1705/3235 [30:43<29:35,  1.16s/it]

{'loss': 0.9019, 'grad_norm': 0.42935436964035034, 'learning_rate': 9.473684210526316e-05, 'epoch': 0.53}


 53%|█████▎    | 1706/3235 [30:44<30:35,  1.20s/it]

{'loss': 1.0429, 'grad_norm': 0.42509254813194275, 'learning_rate': 9.46749226006192e-05, 'epoch': 0.53}


 53%|█████▎    | 1707/3235 [30:45<30:47,  1.21s/it]

{'loss': 0.8481, 'grad_norm': 0.49295297265052795, 'learning_rate': 9.461300309597524e-05, 'epoch': 0.53}


 53%|█████▎    | 1708/3235 [30:46<28:44,  1.13s/it]

{'loss': 0.9605, 'grad_norm': 0.5204502940177917, 'learning_rate': 9.455108359133127e-05, 'epoch': 0.53}


 53%|█████▎    | 1709/3235 [30:47<27:34,  1.08s/it]

{'loss': 0.9367, 'grad_norm': 0.4631851315498352, 'learning_rate': 9.448916408668731e-05, 'epoch': 0.53}


 53%|█████▎    | 1710/3235 [30:48<28:29,  1.12s/it]

{'loss': 1.0677, 'grad_norm': 0.4436572194099426, 'learning_rate': 9.442724458204335e-05, 'epoch': 0.53}


 53%|█████▎    | 1711/3235 [30:49<27:24,  1.08s/it]

{'loss': 1.0989, 'grad_norm': 0.48358839750289917, 'learning_rate': 9.436532507739937e-05, 'epoch': 0.53}


 53%|█████▎    | 1712/3235 [30:50<27:48,  1.10s/it]

{'loss': 1.0331, 'grad_norm': 0.4563637673854828, 'learning_rate': 9.430340557275541e-05, 'epoch': 0.53}


 53%|█████▎    | 1713/3235 [30:51<28:04,  1.11s/it]

{'loss': 1.0991, 'grad_norm': 0.5097723603248596, 'learning_rate': 9.424148606811147e-05, 'epoch': 0.53}


 53%|█████▎    | 1714/3235 [30:53<27:31,  1.09s/it]

{'loss': 1.0289, 'grad_norm': 0.5004231929779053, 'learning_rate': 9.41795665634675e-05, 'epoch': 0.53}


 53%|█████▎    | 1715/3235 [30:53<25:31,  1.01s/it]

{'loss': 0.7759, 'grad_norm': 0.5010817050933838, 'learning_rate': 9.411764705882353e-05, 'epoch': 0.53}


 53%|█████▎    | 1716/3235 [30:54<25:57,  1.03s/it]

{'loss': 0.9829, 'grad_norm': 0.4668424129486084, 'learning_rate': 9.405572755417957e-05, 'epoch': 0.53}


 53%|█████▎    | 1717/3235 [30:55<25:54,  1.02s/it]

{'loss': 0.913, 'grad_norm': 0.4610929787158966, 'learning_rate': 9.39938080495356e-05, 'epoch': 0.53}


 53%|█████▎    | 1718/3235 [30:57<27:35,  1.09s/it]

{'loss': 0.9946, 'grad_norm': 0.4551910161972046, 'learning_rate': 9.393188854489164e-05, 'epoch': 0.53}


 53%|█████▎    | 1719/3235 [30:58<26:15,  1.04s/it]

{'loss': 1.0286, 'grad_norm': 0.49396052956581116, 'learning_rate': 9.386996904024768e-05, 'epoch': 0.53}


 53%|█████▎    | 1720/3235 [30:59<27:12,  1.08s/it]

{'loss': 0.9904, 'grad_norm': 0.44721072912216187, 'learning_rate': 9.380804953560372e-05, 'epoch': 0.53}


 53%|█████▎    | 1721/3235 [31:00<27:16,  1.08s/it]

{'loss': 1.0149, 'grad_norm': 0.46472370624542236, 'learning_rate': 9.374613003095976e-05, 'epoch': 0.53}


 53%|█████▎    | 1722/3235 [31:01<28:26,  1.13s/it]

{'loss': 0.9526, 'grad_norm': 0.42517712712287903, 'learning_rate': 9.36842105263158e-05, 'epoch': 0.53}


 53%|█████▎    | 1723/3235 [31:03<30:57,  1.23s/it]

{'loss': 1.0373, 'grad_norm': 0.4468097686767578, 'learning_rate': 9.362229102167183e-05, 'epoch': 0.53}


 53%|█████▎    | 1724/3235 [31:04<30:35,  1.21s/it]

{'loss': 0.983, 'grad_norm': 0.4859215021133423, 'learning_rate': 9.356037151702787e-05, 'epoch': 0.53}


 53%|█████▎    | 1725/3235 [31:05<30:17,  1.20s/it]

{'loss': 1.1661, 'grad_norm': 0.47419485449790955, 'learning_rate': 9.349845201238391e-05, 'epoch': 0.53}


 53%|█████▎    | 1726/3235 [31:06<29:08,  1.16s/it]

{'loss': 0.9821, 'grad_norm': 0.5121973752975464, 'learning_rate': 9.343653250773993e-05, 'epoch': 0.53}


 53%|█████▎    | 1727/3235 [31:07<29:22,  1.17s/it]

{'loss': 1.1069, 'grad_norm': 0.47597500681877136, 'learning_rate': 9.337461300309597e-05, 'epoch': 0.53}


 53%|█████▎    | 1728/3235 [31:08<27:43,  1.10s/it]

{'loss': 0.989, 'grad_norm': 0.48613762855529785, 'learning_rate': 9.331269349845201e-05, 'epoch': 0.53}


 53%|█████▎    | 1729/3235 [31:09<27:13,  1.08s/it]

{'loss': 1.0898, 'grad_norm': 0.5313786268234253, 'learning_rate': 9.325077399380805e-05, 'epoch': 0.53}


 53%|█████▎    | 1730/3235 [31:10<23:52,  1.05it/s]

{'loss': 0.8685, 'grad_norm': 0.6117193698883057, 'learning_rate': 9.31888544891641e-05, 'epoch': 0.53}


 54%|█████▎    | 1731/3235 [31:11<24:03,  1.04it/s]

{'loss': 0.847, 'grad_norm': 0.5048965215682983, 'learning_rate': 9.312693498452013e-05, 'epoch': 0.54}


 54%|█████▎    | 1732/3235 [31:12<25:44,  1.03s/it]

{'loss': 1.0076, 'grad_norm': 0.5063735246658325, 'learning_rate': 9.306501547987616e-05, 'epoch': 0.54}


 54%|█████▎    | 1733/3235 [31:13<26:49,  1.07s/it]

{'loss': 1.0105, 'grad_norm': 0.46508753299713135, 'learning_rate': 9.30030959752322e-05, 'epoch': 0.54}


 54%|█████▎    | 1734/3235 [31:14<26:59,  1.08s/it]

{'loss': 1.1664, 'grad_norm': 0.5188907980918884, 'learning_rate': 9.294117647058824e-05, 'epoch': 0.54}


 54%|█████▎    | 1735/3235 [31:15<27:42,  1.11s/it]

{'loss': 0.9678, 'grad_norm': 0.462471067905426, 'learning_rate': 9.287925696594427e-05, 'epoch': 0.54}


 54%|█████▎    | 1736/3235 [31:16<25:36,  1.02s/it]

{'loss': 0.9918, 'grad_norm': 0.5070989727973938, 'learning_rate': 9.281733746130031e-05, 'epoch': 0.54}


 54%|█████▎    | 1737/3235 [31:17<25:38,  1.03s/it]

{'loss': 1.0051, 'grad_norm': 0.4769804775714874, 'learning_rate': 9.275541795665636e-05, 'epoch': 0.54}


 54%|█████▎    | 1738/3235 [31:18<25:33,  1.02s/it]

{'loss': 0.9791, 'grad_norm': 0.45303139090538025, 'learning_rate': 9.269349845201239e-05, 'epoch': 0.54}


 54%|█████▍    | 1739/3235 [31:19<25:43,  1.03s/it]

{'loss': 0.9959, 'grad_norm': 0.4804115295410156, 'learning_rate': 9.263157894736843e-05, 'epoch': 0.54}


 54%|█████▍    | 1740/3235 [31:20<25:48,  1.04s/it]

{'loss': 1.0447, 'grad_norm': 0.4613898992538452, 'learning_rate': 9.256965944272447e-05, 'epoch': 0.54}


 54%|█████▍    | 1741/3235 [31:22<26:41,  1.07s/it]

{'loss': 1.034, 'grad_norm': 0.4350445568561554, 'learning_rate': 9.25077399380805e-05, 'epoch': 0.54}


 54%|█████▍    | 1742/3235 [31:22<25:29,  1.02s/it]

{'loss': 1.1617, 'grad_norm': 0.4985264539718628, 'learning_rate': 9.244582043343653e-05, 'epoch': 0.54}


 54%|█████▍    | 1743/3235 [31:24<26:14,  1.06s/it]

{'loss': 1.0164, 'grad_norm': 0.44819116592407227, 'learning_rate': 9.238390092879257e-05, 'epoch': 0.54}


 54%|█████▍    | 1744/3235 [31:25<26:00,  1.05s/it]

{'loss': 1.0699, 'grad_norm': 0.5312239527702332, 'learning_rate': 9.232198142414861e-05, 'epoch': 0.54}


 54%|█████▍    | 1745/3235 [31:26<29:02,  1.17s/it]

{'loss': 1.1253, 'grad_norm': 0.40852269530296326, 'learning_rate': 9.226006191950465e-05, 'epoch': 0.54}


 54%|█████▍    | 1746/3235 [31:27<28:17,  1.14s/it]

{'loss': 0.8532, 'grad_norm': 0.48526301980018616, 'learning_rate': 9.21981424148607e-05, 'epoch': 0.54}


 54%|█████▍    | 1747/3235 [31:28<27:57,  1.13s/it]

{'loss': 0.9564, 'grad_norm': 0.43999624252319336, 'learning_rate': 9.213622291021672e-05, 'epoch': 0.54}


 54%|█████▍    | 1748/3235 [31:30<29:51,  1.20s/it]

{'loss': 1.1658, 'grad_norm': 0.4670984148979187, 'learning_rate': 9.207430340557276e-05, 'epoch': 0.54}


 54%|█████▍    | 1749/3235 [31:31<30:52,  1.25s/it]

{'loss': 1.0424, 'grad_norm': 0.44372302293777466, 'learning_rate': 9.20123839009288e-05, 'epoch': 0.54}


 54%|█████▍    | 1750/3235 [31:32<29:21,  1.19s/it]

{'loss': 1.0251, 'grad_norm': 0.48088908195495605, 'learning_rate': 9.195046439628483e-05, 'epoch': 0.54}


 54%|█████▍    | 1751/3235 [31:33<28:26,  1.15s/it]

{'loss': 0.9944, 'grad_norm': 0.4544235169887543, 'learning_rate': 9.188854489164087e-05, 'epoch': 0.54}


 54%|█████▍    | 1752/3235 [31:34<27:13,  1.10s/it]

{'loss': 0.9103, 'grad_norm': 0.4832247197628021, 'learning_rate': 9.18266253869969e-05, 'epoch': 0.54}


 54%|█████▍    | 1753/3235 [31:35<27:27,  1.11s/it]

{'loss': 1.0306, 'grad_norm': 0.4326912760734558, 'learning_rate': 9.176470588235295e-05, 'epoch': 0.54}


 54%|█████▍    | 1754/3235 [31:36<28:39,  1.16s/it]

{'loss': 1.051, 'grad_norm': 0.46904465556144714, 'learning_rate': 9.170278637770899e-05, 'epoch': 0.54}


 54%|█████▍    | 1755/3235 [31:37<25:15,  1.02s/it]

{'loss': 0.9141, 'grad_norm': 0.5662012100219727, 'learning_rate': 9.164086687306503e-05, 'epoch': 0.54}


 54%|█████▍    | 1756/3235 [31:38<26:43,  1.08s/it]

{'loss': 1.052, 'grad_norm': 0.4211273491382599, 'learning_rate': 9.157894736842105e-05, 'epoch': 0.54}


 54%|█████▍    | 1757/3235 [31:39<25:15,  1.03s/it]

{'loss': 1.1259, 'grad_norm': 0.5485184788703918, 'learning_rate': 9.151702786377709e-05, 'epoch': 0.54}


 54%|█████▍    | 1758/3235 [31:40<26:22,  1.07s/it]

{'loss': 1.2033, 'grad_norm': 0.5200711488723755, 'learning_rate': 9.145510835913313e-05, 'epoch': 0.54}


 54%|█████▍    | 1759/3235 [31:42<27:02,  1.10s/it]

{'loss': 1.0949, 'grad_norm': 0.4449065327644348, 'learning_rate': 9.139318885448916e-05, 'epoch': 0.54}


 54%|█████▍    | 1760/3235 [31:43<26:20,  1.07s/it]

{'loss': 1.1656, 'grad_norm': 0.5291489362716675, 'learning_rate': 9.13312693498452e-05, 'epoch': 0.54}


 54%|█████▍    | 1761/3235 [31:44<26:51,  1.09s/it]

{'loss': 1.171, 'grad_norm': 0.5361735224723816, 'learning_rate': 9.126934984520124e-05, 'epoch': 0.54}


 54%|█████▍    | 1762/3235 [31:45<26:24,  1.08s/it]

{'loss': 1.0112, 'grad_norm': 0.4479098618030548, 'learning_rate': 9.120743034055728e-05, 'epoch': 0.54}


 54%|█████▍    | 1763/3235 [31:46<26:57,  1.10s/it]

{'loss': 1.1071, 'grad_norm': 0.483986496925354, 'learning_rate': 9.114551083591332e-05, 'epoch': 0.54}


 55%|█████▍    | 1764/3235 [31:47<26:23,  1.08s/it]

{'loss': 1.022, 'grad_norm': 0.478153795003891, 'learning_rate': 9.108359133126936e-05, 'epoch': 0.55}


 55%|█████▍    | 1765/3235 [31:48<25:09,  1.03s/it]

{'loss': 1.1336, 'grad_norm': 0.48703986406326294, 'learning_rate': 9.102167182662539e-05, 'epoch': 0.55}


 55%|█████▍    | 1766/3235 [31:49<24:36,  1.00s/it]

{'loss': 0.9292, 'grad_norm': 0.5561316609382629, 'learning_rate': 9.095975232198143e-05, 'epoch': 0.55}


 55%|█████▍    | 1767/3235 [31:50<24:48,  1.01s/it]

{'loss': 0.9096, 'grad_norm': 0.4680026173591614, 'learning_rate': 9.089783281733747e-05, 'epoch': 0.55}


 55%|█████▍    | 1768/3235 [31:51<27:44,  1.13s/it]

{'loss': 0.9417, 'grad_norm': 0.44507288932800293, 'learning_rate': 9.083591331269349e-05, 'epoch': 0.55}


 55%|█████▍    | 1769/3235 [31:52<27:58,  1.15s/it]

{'loss': 1.0546, 'grad_norm': 0.48797836899757385, 'learning_rate': 9.077399380804955e-05, 'epoch': 0.55}


 55%|█████▍    | 1770/3235 [31:54<28:51,  1.18s/it]

{'loss': 1.0185, 'grad_norm': 0.4367201328277588, 'learning_rate': 9.071207430340559e-05, 'epoch': 0.55}


 55%|█████▍    | 1771/3235 [31:55<26:36,  1.09s/it]

{'loss': 0.8509, 'grad_norm': 0.5110825300216675, 'learning_rate': 9.065015479876161e-05, 'epoch': 0.55}


 55%|█████▍    | 1772/3235 [31:56<25:03,  1.03s/it]

{'loss': 1.033, 'grad_norm': 0.5394993424415588, 'learning_rate': 9.058823529411765e-05, 'epoch': 0.55}


 55%|█████▍    | 1773/3235 [31:57<25:19,  1.04s/it]

{'loss': 1.0135, 'grad_norm': 0.4684627950191498, 'learning_rate': 9.052631578947369e-05, 'epoch': 0.55}


 55%|█████▍    | 1774/3235 [31:58<25:42,  1.06s/it]

{'loss': 1.0638, 'grad_norm': 0.46928438544273376, 'learning_rate': 9.046439628482972e-05, 'epoch': 0.55}


 55%|█████▍    | 1775/3235 [31:59<27:54,  1.15s/it]

{'loss': 0.9801, 'grad_norm': 0.46027672290802, 'learning_rate': 9.040247678018576e-05, 'epoch': 0.55}


 55%|█████▍    | 1776/3235 [32:00<28:26,  1.17s/it]

{'loss': 1.0377, 'grad_norm': 0.4688592553138733, 'learning_rate': 9.03405572755418e-05, 'epoch': 0.55}


 55%|█████▍    | 1777/3235 [32:01<27:33,  1.13s/it]

{'loss': 0.9957, 'grad_norm': 0.46868017315864563, 'learning_rate': 9.027863777089784e-05, 'epoch': 0.55}


 55%|█████▍    | 1778/3235 [32:02<27:10,  1.12s/it]

{'loss': 1.0486, 'grad_norm': 0.500167191028595, 'learning_rate': 9.021671826625388e-05, 'epoch': 0.55}


 55%|█████▍    | 1779/3235 [32:03<26:19,  1.08s/it]

{'loss': 0.9519, 'grad_norm': 0.4793742895126343, 'learning_rate': 9.015479876160992e-05, 'epoch': 0.55}


 55%|█████▌    | 1780/3235 [32:05<26:34,  1.10s/it]

{'loss': 1.1193, 'grad_norm': 0.4969451427459717, 'learning_rate': 9.009287925696595e-05, 'epoch': 0.55}


 55%|█████▌    | 1781/3235 [32:06<27:15,  1.12s/it]

{'loss': 0.9386, 'grad_norm': 0.4962572753429413, 'learning_rate': 9.003095975232199e-05, 'epoch': 0.55}


 55%|█████▌    | 1782/3235 [32:07<28:02,  1.16s/it]

{'loss': 1.0461, 'grad_norm': 0.44962236285209656, 'learning_rate': 8.996904024767803e-05, 'epoch': 0.55}


 55%|█████▌    | 1783/3235 [32:08<27:12,  1.12s/it]

{'loss': 1.1489, 'grad_norm': 0.48817309737205505, 'learning_rate': 8.990712074303405e-05, 'epoch': 0.55}


 55%|█████▌    | 1784/3235 [32:09<27:01,  1.12s/it]

{'loss': 1.0861, 'grad_norm': 0.5107426643371582, 'learning_rate': 8.984520123839009e-05, 'epoch': 0.55}


 55%|█████▌    | 1785/3235 [32:10<25:54,  1.07s/it]

{'loss': 0.9875, 'grad_norm': 0.4670444428920746, 'learning_rate': 8.978328173374613e-05, 'epoch': 0.55}


 55%|█████▌    | 1786/3235 [32:11<27:05,  1.12s/it]

{'loss': 1.1493, 'grad_norm': 0.4758065342903137, 'learning_rate': 8.972136222910217e-05, 'epoch': 0.55}


 55%|█████▌    | 1787/3235 [32:12<25:39,  1.06s/it]

{'loss': 0.9707, 'grad_norm': 0.5281102657318115, 'learning_rate': 8.965944272445821e-05, 'epoch': 0.55}


 55%|█████▌    | 1788/3235 [32:13<25:52,  1.07s/it]

{'loss': 0.9451, 'grad_norm': 0.4572449326515198, 'learning_rate': 8.959752321981425e-05, 'epoch': 0.55}


 55%|█████▌    | 1789/3235 [32:14<26:03,  1.08s/it]

{'loss': 1.2003, 'grad_norm': 0.46987342834472656, 'learning_rate': 8.953560371517028e-05, 'epoch': 0.55}


 55%|█████▌    | 1790/3235 [32:16<26:19,  1.09s/it]

{'loss': 0.9999, 'grad_norm': 0.4802318811416626, 'learning_rate': 8.947368421052632e-05, 'epoch': 0.55}


 55%|█████▌    | 1791/3235 [32:16<24:31,  1.02s/it]

{'loss': 1.1276, 'grad_norm': 0.5714967846870422, 'learning_rate': 8.941176470588236e-05, 'epoch': 0.55}


 55%|█████▌    | 1792/3235 [32:18<25:11,  1.05s/it]

{'loss': 0.9936, 'grad_norm': 0.4892345368862152, 'learning_rate': 8.934984520123839e-05, 'epoch': 0.55}


 55%|█████▌    | 1793/3235 [32:18<24:19,  1.01s/it]

{'loss': 1.0786, 'grad_norm': 0.4876062572002411, 'learning_rate': 8.928792569659444e-05, 'epoch': 0.55}


 55%|█████▌    | 1794/3235 [32:20<24:48,  1.03s/it]

{'loss': 0.9572, 'grad_norm': 0.4900304079055786, 'learning_rate': 8.922600619195048e-05, 'epoch': 0.55}


 55%|█████▌    | 1795/3235 [32:21<24:59,  1.04s/it]

{'loss': 1.0331, 'grad_norm': 0.4574369490146637, 'learning_rate': 8.91640866873065e-05, 'epoch': 0.55}


 56%|█████▌    | 1796/3235 [32:22<26:47,  1.12s/it]

{'loss': 1.0569, 'grad_norm': 0.4687355160713196, 'learning_rate': 8.910216718266255e-05, 'epoch': 0.56}


 56%|█████▌    | 1797/3235 [32:23<27:20,  1.14s/it]

{'loss': 0.9035, 'grad_norm': 0.4403097927570343, 'learning_rate': 8.904024767801859e-05, 'epoch': 0.56}


 56%|█████▌    | 1798/3235 [32:24<26:30,  1.11s/it]

{'loss': 1.042, 'grad_norm': 0.4595843255519867, 'learning_rate': 8.897832817337461e-05, 'epoch': 0.56}


 56%|█████▌    | 1799/3235 [32:25<26:31,  1.11s/it]

{'loss': 1.109, 'grad_norm': 0.46879512071609497, 'learning_rate': 8.891640866873065e-05, 'epoch': 0.56}


 56%|█████▌    | 1800/3235 [32:26<26:06,  1.09s/it]

{'loss': 0.9615, 'grad_norm': 0.49045079946517944, 'learning_rate': 8.885448916408669e-05, 'epoch': 0.56}


 56%|█████▌    | 1801/3235 [32:28<29:02,  1.22s/it]

{'loss': 1.0108, 'grad_norm': 0.4119194447994232, 'learning_rate': 8.879256965944273e-05, 'epoch': 0.56}


 56%|█████▌    | 1802/3235 [32:29<27:54,  1.17s/it]

{'loss': 1.0429, 'grad_norm': 0.48119449615478516, 'learning_rate': 8.873065015479877e-05, 'epoch': 0.56}


 56%|█████▌    | 1803/3235 [32:30<25:07,  1.05s/it]

{'loss': 1.0213, 'grad_norm': 0.5509917736053467, 'learning_rate': 8.866873065015481e-05, 'epoch': 0.56}


 56%|█████▌    | 1804/3235 [32:31<25:39,  1.08s/it]

{'loss': 1.1102, 'grad_norm': 0.5098915696144104, 'learning_rate': 8.860681114551084e-05, 'epoch': 0.56}


 56%|█████▌    | 1805/3235 [32:32<27:35,  1.16s/it]

{'loss': 1.0742, 'grad_norm': 0.4561578631401062, 'learning_rate': 8.854489164086688e-05, 'epoch': 0.56}


 56%|█████▌    | 1806/3235 [32:33<27:51,  1.17s/it]

{'loss': 0.8849, 'grad_norm': 0.4506501257419586, 'learning_rate': 8.848297213622292e-05, 'epoch': 0.56}


 56%|█████▌    | 1807/3235 [32:34<26:44,  1.12s/it]

{'loss': 0.8355, 'grad_norm': 0.4757840037345886, 'learning_rate': 8.842105263157894e-05, 'epoch': 0.56}


 56%|█████▌    | 1808/3235 [32:35<27:16,  1.15s/it]

{'loss': 0.9897, 'grad_norm': 0.451006144285202, 'learning_rate': 8.835913312693498e-05, 'epoch': 0.56}


 56%|█████▌    | 1809/3235 [32:36<26:03,  1.10s/it]

{'loss': 1.0763, 'grad_norm': 0.5367120504379272, 'learning_rate': 8.829721362229102e-05, 'epoch': 0.56}


 56%|█████▌    | 1810/3235 [32:38<28:04,  1.18s/it]

{'loss': 1.0469, 'grad_norm': 0.40568771958351135, 'learning_rate': 8.823529411764706e-05, 'epoch': 0.56}


 56%|█████▌    | 1811/3235 [32:39<27:40,  1.17s/it]

{'loss': 1.0374, 'grad_norm': 0.4845059812068939, 'learning_rate': 8.81733746130031e-05, 'epoch': 0.56}


 56%|█████▌    | 1812/3235 [32:40<26:00,  1.10s/it]

{'loss': 0.9912, 'grad_norm': 0.5338419079780579, 'learning_rate': 8.811145510835914e-05, 'epoch': 0.56}


 56%|█████▌    | 1813/3235 [32:41<23:56,  1.01s/it]

{'loss': 0.9428, 'grad_norm': 0.557546079158783, 'learning_rate': 8.804953560371517e-05, 'epoch': 0.56}


 56%|█████▌    | 1814/3235 [32:42<25:27,  1.08s/it]

{'loss': 1.1226, 'grad_norm': 0.462493896484375, 'learning_rate': 8.798761609907121e-05, 'epoch': 0.56}


 56%|█████▌    | 1815/3235 [32:43<26:43,  1.13s/it]

{'loss': 0.9921, 'grad_norm': 0.44813549518585205, 'learning_rate': 8.792569659442725e-05, 'epoch': 0.56}


 56%|█████▌    | 1816/3235 [32:44<26:13,  1.11s/it]

{'loss': 1.1139, 'grad_norm': 0.4715689420700073, 'learning_rate': 8.786377708978328e-05, 'epoch': 0.56}


 56%|█████▌    | 1817/3235 [32:45<27:00,  1.14s/it]

{'loss': 0.9869, 'grad_norm': 0.47339320182800293, 'learning_rate': 8.780185758513932e-05, 'epoch': 0.56}


 56%|█████▌    | 1818/3235 [32:47<26:59,  1.14s/it]

{'loss': 1.0611, 'grad_norm': 0.46260014176368713, 'learning_rate': 8.773993808049537e-05, 'epoch': 0.56}


 56%|█████▌    | 1819/3235 [32:48<26:28,  1.12s/it]

{'loss': 0.987, 'grad_norm': 0.4816419780254364, 'learning_rate': 8.76780185758514e-05, 'epoch': 0.56}


 56%|█████▋    | 1820/3235 [32:49<26:16,  1.11s/it]

{'loss': 0.9572, 'grad_norm': 0.43328091502189636, 'learning_rate': 8.761609907120744e-05, 'epoch': 0.56}


 56%|█████▋    | 1821/3235 [32:50<28:06,  1.19s/it]

{'loss': 1.147, 'grad_norm': 0.4277013838291168, 'learning_rate': 8.755417956656348e-05, 'epoch': 0.56}


 56%|█████▋    | 1822/3235 [32:51<27:12,  1.16s/it]

{'loss': 0.9223, 'grad_norm': 0.502343475818634, 'learning_rate': 8.74922600619195e-05, 'epoch': 0.56}


 56%|█████▋    | 1823/3235 [32:52<26:49,  1.14s/it]

{'loss': 0.8571, 'grad_norm': 0.5209981799125671, 'learning_rate': 8.743034055727554e-05, 'epoch': 0.56}


 56%|█████▋    | 1824/3235 [32:54<28:08,  1.20s/it]

{'loss': 1.1337, 'grad_norm': 0.5184128880500793, 'learning_rate': 8.736842105263158e-05, 'epoch': 0.56}


 56%|█████▋    | 1825/3235 [32:55<27:46,  1.18s/it]

{'loss': 0.9927, 'grad_norm': 0.46785321831703186, 'learning_rate': 8.730650154798762e-05, 'epoch': 0.56}


 56%|█████▋    | 1826/3235 [32:56<26:30,  1.13s/it]

{'loss': 1.1122, 'grad_norm': 0.5248762965202332, 'learning_rate': 8.724458204334366e-05, 'epoch': 0.56}


 56%|█████▋    | 1827/3235 [32:57<24:45,  1.05s/it]

{'loss': 0.8935, 'grad_norm': 0.5306352972984314, 'learning_rate': 8.71826625386997e-05, 'epoch': 0.56}


 57%|█████▋    | 1828/3235 [32:58<24:22,  1.04s/it]

{'loss': 0.9145, 'grad_norm': 0.4656846225261688, 'learning_rate': 8.712074303405573e-05, 'epoch': 0.57}


 57%|█████▋    | 1829/3235 [32:59<25:54,  1.11s/it]

{'loss': 1.0699, 'grad_norm': 0.4753512442111969, 'learning_rate': 8.705882352941177e-05, 'epoch': 0.57}


 57%|█████▋    | 1830/3235 [33:00<23:46,  1.01s/it]

{'loss': 0.8808, 'grad_norm': 0.50853431224823, 'learning_rate': 8.699690402476781e-05, 'epoch': 0.57}


 57%|█████▋    | 1831/3235 [33:01<22:46,  1.03it/s]

{'loss': 0.9732, 'grad_norm': 0.474931538105011, 'learning_rate': 8.693498452012384e-05, 'epoch': 0.57}


 57%|█████▋    | 1832/3235 [33:02<22:57,  1.02it/s]

{'loss': 1.0596, 'grad_norm': 0.49984011054039, 'learning_rate': 8.687306501547988e-05, 'epoch': 0.57}


 57%|█████▋    | 1833/3235 [33:02<20:30,  1.14it/s]

{'loss': 0.9497, 'grad_norm': 0.5386503338813782, 'learning_rate': 8.681114551083592e-05, 'epoch': 0.57}


 57%|█████▋    | 1834/3235 [33:03<21:01,  1.11it/s]

{'loss': 0.9122, 'grad_norm': 0.49706223607063293, 'learning_rate': 8.674922600619196e-05, 'epoch': 0.57}


 57%|█████▋    | 1835/3235 [33:04<21:47,  1.07it/s]

{'loss': 1.0169, 'grad_norm': 0.4523846209049225, 'learning_rate': 8.6687306501548e-05, 'epoch': 0.57}


 57%|█████▋    | 1836/3235 [33:05<23:40,  1.02s/it]

{'loss': 1.0776, 'grad_norm': 0.47856587171554565, 'learning_rate': 8.662538699690404e-05, 'epoch': 0.57}


 57%|█████▋    | 1837/3235 [33:07<23:50,  1.02s/it]

{'loss': 0.9809, 'grad_norm': 0.5397708415985107, 'learning_rate': 8.656346749226006e-05, 'epoch': 0.57}


 57%|█████▋    | 1838/3235 [33:07<22:51,  1.02it/s]

{'loss': 0.9234, 'grad_norm': 0.5046274065971375, 'learning_rate': 8.65015479876161e-05, 'epoch': 0.57}


 57%|█████▋    | 1839/3235 [33:09<23:43,  1.02s/it]

{'loss': 1.0586, 'grad_norm': 0.48436909914016724, 'learning_rate': 8.643962848297214e-05, 'epoch': 0.57}


 57%|█████▋    | 1840/3235 [33:09<23:03,  1.01it/s]

{'loss': 1.0128, 'grad_norm': 0.5197851657867432, 'learning_rate': 8.637770897832817e-05, 'epoch': 0.57}


 57%|█████▋    | 1841/3235 [33:11<23:44,  1.02s/it]

{'loss': 1.0028, 'grad_norm': 0.4259423017501831, 'learning_rate': 8.631578947368421e-05, 'epoch': 0.57}


 57%|█████▋    | 1842/3235 [33:11<22:45,  1.02it/s]

{'loss': 1.0122, 'grad_norm': 0.5004357099533081, 'learning_rate': 8.625386996904025e-05, 'epoch': 0.57}


 57%|█████▋    | 1843/3235 [33:13<24:19,  1.05s/it]

{'loss': 0.9927, 'grad_norm': 0.48369598388671875, 'learning_rate': 8.619195046439629e-05, 'epoch': 0.57}


 57%|█████▋    | 1844/3235 [33:14<25:44,  1.11s/it]

{'loss': 1.07, 'grad_norm': 0.4742726981639862, 'learning_rate': 8.613003095975233e-05, 'epoch': 0.57}


 57%|█████▋    | 1845/3235 [33:15<26:29,  1.14s/it]

{'loss': 1.0958, 'grad_norm': 0.4994145333766937, 'learning_rate': 8.606811145510836e-05, 'epoch': 0.57}


 57%|█████▋    | 1846/3235 [33:16<25:56,  1.12s/it]

{'loss': 0.9707, 'grad_norm': 0.5025373697280884, 'learning_rate': 8.60061919504644e-05, 'epoch': 0.57}


 57%|█████▋    | 1847/3235 [33:17<26:03,  1.13s/it]

{'loss': 1.0985, 'grad_norm': 0.444343239068985, 'learning_rate': 8.594427244582044e-05, 'epoch': 0.57}


 57%|█████▋    | 1848/3235 [33:18<24:33,  1.06s/it]

{'loss': 0.9264, 'grad_norm': 0.4850146472454071, 'learning_rate': 8.588235294117646e-05, 'epoch': 0.57}


 57%|█████▋    | 1849/3235 [33:19<22:48,  1.01it/s]

{'loss': 1.101, 'grad_norm': 0.5231844782829285, 'learning_rate': 8.582043343653252e-05, 'epoch': 0.57}


 57%|█████▋    | 1850/3235 [33:20<25:09,  1.09s/it]

{'loss': 1.0914, 'grad_norm': 0.4379085898399353, 'learning_rate': 8.575851393188856e-05, 'epoch': 0.57}


 57%|█████▋    | 1851/3235 [33:21<24:25,  1.06s/it]

{'loss': 0.9632, 'grad_norm': 0.5115854144096375, 'learning_rate': 8.569659442724458e-05, 'epoch': 0.57}


 57%|█████▋    | 1852/3235 [33:22<24:50,  1.08s/it]

{'loss': 0.9676, 'grad_norm': 0.4834579825401306, 'learning_rate': 8.563467492260062e-05, 'epoch': 0.57}


 57%|█████▋    | 1853/3235 [33:24<24:35,  1.07s/it]

{'loss': 0.9248, 'grad_norm': 0.48445454239845276, 'learning_rate': 8.557275541795666e-05, 'epoch': 0.57}


 57%|█████▋    | 1854/3235 [33:25<24:16,  1.05s/it]

{'loss': 1.0272, 'grad_norm': 0.4938993453979492, 'learning_rate': 8.551083591331269e-05, 'epoch': 0.57}


 57%|█████▋    | 1855/3235 [33:26<24:24,  1.06s/it]

{'loss': 1.0261, 'grad_norm': 0.4459044337272644, 'learning_rate': 8.544891640866873e-05, 'epoch': 0.57}


 57%|█████▋    | 1856/3235 [33:27<23:38,  1.03s/it]

{'loss': 1.0115, 'grad_norm': 0.5150670409202576, 'learning_rate': 8.538699690402477e-05, 'epoch': 0.57}


 57%|█████▋    | 1857/3235 [33:28<23:59,  1.04s/it]

{'loss': 1.0544, 'grad_norm': 0.4965774416923523, 'learning_rate': 8.532507739938081e-05, 'epoch': 0.57}


 57%|█████▋    | 1858/3235 [33:29<24:54,  1.09s/it]

{'loss': 1.0864, 'grad_norm': 0.5188137888908386, 'learning_rate': 8.526315789473685e-05, 'epoch': 0.57}


 57%|█████▋    | 1859/3235 [33:30<25:08,  1.10s/it]

{'loss': 0.9268, 'grad_norm': 0.42539313435554504, 'learning_rate': 8.520123839009289e-05, 'epoch': 0.57}


 57%|█████▋    | 1860/3235 [33:31<25:56,  1.13s/it]

{'loss': 0.971, 'grad_norm': 0.44456249475479126, 'learning_rate': 8.513931888544892e-05, 'epoch': 0.57}


 58%|█████▊    | 1861/3235 [33:32<25:57,  1.13s/it]

{'loss': 1.015, 'grad_norm': 0.47245654463768005, 'learning_rate': 8.507739938080496e-05, 'epoch': 0.58}


 58%|█████▊    | 1862/3235 [33:33<23:45,  1.04s/it]

{'loss': 0.9861, 'grad_norm': 0.4944397807121277, 'learning_rate': 8.5015479876161e-05, 'epoch': 0.58}


 58%|█████▊    | 1863/3235 [33:34<23:55,  1.05s/it]

{'loss': 1.0032, 'grad_norm': 0.4766109585762024, 'learning_rate': 8.495356037151702e-05, 'epoch': 0.58}


 58%|█████▊    | 1864/3235 [33:35<23:22,  1.02s/it]

{'loss': 0.8356, 'grad_norm': 0.4804091453552246, 'learning_rate': 8.489164086687306e-05, 'epoch': 0.58}


 58%|█████▊    | 1865/3235 [33:36<25:20,  1.11s/it]

{'loss': 1.1766, 'grad_norm': 0.48234522342681885, 'learning_rate': 8.48297213622291e-05, 'epoch': 0.58}


 58%|█████▊    | 1866/3235 [33:37<24:30,  1.07s/it]

{'loss': 1.0096, 'grad_norm': 0.5033060312271118, 'learning_rate': 8.476780185758514e-05, 'epoch': 0.58}


 58%|█████▊    | 1867/3235 [33:39<25:00,  1.10s/it]

{'loss': 1.0544, 'grad_norm': 0.4580341577529907, 'learning_rate': 8.470588235294118e-05, 'epoch': 0.58}


 58%|█████▊    | 1868/3235 [33:40<25:09,  1.10s/it]

{'loss': 1.0081, 'grad_norm': 0.5090609192848206, 'learning_rate': 8.464396284829722e-05, 'epoch': 0.58}


 58%|█████▊    | 1869/3235 [33:41<25:18,  1.11s/it]

{'loss': 0.9052, 'grad_norm': 0.43090111017227173, 'learning_rate': 8.458204334365325e-05, 'epoch': 0.58}


 58%|█████▊    | 1870/3235 [33:42<24:17,  1.07s/it]

{'loss': 0.9982, 'grad_norm': 0.5207095742225647, 'learning_rate': 8.452012383900929e-05, 'epoch': 0.58}


 58%|█████▊    | 1871/3235 [33:43<24:18,  1.07s/it]

{'loss': 1.0638, 'grad_norm': 0.46369290351867676, 'learning_rate': 8.445820433436533e-05, 'epoch': 0.58}


 58%|█████▊    | 1872/3235 [33:44<22:04,  1.03it/s]

{'loss': 1.1582, 'grad_norm': 0.5356533527374268, 'learning_rate': 8.439628482972136e-05, 'epoch': 0.58}


 58%|█████▊    | 1873/3235 [33:45<21:25,  1.06it/s]

{'loss': 0.813, 'grad_norm': 0.4824233055114746, 'learning_rate': 8.433436532507741e-05, 'epoch': 0.58}


 58%|█████▊    | 1874/3235 [33:46<21:52,  1.04it/s]

{'loss': 1.02, 'grad_norm': 0.469625324010849, 'learning_rate': 8.427244582043345e-05, 'epoch': 0.58}


 58%|█████▊    | 1875/3235 [33:46<21:42,  1.04it/s]

{'loss': 0.979, 'grad_norm': 0.4818462133407593, 'learning_rate': 8.421052631578948e-05, 'epoch': 0.58}


 58%|█████▊    | 1876/3235 [33:47<20:47,  1.09it/s]

{'loss': 0.9988, 'grad_norm': 0.5435736179351807, 'learning_rate': 8.414860681114552e-05, 'epoch': 0.58}


 58%|█████▊    | 1877/3235 [33:49<23:01,  1.02s/it]

{'loss': 1.1142, 'grad_norm': 0.453614205121994, 'learning_rate': 8.408668730650156e-05, 'epoch': 0.58}


 58%|█████▊    | 1878/3235 [33:50<22:47,  1.01s/it]

{'loss': 1.0225, 'grad_norm': 0.4508006274700165, 'learning_rate': 8.402476780185758e-05, 'epoch': 0.58}


 58%|█████▊    | 1879/3235 [33:51<23:58,  1.06s/it]

{'loss': 1.0092, 'grad_norm': 0.4306773543357849, 'learning_rate': 8.396284829721362e-05, 'epoch': 0.58}


 58%|█████▊    | 1880/3235 [33:52<24:42,  1.09s/it]

{'loss': 0.8983, 'grad_norm': 0.46508246660232544, 'learning_rate': 8.390092879256966e-05, 'epoch': 0.58}


 58%|█████▊    | 1881/3235 [33:53<22:59,  1.02s/it]

{'loss': 0.9725, 'grad_norm': 0.5150343775749207, 'learning_rate': 8.38390092879257e-05, 'epoch': 0.58}


 58%|█████▊    | 1882/3235 [33:54<23:36,  1.05s/it]

{'loss': 1.1084, 'grad_norm': 0.4959714412689209, 'learning_rate': 8.377708978328174e-05, 'epoch': 0.58}


 58%|█████▊    | 1883/3235 [33:55<23:29,  1.04s/it]

{'loss': 0.9321, 'grad_norm': 0.47397464513778687, 'learning_rate': 8.371517027863778e-05, 'epoch': 0.58}


 58%|█████▊    | 1884/3235 [33:56<23:18,  1.04s/it]

{'loss': 1.0488, 'grad_norm': 0.5029103755950928, 'learning_rate': 8.365325077399381e-05, 'epoch': 0.58}


 58%|█████▊    | 1885/3235 [33:57<23:12,  1.03s/it]

{'loss': 1.185, 'grad_norm': 0.5031077861785889, 'learning_rate': 8.359133126934985e-05, 'epoch': 0.58}


 58%|█████▊    | 1886/3235 [33:58<23:04,  1.03s/it]

{'loss': 0.9458, 'grad_norm': 0.48543769121170044, 'learning_rate': 8.352941176470589e-05, 'epoch': 0.58}


 58%|█████▊    | 1887/3235 [33:59<24:04,  1.07s/it]

{'loss': 1.075, 'grad_norm': 0.44492197036743164, 'learning_rate': 8.346749226006192e-05, 'epoch': 0.58}


 58%|█████▊    | 1888/3235 [34:00<24:09,  1.08s/it]

{'loss': 1.0476, 'grad_norm': 0.5026553273200989, 'learning_rate': 8.340557275541796e-05, 'epoch': 0.58}


 58%|█████▊    | 1889/3235 [34:01<22:55,  1.02s/it]

{'loss': 0.9511, 'grad_norm': 0.5106830596923828, 'learning_rate': 8.3343653250774e-05, 'epoch': 0.58}


 58%|█████▊    | 1890/3235 [34:02<21:42,  1.03it/s]

{'loss': 1.0454, 'grad_norm': 0.5737548470497131, 'learning_rate': 8.328173374613004e-05, 'epoch': 0.58}


 58%|█████▊    | 1891/3235 [34:03<22:52,  1.02s/it]

{'loss': 1.003, 'grad_norm': 0.4751967489719391, 'learning_rate': 8.321981424148608e-05, 'epoch': 0.58}


 58%|█████▊    | 1892/3235 [34:04<21:56,  1.02it/s]

{'loss': 1.1136, 'grad_norm': 0.5287949442863464, 'learning_rate': 8.315789473684212e-05, 'epoch': 0.58}


 59%|█████▊    | 1893/3235 [34:05<22:17,  1.00it/s]

{'loss': 1.302, 'grad_norm': 0.5719290971755981, 'learning_rate': 8.309597523219814e-05, 'epoch': 0.59}


 59%|█████▊    | 1894/3235 [34:06<21:54,  1.02it/s]

{'loss': 0.921, 'grad_norm': 0.4706416428089142, 'learning_rate': 8.303405572755418e-05, 'epoch': 0.59}


 59%|█████▊    | 1895/3235 [34:07<23:22,  1.05s/it]

{'loss': 1.0414, 'grad_norm': 0.464625746011734, 'learning_rate': 8.297213622291022e-05, 'epoch': 0.59}


 59%|█████▊    | 1896/3235 [34:08<24:50,  1.11s/it]

{'loss': 1.0789, 'grad_norm': 0.42976608872413635, 'learning_rate': 8.291021671826625e-05, 'epoch': 0.59}


 59%|█████▊    | 1897/3235 [34:10<25:53,  1.16s/it]

{'loss': 1.0659, 'grad_norm': 0.5054683089256287, 'learning_rate': 8.284829721362229e-05, 'epoch': 0.59}


 59%|█████▊    | 1898/3235 [34:11<25:53,  1.16s/it]

{'loss': 1.1362, 'grad_norm': 0.44016963243484497, 'learning_rate': 8.278637770897834e-05, 'epoch': 0.59}


 59%|█████▊    | 1899/3235 [34:12<25:15,  1.13s/it]

{'loss': 0.9659, 'grad_norm': 0.5144371390342712, 'learning_rate': 8.272445820433437e-05, 'epoch': 0.59}


 59%|█████▊    | 1900/3235 [34:13<24:21,  1.09s/it]

{'loss': 0.9797, 'grad_norm': 0.45646029710769653, 'learning_rate': 8.266253869969041e-05, 'epoch': 0.59}


 59%|█████▉    | 1901/3235 [34:14<24:56,  1.12s/it]

{'loss': 1.0049, 'grad_norm': 0.42925795912742615, 'learning_rate': 8.260061919504645e-05, 'epoch': 0.59}


 59%|█████▉    | 1902/3235 [34:15<23:59,  1.08s/it]

{'loss': 0.9268, 'grad_norm': 0.4558199644088745, 'learning_rate': 8.253869969040247e-05, 'epoch': 0.59}


 59%|█████▉    | 1903/3235 [34:16<24:22,  1.10s/it]

{'loss': 1.1167, 'grad_norm': 0.5287526845932007, 'learning_rate': 8.247678018575851e-05, 'epoch': 0.59}


 59%|█████▉    | 1904/3235 [34:17<24:41,  1.11s/it]

{'loss': 1.159, 'grad_norm': 0.5123034119606018, 'learning_rate': 8.241486068111455e-05, 'epoch': 0.59}


 59%|█████▉    | 1905/3235 [34:19<26:05,  1.18s/it]

{'loss': 1.0465, 'grad_norm': 0.43694743514060974, 'learning_rate': 8.23529411764706e-05, 'epoch': 0.59}


 59%|█████▉    | 1906/3235 [34:20<25:37,  1.16s/it]

{'loss': 1.0156, 'grad_norm': 0.4829508364200592, 'learning_rate': 8.229102167182663e-05, 'epoch': 0.59}


 59%|█████▉    | 1907/3235 [34:21<24:05,  1.09s/it]

{'loss': 1.0463, 'grad_norm': 0.511124849319458, 'learning_rate': 8.222910216718267e-05, 'epoch': 0.59}


 59%|█████▉    | 1908/3235 [34:22<24:31,  1.11s/it]

{'loss': 1.008, 'grad_norm': 0.4950776994228363, 'learning_rate': 8.21671826625387e-05, 'epoch': 0.59}


 59%|█████▉    | 1909/3235 [34:23<25:08,  1.14s/it]

{'loss': 1.0757, 'grad_norm': 0.4968932867050171, 'learning_rate': 8.210526315789474e-05, 'epoch': 0.59}


 59%|█████▉    | 1910/3235 [34:24<26:29,  1.20s/it]

{'loss': 0.9892, 'grad_norm': 0.41981127858161926, 'learning_rate': 8.204334365325078e-05, 'epoch': 0.59}


 59%|█████▉    | 1911/3235 [34:25<24:38,  1.12s/it]

{'loss': 1.1223, 'grad_norm': 0.5571356415748596, 'learning_rate': 8.198142414860681e-05, 'epoch': 0.59}


 59%|█████▉    | 1912/3235 [34:26<22:30,  1.02s/it]

{'loss': 0.9762, 'grad_norm': 0.534068763256073, 'learning_rate': 8.191950464396285e-05, 'epoch': 0.59}


 59%|█████▉    | 1913/3235 [34:27<22:44,  1.03s/it]

{'loss': 0.9706, 'grad_norm': 0.4963834583759308, 'learning_rate': 8.185758513931889e-05, 'epoch': 0.59}


 59%|█████▉    | 1914/3235 [34:28<23:46,  1.08s/it]

{'loss': 1.1697, 'grad_norm': 0.49271050095558167, 'learning_rate': 8.179566563467493e-05, 'epoch': 0.59}


 59%|█████▉    | 1915/3235 [34:30<24:24,  1.11s/it]

{'loss': 1.0274, 'grad_norm': 0.4426766335964203, 'learning_rate': 8.173374613003097e-05, 'epoch': 0.59}


 59%|█████▉    | 1916/3235 [34:30<22:48,  1.04s/it]

{'loss': 0.9581, 'grad_norm': 0.5374962091445923, 'learning_rate': 8.167182662538701e-05, 'epoch': 0.59}


 59%|█████▉    | 1917/3235 [34:32<25:14,  1.15s/it]

{'loss': 0.8911, 'grad_norm': 0.4413866102695465, 'learning_rate': 8.160990712074303e-05, 'epoch': 0.59}


 59%|█████▉    | 1918/3235 [34:33<23:13,  1.06s/it]

{'loss': 1.0154, 'grad_norm': 0.5219482779502869, 'learning_rate': 8.154798761609907e-05, 'epoch': 0.59}


 59%|█████▉    | 1919/3235 [34:34<23:49,  1.09s/it]

{'loss': 1.136, 'grad_norm': 0.4795684218406677, 'learning_rate': 8.148606811145511e-05, 'epoch': 0.59}


 59%|█████▉    | 1920/3235 [34:35<23:15,  1.06s/it]

{'loss': 1.0135, 'grad_norm': 0.47320234775543213, 'learning_rate': 8.142414860681114e-05, 'epoch': 0.59}


 59%|█████▉    | 1921/3235 [34:36<23:41,  1.08s/it]

{'loss': 1.1347, 'grad_norm': 0.519162118434906, 'learning_rate': 8.136222910216718e-05, 'epoch': 0.59}


 59%|█████▉    | 1922/3235 [34:37<24:08,  1.10s/it]

{'loss': 0.9519, 'grad_norm': 0.4378233850002289, 'learning_rate': 8.130030959752323e-05, 'epoch': 0.59}


 59%|█████▉    | 1923/3235 [34:39<25:46,  1.18s/it]

{'loss': 0.8327, 'grad_norm': 0.44398629665374756, 'learning_rate': 8.123839009287926e-05, 'epoch': 0.59}


 59%|█████▉    | 1924/3235 [34:40<26:15,  1.20s/it]

{'loss': 1.0502, 'grad_norm': 0.43422865867614746, 'learning_rate': 8.11764705882353e-05, 'epoch': 0.59}


 60%|█████▉    | 1925/3235 [34:41<25:19,  1.16s/it]

{'loss': 1.2056, 'grad_norm': 0.5536998510360718, 'learning_rate': 8.111455108359134e-05, 'epoch': 0.6}


 60%|█████▉    | 1926/3235 [34:42<24:00,  1.10s/it]

{'loss': 1.0743, 'grad_norm': 0.5001991987228394, 'learning_rate': 8.105263157894737e-05, 'epoch': 0.6}


 60%|█████▉    | 1927/3235 [34:43<24:49,  1.14s/it]

{'loss': 1.0895, 'grad_norm': 0.45780301094055176, 'learning_rate': 8.099071207430341e-05, 'epoch': 0.6}


 60%|█████▉    | 1928/3235 [34:44<23:58,  1.10s/it]

{'loss': 0.9406, 'grad_norm': 0.46378976106643677, 'learning_rate': 8.092879256965945e-05, 'epoch': 0.6}


 60%|█████▉    | 1929/3235 [34:45<22:53,  1.05s/it]

{'loss': 1.0437, 'grad_norm': 0.5338270664215088, 'learning_rate': 8.086687306501549e-05, 'epoch': 0.6}


 60%|█████▉    | 1930/3235 [34:46<23:04,  1.06s/it]

{'loss': 1.2005, 'grad_norm': 0.5064551830291748, 'learning_rate': 8.080495356037153e-05, 'epoch': 0.6}


 60%|█████▉    | 1931/3235 [34:47<23:14,  1.07s/it]

{'loss': 1.1448, 'grad_norm': 0.5013308525085449, 'learning_rate': 8.074303405572757e-05, 'epoch': 0.6}


 60%|█████▉    | 1932/3235 [34:48<24:30,  1.13s/it]

{'loss': 1.1351, 'grad_norm': 0.48782026767730713, 'learning_rate': 8.06811145510836e-05, 'epoch': 0.6}


 60%|█████▉    | 1933/3235 [34:50<25:16,  1.16s/it]

{'loss': 0.8622, 'grad_norm': 0.44723108410835266, 'learning_rate': 8.061919504643963e-05, 'epoch': 0.6}


 60%|█████▉    | 1934/3235 [34:51<25:57,  1.20s/it]

{'loss': 1.0049, 'grad_norm': 0.4546036124229431, 'learning_rate': 8.055727554179567e-05, 'epoch': 0.6}


 60%|█████▉    | 1935/3235 [34:52<24:36,  1.14s/it]

{'loss': 0.9019, 'grad_norm': 0.49239951372146606, 'learning_rate': 8.04953560371517e-05, 'epoch': 0.6}


 60%|█████▉    | 1936/3235 [34:53<23:07,  1.07s/it]

{'loss': 0.9064, 'grad_norm': 0.49359169602394104, 'learning_rate': 8.043343653250774e-05, 'epoch': 0.6}


 60%|█████▉    | 1937/3235 [34:54<23:19,  1.08s/it]

{'loss': 0.8728, 'grad_norm': 0.5088309645652771, 'learning_rate': 8.037151702786378e-05, 'epoch': 0.6}


 60%|█████▉    | 1938/3235 [34:55<24:28,  1.13s/it]

{'loss': 0.8901, 'grad_norm': 0.5521434545516968, 'learning_rate': 8.030959752321982e-05, 'epoch': 0.6}


 60%|█████▉    | 1939/3235 [34:56<23:21,  1.08s/it]

{'loss': 1.0787, 'grad_norm': 0.4819967448711395, 'learning_rate': 8.024767801857586e-05, 'epoch': 0.6}


 60%|█████▉    | 1940/3235 [34:57<24:20,  1.13s/it]

{'loss': 0.8588, 'grad_norm': 0.4388319253921509, 'learning_rate': 8.01857585139319e-05, 'epoch': 0.6}


 60%|██████    | 1941/3235 [34:58<22:14,  1.03s/it]

{'loss': 0.9412, 'grad_norm': 0.5334904789924622, 'learning_rate': 8.012383900928793e-05, 'epoch': 0.6}


 60%|██████    | 1942/3235 [34:59<23:09,  1.07s/it]

{'loss': 1.234, 'grad_norm': 0.49672991037368774, 'learning_rate': 8.006191950464397e-05, 'epoch': 0.6}


 60%|██████    | 1943/3235 [35:01<24:51,  1.15s/it]

{'loss': 1.0849, 'grad_norm': 0.42627400159835815, 'learning_rate': 8e-05, 'epoch': 0.6}


 60%|██████    | 1944/3235 [35:02<23:34,  1.10s/it]

{'loss': 1.0491, 'grad_norm': 0.5097541213035583, 'learning_rate': 7.993808049535603e-05, 'epoch': 0.6}


 60%|██████    | 1945/3235 [35:03<23:02,  1.07s/it]

{'loss': 1.0664, 'grad_norm': 0.5330694317817688, 'learning_rate': 7.987616099071207e-05, 'epoch': 0.6}


 60%|██████    | 1946/3235 [35:04<23:08,  1.08s/it]

{'loss': 1.0975, 'grad_norm': 0.5648742914199829, 'learning_rate': 7.981424148606811e-05, 'epoch': 0.6}


 60%|██████    | 1947/3235 [35:05<24:00,  1.12s/it]

{'loss': 1.0656, 'grad_norm': 0.4642486870288849, 'learning_rate': 7.975232198142415e-05, 'epoch': 0.6}


 60%|██████    | 1948/3235 [35:06<24:22,  1.14s/it]

{'loss': 1.0295, 'grad_norm': 0.4387325644493103, 'learning_rate': 7.969040247678019e-05, 'epoch': 0.6}


 60%|██████    | 1949/3235 [35:07<23:33,  1.10s/it]

{'loss': 1.1798, 'grad_norm': 0.4862667918205261, 'learning_rate': 7.962848297213623e-05, 'epoch': 0.6}


 60%|██████    | 1950/3235 [35:08<23:27,  1.10s/it]

{'loss': 0.9655, 'grad_norm': 0.49267223477363586, 'learning_rate': 7.956656346749226e-05, 'epoch': 0.6}


 60%|██████    | 1951/3235 [35:10<24:51,  1.16s/it]

{'loss': 1.0581, 'grad_norm': 0.5258931517601013, 'learning_rate': 7.95046439628483e-05, 'epoch': 0.6}


 60%|██████    | 1952/3235 [35:10<22:24,  1.05s/it]

{'loss': 0.8988, 'grad_norm': 0.507732093334198, 'learning_rate': 7.944272445820434e-05, 'epoch': 0.6}


 60%|██████    | 1953/3235 [35:12<24:32,  1.15s/it]

{'loss': 0.9622, 'grad_norm': 0.4202623963356018, 'learning_rate': 7.938080495356037e-05, 'epoch': 0.6}


 60%|██████    | 1954/3235 [35:13<22:35,  1.06s/it]

{'loss': 0.9648, 'grad_norm': 0.5265210270881653, 'learning_rate': 7.931888544891642e-05, 'epoch': 0.6}


 60%|██████    | 1955/3235 [35:14<22:03,  1.03s/it]

{'loss': 1.0489, 'grad_norm': 0.5122990012168884, 'learning_rate': 7.925696594427246e-05, 'epoch': 0.6}


 60%|██████    | 1956/3235 [35:15<22:21,  1.05s/it]

{'loss': 0.9276, 'grad_norm': 0.4593079090118408, 'learning_rate': 7.919504643962849e-05, 'epoch': 0.6}


 60%|██████    | 1957/3235 [35:16<23:06,  1.09s/it]

{'loss': 0.9722, 'grad_norm': 0.4455040395259857, 'learning_rate': 7.913312693498453e-05, 'epoch': 0.6}


 61%|██████    | 1958/3235 [35:17<21:50,  1.03s/it]

{'loss': 0.9941, 'grad_norm': 0.481586217880249, 'learning_rate': 7.907120743034057e-05, 'epoch': 0.61}


 61%|██████    | 1959/3235 [35:18<21:07,  1.01it/s]

{'loss': 0.9951, 'grad_norm': 0.5342375040054321, 'learning_rate': 7.900928792569659e-05, 'epoch': 0.61}


 61%|██████    | 1960/3235 [35:19<21:10,  1.00it/s]

{'loss': 0.9258, 'grad_norm': 0.48689085245132446, 'learning_rate': 7.894736842105263e-05, 'epoch': 0.61}


 61%|██████    | 1961/3235 [35:20<22:05,  1.04s/it]

{'loss': 1.0016, 'grad_norm': 0.46050670742988586, 'learning_rate': 7.888544891640867e-05, 'epoch': 0.61}


 61%|██████    | 1962/3235 [35:21<23:35,  1.11s/it]

{'loss': 1.1445, 'grad_norm': 0.5190953016281128, 'learning_rate': 7.882352941176471e-05, 'epoch': 0.61}


 61%|██████    | 1963/3235 [35:22<21:52,  1.03s/it]

{'loss': 1.0131, 'grad_norm': 0.5965263247489929, 'learning_rate': 7.876160990712075e-05, 'epoch': 0.61}


 61%|██████    | 1964/3235 [35:23<23:00,  1.09s/it]

{'loss': 0.92, 'grad_norm': 0.553268551826477, 'learning_rate': 7.869969040247679e-05, 'epoch': 0.61}


 61%|██████    | 1965/3235 [35:24<21:18,  1.01s/it]

{'loss': 1.0427, 'grad_norm': 0.5438151359558105, 'learning_rate': 7.863777089783282e-05, 'epoch': 0.61}


 61%|██████    | 1966/3235 [35:25<21:38,  1.02s/it]

{'loss': 1.0119, 'grad_norm': 0.4589482843875885, 'learning_rate': 7.857585139318886e-05, 'epoch': 0.61}


 61%|██████    | 1967/3235 [35:26<22:15,  1.05s/it]

{'loss': 1.039, 'grad_norm': 0.5516073703765869, 'learning_rate': 7.85139318885449e-05, 'epoch': 0.61}


 61%|██████    | 1968/3235 [35:27<21:49,  1.03s/it]

{'loss': 0.8738, 'grad_norm': 0.516857385635376, 'learning_rate': 7.845201238390093e-05, 'epoch': 0.61}


 61%|██████    | 1969/3235 [35:28<23:11,  1.10s/it]

{'loss': 1.0767, 'grad_norm': 0.46706056594848633, 'learning_rate': 7.839009287925697e-05, 'epoch': 0.61}


 61%|██████    | 1970/3235 [35:29<21:54,  1.04s/it]

{'loss': 1.0335, 'grad_norm': 0.5235626697540283, 'learning_rate': 7.8328173374613e-05, 'epoch': 0.61}


 61%|██████    | 1971/3235 [35:30<21:55,  1.04s/it]

{'loss': 1.0963, 'grad_norm': 0.5188226699829102, 'learning_rate': 7.826625386996905e-05, 'epoch': 0.61}


 61%|██████    | 1972/3235 [35:31<20:48,  1.01it/s]

{'loss': 1.0579, 'grad_norm': 0.550849437713623, 'learning_rate': 7.820433436532509e-05, 'epoch': 0.61}


 61%|██████    | 1973/3235 [35:32<20:41,  1.02it/s]

{'loss': 0.9535, 'grad_norm': 0.5155124068260193, 'learning_rate': 7.814241486068113e-05, 'epoch': 0.61}


 61%|██████    | 1974/3235 [35:33<19:44,  1.06it/s]

{'loss': 1.0331, 'grad_norm': 0.5632175803184509, 'learning_rate': 7.808049535603715e-05, 'epoch': 0.61}


 61%|██████    | 1975/3235 [35:34<22:13,  1.06s/it]

{'loss': 1.0547, 'grad_norm': 0.480374813079834, 'learning_rate': 7.801857585139319e-05, 'epoch': 0.61}


 61%|██████    | 1976/3235 [35:36<23:18,  1.11s/it]

{'loss': 1.1549, 'grad_norm': 0.4808262288570404, 'learning_rate': 7.795665634674923e-05, 'epoch': 0.61}


 61%|██████    | 1977/3235 [35:37<23:51,  1.14s/it]

{'loss': 1.1311, 'grad_norm': 0.45628273487091064, 'learning_rate': 7.789473684210526e-05, 'epoch': 0.61}


 61%|██████    | 1978/3235 [35:38<22:24,  1.07s/it]

{'loss': 0.979, 'grad_norm': 0.5358354449272156, 'learning_rate': 7.783281733746131e-05, 'epoch': 0.61}


 61%|██████    | 1979/3235 [35:39<23:55,  1.14s/it]

{'loss': 0.9914, 'grad_norm': 0.44588226079940796, 'learning_rate': 7.777089783281735e-05, 'epoch': 0.61}


 61%|██████    | 1980/3235 [35:40<24:34,  1.17s/it]

{'loss': 1.038, 'grad_norm': 0.46214377880096436, 'learning_rate': 7.770897832817338e-05, 'epoch': 0.61}


 61%|██████    | 1981/3235 [35:41<24:10,  1.16s/it]

{'loss': 0.9475, 'grad_norm': 0.47886916995048523, 'learning_rate': 7.764705882352942e-05, 'epoch': 0.61}


 61%|██████▏   | 1982/3235 [35:43<24:27,  1.17s/it]

{'loss': 1.0324, 'grad_norm': 0.46878573298454285, 'learning_rate': 7.758513931888546e-05, 'epoch': 0.61}


 61%|██████▏   | 1983/3235 [35:44<25:39,  1.23s/it]

{'loss': 0.9861, 'grad_norm': 0.48789456486701965, 'learning_rate': 7.752321981424148e-05, 'epoch': 0.61}


 61%|██████▏   | 1984/3235 [35:45<24:30,  1.18s/it]

{'loss': 0.8452, 'grad_norm': 0.46614494919776917, 'learning_rate': 7.746130030959752e-05, 'epoch': 0.61}


 61%|██████▏   | 1985/3235 [35:46<23:40,  1.14s/it]

{'loss': 1.0662, 'grad_norm': 0.5364292860031128, 'learning_rate': 7.739938080495357e-05, 'epoch': 0.61}


 61%|██████▏   | 1986/3235 [35:47<23:08,  1.11s/it]

{'loss': 1.0942, 'grad_norm': 0.5125433206558228, 'learning_rate': 7.73374613003096e-05, 'epoch': 0.61}


 61%|██████▏   | 1987/3235 [35:48<22:18,  1.07s/it]

{'loss': 0.9409, 'grad_norm': 0.4698762893676758, 'learning_rate': 7.727554179566565e-05, 'epoch': 0.61}


 61%|██████▏   | 1988/3235 [35:49<21:42,  1.04s/it]

{'loss': 1.0308, 'grad_norm': 0.4786207675933838, 'learning_rate': 7.721362229102167e-05, 'epoch': 0.61}


 61%|██████▏   | 1989/3235 [35:50<20:41,  1.00it/s]

{'loss': 1.0701, 'grad_norm': 0.5423540472984314, 'learning_rate': 7.715170278637771e-05, 'epoch': 0.61}


 62%|██████▏   | 1990/3235 [35:51<19:13,  1.08it/s]

{'loss': 0.8945, 'grad_norm': 0.5584741830825806, 'learning_rate': 7.708978328173375e-05, 'epoch': 0.62}


 62%|██████▏   | 1991/3235 [35:52<21:37,  1.04s/it]

{'loss': 1.0298, 'grad_norm': 0.4318017363548279, 'learning_rate': 7.702786377708978e-05, 'epoch': 0.62}


 62%|██████▏   | 1992/3235 [35:53<23:43,  1.14s/it]

{'loss': 0.9402, 'grad_norm': 0.44277268648147583, 'learning_rate': 7.696594427244582e-05, 'epoch': 0.62}


 62%|██████▏   | 1993/3235 [35:55<24:26,  1.18s/it]

{'loss': 0.9134, 'grad_norm': 0.4528992176055908, 'learning_rate': 7.690402476780186e-05, 'epoch': 0.62}


 62%|██████▏   | 1994/3235 [35:56<22:52,  1.11s/it]

{'loss': 0.9412, 'grad_norm': 0.4760395288467407, 'learning_rate': 7.68421052631579e-05, 'epoch': 0.62}


 62%|██████▏   | 1995/3235 [35:57<23:05,  1.12s/it]

{'loss': 1.0942, 'grad_norm': 0.4420236647129059, 'learning_rate': 7.678018575851394e-05, 'epoch': 0.62}


 62%|██████▏   | 1996/3235 [35:58<22:14,  1.08s/it]

{'loss': 0.88, 'grad_norm': 0.5136647820472717, 'learning_rate': 7.671826625386998e-05, 'epoch': 0.62}


 62%|██████▏   | 1997/3235 [35:59<23:14,  1.13s/it]

{'loss': 1.1336, 'grad_norm': 0.4577489197254181, 'learning_rate': 7.6656346749226e-05, 'epoch': 0.62}


 62%|██████▏   | 1998/3235 [36:00<23:43,  1.15s/it]

{'loss': 1.0494, 'grad_norm': 0.49354833364486694, 'learning_rate': 7.659442724458204e-05, 'epoch': 0.62}


 62%|██████▏   | 1999/3235 [36:01<23:42,  1.15s/it]

{'loss': 0.988, 'grad_norm': 0.4929022192955017, 'learning_rate': 7.653250773993808e-05, 'epoch': 0.62}


 62%|██████▏   | 2000/3235 [36:03<24:18,  1.18s/it]

{'loss': 0.9652, 'grad_norm': 0.43019670248031616, 'learning_rate': 7.647058823529411e-05, 'epoch': 0.62}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-2000)... Done. 0.1s
 62%|██████▏   | 2001/3235 [36:05<29:19,  1.43s/it]

{'loss': 1.2134, 'grad_norm': 0.45576968789100647, 'learning_rate': 7.640866873065015e-05, 'epoch': 0.62}


 62%|██████▏   | 2002/3235 [36:06<27:48,  1.35s/it]

{'loss': 0.9934, 'grad_norm': 0.4723608195781708, 'learning_rate': 7.63467492260062e-05, 'epoch': 0.62}


 62%|██████▏   | 2003/3235 [36:06<24:12,  1.18s/it]

{'loss': 1.0438, 'grad_norm': 0.6554886102676392, 'learning_rate': 7.628482972136223e-05, 'epoch': 0.62}


 62%|██████▏   | 2004/3235 [36:08<23:28,  1.14s/it]

{'loss': 1.042, 'grad_norm': 0.4870135486125946, 'learning_rate': 7.622291021671827e-05, 'epoch': 0.62}


 62%|██████▏   | 2005/3235 [36:09<22:28,  1.10s/it]

{'loss': 0.9282, 'grad_norm': 0.4446234405040741, 'learning_rate': 7.616099071207431e-05, 'epoch': 0.62}


 62%|██████▏   | 2006/3235 [36:10<22:39,  1.11s/it]

{'loss': 0.9976, 'grad_norm': 0.4452318549156189, 'learning_rate': 7.609907120743034e-05, 'epoch': 0.62}


 62%|██████▏   | 2007/3235 [36:11<25:06,  1.23s/it]

{'loss': 1.0216, 'grad_norm': 0.43090835213661194, 'learning_rate': 7.603715170278638e-05, 'epoch': 0.62}


 62%|██████▏   | 2008/3235 [36:12<25:33,  1.25s/it]

{'loss': 1.0932, 'grad_norm': 0.5126166939735413, 'learning_rate': 7.597523219814242e-05, 'epoch': 0.62}


 62%|██████▏   | 2009/3235 [36:14<24:51,  1.22s/it]

{'loss': 1.0995, 'grad_norm': 0.4542882442474365, 'learning_rate': 7.591331269349844e-05, 'epoch': 0.62}


 62%|██████▏   | 2010/3235 [36:15<24:03,  1.18s/it]

{'loss': 1.049, 'grad_norm': 0.4835338592529297, 'learning_rate': 7.58513931888545e-05, 'epoch': 0.62}


 62%|██████▏   | 2011/3235 [36:16<24:09,  1.18s/it]

{'loss': 1.0159, 'grad_norm': 0.5069199800491333, 'learning_rate': 7.578947368421054e-05, 'epoch': 0.62}


 62%|██████▏   | 2012/3235 [36:17<21:17,  1.04s/it]

{'loss': 0.8805, 'grad_norm': 0.5474933981895447, 'learning_rate': 7.572755417956656e-05, 'epoch': 0.62}


 62%|██████▏   | 2013/3235 [36:18<20:27,  1.00s/it]

{'loss': 0.9413, 'grad_norm': 0.5207570195198059, 'learning_rate': 7.56656346749226e-05, 'epoch': 0.62}


 62%|██████▏   | 2014/3235 [36:19<20:57,  1.03s/it]

{'loss': 1.0368, 'grad_norm': 0.48846402764320374, 'learning_rate': 7.560371517027864e-05, 'epoch': 0.62}


 62%|██████▏   | 2015/3235 [36:20<20:41,  1.02s/it]

{'loss': 0.9912, 'grad_norm': 0.5168682932853699, 'learning_rate': 7.554179566563467e-05, 'epoch': 0.62}


 62%|██████▏   | 2016/3235 [36:21<22:59,  1.13s/it]

{'loss': 1.1568, 'grad_norm': 0.4190790355205536, 'learning_rate': 7.547987616099071e-05, 'epoch': 0.62}


 62%|██████▏   | 2017/3235 [36:22<22:15,  1.10s/it]

{'loss': 0.9817, 'grad_norm': 0.4803959131240845, 'learning_rate': 7.541795665634675e-05, 'epoch': 0.62}


 62%|██████▏   | 2018/3235 [36:23<21:40,  1.07s/it]

{'loss': 0.864, 'grad_norm': 0.5193497538566589, 'learning_rate': 7.535603715170279e-05, 'epoch': 0.62}


 62%|██████▏   | 2019/3235 [36:24<20:52,  1.03s/it]

{'loss': 0.9608, 'grad_norm': 0.6281020641326904, 'learning_rate': 7.529411764705883e-05, 'epoch': 0.62}


 62%|██████▏   | 2020/3235 [36:25<20:40,  1.02s/it]

{'loss': 1.0442, 'grad_norm': 0.49929726123809814, 'learning_rate': 7.523219814241487e-05, 'epoch': 0.62}


 62%|██████▏   | 2021/3235 [36:26<20:31,  1.01s/it]

{'loss': 1.0827, 'grad_norm': 0.456432044506073, 'learning_rate': 7.51702786377709e-05, 'epoch': 0.62}


 63%|██████▎   | 2022/3235 [36:27<20:24,  1.01s/it]

{'loss': 1.009, 'grad_norm': 0.5344399809837341, 'learning_rate': 7.510835913312694e-05, 'epoch': 0.63}


 63%|██████▎   | 2023/3235 [36:28<20:50,  1.03s/it]

{'loss': 1.0243, 'grad_norm': 0.5036912560462952, 'learning_rate': 7.504643962848298e-05, 'epoch': 0.63}


 63%|██████▎   | 2024/3235 [36:29<21:05,  1.05s/it]

{'loss': 0.9785, 'grad_norm': 0.4492650330066681, 'learning_rate': 7.4984520123839e-05, 'epoch': 0.63}


 63%|██████▎   | 2025/3235 [36:30<20:53,  1.04s/it]

{'loss': 1.1297, 'grad_norm': 0.4865063726902008, 'learning_rate': 7.492260061919504e-05, 'epoch': 0.63}


 63%|██████▎   | 2026/3235 [36:31<20:27,  1.02s/it]

{'loss': 0.8774, 'grad_norm': 0.4492170810699463, 'learning_rate': 7.486068111455108e-05, 'epoch': 0.63}


 63%|██████▎   | 2027/3235 [36:32<19:40,  1.02it/s]

{'loss': 1.149, 'grad_norm': 0.49159619212150574, 'learning_rate': 7.479876160990712e-05, 'epoch': 0.63}


 63%|██████▎   | 2028/3235 [36:33<19:05,  1.05it/s]

{'loss': 1.0773, 'grad_norm': 0.5677652359008789, 'learning_rate': 7.473684210526316e-05, 'epoch': 0.63}


 63%|██████▎   | 2029/3235 [36:34<20:32,  1.02s/it]

{'loss': 1.1698, 'grad_norm': 0.46781566739082336, 'learning_rate': 7.46749226006192e-05, 'epoch': 0.63}


 63%|██████▎   | 2030/3235 [36:35<20:37,  1.03s/it]

{'loss': 1.2343, 'grad_norm': 0.5198735594749451, 'learning_rate': 7.461300309597523e-05, 'epoch': 0.63}


 63%|██████▎   | 2031/3235 [36:36<21:13,  1.06s/it]

{'loss': 0.976, 'grad_norm': 0.4214600920677185, 'learning_rate': 7.455108359133127e-05, 'epoch': 0.63}


 63%|██████▎   | 2032/3235 [36:37<21:59,  1.10s/it]

{'loss': 1.0195, 'grad_norm': 0.4581027626991272, 'learning_rate': 7.448916408668731e-05, 'epoch': 0.63}


 63%|██████▎   | 2033/3235 [36:39<22:07,  1.10s/it]

{'loss': 1.1277, 'grad_norm': 0.49486303329467773, 'learning_rate': 7.442724458204334e-05, 'epoch': 0.63}


 63%|██████▎   | 2034/3235 [36:39<20:17,  1.01s/it]

{'loss': 0.9831, 'grad_norm': 0.5277059078216553, 'learning_rate': 7.436532507739939e-05, 'epoch': 0.63}


 63%|██████▎   | 2035/3235 [36:40<18:33,  1.08it/s]

{'loss': 1.0296, 'grad_norm': 0.5507737994194031, 'learning_rate': 7.430340557275543e-05, 'epoch': 0.63}


 63%|██████▎   | 2036/3235 [36:41<19:20,  1.03it/s]

{'loss': 1.1115, 'grad_norm': 0.4705658257007599, 'learning_rate': 7.424148606811146e-05, 'epoch': 0.63}


 63%|██████▎   | 2037/3235 [36:42<20:52,  1.05s/it]

{'loss': 1.1288, 'grad_norm': 0.49222785234451294, 'learning_rate': 7.41795665634675e-05, 'epoch': 0.63}


 63%|██████▎   | 2038/3235 [36:43<20:25,  1.02s/it]

{'loss': 0.949, 'grad_norm': 0.4780718982219696, 'learning_rate': 7.411764705882354e-05, 'epoch': 0.63}


 63%|██████▎   | 2039/3235 [36:44<20:13,  1.01s/it]

{'loss': 0.9108, 'grad_norm': 0.4532027840614319, 'learning_rate': 7.405572755417956e-05, 'epoch': 0.63}


 63%|██████▎   | 2040/3235 [36:46<21:50,  1.10s/it]

{'loss': 0.9382, 'grad_norm': 0.4488874077796936, 'learning_rate': 7.39938080495356e-05, 'epoch': 0.63}


 63%|██████▎   | 2041/3235 [36:47<22:01,  1.11s/it]

{'loss': 1.0391, 'grad_norm': 0.49017661809921265, 'learning_rate': 7.393188854489164e-05, 'epoch': 0.63}


 63%|██████▎   | 2042/3235 [36:48<20:59,  1.06s/it]

{'loss': 1.2208, 'grad_norm': 0.5531544089317322, 'learning_rate': 7.386996904024768e-05, 'epoch': 0.63}


 63%|██████▎   | 2043/3235 [36:49<22:06,  1.11s/it]

{'loss': 1.061, 'grad_norm': 0.4805184602737427, 'learning_rate': 7.380804953560372e-05, 'epoch': 0.63}


 63%|██████▎   | 2044/3235 [36:50<22:19,  1.12s/it]

{'loss': 1.122, 'grad_norm': 0.48532265424728394, 'learning_rate': 7.374613003095976e-05, 'epoch': 0.63}


 63%|██████▎   | 2045/3235 [36:51<22:03,  1.11s/it]

{'loss': 0.9371, 'grad_norm': 0.4788917601108551, 'learning_rate': 7.368421052631579e-05, 'epoch': 0.63}


 63%|██████▎   | 2046/3235 [36:52<22:51,  1.15s/it]

{'loss': 0.8571, 'grad_norm': 0.4316388964653015, 'learning_rate': 7.362229102167183e-05, 'epoch': 0.63}


 63%|██████▎   | 2047/3235 [36:53<21:53,  1.11s/it]

{'loss': 1.0411, 'grad_norm': 0.5051210522651672, 'learning_rate': 7.356037151702787e-05, 'epoch': 0.63}


 63%|██████▎   | 2048/3235 [36:55<22:25,  1.13s/it]

{'loss': 1.0142, 'grad_norm': 0.4476599097251892, 'learning_rate': 7.34984520123839e-05, 'epoch': 0.63}


 63%|██████▎   | 2049/3235 [36:56<22:02,  1.12s/it]

{'loss': 0.8574, 'grad_norm': 0.47255197167396545, 'learning_rate': 7.343653250773994e-05, 'epoch': 0.63}


 63%|██████▎   | 2050/3235 [36:57<22:51,  1.16s/it]

{'loss': 1.0234, 'grad_norm': 0.4917627274990082, 'learning_rate': 7.337461300309598e-05, 'epoch': 0.63}


 63%|██████▎   | 2051/3235 [36:58<22:14,  1.13s/it]

{'loss': 0.9295, 'grad_norm': 0.44653868675231934, 'learning_rate': 7.331269349845202e-05, 'epoch': 0.63}


 63%|██████▎   | 2052/3235 [36:59<22:10,  1.12s/it]

{'loss': 0.8662, 'grad_norm': 0.4778546392917633, 'learning_rate': 7.325077399380806e-05, 'epoch': 0.63}


 63%|██████▎   | 2053/3235 [37:00<21:30,  1.09s/it]

{'loss': 1.0125, 'grad_norm': 0.5532640218734741, 'learning_rate': 7.31888544891641e-05, 'epoch': 0.63}


 63%|██████▎   | 2054/3235 [37:01<22:00,  1.12s/it]

{'loss': 1.1265, 'grad_norm': 0.5149521231651306, 'learning_rate': 7.312693498452012e-05, 'epoch': 0.63}


 64%|██████▎   | 2055/3235 [37:02<21:37,  1.10s/it]

{'loss': 0.9575, 'grad_norm': 0.4863269329071045, 'learning_rate': 7.306501547987616e-05, 'epoch': 0.64}


 64%|██████▎   | 2056/3235 [37:04<21:57,  1.12s/it]

{'loss': 1.0578, 'grad_norm': 0.5104178190231323, 'learning_rate': 7.30030959752322e-05, 'epoch': 0.64}


 64%|██████▎   | 2057/3235 [37:05<21:26,  1.09s/it]

{'loss': 1.0955, 'grad_norm': 0.5046342611312866, 'learning_rate': 7.294117647058823e-05, 'epoch': 0.64}


 64%|██████▎   | 2058/3235 [37:05<19:51,  1.01s/it]

{'loss': 0.9014, 'grad_norm': 0.5213170647621155, 'learning_rate': 7.287925696594428e-05, 'epoch': 0.64}


 64%|██████▎   | 2059/3235 [37:06<20:22,  1.04s/it]

{'loss': 0.9398, 'grad_norm': 0.4998994767665863, 'learning_rate': 7.281733746130032e-05, 'epoch': 0.64}


 64%|██████▎   | 2060/3235 [37:08<22:02,  1.13s/it]

{'loss': 0.9983, 'grad_norm': 0.4694969654083252, 'learning_rate': 7.275541795665635e-05, 'epoch': 0.64}


 64%|██████▎   | 2061/3235 [37:09<22:10,  1.13s/it]

{'loss': 0.9768, 'grad_norm': 0.49529844522476196, 'learning_rate': 7.269349845201239e-05, 'epoch': 0.64}


 64%|██████▎   | 2062/3235 [37:10<21:07,  1.08s/it]

{'loss': 0.9968, 'grad_norm': 0.4776112735271454, 'learning_rate': 7.263157894736843e-05, 'epoch': 0.64}


 64%|██████▍   | 2063/3235 [37:11<20:04,  1.03s/it]

{'loss': 0.9191, 'grad_norm': 0.5110501646995544, 'learning_rate': 7.256965944272446e-05, 'epoch': 0.64}


 64%|██████▍   | 2064/3235 [37:12<19:40,  1.01s/it]

{'loss': 0.9759, 'grad_norm': 0.5262235999107361, 'learning_rate': 7.25077399380805e-05, 'epoch': 0.64}


 64%|██████▍   | 2065/3235 [37:13<20:16,  1.04s/it]

{'loss': 1.0001, 'grad_norm': 0.45938923954963684, 'learning_rate': 7.244582043343654e-05, 'epoch': 0.64}


 64%|██████▍   | 2066/3235 [37:14<21:35,  1.11s/it]

{'loss': 1.0665, 'grad_norm': 0.4476969838142395, 'learning_rate': 7.238390092879258e-05, 'epoch': 0.64}


 64%|██████▍   | 2067/3235 [37:15<21:43,  1.12s/it]

{'loss': 0.9445, 'grad_norm': 0.5014944672584534, 'learning_rate': 7.232198142414862e-05, 'epoch': 0.64}


 64%|██████▍   | 2068/3235 [37:16<21:22,  1.10s/it]

{'loss': 1.2012, 'grad_norm': 0.4896683096885681, 'learning_rate': 7.226006191950466e-05, 'epoch': 0.64}


 64%|██████▍   | 2069/3235 [37:17<21:05,  1.09s/it]

{'loss': 0.9829, 'grad_norm': 0.4738890826702118, 'learning_rate': 7.219814241486068e-05, 'epoch': 0.64}


 64%|██████▍   | 2070/3235 [37:18<20:40,  1.07s/it]

{'loss': 0.9157, 'grad_norm': 0.4611606299877167, 'learning_rate': 7.213622291021672e-05, 'epoch': 0.64}


 64%|██████▍   | 2071/3235 [37:20<21:53,  1.13s/it]

{'loss': 0.9831, 'grad_norm': 0.4718247056007385, 'learning_rate': 7.207430340557276e-05, 'epoch': 0.64}


 64%|██████▍   | 2072/3235 [37:21<21:10,  1.09s/it]

{'loss': 0.9301, 'grad_norm': 0.46645763516426086, 'learning_rate': 7.201238390092879e-05, 'epoch': 0.64}


 64%|██████▍   | 2073/3235 [37:22<20:07,  1.04s/it]

{'loss': 1.0587, 'grad_norm': 0.4968152344226837, 'learning_rate': 7.195046439628483e-05, 'epoch': 0.64}


 64%|██████▍   | 2074/3235 [37:23<22:01,  1.14s/it]

{'loss': 0.8791, 'grad_norm': 0.5074220895767212, 'learning_rate': 7.188854489164087e-05, 'epoch': 0.64}


 64%|██████▍   | 2075/3235 [37:24<21:02,  1.09s/it]

{'loss': 1.0411, 'grad_norm': 0.5688651204109192, 'learning_rate': 7.182662538699691e-05, 'epoch': 0.64}


 64%|██████▍   | 2076/3235 [37:25<19:06,  1.01it/s]

{'loss': 1.1839, 'grad_norm': 0.9733989834785461, 'learning_rate': 7.176470588235295e-05, 'epoch': 0.64}


 64%|██████▍   | 2077/3235 [37:26<19:32,  1.01s/it]

{'loss': 1.1272, 'grad_norm': 0.5118612051010132, 'learning_rate': 7.170278637770899e-05, 'epoch': 0.64}


 64%|██████▍   | 2078/3235 [37:27<19:34,  1.01s/it]

{'loss': 1.0102, 'grad_norm': 0.4674713611602783, 'learning_rate': 7.164086687306501e-05, 'epoch': 0.64}


 64%|██████▍   | 2079/3235 [37:28<20:42,  1.07s/it]

{'loss': 1.2451, 'grad_norm': 0.48045480251312256, 'learning_rate': 7.157894736842105e-05, 'epoch': 0.64}


 64%|██████▍   | 2080/3235 [37:29<20:31,  1.07s/it]

{'loss': 1.1775, 'grad_norm': 0.5063393115997314, 'learning_rate': 7.15170278637771e-05, 'epoch': 0.64}


 64%|██████▍   | 2081/3235 [37:30<21:39,  1.13s/it]

{'loss': 1.0664, 'grad_norm': 0.42734384536743164, 'learning_rate': 7.145510835913312e-05, 'epoch': 0.64}


 64%|██████▍   | 2082/3235 [37:32<22:03,  1.15s/it]

{'loss': 1.0295, 'grad_norm': 0.5011327266693115, 'learning_rate': 7.139318885448916e-05, 'epoch': 0.64}


 64%|██████▍   | 2083/3235 [37:33<21:08,  1.10s/it]

{'loss': 0.9266, 'grad_norm': 0.5147437453269958, 'learning_rate': 7.133126934984521e-05, 'epoch': 0.64}


 64%|██████▍   | 2084/3235 [37:33<20:01,  1.04s/it]

{'loss': 0.96, 'grad_norm': 0.5448670983314514, 'learning_rate': 7.126934984520124e-05, 'epoch': 0.64}


 64%|██████▍   | 2085/3235 [37:34<18:57,  1.01it/s]

{'loss': 1.0305, 'grad_norm': 0.5441906452178955, 'learning_rate': 7.120743034055728e-05, 'epoch': 0.64}


 64%|██████▍   | 2086/3235 [37:35<18:58,  1.01it/s]

{'loss': 1.1508, 'grad_norm': 0.500733494758606, 'learning_rate': 7.114551083591332e-05, 'epoch': 0.64}


 65%|██████▍   | 2087/3235 [37:36<19:00,  1.01it/s]

{'loss': 0.8847, 'grad_norm': 0.5063279867172241, 'learning_rate': 7.108359133126935e-05, 'epoch': 0.65}


 65%|██████▍   | 2088/3235 [37:37<18:52,  1.01it/s]

{'loss': 1.0167, 'grad_norm': 0.5039752125740051, 'learning_rate': 7.102167182662539e-05, 'epoch': 0.65}


 65%|██████▍   | 2089/3235 [37:39<20:15,  1.06s/it]

{'loss': 1.0522, 'grad_norm': 0.5026612877845764, 'learning_rate': 7.095975232198143e-05, 'epoch': 0.65}


 65%|██████▍   | 2090/3235 [37:40<21:05,  1.11s/it]

{'loss': 1.0222, 'grad_norm': 0.47367149591445923, 'learning_rate': 7.089783281733747e-05, 'epoch': 0.65}


 65%|██████▍   | 2091/3235 [37:41<21:51,  1.15s/it]

{'loss': 1.0272, 'grad_norm': 0.45143625140190125, 'learning_rate': 7.083591331269351e-05, 'epoch': 0.65}


 65%|██████▍   | 2092/3235 [37:42<19:39,  1.03s/it]

{'loss': 1.0095, 'grad_norm': 0.6590426564216614, 'learning_rate': 7.077399380804955e-05, 'epoch': 0.65}


 65%|██████▍   | 2093/3235 [37:43<20:59,  1.10s/it]

{'loss': 1.0162, 'grad_norm': 0.4656359851360321, 'learning_rate': 7.071207430340557e-05, 'epoch': 0.65}


 65%|██████▍   | 2094/3235 [37:44<21:09,  1.11s/it]

{'loss': 1.0658, 'grad_norm': 0.47391757369041443, 'learning_rate': 7.065015479876161e-05, 'epoch': 0.65}


 65%|██████▍   | 2095/3235 [37:45<21:15,  1.12s/it]

{'loss': 1.1706, 'grad_norm': 0.5050564408302307, 'learning_rate': 7.058823529411765e-05, 'epoch': 0.65}


 65%|██████▍   | 2096/3235 [37:46<20:53,  1.10s/it]

{'loss': 1.0419, 'grad_norm': 0.47228318452835083, 'learning_rate': 7.052631578947368e-05, 'epoch': 0.65}


 65%|██████▍   | 2097/3235 [37:47<20:56,  1.10s/it]

{'loss': 1.0503, 'grad_norm': 0.5172711610794067, 'learning_rate': 7.046439628482972e-05, 'epoch': 0.65}


 65%|██████▍   | 2098/3235 [37:48<20:38,  1.09s/it]

{'loss': 1.0194, 'grad_norm': 0.46644070744514465, 'learning_rate': 7.040247678018576e-05, 'epoch': 0.65}


 65%|██████▍   | 2099/3235 [37:50<21:41,  1.15s/it]

{'loss': 0.9633, 'grad_norm': 0.4101526439189911, 'learning_rate': 7.03405572755418e-05, 'epoch': 0.65}


 65%|██████▍   | 2100/3235 [37:51<21:33,  1.14s/it]

{'loss': 0.8614, 'grad_norm': 0.46512117981910706, 'learning_rate': 7.027863777089784e-05, 'epoch': 0.65}


 65%|██████▍   | 2101/3235 [37:52<20:40,  1.09s/it]

{'loss': 0.9997, 'grad_norm': 0.4941880404949188, 'learning_rate': 7.021671826625388e-05, 'epoch': 0.65}


 65%|██████▍   | 2102/3235 [37:53<21:46,  1.15s/it]

{'loss': 1.0283, 'grad_norm': 0.4293067753314972, 'learning_rate': 7.015479876160991e-05, 'epoch': 0.65}


 65%|██████▌   | 2103/3235 [37:54<22:09,  1.17s/it]

{'loss': 1.0788, 'grad_norm': 0.46853160858154297, 'learning_rate': 7.009287925696595e-05, 'epoch': 0.65}


 65%|██████▌   | 2104/3235 [37:55<21:13,  1.13s/it]

{'loss': 0.9644, 'grad_norm': 0.4867280423641205, 'learning_rate': 7.003095975232199e-05, 'epoch': 0.65}


 65%|██████▌   | 2105/3235 [37:57<22:01,  1.17s/it]

{'loss': 1.034, 'grad_norm': 0.4085655212402344, 'learning_rate': 6.996904024767801e-05, 'epoch': 0.65}


 65%|██████▌   | 2106/3235 [37:58<21:47,  1.16s/it]

{'loss': 0.8656, 'grad_norm': 0.4587029218673706, 'learning_rate': 6.990712074303405e-05, 'epoch': 0.65}


 65%|██████▌   | 2107/3235 [37:59<19:55,  1.06s/it]

{'loss': 1.0657, 'grad_norm': 0.5015133619308472, 'learning_rate': 6.984520123839011e-05, 'epoch': 0.65}


 65%|██████▌   | 2108/3235 [38:00<20:43,  1.10s/it]

{'loss': 1.0407, 'grad_norm': 0.4931783974170685, 'learning_rate': 6.978328173374613e-05, 'epoch': 0.65}


 65%|██████▌   | 2109/3235 [38:01<19:33,  1.04s/it]

{'loss': 0.9354, 'grad_norm': 0.5279909372329712, 'learning_rate': 6.972136222910217e-05, 'epoch': 0.65}


 65%|██████▌   | 2110/3235 [38:02<20:10,  1.08s/it]

{'loss': 1.0033, 'grad_norm': 0.48176681995391846, 'learning_rate': 6.965944272445821e-05, 'epoch': 0.65}


 65%|██████▌   | 2111/3235 [38:03<20:15,  1.08s/it]

{'loss': 1.0777, 'grad_norm': 0.4825449585914612, 'learning_rate': 6.959752321981424e-05, 'epoch': 0.65}


 65%|██████▌   | 2112/3235 [38:04<19:02,  1.02s/it]

{'loss': 0.9924, 'grad_norm': 0.5524982213973999, 'learning_rate': 6.953560371517028e-05, 'epoch': 0.65}


 65%|██████▌   | 2113/3235 [38:05<20:07,  1.08s/it]

{'loss': 1.0647, 'grad_norm': 0.4834529161453247, 'learning_rate': 6.947368421052632e-05, 'epoch': 0.65}


 65%|██████▌   | 2114/3235 [38:06<19:49,  1.06s/it]

{'loss': 1.0743, 'grad_norm': 0.5043324828147888, 'learning_rate': 6.941176470588236e-05, 'epoch': 0.65}


 65%|██████▌   | 2115/3235 [38:07<20:52,  1.12s/it]

{'loss': 1.015, 'grad_norm': 0.46247825026512146, 'learning_rate': 6.93498452012384e-05, 'epoch': 0.65}


 65%|██████▌   | 2116/3235 [38:08<20:28,  1.10s/it]

{'loss': 0.9811, 'grad_norm': 0.491716206073761, 'learning_rate': 6.928792569659444e-05, 'epoch': 0.65}


 65%|██████▌   | 2117/3235 [38:09<20:06,  1.08s/it]

{'loss': 0.9665, 'grad_norm': 0.48468315601348877, 'learning_rate': 6.922600619195047e-05, 'epoch': 0.65}


 65%|██████▌   | 2118/3235 [38:11<21:32,  1.16s/it]

{'loss': 0.9177, 'grad_norm': 0.46506690979003906, 'learning_rate': 6.916408668730651e-05, 'epoch': 0.65}


 66%|██████▌   | 2119/3235 [38:12<20:42,  1.11s/it]

{'loss': 0.8732, 'grad_norm': 0.4747826159000397, 'learning_rate': 6.910216718266255e-05, 'epoch': 0.66}


 66%|██████▌   | 2120/3235 [38:13<20:55,  1.13s/it]

{'loss': 1.0253, 'grad_norm': 0.4669435918331146, 'learning_rate': 6.904024767801857e-05, 'epoch': 0.66}


 66%|██████▌   | 2121/3235 [38:14<20:59,  1.13s/it]

{'loss': 1.0221, 'grad_norm': 0.5039915442466736, 'learning_rate': 6.897832817337461e-05, 'epoch': 0.66}


 66%|██████▌   | 2122/3235 [38:15<19:32,  1.05s/it]

{'loss': 0.9495, 'grad_norm': 0.5456237196922302, 'learning_rate': 6.891640866873065e-05, 'epoch': 0.66}


 66%|██████▌   | 2123/3235 [38:16<20:27,  1.10s/it]

{'loss': 1.0477, 'grad_norm': 0.45378246903419495, 'learning_rate': 6.88544891640867e-05, 'epoch': 0.66}


 66%|██████▌   | 2124/3235 [38:17<19:46,  1.07s/it]

{'loss': 1.0486, 'grad_norm': 0.4733760356903076, 'learning_rate': 6.879256965944273e-05, 'epoch': 0.66}


 66%|██████▌   | 2125/3235 [38:18<20:12,  1.09s/it]

{'loss': 1.1663, 'grad_norm': 0.5152910947799683, 'learning_rate': 6.873065015479877e-05, 'epoch': 0.66}


 66%|██████▌   | 2126/3235 [38:19<19:52,  1.08s/it]

{'loss': 0.916, 'grad_norm': 0.5092548727989197, 'learning_rate': 6.86687306501548e-05, 'epoch': 0.66}


 66%|██████▌   | 2127/3235 [38:21<21:04,  1.14s/it]

{'loss': 1.1216, 'grad_norm': 0.43343451619148254, 'learning_rate': 6.860681114551084e-05, 'epoch': 0.66}


 66%|██████▌   | 2128/3235 [38:22<21:17,  1.15s/it]

{'loss': 1.0117, 'grad_norm': 0.47992628812789917, 'learning_rate': 6.854489164086688e-05, 'epoch': 0.66}


 66%|██████▌   | 2129/3235 [38:23<21:05,  1.14s/it]

{'loss': 0.957, 'grad_norm': 0.4758724570274353, 'learning_rate': 6.84829721362229e-05, 'epoch': 0.66}


 66%|██████▌   | 2130/3235 [38:24<18:58,  1.03s/it]

{'loss': 0.9644, 'grad_norm': 0.5085723996162415, 'learning_rate': 6.842105263157895e-05, 'epoch': 0.66}


 66%|██████▌   | 2131/3235 [38:25<19:24,  1.06s/it]

{'loss': 1.0163, 'grad_norm': 0.48396676778793335, 'learning_rate': 6.8359133126935e-05, 'epoch': 0.66}


 66%|██████▌   | 2132/3235 [38:26<20:10,  1.10s/it]

{'loss': 1.0731, 'grad_norm': 0.45153599977493286, 'learning_rate': 6.829721362229103e-05, 'epoch': 0.66}


 66%|██████▌   | 2133/3235 [38:27<20:49,  1.13s/it]

{'loss': 1.1613, 'grad_norm': 0.4489448666572571, 'learning_rate': 6.823529411764707e-05, 'epoch': 0.66}


 66%|██████▌   | 2134/3235 [38:28<19:39,  1.07s/it]

{'loss': 0.8922, 'grad_norm': 0.5081321001052856, 'learning_rate': 6.817337461300309e-05, 'epoch': 0.66}


 66%|██████▌   | 2135/3235 [38:29<20:26,  1.12s/it]

{'loss': 0.9714, 'grad_norm': 0.45074138045310974, 'learning_rate': 6.811145510835913e-05, 'epoch': 0.66}


 66%|██████▌   | 2136/3235 [38:31<21:04,  1.15s/it]

{'loss': 0.7949, 'grad_norm': 0.4444904923439026, 'learning_rate': 6.804953560371517e-05, 'epoch': 0.66}


 66%|██████▌   | 2137/3235 [38:32<21:07,  1.15s/it]

{'loss': 1.0714, 'grad_norm': 0.44199317693710327, 'learning_rate': 6.79876160990712e-05, 'epoch': 0.66}


 66%|██████▌   | 2138/3235 [38:33<20:19,  1.11s/it]

{'loss': 0.9635, 'grad_norm': 0.48953187465667725, 'learning_rate': 6.792569659442724e-05, 'epoch': 0.66}


 66%|██████▌   | 2139/3235 [38:34<20:16,  1.11s/it]

{'loss': 0.9209, 'grad_norm': 0.4656834304332733, 'learning_rate': 6.786377708978329e-05, 'epoch': 0.66}


 66%|██████▌   | 2140/3235 [38:35<20:43,  1.14s/it]

{'loss': 1.048, 'grad_norm': 0.4837208092212677, 'learning_rate': 6.780185758513932e-05, 'epoch': 0.66}


 66%|██████▌   | 2141/3235 [38:36<19:00,  1.04s/it]

{'loss': 0.9468, 'grad_norm': 0.4768553376197815, 'learning_rate': 6.773993808049536e-05, 'epoch': 0.66}


 66%|██████▌   | 2142/3235 [38:37<19:11,  1.05s/it]

{'loss': 1.1469, 'grad_norm': 0.47262468934059143, 'learning_rate': 6.76780185758514e-05, 'epoch': 0.66}


 66%|██████▌   | 2143/3235 [38:38<18:39,  1.03s/it]

{'loss': 0.9046, 'grad_norm': 0.4997774064540863, 'learning_rate': 6.761609907120743e-05, 'epoch': 0.66}


 66%|██████▋   | 2144/3235 [38:39<18:38,  1.03s/it]

{'loss': 0.9873, 'grad_norm': 0.45924827456474304, 'learning_rate': 6.755417956656347e-05, 'epoch': 0.66}


 66%|██████▋   | 2145/3235 [38:40<18:33,  1.02s/it]

{'loss': 1.0937, 'grad_norm': 0.462238073348999, 'learning_rate': 6.74922600619195e-05, 'epoch': 0.66}


 66%|██████▋   | 2146/3235 [38:41<17:40,  1.03it/s]

{'loss': 0.9618, 'grad_norm': 0.49921396374702454, 'learning_rate': 6.743034055727555e-05, 'epoch': 0.66}


 66%|██████▋   | 2147/3235 [38:42<19:06,  1.05s/it]

{'loss': 1.0072, 'grad_norm': 0.42766183614730835, 'learning_rate': 6.736842105263159e-05, 'epoch': 0.66}


 66%|██████▋   | 2148/3235 [38:43<19:09,  1.06s/it]

{'loss': 1.1474, 'grad_norm': 0.5081475973129272, 'learning_rate': 6.730650154798763e-05, 'epoch': 0.66}


 66%|██████▋   | 2149/3235 [38:44<19:03,  1.05s/it]

{'loss': 1.0991, 'grad_norm': 0.6292262673377991, 'learning_rate': 6.724458204334365e-05, 'epoch': 0.66}


 66%|██████▋   | 2150/3235 [38:45<19:08,  1.06s/it]

{'loss': 1.1407, 'grad_norm': 0.46503013372421265, 'learning_rate': 6.718266253869969e-05, 'epoch': 0.66}


 66%|██████▋   | 2151/3235 [38:46<19:33,  1.08s/it]

{'loss': 1.1107, 'grad_norm': 0.4385930001735687, 'learning_rate': 6.712074303405573e-05, 'epoch': 0.66}


 67%|██████▋   | 2152/3235 [38:47<19:19,  1.07s/it]

{'loss': 1.0586, 'grad_norm': 0.47159892320632935, 'learning_rate': 6.705882352941176e-05, 'epoch': 0.67}


 67%|██████▋   | 2153/3235 [38:49<20:41,  1.15s/it]

{'loss': 1.0542, 'grad_norm': 0.43149256706237793, 'learning_rate': 6.69969040247678e-05, 'epoch': 0.67}


 67%|██████▋   | 2154/3235 [38:50<19:11,  1.06s/it]

{'loss': 1.003, 'grad_norm': 0.5864466428756714, 'learning_rate': 6.693498452012384e-05, 'epoch': 0.67}


 67%|██████▋   | 2155/3235 [38:51<18:28,  1.03s/it]

{'loss': 1.0123, 'grad_norm': 0.5793993473052979, 'learning_rate': 6.687306501547988e-05, 'epoch': 0.67}


 67%|██████▋   | 2156/3235 [38:52<19:46,  1.10s/it]

{'loss': 1.0692, 'grad_norm': 0.4528200030326843, 'learning_rate': 6.681114551083592e-05, 'epoch': 0.67}


 67%|██████▋   | 2157/3235 [38:53<20:14,  1.13s/it]

{'loss': 0.8643, 'grad_norm': 0.44622427225112915, 'learning_rate': 6.674922600619196e-05, 'epoch': 0.67}


 67%|██████▋   | 2158/3235 [38:54<20:19,  1.13s/it]

{'loss': 0.9974, 'grad_norm': 0.45201197266578674, 'learning_rate': 6.668730650154799e-05, 'epoch': 0.67}


 67%|██████▋   | 2159/3235 [38:55<19:29,  1.09s/it]

{'loss': 0.9821, 'grad_norm': 0.5046445727348328, 'learning_rate': 6.662538699690403e-05, 'epoch': 0.67}


 67%|██████▋   | 2160/3235 [38:56<20:17,  1.13s/it]

{'loss': 1.0945, 'grad_norm': 0.5018527507781982, 'learning_rate': 6.656346749226007e-05, 'epoch': 0.67}


 67%|██████▋   | 2161/3235 [38:58<21:14,  1.19s/it]

{'loss': 0.9431, 'grad_norm': 0.4206349849700928, 'learning_rate': 6.650154798761609e-05, 'epoch': 0.67}


 67%|██████▋   | 2162/3235 [38:59<20:18,  1.14s/it]

{'loss': 1.0078, 'grad_norm': 0.48319581151008606, 'learning_rate': 6.643962848297213e-05, 'epoch': 0.67}


 67%|██████▋   | 2163/3235 [39:00<18:43,  1.05s/it]

{'loss': 0.9992, 'grad_norm': 0.5165929794311523, 'learning_rate': 6.637770897832819e-05, 'epoch': 0.67}


 67%|██████▋   | 2164/3235 [39:00<17:46,  1.00it/s]

{'loss': 0.912, 'grad_norm': 0.48144254088401794, 'learning_rate': 6.631578947368421e-05, 'epoch': 0.67}


 67%|██████▋   | 2165/3235 [39:02<18:37,  1.04s/it]

{'loss': 1.0218, 'grad_norm': 0.4981589615345001, 'learning_rate': 6.625386996904025e-05, 'epoch': 0.67}


 67%|██████▋   | 2166/3235 [39:03<19:52,  1.12s/it]

{'loss': 1.0043, 'grad_norm': 0.4364224076271057, 'learning_rate': 6.619195046439629e-05, 'epoch': 0.67}


 67%|██████▋   | 2167/3235 [39:04<20:52,  1.17s/it]

{'loss': 1.0091, 'grad_norm': 0.43591177463531494, 'learning_rate': 6.613003095975232e-05, 'epoch': 0.67}


 67%|██████▋   | 2168/3235 [39:05<20:17,  1.14s/it]

{'loss': 0.9924, 'grad_norm': 0.5227799415588379, 'learning_rate': 6.606811145510836e-05, 'epoch': 0.67}


 67%|██████▋   | 2169/3235 [39:06<19:43,  1.11s/it]

{'loss': 1.0125, 'grad_norm': 0.537639856338501, 'learning_rate': 6.60061919504644e-05, 'epoch': 0.67}


 67%|██████▋   | 2170/3235 [39:07<20:02,  1.13s/it]

{'loss': 1.0617, 'grad_norm': 0.49932560324668884, 'learning_rate': 6.594427244582044e-05, 'epoch': 0.67}


 67%|██████▋   | 2171/3235 [39:09<19:49,  1.12s/it]

{'loss': 1.0987, 'grad_norm': 0.5256311893463135, 'learning_rate': 6.588235294117648e-05, 'epoch': 0.67}


 67%|██████▋   | 2172/3235 [39:10<20:25,  1.15s/it]

{'loss': 0.8826, 'grad_norm': 0.4607546925544739, 'learning_rate': 6.582043343653252e-05, 'epoch': 0.67}


 67%|██████▋   | 2173/3235 [39:11<20:49,  1.18s/it]

{'loss': 1.1197, 'grad_norm': 0.4646201431751251, 'learning_rate': 6.575851393188854e-05, 'epoch': 0.67}


 67%|██████▋   | 2174/3235 [39:12<19:27,  1.10s/it]

{'loss': 1.0671, 'grad_norm': 0.5389634370803833, 'learning_rate': 6.569659442724458e-05, 'epoch': 0.67}


 67%|██████▋   | 2175/3235 [39:13<17:29,  1.01it/s]

{'loss': 1.1232, 'grad_norm': 0.546622633934021, 'learning_rate': 6.563467492260062e-05, 'epoch': 0.67}


 67%|██████▋   | 2176/3235 [39:14<17:13,  1.03it/s]

{'loss': 1.0865, 'grad_norm': 0.5190603137016296, 'learning_rate': 6.557275541795665e-05, 'epoch': 0.67}


 67%|██████▋   | 2177/3235 [39:15<18:57,  1.08s/it]

{'loss': 0.999, 'grad_norm': 0.4845064580440521, 'learning_rate': 6.551083591331269e-05, 'epoch': 0.67}


 67%|██████▋   | 2178/3235 [39:16<19:17,  1.09s/it]

{'loss': 1.0054, 'grad_norm': 0.4942077696323395, 'learning_rate': 6.544891640866873e-05, 'epoch': 0.67}


 67%|██████▋   | 2179/3235 [39:17<18:38,  1.06s/it]

{'loss': 1.0595, 'grad_norm': 0.4806395471096039, 'learning_rate': 6.538699690402477e-05, 'epoch': 0.67}


 67%|██████▋   | 2180/3235 [39:18<18:28,  1.05s/it]

{'loss': 1.0224, 'grad_norm': 0.47331130504608154, 'learning_rate': 6.532507739938081e-05, 'epoch': 0.67}


 67%|██████▋   | 2181/3235 [39:19<17:11,  1.02it/s]

{'loss': 0.892, 'grad_norm': 0.5246090292930603, 'learning_rate': 6.526315789473685e-05, 'epoch': 0.67}


 67%|██████▋   | 2182/3235 [39:20<19:06,  1.09s/it]

{'loss': 0.9215, 'grad_norm': 0.4158305525779724, 'learning_rate': 6.520123839009288e-05, 'epoch': 0.67}


 67%|██████▋   | 2183/3235 [39:21<19:43,  1.13s/it]

{'loss': 1.1252, 'grad_norm': 0.4735691547393799, 'learning_rate': 6.513931888544892e-05, 'epoch': 0.67}


 68%|██████▊   | 2184/3235 [39:23<20:56,  1.20s/it]

{'loss': 1.0947, 'grad_norm': 0.4447014629840851, 'learning_rate': 6.507739938080496e-05, 'epoch': 0.68}


 68%|██████▊   | 2185/3235 [39:24<19:37,  1.12s/it]

{'loss': 0.9979, 'grad_norm': 0.4779491722583771, 'learning_rate': 6.501547987616098e-05, 'epoch': 0.68}


 68%|██████▊   | 2186/3235 [39:25<18:30,  1.06s/it]

{'loss': 1.0426, 'grad_norm': 0.48573070764541626, 'learning_rate': 6.495356037151702e-05, 'epoch': 0.68}


 68%|██████▊   | 2187/3235 [39:26<20:43,  1.19s/it]

{'loss': 0.8088, 'grad_norm': 0.40874460339546204, 'learning_rate': 6.489164086687308e-05, 'epoch': 0.68}


 68%|██████▊   | 2188/3235 [39:27<19:45,  1.13s/it]

{'loss': 0.9873, 'grad_norm': 0.4668997526168823, 'learning_rate': 6.48297213622291e-05, 'epoch': 0.68}


 68%|██████▊   | 2189/3235 [39:28<18:21,  1.05s/it]

{'loss': 0.9724, 'grad_norm': 0.517004668712616, 'learning_rate': 6.476780185758514e-05, 'epoch': 0.68}


 68%|██████▊   | 2190/3235 [39:29<18:57,  1.09s/it]

{'loss': 0.9466, 'grad_norm': 0.47844281792640686, 'learning_rate': 6.470588235294118e-05, 'epoch': 0.68}


 68%|██████▊   | 2191/3235 [39:30<18:36,  1.07s/it]

{'loss': 1.0695, 'grad_norm': 0.5204182267189026, 'learning_rate': 6.464396284829721e-05, 'epoch': 0.68}


 68%|██████▊   | 2192/3235 [39:31<18:26,  1.06s/it]

{'loss': 1.0661, 'grad_norm': 0.5056304335594177, 'learning_rate': 6.458204334365325e-05, 'epoch': 0.68}


 68%|██████▊   | 2193/3235 [39:32<18:56,  1.09s/it]

{'loss': 1.0123, 'grad_norm': 0.45565977692604065, 'learning_rate': 6.452012383900929e-05, 'epoch': 0.68}


 68%|██████▊   | 2194/3235 [39:34<19:10,  1.10s/it]

{'loss': 1.0378, 'grad_norm': 0.519726574420929, 'learning_rate': 6.445820433436533e-05, 'epoch': 0.68}


 68%|██████▊   | 2195/3235 [39:35<18:28,  1.07s/it]

{'loss': 0.87, 'grad_norm': 0.4674248993396759, 'learning_rate': 6.439628482972137e-05, 'epoch': 0.68}


 68%|██████▊   | 2196/3235 [39:36<19:23,  1.12s/it]

{'loss': 1.1933, 'grad_norm': 0.48206326365470886, 'learning_rate': 6.433436532507741e-05, 'epoch': 0.68}


 68%|██████▊   | 2197/3235 [39:37<19:33,  1.13s/it]

{'loss': 1.0701, 'grad_norm': 0.4738313853740692, 'learning_rate': 6.427244582043344e-05, 'epoch': 0.68}


 68%|██████▊   | 2198/3235 [39:38<19:37,  1.14s/it]

{'loss': 1.0329, 'grad_norm': 0.5507681369781494, 'learning_rate': 6.421052631578948e-05, 'epoch': 0.68}


 68%|██████▊   | 2199/3235 [39:39<18:55,  1.10s/it]

{'loss': 1.0577, 'grad_norm': 0.4921213388442993, 'learning_rate': 6.414860681114552e-05, 'epoch': 0.68}


 68%|██████▊   | 2200/3235 [39:40<18:57,  1.10s/it]

{'loss': 1.0587, 'grad_norm': 0.4169671833515167, 'learning_rate': 6.408668730650154e-05, 'epoch': 0.68}


 68%|██████▊   | 2201/3235 [39:41<18:44,  1.09s/it]

{'loss': 0.9049, 'grad_norm': 0.4870328903198242, 'learning_rate': 6.402476780185758e-05, 'epoch': 0.68}


 68%|██████▊   | 2202/3235 [39:42<18:06,  1.05s/it]

{'loss': 1.0221, 'grad_norm': 0.5046297907829285, 'learning_rate': 6.396284829721362e-05, 'epoch': 0.68}


 68%|██████▊   | 2203/3235 [39:43<17:29,  1.02s/it]

{'loss': 0.9666, 'grad_norm': 0.5151554346084595, 'learning_rate': 6.390092879256966e-05, 'epoch': 0.68}


 68%|██████▊   | 2204/3235 [39:44<18:55,  1.10s/it]

{'loss': 1.2024, 'grad_norm': 0.5278751254081726, 'learning_rate': 6.38390092879257e-05, 'epoch': 0.68}


 68%|██████▊   | 2205/3235 [39:46<18:54,  1.10s/it]

{'loss': 1.0508, 'grad_norm': 0.4795638620853424, 'learning_rate': 6.377708978328174e-05, 'epoch': 0.68}


 68%|██████▊   | 2206/3235 [39:47<18:00,  1.05s/it]

{'loss': 0.9588, 'grad_norm': 0.4979705512523651, 'learning_rate': 6.371517027863777e-05, 'epoch': 0.68}


 68%|██████▊   | 2207/3235 [39:48<17:56,  1.05s/it]

{'loss': 0.8656, 'grad_norm': 0.47604307532310486, 'learning_rate': 6.365325077399381e-05, 'epoch': 0.68}


 68%|██████▊   | 2208/3235 [39:49<18:31,  1.08s/it]

{'loss': 1.1121, 'grad_norm': 0.49677711725234985, 'learning_rate': 6.359133126934985e-05, 'epoch': 0.68}


 68%|██████▊   | 2209/3235 [39:50<19:06,  1.12s/it]

{'loss': 0.8637, 'grad_norm': 0.4483051896095276, 'learning_rate': 6.352941176470588e-05, 'epoch': 0.68}


 68%|██████▊   | 2210/3235 [39:51<19:24,  1.14s/it]

{'loss': 1.1359, 'grad_norm': 0.46604400873184204, 'learning_rate': 6.346749226006192e-05, 'epoch': 0.68}


 68%|██████▊   | 2211/3235 [39:52<19:59,  1.17s/it]

{'loss': 1.0347, 'grad_norm': 0.49708202481269836, 'learning_rate': 6.340557275541796e-05, 'epoch': 0.68}


 68%|██████▊   | 2212/3235 [39:54<19:54,  1.17s/it]

{'loss': 0.8881, 'grad_norm': 0.5128912329673767, 'learning_rate': 6.3343653250774e-05, 'epoch': 0.68}


 68%|██████▊   | 2213/3235 [39:55<19:21,  1.14s/it]

{'loss': 0.9758, 'grad_norm': 0.46938326954841614, 'learning_rate': 6.328173374613004e-05, 'epoch': 0.68}


 68%|██████▊   | 2214/3235 [39:56<18:45,  1.10s/it]

{'loss': 0.9426, 'grad_norm': 0.5013130307197571, 'learning_rate': 6.321981424148608e-05, 'epoch': 0.68}


 68%|██████▊   | 2215/3235 [39:57<18:35,  1.09s/it]

{'loss': 0.9505, 'grad_norm': 0.4397929012775421, 'learning_rate': 6.31578947368421e-05, 'epoch': 0.68}


 69%|██████▊   | 2216/3235 [39:57<17:03,  1.00s/it]

{'loss': 0.9356, 'grad_norm': 0.5344246029853821, 'learning_rate': 6.309597523219814e-05, 'epoch': 0.69}


 69%|██████▊   | 2217/3235 [39:59<17:15,  1.02s/it]

{'loss': 1.2303, 'grad_norm': 0.5473660826683044, 'learning_rate': 6.303405572755418e-05, 'epoch': 0.69}


 69%|██████▊   | 2218/3235 [40:00<17:15,  1.02s/it]

{'loss': 0.9765, 'grad_norm': 0.5636795163154602, 'learning_rate': 6.297213622291021e-05, 'epoch': 0.69}


 69%|██████▊   | 2219/3235 [40:01<18:07,  1.07s/it]

{'loss': 0.9828, 'grad_norm': 0.444711834192276, 'learning_rate': 6.291021671826626e-05, 'epoch': 0.69}


 69%|██████▊   | 2220/3235 [40:02<17:33,  1.04s/it]

{'loss': 1.0288, 'grad_norm': 0.520804226398468, 'learning_rate': 6.28482972136223e-05, 'epoch': 0.69}


 69%|██████▊   | 2221/3235 [40:03<17:26,  1.03s/it]

{'loss': 1.0154, 'grad_norm': 0.46881619095802307, 'learning_rate': 6.278637770897833e-05, 'epoch': 0.69}


 69%|██████▊   | 2222/3235 [40:04<18:13,  1.08s/it]

{'loss': 1.0565, 'grad_norm': 0.4768882989883423, 'learning_rate': 6.272445820433437e-05, 'epoch': 0.69}


 69%|██████▊   | 2223/3235 [40:05<18:14,  1.08s/it]

{'loss': 1.0463, 'grad_norm': 0.49484023451805115, 'learning_rate': 6.266253869969041e-05, 'epoch': 0.69}


 69%|██████▊   | 2224/3235 [40:06<20:07,  1.19s/it]

{'loss': 1.339, 'grad_norm': 0.49306026101112366, 'learning_rate': 6.260061919504644e-05, 'epoch': 0.69}


 69%|██████▉   | 2225/3235 [40:07<19:02,  1.13s/it]

{'loss': 0.9605, 'grad_norm': 0.4502630829811096, 'learning_rate': 6.253869969040248e-05, 'epoch': 0.69}


 69%|██████▉   | 2226/3235 [40:09<18:54,  1.12s/it]

{'loss': 1.024, 'grad_norm': 0.4493965208530426, 'learning_rate': 6.247678018575852e-05, 'epoch': 0.69}


 69%|██████▉   | 2227/3235 [40:10<19:43,  1.17s/it]

{'loss': 1.1115, 'grad_norm': 0.46591708064079285, 'learning_rate': 6.241486068111456e-05, 'epoch': 0.69}


 69%|██████▉   | 2228/3235 [40:11<20:25,  1.22s/it]

{'loss': 0.9603, 'grad_norm': 0.4654037356376648, 'learning_rate': 6.23529411764706e-05, 'epoch': 0.69}


 69%|██████▉   | 2229/3235 [40:12<18:55,  1.13s/it]

{'loss': 0.8243, 'grad_norm': 0.4810275435447693, 'learning_rate': 6.229102167182664e-05, 'epoch': 0.69}


 69%|██████▉   | 2230/3235 [40:13<18:02,  1.08s/it]

{'loss': 1.0913, 'grad_norm': 0.49675506353378296, 'learning_rate': 6.222910216718266e-05, 'epoch': 0.69}


 69%|██████▉   | 2231/3235 [40:14<18:42,  1.12s/it]

{'loss': 0.9451, 'grad_norm': 0.4703710377216339, 'learning_rate': 6.21671826625387e-05, 'epoch': 0.69}


 69%|██████▉   | 2232/3235 [40:16<19:49,  1.19s/it]

{'loss': 1.113, 'grad_norm': 0.44496941566467285, 'learning_rate': 6.210526315789474e-05, 'epoch': 0.69}


 69%|██████▉   | 2233/3235 [40:17<18:57,  1.14s/it]

{'loss': 1.0206, 'grad_norm': 0.473299503326416, 'learning_rate': 6.204334365325077e-05, 'epoch': 0.69}


 69%|██████▉   | 2234/3235 [40:18<18:15,  1.09s/it]

{'loss': 0.9598, 'grad_norm': 0.5030916929244995, 'learning_rate': 6.198142414860681e-05, 'epoch': 0.69}


 69%|██████▉   | 2235/3235 [40:19<17:29,  1.05s/it]

{'loss': 0.9116, 'grad_norm': 0.47564709186553955, 'learning_rate': 6.191950464396285e-05, 'epoch': 0.69}


 69%|██████▉   | 2236/3235 [40:20<17:44,  1.07s/it]

{'loss': 1.1527, 'grad_norm': 0.47938498854637146, 'learning_rate': 6.185758513931889e-05, 'epoch': 0.69}


 69%|██████▉   | 2237/3235 [40:21<18:13,  1.10s/it]

{'loss': 1.035, 'grad_norm': 0.4870794117450714, 'learning_rate': 6.179566563467493e-05, 'epoch': 0.69}


 69%|██████▉   | 2238/3235 [40:22<19:03,  1.15s/it]

{'loss': 1.0182, 'grad_norm': 0.5059289932250977, 'learning_rate': 6.173374613003097e-05, 'epoch': 0.69}


 69%|██████▉   | 2239/3235 [40:23<19:04,  1.15s/it]

{'loss': 1.0918, 'grad_norm': 0.492163747549057, 'learning_rate': 6.1671826625387e-05, 'epoch': 0.69}


 69%|██████▉   | 2240/3235 [40:24<18:09,  1.10s/it]

{'loss': 0.9015, 'grad_norm': 0.5462119579315186, 'learning_rate': 6.160990712074304e-05, 'epoch': 0.69}


 69%|██████▉   | 2241/3235 [40:25<17:41,  1.07s/it]

{'loss': 0.9732, 'grad_norm': 0.4792194664478302, 'learning_rate': 6.154798761609908e-05, 'epoch': 0.69}


 69%|██████▉   | 2242/3235 [40:26<17:56,  1.08s/it]

{'loss': 0.8789, 'grad_norm': 0.4518696367740631, 'learning_rate': 6.14860681114551e-05, 'epoch': 0.69}


 69%|██████▉   | 2243/3235 [40:28<18:44,  1.13s/it]

{'loss': 1.0724, 'grad_norm': 0.4184631407260895, 'learning_rate': 6.142414860681116e-05, 'epoch': 0.69}


 69%|██████▉   | 2244/3235 [40:29<18:31,  1.12s/it]

{'loss': 1.0218, 'grad_norm': 0.5075533390045166, 'learning_rate': 6.13622291021672e-05, 'epoch': 0.69}


 69%|██████▉   | 2245/3235 [40:30<17:41,  1.07s/it]

{'loss': 0.9918, 'grad_norm': 0.513187825679779, 'learning_rate': 6.130030959752322e-05, 'epoch': 0.69}


 69%|██████▉   | 2246/3235 [40:31<18:19,  1.11s/it]

{'loss': 1.0933, 'grad_norm': 0.4737893044948578, 'learning_rate': 6.123839009287926e-05, 'epoch': 0.69}


 69%|██████▉   | 2247/3235 [40:32<19:38,  1.19s/it]

{'loss': 1.2545, 'grad_norm': 0.49843931198120117, 'learning_rate': 6.11764705882353e-05, 'epoch': 0.69}


 69%|██████▉   | 2248/3235 [40:33<18:24,  1.12s/it]

{'loss': 0.9143, 'grad_norm': 0.46567052602767944, 'learning_rate': 6.111455108359133e-05, 'epoch': 0.69}


 70%|██████▉   | 2249/3235 [40:34<16:40,  1.01s/it]

{'loss': 0.9714, 'grad_norm': 0.5132730007171631, 'learning_rate': 6.105263157894737e-05, 'epoch': 0.7}


 70%|██████▉   | 2250/3235 [40:35<17:24,  1.06s/it]

{'loss': 0.9405, 'grad_norm': 0.4753718376159668, 'learning_rate': 6.0990712074303416e-05, 'epoch': 0.7}


 70%|██████▉   | 2251/3235 [40:36<16:29,  1.01s/it]

{'loss': 0.9806, 'grad_norm': 0.5987052321434021, 'learning_rate': 6.092879256965944e-05, 'epoch': 0.7}


 70%|██████▉   | 2252/3235 [40:37<16:27,  1.00s/it]

{'loss': 0.9681, 'grad_norm': 0.4789021909236908, 'learning_rate': 6.086687306501548e-05, 'epoch': 0.7}


 70%|██████▉   | 2253/3235 [40:38<16:56,  1.04s/it]

{'loss': 0.9787, 'grad_norm': 0.4651888608932495, 'learning_rate': 6.080495356037152e-05, 'epoch': 0.7}


 70%|██████▉   | 2254/3235 [40:39<17:39,  1.08s/it]

{'loss': 0.9812, 'grad_norm': 0.4426144063472748, 'learning_rate': 6.0743034055727555e-05, 'epoch': 0.7}


 70%|██████▉   | 2255/3235 [40:41<18:54,  1.16s/it]

{'loss': 1.1698, 'grad_norm': 0.46053335070610046, 'learning_rate': 6.0681114551083595e-05, 'epoch': 0.7}


 70%|██████▉   | 2256/3235 [40:42<17:51,  1.09s/it]

{'loss': 1.0993, 'grad_norm': 0.5941175818443298, 'learning_rate': 6.0619195046439635e-05, 'epoch': 0.7}


 70%|██████▉   | 2257/3235 [40:43<18:38,  1.14s/it]

{'loss': 0.9181, 'grad_norm': 0.4100373089313507, 'learning_rate': 6.055727554179567e-05, 'epoch': 0.7}


 70%|██████▉   | 2258/3235 [40:44<18:37,  1.14s/it]

{'loss': 1.084, 'grad_norm': 0.49413764476776123, 'learning_rate': 6.049535603715171e-05, 'epoch': 0.7}


 70%|██████▉   | 2259/3235 [40:45<19:38,  1.21s/it]

{'loss': 1.0196, 'grad_norm': 0.42574432492256165, 'learning_rate': 6.043343653250775e-05, 'epoch': 0.7}


 70%|██████▉   | 2260/3235 [40:46<19:22,  1.19s/it]

{'loss': 1.0309, 'grad_norm': 0.4305903911590576, 'learning_rate': 6.0371517027863775e-05, 'epoch': 0.7}


 70%|██████▉   | 2261/3235 [40:47<18:19,  1.13s/it]

{'loss': 1.0182, 'grad_norm': 0.506762683391571, 'learning_rate': 6.030959752321982e-05, 'epoch': 0.7}


 70%|██████▉   | 2262/3235 [40:48<17:03,  1.05s/it]

{'loss': 0.949, 'grad_norm': 0.538994550704956, 'learning_rate': 6.024767801857586e-05, 'epoch': 0.7}


 70%|██████▉   | 2263/3235 [40:49<17:27,  1.08s/it]

{'loss': 0.9572, 'grad_norm': 0.47138142585754395, 'learning_rate': 6.018575851393189e-05, 'epoch': 0.7}


 70%|██████▉   | 2264/3235 [40:50<17:00,  1.05s/it]

{'loss': 1.0923, 'grad_norm': 0.47643718123435974, 'learning_rate': 6.012383900928793e-05, 'epoch': 0.7}


 70%|███████   | 2265/3235 [40:51<16:15,  1.01s/it]

{'loss': 0.9979, 'grad_norm': 0.49425166845321655, 'learning_rate': 6.006191950464397e-05, 'epoch': 0.7}


 70%|███████   | 2266/3235 [40:52<16:05,  1.00it/s]

{'loss': 1.1736, 'grad_norm': 0.4885154664516449, 'learning_rate': 6e-05, 'epoch': 0.7}


 70%|███████   | 2267/3235 [40:53<15:48,  1.02it/s]

{'loss': 0.9816, 'grad_norm': 0.5267581343650818, 'learning_rate': 5.993808049535604e-05, 'epoch': 0.7}


 70%|███████   | 2268/3235 [40:54<14:59,  1.08it/s]

{'loss': 1.0166, 'grad_norm': 0.5438356399536133, 'learning_rate': 5.987616099071208e-05, 'epoch': 0.7}


 70%|███████   | 2269/3235 [40:55<15:45,  1.02it/s]

{'loss': 0.9518, 'grad_norm': 0.442352831363678, 'learning_rate': 5.9814241486068115e-05, 'epoch': 0.7}


 70%|███████   | 2270/3235 [40:56<15:51,  1.01it/s]

{'loss': 0.9066, 'grad_norm': 0.5016307830810547, 'learning_rate': 5.9752321981424155e-05, 'epoch': 0.7}


 70%|███████   | 2271/3235 [40:57<17:09,  1.07s/it]

{'loss': 1.159, 'grad_norm': 0.4893302321434021, 'learning_rate': 5.9690402476780195e-05, 'epoch': 0.7}


 70%|███████   | 2272/3235 [40:58<17:07,  1.07s/it]

{'loss': 1.0209, 'grad_norm': 0.5164808630943298, 'learning_rate': 5.962848297213622e-05, 'epoch': 0.7}


 70%|███████   | 2273/3235 [41:00<17:28,  1.09s/it]

{'loss': 0.981, 'grad_norm': 0.4965241253376007, 'learning_rate': 5.956656346749226e-05, 'epoch': 0.7}


 70%|███████   | 2274/3235 [41:01<16:56,  1.06s/it]

{'loss': 1.0054, 'grad_norm': 0.5349296927452087, 'learning_rate': 5.950464396284831e-05, 'epoch': 0.7}


 70%|███████   | 2275/3235 [41:02<17:36,  1.10s/it]

{'loss': 1.0976, 'grad_norm': 0.5249476432800293, 'learning_rate': 5.9442724458204335e-05, 'epoch': 0.7}


 70%|███████   | 2276/3235 [41:03<16:14,  1.02s/it]

{'loss': 0.8846, 'grad_norm': 0.5081495046615601, 'learning_rate': 5.9380804953560375e-05, 'epoch': 0.7}


 70%|███████   | 2277/3235 [41:04<17:09,  1.07s/it]

{'loss': 1.0616, 'grad_norm': 0.4730571508407593, 'learning_rate': 5.9318885448916415e-05, 'epoch': 0.7}


 70%|███████   | 2278/3235 [41:05<17:18,  1.09s/it]

{'loss': 0.9283, 'grad_norm': 0.425864577293396, 'learning_rate': 5.925696594427245e-05, 'epoch': 0.7}


 70%|███████   | 2279/3235 [41:06<17:09,  1.08s/it]

{'loss': 0.926, 'grad_norm': 0.5045418739318848, 'learning_rate': 5.919504643962849e-05, 'epoch': 0.7}


 70%|███████   | 2280/3235 [41:07<16:44,  1.05s/it]

{'loss': 1.1991, 'grad_norm': 0.5131015181541443, 'learning_rate': 5.913312693498453e-05, 'epoch': 0.7}


 71%|███████   | 2281/3235 [41:08<17:01,  1.07s/it]

{'loss': 1.1437, 'grad_norm': 0.49751245975494385, 'learning_rate': 5.907120743034056e-05, 'epoch': 0.71}


 71%|███████   | 2282/3235 [41:09<16:31,  1.04s/it]

{'loss': 1.0956, 'grad_norm': 0.5528790354728699, 'learning_rate': 5.90092879256966e-05, 'epoch': 0.71}


 71%|███████   | 2283/3235 [41:10<16:35,  1.05s/it]

{'loss': 1.0081, 'grad_norm': 0.49188876152038574, 'learning_rate': 5.894736842105263e-05, 'epoch': 0.71}


 71%|███████   | 2284/3235 [41:11<16:40,  1.05s/it]

{'loss': 0.9776, 'grad_norm': 0.5330953001976013, 'learning_rate': 5.888544891640867e-05, 'epoch': 0.71}


 71%|███████   | 2285/3235 [41:12<15:57,  1.01s/it]

{'loss': 1.1555, 'grad_norm': 0.5963987708091736, 'learning_rate': 5.882352941176471e-05, 'epoch': 0.71}


 71%|███████   | 2286/3235 [41:13<15:58,  1.01s/it]

{'loss': 1.1113, 'grad_norm': 0.5119498372077942, 'learning_rate': 5.876160990712074e-05, 'epoch': 0.71}


 71%|███████   | 2287/3235 [41:14<16:08,  1.02s/it]

{'loss': 1.0616, 'grad_norm': 0.512908935546875, 'learning_rate': 5.869969040247678e-05, 'epoch': 0.71}


 71%|███████   | 2288/3235 [41:15<14:54,  1.06it/s]

{'loss': 0.9388, 'grad_norm': 0.5258103013038635, 'learning_rate': 5.863777089783282e-05, 'epoch': 0.71}


 71%|███████   | 2289/3235 [41:16<14:43,  1.07it/s]

{'loss': 0.9178, 'grad_norm': 0.46733635663986206, 'learning_rate': 5.8575851393188854e-05, 'epoch': 0.71}


 71%|███████   | 2290/3235 [41:17<15:19,  1.03it/s]

{'loss': 0.9911, 'grad_norm': 0.5059385895729065, 'learning_rate': 5.8513931888544894e-05, 'epoch': 0.71}


 71%|███████   | 2291/3235 [41:18<15:14,  1.03it/s]

{'loss': 0.9551, 'grad_norm': 0.5145018100738525, 'learning_rate': 5.8452012383900934e-05, 'epoch': 0.71}


 71%|███████   | 2292/3235 [41:19<15:53,  1.01s/it]

{'loss': 1.0344, 'grad_norm': 0.458955317735672, 'learning_rate': 5.839009287925696e-05, 'epoch': 0.71}


 71%|███████   | 2293/3235 [41:20<16:27,  1.05s/it]

{'loss': 1.0691, 'grad_norm': 0.44362208247184753, 'learning_rate': 5.832817337461301e-05, 'epoch': 0.71}


 71%|███████   | 2294/3235 [41:21<16:27,  1.05s/it]

{'loss': 1.0111, 'grad_norm': 0.5276951789855957, 'learning_rate': 5.826625386996905e-05, 'epoch': 0.71}


 71%|███████   | 2295/3235 [41:22<16:39,  1.06s/it]

{'loss': 1.0173, 'grad_norm': 0.44216132164001465, 'learning_rate': 5.8204334365325074e-05, 'epoch': 0.71}


 71%|███████   | 2296/3235 [41:23<16:34,  1.06s/it]

{'loss': 0.9578, 'grad_norm': 0.4691091775894165, 'learning_rate': 5.8142414860681114e-05, 'epoch': 0.71}


 71%|███████   | 2297/3235 [41:25<17:27,  1.12s/it]

{'loss': 1.0926, 'grad_norm': 0.5026765465736389, 'learning_rate': 5.8080495356037154e-05, 'epoch': 0.71}


 71%|███████   | 2298/3235 [41:25<16:14,  1.04s/it]

{'loss': 0.9151, 'grad_norm': 0.5164067149162292, 'learning_rate': 5.801857585139319e-05, 'epoch': 0.71}


 71%|███████   | 2299/3235 [41:27<17:11,  1.10s/it]

{'loss': 0.9603, 'grad_norm': 0.4477086663246155, 'learning_rate': 5.795665634674923e-05, 'epoch': 0.71}


 71%|███████   | 2300/3235 [41:28<17:22,  1.11s/it]

{'loss': 1.1228, 'grad_norm': 0.4532544016838074, 'learning_rate': 5.789473684210527e-05, 'epoch': 0.71}


 71%|███████   | 2301/3235 [41:29<16:40,  1.07s/it]

{'loss': 1.0588, 'grad_norm': 0.4938652217388153, 'learning_rate': 5.78328173374613e-05, 'epoch': 0.71}


 71%|███████   | 2302/3235 [41:30<16:46,  1.08s/it]

{'loss': 1.0376, 'grad_norm': 0.521826982498169, 'learning_rate': 5.777089783281734e-05, 'epoch': 0.71}


 71%|███████   | 2303/3235 [41:31<17:17,  1.11s/it]

{'loss': 1.0807, 'grad_norm': 0.5093720555305481, 'learning_rate': 5.770897832817338e-05, 'epoch': 0.71}


 71%|███████   | 2304/3235 [41:32<17:12,  1.11s/it]

{'loss': 0.8888, 'grad_norm': 0.496248334646225, 'learning_rate': 5.764705882352941e-05, 'epoch': 0.71}


 71%|███████▏  | 2305/3235 [41:33<17:27,  1.13s/it]

{'loss': 1.0372, 'grad_norm': 0.46768128871917725, 'learning_rate': 5.7585139318885454e-05, 'epoch': 0.71}


 71%|███████▏  | 2306/3235 [41:34<16:50,  1.09s/it]

{'loss': 1.0229, 'grad_norm': 0.5086036324501038, 'learning_rate': 5.7523219814241494e-05, 'epoch': 0.71}


 71%|███████▏  | 2307/3235 [41:35<16:14,  1.05s/it]

{'loss': 0.9566, 'grad_norm': 0.4681739807128906, 'learning_rate': 5.746130030959752e-05, 'epoch': 0.71}


 71%|███████▏  | 2308/3235 [41:36<16:18,  1.06s/it]

{'loss': 0.9415, 'grad_norm': 0.4944306015968323, 'learning_rate': 5.739938080495356e-05, 'epoch': 0.71}


 71%|███████▏  | 2309/3235 [41:38<16:35,  1.08s/it]

{'loss': 1.007, 'grad_norm': 0.47446903586387634, 'learning_rate': 5.73374613003096e-05, 'epoch': 0.71}


 71%|███████▏  | 2310/3235 [41:38<16:09,  1.05s/it]

{'loss': 0.9796, 'grad_norm': 0.4706081748008728, 'learning_rate': 5.727554179566563e-05, 'epoch': 0.71}


 71%|███████▏  | 2311/3235 [41:40<17:04,  1.11s/it]

{'loss': 1.0785, 'grad_norm': 0.4868687093257904, 'learning_rate': 5.721362229102167e-05, 'epoch': 0.71}


 71%|███████▏  | 2312/3235 [41:41<17:14,  1.12s/it]

{'loss': 0.9738, 'grad_norm': 0.5029776692390442, 'learning_rate': 5.715170278637771e-05, 'epoch': 0.71}


 71%|███████▏  | 2313/3235 [41:42<17:18,  1.13s/it]

{'loss': 1.2206, 'grad_norm': 0.5062943696975708, 'learning_rate': 5.7089783281733746e-05, 'epoch': 0.71}


 72%|███████▏  | 2314/3235 [41:43<16:31,  1.08s/it]

{'loss': 0.8949, 'grad_norm': 0.5024423599243164, 'learning_rate': 5.7027863777089786e-05, 'epoch': 0.72}


 72%|███████▏  | 2315/3235 [41:44<16:18,  1.06s/it]

{'loss': 0.8802, 'grad_norm': 0.4617179334163666, 'learning_rate': 5.6965944272445826e-05, 'epoch': 0.72}


 72%|███████▏  | 2316/3235 [41:45<16:26,  1.07s/it]

{'loss': 1.1015, 'grad_norm': 0.4896305203437805, 'learning_rate': 5.690402476780185e-05, 'epoch': 0.72}


 72%|███████▏  | 2317/3235 [41:46<15:53,  1.04s/it]

{'loss': 0.8654, 'grad_norm': 0.521687924861908, 'learning_rate': 5.68421052631579e-05, 'epoch': 0.72}


 72%|███████▏  | 2318/3235 [41:47<15:25,  1.01s/it]

{'loss': 0.9067, 'grad_norm': 0.5213393568992615, 'learning_rate': 5.678018575851394e-05, 'epoch': 0.72}


 72%|███████▏  | 2319/3235 [41:48<14:38,  1.04it/s]

{'loss': 0.9477, 'grad_norm': 0.5759119987487793, 'learning_rate': 5.6718266253869966e-05, 'epoch': 0.72}


 72%|███████▏  | 2320/3235 [41:49<15:44,  1.03s/it]

{'loss': 1.0235, 'grad_norm': 0.5044153332710266, 'learning_rate': 5.6656346749226006e-05, 'epoch': 0.72}


 72%|███████▏  | 2321/3235 [41:50<15:37,  1.03s/it]

{'loss': 0.9649, 'grad_norm': 0.4835347831249237, 'learning_rate': 5.6594427244582046e-05, 'epoch': 0.72}


 72%|███████▏  | 2322/3235 [41:51<15:42,  1.03s/it]

{'loss': 1.029, 'grad_norm': 0.5543975830078125, 'learning_rate': 5.653250773993808e-05, 'epoch': 0.72}


 72%|███████▏  | 2323/3235 [41:52<16:39,  1.10s/it]

{'loss': 1.0064, 'grad_norm': 0.4449673891067505, 'learning_rate': 5.647058823529412e-05, 'epoch': 0.72}


 72%|███████▏  | 2324/3235 [41:54<17:39,  1.16s/it]

{'loss': 1.002, 'grad_norm': 0.4847801923751831, 'learning_rate': 5.640866873065016e-05, 'epoch': 0.72}


 72%|███████▏  | 2325/3235 [41:55<16:52,  1.11s/it]

{'loss': 0.8409, 'grad_norm': 0.4820277690887451, 'learning_rate': 5.634674922600619e-05, 'epoch': 0.72}


 72%|███████▏  | 2326/3235 [41:56<16:28,  1.09s/it]

{'loss': 0.9429, 'grad_norm': 0.4776226282119751, 'learning_rate': 5.628482972136223e-05, 'epoch': 0.72}


 72%|███████▏  | 2327/3235 [41:57<15:51,  1.05s/it]

{'loss': 0.982, 'grad_norm': 0.5373939871788025, 'learning_rate': 5.622291021671827e-05, 'epoch': 0.72}


 72%|███████▏  | 2328/3235 [41:58<15:13,  1.01s/it]

{'loss': 1.1058, 'grad_norm': 0.5074027180671692, 'learning_rate': 5.61609907120743e-05, 'epoch': 0.72}


 72%|███████▏  | 2329/3235 [41:59<15:45,  1.04s/it]

{'loss': 1.0142, 'grad_norm': 0.486733078956604, 'learning_rate': 5.6099071207430346e-05, 'epoch': 0.72}


 72%|███████▏  | 2330/3235 [42:00<15:18,  1.01s/it]

{'loss': 0.9015, 'grad_norm': 0.4728718101978302, 'learning_rate': 5.6037151702786386e-05, 'epoch': 0.72}


 72%|███████▏  | 2331/3235 [42:01<16:10,  1.07s/it]

{'loss': 1.0358, 'grad_norm': 0.5057692527770996, 'learning_rate': 5.597523219814241e-05, 'epoch': 0.72}


 72%|███████▏  | 2332/3235 [42:02<15:41,  1.04s/it]

{'loss': 1.0451, 'grad_norm': 0.5203282833099365, 'learning_rate': 5.591331269349845e-05, 'epoch': 0.72}


 72%|███████▏  | 2333/3235 [42:03<15:39,  1.04s/it]

{'loss': 1.0281, 'grad_norm': 0.527927041053772, 'learning_rate': 5.585139318885449e-05, 'epoch': 0.72}


 72%|███████▏  | 2334/3235 [42:04<15:29,  1.03s/it]

{'loss': 1.0181, 'grad_norm': 0.5023716688156128, 'learning_rate': 5.5789473684210526e-05, 'epoch': 0.72}


 72%|███████▏  | 2335/3235 [42:05<15:33,  1.04s/it]

{'loss': 1.0169, 'grad_norm': 0.5410948395729065, 'learning_rate': 5.5727554179566566e-05, 'epoch': 0.72}


 72%|███████▏  | 2336/3235 [42:06<16:37,  1.11s/it]

{'loss': 0.9137, 'grad_norm': 0.42962414026260376, 'learning_rate': 5.5665634674922606e-05, 'epoch': 0.72}


 72%|███████▏  | 2337/3235 [42:07<16:04,  1.07s/it]

{'loss': 0.983, 'grad_norm': 0.5357270836830139, 'learning_rate': 5.560371517027864e-05, 'epoch': 0.72}


 72%|███████▏  | 2338/3235 [42:08<16:01,  1.07s/it]

{'loss': 0.9872, 'grad_norm': 0.5022158622741699, 'learning_rate': 5.554179566563468e-05, 'epoch': 0.72}


 72%|███████▏  | 2339/3235 [42:09<15:22,  1.03s/it]

{'loss': 0.8842, 'grad_norm': 0.589438259601593, 'learning_rate': 5.547987616099072e-05, 'epoch': 0.72}


 72%|███████▏  | 2340/3235 [42:10<15:45,  1.06s/it]

{'loss': 1.0219, 'grad_norm': 0.46439051628112793, 'learning_rate': 5.5417956656346745e-05, 'epoch': 0.72}


 72%|███████▏  | 2341/3235 [42:11<15:25,  1.03s/it]

{'loss': 0.9908, 'grad_norm': 0.5043372511863708, 'learning_rate': 5.5356037151702785e-05, 'epoch': 0.72}


 72%|███████▏  | 2342/3235 [42:12<15:48,  1.06s/it]

{'loss': 1.1249, 'grad_norm': 0.5112420916557312, 'learning_rate': 5.529411764705883e-05, 'epoch': 0.72}


 72%|███████▏  | 2343/3235 [42:14<16:26,  1.11s/it]

{'loss': 0.9807, 'grad_norm': 0.44342875480651855, 'learning_rate': 5.523219814241486e-05, 'epoch': 0.72}


 72%|███████▏  | 2344/3235 [42:15<16:23,  1.10s/it]

{'loss': 1.055, 'grad_norm': 0.5131305456161499, 'learning_rate': 5.51702786377709e-05, 'epoch': 0.72}


 72%|███████▏  | 2345/3235 [42:16<15:58,  1.08s/it]

{'loss': 0.9177, 'grad_norm': 0.5577465295791626, 'learning_rate': 5.510835913312694e-05, 'epoch': 0.72}


 73%|███████▎  | 2346/3235 [42:17<15:55,  1.07s/it]

{'loss': 0.9751, 'grad_norm': 0.5169970393180847, 'learning_rate': 5.504643962848297e-05, 'epoch': 0.73}


 73%|███████▎  | 2347/3235 [42:18<16:56,  1.14s/it]

{'loss': 0.8955, 'grad_norm': 0.46444374322891235, 'learning_rate': 5.498452012383901e-05, 'epoch': 0.73}


 73%|███████▎  | 2348/3235 [42:19<15:57,  1.08s/it]

{'loss': 0.9379, 'grad_norm': 0.5473175048828125, 'learning_rate': 5.492260061919505e-05, 'epoch': 0.73}


 73%|███████▎  | 2349/3235 [42:20<15:14,  1.03s/it]

{'loss': 1.068, 'grad_norm': 0.5044537782669067, 'learning_rate': 5.4860681114551085e-05, 'epoch': 0.73}


 73%|███████▎  | 2350/3235 [42:21<16:22,  1.11s/it]

{'loss': 0.9999, 'grad_norm': 0.5007531046867371, 'learning_rate': 5.4798761609907125e-05, 'epoch': 0.73}


 73%|███████▎  | 2351/3235 [42:22<15:34,  1.06s/it]

{'loss': 1.0108, 'grad_norm': 0.5175377130508423, 'learning_rate': 5.4736842105263165e-05, 'epoch': 0.73}


 73%|███████▎  | 2352/3235 [42:23<15:08,  1.03s/it]

{'loss': 0.9143, 'grad_norm': 0.4786519706249237, 'learning_rate': 5.467492260061919e-05, 'epoch': 0.73}


 73%|███████▎  | 2353/3235 [42:24<14:13,  1.03it/s]

{'loss': 1.0167, 'grad_norm': 0.569166898727417, 'learning_rate': 5.461300309597523e-05, 'epoch': 0.73}


 73%|███████▎  | 2354/3235 [42:25<13:52,  1.06it/s]

{'loss': 0.9266, 'grad_norm': 0.5526748895645142, 'learning_rate': 5.455108359133128e-05, 'epoch': 0.73}


 73%|███████▎  | 2355/3235 [42:26<15:10,  1.04s/it]

{'loss': 1.0213, 'grad_norm': 0.44009289145469666, 'learning_rate': 5.4489164086687305e-05, 'epoch': 0.73}


 73%|███████▎  | 2356/3235 [42:27<15:54,  1.09s/it]

{'loss': 1.0218, 'grad_norm': 0.4599245488643646, 'learning_rate': 5.4427244582043345e-05, 'epoch': 0.73}


 73%|███████▎  | 2357/3235 [42:28<15:35,  1.07s/it]

{'loss': 1.1168, 'grad_norm': 0.5283732414245605, 'learning_rate': 5.4365325077399385e-05, 'epoch': 0.73}


 73%|███████▎  | 2358/3235 [42:30<16:06,  1.10s/it]

{'loss': 0.9523, 'grad_norm': 0.4976550340652466, 'learning_rate': 5.430340557275542e-05, 'epoch': 0.73}


 73%|███████▎  | 2359/3235 [42:30<15:29,  1.06s/it]

{'loss': 0.9533, 'grad_norm': 0.493777334690094, 'learning_rate': 5.424148606811146e-05, 'epoch': 0.73}


 73%|███████▎  | 2360/3235 [42:32<16:32,  1.13s/it]

{'loss': 1.0658, 'grad_norm': 0.47140583395957947, 'learning_rate': 5.41795665634675e-05, 'epoch': 0.73}


 73%|███████▎  | 2361/3235 [42:33<16:57,  1.16s/it]

{'loss': 1.0772, 'grad_norm': 0.4593948423862457, 'learning_rate': 5.411764705882353e-05, 'epoch': 0.73}


 73%|███████▎  | 2362/3235 [42:34<17:34,  1.21s/it]

{'loss': 0.9453, 'grad_norm': 0.4538835883140564, 'learning_rate': 5.405572755417957e-05, 'epoch': 0.73}


 73%|███████▎  | 2363/3235 [42:35<16:38,  1.14s/it]

{'loss': 0.99, 'grad_norm': 0.46548131108283997, 'learning_rate': 5.399380804953561e-05, 'epoch': 0.73}


 73%|███████▎  | 2364/3235 [42:37<16:42,  1.15s/it]

{'loss': 0.9771, 'grad_norm': 0.49509358406066895, 'learning_rate': 5.393188854489164e-05, 'epoch': 0.73}


 73%|███████▎  | 2365/3235 [42:38<16:08,  1.11s/it]

{'loss': 1.125, 'grad_norm': 0.5321530103683472, 'learning_rate': 5.386996904024768e-05, 'epoch': 0.73}


 73%|███████▎  | 2366/3235 [42:39<15:38,  1.08s/it]

{'loss': 1.0838, 'grad_norm': 0.537813663482666, 'learning_rate': 5.3808049535603725e-05, 'epoch': 0.73}


 73%|███████▎  | 2367/3235 [42:40<15:09,  1.05s/it]

{'loss': 1.0534, 'grad_norm': 0.5024200081825256, 'learning_rate': 5.374613003095975e-05, 'epoch': 0.73}


 73%|███████▎  | 2368/3235 [42:40<14:32,  1.01s/it]

{'loss': 0.8816, 'grad_norm': 0.49838119745254517, 'learning_rate': 5.368421052631579e-05, 'epoch': 0.73}


 73%|███████▎  | 2369/3235 [42:41<14:11,  1.02it/s]

{'loss': 0.952, 'grad_norm': 0.4984282851219177, 'learning_rate': 5.362229102167183e-05, 'epoch': 0.73}


 73%|███████▎  | 2370/3235 [42:42<14:31,  1.01s/it]

{'loss': 0.9746, 'grad_norm': 0.5142058730125427, 'learning_rate': 5.3560371517027864e-05, 'epoch': 0.73}


 73%|███████▎  | 2371/3235 [42:43<14:33,  1.01s/it]

{'loss': 1.035, 'grad_norm': 0.4728289544582367, 'learning_rate': 5.3498452012383904e-05, 'epoch': 0.73}


 73%|███████▎  | 2372/3235 [42:44<14:15,  1.01it/s]

{'loss': 0.983, 'grad_norm': 0.5115075707435608, 'learning_rate': 5.3436532507739944e-05, 'epoch': 0.73}


 73%|███████▎  | 2373/3235 [42:45<13:40,  1.05it/s]

{'loss': 0.8584, 'grad_norm': 0.4948060214519501, 'learning_rate': 5.337461300309598e-05, 'epoch': 0.73}


 73%|███████▎  | 2374/3235 [42:46<12:27,  1.15it/s]

{'loss': 0.9932, 'grad_norm': 0.5787736773490906, 'learning_rate': 5.331269349845202e-05, 'epoch': 0.73}


 73%|███████▎  | 2375/3235 [42:47<13:12,  1.09it/s]

{'loss': 1.1697, 'grad_norm': 0.5107197165489197, 'learning_rate': 5.325077399380806e-05, 'epoch': 0.73}


 73%|███████▎  | 2376/3235 [42:48<13:14,  1.08it/s]

{'loss': 1.0529, 'grad_norm': 0.4801459014415741, 'learning_rate': 5.3188854489164084e-05, 'epoch': 0.73}


 73%|███████▎  | 2377/3235 [42:49<13:46,  1.04it/s]

{'loss': 0.9553, 'grad_norm': 0.49745380878448486, 'learning_rate': 5.3126934984520124e-05, 'epoch': 0.73}


 74%|███████▎  | 2378/3235 [42:50<14:27,  1.01s/it]

{'loss': 0.9767, 'grad_norm': 0.5221152305603027, 'learning_rate': 5.306501547987617e-05, 'epoch': 0.74}


 74%|███████▎  | 2379/3235 [42:51<14:54,  1.04s/it]

{'loss': 1.1168, 'grad_norm': 0.5060826539993286, 'learning_rate': 5.30030959752322e-05, 'epoch': 0.74}


 74%|███████▎  | 2380/3235 [42:52<14:22,  1.01s/it]

{'loss': 0.8546, 'grad_norm': 0.5037204623222351, 'learning_rate': 5.294117647058824e-05, 'epoch': 0.74}


 74%|███████▎  | 2381/3235 [42:53<15:50,  1.11s/it]

{'loss': 1.0096, 'grad_norm': 0.45245689153671265, 'learning_rate': 5.287925696594428e-05, 'epoch': 0.74}


 74%|███████▎  | 2382/3235 [42:54<14:55,  1.05s/it]

{'loss': 0.9559, 'grad_norm': 0.5245349407196045, 'learning_rate': 5.281733746130031e-05, 'epoch': 0.74}


 74%|███████▎  | 2383/3235 [42:55<14:51,  1.05s/it]

{'loss': 1.0733, 'grad_norm': 0.530475378036499, 'learning_rate': 5.275541795665635e-05, 'epoch': 0.74}


 74%|███████▎  | 2384/3235 [42:56<14:58,  1.06s/it]

{'loss': 0.9769, 'grad_norm': 0.4923812747001648, 'learning_rate': 5.269349845201239e-05, 'epoch': 0.74}


 74%|███████▎  | 2385/3235 [42:57<13:50,  1.02it/s]

{'loss': 0.9523, 'grad_norm': 0.5416619181632996, 'learning_rate': 5.2631578947368424e-05, 'epoch': 0.74}


 74%|███████▍  | 2386/3235 [42:58<14:46,  1.04s/it]

{'loss': 1.1318, 'grad_norm': 0.4618586003780365, 'learning_rate': 5.2569659442724464e-05, 'epoch': 0.74}


 74%|███████▍  | 2387/3235 [42:59<14:20,  1.01s/it]

{'loss': 0.9881, 'grad_norm': 0.5466853380203247, 'learning_rate': 5.2507739938080504e-05, 'epoch': 0.74}


 74%|███████▍  | 2388/3235 [43:01<14:35,  1.03s/it]

{'loss': 1.0065, 'grad_norm': 0.49549248814582825, 'learning_rate': 5.244582043343653e-05, 'epoch': 0.74}


 74%|███████▍  | 2389/3235 [43:02<15:06,  1.07s/it]

{'loss': 0.918, 'grad_norm': 0.4556533694267273, 'learning_rate': 5.238390092879257e-05, 'epoch': 0.74}


 74%|███████▍  | 2390/3235 [43:03<14:07,  1.00s/it]

{'loss': 0.9569, 'grad_norm': 0.5608800649642944, 'learning_rate': 5.232198142414862e-05, 'epoch': 0.74}


 74%|███████▍  | 2391/3235 [43:04<15:33,  1.11s/it]

{'loss': 0.9595, 'grad_norm': 0.4746851623058319, 'learning_rate': 5.2260061919504644e-05, 'epoch': 0.74}


 74%|███████▍  | 2392/3235 [43:05<15:55,  1.13s/it]

{'loss': 1.065, 'grad_norm': 0.4642404317855835, 'learning_rate': 5.2198142414860684e-05, 'epoch': 0.74}


 74%|███████▍  | 2393/3235 [43:06<15:04,  1.07s/it]

{'loss': 0.914, 'grad_norm': 0.5365289449691772, 'learning_rate': 5.2136222910216724e-05, 'epoch': 0.74}


 74%|███████▍  | 2394/3235 [43:07<15:38,  1.12s/it]

{'loss': 1.1014, 'grad_norm': 0.5175334811210632, 'learning_rate': 5.207430340557276e-05, 'epoch': 0.74}


 74%|███████▍  | 2395/3235 [43:08<14:48,  1.06s/it]

{'loss': 1.0374, 'grad_norm': 0.5396342873573303, 'learning_rate': 5.20123839009288e-05, 'epoch': 0.74}


 74%|███████▍  | 2396/3235 [43:09<14:51,  1.06s/it]

{'loss': 1.0371, 'grad_norm': 0.4964328706264496, 'learning_rate': 5.195046439628484e-05, 'epoch': 0.74}


 74%|███████▍  | 2397/3235 [43:10<14:54,  1.07s/it]

{'loss': 0.9847, 'grad_norm': 0.5133181810379028, 'learning_rate': 5.1888544891640863e-05, 'epoch': 0.74}


 74%|███████▍  | 2398/3235 [43:11<13:54,  1.00it/s]

{'loss': 0.9703, 'grad_norm': 0.5442929267883301, 'learning_rate': 5.182662538699691e-05, 'epoch': 0.74}


 74%|███████▍  | 2399/3235 [43:12<13:51,  1.01it/s]

{'loss': 1.1088, 'grad_norm': 0.5491724610328674, 'learning_rate': 5.176470588235295e-05, 'epoch': 0.74}


 74%|███████▍  | 2400/3235 [43:13<14:03,  1.01s/it]

{'loss': 1.0045, 'grad_norm': 0.5141651630401611, 'learning_rate': 5.170278637770898e-05, 'epoch': 0.74}


 74%|███████▍  | 2401/3235 [43:14<14:21,  1.03s/it]

{'loss': 1.0271, 'grad_norm': 0.4930952191352844, 'learning_rate': 5.164086687306502e-05, 'epoch': 0.74}


 74%|███████▍  | 2402/3235 [43:15<13:44,  1.01it/s]

{'loss': 1.0623, 'grad_norm': 0.5145682692527771, 'learning_rate': 5.157894736842106e-05, 'epoch': 0.74}


 74%|███████▍  | 2403/3235 [43:16<13:58,  1.01s/it]

{'loss': 0.8949, 'grad_norm': 0.49949097633361816, 'learning_rate': 5.151702786377709e-05, 'epoch': 0.74}


 74%|███████▍  | 2404/3235 [43:17<14:01,  1.01s/it]

{'loss': 0.7049, 'grad_norm': 0.45907464623451233, 'learning_rate': 5.145510835913313e-05, 'epoch': 0.74}


 74%|███████▍  | 2405/3235 [43:18<15:11,  1.10s/it]

{'loss': 1.1162, 'grad_norm': 0.48710209131240845, 'learning_rate': 5.139318885448917e-05, 'epoch': 0.74}


 74%|███████▍  | 2406/3235 [43:20<14:53,  1.08s/it]

{'loss': 1.0659, 'grad_norm': 0.4859381318092346, 'learning_rate': 5.13312693498452e-05, 'epoch': 0.74}


 74%|███████▍  | 2407/3235 [43:21<15:20,  1.11s/it]

{'loss': 1.1372, 'grad_norm': 0.4648531377315521, 'learning_rate': 5.126934984520124e-05, 'epoch': 0.74}


 74%|███████▍  | 2408/3235 [43:22<14:10,  1.03s/it]

{'loss': 0.9828, 'grad_norm': 0.5158297419548035, 'learning_rate': 5.120743034055728e-05, 'epoch': 0.74}


 74%|███████▍  | 2409/3235 [43:23<14:53,  1.08s/it]

{'loss': 1.0078, 'grad_norm': 0.47389110922813416, 'learning_rate': 5.114551083591331e-05, 'epoch': 0.74}


 74%|███████▍  | 2410/3235 [43:24<16:27,  1.20s/it]

{'loss': 0.9007, 'grad_norm': 0.45919057726860046, 'learning_rate': 5.1083591331269356e-05, 'epoch': 0.74}


 75%|███████▍  | 2411/3235 [43:25<16:39,  1.21s/it]

{'loss': 1.0322, 'grad_norm': 0.5325273275375366, 'learning_rate': 5.1021671826625396e-05, 'epoch': 0.75}


 75%|███████▍  | 2412/3235 [43:27<16:21,  1.19s/it]

{'loss': 1.1113, 'grad_norm': 0.5107132196426392, 'learning_rate': 5.095975232198142e-05, 'epoch': 0.75}


 75%|███████▍  | 2413/3235 [43:28<17:16,  1.26s/it]

{'loss': 1.0215, 'grad_norm': 0.4393346607685089, 'learning_rate': 5.089783281733746e-05, 'epoch': 0.75}


 75%|███████▍  | 2414/3235 [43:29<15:47,  1.15s/it]

{'loss': 0.9551, 'grad_norm': 0.48456883430480957, 'learning_rate': 5.08359133126935e-05, 'epoch': 0.75}


 75%|███████▍  | 2415/3235 [43:30<15:56,  1.17s/it]

{'loss': 1.0506, 'grad_norm': 0.4492625594139099, 'learning_rate': 5.0773993808049536e-05, 'epoch': 0.75}


 75%|███████▍  | 2416/3235 [43:31<15:20,  1.12s/it]

{'loss': 0.8573, 'grad_norm': 0.5140233039855957, 'learning_rate': 5.0712074303405576e-05, 'epoch': 0.75}


 75%|███████▍  | 2417/3235 [43:32<15:23,  1.13s/it]

{'loss': 1.0653, 'grad_norm': 0.4549368917942047, 'learning_rate': 5.0650154798761616e-05, 'epoch': 0.75}


 75%|███████▍  | 2418/3235 [43:33<14:31,  1.07s/it]

{'loss': 0.9021, 'grad_norm': 0.49094581604003906, 'learning_rate': 5.058823529411765e-05, 'epoch': 0.75}


 75%|███████▍  | 2419/3235 [43:34<15:23,  1.13s/it]

{'loss': 0.7913, 'grad_norm': 0.4068695604801178, 'learning_rate': 5.052631578947369e-05, 'epoch': 0.75}


 75%|███████▍  | 2420/3235 [43:36<14:52,  1.10s/it]

{'loss': 1.0173, 'grad_norm': 0.4615694284439087, 'learning_rate': 5.046439628482973e-05, 'epoch': 0.75}


 75%|███████▍  | 2421/3235 [43:37<15:31,  1.14s/it]

{'loss': 1.015, 'grad_norm': 0.4580881595611572, 'learning_rate': 5.0402476780185756e-05, 'epoch': 0.75}


 75%|███████▍  | 2422/3235 [43:38<16:12,  1.20s/it]

{'loss': 0.984, 'grad_norm': 0.47951972484588623, 'learning_rate': 5.03405572755418e-05, 'epoch': 0.75}


 75%|███████▍  | 2423/3235 [43:39<15:34,  1.15s/it]

{'loss': 1.0307, 'grad_norm': 0.5491160750389099, 'learning_rate': 5.027863777089784e-05, 'epoch': 0.75}


 75%|███████▍  | 2424/3235 [43:40<14:07,  1.04s/it]

{'loss': 0.831, 'grad_norm': 0.5163195729255676, 'learning_rate': 5.021671826625387e-05, 'epoch': 0.75}


 75%|███████▍  | 2425/3235 [43:41<14:34,  1.08s/it]

{'loss': 0.9622, 'grad_norm': 0.48684701323509216, 'learning_rate': 5.015479876160991e-05, 'epoch': 0.75}


 75%|███████▍  | 2426/3235 [43:42<13:32,  1.00s/it]

{'loss': 0.967, 'grad_norm': 0.49948421120643616, 'learning_rate': 5.009287925696595e-05, 'epoch': 0.75}


 75%|███████▌  | 2427/3235 [43:43<14:12,  1.06s/it]

{'loss': 1.0233, 'grad_norm': 0.47637638449668884, 'learning_rate': 5.003095975232198e-05, 'epoch': 0.75}


 75%|███████▌  | 2428/3235 [43:44<14:04,  1.05s/it]

{'loss': 0.909, 'grad_norm': 0.48758941888809204, 'learning_rate': 4.996904024767802e-05, 'epoch': 0.75}


 75%|███████▌  | 2429/3235 [43:45<13:11,  1.02it/s]

{'loss': 0.9377, 'grad_norm': 0.5221600532531738, 'learning_rate': 4.9907120743034056e-05, 'epoch': 0.75}


 75%|███████▌  | 2430/3235 [43:46<13:56,  1.04s/it]

{'loss': 1.0594, 'grad_norm': 0.5130751729011536, 'learning_rate': 4.9845201238390096e-05, 'epoch': 0.75}


 75%|███████▌  | 2431/3235 [43:47<14:30,  1.08s/it]

{'loss': 0.9785, 'grad_norm': 0.4682272970676422, 'learning_rate': 4.9783281733746136e-05, 'epoch': 0.75}


 75%|███████▌  | 2432/3235 [43:48<14:45,  1.10s/it]

{'loss': 0.9595, 'grad_norm': 0.49300262331962585, 'learning_rate': 4.972136222910217e-05, 'epoch': 0.75}


 75%|███████▌  | 2433/3235 [43:49<14:11,  1.06s/it]

{'loss': 0.8631, 'grad_norm': 0.4671951234340668, 'learning_rate': 4.96594427244582e-05, 'epoch': 0.75}


 75%|███████▌  | 2434/3235 [43:50<13:51,  1.04s/it]

{'loss': 0.9933, 'grad_norm': 0.4782065153121948, 'learning_rate': 4.959752321981425e-05, 'epoch': 0.75}


 75%|███████▌  | 2435/3235 [43:52<14:36,  1.10s/it]

{'loss': 0.864, 'grad_norm': 0.4120563268661499, 'learning_rate': 4.953560371517028e-05, 'epoch': 0.75}


 75%|███████▌  | 2436/3235 [43:53<15:18,  1.15s/it]

{'loss': 1.0153, 'grad_norm': 0.4670512080192566, 'learning_rate': 4.9473684210526315e-05, 'epoch': 0.75}


 75%|███████▌  | 2437/3235 [43:54<14:44,  1.11s/it]

{'loss': 1.0283, 'grad_norm': 0.47090697288513184, 'learning_rate': 4.9411764705882355e-05, 'epoch': 0.75}


 75%|███████▌  | 2438/3235 [43:55<15:19,  1.15s/it]

{'loss': 1.0643, 'grad_norm': 0.4703008532524109, 'learning_rate': 4.9349845201238395e-05, 'epoch': 0.75}


 75%|███████▌  | 2439/3235 [43:56<15:36,  1.18s/it]

{'loss': 0.8384, 'grad_norm': 0.4483484625816345, 'learning_rate': 4.928792569659443e-05, 'epoch': 0.75}


 75%|███████▌  | 2440/3235 [43:58<15:34,  1.18s/it]

{'loss': 1.0456, 'grad_norm': 0.5246285796165466, 'learning_rate': 4.922600619195047e-05, 'epoch': 0.75}


 75%|███████▌  | 2441/3235 [43:59<16:19,  1.23s/it]

{'loss': 1.1585, 'grad_norm': 0.45710426568984985, 'learning_rate': 4.91640866873065e-05, 'epoch': 0.75}


 75%|███████▌  | 2442/3235 [44:00<15:40,  1.19s/it]

{'loss': 1.0044, 'grad_norm': 0.48783421516418457, 'learning_rate': 4.910216718266254e-05, 'epoch': 0.75}


 76%|███████▌  | 2443/3235 [44:01<15:14,  1.15s/it]

{'loss': 0.8501, 'grad_norm': 0.4760468006134033, 'learning_rate': 4.904024767801858e-05, 'epoch': 0.76}


 76%|███████▌  | 2444/3235 [44:02<14:46,  1.12s/it]

{'loss': 0.9669, 'grad_norm': 0.46994319558143616, 'learning_rate': 4.8978328173374615e-05, 'epoch': 0.76}


 76%|███████▌  | 2445/3235 [44:03<15:18,  1.16s/it]

{'loss': 1.1404, 'grad_norm': 0.47776368260383606, 'learning_rate': 4.891640866873065e-05, 'epoch': 0.76}


 76%|███████▌  | 2446/3235 [44:05<15:39,  1.19s/it]

{'loss': 0.8687, 'grad_norm': 0.44003185629844666, 'learning_rate': 4.8854489164086695e-05, 'epoch': 0.76}


 76%|███████▌  | 2447/3235 [44:06<15:46,  1.20s/it]

{'loss': 1.0422, 'grad_norm': 0.4462631344795227, 'learning_rate': 4.879256965944273e-05, 'epoch': 0.76}


 76%|███████▌  | 2448/3235 [44:07<15:40,  1.20s/it]

{'loss': 0.981, 'grad_norm': 0.4906631410121918, 'learning_rate': 4.873065015479876e-05, 'epoch': 0.76}


 76%|███████▌  | 2449/3235 [44:08<15:00,  1.15s/it]

{'loss': 0.9987, 'grad_norm': 0.5609897375106812, 'learning_rate': 4.86687306501548e-05, 'epoch': 0.76}


 76%|███████▌  | 2450/3235 [44:09<14:19,  1.09s/it]

{'loss': 0.9384, 'grad_norm': 0.4751499593257904, 'learning_rate': 4.860681114551084e-05, 'epoch': 0.76}


 76%|███████▌  | 2451/3235 [44:10<13:48,  1.06s/it]

{'loss': 0.9482, 'grad_norm': 0.4930705428123474, 'learning_rate': 4.8544891640866875e-05, 'epoch': 0.76}


 76%|███████▌  | 2452/3235 [44:11<14:27,  1.11s/it]

{'loss': 0.9439, 'grad_norm': 0.4356662631034851, 'learning_rate': 4.8482972136222915e-05, 'epoch': 0.76}


 76%|███████▌  | 2453/3235 [44:13<15:28,  1.19s/it]

{'loss': 1.0459, 'grad_norm': 0.4613065719604492, 'learning_rate': 4.842105263157895e-05, 'epoch': 0.76}


 76%|███████▌  | 2454/3235 [44:14<14:41,  1.13s/it]

{'loss': 0.9157, 'grad_norm': 0.5612457990646362, 'learning_rate': 4.835913312693499e-05, 'epoch': 0.76}


 76%|███████▌  | 2455/3235 [44:15<13:54,  1.07s/it]

{'loss': 1.0535, 'grad_norm': 0.5104086995124817, 'learning_rate': 4.829721362229103e-05, 'epoch': 0.76}


 76%|███████▌  | 2456/3235 [44:16<14:48,  1.14s/it]

{'loss': 0.9212, 'grad_norm': 0.4044664800167084, 'learning_rate': 4.823529411764706e-05, 'epoch': 0.76}


 76%|███████▌  | 2457/3235 [44:17<15:03,  1.16s/it]

{'loss': 0.9365, 'grad_norm': 0.49369436502456665, 'learning_rate': 4.8173374613003095e-05, 'epoch': 0.76}


 76%|███████▌  | 2458/3235 [44:18<14:38,  1.13s/it]

{'loss': 0.9995, 'grad_norm': 0.49130013585090637, 'learning_rate': 4.811145510835914e-05, 'epoch': 0.76}


 76%|███████▌  | 2459/3235 [44:19<14:44,  1.14s/it]

{'loss': 0.9994, 'grad_norm': 0.5237827301025391, 'learning_rate': 4.8049535603715175e-05, 'epoch': 0.76}


 76%|███████▌  | 2460/3235 [44:20<14:01,  1.09s/it]

{'loss': 0.8842, 'grad_norm': 0.5316936373710632, 'learning_rate': 4.798761609907121e-05, 'epoch': 0.76}


 76%|███████▌  | 2461/3235 [44:21<14:01,  1.09s/it]

{'loss': 1.0023, 'grad_norm': 0.5375436544418335, 'learning_rate': 4.792569659442725e-05, 'epoch': 0.76}


 76%|███████▌  | 2462/3235 [44:22<13:38,  1.06s/it]

{'loss': 1.1114, 'grad_norm': 0.5258803367614746, 'learning_rate': 4.786377708978329e-05, 'epoch': 0.76}


 76%|███████▌  | 2463/3235 [44:23<12:25,  1.04it/s]

{'loss': 0.827, 'grad_norm': 0.5139946937561035, 'learning_rate': 4.780185758513932e-05, 'epoch': 0.76}


 76%|███████▌  | 2464/3235 [44:24<12:05,  1.06it/s]

{'loss': 1.0191, 'grad_norm': 0.5538519024848938, 'learning_rate': 4.773993808049536e-05, 'epoch': 0.76}


 76%|███████▌  | 2465/3235 [44:25<11:23,  1.13it/s]

{'loss': 0.8373, 'grad_norm': 0.5449523329734802, 'learning_rate': 4.7678018575851394e-05, 'epoch': 0.76}


 76%|███████▌  | 2466/3235 [44:26<13:22,  1.04s/it]

{'loss': 1.064, 'grad_norm': 0.49221697449684143, 'learning_rate': 4.7616099071207434e-05, 'epoch': 0.76}


 76%|███████▋  | 2467/3235 [44:27<13:39,  1.07s/it]

{'loss': 0.9131, 'grad_norm': 0.48350366950035095, 'learning_rate': 4.755417956656347e-05, 'epoch': 0.76}


 76%|███████▋  | 2468/3235 [44:28<13:45,  1.08s/it]

{'loss': 1.0601, 'grad_norm': 0.5341417789459229, 'learning_rate': 4.749226006191951e-05, 'epoch': 0.76}


 76%|███████▋  | 2469/3235 [44:29<13:25,  1.05s/it]

{'loss': 1.0508, 'grad_norm': 0.5465065240859985, 'learning_rate': 4.743034055727554e-05, 'epoch': 0.76}


 76%|███████▋  | 2470/3235 [44:30<13:28,  1.06s/it]

{'loss': 1.1404, 'grad_norm': 0.506589949131012, 'learning_rate': 4.736842105263158e-05, 'epoch': 0.76}


 76%|███████▋  | 2471/3235 [44:31<13:20,  1.05s/it]

{'loss': 0.8865, 'grad_norm': 0.502569854259491, 'learning_rate': 4.730650154798762e-05, 'epoch': 0.76}


 76%|███████▋  | 2472/3235 [44:32<13:13,  1.04s/it]

{'loss': 0.9582, 'grad_norm': 0.4805692434310913, 'learning_rate': 4.7244582043343654e-05, 'epoch': 0.76}


 76%|███████▋  | 2473/3235 [44:34<13:57,  1.10s/it]

{'loss': 1.0497, 'grad_norm': 0.44799715280532837, 'learning_rate': 4.718266253869969e-05, 'epoch': 0.76}


 76%|███████▋  | 2474/3235 [44:35<13:40,  1.08s/it]

{'loss': 1.0099, 'grad_norm': 0.46011853218078613, 'learning_rate': 4.7120743034055734e-05, 'epoch': 0.76}


 77%|███████▋  | 2475/3235 [44:36<14:01,  1.11s/it]

{'loss': 0.9725, 'grad_norm': 0.47824299335479736, 'learning_rate': 4.705882352941177e-05, 'epoch': 0.77}


 77%|███████▋  | 2476/3235 [44:37<14:19,  1.13s/it]

{'loss': 0.9999, 'grad_norm': 0.4493379294872284, 'learning_rate': 4.69969040247678e-05, 'epoch': 0.77}


 77%|███████▋  | 2477/3235 [44:38<13:12,  1.05s/it]

{'loss': 0.9133, 'grad_norm': 0.5274870991706848, 'learning_rate': 4.693498452012384e-05, 'epoch': 0.77}


 77%|███████▋  | 2478/3235 [44:39<13:58,  1.11s/it]

{'loss': 1.0561, 'grad_norm': 0.48549607396125793, 'learning_rate': 4.687306501547988e-05, 'epoch': 0.77}


 77%|███████▋  | 2479/3235 [44:40<13:17,  1.05s/it]

{'loss': 0.8197, 'grad_norm': 0.5752626657485962, 'learning_rate': 4.6811145510835914e-05, 'epoch': 0.77}


 77%|███████▋  | 2480/3235 [44:41<13:34,  1.08s/it]

{'loss': 1.0307, 'grad_norm': 0.4740827977657318, 'learning_rate': 4.6749226006191954e-05, 'epoch': 0.77}


 77%|███████▋  | 2481/3235 [44:42<13:14,  1.05s/it]

{'loss': 1.0398, 'grad_norm': 0.5118197202682495, 'learning_rate': 4.668730650154799e-05, 'epoch': 0.77}


 77%|███████▋  | 2482/3235 [44:44<14:12,  1.13s/it]

{'loss': 1.0986, 'grad_norm': 0.5015398859977722, 'learning_rate': 4.662538699690403e-05, 'epoch': 0.77}


 77%|███████▋  | 2483/3235 [44:44<13:17,  1.06s/it]

{'loss': 0.8893, 'grad_norm': 0.5266955494880676, 'learning_rate': 4.656346749226007e-05, 'epoch': 0.77}


 77%|███████▋  | 2484/3235 [44:46<13:53,  1.11s/it]

{'loss': 1.0035, 'grad_norm': 0.4689525067806244, 'learning_rate': 4.65015479876161e-05, 'epoch': 0.77}


 77%|███████▋  | 2485/3235 [44:47<13:06,  1.05s/it]

{'loss': 0.9686, 'grad_norm': 0.5095610618591309, 'learning_rate': 4.6439628482972134e-05, 'epoch': 0.77}


 77%|███████▋  | 2486/3235 [44:48<13:06,  1.05s/it]

{'loss': 1.0784, 'grad_norm': 0.5142221450805664, 'learning_rate': 4.637770897832818e-05, 'epoch': 0.77}


 77%|███████▋  | 2487/3235 [44:49<13:51,  1.11s/it]

{'loss': 1.0487, 'grad_norm': 0.5307725667953491, 'learning_rate': 4.6315789473684214e-05, 'epoch': 0.77}


 77%|███████▋  | 2488/3235 [44:50<14:16,  1.15s/it]

{'loss': 1.0395, 'grad_norm': 0.4965486526489258, 'learning_rate': 4.625386996904025e-05, 'epoch': 0.77}


 77%|███████▋  | 2489/3235 [44:51<14:09,  1.14s/it]

{'loss': 0.9571, 'grad_norm': 0.4893462359905243, 'learning_rate': 4.619195046439629e-05, 'epoch': 0.77}


 77%|███████▋  | 2490/3235 [44:52<13:20,  1.07s/it]

{'loss': 1.0162, 'grad_norm': 0.4880770146846771, 'learning_rate': 4.613003095975233e-05, 'epoch': 0.77}


 77%|███████▋  | 2491/3235 [44:53<13:25,  1.08s/it]

{'loss': 1.1738, 'grad_norm': 0.528233528137207, 'learning_rate': 4.606811145510836e-05, 'epoch': 0.77}


 77%|███████▋  | 2492/3235 [44:54<13:07,  1.06s/it]

{'loss': 0.9411, 'grad_norm': 0.468993604183197, 'learning_rate': 4.60061919504644e-05, 'epoch': 0.77}


 77%|███████▋  | 2493/3235 [44:55<13:19,  1.08s/it]

{'loss': 1.0638, 'grad_norm': 0.5443726778030396, 'learning_rate': 4.594427244582043e-05, 'epoch': 0.77}


 77%|███████▋  | 2494/3235 [44:56<12:06,  1.02it/s]

{'loss': 0.8405, 'grad_norm': 0.5212441682815552, 'learning_rate': 4.588235294117647e-05, 'epoch': 0.77}


 77%|███████▋  | 2495/3235 [44:57<13:00,  1.06s/it]

{'loss': 1.0474, 'grad_norm': 0.442117840051651, 'learning_rate': 4.582043343653251e-05, 'epoch': 0.77}


 77%|███████▋  | 2496/3235 [44:58<13:04,  1.06s/it]

{'loss': 1.0397, 'grad_norm': 0.47596341371536255, 'learning_rate': 4.5758513931888547e-05, 'epoch': 0.77}


 77%|███████▋  | 2497/3235 [45:00<12:53,  1.05s/it]

{'loss': 1.0778, 'grad_norm': 0.4905576705932617, 'learning_rate': 4.569659442724458e-05, 'epoch': 0.77}


 77%|███████▋  | 2498/3235 [45:01<12:58,  1.06s/it]

{'loss': 0.8967, 'grad_norm': 0.45915690064430237, 'learning_rate': 4.563467492260062e-05, 'epoch': 0.77}


 77%|███████▋  | 2499/3235 [45:02<14:12,  1.16s/it]

{'loss': 1.0662, 'grad_norm': 0.44277244806289673, 'learning_rate': 4.557275541795666e-05, 'epoch': 0.77}


 77%|███████▋  | 2500/3235 [45:03<14:44,  1.20s/it]

{'loss': 0.9952, 'grad_norm': 0.44829729199409485, 'learning_rate': 4.551083591331269e-05, 'epoch': 0.77}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-2500)... Done. 0.1s
 77%|███████▋  | 2501/3235 [45:05<17:28,  1.43s/it]

{'loss': 0.8993, 'grad_norm': 0.4263133704662323, 'learning_rate': 4.544891640866873e-05, 'epoch': 0.77}


 77%|███████▋  | 2502/3235 [45:06<16:31,  1.35s/it]

{'loss': 0.9471, 'grad_norm': 0.4557373523712158, 'learning_rate': 4.538699690402477e-05, 'epoch': 0.77}


 77%|███████▋  | 2503/3235 [45:07<14:47,  1.21s/it]

{'loss': 0.8675, 'grad_norm': 0.5086732506752014, 'learning_rate': 4.5325077399380806e-05, 'epoch': 0.77}


 77%|███████▋  | 2504/3235 [45:08<14:32,  1.19s/it]

{'loss': 0.9978, 'grad_norm': 0.4782678782939911, 'learning_rate': 4.5263157894736846e-05, 'epoch': 0.77}


 77%|███████▋  | 2505/3235 [45:10<14:44,  1.21s/it]

{'loss': 1.0044, 'grad_norm': 0.48421144485473633, 'learning_rate': 4.520123839009288e-05, 'epoch': 0.77}


 77%|███████▋  | 2506/3235 [45:11<14:30,  1.19s/it]

{'loss': 1.0433, 'grad_norm': 0.41780638694763184, 'learning_rate': 4.513931888544892e-05, 'epoch': 0.77}


 77%|███████▋  | 2507/3235 [45:12<14:16,  1.18s/it]

{'loss': 1.0879, 'grad_norm': 0.5279420614242554, 'learning_rate': 4.507739938080496e-05, 'epoch': 0.77}


 78%|███████▊  | 2508/3235 [45:13<14:15,  1.18s/it]

{'loss': 0.9537, 'grad_norm': 0.47550514340400696, 'learning_rate': 4.501547987616099e-05, 'epoch': 0.78}


 78%|███████▊  | 2509/3235 [45:14<13:35,  1.12s/it]

{'loss': 0.9834, 'grad_norm': 0.4962143898010254, 'learning_rate': 4.4953560371517026e-05, 'epoch': 0.78}


 78%|███████▊  | 2510/3235 [45:15<12:25,  1.03s/it]

{'loss': 0.9555, 'grad_norm': 0.5648516416549683, 'learning_rate': 4.4891640866873066e-05, 'epoch': 0.78}


 78%|███████▊  | 2511/3235 [45:16<11:35,  1.04it/s]

{'loss': 0.9849, 'grad_norm': 0.5009021759033203, 'learning_rate': 4.4829721362229106e-05, 'epoch': 0.78}


 78%|███████▊  | 2512/3235 [45:17<11:25,  1.06it/s]

{'loss': 0.9396, 'grad_norm': 0.5250344276428223, 'learning_rate': 4.476780185758514e-05, 'epoch': 0.78}


 78%|███████▊  | 2513/3235 [45:18<11:20,  1.06it/s]

{'loss': 0.9185, 'grad_norm': 0.5067564845085144, 'learning_rate': 4.470588235294118e-05, 'epoch': 0.78}


 78%|███████▊  | 2514/3235 [45:19<12:43,  1.06s/it]

{'loss': 0.9427, 'grad_norm': 0.4305920898914337, 'learning_rate': 4.464396284829722e-05, 'epoch': 0.78}


 78%|███████▊  | 2515/3235 [45:20<12:42,  1.06s/it]

{'loss': 1.0175, 'grad_norm': 0.47134914994239807, 'learning_rate': 4.458204334365325e-05, 'epoch': 0.78}


 78%|███████▊  | 2516/3235 [45:21<12:48,  1.07s/it]

{'loss': 1.1262, 'grad_norm': 0.5030190348625183, 'learning_rate': 4.452012383900929e-05, 'epoch': 0.78}


 78%|███████▊  | 2517/3235 [45:22<12:54,  1.08s/it]

{'loss': 1.0231, 'grad_norm': 0.4712178409099579, 'learning_rate': 4.4458204334365326e-05, 'epoch': 0.78}


 78%|███████▊  | 2518/3235 [45:23<13:05,  1.10s/it]

{'loss': 1.1128, 'grad_norm': 0.4651041328907013, 'learning_rate': 4.4396284829721366e-05, 'epoch': 0.78}


 78%|███████▊  | 2519/3235 [45:24<13:12,  1.11s/it]

{'loss': 0.7828, 'grad_norm': 0.47633761167526245, 'learning_rate': 4.4334365325077406e-05, 'epoch': 0.78}


 78%|███████▊  | 2520/3235 [45:26<13:27,  1.13s/it]

{'loss': 1.1053, 'grad_norm': 0.5684314966201782, 'learning_rate': 4.427244582043344e-05, 'epoch': 0.78}


 78%|███████▊  | 2521/3235 [45:27<13:18,  1.12s/it]

{'loss': 1.0046, 'grad_norm': 0.49044516682624817, 'learning_rate': 4.421052631578947e-05, 'epoch': 0.78}


 78%|███████▊  | 2522/3235 [45:28<12:49,  1.08s/it]

{'loss': 1.0766, 'grad_norm': 0.541744589805603, 'learning_rate': 4.414860681114551e-05, 'epoch': 0.78}


 78%|███████▊  | 2523/3235 [45:29<12:26,  1.05s/it]

{'loss': 1.0291, 'grad_norm': 0.5466068983078003, 'learning_rate': 4.408668730650155e-05, 'epoch': 0.78}


 78%|███████▊  | 2524/3235 [45:30<12:46,  1.08s/it]

{'loss': 0.9814, 'grad_norm': 0.46697917580604553, 'learning_rate': 4.4024767801857586e-05, 'epoch': 0.78}


 78%|███████▊  | 2525/3235 [45:31<13:00,  1.10s/it]

{'loss': 1.0145, 'grad_norm': 0.46888938546180725, 'learning_rate': 4.3962848297213626e-05, 'epoch': 0.78}


 78%|███████▊  | 2526/3235 [45:32<12:20,  1.04s/it]

{'loss': 0.9174, 'grad_norm': 0.5367438197135925, 'learning_rate': 4.390092879256966e-05, 'epoch': 0.78}


 78%|███████▊  | 2527/3235 [45:33<13:22,  1.13s/it]

{'loss': 1.0521, 'grad_norm': 0.4320186376571655, 'learning_rate': 4.38390092879257e-05, 'epoch': 0.78}


 78%|███████▊  | 2528/3235 [45:34<13:00,  1.10s/it]

{'loss': 0.965, 'grad_norm': 0.5338256359100342, 'learning_rate': 4.377708978328174e-05, 'epoch': 0.78}


 78%|███████▊  | 2529/3235 [45:35<12:56,  1.10s/it]

{'loss': 1.0662, 'grad_norm': 0.5173341631889343, 'learning_rate': 4.371517027863777e-05, 'epoch': 0.78}


 78%|███████▊  | 2530/3235 [45:37<13:47,  1.17s/it]

{'loss': 0.9936, 'grad_norm': 0.47906553745269775, 'learning_rate': 4.365325077399381e-05, 'epoch': 0.78}


 78%|███████▊  | 2531/3235 [45:38<14:47,  1.26s/it]

{'loss': 1.0051, 'grad_norm': 0.4675259292125702, 'learning_rate': 4.359133126934985e-05, 'epoch': 0.78}


 78%|███████▊  | 2532/3235 [45:39<13:46,  1.18s/it]

{'loss': 1.0847, 'grad_norm': 0.54715496301651, 'learning_rate': 4.3529411764705885e-05, 'epoch': 0.78}


 78%|███████▊  | 2533/3235 [45:41<14:28,  1.24s/it]

{'loss': 0.9674, 'grad_norm': 0.4660414159297943, 'learning_rate': 4.346749226006192e-05, 'epoch': 0.78}


 78%|███████▊  | 2534/3235 [45:42<14:32,  1.24s/it]

{'loss': 1.0386, 'grad_norm': 0.4621145725250244, 'learning_rate': 4.340557275541796e-05, 'epoch': 0.78}


 78%|███████▊  | 2535/3235 [45:43<13:45,  1.18s/it]

{'loss': 0.8894, 'grad_norm': 0.5359027981758118, 'learning_rate': 4.3343653250774e-05, 'epoch': 0.78}


 78%|███████▊  | 2536/3235 [45:44<13:45,  1.18s/it]

{'loss': 1.053, 'grad_norm': 0.45554119348526, 'learning_rate': 4.328173374613003e-05, 'epoch': 0.78}


 78%|███████▊  | 2537/3235 [45:45<13:15,  1.14s/it]

{'loss': 0.9918, 'grad_norm': 0.5001680850982666, 'learning_rate': 4.321981424148607e-05, 'epoch': 0.78}


 78%|███████▊  | 2538/3235 [45:46<12:20,  1.06s/it]

{'loss': 1.0071, 'grad_norm': 0.5555773377418518, 'learning_rate': 4.3157894736842105e-05, 'epoch': 0.78}


 78%|███████▊  | 2539/3235 [45:47<12:09,  1.05s/it]

{'loss': 0.9673, 'grad_norm': 0.466599702835083, 'learning_rate': 4.3095975232198145e-05, 'epoch': 0.78}


 79%|███████▊  | 2540/3235 [45:48<12:34,  1.09s/it]

{'loss': 0.8367, 'grad_norm': 0.47219398617744446, 'learning_rate': 4.303405572755418e-05, 'epoch': 0.79}


 79%|███████▊  | 2541/3235 [45:49<12:36,  1.09s/it]

{'loss': 0.8772, 'grad_norm': 0.5004647374153137, 'learning_rate': 4.297213622291022e-05, 'epoch': 0.79}


 79%|███████▊  | 2542/3235 [45:50<12:06,  1.05s/it]

{'loss': 1.0699, 'grad_norm': 0.4527917504310608, 'learning_rate': 4.291021671826626e-05, 'epoch': 0.79}


 79%|███████▊  | 2543/3235 [45:52<13:03,  1.13s/it]

{'loss': 1.0749, 'grad_norm': 0.47947677969932556, 'learning_rate': 4.284829721362229e-05, 'epoch': 0.79}


 79%|███████▊  | 2544/3235 [45:53<12:33,  1.09s/it]

{'loss': 0.8864, 'grad_norm': 0.49583899974823, 'learning_rate': 4.278637770897833e-05, 'epoch': 0.79}


 79%|███████▊  | 2545/3235 [45:54<12:21,  1.07s/it]

{'loss': 0.8258, 'grad_norm': 0.4633086621761322, 'learning_rate': 4.2724458204334365e-05, 'epoch': 0.79}


 79%|███████▊  | 2546/3235 [45:54<11:49,  1.03s/it]

{'loss': 0.9421, 'grad_norm': 0.5008118748664856, 'learning_rate': 4.2662538699690405e-05, 'epoch': 0.79}


 79%|███████▊  | 2547/3235 [45:55<11:35,  1.01s/it]

{'loss': 1.035, 'grad_norm': 0.522304117679596, 'learning_rate': 4.2600619195046445e-05, 'epoch': 0.79}


 79%|███████▉  | 2548/3235 [45:56<11:35,  1.01s/it]

{'loss': 0.8882, 'grad_norm': 0.55136638879776, 'learning_rate': 4.253869969040248e-05, 'epoch': 0.79}


 79%|███████▉  | 2549/3235 [45:58<11:55,  1.04s/it]

{'loss': 1.1772, 'grad_norm': 0.5048184990882874, 'learning_rate': 4.247678018575851e-05, 'epoch': 0.79}


 79%|███████▉  | 2550/3235 [45:59<11:37,  1.02s/it]

{'loss': 0.9405, 'grad_norm': 0.4945452809333801, 'learning_rate': 4.241486068111455e-05, 'epoch': 0.79}


 79%|███████▉  | 2551/3235 [46:00<11:57,  1.05s/it]

{'loss': 1.0888, 'grad_norm': 0.47093355655670166, 'learning_rate': 4.235294117647059e-05, 'epoch': 0.79}


 79%|███████▉  | 2552/3235 [46:01<11:53,  1.04s/it]

{'loss': 1.0364, 'grad_norm': 0.5202378034591675, 'learning_rate': 4.2291021671826625e-05, 'epoch': 0.79}


 79%|███████▉  | 2553/3235 [46:02<12:09,  1.07s/it]

{'loss': 1.0664, 'grad_norm': 0.4977806806564331, 'learning_rate': 4.2229102167182665e-05, 'epoch': 0.79}


 79%|███████▉  | 2554/3235 [46:03<12:13,  1.08s/it]

{'loss': 0.9425, 'grad_norm': 0.46554484963417053, 'learning_rate': 4.2167182662538705e-05, 'epoch': 0.79}


 79%|███████▉  | 2555/3235 [46:04<12:24,  1.10s/it]

{'loss': 0.8765, 'grad_norm': 0.47622787952423096, 'learning_rate': 4.210526315789474e-05, 'epoch': 0.79}


 79%|███████▉  | 2556/3235 [46:05<11:52,  1.05s/it]

{'loss': 0.8625, 'grad_norm': 0.4732222855091095, 'learning_rate': 4.204334365325078e-05, 'epoch': 0.79}


 79%|███████▉  | 2557/3235 [46:06<12:36,  1.12s/it]

{'loss': 0.9636, 'grad_norm': 0.441588819026947, 'learning_rate': 4.198142414860681e-05, 'epoch': 0.79}


 79%|███████▉  | 2558/3235 [46:07<11:56,  1.06s/it]

{'loss': 0.7841, 'grad_norm': 0.4572694003582001, 'learning_rate': 4.191950464396285e-05, 'epoch': 0.79}


 79%|███████▉  | 2559/3235 [46:08<10:53,  1.03it/s]

{'loss': 0.9927, 'grad_norm': 0.5302456021308899, 'learning_rate': 4.185758513931889e-05, 'epoch': 0.79}


 79%|███████▉  | 2560/3235 [46:09<11:21,  1.01s/it]

{'loss': 0.975, 'grad_norm': 0.47004908323287964, 'learning_rate': 4.1795665634674924e-05, 'epoch': 0.79}


 79%|███████▉  | 2561/3235 [46:10<10:36,  1.06it/s]

{'loss': 1.007, 'grad_norm': 0.524177074432373, 'learning_rate': 4.173374613003096e-05, 'epoch': 0.79}


 79%|███████▉  | 2562/3235 [46:11<11:38,  1.04s/it]

{'loss': 0.9085, 'grad_norm': 0.4648248851299286, 'learning_rate': 4.1671826625387e-05, 'epoch': 0.79}


 79%|███████▉  | 2563/3235 [46:12<11:54,  1.06s/it]

{'loss': 1.0439, 'grad_norm': 0.5114519596099854, 'learning_rate': 4.160990712074304e-05, 'epoch': 0.79}


 79%|███████▉  | 2564/3235 [46:14<13:18,  1.19s/it]

{'loss': 0.9743, 'grad_norm': 0.46706879138946533, 'learning_rate': 4.154798761609907e-05, 'epoch': 0.79}


 79%|███████▉  | 2565/3235 [46:15<13:44,  1.23s/it]

{'loss': 0.9816, 'grad_norm': 0.4479764699935913, 'learning_rate': 4.148606811145511e-05, 'epoch': 0.79}


 79%|███████▉  | 2566/3235 [46:16<13:36,  1.22s/it]

{'loss': 1.0853, 'grad_norm': 0.4814034104347229, 'learning_rate': 4.1424148606811144e-05, 'epoch': 0.79}


 79%|███████▉  | 2567/3235 [46:17<13:33,  1.22s/it]

{'loss': 1.0633, 'grad_norm': 0.4809081256389618, 'learning_rate': 4.1362229102167184e-05, 'epoch': 0.79}


 79%|███████▉  | 2568/3235 [46:18<12:49,  1.15s/it]

{'loss': 1.0285, 'grad_norm': 0.5372211933135986, 'learning_rate': 4.1300309597523224e-05, 'epoch': 0.79}


 79%|███████▉  | 2569/3235 [46:20<12:54,  1.16s/it]

{'loss': 0.8201, 'grad_norm': 0.47875285148620605, 'learning_rate': 4.123839009287926e-05, 'epoch': 0.79}


 79%|███████▉  | 2570/3235 [46:21<12:31,  1.13s/it]

{'loss': 1.0993, 'grad_norm': 0.5267564654350281, 'learning_rate': 4.11764705882353e-05, 'epoch': 0.79}


 79%|███████▉  | 2571/3235 [46:22<12:06,  1.09s/it]

{'loss': 1.0429, 'grad_norm': 0.5642645955085754, 'learning_rate': 4.111455108359134e-05, 'epoch': 0.79}


 80%|███████▉  | 2572/3235 [46:23<12:01,  1.09s/it]

{'loss': 1.1128, 'grad_norm': 0.4915940761566162, 'learning_rate': 4.105263157894737e-05, 'epoch': 0.8}


 80%|███████▉  | 2573/3235 [46:24<12:13,  1.11s/it]

{'loss': 1.0286, 'grad_norm': 0.5125536918640137, 'learning_rate': 4.0990712074303404e-05, 'epoch': 0.8}


 80%|███████▉  | 2574/3235 [46:25<11:22,  1.03s/it]

{'loss': 0.9458, 'grad_norm': 0.5022929310798645, 'learning_rate': 4.0928792569659444e-05, 'epoch': 0.8}


 80%|███████▉  | 2575/3235 [46:26<11:18,  1.03s/it]

{'loss': 0.98, 'grad_norm': 0.5160405039787292, 'learning_rate': 4.0866873065015484e-05, 'epoch': 0.8}


 80%|███████▉  | 2576/3235 [46:27<12:03,  1.10s/it]

{'loss': 1.1719, 'grad_norm': 0.47487136721611023, 'learning_rate': 4.080495356037152e-05, 'epoch': 0.8}


 80%|███████▉  | 2577/3235 [46:28<12:40,  1.16s/it]

{'loss': 0.9712, 'grad_norm': 0.4352428913116455, 'learning_rate': 4.074303405572756e-05, 'epoch': 0.8}


 80%|███████▉  | 2578/3235 [46:30<12:47,  1.17s/it]

{'loss': 0.8929, 'grad_norm': 0.4441790282726288, 'learning_rate': 4.068111455108359e-05, 'epoch': 0.8}


 80%|███████▉  | 2579/3235 [46:31<12:41,  1.16s/it]

{'loss': 1.0736, 'grad_norm': 0.4644005000591278, 'learning_rate': 4.061919504643963e-05, 'epoch': 0.8}


 80%|███████▉  | 2580/3235 [46:32<12:09,  1.11s/it]

{'loss': 1.0199, 'grad_norm': 0.4629495143890381, 'learning_rate': 4.055727554179567e-05, 'epoch': 0.8}


 80%|███████▉  | 2581/3235 [46:33<11:54,  1.09s/it]

{'loss': 1.1189, 'grad_norm': 0.5185092687606812, 'learning_rate': 4.0495356037151704e-05, 'epoch': 0.8}


 80%|███████▉  | 2582/3235 [46:34<12:45,  1.17s/it]

{'loss': 1.0784, 'grad_norm': 0.4428517520427704, 'learning_rate': 4.0433436532507744e-05, 'epoch': 0.8}


 80%|███████▉  | 2583/3235 [46:35<12:40,  1.17s/it]

{'loss': 1.1734, 'grad_norm': 0.4718017280101776, 'learning_rate': 4.0371517027863784e-05, 'epoch': 0.8}


 80%|███████▉  | 2584/3235 [46:36<12:42,  1.17s/it]

{'loss': 0.9497, 'grad_norm': 0.4404958486557007, 'learning_rate': 4.030959752321982e-05, 'epoch': 0.8}


 80%|███████▉  | 2585/3235 [46:37<11:35,  1.07s/it]

{'loss': 0.84, 'grad_norm': 0.5159338712692261, 'learning_rate': 4.024767801857585e-05, 'epoch': 0.8}


 80%|███████▉  | 2586/3235 [46:38<11:19,  1.05s/it]

{'loss': 1.0578, 'grad_norm': 0.5035595893859863, 'learning_rate': 4.018575851393189e-05, 'epoch': 0.8}


 80%|███████▉  | 2587/3235 [46:39<10:54,  1.01s/it]

{'loss': 1.0348, 'grad_norm': 0.5122194886207581, 'learning_rate': 4.012383900928793e-05, 'epoch': 0.8}


 80%|████████  | 2588/3235 [46:40<11:22,  1.06s/it]

{'loss': 1.0778, 'grad_norm': 0.4957168698310852, 'learning_rate': 4.006191950464396e-05, 'epoch': 0.8}


 80%|████████  | 2589/3235 [46:41<11:28,  1.07s/it]

{'loss': 1.035, 'grad_norm': 0.4498300552368164, 'learning_rate': 4e-05, 'epoch': 0.8}


 80%|████████  | 2590/3235 [46:43<11:59,  1.12s/it]

{'loss': 1.0988, 'grad_norm': 0.48754996061325073, 'learning_rate': 3.9938080495356037e-05, 'epoch': 0.8}


 80%|████████  | 2591/3235 [46:44<11:19,  1.05s/it]

{'loss': 0.9078, 'grad_norm': 0.47523507475852966, 'learning_rate': 3.9876160990712077e-05, 'epoch': 0.8}


 80%|████████  | 2592/3235 [46:45<11:34,  1.08s/it]

{'loss': 0.9621, 'grad_norm': 0.5109578967094421, 'learning_rate': 3.9814241486068117e-05, 'epoch': 0.8}


 80%|████████  | 2593/3235 [46:46<11:24,  1.07s/it]

{'loss': 1.0635, 'grad_norm': 0.5089250206947327, 'learning_rate': 3.975232198142415e-05, 'epoch': 0.8}


 80%|████████  | 2594/3235 [46:47<11:47,  1.10s/it]

{'loss': 0.8147, 'grad_norm': 0.4522492289543152, 'learning_rate': 3.969040247678018e-05, 'epoch': 0.8}


 80%|████████  | 2595/3235 [46:48<10:59,  1.03s/it]

{'loss': 0.9861, 'grad_norm': 0.5788409113883972, 'learning_rate': 3.962848297213623e-05, 'epoch': 0.8}


 80%|████████  | 2596/3235 [46:49<11:06,  1.04s/it]

{'loss': 0.92, 'grad_norm': 0.464412122964859, 'learning_rate': 3.956656346749226e-05, 'epoch': 0.8}


 80%|████████  | 2597/3235 [46:50<11:24,  1.07s/it]

{'loss': 0.9882, 'grad_norm': 0.5241762399673462, 'learning_rate': 3.9504643962848296e-05, 'epoch': 0.8}


 80%|████████  | 2598/3235 [46:51<12:09,  1.14s/it]

{'loss': 1.1201, 'grad_norm': 0.48290911316871643, 'learning_rate': 3.9442724458204336e-05, 'epoch': 0.8}


 80%|████████  | 2599/3235 [46:52<11:56,  1.13s/it]

{'loss': 1.0148, 'grad_norm': 0.4729205369949341, 'learning_rate': 3.9380804953560376e-05, 'epoch': 0.8}


 80%|████████  | 2600/3235 [46:53<11:28,  1.08s/it]

{'loss': 1.1111, 'grad_norm': 0.5571753978729248, 'learning_rate': 3.931888544891641e-05, 'epoch': 0.8}


 80%|████████  | 2601/3235 [46:55<11:43,  1.11s/it]

{'loss': 0.9999, 'grad_norm': 0.447160542011261, 'learning_rate': 3.925696594427245e-05, 'epoch': 0.8}


 80%|████████  | 2602/3235 [46:56<11:13,  1.06s/it]

{'loss': 1.0744, 'grad_norm': 0.6082407832145691, 'learning_rate': 3.919504643962848e-05, 'epoch': 0.8}


 80%|████████  | 2603/3235 [46:57<11:24,  1.08s/it]

{'loss': 0.91, 'grad_norm': 0.47860169410705566, 'learning_rate': 3.913312693498452e-05, 'epoch': 0.8}


 80%|████████  | 2604/3235 [46:58<11:16,  1.07s/it]

{'loss': 1.1201, 'grad_norm': 0.5539028644561768, 'learning_rate': 3.907120743034056e-05, 'epoch': 0.8}


 81%|████████  | 2605/3235 [46:59<11:01,  1.05s/it]

{'loss': 1.0461, 'grad_norm': 0.5037031769752502, 'learning_rate': 3.9009287925696596e-05, 'epoch': 0.81}


 81%|████████  | 2606/3235 [47:00<11:30,  1.10s/it]

{'loss': 0.9383, 'grad_norm': 0.4628446400165558, 'learning_rate': 3.894736842105263e-05, 'epoch': 0.81}


 81%|████████  | 2607/3235 [47:01<10:31,  1.01s/it]

{'loss': 0.8976, 'grad_norm': 0.5274552702903748, 'learning_rate': 3.8885448916408676e-05, 'epoch': 0.81}


 81%|████████  | 2608/3235 [47:02<10:53,  1.04s/it]

{'loss': 0.9001, 'grad_norm': 0.500012993812561, 'learning_rate': 3.882352941176471e-05, 'epoch': 0.81}


 81%|████████  | 2609/3235 [47:03<11:25,  1.10s/it]

{'loss': 1.0088, 'grad_norm': 0.44260886311531067, 'learning_rate': 3.876160990712074e-05, 'epoch': 0.81}


 81%|████████  | 2610/3235 [47:04<10:30,  1.01s/it]

{'loss': 0.9674, 'grad_norm': 0.5475307106971741, 'learning_rate': 3.869969040247678e-05, 'epoch': 0.81}


 81%|████████  | 2611/3235 [47:05<10:22,  1.00it/s]

{'loss': 1.0542, 'grad_norm': 0.5517377257347107, 'learning_rate': 3.863777089783282e-05, 'epoch': 0.81}


 81%|████████  | 2612/3235 [47:06<10:25,  1.00s/it]

{'loss': 0.9694, 'grad_norm': 0.5221045017242432, 'learning_rate': 3.8575851393188856e-05, 'epoch': 0.81}


 81%|████████  | 2613/3235 [47:07<10:54,  1.05s/it]

{'loss': 1.0029, 'grad_norm': 0.48626333475112915, 'learning_rate': 3.851393188854489e-05, 'epoch': 0.81}


 81%|████████  | 2614/3235 [47:08<11:56,  1.15s/it]

{'loss': 1.1216, 'grad_norm': 0.44126105308532715, 'learning_rate': 3.845201238390093e-05, 'epoch': 0.81}


 81%|████████  | 2615/3235 [47:09<11:31,  1.12s/it]

{'loss': 0.908, 'grad_norm': 0.5047829747200012, 'learning_rate': 3.839009287925697e-05, 'epoch': 0.81}


 81%|████████  | 2616/3235 [47:10<11:15,  1.09s/it]

{'loss': 0.9504, 'grad_norm': 0.4747292995452881, 'learning_rate': 3.8328173374613e-05, 'epoch': 0.81}


 81%|████████  | 2617/3235 [47:11<10:57,  1.06s/it]

{'loss': 0.9322, 'grad_norm': 0.5020231604576111, 'learning_rate': 3.826625386996904e-05, 'epoch': 0.81}


 81%|████████  | 2618/3235 [47:13<11:22,  1.11s/it]

{'loss': 0.8964, 'grad_norm': 0.4768369197845459, 'learning_rate': 3.8204334365325075e-05, 'epoch': 0.81}


 81%|████████  | 2619/3235 [47:14<10:38,  1.04s/it]

{'loss': 1.0741, 'grad_norm': 0.518240749835968, 'learning_rate': 3.8142414860681115e-05, 'epoch': 0.81}


 81%|████████  | 2620/3235 [47:15<11:01,  1.07s/it]

{'loss': 1.1764, 'grad_norm': 0.4983774721622467, 'learning_rate': 3.8080495356037155e-05, 'epoch': 0.81}


 81%|████████  | 2621/3235 [47:16<10:57,  1.07s/it]

{'loss': 0.9555, 'grad_norm': 0.4681966304779053, 'learning_rate': 3.801857585139319e-05, 'epoch': 0.81}


 81%|████████  | 2622/3235 [47:17<11:21,  1.11s/it]

{'loss': 1.0542, 'grad_norm': 0.4576435983181, 'learning_rate': 3.795665634674922e-05, 'epoch': 0.81}


 81%|████████  | 2623/3235 [47:18<12:00,  1.18s/it]

{'loss': 1.0803, 'grad_norm': 0.4646637439727783, 'learning_rate': 3.789473684210527e-05, 'epoch': 0.81}


 81%|████████  | 2624/3235 [47:20<12:24,  1.22s/it]

{'loss': 0.9852, 'grad_norm': 0.41507530212402344, 'learning_rate': 3.78328173374613e-05, 'epoch': 0.81}


 81%|████████  | 2625/3235 [47:21<11:34,  1.14s/it]

{'loss': 1.0063, 'grad_norm': 0.4582022726535797, 'learning_rate': 3.7770897832817335e-05, 'epoch': 0.81}


 81%|████████  | 2626/3235 [47:22<11:14,  1.11s/it]

{'loss': 0.9027, 'grad_norm': 0.47643667459487915, 'learning_rate': 3.7708978328173375e-05, 'epoch': 0.81}


 81%|████████  | 2627/3235 [47:23<11:05,  1.09s/it]

{'loss': 0.9707, 'grad_norm': 0.4314773976802826, 'learning_rate': 3.7647058823529415e-05, 'epoch': 0.81}


 81%|████████  | 2628/3235 [47:24<10:49,  1.07s/it]

{'loss': 0.8936, 'grad_norm': 0.5079814195632935, 'learning_rate': 3.758513931888545e-05, 'epoch': 0.81}


 81%|████████▏ | 2629/3235 [47:25<11:22,  1.13s/it]

{'loss': 1.1338, 'grad_norm': 0.4965287148952484, 'learning_rate': 3.752321981424149e-05, 'epoch': 0.81}


 81%|████████▏ | 2630/3235 [47:26<11:43,  1.16s/it]

{'loss': 0.8619, 'grad_norm': 0.4541562795639038, 'learning_rate': 3.746130030959752e-05, 'epoch': 0.81}


 81%|████████▏ | 2631/3235 [47:27<11:34,  1.15s/it]

{'loss': 0.9796, 'grad_norm': 0.4782179594039917, 'learning_rate': 3.739938080495356e-05, 'epoch': 0.81}


 81%|████████▏ | 2632/3235 [47:28<10:50,  1.08s/it]

{'loss': 0.8444, 'grad_norm': 0.5208672285079956, 'learning_rate': 3.73374613003096e-05, 'epoch': 0.81}


 81%|████████▏ | 2633/3235 [47:29<10:57,  1.09s/it]

{'loss': 0.9821, 'grad_norm': 0.4843381941318512, 'learning_rate': 3.7275541795665635e-05, 'epoch': 0.81}


 81%|████████▏ | 2634/3235 [47:31<11:48,  1.18s/it]

{'loss': 1.0811, 'grad_norm': 0.43788859248161316, 'learning_rate': 3.721362229102167e-05, 'epoch': 0.81}


 81%|████████▏ | 2635/3235 [47:32<10:49,  1.08s/it]

{'loss': 0.9667, 'grad_norm': 0.5594814419746399, 'learning_rate': 3.7151702786377715e-05, 'epoch': 0.81}


 81%|████████▏ | 2636/3235 [47:33<10:49,  1.08s/it]

{'loss': 1.0224, 'grad_norm': 0.4903573989868164, 'learning_rate': 3.708978328173375e-05, 'epoch': 0.81}


 82%|████████▏ | 2637/3235 [47:34<10:25,  1.05s/it]

{'loss': 0.9843, 'grad_norm': 0.49319180846214294, 'learning_rate': 3.702786377708978e-05, 'epoch': 0.82}


 82%|████████▏ | 2638/3235 [47:35<10:35,  1.07s/it]

{'loss': 0.9461, 'grad_norm': 0.5210948586463928, 'learning_rate': 3.696594427244582e-05, 'epoch': 0.82}


 82%|████████▏ | 2639/3235 [47:36<10:20,  1.04s/it]

{'loss': 1.0373, 'grad_norm': 0.4862228333950043, 'learning_rate': 3.690402476780186e-05, 'epoch': 0.82}


 82%|████████▏ | 2640/3235 [47:37<11:06,  1.12s/it]

{'loss': 1.0903, 'grad_norm': 0.46731671690940857, 'learning_rate': 3.6842105263157895e-05, 'epoch': 0.82}


 82%|████████▏ | 2641/3235 [47:38<10:44,  1.08s/it]

{'loss': 0.9172, 'grad_norm': 0.49630001187324524, 'learning_rate': 3.6780185758513935e-05, 'epoch': 0.82}


 82%|████████▏ | 2642/3235 [47:39<10:34,  1.07s/it]

{'loss': 1.0126, 'grad_norm': 0.4751863479614258, 'learning_rate': 3.671826625386997e-05, 'epoch': 0.82}


 82%|████████▏ | 2643/3235 [47:40<11:11,  1.14s/it]

{'loss': 0.993, 'grad_norm': 0.43375352025032043, 'learning_rate': 3.665634674922601e-05, 'epoch': 0.82}


 82%|████████▏ | 2644/3235 [47:41<11:04,  1.12s/it]

{'loss': 1.0251, 'grad_norm': 0.4913020133972168, 'learning_rate': 3.659442724458205e-05, 'epoch': 0.82}


 82%|████████▏ | 2645/3235 [47:43<11:02,  1.12s/it]

{'loss': 1.1219, 'grad_norm': 0.5142021179199219, 'learning_rate': 3.653250773993808e-05, 'epoch': 0.82}


 82%|████████▏ | 2646/3235 [47:44<11:07,  1.13s/it]

{'loss': 1.0011, 'grad_norm': 0.4695936143398285, 'learning_rate': 3.6470588235294114e-05, 'epoch': 0.82}


 82%|████████▏ | 2647/3235 [47:45<10:47,  1.10s/it]

{'loss': 0.8095, 'grad_norm': 0.532068133354187, 'learning_rate': 3.640866873065016e-05, 'epoch': 0.82}


 82%|████████▏ | 2648/3235 [47:46<10:50,  1.11s/it]

{'loss': 0.8761, 'grad_norm': 0.4831930100917816, 'learning_rate': 3.6346749226006194e-05, 'epoch': 0.82}


 82%|████████▏ | 2649/3235 [47:47<10:44,  1.10s/it]

{'loss': 1.1355, 'grad_norm': 0.5457015037536621, 'learning_rate': 3.628482972136223e-05, 'epoch': 0.82}


 82%|████████▏ | 2650/3235 [47:48<10:12,  1.05s/it]

{'loss': 0.811, 'grad_norm': 0.46507295966148376, 'learning_rate': 3.622291021671827e-05, 'epoch': 0.82}


 82%|████████▏ | 2651/3235 [47:49<10:39,  1.09s/it]

{'loss': 0.9494, 'grad_norm': 0.48251840472221375, 'learning_rate': 3.616099071207431e-05, 'epoch': 0.82}


 82%|████████▏ | 2652/3235 [47:50<10:03,  1.04s/it]

{'loss': 1.0683, 'grad_norm': 0.5282366871833801, 'learning_rate': 3.609907120743034e-05, 'epoch': 0.82}


 82%|████████▏ | 2653/3235 [47:51<10:12,  1.05s/it]

{'loss': 0.9086, 'grad_norm': 0.4898272156715393, 'learning_rate': 3.603715170278638e-05, 'epoch': 0.82}


 82%|████████▏ | 2654/3235 [47:52<10:00,  1.03s/it]

{'loss': 1.0227, 'grad_norm': 0.49795830249786377, 'learning_rate': 3.5975232198142414e-05, 'epoch': 0.82}


 82%|████████▏ | 2655/3235 [47:53<09:47,  1.01s/it]

{'loss': 0.9668, 'grad_norm': 0.4932411313056946, 'learning_rate': 3.5913312693498454e-05, 'epoch': 0.82}


 82%|████████▏ | 2656/3235 [47:54<09:39,  1.00s/it]

{'loss': 1.076, 'grad_norm': 0.5118407011032104, 'learning_rate': 3.5851393188854494e-05, 'epoch': 0.82}


 82%|████████▏ | 2657/3235 [47:55<09:48,  1.02s/it]

{'loss': 1.0063, 'grad_norm': 0.5165562033653259, 'learning_rate': 3.578947368421053e-05, 'epoch': 0.82}


 82%|████████▏ | 2658/3235 [47:56<09:26,  1.02it/s]

{'loss': 1.0747, 'grad_norm': 0.5545226335525513, 'learning_rate': 3.572755417956656e-05, 'epoch': 0.82}


 82%|████████▏ | 2659/3235 [47:57<10:14,  1.07s/it]

{'loss': 0.9413, 'grad_norm': 0.41676589846611023, 'learning_rate': 3.566563467492261e-05, 'epoch': 0.82}


 82%|████████▏ | 2660/3235 [47:58<10:12,  1.06s/it]

{'loss': 1.0465, 'grad_norm': 0.4916168451309204, 'learning_rate': 3.560371517027864e-05, 'epoch': 0.82}


 82%|████████▏ | 2661/3235 [47:59<09:59,  1.04s/it]

{'loss': 0.89, 'grad_norm': 0.5412487983703613, 'learning_rate': 3.5541795665634674e-05, 'epoch': 0.82}


 82%|████████▏ | 2662/3235 [48:00<09:44,  1.02s/it]

{'loss': 1.134, 'grad_norm': 0.5801544785499573, 'learning_rate': 3.5479876160990714e-05, 'epoch': 0.82}


 82%|████████▏ | 2663/3235 [48:01<10:09,  1.07s/it]

{'loss': 1.1554, 'grad_norm': 0.48471030592918396, 'learning_rate': 3.5417956656346754e-05, 'epoch': 0.82}


 82%|████████▏ | 2664/3235 [48:03<10:27,  1.10s/it]

{'loss': 0.9413, 'grad_norm': 0.45980411767959595, 'learning_rate': 3.535603715170279e-05, 'epoch': 0.82}


 82%|████████▏ | 2665/3235 [48:04<10:23,  1.09s/it]

{'loss': 1.0491, 'grad_norm': 0.4940502941608429, 'learning_rate': 3.529411764705883e-05, 'epoch': 0.82}


 82%|████████▏ | 2666/3235 [48:05<10:22,  1.09s/it]

{'loss': 0.9603, 'grad_norm': 0.5292679667472839, 'learning_rate': 3.523219814241486e-05, 'epoch': 0.82}


 82%|████████▏ | 2667/3235 [48:06<10:44,  1.13s/it]

{'loss': 1.0887, 'grad_norm': 0.5317500233650208, 'learning_rate': 3.51702786377709e-05, 'epoch': 0.82}


 82%|████████▏ | 2668/3235 [48:07<10:40,  1.13s/it]

{'loss': 0.9912, 'grad_norm': 0.4886804521083832, 'learning_rate': 3.510835913312694e-05, 'epoch': 0.82}


 83%|████████▎ | 2669/3235 [48:09<11:25,  1.21s/it]

{'loss': 0.9824, 'grad_norm': 0.42036768794059753, 'learning_rate': 3.5046439628482974e-05, 'epoch': 0.83}


 83%|████████▎ | 2670/3235 [48:10<11:18,  1.20s/it]

{'loss': 1.0678, 'grad_norm': 0.493315190076828, 'learning_rate': 3.498452012383901e-05, 'epoch': 0.83}


 83%|████████▎ | 2671/3235 [48:11<10:36,  1.13s/it]

{'loss': 1.0461, 'grad_norm': 0.4780554175376892, 'learning_rate': 3.4922600619195054e-05, 'epoch': 0.83}


 83%|████████▎ | 2672/3235 [48:12<10:18,  1.10s/it]

{'loss': 1.0388, 'grad_norm': 0.5146607160568237, 'learning_rate': 3.486068111455109e-05, 'epoch': 0.83}


 83%|████████▎ | 2673/3235 [48:13<10:19,  1.10s/it]

{'loss': 1.0889, 'grad_norm': 0.48978573083877563, 'learning_rate': 3.479876160990712e-05, 'epoch': 0.83}


 83%|████████▎ | 2674/3235 [48:14<10:29,  1.12s/it]

{'loss': 1.0546, 'grad_norm': 0.48904773592948914, 'learning_rate': 3.473684210526316e-05, 'epoch': 0.83}


 83%|████████▎ | 2675/3235 [48:15<11:24,  1.22s/it]

{'loss': 0.9836, 'grad_norm': 0.4525517523288727, 'learning_rate': 3.46749226006192e-05, 'epoch': 0.83}


 83%|████████▎ | 2676/3235 [48:17<11:10,  1.20s/it]

{'loss': 0.7836, 'grad_norm': 0.47238612174987793, 'learning_rate': 3.4613003095975233e-05, 'epoch': 0.83}


 83%|████████▎ | 2677/3235 [48:18<10:35,  1.14s/it]

{'loss': 1.0553, 'grad_norm': 0.5597366094589233, 'learning_rate': 3.4551083591331273e-05, 'epoch': 0.83}


 83%|████████▎ | 2678/3235 [48:19<10:25,  1.12s/it]

{'loss': 1.0049, 'grad_norm': 0.4857565760612488, 'learning_rate': 3.448916408668731e-05, 'epoch': 0.83}


 83%|████████▎ | 2679/3235 [48:20<10:20,  1.12s/it]

{'loss': 1.1331, 'grad_norm': 0.49622082710266113, 'learning_rate': 3.442724458204335e-05, 'epoch': 0.83}


 83%|████████▎ | 2680/3235 [48:21<10:35,  1.15s/it]

{'loss': 1.0103, 'grad_norm': 0.44750887155532837, 'learning_rate': 3.436532507739939e-05, 'epoch': 0.83}


 83%|████████▎ | 2681/3235 [48:22<10:35,  1.15s/it]

{'loss': 0.9043, 'grad_norm': 0.5149642825126648, 'learning_rate': 3.430340557275542e-05, 'epoch': 0.83}


 83%|████████▎ | 2682/3235 [48:23<10:09,  1.10s/it]

{'loss': 0.9139, 'grad_norm': 0.48534584045410156, 'learning_rate': 3.424148606811145e-05, 'epoch': 0.83}


 83%|████████▎ | 2683/3235 [48:24<10:34,  1.15s/it]

{'loss': 1.027, 'grad_norm': 0.4668998420238495, 'learning_rate': 3.41795665634675e-05, 'epoch': 0.83}


 83%|████████▎ | 2684/3235 [48:26<10:36,  1.16s/it]

{'loss': 1.0944, 'grad_norm': 0.47801917791366577, 'learning_rate': 3.411764705882353e-05, 'epoch': 0.83}


 83%|████████▎ | 2685/3235 [48:27<11:02,  1.20s/it]

{'loss': 1.0517, 'grad_norm': 0.4790627360343933, 'learning_rate': 3.4055727554179566e-05, 'epoch': 0.83}


 83%|████████▎ | 2686/3235 [48:28<10:40,  1.17s/it]

{'loss': 1.0147, 'grad_norm': 0.49075594544410706, 'learning_rate': 3.39938080495356e-05, 'epoch': 0.83}


 83%|████████▎ | 2687/3235 [48:29<11:03,  1.21s/it]

{'loss': 0.9727, 'grad_norm': 0.46184563636779785, 'learning_rate': 3.3931888544891646e-05, 'epoch': 0.83}


 83%|████████▎ | 2688/3235 [48:30<10:33,  1.16s/it]

{'loss': 0.9846, 'grad_norm': 0.5092679262161255, 'learning_rate': 3.386996904024768e-05, 'epoch': 0.83}


 83%|████████▎ | 2689/3235 [48:31<09:49,  1.08s/it]

{'loss': 1.0478, 'grad_norm': 0.5182161927223206, 'learning_rate': 3.380804953560371e-05, 'epoch': 0.83}


 83%|████████▎ | 2690/3235 [48:32<09:32,  1.05s/it]

{'loss': 1.0519, 'grad_norm': 0.48895180225372314, 'learning_rate': 3.374613003095975e-05, 'epoch': 0.83}


 83%|████████▎ | 2691/3235 [48:34<10:21,  1.14s/it]

{'loss': 0.8717, 'grad_norm': 0.4339689016342163, 'learning_rate': 3.368421052631579e-05, 'epoch': 0.83}


 83%|████████▎ | 2692/3235 [48:34<09:49,  1.09s/it]

{'loss': 0.9191, 'grad_norm': 0.5211464762687683, 'learning_rate': 3.3622291021671826e-05, 'epoch': 0.83}


 83%|████████▎ | 2693/3235 [48:36<09:48,  1.09s/it]

{'loss': 0.9535, 'grad_norm': 0.5831875205039978, 'learning_rate': 3.3560371517027866e-05, 'epoch': 0.83}


 83%|████████▎ | 2694/3235 [48:37<09:51,  1.09s/it]

{'loss': 0.8991, 'grad_norm': 0.44280576705932617, 'learning_rate': 3.34984520123839e-05, 'epoch': 0.83}


 83%|████████▎ | 2695/3235 [48:38<10:07,  1.13s/it]

{'loss': 0.9328, 'grad_norm': 0.4245014488697052, 'learning_rate': 3.343653250773994e-05, 'epoch': 0.83}


 83%|████████▎ | 2696/3235 [48:39<09:32,  1.06s/it]

{'loss': 0.9235, 'grad_norm': 0.47250670194625854, 'learning_rate': 3.337461300309598e-05, 'epoch': 0.83}


 83%|████████▎ | 2697/3235 [48:40<09:39,  1.08s/it]

{'loss': 1.0382, 'grad_norm': 0.49827149510383606, 'learning_rate': 3.331269349845201e-05, 'epoch': 0.83}


 83%|████████▎ | 2698/3235 [48:41<08:47,  1.02it/s]

{'loss': 1.1161, 'grad_norm': 0.5799596309661865, 'learning_rate': 3.3250773993808046e-05, 'epoch': 0.83}


 83%|████████▎ | 2699/3235 [48:42<08:31,  1.05it/s]

{'loss': 1.1996, 'grad_norm': 0.5740479826927185, 'learning_rate': 3.318885448916409e-05, 'epoch': 0.83}


 83%|████████▎ | 2700/3235 [48:43<09:00,  1.01s/it]

{'loss': 0.9732, 'grad_norm': 0.4615619480609894, 'learning_rate': 3.3126934984520126e-05, 'epoch': 0.83}


 83%|████████▎ | 2701/3235 [48:44<08:46,  1.01it/s]

{'loss': 0.8037, 'grad_norm': 0.45609113574028015, 'learning_rate': 3.306501547987616e-05, 'epoch': 0.83}


 84%|████████▎ | 2702/3235 [48:45<09:34,  1.08s/it]

{'loss': 0.9227, 'grad_norm': 0.45443499088287354, 'learning_rate': 3.30030959752322e-05, 'epoch': 0.84}


 84%|████████▎ | 2703/3235 [48:46<09:25,  1.06s/it]

{'loss': 0.8602, 'grad_norm': 0.47040221095085144, 'learning_rate': 3.294117647058824e-05, 'epoch': 0.84}


 84%|████████▎ | 2704/3235 [48:47<09:30,  1.07s/it]

{'loss': 1.1307, 'grad_norm': 0.6036956310272217, 'learning_rate': 3.287925696594427e-05, 'epoch': 0.84}


 84%|████████▎ | 2705/3235 [48:48<09:11,  1.04s/it]

{'loss': 1.1137, 'grad_norm': 0.48189273476600647, 'learning_rate': 3.281733746130031e-05, 'epoch': 0.84}


 84%|████████▎ | 2706/3235 [48:49<09:18,  1.05s/it]

{'loss': 0.915, 'grad_norm': 0.4975087642669678, 'learning_rate': 3.2755417956656346e-05, 'epoch': 0.84}


 84%|████████▎ | 2707/3235 [48:50<09:28,  1.08s/it]

{'loss': 0.965, 'grad_norm': 0.5103808641433716, 'learning_rate': 3.2693498452012386e-05, 'epoch': 0.84}


 84%|████████▎ | 2708/3235 [48:51<08:52,  1.01s/it]

{'loss': 0.9799, 'grad_norm': 0.5302641987800598, 'learning_rate': 3.2631578947368426e-05, 'epoch': 0.84}


 84%|████████▎ | 2709/3235 [48:52<08:28,  1.04it/s]

{'loss': 1.0443, 'grad_norm': 0.5410245656967163, 'learning_rate': 3.256965944272446e-05, 'epoch': 0.84}


 84%|████████▍ | 2710/3235 [48:53<08:52,  1.01s/it]

{'loss': 0.8755, 'grad_norm': 0.49451887607574463, 'learning_rate': 3.250773993808049e-05, 'epoch': 0.84}


 84%|████████▍ | 2711/3235 [48:54<08:43,  1.00it/s]

{'loss': 1.0316, 'grad_norm': 0.5508071184158325, 'learning_rate': 3.244582043343654e-05, 'epoch': 0.84}


 84%|████████▍ | 2712/3235 [48:55<08:38,  1.01it/s]

{'loss': 1.0557, 'grad_norm': 0.5892446637153625, 'learning_rate': 3.238390092879257e-05, 'epoch': 0.84}


 84%|████████▍ | 2713/3235 [48:56<09:11,  1.06s/it]

{'loss': 1.0572, 'grad_norm': 0.5239875316619873, 'learning_rate': 3.2321981424148605e-05, 'epoch': 0.84}


 84%|████████▍ | 2714/3235 [48:57<08:38,  1.01it/s]

{'loss': 0.9826, 'grad_norm': 0.5581243634223938, 'learning_rate': 3.2260061919504645e-05, 'epoch': 0.84}


 84%|████████▍ | 2715/3235 [48:58<09:11,  1.06s/it]

{'loss': 1.048, 'grad_norm': 0.47884029150009155, 'learning_rate': 3.2198142414860685e-05, 'epoch': 0.84}


 84%|████████▍ | 2716/3235 [48:59<08:36,  1.00it/s]

{'loss': 0.8809, 'grad_norm': 0.46823620796203613, 'learning_rate': 3.213622291021672e-05, 'epoch': 0.84}


 84%|████████▍ | 2717/3235 [49:00<08:17,  1.04it/s]

{'loss': 0.952, 'grad_norm': 0.48971790075302124, 'learning_rate': 3.207430340557276e-05, 'epoch': 0.84}


 84%|████████▍ | 2718/3235 [49:01<08:17,  1.04it/s]

{'loss': 1.1127, 'grad_norm': 0.48752114176750183, 'learning_rate': 3.201238390092879e-05, 'epoch': 0.84}


 84%|████████▍ | 2719/3235 [49:02<08:15,  1.04it/s]

{'loss': 0.9228, 'grad_norm': 0.5269837379455566, 'learning_rate': 3.195046439628483e-05, 'epoch': 0.84}


 84%|████████▍ | 2720/3235 [49:03<08:49,  1.03s/it]

{'loss': 0.9565, 'grad_norm': 0.4428962469100952, 'learning_rate': 3.188854489164087e-05, 'epoch': 0.84}


 84%|████████▍ | 2721/3235 [49:04<08:10,  1.05it/s]

{'loss': 0.9863, 'grad_norm': 0.5462864637374878, 'learning_rate': 3.1826625386996905e-05, 'epoch': 0.84}


 84%|████████▍ | 2722/3235 [49:05<08:18,  1.03it/s]

{'loss': 1.0662, 'grad_norm': 0.47783443331718445, 'learning_rate': 3.176470588235294e-05, 'epoch': 0.84}


 84%|████████▍ | 2723/3235 [49:06<08:41,  1.02s/it]

{'loss': 1.0259, 'grad_norm': 0.5011237263679504, 'learning_rate': 3.170278637770898e-05, 'epoch': 0.84}


 84%|████████▍ | 2724/3235 [49:07<09:19,  1.09s/it]

{'loss': 1.0081, 'grad_norm': 0.45976722240448, 'learning_rate': 3.164086687306502e-05, 'epoch': 0.84}


 84%|████████▍ | 2725/3235 [49:08<08:50,  1.04s/it]

{'loss': 1.1328, 'grad_norm': 0.5529881715774536, 'learning_rate': 3.157894736842105e-05, 'epoch': 0.84}


 84%|████████▍ | 2726/3235 [49:09<08:54,  1.05s/it]

{'loss': 0.9486, 'grad_norm': 0.45519718527793884, 'learning_rate': 3.151702786377709e-05, 'epoch': 0.84}


 84%|████████▍ | 2727/3235 [49:10<08:21,  1.01it/s]

{'loss': 0.9442, 'grad_norm': 0.5456218719482422, 'learning_rate': 3.145510835913313e-05, 'epoch': 0.84}


 84%|████████▍ | 2728/3235 [49:11<07:55,  1.07it/s]

{'loss': 0.9818, 'grad_norm': 0.5660227537155151, 'learning_rate': 3.1393188854489165e-05, 'epoch': 0.84}


 84%|████████▍ | 2729/3235 [49:12<08:00,  1.05it/s]

{'loss': 0.9671, 'grad_norm': 0.4764803946018219, 'learning_rate': 3.1331269349845205e-05, 'epoch': 0.84}


 84%|████████▍ | 2730/3235 [49:13<07:43,  1.09it/s]

{'loss': 0.9786, 'grad_norm': 0.5727441906929016, 'learning_rate': 3.126934984520124e-05, 'epoch': 0.84}


 84%|████████▍ | 2731/3235 [49:14<08:19,  1.01it/s]

{'loss': 0.9693, 'grad_norm': 0.4564054608345032, 'learning_rate': 3.120743034055728e-05, 'epoch': 0.84}


 84%|████████▍ | 2732/3235 [49:15<08:30,  1.01s/it]

{'loss': 0.9485, 'grad_norm': 0.4768820106983185, 'learning_rate': 3.114551083591332e-05, 'epoch': 0.84}


 84%|████████▍ | 2733/3235 [49:16<08:26,  1.01s/it]

{'loss': 0.9686, 'grad_norm': 0.5568751096725464, 'learning_rate': 3.108359133126935e-05, 'epoch': 0.84}


 85%|████████▍ | 2734/3235 [49:17<08:24,  1.01s/it]

{'loss': 0.9462, 'grad_norm': 0.5348932147026062, 'learning_rate': 3.1021671826625385e-05, 'epoch': 0.85}


 85%|████████▍ | 2735/3235 [49:18<09:00,  1.08s/it]

{'loss': 1.057, 'grad_norm': 0.45415642857551575, 'learning_rate': 3.0959752321981425e-05, 'epoch': 0.85}


 85%|████████▍ | 2736/3235 [49:19<09:06,  1.09s/it]

{'loss': 0.9718, 'grad_norm': 0.4672534465789795, 'learning_rate': 3.0897832817337465e-05, 'epoch': 0.85}


 85%|████████▍ | 2737/3235 [49:21<09:31,  1.15s/it]

{'loss': 0.9764, 'grad_norm': 0.4634224474430084, 'learning_rate': 3.08359133126935e-05, 'epoch': 0.85}


 85%|████████▍ | 2738/3235 [49:22<09:18,  1.12s/it]

{'loss': 1.0621, 'grad_norm': 0.5673166513442993, 'learning_rate': 3.077399380804954e-05, 'epoch': 0.85}


 85%|████████▍ | 2739/3235 [49:23<08:54,  1.08s/it]

{'loss': 0.9777, 'grad_norm': 0.5157853364944458, 'learning_rate': 3.071207430340558e-05, 'epoch': 0.85}


 85%|████████▍ | 2740/3235 [49:24<08:55,  1.08s/it]

{'loss': 1.0702, 'grad_norm': 0.4897884726524353, 'learning_rate': 3.065015479876161e-05, 'epoch': 0.85}


 85%|████████▍ | 2741/3235 [49:25<08:18,  1.01s/it]

{'loss': 0.9582, 'grad_norm': 0.5287984013557434, 'learning_rate': 3.058823529411765e-05, 'epoch': 0.85}


 85%|████████▍ | 2742/3235 [49:26<08:48,  1.07s/it]

{'loss': 0.9434, 'grad_norm': 0.4688917100429535, 'learning_rate': 3.0526315789473684e-05, 'epoch': 0.85}


 85%|████████▍ | 2743/3235 [49:27<08:48,  1.07s/it]

{'loss': 0.9667, 'grad_norm': 0.4718683660030365, 'learning_rate': 3.046439628482972e-05, 'epoch': 0.85}


 85%|████████▍ | 2744/3235 [49:28<09:06,  1.11s/it]

{'loss': 1.0777, 'grad_norm': 0.4810080826282501, 'learning_rate': 3.040247678018576e-05, 'epoch': 0.85}


 85%|████████▍ | 2745/3235 [49:29<08:31,  1.04s/it]

{'loss': 1.0296, 'grad_norm': 0.5836926102638245, 'learning_rate': 3.0340557275541798e-05, 'epoch': 0.85}


 85%|████████▍ | 2746/3235 [49:30<08:22,  1.03s/it]

{'loss': 0.9448, 'grad_norm': 0.5350738167762756, 'learning_rate': 3.0278637770897834e-05, 'epoch': 0.85}


 85%|████████▍ | 2747/3235 [49:31<08:56,  1.10s/it]

{'loss': 0.9314, 'grad_norm': 0.41190236806869507, 'learning_rate': 3.0216718266253874e-05, 'epoch': 0.85}


 85%|████████▍ | 2748/3235 [49:32<08:24,  1.04s/it]

{'loss': 0.9905, 'grad_norm': 0.5285350680351257, 'learning_rate': 3.015479876160991e-05, 'epoch': 0.85}


 85%|████████▍ | 2749/3235 [49:33<09:07,  1.13s/it]

{'loss': 1.0334, 'grad_norm': 0.4148096740245819, 'learning_rate': 3.0092879256965944e-05, 'epoch': 0.85}


 85%|████████▌ | 2750/3235 [49:35<09:33,  1.18s/it]

{'loss': 1.1301, 'grad_norm': 0.44987279176712036, 'learning_rate': 3.0030959752321984e-05, 'epoch': 0.85}


 85%|████████▌ | 2751/3235 [49:36<09:48,  1.22s/it]

{'loss': 1.0739, 'grad_norm': 0.46666353940963745, 'learning_rate': 2.996904024767802e-05, 'epoch': 0.85}


 85%|████████▌ | 2752/3235 [49:37<09:29,  1.18s/it]

{'loss': 1.0327, 'grad_norm': 0.522790789604187, 'learning_rate': 2.9907120743034057e-05, 'epoch': 0.85}


 85%|████████▌ | 2753/3235 [49:39<10:07,  1.26s/it]

{'loss': 1.0309, 'grad_norm': 0.44985854625701904, 'learning_rate': 2.9845201238390097e-05, 'epoch': 0.85}


 85%|████████▌ | 2754/3235 [49:40<10:15,  1.28s/it]

{'loss': 1.2153, 'grad_norm': 0.5245577692985535, 'learning_rate': 2.978328173374613e-05, 'epoch': 0.85}


 85%|████████▌ | 2755/3235 [49:41<09:06,  1.14s/it]

{'loss': 1.0949, 'grad_norm': 0.5508984923362732, 'learning_rate': 2.9721362229102167e-05, 'epoch': 0.85}


 85%|████████▌ | 2756/3235 [49:42<09:04,  1.14s/it]

{'loss': 1.0919, 'grad_norm': 0.49063077569007874, 'learning_rate': 2.9659442724458207e-05, 'epoch': 0.85}


 85%|████████▌ | 2757/3235 [49:43<09:44,  1.22s/it]

{'loss': 0.8869, 'grad_norm': 0.4275829792022705, 'learning_rate': 2.9597523219814244e-05, 'epoch': 0.85}


 85%|████████▌ | 2758/3235 [49:44<08:52,  1.12s/it]

{'loss': 0.9698, 'grad_norm': 0.5287405252456665, 'learning_rate': 2.953560371517028e-05, 'epoch': 0.85}


 85%|████████▌ | 2759/3235 [49:45<09:04,  1.14s/it]

{'loss': 0.9381, 'grad_norm': 0.4891441762447357, 'learning_rate': 2.9473684210526314e-05, 'epoch': 0.85}


 85%|████████▌ | 2760/3235 [49:47<09:10,  1.16s/it]

{'loss': 1.0345, 'grad_norm': 0.47958412766456604, 'learning_rate': 2.9411764705882354e-05, 'epoch': 0.85}


 85%|████████▌ | 2761/3235 [49:48<10:12,  1.29s/it]

{'loss': 0.9666, 'grad_norm': 0.5572128295898438, 'learning_rate': 2.934984520123839e-05, 'epoch': 0.85}


 85%|████████▌ | 2762/3235 [49:49<09:55,  1.26s/it]

{'loss': 0.9447, 'grad_norm': 0.4941897690296173, 'learning_rate': 2.9287925696594427e-05, 'epoch': 0.85}


 85%|████████▌ | 2763/3235 [49:50<08:51,  1.13s/it]

{'loss': 0.984, 'grad_norm': 0.5294554829597473, 'learning_rate': 2.9226006191950467e-05, 'epoch': 0.85}


 85%|████████▌ | 2764/3235 [49:51<09:03,  1.15s/it]

{'loss': 1.0495, 'grad_norm': 0.45220696926116943, 'learning_rate': 2.9164086687306504e-05, 'epoch': 0.85}


 85%|████████▌ | 2765/3235 [49:53<09:21,  1.19s/it]

{'loss': 0.9625, 'grad_norm': 0.4461837708950043, 'learning_rate': 2.9102167182662537e-05, 'epoch': 0.85}


 86%|████████▌ | 2766/3235 [49:54<09:19,  1.19s/it]

{'loss': 1.0684, 'grad_norm': 0.48933592438697815, 'learning_rate': 2.9040247678018577e-05, 'epoch': 0.86}


 86%|████████▌ | 2767/3235 [49:55<08:58,  1.15s/it]

{'loss': 1.0927, 'grad_norm': 0.5272071957588196, 'learning_rate': 2.8978328173374613e-05, 'epoch': 0.86}


 86%|████████▌ | 2768/3235 [49:56<08:57,  1.15s/it]

{'loss': 1.1477, 'grad_norm': 0.5004926919937134, 'learning_rate': 2.891640866873065e-05, 'epoch': 0.86}


 86%|████████▌ | 2769/3235 [49:57<09:01,  1.16s/it]

{'loss': 1.0316, 'grad_norm': 0.44091346859931946, 'learning_rate': 2.885448916408669e-05, 'epoch': 0.86}


 86%|████████▌ | 2770/3235 [49:59<09:16,  1.20s/it]

{'loss': 1.1889, 'grad_norm': 0.4511704444885254, 'learning_rate': 2.8792569659442727e-05, 'epoch': 0.86}


 86%|████████▌ | 2771/3235 [50:00<09:05,  1.18s/it]

{'loss': 1.0389, 'grad_norm': 0.49640682339668274, 'learning_rate': 2.873065015479876e-05, 'epoch': 0.86}


 86%|████████▌ | 2772/3235 [50:01<09:10,  1.19s/it]

{'loss': 0.9835, 'grad_norm': 0.43272334337234497, 'learning_rate': 2.86687306501548e-05, 'epoch': 0.86}


 86%|████████▌ | 2773/3235 [50:02<09:18,  1.21s/it]

{'loss': 1.0134, 'grad_norm': 0.5302269458770752, 'learning_rate': 2.8606811145510837e-05, 'epoch': 0.86}


 86%|████████▌ | 2774/3235 [50:03<08:33,  1.11s/it]

{'loss': 1.0388, 'grad_norm': 0.5162099599838257, 'learning_rate': 2.8544891640866873e-05, 'epoch': 0.86}


 86%|████████▌ | 2775/3235 [50:04<08:22,  1.09s/it]

{'loss': 0.9263, 'grad_norm': 0.5407185554504395, 'learning_rate': 2.8482972136222913e-05, 'epoch': 0.86}


 86%|████████▌ | 2776/3235 [50:05<08:35,  1.12s/it]

{'loss': 1.0732, 'grad_norm': 0.5080339312553406, 'learning_rate': 2.842105263157895e-05, 'epoch': 0.86}


 86%|████████▌ | 2777/3235 [50:07<08:54,  1.17s/it]

{'loss': 1.1913, 'grad_norm': 0.48312628269195557, 'learning_rate': 2.8359133126934983e-05, 'epoch': 0.86}


 86%|████████▌ | 2778/3235 [50:07<08:22,  1.10s/it]

{'loss': 0.9631, 'grad_norm': 0.49467360973358154, 'learning_rate': 2.8297213622291023e-05, 'epoch': 0.86}


 86%|████████▌ | 2779/3235 [50:09<09:07,  1.20s/it]

{'loss': 1.0394, 'grad_norm': 0.4501820504665375, 'learning_rate': 2.823529411764706e-05, 'epoch': 0.86}


 86%|████████▌ | 2780/3235 [50:10<08:44,  1.15s/it]

{'loss': 1.059, 'grad_norm': 0.4987928569316864, 'learning_rate': 2.8173374613003096e-05, 'epoch': 0.86}


 86%|████████▌ | 2781/3235 [50:11<08:57,  1.18s/it]

{'loss': 1.0934, 'grad_norm': 0.45416000485420227, 'learning_rate': 2.8111455108359136e-05, 'epoch': 0.86}


 86%|████████▌ | 2782/3235 [50:12<08:26,  1.12s/it]

{'loss': 1.0119, 'grad_norm': 0.5512709617614746, 'learning_rate': 2.8049535603715173e-05, 'epoch': 0.86}


 86%|████████▌ | 2783/3235 [50:13<08:37,  1.15s/it]

{'loss': 1.1203, 'grad_norm': 0.4625488221645355, 'learning_rate': 2.7987616099071206e-05, 'epoch': 0.86}


 86%|████████▌ | 2784/3235 [50:14<08:30,  1.13s/it]

{'loss': 1.0648, 'grad_norm': 0.4924893379211426, 'learning_rate': 2.7925696594427246e-05, 'epoch': 0.86}


 86%|████████▌ | 2785/3235 [50:16<08:31,  1.14s/it]

{'loss': 1.0227, 'grad_norm': 0.47223737835884094, 'learning_rate': 2.7863777089783283e-05, 'epoch': 0.86}


 86%|████████▌ | 2786/3235 [50:17<08:34,  1.15s/it]

{'loss': 1.0843, 'grad_norm': 0.5086444616317749, 'learning_rate': 2.780185758513932e-05, 'epoch': 0.86}


 86%|████████▌ | 2787/3235 [50:18<08:29,  1.14s/it]

{'loss': 1.0264, 'grad_norm': 0.46850958466529846, 'learning_rate': 2.773993808049536e-05, 'epoch': 0.86}


 86%|████████▌ | 2788/3235 [50:19<07:52,  1.06s/it]

{'loss': 0.9894, 'grad_norm': 0.5633645057678223, 'learning_rate': 2.7678018575851393e-05, 'epoch': 0.86}


 86%|████████▌ | 2789/3235 [50:20<08:19,  1.12s/it]

{'loss': 1.0285, 'grad_norm': 0.46195876598358154, 'learning_rate': 2.761609907120743e-05, 'epoch': 0.86}


 86%|████████▌ | 2790/3235 [50:21<08:27,  1.14s/it]

{'loss': 1.0615, 'grad_norm': 0.47565439343452454, 'learning_rate': 2.755417956656347e-05, 'epoch': 0.86}


 86%|████████▋ | 2791/3235 [50:23<08:49,  1.19s/it]

{'loss': 1.0656, 'grad_norm': 0.4962935149669647, 'learning_rate': 2.7492260061919506e-05, 'epoch': 0.86}


 86%|████████▋ | 2792/3235 [50:24<08:34,  1.16s/it]

{'loss': 0.9671, 'grad_norm': 0.49381762742996216, 'learning_rate': 2.7430340557275543e-05, 'epoch': 0.86}


 86%|████████▋ | 2793/3235 [50:25<08:22,  1.14s/it]

{'loss': 0.8795, 'grad_norm': 0.5118005871772766, 'learning_rate': 2.7368421052631583e-05, 'epoch': 0.86}


 86%|████████▋ | 2794/3235 [50:26<08:39,  1.18s/it]

{'loss': 0.9603, 'grad_norm': 0.4255158603191376, 'learning_rate': 2.7306501547987616e-05, 'epoch': 0.86}


 86%|████████▋ | 2795/3235 [50:27<08:55,  1.22s/it]

{'loss': 1.0295, 'grad_norm': 0.4869900941848755, 'learning_rate': 2.7244582043343652e-05, 'epoch': 0.86}


 86%|████████▋ | 2796/3235 [50:29<08:54,  1.22s/it]

{'loss': 1.1709, 'grad_norm': 0.48722341656684875, 'learning_rate': 2.7182662538699692e-05, 'epoch': 0.86}


 86%|████████▋ | 2797/3235 [50:30<08:56,  1.22s/it]

{'loss': 0.9784, 'grad_norm': 0.45602819323539734, 'learning_rate': 2.712074303405573e-05, 'epoch': 0.86}


 86%|████████▋ | 2798/3235 [50:31<08:39,  1.19s/it]

{'loss': 0.9608, 'grad_norm': 0.48464933037757874, 'learning_rate': 2.7058823529411766e-05, 'epoch': 0.86}


 87%|████████▋ | 2799/3235 [50:32<08:31,  1.17s/it]

{'loss': 1.0552, 'grad_norm': 0.47231927514076233, 'learning_rate': 2.6996904024767806e-05, 'epoch': 0.87}


 87%|████████▋ | 2800/3235 [50:33<08:01,  1.11s/it]

{'loss': 1.0844, 'grad_norm': 0.529905378818512, 'learning_rate': 2.693498452012384e-05, 'epoch': 0.87}


 87%|████████▋ | 2801/3235 [50:34<07:34,  1.05s/it]

{'loss': 1.0714, 'grad_norm': 0.5620638132095337, 'learning_rate': 2.6873065015479876e-05, 'epoch': 0.87}


 87%|████████▋ | 2802/3235 [50:35<07:37,  1.06s/it]

{'loss': 0.9333, 'grad_norm': 0.45531824231147766, 'learning_rate': 2.6811145510835916e-05, 'epoch': 0.87}


 87%|████████▋ | 2803/3235 [50:36<07:44,  1.08s/it]

{'loss': 0.8897, 'grad_norm': 0.4728715717792511, 'learning_rate': 2.6749226006191952e-05, 'epoch': 0.87}


 87%|████████▋ | 2804/3235 [50:37<07:26,  1.04s/it]

{'loss': 0.9813, 'grad_norm': 0.5077470541000366, 'learning_rate': 2.668730650154799e-05, 'epoch': 0.87}


 87%|████████▋ | 2805/3235 [50:38<07:13,  1.01s/it]

{'loss': 0.8693, 'grad_norm': 0.5434483289718628, 'learning_rate': 2.662538699690403e-05, 'epoch': 0.87}


 87%|████████▋ | 2806/3235 [50:39<07:59,  1.12s/it]

{'loss': 1.1355, 'grad_norm': 0.4701467752456665, 'learning_rate': 2.6563467492260062e-05, 'epoch': 0.87}


 87%|████████▋ | 2807/3235 [50:40<07:55,  1.11s/it]

{'loss': 1.1426, 'grad_norm': 0.49829530715942383, 'learning_rate': 2.65015479876161e-05, 'epoch': 0.87}


 87%|████████▋ | 2808/3235 [50:41<07:31,  1.06s/it]

{'loss': 0.8784, 'grad_norm': 0.4959659278392792, 'learning_rate': 2.643962848297214e-05, 'epoch': 0.87}


 87%|████████▋ | 2809/3235 [50:42<07:30,  1.06s/it]

{'loss': 1.045, 'grad_norm': 0.5221631526947021, 'learning_rate': 2.6377708978328175e-05, 'epoch': 0.87}


 87%|████████▋ | 2810/3235 [50:43<07:16,  1.03s/it]

{'loss': 1.0867, 'grad_norm': 0.5320389866828918, 'learning_rate': 2.6315789473684212e-05, 'epoch': 0.87}


 87%|████████▋ | 2811/3235 [50:45<07:33,  1.07s/it]

{'loss': 0.9833, 'grad_norm': 0.46262967586517334, 'learning_rate': 2.6253869969040252e-05, 'epoch': 0.87}


 87%|████████▋ | 2812/3235 [50:45<07:09,  1.02s/it]

{'loss': 1.016, 'grad_norm': 0.6136477589607239, 'learning_rate': 2.6191950464396285e-05, 'epoch': 0.87}


 87%|████████▋ | 2813/3235 [50:46<06:54,  1.02it/s]

{'loss': 1.0212, 'grad_norm': 0.5482628345489502, 'learning_rate': 2.6130030959752322e-05, 'epoch': 0.87}


 87%|████████▋ | 2814/3235 [50:48<07:28,  1.06s/it]

{'loss': 1.0538, 'grad_norm': 0.4592691659927368, 'learning_rate': 2.6068111455108362e-05, 'epoch': 0.87}


 87%|████████▋ | 2815/3235 [50:49<07:23,  1.06s/it]

{'loss': 0.9349, 'grad_norm': 0.4694719612598419, 'learning_rate': 2.60061919504644e-05, 'epoch': 0.87}


 87%|████████▋ | 2816/3235 [50:50<07:28,  1.07s/it]

{'loss': 0.9262, 'grad_norm': 0.4168098270893097, 'learning_rate': 2.5944272445820432e-05, 'epoch': 0.87}


 87%|████████▋ | 2817/3235 [50:50<06:31,  1.07it/s]

{'loss': 0.9482, 'grad_norm': 0.6345692873001099, 'learning_rate': 2.5882352941176475e-05, 'epoch': 0.87}


 87%|████████▋ | 2818/3235 [50:52<07:02,  1.01s/it]

{'loss': 0.9617, 'grad_norm': 0.4387955665588379, 'learning_rate': 2.582043343653251e-05, 'epoch': 0.87}


 87%|████████▋ | 2819/3235 [50:53<07:05,  1.02s/it]

{'loss': 1.1231, 'grad_norm': 0.5047072172164917, 'learning_rate': 2.5758513931888545e-05, 'epoch': 0.87}


 87%|████████▋ | 2820/3235 [50:54<07:29,  1.08s/it]

{'loss': 1.0305, 'grad_norm': 0.47719231247901917, 'learning_rate': 2.5696594427244585e-05, 'epoch': 0.87}


 87%|████████▋ | 2821/3235 [50:55<07:42,  1.12s/it]

{'loss': 0.9139, 'grad_norm': 0.49028128385543823, 'learning_rate': 2.563467492260062e-05, 'epoch': 0.87}


 87%|████████▋ | 2822/3235 [50:56<08:08,  1.18s/it]

{'loss': 1.0021, 'grad_norm': 0.44260451197624207, 'learning_rate': 2.5572755417956655e-05, 'epoch': 0.87}


 87%|████████▋ | 2823/3235 [50:57<07:30,  1.09s/it]

{'loss': 0.9846, 'grad_norm': 0.5137882232666016, 'learning_rate': 2.5510835913312698e-05, 'epoch': 0.87}


 87%|████████▋ | 2824/3235 [50:58<07:48,  1.14s/it]

{'loss': 1.0784, 'grad_norm': 0.4991311728954315, 'learning_rate': 2.544891640866873e-05, 'epoch': 0.87}


 87%|████████▋ | 2825/3235 [51:00<07:46,  1.14s/it]

{'loss': 1.0066, 'grad_norm': 0.4898292124271393, 'learning_rate': 2.5386996904024768e-05, 'epoch': 0.87}


 87%|████████▋ | 2826/3235 [51:01<07:24,  1.09s/it]

{'loss': 1.0474, 'grad_norm': 0.5662915110588074, 'learning_rate': 2.5325077399380808e-05, 'epoch': 0.87}


 87%|████████▋ | 2827/3235 [51:02<07:18,  1.08s/it]

{'loss': 0.9512, 'grad_norm': 0.504048228263855, 'learning_rate': 2.5263157894736845e-05, 'epoch': 0.87}


 87%|████████▋ | 2828/3235 [51:03<07:40,  1.13s/it]

{'loss': 1.1319, 'grad_norm': 0.4572177231311798, 'learning_rate': 2.5201238390092878e-05, 'epoch': 0.87}


 87%|████████▋ | 2829/3235 [51:04<07:18,  1.08s/it]

{'loss': 1.0169, 'grad_norm': 0.48919907212257385, 'learning_rate': 2.513931888544892e-05, 'epoch': 0.87}


 87%|████████▋ | 2830/3235 [51:05<07:33,  1.12s/it]

{'loss': 0.8705, 'grad_norm': 0.4257265031337738, 'learning_rate': 2.5077399380804955e-05, 'epoch': 0.87}


 88%|████████▊ | 2831/3235 [51:07<08:09,  1.21s/it]

{'loss': 0.904, 'grad_norm': 0.414021372795105, 'learning_rate': 2.501547987616099e-05, 'epoch': 0.88}


 88%|████████▊ | 2832/3235 [51:08<08:00,  1.19s/it]

{'loss': 1.0835, 'grad_norm': 0.4742050766944885, 'learning_rate': 2.4953560371517028e-05, 'epoch': 0.88}


 88%|████████▊ | 2833/3235 [51:09<07:41,  1.15s/it]

{'loss': 0.9381, 'grad_norm': 0.45301759243011475, 'learning_rate': 2.4891640866873068e-05, 'epoch': 0.88}


 88%|████████▊ | 2834/3235 [51:10<07:14,  1.08s/it]

{'loss': 0.9561, 'grad_norm': 0.5288282036781311, 'learning_rate': 2.48297213622291e-05, 'epoch': 0.88}


 88%|████████▊ | 2835/3235 [51:11<07:18,  1.10s/it]

{'loss': 1.1429, 'grad_norm': 0.5149818658828735, 'learning_rate': 2.476780185758514e-05, 'epoch': 0.88}


 88%|████████▊ | 2836/3235 [51:12<06:50,  1.03s/it]

{'loss': 0.9716, 'grad_norm': 0.49027442932128906, 'learning_rate': 2.4705882352941178e-05, 'epoch': 0.88}


 88%|████████▊ | 2837/3235 [51:13<06:48,  1.03s/it]

{'loss': 1.1515, 'grad_norm': 0.5208162665367126, 'learning_rate': 2.4643962848297214e-05, 'epoch': 0.88}


 88%|████████▊ | 2838/3235 [51:14<06:30,  1.02it/s]

{'loss': 0.9658, 'grad_norm': 0.5193958282470703, 'learning_rate': 2.458204334365325e-05, 'epoch': 0.88}


 88%|████████▊ | 2839/3235 [51:15<06:30,  1.01it/s]

{'loss': 1.0706, 'grad_norm': 0.5238286852836609, 'learning_rate': 2.452012383900929e-05, 'epoch': 0.88}


 88%|████████▊ | 2840/3235 [51:15<06:25,  1.02it/s]

{'loss': 0.9788, 'grad_norm': 0.5229380130767822, 'learning_rate': 2.4458204334365324e-05, 'epoch': 0.88}


 88%|████████▊ | 2841/3235 [51:17<06:46,  1.03s/it]

{'loss': 0.9742, 'grad_norm': 0.43049752712249756, 'learning_rate': 2.4396284829721364e-05, 'epoch': 0.88}


 88%|████████▊ | 2842/3235 [51:18<06:33,  1.00s/it]

{'loss': 0.9946, 'grad_norm': 0.5528894066810608, 'learning_rate': 2.43343653250774e-05, 'epoch': 0.88}


 88%|████████▊ | 2843/3235 [51:19<06:24,  1.02it/s]

{'loss': 0.9459, 'grad_norm': 0.5316804647445679, 'learning_rate': 2.4272445820433437e-05, 'epoch': 0.88}


 88%|████████▊ | 2844/3235 [51:20<07:05,  1.09s/it]

{'loss': 1.006, 'grad_norm': 0.4201929569244385, 'learning_rate': 2.4210526315789474e-05, 'epoch': 0.88}


 88%|████████▊ | 2845/3235 [51:21<07:03,  1.09s/it]

{'loss': 0.9121, 'grad_norm': 0.5143929719924927, 'learning_rate': 2.4148606811145514e-05, 'epoch': 0.88}


 88%|████████▊ | 2846/3235 [51:22<07:09,  1.10s/it]

{'loss': 0.9493, 'grad_norm': 0.48614904284477234, 'learning_rate': 2.4086687306501547e-05, 'epoch': 0.88}


 88%|████████▊ | 2847/3235 [51:23<07:37,  1.18s/it]

{'loss': 1.0168, 'grad_norm': 0.43793755769729614, 'learning_rate': 2.4024767801857587e-05, 'epoch': 0.88}


 88%|████████▊ | 2848/3235 [51:24<06:43,  1.04s/it]

{'loss': 0.942, 'grad_norm': 0.5731577277183533, 'learning_rate': 2.3962848297213624e-05, 'epoch': 0.88}


 88%|████████▊ | 2849/3235 [51:25<06:54,  1.07s/it]

{'loss': 1.0282, 'grad_norm': 0.468368798494339, 'learning_rate': 2.390092879256966e-05, 'epoch': 0.88}


 88%|████████▊ | 2850/3235 [51:26<06:50,  1.07s/it]

{'loss': 1.0661, 'grad_norm': 0.5037955045700073, 'learning_rate': 2.3839009287925697e-05, 'epoch': 0.88}


 88%|████████▊ | 2851/3235 [51:28<07:00,  1.09s/it]

{'loss': 0.9517, 'grad_norm': 0.4734060764312744, 'learning_rate': 2.3777089783281734e-05, 'epoch': 0.88}


 88%|████████▊ | 2852/3235 [51:28<06:38,  1.04s/it]

{'loss': 0.9329, 'grad_norm': 0.4823625087738037, 'learning_rate': 2.371517027863777e-05, 'epoch': 0.88}


 88%|████████▊ | 2853/3235 [51:30<06:54,  1.09s/it]

{'loss': 0.8739, 'grad_norm': 0.45809629559516907, 'learning_rate': 2.365325077399381e-05, 'epoch': 0.88}


 88%|████████▊ | 2854/3235 [51:31<06:42,  1.06s/it]

{'loss': 0.9083, 'grad_norm': 0.5163038372993469, 'learning_rate': 2.3591331269349844e-05, 'epoch': 0.88}


 88%|████████▊ | 2855/3235 [51:32<06:58,  1.10s/it]

{'loss': 1.0379, 'grad_norm': 0.5085408687591553, 'learning_rate': 2.3529411764705884e-05, 'epoch': 0.88}


 88%|████████▊ | 2856/3235 [51:33<06:42,  1.06s/it]

{'loss': 0.9198, 'grad_norm': 0.52496737241745, 'learning_rate': 2.346749226006192e-05, 'epoch': 0.88}


 88%|████████▊ | 2857/3235 [51:34<07:05,  1.13s/it]

{'loss': 1.0854, 'grad_norm': 0.4315614700317383, 'learning_rate': 2.3405572755417957e-05, 'epoch': 0.88}


 88%|████████▊ | 2858/3235 [51:35<07:15,  1.16s/it]

{'loss': 1.021, 'grad_norm': 0.4439811408519745, 'learning_rate': 2.3343653250773994e-05, 'epoch': 0.88}


 88%|████████▊ | 2859/3235 [51:36<07:10,  1.15s/it]

{'loss': 0.9072, 'grad_norm': 0.4601079523563385, 'learning_rate': 2.3281733746130034e-05, 'epoch': 0.88}


 88%|████████▊ | 2860/3235 [51:38<07:40,  1.23s/it]

{'loss': 1.1125, 'grad_norm': 0.491763710975647, 'learning_rate': 2.3219814241486067e-05, 'epoch': 0.88}


 88%|████████▊ | 2861/3235 [51:39<07:44,  1.24s/it]

{'loss': 0.871, 'grad_norm': 0.46447980403900146, 'learning_rate': 2.3157894736842107e-05, 'epoch': 0.88}


 88%|████████▊ | 2862/3235 [51:40<07:17,  1.17s/it]

{'loss': 1.0071, 'grad_norm': 0.5189666152000427, 'learning_rate': 2.3095975232198143e-05, 'epoch': 0.88}


 89%|████████▊ | 2863/3235 [51:41<07:24,  1.20s/it]

{'loss': 0.9429, 'grad_norm': 0.4600343704223633, 'learning_rate': 2.303405572755418e-05, 'epoch': 0.89}


 89%|████████▊ | 2864/3235 [51:42<07:04,  1.14s/it]

{'loss': 1.0569, 'grad_norm': 0.5509881973266602, 'learning_rate': 2.2972136222910217e-05, 'epoch': 0.89}


 89%|████████▊ | 2865/3235 [51:43<06:45,  1.10s/it]

{'loss': 0.9924, 'grad_norm': 0.4886705279350281, 'learning_rate': 2.2910216718266257e-05, 'epoch': 0.89}


 89%|████████▊ | 2866/3235 [51:44<06:37,  1.08s/it]

{'loss': 0.933, 'grad_norm': 0.4803883731365204, 'learning_rate': 2.284829721362229e-05, 'epoch': 0.89}


 89%|████████▊ | 2867/3235 [51:45<06:22,  1.04s/it]

{'loss': 0.9705, 'grad_norm': 0.5024299621582031, 'learning_rate': 2.278637770897833e-05, 'epoch': 0.89}


 89%|████████▊ | 2868/3235 [51:46<06:16,  1.02s/it]

{'loss': 0.8591, 'grad_norm': 0.49884089827537537, 'learning_rate': 2.2724458204334367e-05, 'epoch': 0.89}


 89%|████████▊ | 2869/3235 [51:47<06:15,  1.03s/it]

{'loss': 0.9765, 'grad_norm': 0.4638020396232605, 'learning_rate': 2.2662538699690403e-05, 'epoch': 0.89}


 89%|████████▊ | 2870/3235 [51:48<06:09,  1.01s/it]

{'loss': 0.9774, 'grad_norm': 0.485878586769104, 'learning_rate': 2.260061919504644e-05, 'epoch': 0.89}


 89%|████████▊ | 2871/3235 [51:49<06:15,  1.03s/it]

{'loss': 0.9285, 'grad_norm': 0.49714767932891846, 'learning_rate': 2.253869969040248e-05, 'epoch': 0.89}


 89%|████████▉ | 2872/3235 [51:51<06:32,  1.08s/it]

{'loss': 1.1392, 'grad_norm': 0.5274844169616699, 'learning_rate': 2.2476780185758513e-05, 'epoch': 0.89}


 89%|████████▉ | 2873/3235 [51:52<06:20,  1.05s/it]

{'loss': 1.0356, 'grad_norm': 0.48378100991249084, 'learning_rate': 2.2414860681114553e-05, 'epoch': 0.89}


 89%|████████▉ | 2874/3235 [51:53<06:08,  1.02s/it]

{'loss': 0.9805, 'grad_norm': 0.514623761177063, 'learning_rate': 2.235294117647059e-05, 'epoch': 0.89}


 89%|████████▉ | 2875/3235 [51:53<05:57,  1.01it/s]

{'loss': 0.9256, 'grad_norm': 0.5496751666069031, 'learning_rate': 2.2291021671826626e-05, 'epoch': 0.89}


 89%|████████▉ | 2876/3235 [51:54<05:31,  1.08it/s]

{'loss': 0.9041, 'grad_norm': 0.5678759217262268, 'learning_rate': 2.2229102167182663e-05, 'epoch': 0.89}


 89%|████████▉ | 2877/3235 [51:55<05:44,  1.04it/s]

{'loss': 0.9092, 'grad_norm': 0.4943203628063202, 'learning_rate': 2.2167182662538703e-05, 'epoch': 0.89}


 89%|████████▉ | 2878/3235 [51:56<06:07,  1.03s/it]

{'loss': 0.9129, 'grad_norm': 0.4506664276123047, 'learning_rate': 2.2105263157894736e-05, 'epoch': 0.89}


 89%|████████▉ | 2879/3235 [51:57<06:03,  1.02s/it]

{'loss': 0.9237, 'grad_norm': 0.5208382606506348, 'learning_rate': 2.2043343653250776e-05, 'epoch': 0.89}


 89%|████████▉ | 2880/3235 [51:58<05:45,  1.03it/s]

{'loss': 1.0008, 'grad_norm': 0.5030813813209534, 'learning_rate': 2.1981424148606813e-05, 'epoch': 0.89}


 89%|████████▉ | 2881/3235 [51:59<06:00,  1.02s/it]

{'loss': 0.9406, 'grad_norm': 0.5406582355499268, 'learning_rate': 2.191950464396285e-05, 'epoch': 0.89}


 89%|████████▉ | 2882/3235 [52:00<05:55,  1.01s/it]

{'loss': 0.9652, 'grad_norm': 0.5413356423377991, 'learning_rate': 2.1857585139318886e-05, 'epoch': 0.89}


 89%|████████▉ | 2883/3235 [52:01<05:50,  1.00it/s]

{'loss': 0.9757, 'grad_norm': 0.5046840906143188, 'learning_rate': 2.1795665634674926e-05, 'epoch': 0.89}


 89%|████████▉ | 2884/3235 [52:03<06:06,  1.04s/it]

{'loss': 1.0867, 'grad_norm': 0.4837303161621094, 'learning_rate': 2.173374613003096e-05, 'epoch': 0.89}


 89%|████████▉ | 2885/3235 [52:04<06:11,  1.06s/it]

{'loss': 1.1204, 'grad_norm': 0.4658016264438629, 'learning_rate': 2.1671826625387e-05, 'epoch': 0.89}


 89%|████████▉ | 2886/3235 [52:05<06:40,  1.15s/it]

{'loss': 0.9641, 'grad_norm': 0.458072692155838, 'learning_rate': 2.1609907120743036e-05, 'epoch': 0.89}


 89%|████████▉ | 2887/3235 [52:06<06:36,  1.14s/it]

{'loss': 1.0337, 'grad_norm': 0.4730302393436432, 'learning_rate': 2.1547987616099073e-05, 'epoch': 0.89}


 89%|████████▉ | 2888/3235 [52:07<06:16,  1.09s/it]

{'loss': 1.0998, 'grad_norm': 0.4974157512187958, 'learning_rate': 2.148606811145511e-05, 'epoch': 0.89}


 89%|████████▉ | 2889/3235 [52:08<06:02,  1.05s/it]

{'loss': 1.1262, 'grad_norm': 0.533466637134552, 'learning_rate': 2.1424148606811146e-05, 'epoch': 0.89}


 89%|████████▉ | 2890/3235 [52:09<06:17,  1.10s/it]

{'loss': 1.0115, 'grad_norm': 0.42808833718299866, 'learning_rate': 2.1362229102167182e-05, 'epoch': 0.89}


 89%|████████▉ | 2891/3235 [52:10<06:23,  1.12s/it]

{'loss': 0.9457, 'grad_norm': 0.5612026453018188, 'learning_rate': 2.1300309597523222e-05, 'epoch': 0.89}


 89%|████████▉ | 2892/3235 [52:11<06:11,  1.08s/it]

{'loss': 0.8798, 'grad_norm': 0.48329874873161316, 'learning_rate': 2.1238390092879256e-05, 'epoch': 0.89}


 89%|████████▉ | 2893/3235 [52:12<05:57,  1.04s/it]

{'loss': 0.9506, 'grad_norm': 0.5189582109451294, 'learning_rate': 2.1176470588235296e-05, 'epoch': 0.89}


 89%|████████▉ | 2894/3235 [52:13<05:54,  1.04s/it]

{'loss': 0.9583, 'grad_norm': 0.5361390113830566, 'learning_rate': 2.1114551083591332e-05, 'epoch': 0.89}


 89%|████████▉ | 2895/3235 [52:14<05:52,  1.04s/it]

{'loss': 1.1866, 'grad_norm': 0.48536428809165955, 'learning_rate': 2.105263157894737e-05, 'epoch': 0.89}


 90%|████████▉ | 2896/3235 [52:16<05:56,  1.05s/it]

{'loss': 0.9888, 'grad_norm': 0.46279463171958923, 'learning_rate': 2.0990712074303406e-05, 'epoch': 0.9}


 90%|████████▉ | 2897/3235 [52:17<06:02,  1.07s/it]

{'loss': 0.9617, 'grad_norm': 0.4620814621448517, 'learning_rate': 2.0928792569659446e-05, 'epoch': 0.9}


 90%|████████▉ | 2898/3235 [52:18<06:00,  1.07s/it]

{'loss': 1.0279, 'grad_norm': 0.4861680865287781, 'learning_rate': 2.086687306501548e-05, 'epoch': 0.9}


 90%|████████▉ | 2899/3235 [52:19<05:59,  1.07s/it]

{'loss': 0.8971, 'grad_norm': 0.5105221271514893, 'learning_rate': 2.080495356037152e-05, 'epoch': 0.9}


 90%|████████▉ | 2900/3235 [52:20<06:16,  1.12s/it]

{'loss': 0.8503, 'grad_norm': 0.4718145728111267, 'learning_rate': 2.0743034055727555e-05, 'epoch': 0.9}


 90%|████████▉ | 2901/3235 [52:21<06:20,  1.14s/it]

{'loss': 1.0392, 'grad_norm': 0.5456377863883972, 'learning_rate': 2.0681114551083592e-05, 'epoch': 0.9}


 90%|████████▉ | 2902/3235 [52:22<05:55,  1.07s/it]

{'loss': 0.9103, 'grad_norm': 0.48147156834602356, 'learning_rate': 2.061919504643963e-05, 'epoch': 0.9}


 90%|████████▉ | 2903/3235 [52:23<05:57,  1.08s/it]

{'loss': 0.9805, 'grad_norm': 0.4825459122657776, 'learning_rate': 2.055727554179567e-05, 'epoch': 0.9}


 90%|████████▉ | 2904/3235 [52:24<06:04,  1.10s/it]

{'loss': 1.0293, 'grad_norm': 0.4948250353336334, 'learning_rate': 2.0495356037151702e-05, 'epoch': 0.9}


 90%|████████▉ | 2905/3235 [52:25<05:27,  1.01it/s]

{'loss': 0.8809, 'grad_norm': 0.5324445962905884, 'learning_rate': 2.0433436532507742e-05, 'epoch': 0.9}


 90%|████████▉ | 2906/3235 [52:26<06:02,  1.10s/it]

{'loss': 0.9582, 'grad_norm': 0.4518013894557953, 'learning_rate': 2.037151702786378e-05, 'epoch': 0.9}


 90%|████████▉ | 2907/3235 [52:28<05:59,  1.10s/it]

{'loss': 1.1194, 'grad_norm': 0.48272234201431274, 'learning_rate': 2.0309597523219815e-05, 'epoch': 0.9}


 90%|████████▉ | 2908/3235 [52:29<06:13,  1.14s/it]

{'loss': 0.9967, 'grad_norm': 0.44179674983024597, 'learning_rate': 2.0247678018575852e-05, 'epoch': 0.9}


 90%|████████▉ | 2909/3235 [52:30<06:23,  1.18s/it]

{'loss': 1.1024, 'grad_norm': 0.4649403393268585, 'learning_rate': 2.0185758513931892e-05, 'epoch': 0.9}


 90%|████████▉ | 2910/3235 [52:31<06:24,  1.18s/it]

{'loss': 1.0473, 'grad_norm': 0.48422667384147644, 'learning_rate': 2.0123839009287925e-05, 'epoch': 0.9}


 90%|████████▉ | 2911/3235 [52:32<06:14,  1.16s/it]

{'loss': 1.0191, 'grad_norm': 0.4857490658760071, 'learning_rate': 2.0061919504643965e-05, 'epoch': 0.9}


 90%|█████████ | 2912/3235 [52:33<06:08,  1.14s/it]

{'loss': 1.0423, 'grad_norm': 0.4960654377937317, 'learning_rate': 2e-05, 'epoch': 0.9}


 90%|█████████ | 2913/3235 [52:34<05:54,  1.10s/it]

{'loss': 0.8572, 'grad_norm': 0.4545780122280121, 'learning_rate': 1.9938080495356038e-05, 'epoch': 0.9}


 90%|█████████ | 2914/3235 [52:36<05:52,  1.10s/it]

{'loss': 1.0092, 'grad_norm': 0.5387496948242188, 'learning_rate': 1.9876160990712075e-05, 'epoch': 0.9}


 90%|█████████ | 2915/3235 [52:37<05:46,  1.08s/it]

{'loss': 0.8658, 'grad_norm': 0.5194080471992493, 'learning_rate': 1.9814241486068115e-05, 'epoch': 0.9}


 90%|█████████ | 2916/3235 [52:38<05:53,  1.11s/it]

{'loss': 0.9181, 'grad_norm': 0.4779913127422333, 'learning_rate': 1.9752321981424148e-05, 'epoch': 0.9}


 90%|█████████ | 2917/3235 [52:39<05:34,  1.05s/it]

{'loss': 0.9574, 'grad_norm': 0.5394118428230286, 'learning_rate': 1.9690402476780188e-05, 'epoch': 0.9}


 90%|█████████ | 2918/3235 [52:40<05:25,  1.03s/it]

{'loss': 1.0894, 'grad_norm': 0.5678121447563171, 'learning_rate': 1.9628482972136225e-05, 'epoch': 0.9}


 90%|█████████ | 2919/3235 [52:41<05:22,  1.02s/it]

{'loss': 0.8828, 'grad_norm': 0.46857139468193054, 'learning_rate': 1.956656346749226e-05, 'epoch': 0.9}


 90%|█████████ | 2920/3235 [52:42<05:30,  1.05s/it]

{'loss': 1.0699, 'grad_norm': 0.5053054094314575, 'learning_rate': 1.9504643962848298e-05, 'epoch': 0.9}


 90%|█████████ | 2921/3235 [52:43<05:38,  1.08s/it]

{'loss': 1.0096, 'grad_norm': 0.4793342351913452, 'learning_rate': 1.9442724458204338e-05, 'epoch': 0.9}


 90%|█████████ | 2922/3235 [52:44<05:51,  1.12s/it]

{'loss': 1.067, 'grad_norm': 0.47434014081954956, 'learning_rate': 1.938080495356037e-05, 'epoch': 0.9}


 90%|█████████ | 2923/3235 [52:46<06:16,  1.21s/it]

{'loss': 0.9977, 'grad_norm': 0.4216517508029938, 'learning_rate': 1.931888544891641e-05, 'epoch': 0.9}


 90%|█████████ | 2924/3235 [52:47<05:56,  1.15s/it]

{'loss': 1.0151, 'grad_norm': 0.5471261143684387, 'learning_rate': 1.9256965944272444e-05, 'epoch': 0.9}


 90%|█████████ | 2925/3235 [52:48<05:44,  1.11s/it]

{'loss': 0.9551, 'grad_norm': 0.4860744774341583, 'learning_rate': 1.9195046439628485e-05, 'epoch': 0.9}


 90%|█████████ | 2926/3235 [52:49<05:46,  1.12s/it]

{'loss': 1.0583, 'grad_norm': 0.45646050572395325, 'learning_rate': 1.913312693498452e-05, 'epoch': 0.9}


 90%|█████████ | 2927/3235 [52:50<05:37,  1.09s/it]

{'loss': 0.9863, 'grad_norm': 0.47474998235702515, 'learning_rate': 1.9071207430340558e-05, 'epoch': 0.9}


 91%|█████████ | 2928/3235 [52:51<05:18,  1.04s/it]

{'loss': 1.0589, 'grad_norm': 0.5162879824638367, 'learning_rate': 1.9009287925696594e-05, 'epoch': 0.91}


 91%|█████████ | 2929/3235 [52:52<05:15,  1.03s/it]

{'loss': 0.9519, 'grad_norm': 0.4817633628845215, 'learning_rate': 1.8947368421052634e-05, 'epoch': 0.91}


 91%|█████████ | 2930/3235 [52:53<05:10,  1.02s/it]

{'loss': 1.1188, 'grad_norm': 0.48393514752388, 'learning_rate': 1.8885448916408668e-05, 'epoch': 0.91}


 91%|█████████ | 2931/3235 [52:54<04:56,  1.02it/s]

{'loss': 0.9423, 'grad_norm': 0.5196670293807983, 'learning_rate': 1.8823529411764708e-05, 'epoch': 0.91}


 91%|█████████ | 2932/3235 [52:55<05:17,  1.05s/it]

{'loss': 1.1479, 'grad_norm': 0.46363040804862976, 'learning_rate': 1.8761609907120744e-05, 'epoch': 0.91}


 91%|█████████ | 2933/3235 [52:56<05:31,  1.10s/it]

{'loss': 1.1943, 'grad_norm': 0.5214605331420898, 'learning_rate': 1.869969040247678e-05, 'epoch': 0.91}


 91%|█████████ | 2934/3235 [52:57<05:27,  1.09s/it]

{'loss': 1.1548, 'grad_norm': 0.5492500066757202, 'learning_rate': 1.8637770897832817e-05, 'epoch': 0.91}


 91%|█████████ | 2935/3235 [52:58<05:19,  1.07s/it]

{'loss': 0.918, 'grad_norm': 0.4721280336380005, 'learning_rate': 1.8575851393188857e-05, 'epoch': 0.91}


 91%|█████████ | 2936/3235 [52:59<05:34,  1.12s/it]

{'loss': 0.9785, 'grad_norm': 0.4619412422180176, 'learning_rate': 1.851393188854489e-05, 'epoch': 0.91}


 91%|█████████ | 2937/3235 [53:00<05:12,  1.05s/it]

{'loss': 0.899, 'grad_norm': 0.5072750449180603, 'learning_rate': 1.845201238390093e-05, 'epoch': 0.91}


 91%|█████████ | 2938/3235 [53:01<05:06,  1.03s/it]

{'loss': 1.0695, 'grad_norm': 0.5075364708900452, 'learning_rate': 1.8390092879256967e-05, 'epoch': 0.91}


 91%|█████████ | 2939/3235 [53:03<05:38,  1.14s/it]

{'loss': 1.0948, 'grad_norm': 0.4418116509914398, 'learning_rate': 1.8328173374613004e-05, 'epoch': 0.91}


 91%|█████████ | 2940/3235 [53:04<05:57,  1.21s/it]

{'loss': 1.0833, 'grad_norm': 0.453835666179657, 'learning_rate': 1.826625386996904e-05, 'epoch': 0.91}


 91%|█████████ | 2941/3235 [53:05<05:58,  1.22s/it]

{'loss': 1.1802, 'grad_norm': 0.5109824538230896, 'learning_rate': 1.820433436532508e-05, 'epoch': 0.91}


 91%|█████████ | 2942/3235 [53:06<05:47,  1.19s/it]

{'loss': 0.9164, 'grad_norm': 0.5490174293518066, 'learning_rate': 1.8142414860681114e-05, 'epoch': 0.91}


 91%|█████████ | 2943/3235 [53:07<05:45,  1.18s/it]

{'loss': 0.9732, 'grad_norm': 0.4637840688228607, 'learning_rate': 1.8080495356037154e-05, 'epoch': 0.91}


 91%|█████████ | 2944/3235 [53:09<06:03,  1.25s/it]

{'loss': 1.0046, 'grad_norm': 0.43977275490760803, 'learning_rate': 1.801857585139319e-05, 'epoch': 0.91}


 91%|█████████ | 2945/3235 [53:10<05:31,  1.14s/it]

{'loss': 1.0287, 'grad_norm': 0.507394552230835, 'learning_rate': 1.7956656346749227e-05, 'epoch': 0.91}


 91%|█████████ | 2946/3235 [53:11<05:15,  1.09s/it]

{'loss': 0.9788, 'grad_norm': 0.5055567622184753, 'learning_rate': 1.7894736842105264e-05, 'epoch': 0.91}


 91%|█████████ | 2947/3235 [53:12<05:22,  1.12s/it]

{'loss': 1.073, 'grad_norm': 0.4733654856681824, 'learning_rate': 1.7832817337461304e-05, 'epoch': 0.91}


 91%|█████████ | 2948/3235 [53:13<04:58,  1.04s/it]

{'loss': 0.9828, 'grad_norm': 0.5313814878463745, 'learning_rate': 1.7770897832817337e-05, 'epoch': 0.91}


 91%|█████████ | 2949/3235 [53:14<04:56,  1.04s/it]

{'loss': 1.1036, 'grad_norm': 0.5582288503646851, 'learning_rate': 1.7708978328173377e-05, 'epoch': 0.91}


 91%|█████████ | 2950/3235 [53:15<05:01,  1.06s/it]

{'loss': 1.1958, 'grad_norm': 0.5019118189811707, 'learning_rate': 1.7647058823529414e-05, 'epoch': 0.91}


 91%|█████████ | 2951/3235 [53:16<05:02,  1.07s/it]

{'loss': 0.9731, 'grad_norm': 0.5027575492858887, 'learning_rate': 1.758513931888545e-05, 'epoch': 0.91}


 91%|█████████▏| 2952/3235 [53:17<04:50,  1.03s/it]

{'loss': 0.966, 'grad_norm': 0.4958692789077759, 'learning_rate': 1.7523219814241487e-05, 'epoch': 0.91}


 91%|█████████▏| 2953/3235 [53:18<04:47,  1.02s/it]

{'loss': 1.0107, 'grad_norm': 0.49803170561790466, 'learning_rate': 1.7461300309597527e-05, 'epoch': 0.91}


 91%|█████████▏| 2954/3235 [53:19<05:12,  1.11s/it]

{'loss': 1.0725, 'grad_norm': 0.4475785195827484, 'learning_rate': 1.739938080495356e-05, 'epoch': 0.91}


 91%|█████████▏| 2955/3235 [53:20<05:09,  1.10s/it]

{'loss': 0.9625, 'grad_norm': 0.50714111328125, 'learning_rate': 1.73374613003096e-05, 'epoch': 0.91}


 91%|█████████▏| 2956/3235 [53:22<05:22,  1.16s/it]

{'loss': 1.0074, 'grad_norm': 0.4723639488220215, 'learning_rate': 1.7275541795665637e-05, 'epoch': 0.91}


 91%|█████████▏| 2957/3235 [53:23<05:15,  1.13s/it]

{'loss': 1.0331, 'grad_norm': 0.45742228627204895, 'learning_rate': 1.7213622291021673e-05, 'epoch': 0.91}


 91%|█████████▏| 2958/3235 [53:24<05:19,  1.15s/it]

{'loss': 0.9315, 'grad_norm': 0.5220069885253906, 'learning_rate': 1.715170278637771e-05, 'epoch': 0.91}


 91%|█████████▏| 2959/3235 [53:25<05:30,  1.20s/it]

{'loss': 1.022, 'grad_norm': 0.488785058259964, 'learning_rate': 1.708978328173375e-05, 'epoch': 0.91}


 91%|█████████▏| 2960/3235 [53:27<05:40,  1.24s/it]

{'loss': 0.9609, 'grad_norm': 0.4907468855381012, 'learning_rate': 1.7027863777089783e-05, 'epoch': 0.91}


 92%|█████████▏| 2961/3235 [53:28<05:22,  1.18s/it]

{'loss': 1.0341, 'grad_norm': 0.4685218334197998, 'learning_rate': 1.6965944272445823e-05, 'epoch': 0.92}


 92%|█████████▏| 2962/3235 [53:29<05:15,  1.16s/it]

{'loss': 1.0751, 'grad_norm': 0.49944007396698, 'learning_rate': 1.6904024767801856e-05, 'epoch': 0.92}


 92%|█████████▏| 2963/3235 [53:29<04:39,  1.03s/it]

{'loss': 1.0316, 'grad_norm': 0.5592077374458313, 'learning_rate': 1.6842105263157896e-05, 'epoch': 0.92}


 92%|█████████▏| 2964/3235 [53:30<04:40,  1.03s/it]

{'loss': 0.9148, 'grad_norm': 0.4965609610080719, 'learning_rate': 1.6780185758513933e-05, 'epoch': 0.92}


 92%|█████████▏| 2965/3235 [53:32<04:43,  1.05s/it]

{'loss': 0.9289, 'grad_norm': 0.47412967681884766, 'learning_rate': 1.671826625386997e-05, 'epoch': 0.92}


 92%|█████████▏| 2966/3235 [53:33<04:48,  1.07s/it]

{'loss': 1.015, 'grad_norm': 0.45812487602233887, 'learning_rate': 1.6656346749226006e-05, 'epoch': 0.92}


 92%|█████████▏| 2967/3235 [53:34<04:57,  1.11s/it]

{'loss': 0.9533, 'grad_norm': 0.47633394598960876, 'learning_rate': 1.6594427244582046e-05, 'epoch': 0.92}


 92%|█████████▏| 2968/3235 [53:35<04:47,  1.08s/it]

{'loss': 0.8885, 'grad_norm': 0.528099000453949, 'learning_rate': 1.653250773993808e-05, 'epoch': 0.92}


 92%|█████████▏| 2969/3235 [53:36<04:44,  1.07s/it]

{'loss': 0.9048, 'grad_norm': 0.48706960678100586, 'learning_rate': 1.647058823529412e-05, 'epoch': 0.92}


 92%|█████████▏| 2970/3235 [53:37<04:33,  1.03s/it]

{'loss': 1.0435, 'grad_norm': 0.5272347331047058, 'learning_rate': 1.6408668730650156e-05, 'epoch': 0.92}


 92%|█████████▏| 2971/3235 [53:38<04:53,  1.11s/it]

{'loss': 1.1748, 'grad_norm': 0.46862831711769104, 'learning_rate': 1.6346749226006193e-05, 'epoch': 0.92}


 92%|█████████▏| 2972/3235 [53:39<04:51,  1.11s/it]

{'loss': 1.0589, 'grad_norm': 0.466857373714447, 'learning_rate': 1.628482972136223e-05, 'epoch': 0.92}


 92%|█████████▏| 2973/3235 [53:40<04:49,  1.10s/it]

{'loss': 0.8906, 'grad_norm': 0.5240198373794556, 'learning_rate': 1.622291021671827e-05, 'epoch': 0.92}


 92%|█████████▏| 2974/3235 [53:41<04:46,  1.10s/it]

{'loss': 1.0241, 'grad_norm': 0.49008598923683167, 'learning_rate': 1.6160990712074303e-05, 'epoch': 0.92}


 92%|█████████▏| 2975/3235 [53:42<04:26,  1.02s/it]

{'loss': 0.9149, 'grad_norm': 0.49757978320121765, 'learning_rate': 1.6099071207430343e-05, 'epoch': 0.92}


 92%|█████████▏| 2976/3235 [53:44<04:54,  1.14s/it]

{'loss': 1.0043, 'grad_norm': 0.4840768277645111, 'learning_rate': 1.603715170278638e-05, 'epoch': 0.92}


 92%|█████████▏| 2977/3235 [53:45<05:15,  1.22s/it]

{'loss': 1.1724, 'grad_norm': 0.4672122895717621, 'learning_rate': 1.5975232198142416e-05, 'epoch': 0.92}


 92%|█████████▏| 2978/3235 [53:46<04:57,  1.16s/it]

{'loss': 1.0178, 'grad_norm': 0.4982578456401825, 'learning_rate': 1.5913312693498453e-05, 'epoch': 0.92}


 92%|█████████▏| 2979/3235 [53:47<04:44,  1.11s/it]

{'loss': 1.0163, 'grad_norm': 0.49795764684677124, 'learning_rate': 1.585139318885449e-05, 'epoch': 0.92}


 92%|█████████▏| 2980/3235 [53:48<04:38,  1.09s/it]

{'loss': 1.0209, 'grad_norm': 0.44271472096443176, 'learning_rate': 1.5789473684210526e-05, 'epoch': 0.92}


 92%|█████████▏| 2981/3235 [53:49<04:44,  1.12s/it]

{'loss': 0.9276, 'grad_norm': 0.47738906741142273, 'learning_rate': 1.5727554179566566e-05, 'epoch': 0.92}


 92%|█████████▏| 2982/3235 [53:50<04:42,  1.12s/it]

{'loss': 0.9968, 'grad_norm': 0.4264744818210602, 'learning_rate': 1.5665634674922602e-05, 'epoch': 0.92}


 92%|█████████▏| 2983/3235 [53:52<04:59,  1.19s/it]

{'loss': 1.0115, 'grad_norm': 0.42846864461898804, 'learning_rate': 1.560371517027864e-05, 'epoch': 0.92}


 92%|█████████▏| 2984/3235 [53:53<04:41,  1.12s/it]

{'loss': 1.0577, 'grad_norm': 0.4742393493652344, 'learning_rate': 1.5541795665634676e-05, 'epoch': 0.92}


 92%|█████████▏| 2985/3235 [53:54<04:50,  1.16s/it]

{'loss': 1.0284, 'grad_norm': 0.44007861614227295, 'learning_rate': 1.5479876160990712e-05, 'epoch': 0.92}


 92%|█████████▏| 2986/3235 [53:55<04:47,  1.15s/it]

{'loss': 0.9302, 'grad_norm': 0.4558466076850891, 'learning_rate': 1.541795665634675e-05, 'epoch': 0.92}


 92%|█████████▏| 2987/3235 [53:56<04:50,  1.17s/it]

{'loss': 1.0738, 'grad_norm': 0.4467901289463043, 'learning_rate': 1.535603715170279e-05, 'epoch': 0.92}


 92%|█████████▏| 2988/3235 [53:57<04:22,  1.06s/it]

{'loss': 1.0283, 'grad_norm': 0.51675945520401, 'learning_rate': 1.5294117647058826e-05, 'epoch': 0.92}


 92%|█████████▏| 2989/3235 [53:58<04:21,  1.06s/it]

{'loss': 0.9165, 'grad_norm': 0.4575742483139038, 'learning_rate': 1.523219814241486e-05, 'epoch': 0.92}


 92%|█████████▏| 2990/3235 [53:59<04:27,  1.09s/it]

{'loss': 1.0597, 'grad_norm': 0.463416188955307, 'learning_rate': 1.5170278637770899e-05, 'epoch': 0.92}


 92%|█████████▏| 2991/3235 [54:00<04:07,  1.01s/it]

{'loss': 0.9424, 'grad_norm': 0.5680018067359924, 'learning_rate': 1.5108359133126937e-05, 'epoch': 0.92}


 92%|█████████▏| 2992/3235 [54:02<04:30,  1.11s/it]

{'loss': 1.0578, 'grad_norm': 0.4510277211666107, 'learning_rate': 1.5046439628482972e-05, 'epoch': 0.92}


 93%|█████████▎| 2993/3235 [54:03<04:27,  1.11s/it]

{'loss': 0.9652, 'grad_norm': 0.4809311032295227, 'learning_rate': 1.498452012383901e-05, 'epoch': 0.93}


 93%|█████████▎| 2994/3235 [54:04<04:32,  1.13s/it]

{'loss': 1.0709, 'grad_norm': 0.5092069506645203, 'learning_rate': 1.4922600619195049e-05, 'epoch': 0.93}


 93%|█████████▎| 2995/3235 [54:05<04:18,  1.08s/it]

{'loss': 1.0833, 'grad_norm': 0.5498058199882507, 'learning_rate': 1.4860681114551084e-05, 'epoch': 0.93}


 93%|█████████▎| 2996/3235 [54:06<04:22,  1.10s/it]

{'loss': 1.0075, 'grad_norm': 0.45397260785102844, 'learning_rate': 1.4798761609907122e-05, 'epoch': 0.93}


 93%|█████████▎| 2997/3235 [54:07<04:36,  1.16s/it]

{'loss': 0.9689, 'grad_norm': 0.43077388405799866, 'learning_rate': 1.4736842105263157e-05, 'epoch': 0.93}


 93%|█████████▎| 2998/3235 [54:08<04:36,  1.16s/it]

{'loss': 0.9823, 'grad_norm': 0.49115610122680664, 'learning_rate': 1.4674922600619195e-05, 'epoch': 0.93}


 93%|█████████▎| 2999/3235 [54:09<04:09,  1.06s/it]

{'loss': 0.9198, 'grad_norm': 0.519765317440033, 'learning_rate': 1.4613003095975234e-05, 'epoch': 0.93}


 93%|█████████▎| 3000/3235 [54:10<03:59,  1.02s/it]

{'loss': 0.7535, 'grad_norm': 0.4554305076599121, 'learning_rate': 1.4551083591331268e-05, 'epoch': 0.93}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-3000)... Done. 0.1s
 93%|█████████▎| 3001/3235 [54:12<04:52,  1.25s/it]

{'loss': 1.0782, 'grad_norm': 0.5166234970092773, 'learning_rate': 1.4489164086687307e-05, 'epoch': 0.93}


 93%|█████████▎| 3002/3235 [54:13<04:29,  1.16s/it]

{'loss': 1.16, 'grad_norm': 0.5341070294380188, 'learning_rate': 1.4427244582043345e-05, 'epoch': 0.93}


 93%|█████████▎| 3003/3235 [54:14<04:42,  1.22s/it]

{'loss': 1.0106, 'grad_norm': 0.46309277415275574, 'learning_rate': 1.436532507739938e-05, 'epoch': 0.93}


 93%|█████████▎| 3004/3235 [54:15<04:20,  1.13s/it]

{'loss': 1.0964, 'grad_norm': 0.5076890587806702, 'learning_rate': 1.4303405572755418e-05, 'epoch': 0.93}


 93%|█████████▎| 3005/3235 [54:17<04:31,  1.18s/it]

{'loss': 1.0226, 'grad_norm': 0.43363311886787415, 'learning_rate': 1.4241486068111457e-05, 'epoch': 0.93}


 93%|█████████▎| 3006/3235 [54:17<04:08,  1.09s/it]

{'loss': 1.002, 'grad_norm': 0.5349932909011841, 'learning_rate': 1.4179566563467492e-05, 'epoch': 0.93}


 93%|█████████▎| 3007/3235 [54:18<03:51,  1.01s/it]

{'loss': 0.9066, 'grad_norm': 0.5231594443321228, 'learning_rate': 1.411764705882353e-05, 'epoch': 0.93}


 93%|█████████▎| 3008/3235 [54:19<03:49,  1.01s/it]

{'loss': 0.9124, 'grad_norm': 0.5124852657318115, 'learning_rate': 1.4055727554179568e-05, 'epoch': 0.93}


 93%|█████████▎| 3009/3235 [54:20<03:57,  1.05s/it]

{'loss': 0.9678, 'grad_norm': 0.4981178641319275, 'learning_rate': 1.3993808049535603e-05, 'epoch': 0.93}


 93%|█████████▎| 3010/3235 [54:21<03:56,  1.05s/it]

{'loss': 1.0093, 'grad_norm': 0.5275722146034241, 'learning_rate': 1.3931888544891641e-05, 'epoch': 0.93}


 93%|█████████▎| 3011/3235 [54:23<03:56,  1.05s/it]

{'loss': 1.0116, 'grad_norm': 0.48233816027641296, 'learning_rate': 1.386996904024768e-05, 'epoch': 0.93}


 93%|█████████▎| 3012/3235 [54:24<03:52,  1.04s/it]

{'loss': 0.9106, 'grad_norm': 0.5026823878288269, 'learning_rate': 1.3808049535603715e-05, 'epoch': 0.93}


 93%|█████████▎| 3013/3235 [54:25<04:02,  1.09s/it]

{'loss': 1.1552, 'grad_norm': 0.4928235113620758, 'learning_rate': 1.3746130030959753e-05, 'epoch': 0.93}


 93%|█████████▎| 3014/3235 [54:26<04:17,  1.16s/it]

{'loss': 1.0343, 'grad_norm': 0.4368145763874054, 'learning_rate': 1.3684210526315791e-05, 'epoch': 0.93}


 93%|█████████▎| 3015/3235 [54:27<04:09,  1.14s/it]

{'loss': 1.0254, 'grad_norm': 0.5551478266716003, 'learning_rate': 1.3622291021671826e-05, 'epoch': 0.93}


 93%|█████████▎| 3016/3235 [54:28<04:05,  1.12s/it]

{'loss': 0.9648, 'grad_norm': 0.4798862338066101, 'learning_rate': 1.3560371517027865e-05, 'epoch': 0.93}


 93%|█████████▎| 3017/3235 [54:29<04:10,  1.15s/it]

{'loss': 0.92, 'grad_norm': 0.48217326402664185, 'learning_rate': 1.3498452012383903e-05, 'epoch': 0.93}


 93%|█████████▎| 3018/3235 [54:30<04:01,  1.11s/it]

{'loss': 1.095, 'grad_norm': 0.5181552171707153, 'learning_rate': 1.3436532507739938e-05, 'epoch': 0.93}


 93%|█████████▎| 3019/3235 [54:31<03:39,  1.02s/it]

{'loss': 0.8923, 'grad_norm': 0.5447362065315247, 'learning_rate': 1.3374613003095976e-05, 'epoch': 0.93}


 93%|█████████▎| 3020/3235 [54:33<03:58,  1.11s/it]

{'loss': 0.8893, 'grad_norm': 0.4702315330505371, 'learning_rate': 1.3312693498452014e-05, 'epoch': 0.93}


 93%|█████████▎| 3021/3235 [54:34<04:00,  1.12s/it]

{'loss': 0.9716, 'grad_norm': 0.4830028712749481, 'learning_rate': 1.325077399380805e-05, 'epoch': 0.93}


 93%|█████████▎| 3022/3235 [54:35<03:56,  1.11s/it]

{'loss': 0.9178, 'grad_norm': 0.44643285870552063, 'learning_rate': 1.3188854489164088e-05, 'epoch': 0.93}


 93%|█████████▎| 3023/3235 [54:36<03:41,  1.04s/it]

{'loss': 0.9537, 'grad_norm': 0.5289617776870728, 'learning_rate': 1.3126934984520126e-05, 'epoch': 0.93}


 93%|█████████▎| 3024/3235 [54:37<03:48,  1.08s/it]

{'loss': 1.0592, 'grad_norm': 0.5420231223106384, 'learning_rate': 1.3065015479876161e-05, 'epoch': 0.93}


 94%|█████████▎| 3025/3235 [54:38<04:01,  1.15s/it]

{'loss': 1.0751, 'grad_norm': 0.4489290416240692, 'learning_rate': 1.30030959752322e-05, 'epoch': 0.94}


 94%|█████████▎| 3026/3235 [54:39<03:46,  1.09s/it]

{'loss': 1.1126, 'grad_norm': 0.5153031349182129, 'learning_rate': 1.2941176470588238e-05, 'epoch': 0.94}


 94%|█████████▎| 3027/3235 [54:40<04:00,  1.16s/it]

{'loss': 0.9575, 'grad_norm': 0.460943341255188, 'learning_rate': 1.2879256965944272e-05, 'epoch': 0.94}


 94%|█████████▎| 3028/3235 [54:42<04:15,  1.23s/it]

{'loss': 1.1316, 'grad_norm': 0.4137974679470062, 'learning_rate': 1.281733746130031e-05, 'epoch': 0.94}


 94%|█████████▎| 3029/3235 [54:43<04:26,  1.30s/it]

{'loss': 1.1583, 'grad_norm': 0.4730778932571411, 'learning_rate': 1.2755417956656349e-05, 'epoch': 0.94}


 94%|█████████▎| 3030/3235 [54:45<04:24,  1.29s/it]

{'loss': 1.0479, 'grad_norm': 0.4430685341358185, 'learning_rate': 1.2693498452012384e-05, 'epoch': 0.94}


 94%|█████████▎| 3031/3235 [54:46<04:14,  1.25s/it]

{'loss': 1.0076, 'grad_norm': 0.48094186186790466, 'learning_rate': 1.2631578947368422e-05, 'epoch': 0.94}


 94%|█████████▎| 3032/3235 [54:47<04:29,  1.33s/it]

{'loss': 0.9364, 'grad_norm': 0.4547964930534363, 'learning_rate': 1.256965944272446e-05, 'epoch': 0.94}


 94%|█████████▍| 3033/3235 [54:48<04:07,  1.23s/it]

{'loss': 0.924, 'grad_norm': 0.5357604622840881, 'learning_rate': 1.2507739938080496e-05, 'epoch': 0.94}


 94%|█████████▍| 3034/3235 [54:49<03:47,  1.13s/it]

{'loss': 1.0477, 'grad_norm': 0.5319035649299622, 'learning_rate': 1.2445820433436534e-05, 'epoch': 0.94}


 94%|█████████▍| 3035/3235 [54:50<03:35,  1.08s/it]

{'loss': 0.9431, 'grad_norm': 0.491628497838974, 'learning_rate': 1.238390092879257e-05, 'epoch': 0.94}


 94%|█████████▍| 3036/3235 [54:51<03:17,  1.01it/s]

{'loss': 0.7584, 'grad_norm': 0.5443657636642456, 'learning_rate': 1.2321981424148607e-05, 'epoch': 0.94}


 94%|█████████▍| 3037/3235 [54:52<03:14,  1.02it/s]

{'loss': 0.9399, 'grad_norm': 0.4999249279499054, 'learning_rate': 1.2260061919504645e-05, 'epoch': 0.94}


 94%|█████████▍| 3038/3235 [54:53<03:20,  1.02s/it]

{'loss': 1.0594, 'grad_norm': 0.5474820733070374, 'learning_rate': 1.2198142414860682e-05, 'epoch': 0.94}


 94%|█████████▍| 3039/3235 [54:54<03:19,  1.02s/it]

{'loss': 1.068, 'grad_norm': 0.5045554041862488, 'learning_rate': 1.2136222910216719e-05, 'epoch': 0.94}


 94%|█████████▍| 3040/3235 [54:55<03:36,  1.11s/it]

{'loss': 0.9109, 'grad_norm': 0.47752511501312256, 'learning_rate': 1.2074303405572757e-05, 'epoch': 0.94}


 94%|█████████▍| 3041/3235 [54:56<03:27,  1.07s/it]

{'loss': 1.0033, 'grad_norm': 0.48027652502059937, 'learning_rate': 1.2012383900928794e-05, 'epoch': 0.94}


 94%|█████████▍| 3042/3235 [54:57<03:14,  1.01s/it]

{'loss': 0.8914, 'grad_norm': 0.5094860792160034, 'learning_rate': 1.195046439628483e-05, 'epoch': 0.94}


 94%|█████████▍| 3043/3235 [54:58<03:20,  1.05s/it]

{'loss': 0.9258, 'grad_norm': 0.47889065742492676, 'learning_rate': 1.1888544891640867e-05, 'epoch': 0.94}


 94%|█████████▍| 3044/3235 [54:59<03:28,  1.09s/it]

{'loss': 0.924, 'grad_norm': 0.4719848036766052, 'learning_rate': 1.1826625386996905e-05, 'epoch': 0.94}


 94%|█████████▍| 3045/3235 [55:01<03:30,  1.11s/it]

{'loss': 1.0156, 'grad_norm': 0.4748193919658661, 'learning_rate': 1.1764705882352942e-05, 'epoch': 0.94}


 94%|█████████▍| 3046/3235 [55:02<03:24,  1.08s/it]

{'loss': 1.063, 'grad_norm': 0.5592271089553833, 'learning_rate': 1.1702786377708978e-05, 'epoch': 0.94}


 94%|█████████▍| 3047/3235 [55:03<03:16,  1.05s/it]

{'loss': 1.039, 'grad_norm': 0.5252270698547363, 'learning_rate': 1.1640866873065017e-05, 'epoch': 0.94}


 94%|█████████▍| 3048/3235 [55:04<03:32,  1.13s/it]

{'loss': 0.9837, 'grad_norm': 0.4505109488964081, 'learning_rate': 1.1578947368421053e-05, 'epoch': 0.94}


 94%|█████████▍| 3049/3235 [55:05<03:33,  1.15s/it]

{'loss': 1.0636, 'grad_norm': 0.46607691049575806, 'learning_rate': 1.151702786377709e-05, 'epoch': 0.94}


 94%|█████████▍| 3050/3235 [55:06<03:32,  1.15s/it]

{'loss': 1.021, 'grad_norm': 0.45485585927963257, 'learning_rate': 1.1455108359133128e-05, 'epoch': 0.94}


 94%|█████████▍| 3051/3235 [55:07<03:33,  1.16s/it]

{'loss': 0.9331, 'grad_norm': 0.47991102933883667, 'learning_rate': 1.1393188854489165e-05, 'epoch': 0.94}


 94%|█████████▍| 3052/3235 [55:08<03:26,  1.13s/it]

{'loss': 0.9233, 'grad_norm': 0.4981144964694977, 'learning_rate': 1.1331269349845202e-05, 'epoch': 0.94}


 94%|█████████▍| 3053/3235 [55:10<03:28,  1.15s/it]

{'loss': 0.9398, 'grad_norm': 0.4731488823890686, 'learning_rate': 1.126934984520124e-05, 'epoch': 0.94}


 94%|█████████▍| 3054/3235 [55:11<03:25,  1.13s/it]

{'loss': 0.9579, 'grad_norm': 0.4940515160560608, 'learning_rate': 1.1207430340557277e-05, 'epoch': 0.94}


 94%|█████████▍| 3055/3235 [55:12<03:26,  1.15s/it]

{'loss': 1.1655, 'grad_norm': 0.48637813329696655, 'learning_rate': 1.1145510835913313e-05, 'epoch': 0.94}


 94%|█████████▍| 3056/3235 [55:13<03:23,  1.13s/it]

{'loss': 1.0413, 'grad_norm': 0.47854217886924744, 'learning_rate': 1.1083591331269351e-05, 'epoch': 0.94}


 94%|█████████▍| 3057/3235 [55:14<03:36,  1.22s/it]

{'loss': 0.953, 'grad_norm': 0.39460018277168274, 'learning_rate': 1.1021671826625388e-05, 'epoch': 0.94}


 95%|█████████▍| 3058/3235 [55:15<03:23,  1.15s/it]

{'loss': 1.063, 'grad_norm': 0.5132521986961365, 'learning_rate': 1.0959752321981425e-05, 'epoch': 0.95}


 95%|█████████▍| 3059/3235 [55:17<03:21,  1.14s/it]

{'loss': 0.9732, 'grad_norm': 0.4852246940135956, 'learning_rate': 1.0897832817337463e-05, 'epoch': 0.95}


 95%|█████████▍| 3060/3235 [55:18<03:28,  1.19s/it]

{'loss': 1.0971, 'grad_norm': 0.45586466789245605, 'learning_rate': 1.08359133126935e-05, 'epoch': 0.95}


 95%|█████████▍| 3061/3235 [55:19<03:31,  1.22s/it]

{'loss': 1.0187, 'grad_norm': 0.4767231345176697, 'learning_rate': 1.0773993808049536e-05, 'epoch': 0.95}


 95%|█████████▍| 3062/3235 [55:20<03:07,  1.08s/it]

{'loss': 1.0835, 'grad_norm': 0.5708403587341309, 'learning_rate': 1.0712074303405573e-05, 'epoch': 0.95}


 95%|█████████▍| 3063/3235 [55:21<03:06,  1.08s/it]

{'loss': 1.092, 'grad_norm': 0.46075305342674255, 'learning_rate': 1.0650154798761611e-05, 'epoch': 0.95}


 95%|█████████▍| 3064/3235 [55:22<02:57,  1.04s/it]

{'loss': 1.1722, 'grad_norm': 0.524407684803009, 'learning_rate': 1.0588235294117648e-05, 'epoch': 0.95}


 95%|█████████▍| 3065/3235 [55:23<02:56,  1.04s/it]

{'loss': 0.9409, 'grad_norm': 0.4892953932285309, 'learning_rate': 1.0526315789473684e-05, 'epoch': 0.95}


 95%|█████████▍| 3066/3235 [55:24<02:47,  1.01it/s]

{'loss': 0.8933, 'grad_norm': 0.5337767601013184, 'learning_rate': 1.0464396284829723e-05, 'epoch': 0.95}


 95%|█████████▍| 3067/3235 [55:25<02:47,  1.00it/s]

{'loss': 0.8916, 'grad_norm': 0.48019230365753174, 'learning_rate': 1.040247678018576e-05, 'epoch': 0.95}


 95%|█████████▍| 3068/3235 [55:26<02:49,  1.02s/it]

{'loss': 0.9165, 'grad_norm': 0.4678035080432892, 'learning_rate': 1.0340557275541796e-05, 'epoch': 0.95}


 95%|█████████▍| 3069/3235 [55:27<02:59,  1.08s/it]

{'loss': 1.1814, 'grad_norm': 0.4661305546760559, 'learning_rate': 1.0278637770897834e-05, 'epoch': 0.95}


 95%|█████████▍| 3070/3235 [55:28<02:44,  1.00it/s]

{'loss': 0.9271, 'grad_norm': 0.5647500157356262, 'learning_rate': 1.0216718266253871e-05, 'epoch': 0.95}


 95%|█████████▍| 3071/3235 [55:29<02:55,  1.07s/it]

{'loss': 1.067, 'grad_norm': 0.5240336060523987, 'learning_rate': 1.0154798761609908e-05, 'epoch': 0.95}


 95%|█████████▍| 3072/3235 [55:30<03:01,  1.12s/it]

{'loss': 1.1731, 'grad_norm': 0.4576120376586914, 'learning_rate': 1.0092879256965946e-05, 'epoch': 0.95}


 95%|█████████▍| 3073/3235 [55:32<03:01,  1.12s/it]

{'loss': 0.9216, 'grad_norm': 0.45710164308547974, 'learning_rate': 1.0030959752321983e-05, 'epoch': 0.95}


 95%|█████████▌| 3074/3235 [55:33<03:02,  1.13s/it]

{'loss': 1.0946, 'grad_norm': 0.49697598814964294, 'learning_rate': 9.969040247678019e-06, 'epoch': 0.95}


 95%|█████████▌| 3075/3235 [55:34<02:58,  1.12s/it]

{'loss': 1.1078, 'grad_norm': 0.4725719690322876, 'learning_rate': 9.907120743034057e-06, 'epoch': 0.95}


 95%|█████████▌| 3076/3235 [55:35<02:50,  1.07s/it]

{'loss': 1.0071, 'grad_norm': 0.4891258478164673, 'learning_rate': 9.845201238390094e-06, 'epoch': 0.95}


 95%|█████████▌| 3077/3235 [55:36<03:01,  1.15s/it]

{'loss': 1.17, 'grad_norm': 0.5004016160964966, 'learning_rate': 9.78328173374613e-06, 'epoch': 0.95}


 95%|█████████▌| 3078/3235 [55:37<02:52,  1.10s/it]

{'loss': 1.0088, 'grad_norm': 0.5139142870903015, 'learning_rate': 9.721362229102169e-06, 'epoch': 0.95}


 95%|█████████▌| 3079/3235 [55:38<02:57,  1.14s/it]

{'loss': 1.0001, 'grad_norm': 0.43513578176498413, 'learning_rate': 9.659442724458206e-06, 'epoch': 0.95}


 95%|█████████▌| 3080/3235 [55:39<02:54,  1.13s/it]

{'loss': 0.9698, 'grad_norm': 0.4977346658706665, 'learning_rate': 9.597523219814242e-06, 'epoch': 0.95}


 95%|█████████▌| 3081/3235 [55:41<02:56,  1.15s/it]

{'loss': 1.0397, 'grad_norm': 0.468199223279953, 'learning_rate': 9.535603715170279e-06, 'epoch': 0.95}


 95%|█████████▌| 3082/3235 [55:42<02:55,  1.15s/it]

{'loss': 0.9874, 'grad_norm': 0.49773210287094116, 'learning_rate': 9.473684210526317e-06, 'epoch': 0.95}


 95%|█████████▌| 3083/3235 [55:43<02:35,  1.03s/it]

{'loss': 1.0728, 'grad_norm': 0.5636714100837708, 'learning_rate': 9.411764705882354e-06, 'epoch': 0.95}


 95%|█████████▌| 3084/3235 [55:44<02:38,  1.05s/it]

{'loss': 1.0771, 'grad_norm': 0.5223712921142578, 'learning_rate': 9.34984520123839e-06, 'epoch': 0.95}


 95%|█████████▌| 3085/3235 [55:45<02:45,  1.10s/it]

{'loss': 1.0778, 'grad_norm': 0.4817183315753937, 'learning_rate': 9.287925696594429e-06, 'epoch': 0.95}


 95%|█████████▌| 3086/3235 [55:46<02:29,  1.01s/it]

{'loss': 1.0622, 'grad_norm': 0.5651686191558838, 'learning_rate': 9.226006191950465e-06, 'epoch': 0.95}


 95%|█████████▌| 3087/3235 [55:47<02:34,  1.04s/it]

{'loss': 0.9655, 'grad_norm': 0.4424070417881012, 'learning_rate': 9.164086687306502e-06, 'epoch': 0.95}


 95%|█████████▌| 3088/3235 [55:48<02:32,  1.04s/it]

{'loss': 1.0199, 'grad_norm': 0.48612621426582336, 'learning_rate': 9.10216718266254e-06, 'epoch': 0.95}


 95%|█████████▌| 3089/3235 [55:49<02:52,  1.18s/it]

{'loss': 0.9824, 'grad_norm': 0.39752593636512756, 'learning_rate': 9.040247678018577e-06, 'epoch': 0.95}


 96%|█████████▌| 3090/3235 [55:50<02:41,  1.12s/it]

{'loss': 0.9474, 'grad_norm': 0.4958772659301758, 'learning_rate': 8.978328173374614e-06, 'epoch': 0.96}


 96%|█████████▌| 3091/3235 [55:51<02:39,  1.11s/it]

{'loss': 0.9814, 'grad_norm': 0.4897914230823517, 'learning_rate': 8.916408668730652e-06, 'epoch': 0.96}


 96%|█████████▌| 3092/3235 [55:53<02:41,  1.13s/it]

{'loss': 0.9602, 'grad_norm': 0.4947426915168762, 'learning_rate': 8.854489164086688e-06, 'epoch': 0.96}


 96%|█████████▌| 3093/3235 [55:53<02:26,  1.03s/it]

{'loss': 0.9897, 'grad_norm': 0.5601519346237183, 'learning_rate': 8.792569659442725e-06, 'epoch': 0.96}


 96%|█████████▌| 3094/3235 [55:54<02:18,  1.02it/s]

{'loss': 1.0405, 'grad_norm': 0.5631208419799805, 'learning_rate': 8.730650154798763e-06, 'epoch': 0.96}


 96%|█████████▌| 3095/3235 [55:55<02:21,  1.01s/it]

{'loss': 0.9134, 'grad_norm': 0.48281630873680115, 'learning_rate': 8.6687306501548e-06, 'epoch': 0.96}


 96%|█████████▌| 3096/3235 [55:56<02:18,  1.00it/s]

{'loss': 1.0205, 'grad_norm': 0.5575725436210632, 'learning_rate': 8.606811145510837e-06, 'epoch': 0.96}


 96%|█████████▌| 3097/3235 [55:57<02:23,  1.04s/it]

{'loss': 0.9023, 'grad_norm': 0.4210604727268219, 'learning_rate': 8.544891640866875e-06, 'epoch': 0.96}


 96%|█████████▌| 3098/3235 [55:58<02:18,  1.01s/it]

{'loss': 0.9625, 'grad_norm': 0.5376817584037781, 'learning_rate': 8.482972136222912e-06, 'epoch': 0.96}


 96%|█████████▌| 3099/3235 [55:59<02:23,  1.06s/it]

{'loss': 1.0339, 'grad_norm': 0.47972381114959717, 'learning_rate': 8.421052631578948e-06, 'epoch': 0.96}


 96%|█████████▌| 3100/3235 [56:01<02:31,  1.13s/it]

{'loss': 1.0038, 'grad_norm': 0.4473007321357727, 'learning_rate': 8.359133126934985e-06, 'epoch': 0.96}


 96%|█████████▌| 3101/3235 [56:02<02:35,  1.16s/it]

{'loss': 1.114, 'grad_norm': 0.5100106000900269, 'learning_rate': 8.297213622291023e-06, 'epoch': 0.96}


 96%|█████████▌| 3102/3235 [56:03<02:20,  1.06s/it]

{'loss': 0.9701, 'grad_norm': 0.4924571216106415, 'learning_rate': 8.23529411764706e-06, 'epoch': 0.96}


 96%|█████████▌| 3103/3235 [56:04<02:16,  1.03s/it]

{'loss': 0.9682, 'grad_norm': 0.49006417393684387, 'learning_rate': 8.173374613003096e-06, 'epoch': 0.96}


 96%|█████████▌| 3104/3235 [56:05<02:14,  1.02s/it]

{'loss': 0.9107, 'grad_norm': 0.483028382062912, 'learning_rate': 8.111455108359135e-06, 'epoch': 0.96}


 96%|█████████▌| 3105/3235 [56:06<02:16,  1.05s/it]

{'loss': 1.0698, 'grad_norm': 0.563460111618042, 'learning_rate': 8.049535603715171e-06, 'epoch': 0.96}


 96%|█████████▌| 3106/3235 [56:07<02:31,  1.17s/it]

{'loss': 0.959, 'grad_norm': 0.44149887561798096, 'learning_rate': 7.987616099071208e-06, 'epoch': 0.96}


 96%|█████████▌| 3107/3235 [56:09<02:35,  1.22s/it]

{'loss': 1.1026, 'grad_norm': 0.4884833097457886, 'learning_rate': 7.925696594427245e-06, 'epoch': 0.96}


 96%|█████████▌| 3108/3235 [56:10<02:32,  1.20s/it]

{'loss': 0.9851, 'grad_norm': 0.4803001582622528, 'learning_rate': 7.863777089783283e-06, 'epoch': 0.96}


 96%|█████████▌| 3109/3235 [56:11<02:28,  1.18s/it]

{'loss': 1.0191, 'grad_norm': 0.4736584722995758, 'learning_rate': 7.80185758513932e-06, 'epoch': 0.96}


 96%|█████████▌| 3110/3235 [56:12<02:21,  1.14s/it]

{'loss': 1.159, 'grad_norm': 0.5093572735786438, 'learning_rate': 7.739938080495356e-06, 'epoch': 0.96}


 96%|█████████▌| 3111/3235 [56:13<02:32,  1.23s/it]

{'loss': 1.0238, 'grad_norm': 0.43001869320869446, 'learning_rate': 7.678018575851394e-06, 'epoch': 0.96}


 96%|█████████▌| 3112/3235 [56:15<02:31,  1.24s/it]

{'loss': 1.0812, 'grad_norm': 0.4788711369037628, 'learning_rate': 7.61609907120743e-06, 'epoch': 0.96}


 96%|█████████▌| 3113/3235 [56:16<02:25,  1.19s/it]

{'loss': 1.0359, 'grad_norm': 0.4633593261241913, 'learning_rate': 7.5541795665634686e-06, 'epoch': 0.96}


 96%|█████████▋| 3114/3235 [56:17<02:23,  1.19s/it]

{'loss': 1.042, 'grad_norm': 0.465589314699173, 'learning_rate': 7.492260061919505e-06, 'epoch': 0.96}


 96%|█████████▋| 3115/3235 [56:18<02:08,  1.07s/it]

{'loss': 0.8584, 'grad_norm': 0.5311923623085022, 'learning_rate': 7.430340557275542e-06, 'epoch': 0.96}


 96%|█████████▋| 3116/3235 [56:19<02:05,  1.05s/it]

{'loss': 1.1134, 'grad_norm': 0.49349886178970337, 'learning_rate': 7.3684210526315784e-06, 'epoch': 0.96}


 96%|█████████▋| 3117/3235 [56:20<02:06,  1.07s/it]

{'loss': 1.0679, 'grad_norm': 0.4838056266307831, 'learning_rate': 7.306501547987617e-06, 'epoch': 0.96}


 96%|█████████▋| 3118/3235 [56:21<01:56,  1.01it/s]

{'loss': 0.929, 'grad_norm': 0.5083798766136169, 'learning_rate': 7.244582043343653e-06, 'epoch': 0.96}


 96%|█████████▋| 3119/3235 [56:22<01:52,  1.03it/s]

{'loss': 0.9097, 'grad_norm': 0.5580031871795654, 'learning_rate': 7.18266253869969e-06, 'epoch': 0.96}


 96%|█████████▋| 3120/3235 [56:23<01:54,  1.00it/s]

{'loss': 1.086, 'grad_norm': 0.4991094470024109, 'learning_rate': 7.120743034055728e-06, 'epoch': 0.96}


 96%|█████████▋| 3121/3235 [56:24<02:00,  1.06s/it]

{'loss': 0.9708, 'grad_norm': 0.47742825746536255, 'learning_rate': 7.058823529411765e-06, 'epoch': 0.96}


 97%|█████████▋| 3122/3235 [56:25<01:59,  1.06s/it]

{'loss': 0.9852, 'grad_norm': 0.4978407323360443, 'learning_rate': 6.9969040247678016e-06, 'epoch': 0.97}


 97%|█████████▋| 3123/3235 [56:26<02:05,  1.12s/it]

{'loss': 1.0967, 'grad_norm': 0.4973318576812744, 'learning_rate': 6.93498452012384e-06, 'epoch': 0.97}


 97%|█████████▋| 3124/3235 [56:27<01:58,  1.07s/it]

{'loss': 1.0785, 'grad_norm': 0.5229624509811401, 'learning_rate': 6.8730650154798765e-06, 'epoch': 0.97}


 97%|█████████▋| 3125/3235 [56:28<01:59,  1.09s/it]

{'loss': 0.9878, 'grad_norm': 0.49167853593826294, 'learning_rate': 6.811145510835913e-06, 'epoch': 0.97}


 97%|█████████▋| 3126/3235 [56:29<01:52,  1.03s/it]

{'loss': 0.8206, 'grad_norm': 0.5162532329559326, 'learning_rate': 6.7492260061919514e-06, 'epoch': 0.97}


 97%|█████████▋| 3127/3235 [56:30<01:47,  1.01it/s]

{'loss': 0.9165, 'grad_norm': 0.5041527152061462, 'learning_rate': 6.687306501547988e-06, 'epoch': 0.97}


 97%|█████████▋| 3128/3235 [56:31<01:43,  1.03it/s]

{'loss': 0.9725, 'grad_norm': 0.6926621198654175, 'learning_rate': 6.625386996904025e-06, 'epoch': 0.97}


 97%|█████████▋| 3129/3235 [56:32<01:48,  1.03s/it]

{'loss': 1.0229, 'grad_norm': 0.46674907207489014, 'learning_rate': 6.563467492260063e-06, 'epoch': 0.97}


 97%|█████████▋| 3130/3235 [56:33<01:45,  1.00s/it]

{'loss': 0.953, 'grad_norm': 0.4648359715938568, 'learning_rate': 6.5015479876161e-06, 'epoch': 0.97}


 97%|█████████▋| 3131/3235 [56:34<01:52,  1.08s/it]

{'loss': 1.0012, 'grad_norm': 0.44804850220680237, 'learning_rate': 6.439628482972136e-06, 'epoch': 0.97}


 97%|█████████▋| 3132/3235 [56:35<01:44,  1.01s/it]

{'loss': 1.0571, 'grad_norm': 0.5924621820449829, 'learning_rate': 6.3777089783281746e-06, 'epoch': 0.97}


 97%|█████████▋| 3133/3235 [56:36<01:39,  1.02it/s]

{'loss': 0.9991, 'grad_norm': 0.5256087779998779, 'learning_rate': 6.315789473684211e-06, 'epoch': 0.97}


 97%|█████████▋| 3134/3235 [56:37<01:42,  1.01s/it]

{'loss': 1.0694, 'grad_norm': 0.47773775458335876, 'learning_rate': 6.253869969040248e-06, 'epoch': 0.97}


 97%|█████████▋| 3135/3235 [56:38<01:41,  1.01s/it]

{'loss': 0.9558, 'grad_norm': 0.5008619427680969, 'learning_rate': 6.191950464396285e-06, 'epoch': 0.97}


 97%|█████████▋| 3136/3235 [56:39<01:38,  1.01it/s]

{'loss': 0.8761, 'grad_norm': 0.5039809346199036, 'learning_rate': 6.130030959752323e-06, 'epoch': 0.97}


 97%|█████████▋| 3137/3235 [56:40<01:31,  1.07it/s]

{'loss': 1.0859, 'grad_norm': 0.5327221751213074, 'learning_rate': 6.068111455108359e-06, 'epoch': 0.97}


 97%|█████████▋| 3138/3235 [56:41<01:30,  1.07it/s]

{'loss': 1.0232, 'grad_norm': 0.5289304256439209, 'learning_rate': 6.006191950464397e-06, 'epoch': 0.97}


 97%|█████████▋| 3139/3235 [56:42<01:25,  1.12it/s]

{'loss': 0.9393, 'grad_norm': 0.5316414833068848, 'learning_rate': 5.9442724458204335e-06, 'epoch': 0.97}


 97%|█████████▋| 3140/3235 [56:43<01:27,  1.09it/s]

{'loss': 0.9935, 'grad_norm': 0.4774434566497803, 'learning_rate': 5.882352941176471e-06, 'epoch': 0.97}


 97%|█████████▋| 3141/3235 [56:44<01:36,  1.03s/it]

{'loss': 1.0837, 'grad_norm': 0.44366180896759033, 'learning_rate': 5.820433436532508e-06, 'epoch': 0.97}


 97%|█████████▋| 3142/3235 [56:45<01:32,  1.00it/s]

{'loss': 1.0036, 'grad_norm': 0.4981001615524292, 'learning_rate': 5.758513931888545e-06, 'epoch': 0.97}


 97%|█████████▋| 3143/3235 [56:46<01:40,  1.09s/it]

{'loss': 1.0725, 'grad_norm': 0.46869516372680664, 'learning_rate': 5.6965944272445825e-06, 'epoch': 0.97}


 97%|█████████▋| 3144/3235 [56:47<01:37,  1.07s/it]

{'loss': 0.9232, 'grad_norm': 0.5411565899848938, 'learning_rate': 5.63467492260062e-06, 'epoch': 0.97}


 97%|█████████▋| 3145/3235 [56:49<01:46,  1.18s/it]

{'loss': 0.966, 'grad_norm': 0.47419559955596924, 'learning_rate': 5.5727554179566566e-06, 'epoch': 0.97}


 97%|█████████▋| 3146/3235 [56:50<01:38,  1.10s/it]

{'loss': 0.8746, 'grad_norm': 0.4914148151874542, 'learning_rate': 5.510835913312694e-06, 'epoch': 0.97}


 97%|█████████▋| 3147/3235 [56:50<01:31,  1.04s/it]

{'loss': 0.9096, 'grad_norm': 0.48912107944488525, 'learning_rate': 5.4489164086687315e-06, 'epoch': 0.97}


 97%|█████████▋| 3148/3235 [56:52<01:33,  1.07s/it]

{'loss': 1.0062, 'grad_norm': 0.48235833644866943, 'learning_rate': 5.386996904024768e-06, 'epoch': 0.97}


 97%|█████████▋| 3149/3235 [56:52<01:27,  1.01s/it]

{'loss': 1.124, 'grad_norm': 0.573805034160614, 'learning_rate': 5.325077399380806e-06, 'epoch': 0.97}


 97%|█████████▋| 3150/3235 [56:53<01:22,  1.03it/s]

{'loss': 0.969, 'grad_norm': 0.5131033062934875, 'learning_rate': 5.263157894736842e-06, 'epoch': 0.97}


 97%|█████████▋| 3151/3235 [56:55<01:25,  1.02s/it]

{'loss': 0.8451, 'grad_norm': 0.4104475677013397, 'learning_rate': 5.20123839009288e-06, 'epoch': 0.97}


 97%|█████████▋| 3152/3235 [56:56<01:30,  1.09s/it]

{'loss': 1.0169, 'grad_norm': 0.4851408004760742, 'learning_rate': 5.139318885448917e-06, 'epoch': 0.97}


 97%|█████████▋| 3153/3235 [56:57<01:31,  1.12s/it]

{'loss': 1.0215, 'grad_norm': 0.45487740635871887, 'learning_rate': 5.077399380804954e-06, 'epoch': 0.97}


 97%|█████████▋| 3154/3235 [56:58<01:22,  1.02s/it]

{'loss': 0.9814, 'grad_norm': 0.5664079189300537, 'learning_rate': 5.015479876160991e-06, 'epoch': 0.97}


 98%|█████████▊| 3155/3235 [56:59<01:26,  1.08s/it]

{'loss': 0.992, 'grad_norm': 0.44949305057525635, 'learning_rate': 4.953560371517029e-06, 'epoch': 0.98}


 98%|█████████▊| 3156/3235 [57:00<01:24,  1.07s/it]

{'loss': 1.0205, 'grad_norm': 0.5345263481140137, 'learning_rate': 4.891640866873065e-06, 'epoch': 0.98}


 98%|█████████▊| 3157/3235 [57:01<01:28,  1.14s/it]

{'loss': 1.0685, 'grad_norm': 0.45881566405296326, 'learning_rate': 4.829721362229103e-06, 'epoch': 0.98}


 98%|█████████▊| 3158/3235 [57:03<01:30,  1.17s/it]

{'loss': 1.1552, 'grad_norm': 1.340619683265686, 'learning_rate': 4.7678018575851394e-06, 'epoch': 0.98}


 98%|█████████▊| 3159/3235 [57:03<01:19,  1.05s/it]

{'loss': 0.9818, 'grad_norm': 0.5518234372138977, 'learning_rate': 4.705882352941177e-06, 'epoch': 0.98}


 98%|█████████▊| 3160/3235 [57:05<01:26,  1.15s/it]

{'loss': 0.9144, 'grad_norm': 0.45292019844055176, 'learning_rate': 4.643962848297214e-06, 'epoch': 0.98}


 98%|█████████▊| 3161/3235 [57:06<01:25,  1.15s/it]

{'loss': 1.0759, 'grad_norm': 0.5042949914932251, 'learning_rate': 4.582043343653251e-06, 'epoch': 0.98}


 98%|█████████▊| 3162/3235 [57:07<01:26,  1.18s/it]

{'loss': 1.0396, 'grad_norm': 0.4662410318851471, 'learning_rate': 4.5201238390092885e-06, 'epoch': 0.98}


 98%|█████████▊| 3163/3235 [57:08<01:21,  1.13s/it]

{'loss': 0.8197, 'grad_norm': 0.4735140800476074, 'learning_rate': 4.458204334365326e-06, 'epoch': 0.98}


 98%|█████████▊| 3164/3235 [57:09<01:24,  1.19s/it]

{'loss': 1.0703, 'grad_norm': 0.46164020895957947, 'learning_rate': 4.3962848297213626e-06, 'epoch': 0.98}


 98%|█████████▊| 3165/3235 [57:10<01:16,  1.10s/it]

{'loss': 0.9694, 'grad_norm': 0.4766373634338379, 'learning_rate': 4.3343653250774e-06, 'epoch': 0.98}


 98%|█████████▊| 3166/3235 [57:11<01:14,  1.07s/it]

{'loss': 0.8781, 'grad_norm': 0.44640785455703735, 'learning_rate': 4.2724458204334375e-06, 'epoch': 0.98}


 98%|█████████▊| 3167/3235 [57:13<01:17,  1.13s/it]

{'loss': 0.9063, 'grad_norm': 0.4307125210762024, 'learning_rate': 4.210526315789474e-06, 'epoch': 0.98}


 98%|█████████▊| 3168/3235 [57:14<01:15,  1.12s/it]

{'loss': 0.9608, 'grad_norm': 0.41563376784324646, 'learning_rate': 4.148606811145512e-06, 'epoch': 0.98}


 98%|█████████▊| 3169/3235 [57:15<01:17,  1.17s/it]

{'loss': 1.1858, 'grad_norm': 0.55796879529953, 'learning_rate': 4.086687306501548e-06, 'epoch': 0.98}


 98%|█████████▊| 3170/3235 [57:16<01:09,  1.07s/it]

{'loss': 1.0484, 'grad_norm': 0.541408121585846, 'learning_rate': 4.024767801857586e-06, 'epoch': 0.98}


 98%|█████████▊| 3171/3235 [57:17<01:10,  1.10s/it]

{'loss': 1.036, 'grad_norm': 0.4714953899383545, 'learning_rate': 3.962848297213622e-06, 'epoch': 0.98}


 98%|█████████▊| 3172/3235 [57:18<01:14,  1.19s/it]

{'loss': 1.048, 'grad_norm': 0.4537689685821533, 'learning_rate': 3.90092879256966e-06, 'epoch': 0.98}


 98%|█████████▊| 3173/3235 [57:19<01:10,  1.13s/it]

{'loss': 1.123, 'grad_norm': 0.49852538108825684, 'learning_rate': 3.839009287925697e-06, 'epoch': 0.98}


 98%|█████████▊| 3174/3235 [57:20<01:03,  1.05s/it]

{'loss': 0.9514, 'grad_norm': 0.5282371640205383, 'learning_rate': 3.7770897832817343e-06, 'epoch': 0.98}


 98%|█████████▊| 3175/3235 [57:21<01:04,  1.08s/it]

{'loss': 1.1628, 'grad_norm': 0.49661338329315186, 'learning_rate': 3.715170278637771e-06, 'epoch': 0.98}


 98%|█████████▊| 3176/3235 [57:23<01:04,  1.09s/it]

{'loss': 0.9666, 'grad_norm': 0.4914509654045105, 'learning_rate': 3.6532507739938084e-06, 'epoch': 0.98}


 98%|█████████▊| 3177/3235 [57:24<01:02,  1.07s/it]

{'loss': 1.0401, 'grad_norm': 0.5168889760971069, 'learning_rate': 3.591331269349845e-06, 'epoch': 0.98}


 98%|█████████▊| 3178/3235 [57:25<01:04,  1.13s/it]

{'loss': 1.0817, 'grad_norm': 0.45878949761390686, 'learning_rate': 3.5294117647058825e-06, 'epoch': 0.98}


 98%|█████████▊| 3179/3235 [57:26<01:02,  1.11s/it]

{'loss': 0.9005, 'grad_norm': 0.4812273681163788, 'learning_rate': 3.46749226006192e-06, 'epoch': 0.98}


 98%|█████████▊| 3180/3235 [57:27<01:03,  1.15s/it]

{'loss': 1.0273, 'grad_norm': 0.4522927701473236, 'learning_rate': 3.4055727554179566e-06, 'epoch': 0.98}


 98%|█████████▊| 3181/3235 [57:28<00:59,  1.10s/it]

{'loss': 0.9555, 'grad_norm': 0.5202519297599792, 'learning_rate': 3.343653250773994e-06, 'epoch': 0.98}


 98%|█████████▊| 3182/3235 [57:29<00:57,  1.08s/it]

{'loss': 0.9638, 'grad_norm': 0.48573389649391174, 'learning_rate': 3.2817337461300315e-06, 'epoch': 0.98}


 98%|█████████▊| 3183/3235 [57:30<00:55,  1.07s/it]

{'loss': 1.0833, 'grad_norm': 0.5119141340255737, 'learning_rate': 3.219814241486068e-06, 'epoch': 0.98}


 98%|█████████▊| 3184/3235 [57:31<00:53,  1.05s/it]

{'loss': 0.9066, 'grad_norm': 0.5411704778671265, 'learning_rate': 3.1578947368421056e-06, 'epoch': 0.98}


 98%|█████████▊| 3185/3235 [57:32<00:51,  1.02s/it]

{'loss': 1.0307, 'grad_norm': 0.5308459401130676, 'learning_rate': 3.0959752321981426e-06, 'epoch': 0.98}


 98%|█████████▊| 3186/3235 [57:33<00:53,  1.09s/it]

{'loss': 1.0056, 'grad_norm': 0.4234038293361664, 'learning_rate': 3.0340557275541797e-06, 'epoch': 0.98}


 99%|█████████▊| 3187/3235 [57:34<00:52,  1.08s/it]

{'loss': 1.069, 'grad_norm': 0.48858487606048584, 'learning_rate': 2.9721362229102167e-06, 'epoch': 0.99}


 99%|█████████▊| 3188/3235 [57:36<00:53,  1.14s/it]

{'loss': 1.1212, 'grad_norm': 0.5069174766540527, 'learning_rate': 2.910216718266254e-06, 'epoch': 0.99}


 99%|█████████▊| 3189/3235 [57:37<00:49,  1.07s/it]

{'loss': 0.9704, 'grad_norm': 0.4907330572605133, 'learning_rate': 2.8482972136222912e-06, 'epoch': 0.99}


 99%|█████████▊| 3190/3235 [57:38<00:47,  1.06s/it]

{'loss': 1.0679, 'grad_norm': 0.4789733290672302, 'learning_rate': 2.7863777089783283e-06, 'epoch': 0.99}


 99%|█████████▊| 3191/3235 [57:39<00:47,  1.09s/it]

{'loss': 1.0993, 'grad_norm': 0.4963260591030121, 'learning_rate': 2.7244582043343658e-06, 'epoch': 0.99}


 99%|█████████▊| 3192/3235 [57:40<00:46,  1.09s/it]

{'loss': 1.0676, 'grad_norm': 0.5132740139961243, 'learning_rate': 2.662538699690403e-06, 'epoch': 0.99}


 99%|█████████▊| 3193/3235 [57:41<00:45,  1.09s/it]

{'loss': 0.9977, 'grad_norm': 0.5119229555130005, 'learning_rate': 2.60061919504644e-06, 'epoch': 0.99}


 99%|█████████▊| 3194/3235 [57:42<00:44,  1.09s/it]

{'loss': 0.89, 'grad_norm': 0.48402488231658936, 'learning_rate': 2.538699690402477e-06, 'epoch': 0.99}


 99%|█████████▉| 3195/3235 [57:44<00:48,  1.21s/it]

{'loss': 1.1496, 'grad_norm': 0.500164270401001, 'learning_rate': 2.4767801857585144e-06, 'epoch': 0.99}


 99%|█████████▉| 3196/3235 [57:45<00:45,  1.17s/it]

{'loss': 1.1005, 'grad_norm': 0.4807211458683014, 'learning_rate': 2.4148606811145514e-06, 'epoch': 0.99}


 99%|█████████▉| 3197/3235 [57:46<00:43,  1.15s/it]

{'loss': 1.1007, 'grad_norm': 0.4630413353443146, 'learning_rate': 2.3529411764705885e-06, 'epoch': 0.99}


 99%|█████████▉| 3198/3235 [57:47<00:42,  1.16s/it]

{'loss': 0.9425, 'grad_norm': 0.46935251355171204, 'learning_rate': 2.2910216718266255e-06, 'epoch': 0.99}


 99%|█████████▉| 3199/3235 [57:48<00:42,  1.17s/it]

{'loss': 0.8711, 'grad_norm': 0.4570116698741913, 'learning_rate': 2.229102167182663e-06, 'epoch': 0.99}


 99%|█████████▉| 3200/3235 [57:49<00:39,  1.14s/it]

{'loss': 0.9753, 'grad_norm': 0.49107030034065247, 'learning_rate': 2.1671826625387e-06, 'epoch': 0.99}


 99%|█████████▉| 3201/3235 [57:50<00:38,  1.12s/it]

{'loss': 0.9703, 'grad_norm': 0.5401831269264221, 'learning_rate': 2.105263157894737e-06, 'epoch': 0.99}


 99%|█████████▉| 3202/3235 [57:51<00:34,  1.05s/it]

{'loss': 0.9622, 'grad_norm': 0.5235339999198914, 'learning_rate': 2.043343653250774e-06, 'epoch': 0.99}


 99%|█████████▉| 3203/3235 [57:52<00:34,  1.08s/it]

{'loss': 1.0503, 'grad_norm': 0.4941959083080292, 'learning_rate': 1.981424148606811e-06, 'epoch': 0.99}


 99%|█████████▉| 3204/3235 [57:53<00:33,  1.08s/it]

{'loss': 1.0565, 'grad_norm': 0.4904038608074188, 'learning_rate': 1.9195046439628486e-06, 'epoch': 0.99}


 99%|█████████▉| 3205/3235 [57:54<00:30,  1.02s/it]

{'loss': 1.021, 'grad_norm': 0.5058164596557617, 'learning_rate': 1.8575851393188855e-06, 'epoch': 0.99}


 99%|█████████▉| 3206/3235 [57:55<00:31,  1.07s/it]

{'loss': 1.037, 'grad_norm': 0.4765626788139343, 'learning_rate': 1.7956656346749225e-06, 'epoch': 0.99}


 99%|█████████▉| 3207/3235 [57:57<00:30,  1.08s/it]

{'loss': 0.9723, 'grad_norm': 0.47346723079681396, 'learning_rate': 1.73374613003096e-06, 'epoch': 0.99}


 99%|█████████▉| 3208/3235 [57:58<00:32,  1.19s/it]

{'loss': 0.9113, 'grad_norm': 0.42375648021698, 'learning_rate': 1.671826625386997e-06, 'epoch': 0.99}


 99%|█████████▉| 3209/3235 [57:59<00:30,  1.16s/it]

{'loss': 0.9776, 'grad_norm': 0.48780152201652527, 'learning_rate': 1.609907120743034e-06, 'epoch': 0.99}


 99%|█████████▉| 3210/3235 [58:00<00:28,  1.15s/it]

{'loss': 0.8815, 'grad_norm': 0.4791196286678314, 'learning_rate': 1.5479876160990713e-06, 'epoch': 0.99}


 99%|█████████▉| 3211/3235 [58:01<00:28,  1.18s/it]

{'loss': 0.9262, 'grad_norm': 0.46877309679985046, 'learning_rate': 1.4860681114551084e-06, 'epoch': 0.99}


 99%|█████████▉| 3212/3235 [58:03<00:26,  1.16s/it]

{'loss': 1.0406, 'grad_norm': 0.4940102994441986, 'learning_rate': 1.4241486068111456e-06, 'epoch': 0.99}


 99%|█████████▉| 3213/3235 [58:04<00:24,  1.13s/it]

{'loss': 1.0132, 'grad_norm': 0.5133053660392761, 'learning_rate': 1.3622291021671829e-06, 'epoch': 0.99}


 99%|█████████▉| 3214/3235 [58:05<00:23,  1.11s/it]

{'loss': 1.076, 'grad_norm': 0.5264838933944702, 'learning_rate': 1.30030959752322e-06, 'epoch': 0.99}


 99%|█████████▉| 3215/3235 [58:06<00:23,  1.18s/it]

{'loss': 1.0185, 'grad_norm': 0.47381386160850525, 'learning_rate': 1.2383900928792572e-06, 'epoch': 0.99}


 99%|█████████▉| 3216/3235 [58:07<00:22,  1.19s/it]

{'loss': 1.1676, 'grad_norm': 0.4943065643310547, 'learning_rate': 1.1764705882352942e-06, 'epoch': 0.99}


 99%|█████████▉| 3217/3235 [58:09<00:21,  1.22s/it]

{'loss': 0.925, 'grad_norm': 0.4595752954483032, 'learning_rate': 1.1145510835913315e-06, 'epoch': 0.99}


 99%|█████████▉| 3218/3235 [58:10<00:19,  1.16s/it]

{'loss': 1.0182, 'grad_norm': 0.5331926345825195, 'learning_rate': 1.0526315789473685e-06, 'epoch': 0.99}


100%|█████████▉| 3219/3235 [58:10<00:17,  1.07s/it]

{'loss': 1.0089, 'grad_norm': 0.5216234922409058, 'learning_rate': 9.907120743034056e-07, 'epoch': 1.0}


100%|█████████▉| 3220/3235 [58:12<00:16,  1.10s/it]

{'loss': 0.86, 'grad_norm': 0.4681910574436188, 'learning_rate': 9.287925696594427e-07, 'epoch': 1.0}


100%|█████████▉| 3221/3235 [58:13<00:14,  1.05s/it]

{'loss': 0.9203, 'grad_norm': 0.5957198143005371, 'learning_rate': 8.6687306501548e-07, 'epoch': 1.0}


100%|█████████▉| 3222/3235 [58:14<00:14,  1.10s/it]

{'loss': 1.0895, 'grad_norm': 0.43047723174095154, 'learning_rate': 8.04953560371517e-07, 'epoch': 1.0}


100%|█████████▉| 3223/3235 [58:15<00:13,  1.12s/it]

{'loss': 0.8884, 'grad_norm': 0.45812761783599854, 'learning_rate': 7.430340557275542e-07, 'epoch': 1.0}


100%|█████████▉| 3224/3235 [58:16<00:12,  1.14s/it]

{'loss': 1.0362, 'grad_norm': 0.4825124740600586, 'learning_rate': 6.811145510835914e-07, 'epoch': 1.0}


100%|█████████▉| 3225/3235 [58:17<00:10,  1.03s/it]

{'loss': 0.9244, 'grad_norm': 0.5087699294090271, 'learning_rate': 6.191950464396286e-07, 'epoch': 1.0}


100%|█████████▉| 3226/3235 [58:18<00:09,  1.07s/it]

{'loss': 1.0455, 'grad_norm': 0.4992772936820984, 'learning_rate': 5.572755417956657e-07, 'epoch': 1.0}


100%|█████████▉| 3227/3235 [58:19<00:08,  1.08s/it]

{'loss': 0.9836, 'grad_norm': 0.5473756194114685, 'learning_rate': 4.953560371517028e-07, 'epoch': 1.0}


100%|█████████▉| 3228/3235 [58:20<00:07,  1.08s/it]

{'loss': 0.8882, 'grad_norm': 0.47101983428001404, 'learning_rate': 4.3343653250774e-07, 'epoch': 1.0}


100%|█████████▉| 3229/3235 [58:21<00:06,  1.06s/it]

{'loss': 1.0026, 'grad_norm': 0.5587285161018372, 'learning_rate': 3.715170278637771e-07, 'epoch': 1.0}


100%|█████████▉| 3230/3235 [58:22<00:04,  1.01it/s]

{'loss': 0.9835, 'grad_norm': 0.5704794526100159, 'learning_rate': 3.095975232198143e-07, 'epoch': 1.0}


100%|█████████▉| 3231/3235 [58:23<00:04,  1.08s/it]

{'loss': 1.0546, 'grad_norm': 0.44115370512008667, 'learning_rate': 2.476780185758514e-07, 'epoch': 1.0}


100%|█████████▉| 3232/3235 [58:25<00:03,  1.10s/it]

{'loss': 0.9306, 'grad_norm': 0.42141929268836975, 'learning_rate': 1.8575851393188855e-07, 'epoch': 1.0}


100%|█████████▉| 3233/3235 [58:26<00:02,  1.15s/it]

{'loss': 1.088, 'grad_norm': 0.4451414942741394, 'learning_rate': 1.238390092879257e-07, 'epoch': 1.0}


100%|█████████▉| 3234/3235 [58:27<00:01,  1.22s/it]

{'loss': 1.0732, 'grad_norm': 0.4478287398815155, 'learning_rate': 6.191950464396285e-08, 'epoch': 1.0}


100%|██████████| 3235/3235 [58:29<00:00,  1.25s/it]

{'loss': 1.0245, 'grad_norm': 0.438296914100647, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 3511.0717, 'train_samples_per_second': 14.742, 'train_steps_per_second': 0.921, 'train_loss': 1.030300565333418, 'epoch': 1.0}


100%|██████████| 3235/3235 [58:29<00:00,  1.08s/it]


In [8]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

3511.0717 seconds used for training.
58.52 minutes used for training.
Peak reserved memory = 21.166 GB.
Peak reserved memory for training = 18.947 GB.
Peak reserved memory % of max memory = 89.508 %.
Peak reserved memory for training % of max memory = 80.124 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [9]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\nThe next two numbers in the Fibonacci sequence are 13 and 21.<eos>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [10]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence for the next 6 numbers.", # instruction
        "1, 1, 2, 3, 5, 8, 13, 21", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence for the next 6 numbers.

### Input:
1, 1, 2, 3, 5, 8, 13, 21

### Response:
The next 6 numbers in the Fibonacci sequence are 34, 55, 89, 144, 233, and 377.<eos>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [11]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [12]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

["<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nOne famous tall tower in Paris is the Eiffel Tower, also known as the Iron Lady. It is a wrought iron lattice tower that was built for the 1889 World's Fair in Paris, France. It is one of the most recognizable landmarks in the world and is a popular tourist attraction.<eos>"]

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [13]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [14]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [15]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>