To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

See on our [blog post](https://unsloth.ai/blog/gemma) on how we made **Gemma 7b 2.5x faster** and **Gemma 2x faster**!

In [1]:
# %%capture
import torch
import os

os.environ["WANDB_PROJECT"] = "gemma_2b_test"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
# major_version, minor_version = torch.cuda.get_device_capability()
# if major_version >= 8:
#     # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
#     !pip install "unsloth[colab-ampere] @ git+https://github.com/unslothai/unsloth.git"
# else:
#     # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
#     !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
# pass

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.2.1+cu121)
    Python  3.10.13 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


==((====))==  Unsloth: Fast Gemma patching release 2024.3
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.22.post7. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth




We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.3 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# load json dataset at input.json
dataset = load_dataset('json', data_files='../combined_output.json', split='train')
dataset = dataset.map(formatting_prompts_func, batched = True,)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [5]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 1,
        # max_steps = ,
        num_train_epochs = 10,
        learning_rate = 4e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",
    ),
)

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


In [6]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
2.219 GB of memory reserved.


In [7]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,173 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 2,930
 "-____-"     Number of trainable parameters = 19,611,648
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33myashas-salankimatt[0m ([33myashas-personal[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 1/2930 [00:02<1:42:01,  2.09s/it]

{'loss': 0.8959, 'grad_norm': 0.11932601034641266, 'learning_rate': 0.0004, 'epoch': 0.0}


  0%|          | 2/2930 [00:03<1:30:38,  1.86s/it]

{'loss': 0.9509, 'grad_norm': 0.12995079159736633, 'learning_rate': 0.000399863434619324, 'epoch': 0.01}


  0%|          | 3/2930 [00:05<1:26:20,  1.77s/it]

{'loss': 0.8057, 'grad_norm': 0.1028643399477005, 'learning_rate': 0.00039972686923864806, 'epoch': 0.01}


  0%|          | 4/2930 [00:07<1:24:35,  1.73s/it]

{'loss': 0.781, 'grad_norm': 0.11891897022724152, 'learning_rate': 0.000399590303857972, 'epoch': 0.01}


  0%|          | 5/2930 [00:08<1:22:48,  1.70s/it]

{'loss': 0.7576, 'grad_norm': 0.11693283915519714, 'learning_rate': 0.000399453738477296, 'epoch': 0.02}


  0%|          | 6/2930 [00:10<1:22:47,  1.70s/it]

{'loss': 0.798, 'grad_norm': 0.11341890692710876, 'learning_rate': 0.00039931717309662004, 'epoch': 0.02}


  0%|          | 7/2930 [00:12<1:22:36,  1.70s/it]

{'loss': 0.7117, 'grad_norm': 0.122858926653862, 'learning_rate': 0.00039918060771594404, 'epoch': 0.02}


  0%|          | 8/2930 [00:13<1:21:39,  1.68s/it]

{'loss': 0.6596, 'grad_norm': 0.12956611812114716, 'learning_rate': 0.00039904404233526803, 'epoch': 0.03}


  0%|          | 9/2930 [00:15<1:21:10,  1.67s/it]

{'loss': 0.6033, 'grad_norm': 0.21965965628623962, 'learning_rate': 0.000398907476954592, 'epoch': 0.03}


  0%|          | 10/2930 [00:17<1:21:19,  1.67s/it]

{'loss': 0.5973, 'grad_norm': 0.1176835373044014, 'learning_rate': 0.000398770911573916, 'epoch': 0.03}


  0%|          | 11/2930 [00:18<1:21:09,  1.67s/it]

{'loss': 0.6031, 'grad_norm': 0.11746004223823547, 'learning_rate': 0.00039863434619324, 'epoch': 0.04}


  0%|          | 12/2930 [00:20<1:21:14,  1.67s/it]

{'loss': 0.5548, 'grad_norm': 0.10500402748584747, 'learning_rate': 0.000398497780812564, 'epoch': 0.04}


  0%|          | 13/2930 [00:22<1:21:00,  1.67s/it]

{'loss': 0.6046, 'grad_norm': 0.13920103013515472, 'learning_rate': 0.00039836121543188806, 'epoch': 0.04}


  0%|          | 14/2930 [00:23<1:20:18,  1.65s/it]

{'loss': 0.5541, 'grad_norm': 0.13967584073543549, 'learning_rate': 0.00039822465005121205, 'epoch': 0.05}


  1%|          | 15/2930 [00:25<1:20:27,  1.66s/it]

{'loss': 0.5453, 'grad_norm': 0.13367138803005219, 'learning_rate': 0.00039808808467053605, 'epoch': 0.05}


  1%|          | 16/2930 [00:27<1:20:36,  1.66s/it]

{'loss': 0.5255, 'grad_norm': 0.1382903754711151, 'learning_rate': 0.00039795151928986004, 'epoch': 0.05}


  1%|          | 17/2930 [00:28<1:20:47,  1.66s/it]

{'loss': 0.5806, 'grad_norm': 0.16521431505680084, 'learning_rate': 0.00039781495390918403, 'epoch': 0.06}


  1%|          | 18/2930 [00:30<1:20:26,  1.66s/it]

{'loss': 0.5837, 'grad_norm': 0.14018790423870087, 'learning_rate': 0.00039767838852850803, 'epoch': 0.06}


  1%|          | 19/2930 [00:32<1:20:10,  1.65s/it]

{'loss': 0.5877, 'grad_norm': 0.20510762929916382, 'learning_rate': 0.0003975418231478321, 'epoch': 0.06}


  1%|          | 20/2930 [00:33<1:20:26,  1.66s/it]

{'loss': 0.5098, 'grad_norm': 0.15951736271381378, 'learning_rate': 0.00039740525776715607, 'epoch': 0.07}


  1%|          | 21/2930 [00:35<1:20:37,  1.66s/it]

{'loss': 0.5985, 'grad_norm': 0.18222330510616302, 'learning_rate': 0.00039726869238648, 'epoch': 0.07}


  1%|          | 22/2930 [00:37<1:20:51,  1.67s/it]

{'loss': 0.5194, 'grad_norm': 0.12442512810230255, 'learning_rate': 0.00039713212700580406, 'epoch': 0.08}


  1%|          | 23/2930 [00:38<1:20:28,  1.66s/it]

{'loss': 0.5462, 'grad_norm': 0.1894417256116867, 'learning_rate': 0.00039699556162512805, 'epoch': 0.08}


  1%|          | 24/2930 [00:40<1:20:38,  1.66s/it]

{'loss': 0.4865, 'grad_norm': 0.12953850626945496, 'learning_rate': 0.00039685899624445205, 'epoch': 0.08}


  1%|          | 25/2930 [00:42<1:20:47,  1.67s/it]

{'loss': 0.4825, 'grad_norm': 0.19135719537734985, 'learning_rate': 0.0003967224308637761, 'epoch': 0.09}


  1%|          | 26/2930 [00:43<1:20:52,  1.67s/it]

{'loss': 0.5047, 'grad_norm': 0.1669173687696457, 'learning_rate': 0.00039658586548310004, 'epoch': 0.09}


  1%|          | 27/2930 [00:45<1:20:24,  1.66s/it]

{'loss': 0.4946, 'grad_norm': 0.1852615624666214, 'learning_rate': 0.00039644930010242403, 'epoch': 0.09}


  1%|          | 28/2930 [00:47<1:20:35,  1.67s/it]

{'loss': 0.5015, 'grad_norm': 0.19344154000282288, 'learning_rate': 0.000396312734721748, 'epoch': 0.1}


  1%|          | 29/2930 [00:48<1:21:04,  1.68s/it]

{'loss': 0.5102, 'grad_norm': 0.14693255722522736, 'learning_rate': 0.00039617616934107207, 'epoch': 0.1}


  1%|          | 30/2930 [00:50<1:20:56,  1.67s/it]

{'loss': 0.5022, 'grad_norm': 0.15513701736927032, 'learning_rate': 0.00039603960396039607, 'epoch': 0.1}


  1%|          | 31/2930 [00:52<1:20:09,  1.66s/it]

{'loss': 0.5442, 'grad_norm': 0.16791105270385742, 'learning_rate': 0.00039590303857972006, 'epoch': 0.11}


  1%|          | 32/2930 [00:53<1:20:01,  1.66s/it]

{'loss': 0.4718, 'grad_norm': 0.17000769078731537, 'learning_rate': 0.00039576647319904405, 'epoch': 0.11}


  1%|          | 33/2930 [00:55<1:20:09,  1.66s/it]

{'loss': 0.4348, 'grad_norm': 0.15345706045627594, 'learning_rate': 0.00039562990781836805, 'epoch': 0.11}


  1%|          | 34/2930 [00:57<1:19:59,  1.66s/it]

{'loss': 0.4994, 'grad_norm': 0.19199515879154205, 'learning_rate': 0.00039549334243769204, 'epoch': 0.12}


  1%|          | 35/2930 [00:58<1:19:45,  1.65s/it]

{'loss': 0.5367, 'grad_norm': 0.19298982620239258, 'learning_rate': 0.0003953567770570161, 'epoch': 0.12}


  1%|          | 36/2930 [01:00<1:20:00,  1.66s/it]

{'loss': 0.4753, 'grad_norm': 0.18228675425052643, 'learning_rate': 0.0003952202116763401, 'epoch': 0.12}


  1%|▏         | 37/2930 [01:01<1:20:08,  1.66s/it]

{'loss': 0.4911, 'grad_norm': 0.22715966403484344, 'learning_rate': 0.0003950836462956641, 'epoch': 0.13}


  1%|▏         | 38/2930 [01:03<1:20:04,  1.66s/it]

{'loss': 0.4644, 'grad_norm': 0.17330849170684814, 'learning_rate': 0.0003949470809149881, 'epoch': 0.13}


  1%|▏         | 39/2930 [01:05<1:19:35,  1.65s/it]

{'loss': 0.5084, 'grad_norm': 0.3439425230026245, 'learning_rate': 0.00039481051553431207, 'epoch': 0.13}


  1%|▏         | 40/2930 [01:06<1:19:09,  1.64s/it]

{'loss': 0.5116, 'grad_norm': 0.2567111551761627, 'learning_rate': 0.00039467395015363606, 'epoch': 0.14}


  1%|▏         | 41/2930 [01:08<1:19:34,  1.65s/it]

{'loss': 0.4745, 'grad_norm': 0.20711593329906464, 'learning_rate': 0.0003945373847729601, 'epoch': 0.14}


  1%|▏         | 42/2930 [01:10<1:19:44,  1.66s/it]

{'loss': 0.4883, 'grad_norm': 0.18716032803058624, 'learning_rate': 0.0003944008193922841, 'epoch': 0.14}


  1%|▏         | 43/2930 [01:11<1:19:08,  1.64s/it]

{'loss': 0.4738, 'grad_norm': 0.20658157765865326, 'learning_rate': 0.00039426425401160804, 'epoch': 0.15}


  2%|▏         | 44/2930 [01:13<1:19:20,  1.65s/it]

{'loss': 0.4835, 'grad_norm': 0.19123712182044983, 'learning_rate': 0.00039412768863093204, 'epoch': 0.15}


  2%|▏         | 45/2930 [01:15<1:19:32,  1.65s/it]

{'loss': 0.4783, 'grad_norm': 0.1860593855381012, 'learning_rate': 0.0003939911232502561, 'epoch': 0.15}


  2%|▏         | 46/2930 [01:16<1:19:07,  1.65s/it]

{'loss': 0.4701, 'grad_norm': 0.16982242465019226, 'learning_rate': 0.0003938545578695801, 'epoch': 0.16}


  2%|▏         | 47/2930 [01:18<1:19:26,  1.65s/it]

{'loss': 0.4279, 'grad_norm': 0.19591237604618073, 'learning_rate': 0.0003937179924889041, 'epoch': 0.16}


  2%|▏         | 48/2930 [01:20<1:19:36,  1.66s/it]

{'loss': 0.4391, 'grad_norm': 0.2108498215675354, 'learning_rate': 0.00039358142710822807, 'epoch': 0.16}


  2%|▏         | 49/2930 [01:21<1:19:45,  1.66s/it]

{'loss': 0.501, 'grad_norm': 0.1974157989025116, 'learning_rate': 0.00039344486172755206, 'epoch': 0.17}


  2%|▏         | 50/2930 [01:23<1:19:47,  1.66s/it]

{'loss': 0.4348, 'grad_norm': 0.1864587962627411, 'learning_rate': 0.00039330829634687606, 'epoch': 0.17}


  2%|▏         | 51/2930 [01:25<1:19:51,  1.66s/it]

{'loss': 0.481, 'grad_norm': 0.18650147318840027, 'learning_rate': 0.0003931717309662001, 'epoch': 0.17}


  2%|▏         | 52/2930 [01:26<1:19:50,  1.66s/it]

{'loss': 0.4605, 'grad_norm': 0.16282854974269867, 'learning_rate': 0.0003930351655855241, 'epoch': 0.18}


  2%|▏         | 53/2930 [01:28<1:19:34,  1.66s/it]

{'loss': 0.4911, 'grad_norm': 0.18192808330059052, 'learning_rate': 0.0003928986002048481, 'epoch': 0.18}


  2%|▏         | 54/2930 [01:30<1:19:26,  1.66s/it]

{'loss': 0.4521, 'grad_norm': 0.14408214390277863, 'learning_rate': 0.0003927620348241721, 'epoch': 0.18}


  2%|▏         | 55/2930 [01:31<1:19:35,  1.66s/it]

{'loss': 0.4622, 'grad_norm': 0.16893240809440613, 'learning_rate': 0.0003926254694434961, 'epoch': 0.19}


  2%|▏         | 56/2930 [01:33<1:19:30,  1.66s/it]

{'loss': 0.4969, 'grad_norm': 0.20898085832595825, 'learning_rate': 0.0003924889040628201, 'epoch': 0.19}


  2%|▏         | 57/2930 [01:35<1:19:26,  1.66s/it]

{'loss': 0.5239, 'grad_norm': 0.19083088636398315, 'learning_rate': 0.0003923523386821441, 'epoch': 0.19}


  2%|▏         | 58/2930 [01:36<1:19:22,  1.66s/it]

{'loss': 0.5189, 'grad_norm': 0.1833944469690323, 'learning_rate': 0.0003922157733014681, 'epoch': 0.2}


  2%|▏         | 59/2930 [01:38<1:19:14,  1.66s/it]

{'loss': 0.4416, 'grad_norm': 0.17402781546115875, 'learning_rate': 0.0003920792079207921, 'epoch': 0.2}


  2%|▏         | 60/2930 [01:40<1:18:49,  1.65s/it]

{'loss': 0.4058, 'grad_norm': 0.1983473300933838, 'learning_rate': 0.00039194264254011605, 'epoch': 0.2}


  2%|▏         | 61/2930 [01:41<1:19:00,  1.65s/it]

{'loss': 0.4304, 'grad_norm': 0.198753222823143, 'learning_rate': 0.0003918060771594401, 'epoch': 0.21}


  2%|▏         | 62/2930 [01:43<1:19:09,  1.66s/it]

{'loss': 0.4343, 'grad_norm': 0.19336481392383575, 'learning_rate': 0.0003916695117787641, 'epoch': 0.21}


  2%|▏         | 63/2930 [01:45<1:19:08,  1.66s/it]

{'loss': 0.4663, 'grad_norm': 0.16549089550971985, 'learning_rate': 0.0003915329463980881, 'epoch': 0.21}


  2%|▏         | 64/2930 [01:46<1:19:30,  1.66s/it]

{'loss': 0.4779, 'grad_norm': 0.18489280343055725, 'learning_rate': 0.00039139638101741214, 'epoch': 0.22}


  2%|▏         | 65/2930 [01:48<1:19:58,  1.67s/it]

{'loss': 0.4406, 'grad_norm': 0.17566567659378052, 'learning_rate': 0.0003912598156367361, 'epoch': 0.22}


  2%|▏         | 66/2930 [01:50<1:19:47,  1.67s/it]

{'loss': 0.432, 'grad_norm': 0.1636786162853241, 'learning_rate': 0.00039112325025606007, 'epoch': 0.23}


  2%|▏         | 67/2930 [01:51<1:20:06,  1.68s/it]

{'loss': 0.4013, 'grad_norm': 0.1571778953075409, 'learning_rate': 0.0003909866848753841, 'epoch': 0.23}


  2%|▏         | 68/2930 [01:53<1:20:15,  1.68s/it]

{'loss': 0.453, 'grad_norm': 0.1813317984342575, 'learning_rate': 0.0003908501194947081, 'epoch': 0.23}


  2%|▏         | 69/2930 [01:55<1:19:20,  1.66s/it]

{'loss': 0.4955, 'grad_norm': 0.15979111194610596, 'learning_rate': 0.0003907135541140321, 'epoch': 0.24}


  2%|▏         | 70/2930 [01:56<1:18:17,  1.64s/it]

{'loss': 0.3917, 'grad_norm': 0.18714825809001923, 'learning_rate': 0.00039057698873335616, 'epoch': 0.24}


  2%|▏         | 71/2930 [01:58<1:18:30,  1.65s/it]

{'loss': 0.4404, 'grad_norm': 0.16007860004901886, 'learning_rate': 0.0003904404233526801, 'epoch': 0.24}


  2%|▏         | 72/2930 [01:59<1:18:25,  1.65s/it]

{'loss': 0.4495, 'grad_norm': 0.18033036589622498, 'learning_rate': 0.0003903038579720041, 'epoch': 0.25}


  2%|▏         | 73/2930 [02:01<1:18:21,  1.65s/it]

{'loss': 0.4508, 'grad_norm': 0.18039646744728088, 'learning_rate': 0.00039016729259132814, 'epoch': 0.25}


  3%|▎         | 74/2930 [02:03<1:18:34,  1.65s/it]

{'loss': 0.4633, 'grad_norm': 0.18390505015850067, 'learning_rate': 0.00039003072721065213, 'epoch': 0.25}


  3%|▎         | 75/2930 [02:04<1:18:44,  1.65s/it]

{'loss': 0.4402, 'grad_norm': 0.21328343451023102, 'learning_rate': 0.00038989416182997613, 'epoch': 0.26}


  3%|▎         | 76/2930 [02:06<1:18:13,  1.64s/it]

{'loss': 0.4346, 'grad_norm': 0.20199717581272125, 'learning_rate': 0.0003897575964493001, 'epoch': 0.26}


  3%|▎         | 77/2930 [02:08<1:18:38,  1.65s/it]

{'loss': 0.4383, 'grad_norm': 0.21577580273151398, 'learning_rate': 0.0003896210310686241, 'epoch': 0.26}


  3%|▎         | 78/2930 [02:09<1:18:29,  1.65s/it]

{'loss': 0.4184, 'grad_norm': 0.17391496896743774, 'learning_rate': 0.0003894844656879481, 'epoch': 0.27}


  3%|▎         | 79/2930 [02:11<1:18:32,  1.65s/it]

{'loss': 0.5059, 'grad_norm': 0.17531253397464752, 'learning_rate': 0.0003893479003072721, 'epoch': 0.27}


  3%|▎         | 80/2930 [02:13<1:18:49,  1.66s/it]

{'loss': 0.4707, 'grad_norm': 0.25591975450515747, 'learning_rate': 0.00038921133492659615, 'epoch': 0.27}


  3%|▎         | 81/2930 [02:14<1:18:32,  1.65s/it]

{'loss': 0.4205, 'grad_norm': 0.16724248230457306, 'learning_rate': 0.00038907476954592015, 'epoch': 0.28}


  3%|▎         | 82/2930 [02:16<1:18:29,  1.65s/it]

{'loss': 0.4157, 'grad_norm': 0.18535126745700836, 'learning_rate': 0.0003889382041652441, 'epoch': 0.28}


  3%|▎         | 83/2930 [02:18<1:18:37,  1.66s/it]

{'loss': 0.4128, 'grad_norm': 0.1879180371761322, 'learning_rate': 0.00038880163878456814, 'epoch': 0.28}


  3%|▎         | 84/2930 [02:19<1:18:15,  1.65s/it]

{'loss': 0.4654, 'grad_norm': 0.22169995307922363, 'learning_rate': 0.00038866507340389213, 'epoch': 0.29}


  3%|▎         | 85/2930 [02:21<1:18:28,  1.66s/it]

{'loss': 0.4889, 'grad_norm': 0.22645032405853271, 'learning_rate': 0.0003885285080232161, 'epoch': 0.29}


  3%|▎         | 86/2930 [02:23<1:18:37,  1.66s/it]

{'loss': 0.467, 'grad_norm': 0.17365556955337524, 'learning_rate': 0.00038839194264254017, 'epoch': 0.29}


  3%|▎         | 87/2930 [02:24<1:18:34,  1.66s/it]

{'loss': 0.3815, 'grad_norm': 0.21194802224636078, 'learning_rate': 0.00038825537726186417, 'epoch': 0.3}


  3%|▎         | 88/2930 [02:26<1:18:30,  1.66s/it]

{'loss': 0.4285, 'grad_norm': 0.21270978450775146, 'learning_rate': 0.0003881188118811881, 'epoch': 0.3}


  3%|▎         | 89/2930 [02:28<1:20:57,  1.71s/it]

{'loss': 0.3947, 'grad_norm': 0.2029954493045807, 'learning_rate': 0.00038798224650051215, 'epoch': 0.3}


  3%|▎         | 90/2930 [02:29<1:20:06,  1.69s/it]

{'loss': 0.3947, 'grad_norm': 0.17495183646678925, 'learning_rate': 0.00038784568111983615, 'epoch': 0.31}


  3%|▎         | 91/2930 [02:31<1:19:39,  1.68s/it]

{'loss': 0.4331, 'grad_norm': 0.21961523592472076, 'learning_rate': 0.00038770911573916014, 'epoch': 0.31}


  3%|▎         | 92/2930 [02:33<1:19:16,  1.68s/it]

{'loss': 0.4189, 'grad_norm': 0.17954495549201965, 'learning_rate': 0.00038757255035848414, 'epoch': 0.31}


  3%|▎         | 93/2930 [02:34<1:18:59,  1.67s/it]

{'loss': 0.4617, 'grad_norm': 0.18917836248874664, 'learning_rate': 0.00038743598497780813, 'epoch': 0.32}


  3%|▎         | 94/2930 [02:36<1:18:32,  1.66s/it]

{'loss': 0.3694, 'grad_norm': 0.1711726039648056, 'learning_rate': 0.0003872994195971321, 'epoch': 0.32}


  3%|▎         | 95/2930 [02:38<1:18:31,  1.66s/it]

{'loss': 0.3997, 'grad_norm': 0.18672242760658264, 'learning_rate': 0.0003871628542164561, 'epoch': 0.32}


  3%|▎         | 96/2930 [02:39<1:17:57,  1.65s/it]

{'loss': 0.4099, 'grad_norm': 0.1815706193447113, 'learning_rate': 0.00038702628883578017, 'epoch': 0.33}


  3%|▎         | 97/2930 [02:41<1:18:02,  1.65s/it]

{'loss': 0.4231, 'grad_norm': 0.21121686697006226, 'learning_rate': 0.00038688972345510416, 'epoch': 0.33}


  3%|▎         | 98/2930 [02:43<1:18:00,  1.65s/it]

{'loss': 0.4756, 'grad_norm': 0.24120746552944183, 'learning_rate': 0.00038675315807442816, 'epoch': 0.33}


  3%|▎         | 99/2930 [02:44<1:17:53,  1.65s/it]

{'loss': 0.4366, 'grad_norm': 0.18905414640903473, 'learning_rate': 0.00038661659269375215, 'epoch': 0.34}


  3%|▎         | 100/2930 [02:46<1:17:42,  1.65s/it]

{'loss': 0.4429, 'grad_norm': 0.2789367735385895, 'learning_rate': 0.00038648002731307614, 'epoch': 0.34}


  3%|▎         | 101/2930 [02:48<1:17:48,  1.65s/it]

{'loss': 0.4259, 'grad_norm': 0.17393247783184052, 'learning_rate': 0.00038634346193240014, 'epoch': 0.34}


  3%|▎         | 102/2930 [02:49<1:17:21,  1.64s/it]

{'loss': 0.3994, 'grad_norm': 0.1679239571094513, 'learning_rate': 0.0003862068965517242, 'epoch': 0.35}


  4%|▎         | 103/2930 [02:51<1:17:32,  1.65s/it]

{'loss': 0.4222, 'grad_norm': 0.19263914227485657, 'learning_rate': 0.0003860703311710482, 'epoch': 0.35}


  4%|▎         | 104/2930 [02:53<1:17:34,  1.65s/it]

{'loss': 0.4206, 'grad_norm': 0.1893499195575714, 'learning_rate': 0.0003859337657903722, 'epoch': 0.35}


  4%|▎         | 105/2930 [02:54<1:17:42,  1.65s/it]

{'loss': 0.4698, 'grad_norm': 0.283657431602478, 'learning_rate': 0.00038579720040969617, 'epoch': 0.36}


  4%|▎         | 106/2930 [02:56<1:17:19,  1.64s/it]

{'loss': 0.4357, 'grad_norm': 0.17698688805103302, 'learning_rate': 0.00038566063502902016, 'epoch': 0.36}


  4%|▎         | 107/2930 [02:57<1:17:42,  1.65s/it]

{'loss': 0.3977, 'grad_norm': 0.20826728641986847, 'learning_rate': 0.00038552406964834416, 'epoch': 0.36}


  4%|▎         | 108/2930 [02:59<1:18:05,  1.66s/it]

{'loss': 0.4299, 'grad_norm': 0.1877846121788025, 'learning_rate': 0.00038538750426766815, 'epoch': 0.37}


  4%|▎         | 109/2930 [03:01<1:17:18,  1.64s/it]

{'loss': 0.411, 'grad_norm': 0.19921369850635529, 'learning_rate': 0.0003852509388869922, 'epoch': 0.37}


  4%|▍         | 110/2930 [03:02<1:17:05,  1.64s/it]

{'loss': 0.4259, 'grad_norm': 0.25177013874053955, 'learning_rate': 0.00038511437350631614, 'epoch': 0.38}


  4%|▍         | 111/2930 [03:04<1:17:01,  1.64s/it]

{'loss': 0.3981, 'grad_norm': 0.23400160670280457, 'learning_rate': 0.00038497780812564013, 'epoch': 0.38}


  4%|▍         | 112/2930 [03:06<1:17:16,  1.65s/it]

{'loss': 0.3497, 'grad_norm': 0.20995192229747772, 'learning_rate': 0.0003848412427449642, 'epoch': 0.38}


  4%|▍         | 113/2930 [03:07<1:17:39,  1.65s/it]

{'loss': 0.3957, 'grad_norm': 0.18908075988292694, 'learning_rate': 0.0003847046773642882, 'epoch': 0.39}


  4%|▍         | 114/2930 [03:09<1:17:48,  1.66s/it]

{'loss': 0.4369, 'grad_norm': 0.20198990404605865, 'learning_rate': 0.00038456811198361217, 'epoch': 0.39}


  4%|▍         | 115/2930 [03:11<1:17:58,  1.66s/it]

{'loss': 0.4417, 'grad_norm': 0.1864842027425766, 'learning_rate': 0.00038443154660293616, 'epoch': 0.39}


  4%|▍         | 116/2930 [03:12<1:18:02,  1.66s/it]

{'loss': 0.4059, 'grad_norm': 0.20353202521800995, 'learning_rate': 0.00038429498122226016, 'epoch': 0.4}


  4%|▍         | 117/2930 [03:14<1:18:09,  1.67s/it]

{'loss': 0.4505, 'grad_norm': 0.20620527863502502, 'learning_rate': 0.00038415841584158415, 'epoch': 0.4}


  4%|▍         | 118/2930 [03:16<1:18:18,  1.67s/it]

{'loss': 0.4064, 'grad_norm': 0.17421503365039825, 'learning_rate': 0.0003840218504609082, 'epoch': 0.4}


  4%|▍         | 119/2930 [03:17<1:18:13,  1.67s/it]

{'loss': 0.4032, 'grad_norm': 0.2056753933429718, 'learning_rate': 0.0003838852850802322, 'epoch': 0.41}


  4%|▍         | 120/2930 [03:19<1:18:21,  1.67s/it]

{'loss': 0.4334, 'grad_norm': 0.189042329788208, 'learning_rate': 0.0003837487196995562, 'epoch': 0.41}


  4%|▍         | 121/2930 [03:21<1:18:17,  1.67s/it]

{'loss': 0.3631, 'grad_norm': 0.19650782644748688, 'learning_rate': 0.0003836121543188802, 'epoch': 0.41}


  4%|▍         | 122/2930 [03:22<1:18:21,  1.67s/it]

{'loss': 0.3932, 'grad_norm': 0.18814721703529358, 'learning_rate': 0.0003834755889382042, 'epoch': 0.42}


  4%|▍         | 123/2930 [03:24<1:17:20,  1.65s/it]

{'loss': 0.4532, 'grad_norm': 0.17587293684482574, 'learning_rate': 0.00038333902355752817, 'epoch': 0.42}


  4%|▍         | 124/2930 [03:26<1:17:47,  1.66s/it]

{'loss': 0.4116, 'grad_norm': 0.16995179653167725, 'learning_rate': 0.0003832024581768522, 'epoch': 0.42}


  4%|▍         | 125/2930 [03:27<1:17:52,  1.67s/it]

{'loss': 0.395, 'grad_norm': 0.1828056275844574, 'learning_rate': 0.0003830658927961762, 'epoch': 0.43}


  4%|▍         | 126/2930 [03:29<1:17:57,  1.67s/it]

{'loss': 0.4075, 'grad_norm': 0.18312714993953705, 'learning_rate': 0.0003829293274155002, 'epoch': 0.43}


  4%|▍         | 127/2930 [03:31<1:17:05,  1.65s/it]

{'loss': 0.4385, 'grad_norm': 0.17621472477912903, 'learning_rate': 0.00038279276203482415, 'epoch': 0.43}


  4%|▍         | 128/2930 [03:32<1:17:12,  1.65s/it]

{'loss': 0.3993, 'grad_norm': 0.1863430142402649, 'learning_rate': 0.0003826561966541482, 'epoch': 0.44}


  4%|▍         | 129/2930 [03:34<1:16:59,  1.65s/it]

{'loss': 0.3839, 'grad_norm': 0.19096267223358154, 'learning_rate': 0.0003825196312734722, 'epoch': 0.44}


  4%|▍         | 130/2930 [03:36<1:17:05,  1.65s/it]

{'loss': 0.4004, 'grad_norm': 0.21727265417575836, 'learning_rate': 0.0003823830658927962, 'epoch': 0.44}


  4%|▍         | 131/2930 [03:37<1:17:18,  1.66s/it]

{'loss': 0.3918, 'grad_norm': 0.19263474643230438, 'learning_rate': 0.00038224650051212023, 'epoch': 0.45}


  5%|▍         | 132/2930 [03:39<1:17:14,  1.66s/it]

{'loss': 0.3859, 'grad_norm': 0.1866053193807602, 'learning_rate': 0.0003821099351314442, 'epoch': 0.45}


  5%|▍         | 133/2930 [03:41<1:17:00,  1.65s/it]

{'loss': 0.4111, 'grad_norm': 0.18332050740718842, 'learning_rate': 0.00038197336975076817, 'epoch': 0.45}


  5%|▍         | 134/2930 [03:42<1:16:58,  1.65s/it]

{'loss': 0.3605, 'grad_norm': 0.22084063291549683, 'learning_rate': 0.0003818368043700922, 'epoch': 0.46}


  5%|▍         | 135/2930 [03:44<1:17:09,  1.66s/it]

{'loss': 0.4081, 'grad_norm': 0.21410885453224182, 'learning_rate': 0.0003817002389894162, 'epoch': 0.46}


  5%|▍         | 136/2930 [03:46<1:16:54,  1.65s/it]

{'loss': 0.4057, 'grad_norm': 0.2135201394557953, 'learning_rate': 0.0003815636736087402, 'epoch': 0.46}


  5%|▍         | 137/2930 [03:47<1:16:48,  1.65s/it]

{'loss': 0.3873, 'grad_norm': 0.2088382989168167, 'learning_rate': 0.0003814271082280642, 'epoch': 0.47}


  5%|▍         | 138/2930 [03:49<1:17:12,  1.66s/it]

{'loss': 0.4035, 'grad_norm': 0.20540077984333038, 'learning_rate': 0.0003812905428473882, 'epoch': 0.47}


  5%|▍         | 139/2930 [03:51<1:17:06,  1.66s/it]

{'loss': 0.4043, 'grad_norm': 0.20161038637161255, 'learning_rate': 0.0003811539774667122, 'epoch': 0.47}


  5%|▍         | 140/2930 [03:52<1:16:37,  1.65s/it]

{'loss': 0.3702, 'grad_norm': 0.17132288217544556, 'learning_rate': 0.00038101741208603623, 'epoch': 0.48}


  5%|▍         | 141/2930 [03:54<1:16:37,  1.65s/it]

{'loss': 0.4161, 'grad_norm': 0.1865415871143341, 'learning_rate': 0.00038088084670536023, 'epoch': 0.48}


  5%|▍         | 142/2930 [03:55<1:16:47,  1.65s/it]

{'loss': 0.3748, 'grad_norm': 0.19205421209335327, 'learning_rate': 0.0003807442813246842, 'epoch': 0.48}


  5%|▍         | 143/2930 [03:57<1:16:58,  1.66s/it]

{'loss': 0.395, 'grad_norm': 0.17669232189655304, 'learning_rate': 0.0003806077159440082, 'epoch': 0.49}


  5%|▍         | 144/2930 [03:59<1:16:59,  1.66s/it]

{'loss': 0.37, 'grad_norm': 0.2023332417011261, 'learning_rate': 0.0003804711505633322, 'epoch': 0.49}


  5%|▍         | 145/2930 [04:00<1:17:12,  1.66s/it]

{'loss': 0.3901, 'grad_norm': 0.18515044450759888, 'learning_rate': 0.0003803345851826562, 'epoch': 0.49}


  5%|▍         | 146/2930 [04:02<1:17:23,  1.67s/it]

{'loss': 0.4218, 'grad_norm': 0.2399897426366806, 'learning_rate': 0.0003801980198019802, 'epoch': 0.5}


  5%|▌         | 147/2930 [04:04<1:17:23,  1.67s/it]

{'loss': 0.412, 'grad_norm': 0.2420056313276291, 'learning_rate': 0.00038006145442130425, 'epoch': 0.5}


  5%|▌         | 148/2930 [04:06<1:17:28,  1.67s/it]

{'loss': 0.364, 'grad_norm': 0.23440030217170715, 'learning_rate': 0.00037992488904062824, 'epoch': 0.5}


  5%|▌         | 149/2930 [04:07<1:16:55,  1.66s/it]

{'loss': 0.4022, 'grad_norm': 0.20118071138858795, 'learning_rate': 0.0003797883236599522, 'epoch': 0.51}


  5%|▌         | 150/2930 [04:09<1:17:02,  1.66s/it]

{'loss': 0.3944, 'grad_norm': 0.1706302911043167, 'learning_rate': 0.00037965175827927623, 'epoch': 0.51}


  5%|▌         | 151/2930 [04:10<1:17:09,  1.67s/it]

{'loss': 0.3301, 'grad_norm': 0.19102372229099274, 'learning_rate': 0.0003795151928986002, 'epoch': 0.51}


  5%|▌         | 152/2930 [04:12<1:17:11,  1.67s/it]

{'loss': 0.3984, 'grad_norm': 0.18222784996032715, 'learning_rate': 0.0003793786275179242, 'epoch': 0.52}


  5%|▌         | 153/2930 [04:14<1:17:08,  1.67s/it]

{'loss': 0.4145, 'grad_norm': 0.17706440389156342, 'learning_rate': 0.00037924206213724827, 'epoch': 0.52}


  5%|▌         | 154/2930 [04:15<1:17:20,  1.67s/it]

{'loss': 0.3456, 'grad_norm': 0.19225913286209106, 'learning_rate': 0.0003791054967565722, 'epoch': 0.53}


  5%|▌         | 155/2930 [04:17<1:17:16,  1.67s/it]

{'loss': 0.3633, 'grad_norm': 0.19951173663139343, 'learning_rate': 0.0003789689313758962, 'epoch': 0.53}


  5%|▌         | 156/2930 [04:19<1:16:58,  1.66s/it]

{'loss': 0.3773, 'grad_norm': 0.21073518693447113, 'learning_rate': 0.00037883236599522025, 'epoch': 0.53}


  5%|▌         | 157/2930 [04:20<1:16:18,  1.65s/it]

{'loss': 0.3456, 'grad_norm': 0.17023064196109772, 'learning_rate': 0.00037869580061454424, 'epoch': 0.54}


  5%|▌         | 158/2930 [04:22<1:16:20,  1.65s/it]

{'loss': 0.3543, 'grad_norm': 0.20859971642494202, 'learning_rate': 0.00037855923523386824, 'epoch': 0.54}


  5%|▌         | 159/2930 [04:24<1:16:25,  1.65s/it]

{'loss': 0.3414, 'grad_norm': 0.23133733868598938, 'learning_rate': 0.00037842266985319223, 'epoch': 0.54}


  5%|▌         | 160/2930 [04:25<1:16:36,  1.66s/it]

{'loss': 0.3705, 'grad_norm': 0.1911781132221222, 'learning_rate': 0.0003782861044725162, 'epoch': 0.55}


  5%|▌         | 161/2930 [04:27<1:16:28,  1.66s/it]

{'loss': 0.3987, 'grad_norm': 0.21870696544647217, 'learning_rate': 0.0003781495390918402, 'epoch': 0.55}


  6%|▌         | 162/2930 [04:29<1:16:24,  1.66s/it]

{'loss': 0.364, 'grad_norm': 0.16268710792064667, 'learning_rate': 0.0003780129737111642, 'epoch': 0.55}


  6%|▌         | 163/2930 [04:30<1:16:30,  1.66s/it]

{'loss': 0.3685, 'grad_norm': 0.21182574331760406, 'learning_rate': 0.00037787640833048826, 'epoch': 0.56}


  6%|▌         | 164/2930 [04:32<1:16:19,  1.66s/it]

{'loss': 0.3888, 'grad_norm': 0.19948598742485046, 'learning_rate': 0.00037773984294981226, 'epoch': 0.56}


  6%|▌         | 165/2930 [04:34<1:16:26,  1.66s/it]

{'loss': 0.3684, 'grad_norm': 0.16624271869659424, 'learning_rate': 0.00037760327756913625, 'epoch': 0.56}


  6%|▌         | 166/2930 [04:35<1:16:24,  1.66s/it]

{'loss': 0.4106, 'grad_norm': 0.168888121843338, 'learning_rate': 0.00037746671218846025, 'epoch': 0.57}


  6%|▌         | 167/2930 [04:37<1:15:49,  1.65s/it]

{'loss': 0.4125, 'grad_norm': 0.1666525900363922, 'learning_rate': 0.00037733014680778424, 'epoch': 0.57}


  6%|▌         | 168/2930 [04:39<1:15:38,  1.64s/it]

{'loss': 0.3576, 'grad_norm': 0.1948082447052002, 'learning_rate': 0.00037719358142710823, 'epoch': 0.57}


  6%|▌         | 169/2930 [04:40<1:15:32,  1.64s/it]

{'loss': 0.369, 'grad_norm': 0.19242322444915771, 'learning_rate': 0.0003770570160464323, 'epoch': 0.58}


  6%|▌         | 170/2930 [04:42<1:15:45,  1.65s/it]

{'loss': 0.4216, 'grad_norm': 0.1987505406141281, 'learning_rate': 0.0003769204506657563, 'epoch': 0.58}


  6%|▌         | 171/2930 [04:44<1:15:58,  1.65s/it]

{'loss': 0.4162, 'grad_norm': 0.2443125993013382, 'learning_rate': 0.0003767838852850802, 'epoch': 0.58}


  6%|▌         | 172/2930 [04:45<1:16:29,  1.66s/it]

{'loss': 0.3304, 'grad_norm': 0.1850782334804535, 'learning_rate': 0.00037664731990440426, 'epoch': 0.59}


  6%|▌         | 173/2930 [04:47<1:16:32,  1.67s/it]

{'loss': 0.3673, 'grad_norm': 0.1918804794549942, 'learning_rate': 0.00037651075452372826, 'epoch': 0.59}


  6%|▌         | 174/2930 [04:49<1:17:05,  1.68s/it]

{'loss': 0.4014, 'grad_norm': 0.22750459611415863, 'learning_rate': 0.00037637418914305225, 'epoch': 0.59}


  6%|▌         | 175/2930 [04:50<1:16:26,  1.66s/it]

{'loss': 0.3891, 'grad_norm': 0.17416325211524963, 'learning_rate': 0.00037623762376237625, 'epoch': 0.6}


  6%|▌         | 176/2930 [04:52<1:16:09,  1.66s/it]

{'loss': 0.3864, 'grad_norm': 0.16330155730247498, 'learning_rate': 0.00037610105838170024, 'epoch': 0.6}


  6%|▌         | 177/2930 [04:54<1:16:22,  1.66s/it]

{'loss': 0.3783, 'grad_norm': 0.18479983508586884, 'learning_rate': 0.00037596449300102423, 'epoch': 0.6}


  6%|▌         | 178/2930 [04:55<1:17:16,  1.68s/it]

{'loss': 0.3807, 'grad_norm': 0.19715462625026703, 'learning_rate': 0.00037582792762034823, 'epoch': 0.61}


  6%|▌         | 179/2930 [04:57<1:17:12,  1.68s/it]

{'loss': 0.3633, 'grad_norm': 0.1924590915441513, 'learning_rate': 0.0003756913622396723, 'epoch': 0.61}


  6%|▌         | 180/2930 [04:59<1:17:22,  1.69s/it]

{'loss': 0.3715, 'grad_norm': 0.20747852325439453, 'learning_rate': 0.00037555479685899627, 'epoch': 0.61}


  6%|▌         | 181/2930 [05:00<1:17:08,  1.68s/it]

{'loss': 0.4609, 'grad_norm': 0.2685392200946808, 'learning_rate': 0.00037541823147832027, 'epoch': 0.62}


  6%|▌         | 182/2930 [05:02<1:16:51,  1.68s/it]

{'loss': 0.4046, 'grad_norm': 0.20092441141605377, 'learning_rate': 0.00037528166609764426, 'epoch': 0.62}


  6%|▌         | 183/2930 [05:04<1:16:08,  1.66s/it]

{'loss': 0.4211, 'grad_norm': 0.19370871782302856, 'learning_rate': 0.00037514510071696825, 'epoch': 0.62}


  6%|▋         | 184/2930 [05:05<1:16:31,  1.67s/it]

{'loss': 0.3857, 'grad_norm': 0.17117629945278168, 'learning_rate': 0.00037500853533629225, 'epoch': 0.63}


  6%|▋         | 185/2930 [05:07<1:17:03,  1.68s/it]

{'loss': 0.3542, 'grad_norm': 0.1688281148672104, 'learning_rate': 0.0003748719699556163, 'epoch': 0.63}


  6%|▋         | 186/2930 [05:09<1:16:44,  1.68s/it]

{'loss': 0.4098, 'grad_norm': 0.18875424563884735, 'learning_rate': 0.0003747354045749403, 'epoch': 0.63}


  6%|▋         | 187/2930 [05:11<1:18:19,  1.71s/it]

{'loss': 0.3278, 'grad_norm': 0.18127739429473877, 'learning_rate': 0.0003745988391942643, 'epoch': 0.64}


  6%|▋         | 188/2930 [05:12<1:18:13,  1.71s/it]

{'loss': 0.3473, 'grad_norm': 0.17721326649188995, 'learning_rate': 0.0003744622738135883, 'epoch': 0.64}


  6%|▋         | 189/2930 [05:14<1:18:28,  1.72s/it]

{'loss': 0.3639, 'grad_norm': 0.1865159273147583, 'learning_rate': 0.0003743257084329123, 'epoch': 0.64}


  6%|▋         | 190/2930 [05:16<1:18:53,  1.73s/it]

{'loss': 0.3781, 'grad_norm': 0.17829829454421997, 'learning_rate': 0.00037418914305223627, 'epoch': 0.65}


  7%|▋         | 191/2930 [05:17<1:18:18,  1.72s/it]

{'loss': 0.3628, 'grad_norm': 0.18134117126464844, 'learning_rate': 0.00037405257767156026, 'epoch': 0.65}


  7%|▋         | 192/2930 [05:19<1:17:32,  1.70s/it]

{'loss': 0.4108, 'grad_norm': 0.19378823041915894, 'learning_rate': 0.0003739160122908843, 'epoch': 0.65}


  7%|▋         | 193/2930 [05:21<1:16:57,  1.69s/it]

{'loss': 0.4174, 'grad_norm': 0.18632633984088898, 'learning_rate': 0.00037377944691020825, 'epoch': 0.66}


  7%|▋         | 194/2930 [05:22<1:16:03,  1.67s/it]

{'loss': 0.3919, 'grad_norm': 0.22484908998012543, 'learning_rate': 0.00037364288152953224, 'epoch': 0.66}


  7%|▋         | 195/2930 [05:24<1:16:52,  1.69s/it]

{'loss': 0.3811, 'grad_norm': 0.17331703007221222, 'learning_rate': 0.0003735063161488563, 'epoch': 0.66}


  7%|▋         | 196/2930 [05:26<1:17:57,  1.71s/it]

{'loss': 0.382, 'grad_norm': 0.16882792115211487, 'learning_rate': 0.0003733697507681803, 'epoch': 0.67}


  7%|▋         | 197/2930 [05:28<1:18:36,  1.73s/it]

{'loss': 0.392, 'grad_norm': 0.16708339750766754, 'learning_rate': 0.0003732331853875043, 'epoch': 0.67}


  7%|▋         | 198/2930 [05:29<1:17:45,  1.71s/it]

{'loss': 0.3711, 'grad_norm': 0.17704758048057556, 'learning_rate': 0.0003730966200068283, 'epoch': 0.68}


  7%|▋         | 199/2930 [05:31<1:17:36,  1.70s/it]

{'loss': 0.3362, 'grad_norm': 0.16114240884780884, 'learning_rate': 0.00037296005462615227, 'epoch': 0.68}


  7%|▋         | 200/2930 [05:33<1:17:45,  1.71s/it]

{'loss': 0.3806, 'grad_norm': 0.20022638142108917, 'learning_rate': 0.00037282348924547626, 'epoch': 0.68}


  7%|▋         | 201/2930 [05:34<1:18:02,  1.72s/it]

{'loss': 0.3319, 'grad_norm': 0.18744675815105438, 'learning_rate': 0.0003726869238648003, 'epoch': 0.69}


  7%|▋         | 202/2930 [05:36<1:18:22,  1.72s/it]

{'loss': 0.3529, 'grad_norm': 0.21558323502540588, 'learning_rate': 0.0003725503584841243, 'epoch': 0.69}


  7%|▋         | 203/2930 [05:38<1:18:21,  1.72s/it]

{'loss': 0.39, 'grad_norm': 0.21583403646945953, 'learning_rate': 0.0003724137931034483, 'epoch': 0.69}


  7%|▋         | 204/2930 [05:40<1:17:31,  1.71s/it]

{'loss': 0.351, 'grad_norm': 0.19908298552036285, 'learning_rate': 0.0003722772277227723, 'epoch': 0.7}


  7%|▋         | 205/2930 [05:41<1:16:16,  1.68s/it]

{'loss': 0.4298, 'grad_norm': 0.1802712231874466, 'learning_rate': 0.0003721406623420963, 'epoch': 0.7}


  7%|▋         | 206/2930 [05:43<1:16:08,  1.68s/it]

{'loss': 0.3494, 'grad_norm': 0.19505298137664795, 'learning_rate': 0.0003720040969614203, 'epoch': 0.7}


  7%|▋         | 207/2930 [05:45<1:15:37,  1.67s/it]

{'loss': 0.3365, 'grad_norm': 0.16028131544589996, 'learning_rate': 0.0003718675315807443, 'epoch': 0.71}


  7%|▋         | 208/2930 [05:46<1:15:21,  1.66s/it]

{'loss': 0.3766, 'grad_norm': 0.17946474254131317, 'learning_rate': 0.0003717309662000683, 'epoch': 0.71}


  7%|▋         | 209/2930 [05:48<1:14:47,  1.65s/it]

{'loss': 0.3829, 'grad_norm': 0.1911803036928177, 'learning_rate': 0.0003715944008193923, 'epoch': 0.71}


  7%|▋         | 210/2930 [05:49<1:14:33,  1.64s/it]

{'loss': 0.3937, 'grad_norm': 0.19979660212993622, 'learning_rate': 0.00037145783543871626, 'epoch': 0.72}


  7%|▋         | 211/2930 [05:51<1:14:38,  1.65s/it]

{'loss': 0.3842, 'grad_norm': 0.17798380553722382, 'learning_rate': 0.0003713212700580403, 'epoch': 0.72}


  7%|▋         | 212/2930 [05:53<1:14:43,  1.65s/it]

{'loss': 0.375, 'grad_norm': 0.19575877487659454, 'learning_rate': 0.0003711847046773643, 'epoch': 0.72}


  7%|▋         | 213/2930 [05:54<1:14:48,  1.65s/it]

{'loss': 0.3749, 'grad_norm': 0.19981318712234497, 'learning_rate': 0.0003710481392966883, 'epoch': 0.73}


  7%|▋         | 214/2930 [05:56<1:14:53,  1.65s/it]

{'loss': 0.343, 'grad_norm': 0.1978575438261032, 'learning_rate': 0.00037091157391601234, 'epoch': 0.73}


  7%|▋         | 215/2930 [05:58<1:15:17,  1.66s/it]

{'loss': 0.3663, 'grad_norm': 0.16956236958503723, 'learning_rate': 0.0003707750085353363, 'epoch': 0.73}


  7%|▋         | 216/2930 [05:59<1:15:33,  1.67s/it]

{'loss': 0.3659, 'grad_norm': 0.19567793607711792, 'learning_rate': 0.0003706384431546603, 'epoch': 0.74}


  7%|▋         | 217/2930 [06:01<1:15:38,  1.67s/it]

{'loss': 0.3589, 'grad_norm': 0.18070311844348907, 'learning_rate': 0.0003705018777739843, 'epoch': 0.74}


  7%|▋         | 218/2930 [06:03<1:15:46,  1.68s/it]

{'loss': 0.3523, 'grad_norm': 0.17055091261863708, 'learning_rate': 0.0003703653123933083, 'epoch': 0.74}


  7%|▋         | 219/2930 [06:04<1:15:10,  1.66s/it]

{'loss': 0.3442, 'grad_norm': 0.18255577981472015, 'learning_rate': 0.0003702287470126323, 'epoch': 0.75}


  8%|▊         | 220/2930 [06:06<1:14:59,  1.66s/it]

{'loss': 0.3675, 'grad_norm': 0.2007906138896942, 'learning_rate': 0.00037009218163195636, 'epoch': 0.75}


  8%|▊         | 221/2930 [06:08<1:14:33,  1.65s/it]

{'loss': 0.3583, 'grad_norm': 0.2507295608520508, 'learning_rate': 0.0003699556162512803, 'epoch': 0.75}


  8%|▊         | 222/2930 [06:09<1:13:46,  1.63s/it]

{'loss': 0.3814, 'grad_norm': 0.18266668915748596, 'learning_rate': 0.0003698190508706043, 'epoch': 0.76}


  8%|▊         | 223/2930 [06:11<1:14:08,  1.64s/it]

{'loss': 0.3498, 'grad_norm': 0.18681594729423523, 'learning_rate': 0.00036968248548992834, 'epoch': 0.76}


  8%|▊         | 224/2930 [06:13<1:14:33,  1.65s/it]

{'loss': 0.3385, 'grad_norm': 0.18366739153862, 'learning_rate': 0.00036954592010925234, 'epoch': 0.76}


  8%|▊         | 225/2930 [06:14<1:14:30,  1.65s/it]

{'loss': 0.3846, 'grad_norm': 0.17412066459655762, 'learning_rate': 0.00036940935472857633, 'epoch': 0.77}


  8%|▊         | 226/2930 [06:16<1:14:34,  1.65s/it]

{'loss': 0.376, 'grad_norm': 0.1903606355190277, 'learning_rate': 0.00036927278934790033, 'epoch': 0.77}


  8%|▊         | 227/2930 [06:18<1:14:20,  1.65s/it]

{'loss': 0.3759, 'grad_norm': 0.1933138221502304, 'learning_rate': 0.0003691362239672243, 'epoch': 0.77}


  8%|▊         | 228/2930 [06:19<1:14:41,  1.66s/it]

{'loss': 0.4055, 'grad_norm': 0.22036446630954742, 'learning_rate': 0.0003689996585865483, 'epoch': 0.78}


  8%|▊         | 229/2930 [06:21<1:14:50,  1.66s/it]

{'loss': 0.409, 'grad_norm': 0.2663576900959015, 'learning_rate': 0.0003688630932058723, 'epoch': 0.78}


  8%|▊         | 230/2930 [06:23<1:14:30,  1.66s/it]

{'loss': 0.3664, 'grad_norm': 0.21347099542617798, 'learning_rate': 0.00036872652782519636, 'epoch': 0.78}


  8%|▊         | 231/2930 [06:24<1:13:58,  1.64s/it]

{'loss': 0.3758, 'grad_norm': 0.1852015256881714, 'learning_rate': 0.00036858996244452035, 'epoch': 0.79}


  8%|▊         | 232/2930 [06:26<1:14:32,  1.66s/it]

{'loss': 0.3577, 'grad_norm': 0.19410254061222076, 'learning_rate': 0.0003684533970638443, 'epoch': 0.79}


  8%|▊         | 233/2930 [06:28<1:14:29,  1.66s/it]

{'loss': 0.3637, 'grad_norm': 0.1948726326227188, 'learning_rate': 0.00036831683168316834, 'epoch': 0.79}


  8%|▊         | 234/2930 [06:29<1:14:21,  1.65s/it]

{'loss': 0.35, 'grad_norm': 0.17496037483215332, 'learning_rate': 0.00036818026630249233, 'epoch': 0.8}


  8%|▊         | 235/2930 [06:31<1:14:37,  1.66s/it]

{'loss': 0.3865, 'grad_norm': 0.18115025758743286, 'learning_rate': 0.00036804370092181633, 'epoch': 0.8}


  8%|▊         | 236/2930 [06:32<1:14:18,  1.65s/it]

{'loss': 0.364, 'grad_norm': 0.20185108482837677, 'learning_rate': 0.0003679071355411404, 'epoch': 0.8}


  8%|▊         | 237/2930 [06:34<1:13:55,  1.65s/it]

{'loss': 0.3618, 'grad_norm': 0.1938481330871582, 'learning_rate': 0.00036777057016046437, 'epoch': 0.81}


  8%|▊         | 238/2930 [06:36<1:14:05,  1.65s/it]

{'loss': 0.3573, 'grad_norm': 0.20931850373744965, 'learning_rate': 0.0003676340047797883, 'epoch': 0.81}


  8%|▊         | 239/2930 [06:37<1:14:47,  1.67s/it]

{'loss': 0.4016, 'grad_norm': 0.18935449421405792, 'learning_rate': 0.00036749743939911236, 'epoch': 0.82}


  8%|▊         | 240/2930 [06:39<1:14:40,  1.67s/it]

{'loss': 0.398, 'grad_norm': 0.2249639332294464, 'learning_rate': 0.00036736087401843635, 'epoch': 0.82}


  8%|▊         | 241/2930 [06:41<1:14:46,  1.67s/it]

{'loss': 0.3921, 'grad_norm': 0.1690995693206787, 'learning_rate': 0.00036722430863776035, 'epoch': 0.82}


  8%|▊         | 242/2930 [06:43<1:14:56,  1.67s/it]

{'loss': 0.3927, 'grad_norm': 0.19712598621845245, 'learning_rate': 0.00036708774325708434, 'epoch': 0.83}


  8%|▊         | 243/2930 [06:44<1:14:42,  1.67s/it]

{'loss': 0.3432, 'grad_norm': 0.19995993375778198, 'learning_rate': 0.00036695117787640834, 'epoch': 0.83}


  8%|▊         | 244/2930 [06:46<1:15:46,  1.69s/it]

{'loss': 0.3823, 'grad_norm': 0.20136070251464844, 'learning_rate': 0.00036681461249573233, 'epoch': 0.83}


  8%|▊         | 245/2930 [06:48<1:15:36,  1.69s/it]

{'loss': 0.351, 'grad_norm': 0.1755421757698059, 'learning_rate': 0.0003666780471150563, 'epoch': 0.84}


  8%|▊         | 246/2930 [06:49<1:15:23,  1.69s/it]

{'loss': 0.4405, 'grad_norm': 0.23074768483638763, 'learning_rate': 0.00036654148173438037, 'epoch': 0.84}


  8%|▊         | 247/2930 [06:51<1:15:00,  1.68s/it]

{'loss': 0.3548, 'grad_norm': 0.17130251228809357, 'learning_rate': 0.00036640491635370437, 'epoch': 0.84}


  8%|▊         | 248/2930 [06:53<1:14:55,  1.68s/it]

{'loss': 0.395, 'grad_norm': 0.20049941539764404, 'learning_rate': 0.00036626835097302836, 'epoch': 0.85}


  8%|▊         | 249/2930 [06:54<1:14:18,  1.66s/it]

{'loss': 0.351, 'grad_norm': 0.17307546734809875, 'learning_rate': 0.00036613178559235236, 'epoch': 0.85}


  9%|▊         | 250/2930 [06:56<1:14:32,  1.67s/it]

{'loss': 0.3732, 'grad_norm': 0.18956033885478973, 'learning_rate': 0.00036599522021167635, 'epoch': 0.85}


  9%|▊         | 251/2930 [06:58<1:15:52,  1.70s/it]

{'loss': 0.37, 'grad_norm': 0.19508998095989227, 'learning_rate': 0.00036585865483100034, 'epoch': 0.86}


  9%|▊         | 252/2930 [06:59<1:15:20,  1.69s/it]

{'loss': 0.3617, 'grad_norm': 0.1894523650407791, 'learning_rate': 0.0003657220894503244, 'epoch': 0.86}


  9%|▊         | 253/2930 [07:01<1:14:46,  1.68s/it]

{'loss': 0.3227, 'grad_norm': 0.18485723435878754, 'learning_rate': 0.0003655855240696484, 'epoch': 0.86}


  9%|▊         | 254/2930 [07:03<1:13:14,  1.64s/it]

{'loss': 0.3582, 'grad_norm': 0.19283626973628998, 'learning_rate': 0.0003654489586889724, 'epoch': 0.87}


  9%|▊         | 255/2930 [07:04<1:12:32,  1.63s/it]

{'loss': 0.3563, 'grad_norm': 0.18002986907958984, 'learning_rate': 0.0003653123933082964, 'epoch': 0.87}


  9%|▊         | 256/2930 [07:06<1:11:52,  1.61s/it]

{'loss': 0.3242, 'grad_norm': 0.1734979748725891, 'learning_rate': 0.00036517582792762037, 'epoch': 0.87}


  9%|▉         | 257/2930 [07:07<1:11:20,  1.60s/it]

{'loss': 0.3241, 'grad_norm': 0.19178953766822815, 'learning_rate': 0.00036503926254694436, 'epoch': 0.88}


  9%|▉         | 258/2930 [07:09<1:10:48,  1.59s/it]

{'loss': 0.3719, 'grad_norm': 0.20438987016677856, 'learning_rate': 0.00036490269716626836, 'epoch': 0.88}


  9%|▉         | 259/2930 [07:10<1:10:23,  1.58s/it]

{'loss': 0.3667, 'grad_norm': 0.2484029084444046, 'learning_rate': 0.0003647661317855924, 'epoch': 0.88}


  9%|▉         | 260/2930 [07:12<1:10:04,  1.57s/it]

{'loss': 0.3318, 'grad_norm': 0.18747936189174652, 'learning_rate': 0.00036462956640491634, 'epoch': 0.89}


  9%|▉         | 261/2930 [07:14<1:09:51,  1.57s/it]

{'loss': 0.3915, 'grad_norm': 0.19476965069770813, 'learning_rate': 0.00036449300102424034, 'epoch': 0.89}


  9%|▉         | 262/2930 [07:15<1:09:34,  1.56s/it]

{'loss': 0.3976, 'grad_norm': 0.22352959215641022, 'learning_rate': 0.0003643564356435644, 'epoch': 0.89}


  9%|▉         | 263/2930 [07:17<1:09:20,  1.56s/it]

{'loss': 0.293, 'grad_norm': 0.16797375679016113, 'learning_rate': 0.0003642198702628884, 'epoch': 0.9}


  9%|▉         | 264/2930 [07:18<1:09:46,  1.57s/it]

{'loss': 0.3432, 'grad_norm': 0.1924566775560379, 'learning_rate': 0.0003640833048822124, 'epoch': 0.9}


  9%|▉         | 265/2930 [07:20<1:10:12,  1.58s/it]

{'loss': 0.352, 'grad_norm': 0.19407260417938232, 'learning_rate': 0.00036394673950153637, 'epoch': 0.9}


  9%|▉         | 266/2930 [07:21<1:09:58,  1.58s/it]

{'loss': 0.3711, 'grad_norm': 0.18697986006736755, 'learning_rate': 0.00036381017412086036, 'epoch': 0.91}


  9%|▉         | 267/2930 [07:23<1:09:16,  1.56s/it]

{'loss': 0.3941, 'grad_norm': 0.19984957575798035, 'learning_rate': 0.00036367360874018436, 'epoch': 0.91}


  9%|▉         | 268/2930 [07:24<1:09:05,  1.56s/it]

{'loss': 0.397, 'grad_norm': 0.25614839792251587, 'learning_rate': 0.0003635370433595084, 'epoch': 0.91}


  9%|▉         | 269/2930 [07:26<1:09:06,  1.56s/it]

{'loss': 0.3398, 'grad_norm': 0.17615966498851776, 'learning_rate': 0.0003634004779788324, 'epoch': 0.92}


  9%|▉         | 270/2930 [07:28<1:09:14,  1.56s/it]

{'loss': 0.3268, 'grad_norm': 0.16232623159885406, 'learning_rate': 0.0003632639125981564, 'epoch': 0.92}


  9%|▉         | 271/2930 [07:29<1:09:16,  1.56s/it]

{'loss': 0.3425, 'grad_norm': 0.17790842056274414, 'learning_rate': 0.0003631273472174804, 'epoch': 0.92}


  9%|▉         | 272/2930 [07:31<1:09:21,  1.57s/it]

{'loss': 0.332, 'grad_norm': 0.1584029346704483, 'learning_rate': 0.0003629907818368044, 'epoch': 0.93}


  9%|▉         | 273/2930 [07:32<1:09:30,  1.57s/it]

{'loss': 0.3941, 'grad_norm': 0.20034293830394745, 'learning_rate': 0.0003628542164561284, 'epoch': 0.93}


  9%|▉         | 274/2930 [07:34<1:09:28,  1.57s/it]

{'loss': 0.3626, 'grad_norm': 0.20284485816955566, 'learning_rate': 0.00036271765107545237, 'epoch': 0.93}


  9%|▉         | 275/2930 [07:35<1:09:10,  1.56s/it]

{'loss': 0.3115, 'grad_norm': 0.18261189758777618, 'learning_rate': 0.0003625810856947764, 'epoch': 0.94}


  9%|▉         | 276/2930 [07:37<1:09:06,  1.56s/it]

{'loss': 0.3353, 'grad_norm': 0.2061653584241867, 'learning_rate': 0.0003624445203141004, 'epoch': 0.94}


  9%|▉         | 277/2930 [07:39<1:09:37,  1.57s/it]

{'loss': 0.3147, 'grad_norm': 0.16379022598266602, 'learning_rate': 0.00036230795493342435, 'epoch': 0.94}


  9%|▉         | 278/2930 [07:40<1:09:48,  1.58s/it]

{'loss': 0.3326, 'grad_norm': 0.18098846077919006, 'learning_rate': 0.0003621713895527484, 'epoch': 0.95}


 10%|▉         | 279/2930 [07:42<1:09:37,  1.58s/it]

{'loss': 0.3147, 'grad_norm': 0.17694753408432007, 'learning_rate': 0.0003620348241720724, 'epoch': 0.95}


 10%|▉         | 280/2930 [07:43<1:09:28,  1.57s/it]

{'loss': 0.3502, 'grad_norm': 0.21474365890026093, 'learning_rate': 0.0003618982587913964, 'epoch': 0.95}


 10%|▉         | 281/2930 [07:45<1:09:28,  1.57s/it]

{'loss': 0.3297, 'grad_norm': 0.1938004344701767, 'learning_rate': 0.00036176169341072044, 'epoch': 0.96}


 10%|▉         | 282/2930 [07:46<1:09:15,  1.57s/it]

{'loss': 0.36, 'grad_norm': 0.1779995560646057, 'learning_rate': 0.0003616251280300444, 'epoch': 0.96}


 10%|▉         | 283/2930 [07:48<1:09:26,  1.57s/it]

{'loss': 0.357, 'grad_norm': 0.2481638640165329, 'learning_rate': 0.00036148856264936837, 'epoch': 0.97}


 10%|▉         | 284/2930 [07:50<1:09:15,  1.57s/it]

{'loss': 0.3478, 'grad_norm': 0.19048528373241425, 'learning_rate': 0.0003613519972686924, 'epoch': 0.97}


 10%|▉         | 285/2930 [07:51<1:09:47,  1.58s/it]

{'loss': 0.3242, 'grad_norm': 0.2661489248275757, 'learning_rate': 0.0003612154318880164, 'epoch': 0.97}


 10%|▉         | 286/2930 [07:53<1:09:58,  1.59s/it]

{'loss': 0.3272, 'grad_norm': 0.18586614727973938, 'learning_rate': 0.0003610788665073404, 'epoch': 0.98}


 10%|▉         | 287/2930 [07:54<1:09:29,  1.58s/it]

{'loss': 0.3109, 'grad_norm': 0.252966046333313, 'learning_rate': 0.0003609423011266644, 'epoch': 0.98}


 10%|▉         | 288/2930 [07:56<1:09:12,  1.57s/it]

{'loss': 0.4313, 'grad_norm': 0.25187063217163086, 'learning_rate': 0.0003608057357459884, 'epoch': 0.98}


 10%|▉         | 289/2930 [07:58<1:09:01,  1.57s/it]

{'loss': 0.3286, 'grad_norm': 0.2117471545934677, 'learning_rate': 0.0003606691703653124, 'epoch': 0.99}


 10%|▉         | 290/2930 [07:59<1:09:18,  1.58s/it]

{'loss': 0.3466, 'grad_norm': 0.22412477433681488, 'learning_rate': 0.0003605326049846364, 'epoch': 0.99}


 10%|▉         | 291/2930 [08:01<1:09:36,  1.58s/it]

{'loss': 0.3204, 'grad_norm': 0.16692133247852325, 'learning_rate': 0.00036039603960396043, 'epoch': 0.99}


 10%|▉         | 292/2930 [08:02<1:09:39,  1.58s/it]

{'loss': 0.36, 'grad_norm': 0.18807296454906464, 'learning_rate': 0.00036025947422328443, 'epoch': 1.0}


 10%|█         | 293/2930 [08:04<1:09:52,  1.59s/it]

{'loss': 0.4177, 'grad_norm': 0.21419252455234528, 'learning_rate': 0.0003601229088426084, 'epoch': 1.0}


 10%|█         | 294/2930 [08:06<1:10:08,  1.60s/it]

{'loss': 0.3243, 'grad_norm': 0.1749790906906128, 'learning_rate': 0.0003599863434619324, 'epoch': 1.0}


 10%|█         | 295/2930 [08:07<1:10:14,  1.60s/it]

{'loss': 0.3404, 'grad_norm': 0.18493682146072388, 'learning_rate': 0.0003598497780812564, 'epoch': 1.01}


 10%|█         | 296/2930 [08:09<1:09:51,  1.59s/it]

{'loss': 0.3183, 'grad_norm': 0.21522802114486694, 'learning_rate': 0.0003597132127005804, 'epoch': 1.01}


 10%|█         | 297/2930 [08:10<1:09:09,  1.58s/it]

{'loss': 0.3477, 'grad_norm': 0.24882280826568604, 'learning_rate': 0.00035957664731990445, 'epoch': 1.01}


 10%|█         | 298/2930 [08:12<1:09:16,  1.58s/it]

{'loss': 0.3123, 'grad_norm': 0.21226254105567932, 'learning_rate': 0.00035944008193922845, 'epoch': 1.02}


 10%|█         | 299/2930 [08:13<1:09:27,  1.58s/it]

{'loss': 0.3092, 'grad_norm': 0.23622505366802216, 'learning_rate': 0.0003593035165585524, 'epoch': 1.02}


 10%|█         | 300/2930 [08:15<1:09:38,  1.59s/it]

{'loss': 0.3609, 'grad_norm': 0.24488712847232819, 'learning_rate': 0.00035916695117787644, 'epoch': 1.02}


 10%|█         | 301/2930 [08:17<1:09:47,  1.59s/it]

{'loss': 0.3381, 'grad_norm': 0.22958829998970032, 'learning_rate': 0.00035903038579720043, 'epoch': 1.03}


 10%|█         | 302/2930 [08:18<1:10:03,  1.60s/it]

{'loss': 0.3533, 'grad_norm': 0.19518715143203735, 'learning_rate': 0.0003588938204165244, 'epoch': 1.03}


 10%|█         | 303/2930 [08:20<1:10:18,  1.61s/it]

{'loss': 0.3366, 'grad_norm': 0.19504888355731964, 'learning_rate': 0.00035875725503584847, 'epoch': 1.03}


 10%|█         | 304/2930 [08:21<1:10:14,  1.60s/it]

{'loss': 0.3084, 'grad_norm': 0.197079136967659, 'learning_rate': 0.0003586206896551724, 'epoch': 1.04}


 10%|█         | 305/2930 [08:23<1:10:13,  1.61s/it]

{'loss': 0.3312, 'grad_norm': 0.1875862032175064, 'learning_rate': 0.0003584841242744964, 'epoch': 1.04}


 10%|█         | 306/2930 [08:25<1:09:20,  1.59s/it]

{'loss': 0.3024, 'grad_norm': 0.17808176577091217, 'learning_rate': 0.0003583475588938204, 'epoch': 1.04}


 10%|█         | 307/2930 [08:26<1:09:12,  1.58s/it]

{'loss': 0.3211, 'grad_norm': 0.2030431032180786, 'learning_rate': 0.00035821099351314445, 'epoch': 1.05}


 11%|█         | 308/2930 [08:28<1:09:17,  1.59s/it]

{'loss': 0.3362, 'grad_norm': 0.21580669283866882, 'learning_rate': 0.00035807442813246844, 'epoch': 1.05}


 11%|█         | 309/2930 [08:29<1:09:20,  1.59s/it]

{'loss': 0.3744, 'grad_norm': 0.22665748000144958, 'learning_rate': 0.00035793786275179244, 'epoch': 1.05}


 11%|█         | 310/2930 [08:31<1:09:35,  1.59s/it]

{'loss': 0.3431, 'grad_norm': 0.26513639092445374, 'learning_rate': 0.00035780129737111643, 'epoch': 1.06}


 11%|█         | 311/2930 [08:33<1:09:23,  1.59s/it]

{'loss': 0.3648, 'grad_norm': 0.2653314769268036, 'learning_rate': 0.0003576647319904404, 'epoch': 1.06}


 11%|█         | 312/2930 [08:34<1:08:54,  1.58s/it]

{'loss': 0.3355, 'grad_norm': 0.2110290825366974, 'learning_rate': 0.0003575281666097644, 'epoch': 1.06}


 11%|█         | 313/2930 [08:36<1:08:40,  1.57s/it]

{'loss': 0.3231, 'grad_norm': 0.21307623386383057, 'learning_rate': 0.00035739160122908847, 'epoch': 1.07}


 11%|█         | 314/2930 [08:37<1:08:48,  1.58s/it]

{'loss': 0.3216, 'grad_norm': 0.1949010044336319, 'learning_rate': 0.00035725503584841246, 'epoch': 1.07}


 11%|█         | 315/2930 [08:39<1:08:49,  1.58s/it]

{'loss': 0.3268, 'grad_norm': 0.18773755431175232, 'learning_rate': 0.00035711847046773646, 'epoch': 1.07}


 11%|█         | 316/2930 [08:40<1:08:43,  1.58s/it]

{'loss': 0.3092, 'grad_norm': 0.21050895750522614, 'learning_rate': 0.00035698190508706045, 'epoch': 1.08}


 11%|█         | 317/2930 [08:42<1:08:54,  1.58s/it]

{'loss': 0.3042, 'grad_norm': 0.20176264643669128, 'learning_rate': 0.00035684533970638444, 'epoch': 1.08}


 11%|█         | 318/2930 [08:44<1:09:04,  1.59s/it]

{'loss': 0.3286, 'grad_norm': 0.22255906462669373, 'learning_rate': 0.00035670877432570844, 'epoch': 1.08}


 11%|█         | 319/2930 [08:45<1:08:52,  1.58s/it]

{'loss': 0.3196, 'grad_norm': 0.24059513211250305, 'learning_rate': 0.0003565722089450325, 'epoch': 1.09}


 11%|█         | 320/2930 [08:47<1:09:06,  1.59s/it]

{'loss': 0.3136, 'grad_norm': 0.21301156282424927, 'learning_rate': 0.0003564356435643565, 'epoch': 1.09}


 11%|█         | 321/2930 [08:48<1:09:09,  1.59s/it]

{'loss': 0.3965, 'grad_norm': 0.29737719893455505, 'learning_rate': 0.0003562990781836804, 'epoch': 1.09}


 11%|█         | 322/2930 [08:50<1:08:26,  1.57s/it]

{'loss': 0.3584, 'grad_norm': 0.2317172735929489, 'learning_rate': 0.0003561625128030044, 'epoch': 1.1}


 11%|█         | 323/2930 [08:51<1:08:05,  1.57s/it]

{'loss': 0.3545, 'grad_norm': 0.23438389599323273, 'learning_rate': 0.00035602594742232846, 'epoch': 1.1}


 11%|█         | 324/2930 [08:53<1:08:04,  1.57s/it]

{'loss': 0.3382, 'grad_norm': 0.20273594558238983, 'learning_rate': 0.00035588938204165246, 'epoch': 1.1}


 11%|█         | 325/2930 [08:55<1:07:48,  1.56s/it]

{'loss': 0.3547, 'grad_norm': 0.24612386524677277, 'learning_rate': 0.00035575281666097645, 'epoch': 1.11}


 11%|█         | 326/2930 [08:56<1:08:08,  1.57s/it]

{'loss': 0.365, 'grad_norm': 0.21637634932994843, 'learning_rate': 0.00035561625128030045, 'epoch': 1.11}


 11%|█         | 327/2930 [08:58<1:08:09,  1.57s/it]

{'loss': 0.3382, 'grad_norm': 0.19678835570812225, 'learning_rate': 0.00035547968589962444, 'epoch': 1.12}


 11%|█         | 328/2930 [08:59<1:07:37,  1.56s/it]

{'loss': 0.3723, 'grad_norm': 0.24516811966896057, 'learning_rate': 0.00035534312051894843, 'epoch': 1.12}


 11%|█         | 329/2930 [09:01<1:07:57,  1.57s/it]

{'loss': 0.3422, 'grad_norm': 0.18380539119243622, 'learning_rate': 0.0003552065551382725, 'epoch': 1.12}


 11%|█▏        | 330/2930 [09:02<1:08:13,  1.57s/it]

{'loss': 0.3239, 'grad_norm': 0.19929836690425873, 'learning_rate': 0.0003550699897575965, 'epoch': 1.13}


 11%|█▏        | 331/2930 [09:04<1:08:13,  1.57s/it]

{'loss': 0.2937, 'grad_norm': 0.18697336316108704, 'learning_rate': 0.00035493342437692047, 'epoch': 1.13}


 11%|█▏        | 332/2930 [09:06<1:08:20,  1.58s/it]

{'loss': 0.3093, 'grad_norm': 0.17347358167171478, 'learning_rate': 0.00035479685899624447, 'epoch': 1.13}


 11%|█▏        | 333/2930 [09:07<1:08:14,  1.58s/it]

{'loss': 0.3532, 'grad_norm': 0.19448332488536835, 'learning_rate': 0.00035466029361556846, 'epoch': 1.14}


 11%|█▏        | 334/2930 [09:09<1:07:48,  1.57s/it]

{'loss': 0.3364, 'grad_norm': 0.19622744619846344, 'learning_rate': 0.00035452372823489245, 'epoch': 1.14}


 11%|█▏        | 335/2930 [09:10<1:07:47,  1.57s/it]

{'loss': 0.332, 'grad_norm': 0.20212899148464203, 'learning_rate': 0.0003543871628542165, 'epoch': 1.14}


 11%|█▏        | 336/2930 [09:12<1:07:44,  1.57s/it]

{'loss': 0.278, 'grad_norm': 0.17936906218528748, 'learning_rate': 0.0003542505974735405, 'epoch': 1.15}


 12%|█▏        | 337/2930 [09:13<1:07:28,  1.56s/it]

{'loss': 0.3577, 'grad_norm': 0.23737987875938416, 'learning_rate': 0.0003541140320928645, 'epoch': 1.15}


 12%|█▏        | 338/2930 [09:15<1:07:16,  1.56s/it]

{'loss': 0.3918, 'grad_norm': 0.24131621420383453, 'learning_rate': 0.0003539774667121885, 'epoch': 1.15}


 12%|█▏        | 339/2930 [09:16<1:06:26,  1.54s/it]

{'loss': 0.3512, 'grad_norm': 0.1995277851819992, 'learning_rate': 0.0003538409013315125, 'epoch': 1.16}


 12%|█▏        | 340/2930 [09:18<1:07:02,  1.55s/it]

{'loss': 0.2991, 'grad_norm': 0.1794193983078003, 'learning_rate': 0.00035370433595083647, 'epoch': 1.16}


 12%|█▏        | 341/2930 [09:20<1:07:22,  1.56s/it]

{'loss': 0.319, 'grad_norm': 0.19475747644901276, 'learning_rate': 0.00035356777057016047, 'epoch': 1.16}


 12%|█▏        | 342/2930 [09:21<1:06:35,  1.54s/it]

{'loss': 0.3302, 'grad_norm': 0.20989356935024261, 'learning_rate': 0.0003534312051894845, 'epoch': 1.17}


 12%|█▏        | 343/2930 [09:23<1:06:32,  1.54s/it]

{'loss': 0.334, 'grad_norm': 0.23322360217571259, 'learning_rate': 0.00035329463980880845, 'epoch': 1.17}


 12%|█▏        | 344/2930 [09:24<1:06:53,  1.55s/it]

{'loss': 0.3185, 'grad_norm': 0.22068044543266296, 'learning_rate': 0.00035315807442813245, 'epoch': 1.17}


 12%|█▏        | 345/2930 [09:26<1:06:55,  1.55s/it]

{'loss': 0.3536, 'grad_norm': 0.23216703534126282, 'learning_rate': 0.0003530215090474565, 'epoch': 1.18}


 12%|█▏        | 346/2930 [09:27<1:06:58,  1.56s/it]

{'loss': 0.4075, 'grad_norm': 0.2448900192975998, 'learning_rate': 0.0003528849436667805, 'epoch': 1.18}


 12%|█▏        | 347/2930 [09:29<1:07:16,  1.56s/it]

{'loss': 0.3363, 'grad_norm': 0.20112290978431702, 'learning_rate': 0.0003527483782861045, 'epoch': 1.18}


 12%|█▏        | 348/2930 [09:30<1:07:07,  1.56s/it]

{'loss': 0.311, 'grad_norm': 0.21474134922027588, 'learning_rate': 0.00035261181290542853, 'epoch': 1.19}


 12%|█▏        | 349/2930 [09:32<1:07:07,  1.56s/it]

{'loss': 0.3653, 'grad_norm': 0.20913651585578918, 'learning_rate': 0.0003524752475247525, 'epoch': 1.19}


 12%|█▏        | 350/2930 [09:34<1:07:27,  1.57s/it]

{'loss': 0.2868, 'grad_norm': 0.19747164845466614, 'learning_rate': 0.00035233868214407647, 'epoch': 1.19}


 12%|█▏        | 351/2930 [09:35<1:07:08,  1.56s/it]

{'loss': 0.3268, 'grad_norm': 0.23086903989315033, 'learning_rate': 0.0003522021167634005, 'epoch': 1.2}


 12%|█▏        | 352/2930 [09:37<1:06:48,  1.55s/it]

{'loss': 0.3174, 'grad_norm': 0.21606570482254028, 'learning_rate': 0.0003520655513827245, 'epoch': 1.2}


 12%|█▏        | 353/2930 [09:38<1:06:41,  1.55s/it]

{'loss': 0.3217, 'grad_norm': 0.20444978773593903, 'learning_rate': 0.0003519289860020485, 'epoch': 1.2}


 12%|█▏        | 354/2930 [09:40<1:06:38,  1.55s/it]

{'loss': 0.3473, 'grad_norm': 0.2465686947107315, 'learning_rate': 0.0003517924206213725, 'epoch': 1.21}


 12%|█▏        | 355/2930 [09:41<1:06:29,  1.55s/it]

{'loss': 0.2873, 'grad_norm': 0.22836992144584656, 'learning_rate': 0.0003516558552406965, 'epoch': 1.21}


 12%|█▏        | 356/2930 [09:43<1:06:34,  1.55s/it]

{'loss': 0.3336, 'grad_norm': 0.2218930572271347, 'learning_rate': 0.0003515192898600205, 'epoch': 1.21}


 12%|█▏        | 357/2930 [09:44<1:06:34,  1.55s/it]

{'loss': 0.3234, 'grad_norm': 0.2022252380847931, 'learning_rate': 0.0003513827244793445, 'epoch': 1.22}


 12%|█▏        | 358/2930 [09:46<1:06:15,  1.55s/it]

{'loss': 0.3148, 'grad_norm': 0.23240654170513153, 'learning_rate': 0.00035124615909866853, 'epoch': 1.22}


 12%|█▏        | 359/2930 [09:48<1:06:23,  1.55s/it]

{'loss': 0.3216, 'grad_norm': 0.22239798307418823, 'learning_rate': 0.0003511095937179925, 'epoch': 1.22}


 12%|█▏        | 360/2930 [09:49<1:06:34,  1.55s/it]

{'loss': 0.3409, 'grad_norm': 0.2224113494157791, 'learning_rate': 0.00035097302833731646, 'epoch': 1.23}


 12%|█▏        | 361/2930 [09:51<1:06:39,  1.56s/it]

{'loss': 0.3137, 'grad_norm': 0.19898109138011932, 'learning_rate': 0.0003508364629566405, 'epoch': 1.23}


 12%|█▏        | 362/2930 [09:52<1:06:50,  1.56s/it]

{'loss': 0.334, 'grad_norm': 0.20596817135810852, 'learning_rate': 0.0003506998975759645, 'epoch': 1.23}


 12%|█▏        | 363/2930 [09:54<1:07:07,  1.57s/it]

{'loss': 0.3147, 'grad_norm': 0.18799731135368347, 'learning_rate': 0.0003505633321952885, 'epoch': 1.24}


 12%|█▏        | 364/2930 [09:55<1:07:15,  1.57s/it]

{'loss': 0.2958, 'grad_norm': 0.18455596268177032, 'learning_rate': 0.00035042676681461255, 'epoch': 1.24}


 12%|█▏        | 365/2930 [09:57<1:07:31,  1.58s/it]

{'loss': 0.3683, 'grad_norm': 0.2187097817659378, 'learning_rate': 0.00035029020143393654, 'epoch': 1.24}


 12%|█▏        | 366/2930 [09:59<1:07:30,  1.58s/it]

{'loss': 0.3344, 'grad_norm': 0.2159241884946823, 'learning_rate': 0.0003501536360532605, 'epoch': 1.25}


 13%|█▎        | 367/2930 [10:00<1:06:59,  1.57s/it]

{'loss': 0.3263, 'grad_norm': 0.20640219748020172, 'learning_rate': 0.00035001707067258453, 'epoch': 1.25}


 13%|█▎        | 368/2930 [10:02<1:06:53,  1.57s/it]

{'loss': 0.3229, 'grad_norm': 0.21508823335170746, 'learning_rate': 0.0003498805052919085, 'epoch': 1.25}


 13%|█▎        | 369/2930 [10:03<1:06:19,  1.55s/it]

{'loss': 0.3392, 'grad_norm': 0.24084541201591492, 'learning_rate': 0.0003497439399112325, 'epoch': 1.26}


 13%|█▎        | 370/2930 [10:05<1:06:28,  1.56s/it]

{'loss': 0.3675, 'grad_norm': 0.23376256227493286, 'learning_rate': 0.00034960737453055657, 'epoch': 1.26}


 13%|█▎        | 371/2930 [10:06<1:06:42,  1.56s/it]

{'loss': 0.3022, 'grad_norm': 0.20889417827129364, 'learning_rate': 0.0003494708091498805, 'epoch': 1.27}


 13%|█▎        | 372/2930 [10:08<1:07:07,  1.57s/it]

{'loss': 0.31, 'grad_norm': 0.1714739352464676, 'learning_rate': 0.0003493342437692045, 'epoch': 1.27}


 13%|█▎        | 373/2930 [10:10<1:06:35,  1.56s/it]

{'loss': 0.3031, 'grad_norm': 0.19280219078063965, 'learning_rate': 0.0003491976783885285, 'epoch': 1.27}


 13%|█▎        | 374/2930 [10:11<1:06:57,  1.57s/it]

{'loss': 0.3435, 'grad_norm': 0.19822730123996735, 'learning_rate': 0.00034906111300785254, 'epoch': 1.28}


 13%|█▎        | 375/2930 [10:13<1:07:03,  1.57s/it]

{'loss': 0.3515, 'grad_norm': 0.21925927698612213, 'learning_rate': 0.00034892454762717654, 'epoch': 1.28}


 13%|█▎        | 376/2930 [10:14<1:06:40,  1.57s/it]

{'loss': 0.3401, 'grad_norm': 0.21244312822818756, 'learning_rate': 0.00034878798224650053, 'epoch': 1.28}


 13%|█▎        | 377/2930 [10:16<1:06:43,  1.57s/it]

{'loss': 0.3356, 'grad_norm': 0.23074650764465332, 'learning_rate': 0.0003486514168658245, 'epoch': 1.29}


 13%|█▎        | 378/2930 [10:17<1:05:54,  1.55s/it]

{'loss': 0.3319, 'grad_norm': 0.22382019460201263, 'learning_rate': 0.0003485148514851485, 'epoch': 1.29}


 13%|█▎        | 379/2930 [10:19<1:06:14,  1.56s/it]

{'loss': 0.3348, 'grad_norm': 0.21527686715126038, 'learning_rate': 0.0003483782861044725, 'epoch': 1.29}


 13%|█▎        | 380/2930 [10:20<1:06:27,  1.56s/it]

{'loss': 0.3042, 'grad_norm': 0.21716850996017456, 'learning_rate': 0.00034824172072379656, 'epoch': 1.3}


 13%|█▎        | 381/2930 [10:22<1:05:54,  1.55s/it]

{'loss': 0.3617, 'grad_norm': 0.26965954899787903, 'learning_rate': 0.00034810515534312056, 'epoch': 1.3}


 13%|█▎        | 382/2930 [10:24<1:06:02,  1.56s/it]

{'loss': 0.3115, 'grad_norm': 0.2109793722629547, 'learning_rate': 0.0003479685899624445, 'epoch': 1.3}


 13%|█▎        | 383/2930 [10:25<1:06:16,  1.56s/it]

{'loss': 0.3562, 'grad_norm': 0.23406080901622772, 'learning_rate': 0.00034783202458176855, 'epoch': 1.31}


 13%|█▎        | 384/2930 [10:27<1:06:14,  1.56s/it]

{'loss': 0.3125, 'grad_norm': 0.1885213404893875, 'learning_rate': 0.00034769545920109254, 'epoch': 1.31}


 13%|█▎        | 385/2930 [10:28<1:06:17,  1.56s/it]

{'loss': 0.3176, 'grad_norm': 0.20651307702064514, 'learning_rate': 0.00034755889382041653, 'epoch': 1.31}


 13%|█▎        | 386/2930 [10:30<1:06:25,  1.57s/it]

{'loss': 0.3181, 'grad_norm': 0.20523285865783691, 'learning_rate': 0.0003474223284397406, 'epoch': 1.32}


 13%|█▎        | 387/2930 [10:31<1:06:35,  1.57s/it]

{'loss': 0.3556, 'grad_norm': 0.20894725620746613, 'learning_rate': 0.0003472857630590646, 'epoch': 1.32}


 13%|█▎        | 388/2930 [10:33<1:06:31,  1.57s/it]

{'loss': 0.3165, 'grad_norm': 0.18450474739074707, 'learning_rate': 0.0003471491976783885, 'epoch': 1.32}


 13%|█▎        | 389/2930 [10:35<1:06:23,  1.57s/it]

{'loss': 0.3318, 'grad_norm': 0.23721225559711456, 'learning_rate': 0.0003470126322977125, 'epoch': 1.33}


 13%|█▎        | 390/2930 [10:36<1:06:11,  1.56s/it]

{'loss': 0.279, 'grad_norm': 0.1889929473400116, 'learning_rate': 0.00034687606691703656, 'epoch': 1.33}


 13%|█▎        | 391/2930 [10:38<1:06:22,  1.57s/it]

{'loss': 0.332, 'grad_norm': 0.21711833775043488, 'learning_rate': 0.00034673950153636055, 'epoch': 1.33}


 13%|█▎        | 392/2930 [10:39<1:05:58,  1.56s/it]

{'loss': 0.3613, 'grad_norm': 0.24192307889461517, 'learning_rate': 0.00034660293615568455, 'epoch': 1.34}


 13%|█▎        | 393/2930 [10:41<1:05:47,  1.56s/it]

{'loss': 0.3123, 'grad_norm': 0.26290762424468994, 'learning_rate': 0.00034646637077500854, 'epoch': 1.34}


 13%|█▎        | 394/2930 [10:42<1:05:59,  1.56s/it]

{'loss': 0.3161, 'grad_norm': 0.22642219066619873, 'learning_rate': 0.00034632980539433254, 'epoch': 1.34}


 13%|█▎        | 395/2930 [10:44<1:05:15,  1.54s/it]

{'loss': 0.3138, 'grad_norm': 0.2525891661643982, 'learning_rate': 0.00034619324001365653, 'epoch': 1.35}


 14%|█▎        | 396/2930 [10:45<1:05:09,  1.54s/it]

{'loss': 0.3353, 'grad_norm': 0.21142880618572235, 'learning_rate': 0.0003460566746329806, 'epoch': 1.35}


 14%|█▎        | 397/2930 [10:47<1:05:01,  1.54s/it]

{'loss': 0.3014, 'grad_norm': 0.18987533450126648, 'learning_rate': 0.00034592010925230457, 'epoch': 1.35}


 14%|█▎        | 398/2930 [10:48<1:05:29,  1.55s/it]

{'loss': 0.3026, 'grad_norm': 0.18728755414485931, 'learning_rate': 0.00034578354387162857, 'epoch': 1.36}


 14%|█▎        | 399/2930 [10:50<1:05:37,  1.56s/it]

{'loss': 0.3904, 'grad_norm': 0.21900956332683563, 'learning_rate': 0.00034564697849095256, 'epoch': 1.36}


 14%|█▎        | 400/2930 [10:52<1:06:03,  1.57s/it]

{'loss': 0.3508, 'grad_norm': 0.24834366142749786, 'learning_rate': 0.00034551041311027655, 'epoch': 1.36}


 14%|█▎        | 401/2930 [10:53<1:06:33,  1.58s/it]

{'loss': 0.3082, 'grad_norm': 0.20587611198425293, 'learning_rate': 0.00034537384772960055, 'epoch': 1.37}


 14%|█▎        | 402/2930 [10:55<1:06:23,  1.58s/it]

{'loss': 0.3478, 'grad_norm': 0.23933079838752747, 'learning_rate': 0.0003452372823489246, 'epoch': 1.37}


 14%|█▍        | 403/2930 [10:56<1:06:13,  1.57s/it]

{'loss': 0.2832, 'grad_norm': 0.18723377585411072, 'learning_rate': 0.0003451007169682486, 'epoch': 1.37}


 14%|█▍        | 404/2930 [10:58<1:06:10,  1.57s/it]

{'loss': 0.337, 'grad_norm': 0.21344821155071259, 'learning_rate': 0.0003449641515875726, 'epoch': 1.38}


 14%|█▍        | 405/2930 [11:00<1:06:00,  1.57s/it]

{'loss': 0.3173, 'grad_norm': 0.22146761417388916, 'learning_rate': 0.0003448275862068965, 'epoch': 1.38}


 14%|█▍        | 406/2930 [11:01<1:05:46,  1.56s/it]

{'loss': 0.3551, 'grad_norm': 0.21174845099449158, 'learning_rate': 0.0003446910208262206, 'epoch': 1.38}


 14%|█▍        | 407/2930 [11:03<1:05:59,  1.57s/it]

{'loss': 0.3679, 'grad_norm': 0.24281994998455048, 'learning_rate': 0.00034455445544554457, 'epoch': 1.39}


 14%|█▍        | 408/2930 [11:04<1:06:16,  1.58s/it]

{'loss': 0.3787, 'grad_norm': 0.21000364422798157, 'learning_rate': 0.00034441789006486856, 'epoch': 1.39}


 14%|█▍        | 409/2930 [11:06<1:06:30,  1.58s/it]

{'loss': 0.3194, 'grad_norm': 0.18209049105644226, 'learning_rate': 0.0003442813246841926, 'epoch': 1.39}


 14%|█▍        | 410/2930 [11:07<1:06:18,  1.58s/it]

{'loss': 0.3729, 'grad_norm': 0.23496213555335999, 'learning_rate': 0.00034414475930351655, 'epoch': 1.4}


 14%|█▍        | 411/2930 [11:09<1:06:15,  1.58s/it]

{'loss': 0.344, 'grad_norm': 0.1980510652065277, 'learning_rate': 0.00034400819392284054, 'epoch': 1.4}


 14%|█▍        | 412/2930 [11:11<1:06:11,  1.58s/it]

{'loss': 0.3242, 'grad_norm': 0.20585277676582336, 'learning_rate': 0.0003438716285421646, 'epoch': 1.4}


 14%|█▍        | 413/2930 [11:12<1:05:51,  1.57s/it]

{'loss': 0.3205, 'grad_norm': 0.2242691069841385, 'learning_rate': 0.0003437350631614886, 'epoch': 1.41}


 14%|█▍        | 414/2930 [11:14<1:05:58,  1.57s/it]

{'loss': 0.3585, 'grad_norm': 0.23390519618988037, 'learning_rate': 0.0003435984977808126, 'epoch': 1.41}


 14%|█▍        | 415/2930 [11:15<1:06:03,  1.58s/it]

{'loss': 0.3268, 'grad_norm': 0.2117650955915451, 'learning_rate': 0.0003434619324001366, 'epoch': 1.42}


 14%|█▍        | 416/2930 [11:17<1:06:22,  1.58s/it]

{'loss': 0.3127, 'grad_norm': 0.21375177800655365, 'learning_rate': 0.00034332536701946057, 'epoch': 1.42}


 14%|█▍        | 417/2930 [11:19<1:06:49,  1.60s/it]

{'loss': 0.3235, 'grad_norm': 0.20376719534397125, 'learning_rate': 0.00034318880163878456, 'epoch': 1.42}


 14%|█▍        | 418/2930 [11:20<1:06:31,  1.59s/it]

{'loss': 0.3004, 'grad_norm': 0.20974870026111603, 'learning_rate': 0.0003430522362581086, 'epoch': 1.43}


 14%|█▍        | 419/2930 [11:22<1:06:40,  1.59s/it]

{'loss': 0.3068, 'grad_norm': 0.20580217242240906, 'learning_rate': 0.0003429156708774326, 'epoch': 1.43}


 14%|█▍        | 420/2930 [11:23<1:06:15,  1.58s/it]

{'loss': 0.3041, 'grad_norm': 0.2302500307559967, 'learning_rate': 0.0003427791054967566, 'epoch': 1.43}


 14%|█▍        | 421/2930 [11:25<1:05:52,  1.58s/it]

{'loss': 0.3224, 'grad_norm': 0.22961857914924622, 'learning_rate': 0.0003426425401160806, 'epoch': 1.44}


 14%|█▍        | 422/2930 [11:26<1:05:53,  1.58s/it]

{'loss': 0.2925, 'grad_norm': 0.2238655835390091, 'learning_rate': 0.0003425059747354046, 'epoch': 1.44}


 14%|█▍        | 423/2930 [11:28<1:05:48,  1.57s/it]

{'loss': 0.3017, 'grad_norm': 0.2443244457244873, 'learning_rate': 0.0003423694093547286, 'epoch': 1.44}


 14%|█▍        | 424/2930 [11:30<1:05:57,  1.58s/it]

{'loss': 0.3128, 'grad_norm': 0.21487785875797272, 'learning_rate': 0.0003422328439740526, 'epoch': 1.45}


 15%|█▍        | 425/2930 [11:31<1:05:44,  1.57s/it]

{'loss': 0.3173, 'grad_norm': 0.2441122829914093, 'learning_rate': 0.0003420962785933766, 'epoch': 1.45}


 15%|█▍        | 426/2930 [11:33<1:05:28,  1.57s/it]

{'loss': 0.3416, 'grad_norm': 0.24611538648605347, 'learning_rate': 0.0003419597132127006, 'epoch': 1.45}


 15%|█▍        | 427/2930 [11:34<1:05:54,  1.58s/it]

{'loss': 0.3812, 'grad_norm': 0.22902363538742065, 'learning_rate': 0.00034182314783202456, 'epoch': 1.46}


 15%|█▍        | 428/2930 [11:36<1:06:00,  1.58s/it]

{'loss': 0.312, 'grad_norm': 0.23213110864162445, 'learning_rate': 0.0003416865824513486, 'epoch': 1.46}


 15%|█▍        | 429/2930 [11:37<1:05:40,  1.58s/it]

{'loss': 0.2927, 'grad_norm': 0.20174476504325867, 'learning_rate': 0.0003415500170706726, 'epoch': 1.46}


 15%|█▍        | 430/2930 [11:39<1:05:48,  1.58s/it]

{'loss': 0.3195, 'grad_norm': 0.20617976784706116, 'learning_rate': 0.0003414134516899966, 'epoch': 1.47}


 15%|█▍        | 431/2930 [11:41<1:05:51,  1.58s/it]

{'loss': 0.3053, 'grad_norm': 0.21369288861751556, 'learning_rate': 0.00034127688630932064, 'epoch': 1.47}


 15%|█▍        | 432/2930 [11:42<1:05:37,  1.58s/it]

{'loss': 0.3164, 'grad_norm': 0.19960172474384308, 'learning_rate': 0.0003411403209286446, 'epoch': 1.47}


 15%|█▍        | 433/2930 [11:44<1:05:42,  1.58s/it]

{'loss': 0.2988, 'grad_norm': 0.1865237057209015, 'learning_rate': 0.0003410037555479686, 'epoch': 1.48}


 15%|█▍        | 434/2930 [11:45<1:05:14,  1.57s/it]

{'loss': 0.3183, 'grad_norm': 0.2161438763141632, 'learning_rate': 0.0003408671901672926, 'epoch': 1.48}


 15%|█▍        | 435/2930 [11:47<1:05:34,  1.58s/it]

{'loss': 0.2683, 'grad_norm': 0.20188391208648682, 'learning_rate': 0.0003407306247866166, 'epoch': 1.48}


 15%|█▍        | 436/2930 [11:48<1:05:17,  1.57s/it]

{'loss': 0.311, 'grad_norm': 0.22893966734409332, 'learning_rate': 0.0003405940594059406, 'epoch': 1.49}


 15%|█▍        | 437/2930 [11:50<1:05:12,  1.57s/it]

{'loss': 0.2843, 'grad_norm': 0.24851582944393158, 'learning_rate': 0.0003404574940252646, 'epoch': 1.49}


 15%|█▍        | 438/2930 [11:52<1:05:33,  1.58s/it]

{'loss': 0.345, 'grad_norm': 0.2697020173072815, 'learning_rate': 0.0003403209286445886, 'epoch': 1.49}


 15%|█▍        | 439/2930 [11:53<1:05:38,  1.58s/it]

{'loss': 0.3698, 'grad_norm': 0.2721976935863495, 'learning_rate': 0.0003401843632639126, 'epoch': 1.5}


 15%|█▌        | 440/2930 [11:55<1:05:28,  1.58s/it]

{'loss': 0.3229, 'grad_norm': 0.23578594624996185, 'learning_rate': 0.0003400477978832366, 'epoch': 1.5}


 15%|█▌        | 441/2930 [11:56<1:05:25,  1.58s/it]

{'loss': 0.3193, 'grad_norm': 0.19231589138507843, 'learning_rate': 0.00033991123250256064, 'epoch': 1.5}


 15%|█▌        | 442/2930 [11:58<1:05:37,  1.58s/it]

{'loss': 0.3128, 'grad_norm': 0.2158489227294922, 'learning_rate': 0.00033977466712188463, 'epoch': 1.51}


 15%|█▌        | 443/2930 [12:00<1:05:38,  1.58s/it]

{'loss': 0.3262, 'grad_norm': 0.2037849724292755, 'learning_rate': 0.00033963810174120863, 'epoch': 1.51}


 15%|█▌        | 444/2930 [12:01<1:05:43,  1.59s/it]

{'loss': 0.2829, 'grad_norm': 0.18851549923419952, 'learning_rate': 0.0003395015363605326, 'epoch': 1.51}


 15%|█▌        | 445/2930 [12:03<1:05:34,  1.58s/it]

{'loss': 0.2819, 'grad_norm': 0.19197121262550354, 'learning_rate': 0.0003393649709798566, 'epoch': 1.52}


 15%|█▌        | 446/2930 [12:04<1:05:26,  1.58s/it]

{'loss': 0.3649, 'grad_norm': 0.26434674859046936, 'learning_rate': 0.0003392284055991806, 'epoch': 1.52}


 15%|█▌        | 447/2930 [12:06<1:05:29,  1.58s/it]

{'loss': 0.3086, 'grad_norm': 0.2117290496826172, 'learning_rate': 0.00033909184021850466, 'epoch': 1.52}


 15%|█▌        | 448/2930 [12:07<1:05:01,  1.57s/it]

{'loss': 0.2908, 'grad_norm': 0.2356727421283722, 'learning_rate': 0.00033895527483782865, 'epoch': 1.53}


 15%|█▌        | 449/2930 [12:09<1:05:01,  1.57s/it]

{'loss': 0.3109, 'grad_norm': 0.2894708216190338, 'learning_rate': 0.0003388187094571526, 'epoch': 1.53}


 15%|█▌        | 450/2930 [12:11<1:04:55,  1.57s/it]

{'loss': 0.3419, 'grad_norm': 0.2802561819553375, 'learning_rate': 0.00033868214407647664, 'epoch': 1.53}


 15%|█▌        | 451/2930 [12:12<1:04:55,  1.57s/it]

{'loss': 0.2609, 'grad_norm': 0.21933603286743164, 'learning_rate': 0.00033854557869580064, 'epoch': 1.54}


 15%|█▌        | 452/2930 [12:14<1:04:53,  1.57s/it]

{'loss': 0.2575, 'grad_norm': 0.2017894983291626, 'learning_rate': 0.00033840901331512463, 'epoch': 1.54}


 15%|█▌        | 453/2930 [12:15<1:04:28,  1.56s/it]

{'loss': 0.3213, 'grad_norm': 0.20970788598060608, 'learning_rate': 0.0003382724479344487, 'epoch': 1.54}


 15%|█▌        | 454/2930 [12:17<1:04:44,  1.57s/it]

{'loss': 0.3154, 'grad_norm': 0.22622358798980713, 'learning_rate': 0.0003381358825537726, 'epoch': 1.55}


 16%|█▌        | 455/2930 [12:18<1:04:53,  1.57s/it]

{'loss': 0.2695, 'grad_norm': 0.20292894542217255, 'learning_rate': 0.0003379993171730966, 'epoch': 1.55}


 16%|█▌        | 456/2930 [12:20<1:04:52,  1.57s/it]

{'loss': 0.3132, 'grad_norm': 0.21411782503128052, 'learning_rate': 0.0003378627517924206, 'epoch': 1.55}


 16%|█▌        | 457/2930 [12:22<1:04:48,  1.57s/it]

{'loss': 0.3191, 'grad_norm': 0.2978256344795227, 'learning_rate': 0.00033772618641174465, 'epoch': 1.56}


 16%|█▌        | 458/2930 [12:23<1:04:33,  1.57s/it]

{'loss': 0.3431, 'grad_norm': 0.28505149483680725, 'learning_rate': 0.00033758962103106865, 'epoch': 1.56}


 16%|█▌        | 459/2930 [12:25<1:04:25,  1.56s/it]

{'loss': 0.342, 'grad_norm': 0.29751160740852356, 'learning_rate': 0.00033745305565039264, 'epoch': 1.57}


 16%|█▌        | 460/2930 [12:26<1:04:07,  1.56s/it]

{'loss': 0.298, 'grad_norm': 0.22977152466773987, 'learning_rate': 0.00033731649026971664, 'epoch': 1.57}


 16%|█▌        | 461/2930 [12:28<1:03:57,  1.55s/it]

{'loss': 0.3123, 'grad_norm': 0.18935450911521912, 'learning_rate': 0.00033717992488904063, 'epoch': 1.57}


 16%|█▌        | 462/2930 [12:29<1:04:25,  1.57s/it]

{'loss': 0.2989, 'grad_norm': 0.23792827129364014, 'learning_rate': 0.0003370433595083646, 'epoch': 1.58}


 16%|█▌        | 463/2930 [12:31<1:04:02,  1.56s/it]

{'loss': 0.3033, 'grad_norm': 0.20273086428642273, 'learning_rate': 0.0003369067941276887, 'epoch': 1.58}


 16%|█▌        | 464/2930 [12:32<1:04:23,  1.57s/it]

{'loss': 0.298, 'grad_norm': 0.2046014666557312, 'learning_rate': 0.00033677022874701267, 'epoch': 1.58}


 16%|█▌        | 465/2930 [12:34<1:04:35,  1.57s/it]

{'loss': 0.2824, 'grad_norm': 0.22221295535564423, 'learning_rate': 0.00033663366336633666, 'epoch': 1.59}


 16%|█▌        | 466/2930 [12:36<1:04:31,  1.57s/it]

{'loss': 0.3344, 'grad_norm': 0.23686303198337555, 'learning_rate': 0.00033649709798566066, 'epoch': 1.59}


 16%|█▌        | 467/2930 [12:37<1:04:02,  1.56s/it]

{'loss': 0.3016, 'grad_norm': 0.22441580891609192, 'learning_rate': 0.00033636053260498465, 'epoch': 1.59}


 16%|█▌        | 468/2930 [12:39<1:03:51,  1.56s/it]

{'loss': 0.2959, 'grad_norm': 0.23785005509853363, 'learning_rate': 0.00033622396722430864, 'epoch': 1.6}


 16%|█▌        | 469/2930 [12:40<1:04:02,  1.56s/it]

{'loss': 0.3011, 'grad_norm': 0.2760114371776581, 'learning_rate': 0.0003360874018436327, 'epoch': 1.6}


 16%|█▌        | 470/2930 [12:42<1:03:54,  1.56s/it]

{'loss': 0.3742, 'grad_norm': 0.2698655426502228, 'learning_rate': 0.0003359508364629567, 'epoch': 1.6}


 16%|█▌        | 471/2930 [12:43<1:04:18,  1.57s/it]

{'loss': 0.2879, 'grad_norm': 0.24856457114219666, 'learning_rate': 0.0003358142710822806, 'epoch': 1.61}


 16%|█▌        | 472/2930 [12:45<1:04:07,  1.57s/it]

{'loss': 0.2859, 'grad_norm': 0.2138814479112625, 'learning_rate': 0.0003356777057016046, 'epoch': 1.61}


 16%|█▌        | 473/2930 [12:47<1:04:27,  1.57s/it]

{'loss': 0.3243, 'grad_norm': 0.21407762169837952, 'learning_rate': 0.00033554114032092867, 'epoch': 1.61}


 16%|█▌        | 474/2930 [12:48<1:04:13,  1.57s/it]

{'loss': 0.3095, 'grad_norm': 0.2208736091852188, 'learning_rate': 0.00033540457494025266, 'epoch': 1.62}


 16%|█▌        | 475/2930 [12:50<1:04:06,  1.57s/it]

{'loss': 0.3483, 'grad_norm': 0.21401436626911163, 'learning_rate': 0.00033526800955957666, 'epoch': 1.62}


 16%|█▌        | 476/2930 [12:51<1:04:05,  1.57s/it]

{'loss': 0.3197, 'grad_norm': 0.21205928921699524, 'learning_rate': 0.00033513144417890065, 'epoch': 1.62}


 16%|█▋        | 477/2930 [12:53<1:03:58,  1.56s/it]

{'loss': 0.3251, 'grad_norm': 0.21503736078739166, 'learning_rate': 0.00033499487879822465, 'epoch': 1.63}


 16%|█▋        | 478/2930 [12:54<1:04:00,  1.57s/it]

{'loss': 0.3062, 'grad_norm': 0.19942706823349, 'learning_rate': 0.00033485831341754864, 'epoch': 1.63}


 16%|█▋        | 479/2930 [12:56<1:04:21,  1.58s/it]

{'loss': 0.314, 'grad_norm': 0.22622978687286377, 'learning_rate': 0.0003347217480368727, 'epoch': 1.63}


 16%|█▋        | 480/2930 [12:58<1:04:21,  1.58s/it]

{'loss': 0.2996, 'grad_norm': 0.23547393083572388, 'learning_rate': 0.0003345851826561967, 'epoch': 1.64}


 16%|█▋        | 481/2930 [12:59<1:04:32,  1.58s/it]

{'loss': 0.3596, 'grad_norm': 0.2636096179485321, 'learning_rate': 0.0003344486172755207, 'epoch': 1.64}


 16%|█▋        | 482/2930 [13:01<1:04:05,  1.57s/it]

{'loss': 0.3362, 'grad_norm': 0.2560253143310547, 'learning_rate': 0.00033431205189484467, 'epoch': 1.64}


 16%|█▋        | 483/2930 [13:02<1:04:03,  1.57s/it]

{'loss': 0.2961, 'grad_norm': 0.2340826541185379, 'learning_rate': 0.00033417548651416866, 'epoch': 1.65}


 17%|█▋        | 484/2930 [13:04<1:04:01,  1.57s/it]

{'loss': 0.3003, 'grad_norm': 0.2092929631471634, 'learning_rate': 0.00033403892113349266, 'epoch': 1.65}


 17%|█▋        | 485/2930 [13:05<1:03:05,  1.55s/it]

{'loss': 0.3416, 'grad_norm': 0.2335752695798874, 'learning_rate': 0.0003339023557528167, 'epoch': 1.65}


 17%|█▋        | 486/2930 [13:07<1:03:10,  1.55s/it]

{'loss': 0.3562, 'grad_norm': 0.24169817566871643, 'learning_rate': 0.0003337657903721407, 'epoch': 1.66}


 17%|█▋        | 487/2930 [13:08<1:03:03,  1.55s/it]

{'loss': 0.36, 'grad_norm': 0.2404930740594864, 'learning_rate': 0.0003336292249914647, 'epoch': 1.66}


 17%|█▋        | 488/2930 [13:10<1:03:11,  1.55s/it]

{'loss': 0.2881, 'grad_norm': 0.2303403615951538, 'learning_rate': 0.00033349265961078864, 'epoch': 1.66}


 17%|█▋        | 489/2930 [13:12<1:03:32,  1.56s/it]

{'loss': 0.3023, 'grad_norm': 0.23361048102378845, 'learning_rate': 0.0003333560942301127, 'epoch': 1.67}


 17%|█▋        | 490/2930 [13:13<1:03:51,  1.57s/it]

{'loss': 0.3118, 'grad_norm': 0.23842883110046387, 'learning_rate': 0.0003332195288494367, 'epoch': 1.67}


 17%|█▋        | 491/2930 [13:15<1:04:53,  1.60s/it]

{'loss': 0.2952, 'grad_norm': 0.21191640198230743, 'learning_rate': 0.00033308296346876067, 'epoch': 1.67}


 17%|█▋        | 492/2930 [13:17<1:06:01,  1.62s/it]

{'loss': 0.3121, 'grad_norm': 0.1997518390417099, 'learning_rate': 0.0003329463980880847, 'epoch': 1.68}


 17%|█▋        | 493/2930 [13:18<1:07:30,  1.66s/it]

{'loss': 0.2995, 'grad_norm': 0.22100260853767395, 'learning_rate': 0.00033280983270740866, 'epoch': 1.68}


 17%|█▋        | 494/2930 [13:20<1:08:21,  1.68s/it]

{'loss': 0.2794, 'grad_norm': 0.20040227472782135, 'learning_rate': 0.00033267326732673265, 'epoch': 1.68}


 17%|█▋        | 495/2930 [13:22<1:08:46,  1.69s/it]

{'loss': 0.2881, 'grad_norm': 0.24342064559459686, 'learning_rate': 0.0003325367019460567, 'epoch': 1.69}


 17%|█▋        | 496/2930 [13:23<1:09:16,  1.71s/it]

{'loss': 0.3222, 'grad_norm': 0.24706287682056427, 'learning_rate': 0.0003324001365653807, 'epoch': 1.69}


 17%|█▋        | 497/2930 [13:25<1:09:12,  1.71s/it]

{'loss': 0.302, 'grad_norm': 0.2307801991701126, 'learning_rate': 0.0003322635711847047, 'epoch': 1.69}


 17%|█▋        | 498/2930 [13:27<1:09:09,  1.71s/it]

{'loss': 0.3103, 'grad_norm': 0.2536643147468567, 'learning_rate': 0.00033212700580402874, 'epoch': 1.7}


 17%|█▋        | 499/2930 [13:29<1:09:22,  1.71s/it]

{'loss': 0.3139, 'grad_norm': 0.23330727219581604, 'learning_rate': 0.0003319904404233527, 'epoch': 1.7}


 17%|█▋        | 500/2930 [13:30<1:09:15,  1.71s/it]Checkpoint destination directory outputs/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.3299, 'grad_norm': 0.25033849477767944, 'learning_rate': 0.0003318538750426767, 'epoch': 1.71}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-500)... Done. 0.1s
 17%|█▋        | 501/2930 [13:33<1:18:40,  1.94s/it]

{'loss': 0.3286, 'grad_norm': 0.2157747745513916, 'learning_rate': 0.0003317173096620007, 'epoch': 1.71}


 17%|█▋        | 502/2930 [13:34<1:15:47,  1.87s/it]

{'loss': 0.3528, 'grad_norm': 0.24722754955291748, 'learning_rate': 0.0003315807442813247, 'epoch': 1.71}


 17%|█▋        | 503/2930 [13:36<1:13:54,  1.83s/it]

{'loss': 0.2754, 'grad_norm': 0.2099849134683609, 'learning_rate': 0.0003314441789006487, 'epoch': 1.72}


 17%|█▋        | 504/2930 [13:38<1:12:53,  1.80s/it]

{'loss': 0.2993, 'grad_norm': 0.25216421484947205, 'learning_rate': 0.0003313076135199727, 'epoch': 1.72}


 17%|█▋        | 505/2930 [13:40<1:12:11,  1.79s/it]

{'loss': 0.3358, 'grad_norm': 0.2387196570634842, 'learning_rate': 0.0003311710481392967, 'epoch': 1.72}


 17%|█▋        | 506/2930 [13:41<1:11:36,  1.77s/it]

{'loss': 0.3022, 'grad_norm': 0.2207803875207901, 'learning_rate': 0.0003310344827586207, 'epoch': 1.73}


 17%|█▋        | 507/2930 [13:43<1:11:04,  1.76s/it]

{'loss': 0.2872, 'grad_norm': 0.23718354105949402, 'learning_rate': 0.0003308979173779447, 'epoch': 1.73}


 17%|█▋        | 508/2930 [13:45<1:10:27,  1.75s/it]

{'loss': 0.3053, 'grad_norm': 0.22435124218463898, 'learning_rate': 0.00033076135199726873, 'epoch': 1.73}


 17%|█▋        | 509/2930 [13:47<1:10:11,  1.74s/it]

{'loss': 0.2837, 'grad_norm': 0.22459182143211365, 'learning_rate': 0.00033062478661659273, 'epoch': 1.74}


 17%|█▋        | 510/2930 [13:48<1:09:25,  1.72s/it]

{'loss': 0.3076, 'grad_norm': 0.23864047229290009, 'learning_rate': 0.00033048822123591667, 'epoch': 1.74}


 17%|█▋        | 511/2930 [13:50<1:09:24,  1.72s/it]

{'loss': 0.3137, 'grad_norm': 0.2538717985153198, 'learning_rate': 0.0003303516558552407, 'epoch': 1.74}


 17%|█▋        | 512/2930 [13:52<1:09:25,  1.72s/it]

{'loss': 0.2911, 'grad_norm': 0.21586747467517853, 'learning_rate': 0.0003302150904745647, 'epoch': 1.75}


 18%|█▊        | 513/2930 [13:53<1:09:00,  1.71s/it]

{'loss': 0.2651, 'grad_norm': 0.2315429300069809, 'learning_rate': 0.0003300785250938887, 'epoch': 1.75}


 18%|█▊        | 514/2930 [13:55<1:09:09,  1.72s/it]

{'loss': 0.2836, 'grad_norm': 0.22632360458374023, 'learning_rate': 0.00032994195971321275, 'epoch': 1.75}


 18%|█▊        | 515/2930 [13:57<1:09:13,  1.72s/it]

{'loss': 0.3378, 'grad_norm': 0.2672441601753235, 'learning_rate': 0.00032980539433253675, 'epoch': 1.76}


 18%|█▊        | 516/2930 [13:59<1:09:20,  1.72s/it]

{'loss': 0.2736, 'grad_norm': 0.21000726521015167, 'learning_rate': 0.0003296688289518607, 'epoch': 1.76}


 18%|█▊        | 517/2930 [14:00<1:09:27,  1.73s/it]

{'loss': 0.2708, 'grad_norm': 0.21258525550365448, 'learning_rate': 0.00032953226357118474, 'epoch': 1.76}


 18%|█▊        | 518/2930 [14:02<1:09:23,  1.73s/it]

{'loss': 0.2903, 'grad_norm': 0.21957795321941376, 'learning_rate': 0.00032939569819050873, 'epoch': 1.77}


 18%|█▊        | 519/2930 [14:04<1:09:19,  1.73s/it]

{'loss': 0.2851, 'grad_norm': 0.2004522681236267, 'learning_rate': 0.0003292591328098327, 'epoch': 1.77}


 18%|█▊        | 520/2930 [14:05<1:08:29,  1.71s/it]

{'loss': 0.2544, 'grad_norm': 0.19407841563224792, 'learning_rate': 0.0003291225674291567, 'epoch': 1.77}


 18%|█▊        | 521/2930 [14:07<1:06:32,  1.66s/it]

{'loss': 0.3095, 'grad_norm': 0.22608740627765656, 'learning_rate': 0.0003289860020484807, 'epoch': 1.78}


 18%|█▊        | 522/2930 [14:09<1:05:29,  1.63s/it]

{'loss': 0.3426, 'grad_norm': 0.2553013563156128, 'learning_rate': 0.0003288494366678047, 'epoch': 1.78}


 18%|█▊        | 523/2930 [14:10<1:04:52,  1.62s/it]

{'loss': 0.3162, 'grad_norm': 0.24679940938949585, 'learning_rate': 0.0003287128712871287, 'epoch': 1.78}


 18%|█▊        | 524/2930 [14:12<1:04:15,  1.60s/it]

{'loss': 0.2986, 'grad_norm': 0.21840453147888184, 'learning_rate': 0.00032857630590645275, 'epoch': 1.79}


 18%|█▊        | 525/2930 [14:13<1:04:02,  1.60s/it]

{'loss': 0.3693, 'grad_norm': 0.26539304852485657, 'learning_rate': 0.00032843974052577674, 'epoch': 1.79}


 18%|█▊        | 526/2930 [14:15<1:03:17,  1.58s/it]

{'loss': 0.2956, 'grad_norm': 0.23466843366622925, 'learning_rate': 0.00032830317514510074, 'epoch': 1.79}


 18%|█▊        | 527/2930 [14:16<1:02:54,  1.57s/it]

{'loss': 0.3152, 'grad_norm': 0.2602140009403229, 'learning_rate': 0.00032816660976442473, 'epoch': 1.8}


 18%|█▊        | 528/2930 [14:18<1:03:08,  1.58s/it]

{'loss': 0.2698, 'grad_norm': 0.2120136022567749, 'learning_rate': 0.0003280300443837487, 'epoch': 1.8}


 18%|█▊        | 529/2930 [14:20<1:02:58,  1.57s/it]

{'loss': 0.2849, 'grad_norm': 0.2200935184955597, 'learning_rate': 0.0003278934790030727, 'epoch': 1.8}


 18%|█▊        | 530/2930 [14:21<1:02:57,  1.57s/it]

{'loss': 0.2752, 'grad_norm': 0.24048815667629242, 'learning_rate': 0.00032775691362239677, 'epoch': 1.81}


 18%|█▊        | 531/2930 [14:23<1:02:38,  1.57s/it]

{'loss': 0.3048, 'grad_norm': 0.24697589874267578, 'learning_rate': 0.00032762034824172076, 'epoch': 1.81}


 18%|█▊        | 532/2930 [14:24<1:02:25,  1.56s/it]

{'loss': 0.3104, 'grad_norm': 0.25353580713272095, 'learning_rate': 0.00032748378286104476, 'epoch': 1.81}


 18%|█▊        | 533/2930 [14:26<1:02:31,  1.56s/it]

{'loss': 0.2587, 'grad_norm': 0.22914521396160126, 'learning_rate': 0.00032734721748036875, 'epoch': 1.82}


 18%|█▊        | 534/2930 [14:27<1:02:35,  1.57s/it]

{'loss': 0.2857, 'grad_norm': 0.25223249197006226, 'learning_rate': 0.00032721065209969275, 'epoch': 1.82}


 18%|█▊        | 535/2930 [14:29<1:02:35,  1.57s/it]

{'loss': 0.2587, 'grad_norm': 0.20144064724445343, 'learning_rate': 0.00032707408671901674, 'epoch': 1.82}


 18%|█▊        | 536/2930 [14:31<1:02:43,  1.57s/it]

{'loss': 0.3164, 'grad_norm': 0.2628914415836334, 'learning_rate': 0.00032693752133834073, 'epoch': 1.83}


 18%|█▊        | 537/2930 [14:32<1:02:34,  1.57s/it]

{'loss': 0.2955, 'grad_norm': 0.23092570900917053, 'learning_rate': 0.0003268009559576648, 'epoch': 1.83}


 18%|█▊        | 538/2930 [14:34<1:02:47,  1.57s/it]

{'loss': 0.3197, 'grad_norm': 0.2483551949262619, 'learning_rate': 0.0003266643905769887, 'epoch': 1.83}


 18%|█▊        | 539/2930 [14:35<1:02:20,  1.56s/it]

{'loss': 0.2975, 'grad_norm': 0.24644523859024048, 'learning_rate': 0.0003265278251963127, 'epoch': 1.84}


 18%|█▊        | 540/2930 [14:37<1:02:31,  1.57s/it]

{'loss': 0.3007, 'grad_norm': 0.2311246246099472, 'learning_rate': 0.00032639125981563676, 'epoch': 1.84}


 18%|█▊        | 541/2930 [14:38<1:02:12,  1.56s/it]

{'loss': 0.2746, 'grad_norm': 0.2329186201095581, 'learning_rate': 0.00032625469443496076, 'epoch': 1.84}


 18%|█▊        | 542/2930 [14:40<1:02:23,  1.57s/it]

{'loss': 0.2713, 'grad_norm': 0.253858357667923, 'learning_rate': 0.00032611812905428475, 'epoch': 1.85}


 19%|█▊        | 543/2930 [14:41<1:02:22,  1.57s/it]

{'loss': 0.2987, 'grad_norm': 0.24943000078201294, 'learning_rate': 0.00032598156367360875, 'epoch': 1.85}


 19%|█▊        | 544/2930 [14:43<1:02:31,  1.57s/it]

{'loss': 0.3128, 'grad_norm': 0.2607397735118866, 'learning_rate': 0.00032584499829293274, 'epoch': 1.86}


 19%|█▊        | 545/2930 [14:45<1:02:47,  1.58s/it]

{'loss': 0.2911, 'grad_norm': 0.21323326230049133, 'learning_rate': 0.00032570843291225673, 'epoch': 1.86}


 19%|█▊        | 546/2930 [14:46<1:02:55,  1.58s/it]

{'loss': 0.2973, 'grad_norm': 0.22772394120693207, 'learning_rate': 0.0003255718675315808, 'epoch': 1.86}


 19%|█▊        | 547/2930 [14:48<1:02:46,  1.58s/it]

{'loss': 0.3252, 'grad_norm': 0.22225059568881989, 'learning_rate': 0.0003254353021509048, 'epoch': 1.87}


 19%|█▊        | 548/2930 [14:49<1:02:17,  1.57s/it]

{'loss': 0.3225, 'grad_norm': 0.23616188764572144, 'learning_rate': 0.00032529873677022877, 'epoch': 1.87}


 19%|█▊        | 549/2930 [14:51<1:02:37,  1.58s/it]

{'loss': 0.3006, 'grad_norm': 0.226515531539917, 'learning_rate': 0.00032516217138955277, 'epoch': 1.87}


 19%|█▉        | 550/2930 [14:53<1:02:06,  1.57s/it]

{'loss': 0.3376, 'grad_norm': 0.2777732014656067, 'learning_rate': 0.00032502560600887676, 'epoch': 1.88}


 19%|█▉        | 551/2930 [14:54<1:02:21,  1.57s/it]

{'loss': 0.3051, 'grad_norm': 0.23271530866622925, 'learning_rate': 0.00032488904062820075, 'epoch': 1.88}


 19%|█▉        | 552/2930 [14:56<1:02:40,  1.58s/it]

{'loss': 0.3008, 'grad_norm': 0.26344048976898193, 'learning_rate': 0.0003247524752475248, 'epoch': 1.88}


 19%|█▉        | 553/2930 [14:57<1:03:03,  1.59s/it]

{'loss': 0.2644, 'grad_norm': 0.20472854375839233, 'learning_rate': 0.0003246159098668488, 'epoch': 1.89}


 19%|█▉        | 554/2930 [14:59<1:03:29,  1.60s/it]

{'loss': 0.2784, 'grad_norm': 0.2197595089673996, 'learning_rate': 0.0003244793444861728, 'epoch': 1.89}


 19%|█▉        | 555/2930 [15:01<1:03:39,  1.61s/it]

{'loss': 0.3303, 'grad_norm': 0.29249078035354614, 'learning_rate': 0.00032434277910549673, 'epoch': 1.89}


 19%|█▉        | 556/2930 [15:02<1:03:23,  1.60s/it]

{'loss': 0.3077, 'grad_norm': 0.27194032073020935, 'learning_rate': 0.0003242062137248208, 'epoch': 1.9}


 19%|█▉        | 557/2930 [15:04<1:03:13,  1.60s/it]

{'loss': 0.3073, 'grad_norm': 0.2777726948261261, 'learning_rate': 0.0003240696483441448, 'epoch': 1.9}


 19%|█▉        | 558/2930 [15:05<1:02:43,  1.59s/it]

{'loss': 0.3358, 'grad_norm': 0.27145203948020935, 'learning_rate': 0.00032393308296346877, 'epoch': 1.9}


 19%|█▉        | 559/2930 [15:07<1:02:46,  1.59s/it]

{'loss': 0.274, 'grad_norm': 0.22894646227359772, 'learning_rate': 0.0003237965175827928, 'epoch': 1.91}


 19%|█▉        | 560/2930 [15:09<1:02:56,  1.59s/it]

{'loss': 0.3112, 'grad_norm': 0.2698633372783661, 'learning_rate': 0.00032365995220211676, 'epoch': 1.91}


 19%|█▉        | 561/2930 [15:10<1:02:56,  1.59s/it]

{'loss': 0.2698, 'grad_norm': 0.20423170924186707, 'learning_rate': 0.00032352338682144075, 'epoch': 1.91}


 19%|█▉        | 562/2930 [15:12<1:02:50,  1.59s/it]

{'loss': 0.2845, 'grad_norm': 0.22150705754756927, 'learning_rate': 0.0003233868214407648, 'epoch': 1.92}


 19%|█▉        | 563/2930 [15:13<1:02:44,  1.59s/it]

{'loss': 0.2854, 'grad_norm': 0.2498629093170166, 'learning_rate': 0.0003232502560600888, 'epoch': 1.92}


 19%|█▉        | 564/2930 [15:15<1:02:12,  1.58s/it]

{'loss': 0.2792, 'grad_norm': 0.24699947237968445, 'learning_rate': 0.0003231136906794128, 'epoch': 1.92}


 19%|█▉        | 565/2930 [15:16<1:01:57,  1.57s/it]

{'loss': 0.2915, 'grad_norm': 0.29410219192504883, 'learning_rate': 0.0003229771252987368, 'epoch': 1.93}


 19%|█▉        | 566/2930 [15:18<1:02:10,  1.58s/it]

{'loss': 0.2697, 'grad_norm': 0.24067720770835876, 'learning_rate': 0.0003228405599180608, 'epoch': 1.93}


 19%|█▉        | 567/2930 [15:20<1:01:58,  1.57s/it]

{'loss': 0.2973, 'grad_norm': 0.23433727025985718, 'learning_rate': 0.00032270399453738477, 'epoch': 1.93}


 19%|█▉        | 568/2930 [15:21<1:01:50,  1.57s/it]

{'loss': 0.2654, 'grad_norm': 0.22253163158893585, 'learning_rate': 0.0003225674291567088, 'epoch': 1.94}


 19%|█▉        | 569/2930 [15:23<1:01:50,  1.57s/it]

{'loss': 0.2877, 'grad_norm': 0.260884165763855, 'learning_rate': 0.0003224308637760328, 'epoch': 1.94}


 19%|█▉        | 570/2930 [15:24<1:01:47,  1.57s/it]

{'loss': 0.2949, 'grad_norm': 0.26322606205940247, 'learning_rate': 0.0003222942983953568, 'epoch': 1.94}


 19%|█▉        | 571/2930 [15:26<1:01:51,  1.57s/it]

{'loss': 0.2831, 'grad_norm': 0.2531607449054718, 'learning_rate': 0.0003221577330146808, 'epoch': 1.95}


 20%|█▉        | 572/2930 [15:27<1:01:45,  1.57s/it]

{'loss': 0.3091, 'grad_norm': 0.2755001187324524, 'learning_rate': 0.0003220211676340048, 'epoch': 1.95}


 20%|█▉        | 573/2930 [15:29<1:01:10,  1.56s/it]

{'loss': 0.3104, 'grad_norm': 0.25070714950561523, 'learning_rate': 0.0003218846022533288, 'epoch': 1.95}


 20%|█▉        | 574/2930 [15:30<1:01:18,  1.56s/it]

{'loss': 0.2738, 'grad_norm': 0.21333590149879456, 'learning_rate': 0.0003217480368726528, 'epoch': 1.96}


 20%|█▉        | 575/2930 [15:32<1:01:10,  1.56s/it]

{'loss': 0.2718, 'grad_norm': 0.23027096688747406, 'learning_rate': 0.00032161147149197683, 'epoch': 1.96}


 20%|█▉        | 576/2930 [15:34<1:01:27,  1.57s/it]

{'loss': 0.306, 'grad_norm': 0.24842569231987, 'learning_rate': 0.0003214749061113008, 'epoch': 1.96}


 20%|█▉        | 577/2930 [15:35<1:01:49,  1.58s/it]

{'loss': 0.3247, 'grad_norm': 0.26059767603874207, 'learning_rate': 0.00032133834073062476, 'epoch': 1.97}


 20%|█▉        | 578/2930 [15:37<1:01:38,  1.57s/it]

{'loss': 0.2858, 'grad_norm': 0.24429315328598022, 'learning_rate': 0.0003212017753499488, 'epoch': 1.97}


 20%|█▉        | 579/2930 [15:38<1:01:46,  1.58s/it]

{'loss': 0.2937, 'grad_norm': 0.23248037695884705, 'learning_rate': 0.0003210652099692728, 'epoch': 1.97}


 20%|█▉        | 580/2930 [15:40<1:01:34,  1.57s/it]

{'loss': 0.2705, 'grad_norm': 0.24480308592319489, 'learning_rate': 0.0003209286445885968, 'epoch': 1.98}


 20%|█▉        | 581/2930 [15:42<1:01:36,  1.57s/it]

{'loss': 0.2723, 'grad_norm': 0.2592727243900299, 'learning_rate': 0.00032079207920792085, 'epoch': 1.98}


 20%|█▉        | 582/2930 [15:43<1:01:20,  1.57s/it]

{'loss': 0.3344, 'grad_norm': 0.3040045499801636, 'learning_rate': 0.0003206555138272448, 'epoch': 1.98}


 20%|█▉        | 583/2930 [15:45<1:00:56,  1.56s/it]

{'loss': 0.2937, 'grad_norm': 0.2486877590417862, 'learning_rate': 0.0003205189484465688, 'epoch': 1.99}


 20%|█▉        | 584/2930 [15:46<1:01:13,  1.57s/it]

{'loss': 0.2812, 'grad_norm': 0.291645348072052, 'learning_rate': 0.00032038238306589283, 'epoch': 1.99}


 20%|█▉        | 585/2930 [15:48<1:00:42,  1.55s/it]

{'loss': 0.2687, 'grad_norm': 0.22450725734233856, 'learning_rate': 0.0003202458176852168, 'epoch': 1.99}


 20%|██        | 586/2930 [15:49<1:00:58,  1.56s/it]

{'loss': 0.2607, 'grad_norm': 0.23074780404567719, 'learning_rate': 0.0003201092523045408, 'epoch': 2.0}


 20%|██        | 587/2930 [15:51<1:01:10,  1.57s/it]

{'loss': 0.2747, 'grad_norm': 0.2545883059501648, 'learning_rate': 0.0003199726869238648, 'epoch': 2.0}


 20%|██        | 588/2930 [15:52<1:01:32,  1.58s/it]

{'loss': 0.2426, 'grad_norm': 0.23479263484477997, 'learning_rate': 0.0003198361215431888, 'epoch': 2.01}


 20%|██        | 589/2930 [15:54<1:01:47,  1.58s/it]

{'loss': 0.2492, 'grad_norm': 0.26226720213890076, 'learning_rate': 0.0003196995561625128, 'epoch': 2.01}


 20%|██        | 590/2930 [15:56<1:01:51,  1.59s/it]

{'loss': 0.2697, 'grad_norm': 0.32191288471221924, 'learning_rate': 0.0003195629907818368, 'epoch': 2.01}


 20%|██        | 591/2930 [15:57<1:01:42,  1.58s/it]

{'loss': 0.2388, 'grad_norm': 0.2786462604999542, 'learning_rate': 0.00031942642540116084, 'epoch': 2.02}


 20%|██        | 592/2930 [15:59<1:01:34,  1.58s/it]

{'loss': 0.2911, 'grad_norm': 0.32363754510879517, 'learning_rate': 0.00031928986002048484, 'epoch': 2.02}


 20%|██        | 593/2930 [16:00<1:01:35,  1.58s/it]

{'loss': 0.2821, 'grad_norm': 0.2995135486125946, 'learning_rate': 0.00031915329463980883, 'epoch': 2.02}


 20%|██        | 594/2930 [16:02<1:01:38,  1.58s/it]

{'loss': 0.2705, 'grad_norm': 0.2548411786556244, 'learning_rate': 0.00031901672925913283, 'epoch': 2.03}


 20%|██        | 595/2930 [16:04<1:01:37,  1.58s/it]

{'loss': 0.2637, 'grad_norm': 0.2622937262058258, 'learning_rate': 0.0003188801638784568, 'epoch': 2.03}


 20%|██        | 596/2930 [16:05<1:00:50,  1.56s/it]

{'loss': 0.2369, 'grad_norm': 0.22986482083797455, 'learning_rate': 0.0003187435984977808, 'epoch': 2.03}


 20%|██        | 597/2930 [16:07<1:00:39,  1.56s/it]

{'loss': 0.2636, 'grad_norm': 0.24805472791194916, 'learning_rate': 0.00031860703311710486, 'epoch': 2.04}


 20%|██        | 598/2930 [16:08<1:00:46,  1.56s/it]

{'loss': 0.2378, 'grad_norm': 0.24748413264751434, 'learning_rate': 0.00031847046773642886, 'epoch': 2.04}


 20%|██        | 599/2930 [16:10<1:00:45,  1.56s/it]

{'loss': 0.2636, 'grad_norm': 0.2822194993495941, 'learning_rate': 0.0003183339023557528, 'epoch': 2.04}


 20%|██        | 600/2930 [16:11<1:00:42,  1.56s/it]

{'loss': 0.2585, 'grad_norm': 0.28365758061408997, 'learning_rate': 0.00031819733697507685, 'epoch': 2.05}


 21%|██        | 601/2930 [16:13<1:00:44,  1.56s/it]

{'loss': 0.2571, 'grad_norm': 0.31404346227645874, 'learning_rate': 0.00031806077159440084, 'epoch': 2.05}


 21%|██        | 602/2930 [16:14<1:00:54,  1.57s/it]

{'loss': 0.267, 'grad_norm': 0.30190208554267883, 'learning_rate': 0.00031792420621372483, 'epoch': 2.05}


 21%|██        | 603/2930 [16:16<1:00:27,  1.56s/it]

{'loss': 0.2702, 'grad_norm': 0.3302168548107147, 'learning_rate': 0.00031778764083304883, 'epoch': 2.06}


 21%|██        | 604/2930 [16:18<1:00:10,  1.55s/it]

{'loss': 0.2824, 'grad_norm': 0.3031648099422455, 'learning_rate': 0.0003176510754523728, 'epoch': 2.06}


 21%|██        | 605/2930 [16:19<1:00:31,  1.56s/it]

{'loss': 0.2681, 'grad_norm': 0.30199965834617615, 'learning_rate': 0.0003175145100716968, 'epoch': 2.06}


 21%|██        | 606/2930 [16:21<1:00:34,  1.56s/it]

{'loss': 0.2647, 'grad_norm': 0.2609722316265106, 'learning_rate': 0.0003173779446910208, 'epoch': 2.07}


 21%|██        | 607/2930 [16:22<1:00:34,  1.56s/it]

{'loss': 0.2591, 'grad_norm': 0.24228164553642273, 'learning_rate': 0.00031724137931034486, 'epoch': 2.07}


 21%|██        | 608/2930 [16:24<1:00:27,  1.56s/it]

{'loss': 0.2157, 'grad_norm': 0.23881682753562927, 'learning_rate': 0.00031710481392966885, 'epoch': 2.07}


 21%|██        | 609/2930 [16:25<1:00:08,  1.55s/it]

{'loss': 0.2697, 'grad_norm': 0.28731048107147217, 'learning_rate': 0.00031696824854899285, 'epoch': 2.08}


 21%|██        | 610/2930 [16:27<1:00:38,  1.57s/it]

{'loss': 0.2759, 'grad_norm': 0.303299218416214, 'learning_rate': 0.00031683168316831684, 'epoch': 2.08}


 21%|██        | 611/2930 [16:29<1:00:38,  1.57s/it]

{'loss': 0.2786, 'grad_norm': 0.30895745754241943, 'learning_rate': 0.00031669511778764084, 'epoch': 2.08}


 21%|██        | 612/2930 [16:30<1:00:21,  1.56s/it]

{'loss': 0.2549, 'grad_norm': 0.27508968114852905, 'learning_rate': 0.00031655855240696483, 'epoch': 2.09}


 21%|██        | 613/2930 [16:32<1:00:25,  1.56s/it]

{'loss': 0.2403, 'grad_norm': 0.2568661570549011, 'learning_rate': 0.0003164219870262889, 'epoch': 2.09}


 21%|██        | 614/2930 [16:33<1:00:24,  1.57s/it]

{'loss': 0.2315, 'grad_norm': 0.28074145317077637, 'learning_rate': 0.00031628542164561287, 'epoch': 2.09}


 21%|██        | 615/2930 [16:35<59:52,  1.55s/it]  

{'loss': 0.2242, 'grad_norm': 0.24714379012584686, 'learning_rate': 0.00031614885626493687, 'epoch': 2.1}


 21%|██        | 616/2930 [16:36<59:59,  1.56s/it]

{'loss': 0.2765, 'grad_norm': 0.32963407039642334, 'learning_rate': 0.00031601229088426086, 'epoch': 2.1}


 21%|██        | 617/2930 [16:38<1:00:16,  1.56s/it]

{'loss': 0.2933, 'grad_norm': 0.3279256224632263, 'learning_rate': 0.00031587572550358486, 'epoch': 2.1}


 21%|██        | 618/2930 [16:39<1:00:06,  1.56s/it]

{'loss': 0.2803, 'grad_norm': 0.28644028306007385, 'learning_rate': 0.00031573916012290885, 'epoch': 2.11}


 21%|██        | 619/2930 [16:41<1:00:06,  1.56s/it]

{'loss': 0.2875, 'grad_norm': 0.28528892993927, 'learning_rate': 0.00031560259474223284, 'epoch': 2.11}


 21%|██        | 620/2930 [16:43<59:56,  1.56s/it]  

{'loss': 0.2468, 'grad_norm': 0.2486429065465927, 'learning_rate': 0.0003154660293615569, 'epoch': 2.11}


 21%|██        | 621/2930 [16:44<59:53,  1.56s/it]

{'loss': 0.2454, 'grad_norm': 0.2676497995853424, 'learning_rate': 0.00031532946398088083, 'epoch': 2.12}


 21%|██        | 622/2930 [16:46<1:00:15,  1.57s/it]

{'loss': 0.2564, 'grad_norm': 0.2713560461997986, 'learning_rate': 0.0003151928986002048, 'epoch': 2.12}


 21%|██▏       | 623/2930 [16:47<1:00:15,  1.57s/it]

{'loss': 0.2607, 'grad_norm': 0.2803455591201782, 'learning_rate': 0.0003150563332195289, 'epoch': 2.12}


 21%|██▏       | 624/2930 [16:49<1:00:27,  1.57s/it]

{'loss': 0.2485, 'grad_norm': 0.2626256048679352, 'learning_rate': 0.00031491976783885287, 'epoch': 2.13}


 21%|██▏       | 625/2930 [16:50<1:00:35,  1.58s/it]

{'loss': 0.2538, 'grad_norm': 0.3161276876926422, 'learning_rate': 0.00031478320245817686, 'epoch': 2.13}


 21%|██▏       | 626/2930 [16:52<1:00:27,  1.57s/it]

{'loss': 0.2382, 'grad_norm': 0.29687508940696716, 'learning_rate': 0.00031464663707750086, 'epoch': 2.13}


 21%|██▏       | 627/2930 [16:54<1:00:32,  1.58s/it]

{'loss': 0.2692, 'grad_norm': 0.3442644476890564, 'learning_rate': 0.00031451007169682485, 'epoch': 2.14}


 21%|██▏       | 628/2930 [16:55<1:00:05,  1.57s/it]

{'loss': 0.2888, 'grad_norm': 0.3355252742767334, 'learning_rate': 0.00031437350631614884, 'epoch': 2.14}


 21%|██▏       | 629/2930 [16:57<1:00:07,  1.57s/it]

{'loss': 0.2425, 'grad_norm': 0.2590068280696869, 'learning_rate': 0.0003142369409354729, 'epoch': 2.14}


 22%|██▏       | 630/2930 [16:58<1:00:02,  1.57s/it]

{'loss': 0.2341, 'grad_norm': 0.2568148076534271, 'learning_rate': 0.0003141003755547969, 'epoch': 2.15}


 22%|██▏       | 631/2930 [17:00<1:00:02,  1.57s/it]

{'loss': 0.2554, 'grad_norm': 0.2782413959503174, 'learning_rate': 0.0003139638101741209, 'epoch': 2.15}


 22%|██▏       | 632/2930 [17:01<1:00:00,  1.57s/it]

{'loss': 0.2816, 'grad_norm': 0.2784276604652405, 'learning_rate': 0.0003138272447934449, 'epoch': 2.16}


 22%|██▏       | 633/2930 [17:03<59:57,  1.57s/it]  

{'loss': 0.2946, 'grad_norm': 0.304522305727005, 'learning_rate': 0.00031369067941276887, 'epoch': 2.16}


 22%|██▏       | 634/2930 [17:05<1:00:06,  1.57s/it]

{'loss': 0.2604, 'grad_norm': 0.27047568559646606, 'learning_rate': 0.00031355411403209286, 'epoch': 2.16}


 22%|██▏       | 635/2930 [17:06<59:51,  1.56s/it]  

{'loss': 0.2788, 'grad_norm': 0.26798924803733826, 'learning_rate': 0.00031341754865141686, 'epoch': 2.17}


 22%|██▏       | 636/2930 [17:08<59:46,  1.56s/it]

{'loss': 0.2736, 'grad_norm': 0.29893046617507935, 'learning_rate': 0.0003132809832707409, 'epoch': 2.17}


 22%|██▏       | 637/2930 [17:09<1:00:01,  1.57s/it]

{'loss': 0.2585, 'grad_norm': 0.2836349606513977, 'learning_rate': 0.0003131444178900649, 'epoch': 2.17}


 22%|██▏       | 638/2930 [17:11<1:00:05,  1.57s/it]

{'loss': 0.2661, 'grad_norm': 0.3224533796310425, 'learning_rate': 0.00031300785250938884, 'epoch': 2.18}


 22%|██▏       | 639/2930 [17:12<1:00:04,  1.57s/it]

{'loss': 0.2407, 'grad_norm': 0.2459944635629654, 'learning_rate': 0.0003128712871287129, 'epoch': 2.18}


 22%|██▏       | 640/2930 [17:14<1:00:02,  1.57s/it]

{'loss': 0.2456, 'grad_norm': 0.255055695772171, 'learning_rate': 0.0003127347217480369, 'epoch': 2.18}


 22%|██▏       | 641/2930 [17:16<59:48,  1.57s/it]  

{'loss': 0.2471, 'grad_norm': 0.2923852205276489, 'learning_rate': 0.0003125981563673609, 'epoch': 2.19}


 22%|██▏       | 642/2930 [17:17<59:22,  1.56s/it]

{'loss': 0.2529, 'grad_norm': 0.28309422731399536, 'learning_rate': 0.0003124615909866849, 'epoch': 2.19}


 22%|██▏       | 643/2930 [17:19<59:24,  1.56s/it]

{'loss': 0.2577, 'grad_norm': 0.28973591327667236, 'learning_rate': 0.00031232502560600887, 'epoch': 2.19}


 22%|██▏       | 644/2930 [17:20<59:44,  1.57s/it]

{'loss': 0.2638, 'grad_norm': 0.29402920603752136, 'learning_rate': 0.00031218846022533286, 'epoch': 2.2}


 22%|██▏       | 645/2930 [17:22<59:55,  1.57s/it]

{'loss': 0.2384, 'grad_norm': 0.27947381138801575, 'learning_rate': 0.0003120518948446569, 'epoch': 2.2}


 22%|██▏       | 646/2930 [17:23<1:00:04,  1.58s/it]

{'loss': 0.2245, 'grad_norm': 0.26717349886894226, 'learning_rate': 0.0003119153294639809, 'epoch': 2.2}


 22%|██▏       | 647/2930 [17:25<1:00:10,  1.58s/it]

{'loss': 0.2745, 'grad_norm': 0.36581602692604065, 'learning_rate': 0.0003117787640833049, 'epoch': 2.21}


 22%|██▏       | 648/2930 [17:27<1:00:11,  1.58s/it]

{'loss': 0.2733, 'grad_norm': 0.3059908151626587, 'learning_rate': 0.00031164219870262894, 'epoch': 2.21}


 22%|██▏       | 649/2930 [17:28<59:58,  1.58s/it]  

{'loss': 0.2511, 'grad_norm': 0.35893580317497253, 'learning_rate': 0.0003115056333219529, 'epoch': 2.21}


 22%|██▏       | 650/2930 [17:30<1:00:05,  1.58s/it]

{'loss': 0.2584, 'grad_norm': 0.2775363624095917, 'learning_rate': 0.0003113690679412769, 'epoch': 2.22}


 22%|██▏       | 651/2930 [17:31<59:59,  1.58s/it]  

{'loss': 0.2459, 'grad_norm': 0.295235812664032, 'learning_rate': 0.00031123250256060087, 'epoch': 2.22}


 22%|██▏       | 652/2930 [17:33<1:00:05,  1.58s/it]

{'loss': 0.2744, 'grad_norm': 0.3265732228755951, 'learning_rate': 0.0003110959371799249, 'epoch': 2.22}


 22%|██▏       | 653/2930 [17:34<59:51,  1.58s/it]  

{'loss': 0.2437, 'grad_norm': 0.2942526936531067, 'learning_rate': 0.0003109593717992489, 'epoch': 2.23}


 22%|██▏       | 654/2930 [17:36<59:31,  1.57s/it]

{'loss': 0.2634, 'grad_norm': 0.32908937335014343, 'learning_rate': 0.0003108228064185729, 'epoch': 2.23}


 22%|██▏       | 655/2930 [17:38<59:07,  1.56s/it]

{'loss': 0.3239, 'grad_norm': 0.36819320917129517, 'learning_rate': 0.0003106862410378969, 'epoch': 2.23}


 22%|██▏       | 656/2930 [17:39<58:58,  1.56s/it]

{'loss': 0.2268, 'grad_norm': 0.2666073143482208, 'learning_rate': 0.0003105496756572209, 'epoch': 2.24}


 22%|██▏       | 657/2930 [17:41<58:43,  1.55s/it]

{'loss': 0.2783, 'grad_norm': 0.3843906819820404, 'learning_rate': 0.0003104131102765449, 'epoch': 2.24}


 22%|██▏       | 658/2930 [17:42<59:10,  1.56s/it]

{'loss': 0.2411, 'grad_norm': 0.2553766369819641, 'learning_rate': 0.00031027654489586894, 'epoch': 2.24}


 22%|██▏       | 659/2930 [17:44<59:10,  1.56s/it]

{'loss': 0.2722, 'grad_norm': 0.3624238669872284, 'learning_rate': 0.00031013997951519293, 'epoch': 2.25}


 23%|██▎       | 660/2930 [17:45<59:27,  1.57s/it]

{'loss': 0.2706, 'grad_norm': 0.31297534704208374, 'learning_rate': 0.0003100034141345169, 'epoch': 2.25}


 23%|██▎       | 661/2930 [17:47<59:04,  1.56s/it]

{'loss': 0.2538, 'grad_norm': 0.31684568524360657, 'learning_rate': 0.0003098668487538409, 'epoch': 2.25}


 23%|██▎       | 662/2930 [17:48<59:15,  1.57s/it]

{'loss': 0.2523, 'grad_norm': 0.3213319778442383, 'learning_rate': 0.0003097302833731649, 'epoch': 2.26}


 23%|██▎       | 663/2930 [17:50<58:51,  1.56s/it]

{'loss': 0.2927, 'grad_norm': 0.4779462218284607, 'learning_rate': 0.0003095937179924889, 'epoch': 2.26}


 23%|██▎       | 664/2930 [17:52<59:02,  1.56s/it]

{'loss': 0.2551, 'grad_norm': 0.3026541471481323, 'learning_rate': 0.00030945715261181296, 'epoch': 2.26}


 23%|██▎       | 665/2930 [17:53<58:53,  1.56s/it]

{'loss': 0.2208, 'grad_norm': 0.3311305642127991, 'learning_rate': 0.00030932058723113695, 'epoch': 2.27}


 23%|██▎       | 666/2930 [17:55<58:52,  1.56s/it]

{'loss': 0.231, 'grad_norm': 0.3042091429233551, 'learning_rate': 0.0003091840218504609, 'epoch': 2.27}


 23%|██▎       | 667/2930 [17:56<58:30,  1.55s/it]

{'loss': 0.2502, 'grad_norm': 0.265809565782547, 'learning_rate': 0.00030904745646978494, 'epoch': 2.27}


 23%|██▎       | 668/2930 [17:58<58:43,  1.56s/it]

{'loss': 0.2285, 'grad_norm': 0.3381834924221039, 'learning_rate': 0.00030891089108910894, 'epoch': 2.28}


 23%|██▎       | 669/2930 [17:59<58:33,  1.55s/it]

{'loss': 0.28, 'grad_norm': 0.35733169317245483, 'learning_rate': 0.00030877432570843293, 'epoch': 2.28}


 23%|██▎       | 670/2930 [18:01<58:57,  1.57s/it]

{'loss': 0.2507, 'grad_norm': 0.2805085778236389, 'learning_rate': 0.0003086377603277569, 'epoch': 2.28}


 23%|██▎       | 671/2930 [18:02<58:33,  1.56s/it]

{'loss': 0.2766, 'grad_norm': 0.3854765295982361, 'learning_rate': 0.0003085011949470809, 'epoch': 2.29}


 23%|██▎       | 672/2930 [18:04<58:42,  1.56s/it]

{'loss': 0.2768, 'grad_norm': 0.3840591311454773, 'learning_rate': 0.0003083646295664049, 'epoch': 2.29}


 23%|██▎       | 673/2930 [18:06<59:01,  1.57s/it]

{'loss': 0.2436, 'grad_norm': 0.29096904397010803, 'learning_rate': 0.0003082280641857289, 'epoch': 2.29}


 23%|██▎       | 674/2930 [18:07<58:55,  1.57s/it]

{'loss': 0.2251, 'grad_norm': 0.30415967106819153, 'learning_rate': 0.00030809149880505295, 'epoch': 2.3}


 23%|██▎       | 675/2930 [18:09<58:34,  1.56s/it]

{'loss': 0.2451, 'grad_norm': 0.2738903760910034, 'learning_rate': 0.00030795493342437695, 'epoch': 2.3}


 23%|██▎       | 676/2930 [18:10<58:11,  1.55s/it]

{'loss': 0.2268, 'grad_norm': 0.2681007981300354, 'learning_rate': 0.00030781836804370094, 'epoch': 2.31}


 23%|██▎       | 677/2930 [18:12<58:28,  1.56s/it]

{'loss': 0.2847, 'grad_norm': 0.30679404735565186, 'learning_rate': 0.00030768180266302494, 'epoch': 2.31}


 23%|██▎       | 678/2930 [18:13<58:29,  1.56s/it]

{'loss': 0.2263, 'grad_norm': 0.2807861864566803, 'learning_rate': 0.00030754523728234893, 'epoch': 2.31}


 23%|██▎       | 679/2930 [18:15<58:44,  1.57s/it]

{'loss': 0.2678, 'grad_norm': 0.33048173785209656, 'learning_rate': 0.0003074086719016729, 'epoch': 2.32}


 23%|██▎       | 680/2930 [18:17<59:16,  1.58s/it]

{'loss': 0.2201, 'grad_norm': 0.2610887885093689, 'learning_rate': 0.000307272106520997, 'epoch': 2.32}


 23%|██▎       | 681/2930 [18:18<59:19,  1.58s/it]

{'loss': 0.2441, 'grad_norm': 0.27570798993110657, 'learning_rate': 0.00030713554114032097, 'epoch': 2.32}


 23%|██▎       | 682/2930 [18:20<59:02,  1.58s/it]

{'loss': 0.2555, 'grad_norm': 0.30896615982055664, 'learning_rate': 0.00030699897575964496, 'epoch': 2.33}


 23%|██▎       | 683/2930 [18:21<59:06,  1.58s/it]

{'loss': 0.2416, 'grad_norm': 0.2746807336807251, 'learning_rate': 0.00030686241037896896, 'epoch': 2.33}


 23%|██▎       | 684/2930 [18:23<59:01,  1.58s/it]

{'loss': 0.3002, 'grad_norm': 0.34921547770500183, 'learning_rate': 0.00030672584499829295, 'epoch': 2.33}


 23%|██▎       | 685/2930 [18:24<58:50,  1.57s/it]

{'loss': 0.2758, 'grad_norm': 0.3305409252643585, 'learning_rate': 0.00030658927961761694, 'epoch': 2.34}


 23%|██▎       | 686/2930 [18:26<59:03,  1.58s/it]

{'loss': 0.2483, 'grad_norm': 0.3247512876987457, 'learning_rate': 0.00030645271423694094, 'epoch': 2.34}


 23%|██▎       | 687/2930 [18:28<59:13,  1.58s/it]

{'loss': 0.251, 'grad_norm': 0.2827194631099701, 'learning_rate': 0.000306316148856265, 'epoch': 2.34}


 23%|██▎       | 688/2930 [18:29<59:15,  1.59s/it]

{'loss': 0.2535, 'grad_norm': 0.2888502776622772, 'learning_rate': 0.00030617958347558893, 'epoch': 2.35}


 24%|██▎       | 689/2930 [18:31<59:22,  1.59s/it]

{'loss': 0.2353, 'grad_norm': 0.3101057708263397, 'learning_rate': 0.0003060430180949129, 'epoch': 2.35}


 24%|██▎       | 690/2930 [18:32<59:22,  1.59s/it]

{'loss': 0.2497, 'grad_norm': 0.34698182344436646, 'learning_rate': 0.00030590645271423697, 'epoch': 2.35}


 24%|██▎       | 691/2930 [18:34<58:51,  1.58s/it]

{'loss': 0.2407, 'grad_norm': 0.27240321040153503, 'learning_rate': 0.00030576988733356096, 'epoch': 2.36}


 24%|██▎       | 692/2930 [18:36<59:11,  1.59s/it]

{'loss': 0.2575, 'grad_norm': 0.2992326319217682, 'learning_rate': 0.00030563332195288496, 'epoch': 2.36}


 24%|██▎       | 693/2930 [18:37<59:06,  1.59s/it]

{'loss': 0.2733, 'grad_norm': 0.34528520703315735, 'learning_rate': 0.00030549675657220895, 'epoch': 2.36}


 24%|██▎       | 694/2930 [18:39<59:09,  1.59s/it]

{'loss': 0.2616, 'grad_norm': 0.27604684233665466, 'learning_rate': 0.00030536019119153295, 'epoch': 2.37}


 24%|██▎       | 695/2930 [18:40<59:18,  1.59s/it]

{'loss': 0.2289, 'grad_norm': 0.25040870904922485, 'learning_rate': 0.00030522362581085694, 'epoch': 2.37}


 24%|██▍       | 696/2930 [18:42<58:57,  1.58s/it]

{'loss': 0.2281, 'grad_norm': 0.2581034302711487, 'learning_rate': 0.000305087060430181, 'epoch': 2.37}


 24%|██▍       | 697/2930 [18:43<58:41,  1.58s/it]

{'loss': 0.2348, 'grad_norm': 0.27292492985725403, 'learning_rate': 0.000304950495049505, 'epoch': 2.38}


 24%|██▍       | 698/2930 [18:45<58:39,  1.58s/it]

{'loss': 0.2575, 'grad_norm': 0.2997443675994873, 'learning_rate': 0.000304813929668829, 'epoch': 2.38}


 24%|██▍       | 699/2930 [18:47<58:15,  1.57s/it]

{'loss': 0.2651, 'grad_norm': 0.3166705071926117, 'learning_rate': 0.00030467736428815297, 'epoch': 2.38}


 24%|██▍       | 700/2930 [18:48<58:42,  1.58s/it]

{'loss': 0.2571, 'grad_norm': 0.3317769467830658, 'learning_rate': 0.00030454079890747697, 'epoch': 2.39}


 24%|██▍       | 701/2930 [18:50<58:46,  1.58s/it]

{'loss': 0.2978, 'grad_norm': 0.3580833673477173, 'learning_rate': 0.00030440423352680096, 'epoch': 2.39}


 24%|██▍       | 702/2930 [18:51<58:25,  1.57s/it]

{'loss': 0.2732, 'grad_norm': 0.3152810335159302, 'learning_rate': 0.00030426766814612495, 'epoch': 2.39}


 24%|██▍       | 703/2930 [18:53<58:22,  1.57s/it]

{'loss': 0.2467, 'grad_norm': 0.290412575006485, 'learning_rate': 0.000304131102765449, 'epoch': 2.4}


 24%|██▍       | 704/2930 [18:55<58:16,  1.57s/it]

{'loss': 0.243, 'grad_norm': 0.24854081869125366, 'learning_rate': 0.000303994537384773, 'epoch': 2.4}


 24%|██▍       | 705/2930 [18:56<58:12,  1.57s/it]

{'loss': 0.2484, 'grad_norm': 0.251250684261322, 'learning_rate': 0.00030385797200409694, 'epoch': 2.4}


 24%|██▍       | 706/2930 [18:58<58:25,  1.58s/it]

{'loss': 0.2406, 'grad_norm': 0.24387232959270477, 'learning_rate': 0.000303721406623421, 'epoch': 2.41}


 24%|██▍       | 707/2930 [18:59<58:27,  1.58s/it]

{'loss': 0.2228, 'grad_norm': 0.25394347310066223, 'learning_rate': 0.000303584841242745, 'epoch': 2.41}


 24%|██▍       | 708/2930 [19:01<58:30,  1.58s/it]

{'loss': 0.2362, 'grad_norm': 0.2663169205188751, 'learning_rate': 0.00030344827586206897, 'epoch': 2.41}


 24%|██▍       | 709/2930 [19:02<58:03,  1.57s/it]

{'loss': 0.2778, 'grad_norm': 0.3247399926185608, 'learning_rate': 0.000303311710481393, 'epoch': 2.42}


 24%|██▍       | 710/2930 [19:04<58:05,  1.57s/it]

{'loss': 0.2594, 'grad_norm': 0.33574846386909485, 'learning_rate': 0.00030317514510071696, 'epoch': 2.42}


 24%|██▍       | 711/2930 [19:05<57:27,  1.55s/it]

{'loss': 0.2642, 'grad_norm': 0.409232497215271, 'learning_rate': 0.00030303857972004096, 'epoch': 2.42}


 24%|██▍       | 712/2930 [19:07<57:23,  1.55s/it]

{'loss': 0.2354, 'grad_norm': 0.30598199367523193, 'learning_rate': 0.000302902014339365, 'epoch': 2.43}


 24%|██▍       | 713/2930 [19:09<57:25,  1.55s/it]

{'loss': 0.2427, 'grad_norm': 0.3022158145904541, 'learning_rate': 0.000302765448958689, 'epoch': 2.43}


 24%|██▍       | 714/2930 [19:10<57:41,  1.56s/it]

{'loss': 0.2441, 'grad_norm': 0.29933327436447144, 'learning_rate': 0.000302628883578013, 'epoch': 2.43}


 24%|██▍       | 715/2930 [19:12<57:47,  1.57s/it]

{'loss': 0.2218, 'grad_norm': 0.25995954871177673, 'learning_rate': 0.000302492318197337, 'epoch': 2.44}


 24%|██▍       | 716/2930 [19:13<57:53,  1.57s/it]

{'loss': 0.2136, 'grad_norm': 0.2721095085144043, 'learning_rate': 0.000302355752816661, 'epoch': 2.44}


 24%|██▍       | 717/2930 [19:15<57:13,  1.55s/it]

{'loss': 0.2405, 'grad_norm': 0.3313751518726349, 'learning_rate': 0.000302219187435985, 'epoch': 2.45}


 25%|██▍       | 718/2930 [19:16<57:19,  1.55s/it]

{'loss': 0.2452, 'grad_norm': 0.31067144870758057, 'learning_rate': 0.00030208262205530897, 'epoch': 2.45}


 25%|██▍       | 719/2930 [19:18<57:52,  1.57s/it]

{'loss': 0.2425, 'grad_norm': 0.3437124192714691, 'learning_rate': 0.000301946056674633, 'epoch': 2.45}


 25%|██▍       | 720/2930 [19:20<58:11,  1.58s/it]

{'loss': 0.2535, 'grad_norm': 0.34595808386802673, 'learning_rate': 0.000301809491293957, 'epoch': 2.46}


 25%|██▍       | 721/2930 [19:21<58:18,  1.58s/it]

{'loss': 0.2009, 'grad_norm': 0.257402628660202, 'learning_rate': 0.000301672925913281, 'epoch': 2.46}


 25%|██▍       | 722/2930 [19:23<57:38,  1.57s/it]

{'loss': 0.2296, 'grad_norm': 0.2982734441757202, 'learning_rate': 0.000301536360532605, 'epoch': 2.46}


 25%|██▍       | 723/2930 [19:24<57:39,  1.57s/it]

{'loss': 0.2951, 'grad_norm': 0.3877638578414917, 'learning_rate': 0.000301399795151929, 'epoch': 2.47}


 25%|██▍       | 724/2930 [19:26<57:11,  1.56s/it]

{'loss': 0.2408, 'grad_norm': 0.2884873151779175, 'learning_rate': 0.000301263229771253, 'epoch': 2.47}


 25%|██▍       | 725/2930 [19:27<57:18,  1.56s/it]

{'loss': 0.2342, 'grad_norm': 0.28140708804130554, 'learning_rate': 0.00030112666439057704, 'epoch': 2.47}


 25%|██▍       | 726/2930 [19:29<57:05,  1.55s/it]

{'loss': 0.2549, 'grad_norm': 0.3028900623321533, 'learning_rate': 0.00030099009900990103, 'epoch': 2.48}


 25%|██▍       | 727/2930 [19:30<57:07,  1.56s/it]

{'loss': 0.2295, 'grad_norm': 0.3004824221134186, 'learning_rate': 0.00030085353362922497, 'epoch': 2.48}


 25%|██▍       | 728/2930 [19:32<57:30,  1.57s/it]

{'loss': 0.2497, 'grad_norm': 0.34865355491638184, 'learning_rate': 0.000300716968248549, 'epoch': 2.48}


 25%|██▍       | 729/2930 [19:34<57:54,  1.58s/it]

{'loss': 0.2412, 'grad_norm': 0.31758394837379456, 'learning_rate': 0.000300580402867873, 'epoch': 2.49}


 25%|██▍       | 730/2930 [19:35<57:43,  1.57s/it]

{'loss': 0.2548, 'grad_norm': 0.3228558301925659, 'learning_rate': 0.000300443837487197, 'epoch': 2.49}


 25%|██▍       | 731/2930 [19:37<57:46,  1.58s/it]

{'loss': 0.2171, 'grad_norm': 0.2849781811237335, 'learning_rate': 0.00030030727210652105, 'epoch': 2.49}


 25%|██▍       | 732/2930 [19:38<57:25,  1.57s/it]

{'loss': 0.2482, 'grad_norm': 0.31990668177604675, 'learning_rate': 0.000300170706725845, 'epoch': 2.5}


 25%|██▌       | 733/2930 [19:40<57:33,  1.57s/it]

{'loss': 0.2606, 'grad_norm': 0.3424624502658844, 'learning_rate': 0.000300034141345169, 'epoch': 2.5}


 25%|██▌       | 734/2930 [19:41<57:09,  1.56s/it]

{'loss': 0.2258, 'grad_norm': 0.3061659634113312, 'learning_rate': 0.000299897575964493, 'epoch': 2.5}


 25%|██▌       | 735/2930 [19:43<57:23,  1.57s/it]

{'loss': 0.2361, 'grad_norm': 0.31414079666137695, 'learning_rate': 0.00029976101058381703, 'epoch': 2.51}


 25%|██▌       | 736/2930 [19:45<56:52,  1.56s/it]

{'loss': 0.2367, 'grad_norm': 0.35448431968688965, 'learning_rate': 0.000299624445203141, 'epoch': 2.51}


 25%|██▌       | 737/2930 [19:46<57:02,  1.56s/it]

{'loss': 0.2501, 'grad_norm': 0.3284408748149872, 'learning_rate': 0.000299487879822465, 'epoch': 2.51}


 25%|██▌       | 738/2930 [19:48<56:59,  1.56s/it]

{'loss': 0.2437, 'grad_norm': 0.30347174406051636, 'learning_rate': 0.000299351314441789, 'epoch': 2.52}


 25%|██▌       | 739/2930 [19:49<57:22,  1.57s/it]

{'loss': 0.2213, 'grad_norm': 0.307405561208725, 'learning_rate': 0.000299214749061113, 'epoch': 2.52}


 25%|██▌       | 740/2930 [19:51<57:26,  1.57s/it]

{'loss': 0.2254, 'grad_norm': 0.2793098986148834, 'learning_rate': 0.000299078183680437, 'epoch': 2.52}


 25%|██▌       | 741/2930 [19:52<57:13,  1.57s/it]

{'loss': 0.2429, 'grad_norm': 0.29770147800445557, 'learning_rate': 0.00029894161829976105, 'epoch': 2.53}


 25%|██▌       | 742/2930 [19:54<57:19,  1.57s/it]

{'loss': 0.2317, 'grad_norm': 0.30253463983535767, 'learning_rate': 0.00029880505291908504, 'epoch': 2.53}


 25%|██▌       | 743/2930 [19:56<57:31,  1.58s/it]

{'loss': 0.267, 'grad_norm': 0.3387569189071655, 'learning_rate': 0.00029866848753840904, 'epoch': 2.53}


 25%|██▌       | 744/2930 [19:57<57:33,  1.58s/it]

{'loss': 0.2554, 'grad_norm': 0.3045158088207245, 'learning_rate': 0.00029853192215773303, 'epoch': 2.54}


 25%|██▌       | 745/2930 [19:59<57:19,  1.57s/it]

{'loss': 0.2355, 'grad_norm': 0.28460338711738586, 'learning_rate': 0.000298395356777057, 'epoch': 2.54}


 25%|██▌       | 746/2930 [20:00<56:51,  1.56s/it]

{'loss': 0.2754, 'grad_norm': 0.3839372396469116, 'learning_rate': 0.000298258791396381, 'epoch': 2.54}


 25%|██▌       | 747/2930 [20:02<57:02,  1.57s/it]

{'loss': 0.2418, 'grad_norm': 0.31111642718315125, 'learning_rate': 0.00029812222601570507, 'epoch': 2.55}


 26%|██▌       | 748/2930 [20:03<56:57,  1.57s/it]

{'loss': 0.2695, 'grad_norm': 0.3175371587276459, 'learning_rate': 0.00029798566063502906, 'epoch': 2.55}


 26%|██▌       | 749/2930 [20:05<56:24,  1.55s/it]

{'loss': 0.2235, 'grad_norm': 0.29508957266807556, 'learning_rate': 0.000297849095254353, 'epoch': 2.55}


 26%|██▌       | 750/2930 [20:07<56:20,  1.55s/it]

{'loss': 0.2414, 'grad_norm': 0.30835410952568054, 'learning_rate': 0.000297712529873677, 'epoch': 2.56}


 26%|██▌       | 751/2930 [20:08<56:26,  1.55s/it]

{'loss': 0.2428, 'grad_norm': 0.3015405833721161, 'learning_rate': 0.00029757596449300105, 'epoch': 2.56}


 26%|██▌       | 752/2930 [20:10<56:28,  1.56s/it]

{'loss': 0.2195, 'grad_norm': 0.2813645899295807, 'learning_rate': 0.00029743939911232504, 'epoch': 2.56}


 26%|██▌       | 753/2930 [20:11<56:33,  1.56s/it]

{'loss': 0.272, 'grad_norm': 0.375727117061615, 'learning_rate': 0.00029730283373164903, 'epoch': 2.57}


 26%|██▌       | 754/2930 [20:13<56:47,  1.57s/it]

{'loss': 0.2239, 'grad_norm': 0.29679930210113525, 'learning_rate': 0.00029716626835097303, 'epoch': 2.57}


 26%|██▌       | 755/2930 [20:14<57:02,  1.57s/it]

{'loss': 0.2487, 'grad_norm': 0.32407018542289734, 'learning_rate': 0.000297029702970297, 'epoch': 2.57}


 26%|██▌       | 756/2930 [20:16<57:15,  1.58s/it]

{'loss': 0.2168, 'grad_norm': 0.3077021837234497, 'learning_rate': 0.000296893137589621, 'epoch': 2.58}


 26%|██▌       | 757/2930 [20:18<57:21,  1.58s/it]

{'loss': 0.2255, 'grad_norm': 0.29967018961906433, 'learning_rate': 0.00029675657220894506, 'epoch': 2.58}


 26%|██▌       | 758/2930 [20:19<57:29,  1.59s/it]

{'loss': 0.237, 'grad_norm': 0.2915712594985962, 'learning_rate': 0.00029662000682826906, 'epoch': 2.58}


 26%|██▌       | 759/2930 [20:21<57:14,  1.58s/it]

{'loss': 0.2386, 'grad_norm': 0.30048152804374695, 'learning_rate': 0.00029648344144759305, 'epoch': 2.59}


 26%|██▌       | 760/2930 [20:22<56:37,  1.57s/it]

{'loss': 0.2408, 'grad_norm': 0.33901247382164, 'learning_rate': 0.00029634687606691705, 'epoch': 2.59}


 26%|██▌       | 761/2930 [20:24<56:32,  1.56s/it]

{'loss': 0.2563, 'grad_norm': 0.3419017195701599, 'learning_rate': 0.00029621031068624104, 'epoch': 2.6}


 26%|██▌       | 762/2930 [20:25<56:20,  1.56s/it]

{'loss': 0.2121, 'grad_norm': 0.2807473838329315, 'learning_rate': 0.00029607374530556504, 'epoch': 2.6}


 26%|██▌       | 763/2930 [20:27<56:12,  1.56s/it]

{'loss': 0.236, 'grad_norm': 0.3045010268688202, 'learning_rate': 0.0002959371799248891, 'epoch': 2.6}


 26%|██▌       | 764/2930 [20:28<56:01,  1.55s/it]

{'loss': 0.2852, 'grad_norm': 0.40098056197166443, 'learning_rate': 0.0002958006145442131, 'epoch': 2.61}


 26%|██▌       | 765/2930 [20:30<55:47,  1.55s/it]

{'loss': 0.2587, 'grad_norm': 0.32072681188583374, 'learning_rate': 0.00029566404916353707, 'epoch': 2.61}


 26%|██▌       | 766/2930 [20:32<56:06,  1.56s/it]

{'loss': 0.2168, 'grad_norm': 0.25665050745010376, 'learning_rate': 0.00029552748378286107, 'epoch': 2.61}


 26%|██▌       | 767/2930 [20:33<56:17,  1.56s/it]

{'loss': 0.2753, 'grad_norm': 0.331119179725647, 'learning_rate': 0.00029539091840218506, 'epoch': 2.62}


 26%|██▌       | 768/2930 [20:35<56:08,  1.56s/it]

{'loss': 0.2432, 'grad_norm': 0.3061767816543579, 'learning_rate': 0.00029525435302150905, 'epoch': 2.62}


 26%|██▌       | 769/2930 [20:36<56:02,  1.56s/it]

{'loss': 0.209, 'grad_norm': 0.25391489267349243, 'learning_rate': 0.00029511778764083305, 'epoch': 2.62}


 26%|██▋       | 770/2930 [20:38<55:59,  1.56s/it]

{'loss': 0.2001, 'grad_norm': 0.2607884407043457, 'learning_rate': 0.0002949812222601571, 'epoch': 2.63}


 26%|██▋       | 771/2930 [20:39<55:43,  1.55s/it]

{'loss': 0.2128, 'grad_norm': 0.3020077645778656, 'learning_rate': 0.00029484465687948104, 'epoch': 2.63}


 26%|██▋       | 772/2930 [20:41<55:46,  1.55s/it]

{'loss': 0.2163, 'grad_norm': 0.2972436547279358, 'learning_rate': 0.00029470809149880503, 'epoch': 2.63}


 26%|██▋       | 773/2930 [20:42<55:53,  1.55s/it]

{'loss': 0.239, 'grad_norm': 0.3389706015586853, 'learning_rate': 0.0002945715261181291, 'epoch': 2.64}


 26%|██▋       | 774/2930 [20:44<55:54,  1.56s/it]

{'loss': 0.2604, 'grad_norm': 0.3657676875591278, 'learning_rate': 0.0002944349607374531, 'epoch': 2.64}


 26%|██▋       | 775/2930 [20:46<55:40,  1.55s/it]

{'loss': 0.2204, 'grad_norm': 0.30844593048095703, 'learning_rate': 0.00029429839535677707, 'epoch': 2.64}


 26%|██▋       | 776/2930 [20:47<55:44,  1.55s/it]

{'loss': 0.267, 'grad_norm': 0.40547457337379456, 'learning_rate': 0.00029416182997610106, 'epoch': 2.65}


 27%|██▋       | 777/2930 [20:49<55:51,  1.56s/it]

{'loss': 0.2142, 'grad_norm': 0.3053847551345825, 'learning_rate': 0.00029402526459542506, 'epoch': 2.65}


 27%|██▋       | 778/2930 [20:50<55:32,  1.55s/it]

{'loss': 0.2655, 'grad_norm': 0.33917510509490967, 'learning_rate': 0.00029388869921474905, 'epoch': 2.65}


 27%|██▋       | 779/2930 [20:52<55:47,  1.56s/it]

{'loss': 0.2176, 'grad_norm': 0.2928614914417267, 'learning_rate': 0.0002937521338340731, 'epoch': 2.66}


 27%|██▋       | 780/2930 [20:53<55:35,  1.55s/it]

{'loss': 0.2429, 'grad_norm': 0.33835721015930176, 'learning_rate': 0.0002936155684533971, 'epoch': 2.66}


 27%|██▋       | 781/2930 [20:55<55:44,  1.56s/it]

{'loss': 0.25, 'grad_norm': 0.28479239344596863, 'learning_rate': 0.0002934790030727211, 'epoch': 2.66}


 27%|██▋       | 782/2930 [20:56<55:52,  1.56s/it]

{'loss': 0.2693, 'grad_norm': 0.30384114384651184, 'learning_rate': 0.0002933424376920451, 'epoch': 2.67}


 27%|██▋       | 783/2930 [20:58<55:50,  1.56s/it]

{'loss': 0.228, 'grad_norm': 0.31234925985336304, 'learning_rate': 0.0002932058723113691, 'epoch': 2.67}


 27%|██▋       | 784/2930 [21:00<55:56,  1.56s/it]

{'loss': 0.2324, 'grad_norm': 0.29555395245552063, 'learning_rate': 0.00029306930693069307, 'epoch': 2.67}


 27%|██▋       | 785/2930 [21:01<55:44,  1.56s/it]

{'loss': 0.2271, 'grad_norm': 0.3023376762866974, 'learning_rate': 0.00029293274155001706, 'epoch': 2.68}


 27%|██▋       | 786/2930 [21:03<55:27,  1.55s/it]

{'loss': 0.2309, 'grad_norm': 0.337980717420578, 'learning_rate': 0.0002927961761693411, 'epoch': 2.68}


 27%|██▋       | 787/2930 [21:04<55:38,  1.56s/it]

{'loss': 0.2522, 'grad_norm': 0.34002307057380676, 'learning_rate': 0.0002926596107886651, 'epoch': 2.68}


 27%|██▋       | 788/2930 [21:06<55:24,  1.55s/it]

{'loss': 0.2241, 'grad_norm': 0.29265618324279785, 'learning_rate': 0.00029252304540798905, 'epoch': 2.69}


 27%|██▋       | 789/2930 [21:07<55:40,  1.56s/it]

{'loss': 0.2481, 'grad_norm': 0.31291118264198303, 'learning_rate': 0.0002923864800273131, 'epoch': 2.69}


 27%|██▋       | 790/2930 [21:09<55:50,  1.57s/it]

{'loss': 0.2058, 'grad_norm': 0.2775956690311432, 'learning_rate': 0.0002922499146466371, 'epoch': 2.69}


 27%|██▋       | 791/2930 [21:11<55:47,  1.56s/it]

{'loss': 0.233, 'grad_norm': 0.2999938726425171, 'learning_rate': 0.0002921133492659611, 'epoch': 2.7}


 27%|██▋       | 792/2930 [21:12<56:01,  1.57s/it]

{'loss': 0.2262, 'grad_norm': 0.2878982126712799, 'learning_rate': 0.00029197678388528513, 'epoch': 2.7}


 27%|██▋       | 793/2930 [21:14<55:54,  1.57s/it]

{'loss': 0.2343, 'grad_norm': 0.32835647463798523, 'learning_rate': 0.00029184021850460907, 'epoch': 2.7}


 27%|██▋       | 794/2930 [21:15<56:10,  1.58s/it]

{'loss': 0.2597, 'grad_norm': 0.35022467374801636, 'learning_rate': 0.00029170365312393307, 'epoch': 2.71}


 27%|██▋       | 795/2930 [21:17<56:16,  1.58s/it]

{'loss': 0.2295, 'grad_norm': 0.3087023198604584, 'learning_rate': 0.0002915670877432571, 'epoch': 2.71}


 27%|██▋       | 796/2930 [21:18<56:33,  1.59s/it]

{'loss': 0.2306, 'grad_norm': 0.289816290140152, 'learning_rate': 0.0002914305223625811, 'epoch': 2.71}


 27%|██▋       | 797/2930 [21:20<56:29,  1.59s/it]

{'loss': 0.225, 'grad_norm': 0.2709210515022278, 'learning_rate': 0.0002912939569819051, 'epoch': 2.72}


 27%|██▋       | 798/2930 [21:22<56:35,  1.59s/it]

{'loss': 0.2105, 'grad_norm': 0.27896231412887573, 'learning_rate': 0.00029115739160122915, 'epoch': 2.72}


 27%|██▋       | 799/2930 [21:23<55:56,  1.58s/it]

{'loss': 0.2105, 'grad_norm': 0.2858116924762726, 'learning_rate': 0.0002910208262205531, 'epoch': 2.72}


 27%|██▋       | 800/2930 [21:25<55:57,  1.58s/it]

{'loss': 0.2157, 'grad_norm': 0.277883380651474, 'learning_rate': 0.0002908842608398771, 'epoch': 2.73}


 27%|██▋       | 801/2930 [21:26<55:47,  1.57s/it]

{'loss': 0.2287, 'grad_norm': 0.33724647760391235, 'learning_rate': 0.0002907476954592011, 'epoch': 2.73}


 27%|██▋       | 802/2930 [21:28<55:30,  1.57s/it]

{'loss': 0.2178, 'grad_norm': 0.3286634385585785, 'learning_rate': 0.0002906111300785251, 'epoch': 2.73}


 27%|██▋       | 803/2930 [21:29<55:22,  1.56s/it]

{'loss': 0.2333, 'grad_norm': 0.3831334710121155, 'learning_rate': 0.0002904745646978491, 'epoch': 2.74}


 27%|██▋       | 804/2930 [21:31<55:39,  1.57s/it]

{'loss': 0.1906, 'grad_norm': 0.29567062854766846, 'learning_rate': 0.0002903379993171731, 'epoch': 2.74}


 27%|██▋       | 805/2930 [21:33<55:40,  1.57s/it]

{'loss': 0.2605, 'grad_norm': 0.3941294550895691, 'learning_rate': 0.0002902014339364971, 'epoch': 2.75}


 28%|██▊       | 806/2930 [21:34<55:29,  1.57s/it]

{'loss': 0.2251, 'grad_norm': 0.32062381505966187, 'learning_rate': 0.0002900648685558211, 'epoch': 2.75}


 28%|██▊       | 807/2930 [21:36<55:33,  1.57s/it]

{'loss': 0.2225, 'grad_norm': 0.2895907759666443, 'learning_rate': 0.0002899283031751451, 'epoch': 2.75}


 28%|██▊       | 808/2930 [21:37<55:32,  1.57s/it]

{'loss': 0.229, 'grad_norm': 0.3003113269805908, 'learning_rate': 0.00028979173779446915, 'epoch': 2.76}


 28%|██▊       | 809/2930 [21:39<55:07,  1.56s/it]

{'loss': 0.271, 'grad_norm': 0.3109486997127533, 'learning_rate': 0.00028965517241379314, 'epoch': 2.76}


 28%|██▊       | 810/2930 [21:40<55:25,  1.57s/it]

{'loss': 0.2098, 'grad_norm': 0.27477288246154785, 'learning_rate': 0.0002895186070331171, 'epoch': 2.76}


 28%|██▊       | 811/2930 [21:42<55:48,  1.58s/it]

{'loss': 0.2158, 'grad_norm': 0.26171064376831055, 'learning_rate': 0.00028938204165244113, 'epoch': 2.77}


 28%|██▊       | 812/2930 [21:44<55:58,  1.59s/it]

{'loss': 0.2542, 'grad_norm': 0.31512537598609924, 'learning_rate': 0.0002892454762717651, 'epoch': 2.77}


 28%|██▊       | 813/2930 [21:45<57:29,  1.63s/it]

{'loss': 0.2014, 'grad_norm': 0.2640608549118042, 'learning_rate': 0.0002891089108910891, 'epoch': 2.77}


 28%|██▊       | 814/2930 [21:47<58:03,  1.65s/it]

{'loss': 0.2184, 'grad_norm': 0.3081047832965851, 'learning_rate': 0.00028897234551041316, 'epoch': 2.78}


 28%|██▊       | 815/2930 [21:49<57:54,  1.64s/it]

{'loss': 0.2097, 'grad_norm': 0.3164066672325134, 'learning_rate': 0.00028883578012973716, 'epoch': 2.78}


 28%|██▊       | 816/2930 [21:50<57:17,  1.63s/it]

{'loss': 0.225, 'grad_norm': 0.37459537386894226, 'learning_rate': 0.0002886992147490611, 'epoch': 2.78}


 28%|██▊       | 817/2930 [21:52<57:39,  1.64s/it]

{'loss': 0.224, 'grad_norm': 0.385387659072876, 'learning_rate': 0.0002885626493683851, 'epoch': 2.79}


 28%|██▊       | 818/2930 [21:54<57:48,  1.64s/it]

{'loss': 0.2188, 'grad_norm': 0.3204666078090668, 'learning_rate': 0.00028842608398770914, 'epoch': 2.79}


 28%|██▊       | 819/2930 [21:55<57:15,  1.63s/it]

{'loss': 0.1942, 'grad_norm': 0.2979351282119751, 'learning_rate': 0.00028828951860703314, 'epoch': 2.79}


 28%|██▊       | 820/2930 [21:57<56:56,  1.62s/it]

{'loss': 0.2156, 'grad_norm': 0.2948337197303772, 'learning_rate': 0.00028815295322635713, 'epoch': 2.8}


 28%|██▊       | 821/2930 [21:58<56:31,  1.61s/it]

{'loss': 0.2009, 'grad_norm': 0.29753392934799194, 'learning_rate': 0.0002880163878456811, 'epoch': 2.8}


 28%|██▊       | 822/2930 [22:00<56:19,  1.60s/it]

{'loss': 0.2272, 'grad_norm': 0.3556009531021118, 'learning_rate': 0.0002878798224650051, 'epoch': 2.8}


 28%|██▊       | 823/2930 [22:01<55:43,  1.59s/it]

{'loss': 0.2069, 'grad_norm': 0.3112938106060028, 'learning_rate': 0.0002877432570843291, 'epoch': 2.81}


 28%|██▊       | 824/2930 [22:03<55:46,  1.59s/it]

{'loss': 0.2376, 'grad_norm': 0.32406190037727356, 'learning_rate': 0.00028760669170365316, 'epoch': 2.81}


 28%|██▊       | 825/2930 [22:05<56:05,  1.60s/it]

{'loss': 0.2058, 'grad_norm': 0.28721246123313904, 'learning_rate': 0.00028747012632297715, 'epoch': 2.81}


 28%|██▊       | 826/2930 [22:06<56:53,  1.62s/it]

{'loss': 0.2429, 'grad_norm': 0.3482193946838379, 'learning_rate': 0.00028733356094230115, 'epoch': 2.82}


 28%|██▊       | 827/2930 [22:08<57:16,  1.63s/it]

{'loss': 0.2124, 'grad_norm': 0.31544020771980286, 'learning_rate': 0.00028719699556162514, 'epoch': 2.82}


 28%|██▊       | 828/2930 [22:10<57:32,  1.64s/it]

{'loss': 0.2158, 'grad_norm': 0.3302372395992279, 'learning_rate': 0.00028706043018094914, 'epoch': 2.82}


 28%|██▊       | 829/2930 [22:11<57:35,  1.64s/it]

{'loss': 0.235, 'grad_norm': 0.4201676547527313, 'learning_rate': 0.00028692386480027313, 'epoch': 2.83}


 28%|██▊       | 830/2930 [22:13<57:04,  1.63s/it]

{'loss': 0.2338, 'grad_norm': 0.36885708570480347, 'learning_rate': 0.0002867872994195972, 'epoch': 2.83}


 28%|██▊       | 831/2930 [22:15<56:53,  1.63s/it]

{'loss': 0.2322, 'grad_norm': 0.32017508149147034, 'learning_rate': 0.0002866507340389212, 'epoch': 2.83}


 28%|██▊       | 832/2930 [22:16<56:05,  1.60s/it]

{'loss': 0.2192, 'grad_norm': 0.29086485505104065, 'learning_rate': 0.00028651416865824517, 'epoch': 2.84}


 28%|██▊       | 833/2930 [22:18<56:00,  1.60s/it]

{'loss': 0.2202, 'grad_norm': 0.27868613600730896, 'learning_rate': 0.0002863776032775691, 'epoch': 2.84}


 28%|██▊       | 834/2930 [22:19<56:01,  1.60s/it]

{'loss': 0.2256, 'grad_norm': 0.3060135841369629, 'learning_rate': 0.00028624103789689316, 'epoch': 2.84}


 28%|██▊       | 835/2930 [22:21<56:00,  1.60s/it]

{'loss': 0.2933, 'grad_norm': 0.37719663977622986, 'learning_rate': 0.00028610447251621715, 'epoch': 2.85}


 29%|██▊       | 836/2930 [22:23<55:43,  1.60s/it]

{'loss': 0.2059, 'grad_norm': 0.2985701560974121, 'learning_rate': 0.00028596790713554114, 'epoch': 2.85}


 29%|██▊       | 837/2930 [22:24<55:33,  1.59s/it]

{'loss': 0.2138, 'grad_norm': 0.28321656584739685, 'learning_rate': 0.0002858313417548652, 'epoch': 2.85}


 29%|██▊       | 838/2930 [22:26<55:23,  1.59s/it]

{'loss': 0.2475, 'grad_norm': 0.35457703471183777, 'learning_rate': 0.00028569477637418913, 'epoch': 2.86}


 29%|██▊       | 839/2930 [22:27<55:11,  1.58s/it]

{'loss': 0.2512, 'grad_norm': 0.3412846624851227, 'learning_rate': 0.0002855582109935131, 'epoch': 2.86}


 29%|██▊       | 840/2930 [22:29<54:57,  1.58s/it]

{'loss': 0.2417, 'grad_norm': 0.3182961344718933, 'learning_rate': 0.0002854216456128372, 'epoch': 2.86}


 29%|██▊       | 841/2930 [22:30<55:22,  1.59s/it]

{'loss': 0.2087, 'grad_norm': 0.3019261360168457, 'learning_rate': 0.00028528508023216117, 'epoch': 2.87}


 29%|██▊       | 842/2930 [22:32<55:58,  1.61s/it]

{'loss': 0.2004, 'grad_norm': 0.33955439925193787, 'learning_rate': 0.00028514851485148516, 'epoch': 2.87}


 29%|██▉       | 843/2930 [22:34<55:57,  1.61s/it]

{'loss': 0.2051, 'grad_norm': 0.2843453288078308, 'learning_rate': 0.00028501194947080916, 'epoch': 2.87}


 29%|██▉       | 844/2930 [22:35<56:46,  1.63s/it]

{'loss': 0.2256, 'grad_norm': 0.33285313844680786, 'learning_rate': 0.00028487538409013315, 'epoch': 2.88}


 29%|██▉       | 845/2930 [22:37<56:19,  1.62s/it]

{'loss': 0.2525, 'grad_norm': 0.3417319357395172, 'learning_rate': 0.00028473881870945715, 'epoch': 2.88}


 29%|██▉       | 846/2930 [22:39<56:04,  1.61s/it]

{'loss': 0.2183, 'grad_norm': 0.307253897190094, 'learning_rate': 0.0002846022533287812, 'epoch': 2.88}


 29%|██▉       | 847/2930 [22:40<55:40,  1.60s/it]

{'loss': 0.2245, 'grad_norm': 0.31560319662094116, 'learning_rate': 0.0002844656879481052, 'epoch': 2.89}


 29%|██▉       | 848/2930 [22:42<55:24,  1.60s/it]

{'loss': 0.24, 'grad_norm': 0.30729740858078003, 'learning_rate': 0.0002843291225674292, 'epoch': 2.89}


 29%|██▉       | 849/2930 [22:43<55:25,  1.60s/it]

{'loss': 0.2405, 'grad_norm': 0.3633016049861908, 'learning_rate': 0.0002841925571867532, 'epoch': 2.9}


 29%|██▉       | 850/2930 [22:45<55:27,  1.60s/it]

{'loss': 0.2162, 'grad_norm': 0.3415476083755493, 'learning_rate': 0.00028405599180607717, 'epoch': 2.9}


 29%|██▉       | 851/2930 [22:47<55:18,  1.60s/it]

{'loss': 0.1985, 'grad_norm': 0.2885994613170624, 'learning_rate': 0.00028391942642540116, 'epoch': 2.9}


 29%|██▉       | 852/2930 [22:48<55:32,  1.60s/it]

{'loss': 0.2253, 'grad_norm': 0.3241797983646393, 'learning_rate': 0.00028378286104472516, 'epoch': 2.91}


 29%|██▉       | 853/2930 [22:50<55:29,  1.60s/it]

{'loss': 0.212, 'grad_norm': 0.3000405728816986, 'learning_rate': 0.0002836462956640492, 'epoch': 2.91}


 29%|██▉       | 854/2930 [22:51<55:32,  1.61s/it]

{'loss': 0.197, 'grad_norm': 0.31546875834465027, 'learning_rate': 0.0002835097302833732, 'epoch': 2.91}


 29%|██▉       | 855/2930 [22:53<55:28,  1.60s/it]

{'loss': 0.2417, 'grad_norm': 0.36185017228126526, 'learning_rate': 0.00028337316490269714, 'epoch': 2.92}


 29%|██▉       | 856/2930 [22:55<55:26,  1.60s/it]

{'loss': 0.188, 'grad_norm': 0.3032544255256653, 'learning_rate': 0.0002832365995220212, 'epoch': 2.92}


 29%|██▉       | 857/2930 [22:56<55:53,  1.62s/it]

{'loss': 0.2197, 'grad_norm': 0.34311795234680176, 'learning_rate': 0.0002831000341413452, 'epoch': 2.92}


 29%|██▉       | 858/2930 [22:58<55:46,  1.61s/it]

{'loss': 0.2491, 'grad_norm': 0.36230790615081787, 'learning_rate': 0.0002829634687606692, 'epoch': 2.93}


 29%|██▉       | 859/2930 [22:59<56:01,  1.62s/it]

{'loss': 0.2287, 'grad_norm': 0.3379611670970917, 'learning_rate': 0.0002828269033799932, 'epoch': 2.93}


 29%|██▉       | 860/2930 [23:01<56:33,  1.64s/it]

{'loss': 0.2269, 'grad_norm': 0.3311735987663269, 'learning_rate': 0.00028269033799931717, 'epoch': 2.93}


 29%|██▉       | 861/2930 [23:03<56:32,  1.64s/it]

{'loss': 0.2073, 'grad_norm': 0.44382908940315247, 'learning_rate': 0.00028255377261864116, 'epoch': 2.94}


 29%|██▉       | 862/2930 [23:04<55:54,  1.62s/it]

{'loss': 0.2261, 'grad_norm': 0.302861750125885, 'learning_rate': 0.0002824172072379652, 'epoch': 2.94}


 29%|██▉       | 863/2930 [23:06<55:51,  1.62s/it]

{'loss': 0.2323, 'grad_norm': 0.34865373373031616, 'learning_rate': 0.0002822806418572892, 'epoch': 2.94}


 29%|██▉       | 864/2930 [23:08<55:48,  1.62s/it]

{'loss': 0.2334, 'grad_norm': 0.3242364823818207, 'learning_rate': 0.0002821440764766132, 'epoch': 2.95}


 30%|██▉       | 865/2930 [23:09<56:17,  1.64s/it]

{'loss': 0.2311, 'grad_norm': 0.3911856412887573, 'learning_rate': 0.0002820075110959372, 'epoch': 2.95}


 30%|██▉       | 866/2930 [23:11<56:31,  1.64s/it]

{'loss': 0.222, 'grad_norm': 0.3208194375038147, 'learning_rate': 0.0002818709457152612, 'epoch': 2.95}


 30%|██▉       | 867/2930 [23:13<56:40,  1.65s/it]

{'loss': 0.227, 'grad_norm': 0.35959184169769287, 'learning_rate': 0.0002817343803345852, 'epoch': 2.96}


 30%|██▉       | 868/2930 [23:14<56:31,  1.64s/it]

{'loss': 0.2189, 'grad_norm': 0.36242541670799255, 'learning_rate': 0.0002815978149539092, 'epoch': 2.96}


 30%|██▉       | 869/2930 [23:16<56:51,  1.66s/it]

{'loss': 0.2055, 'grad_norm': 0.33084914088249207, 'learning_rate': 0.0002814612495732332, 'epoch': 2.96}


 30%|██▉       | 870/2930 [23:18<56:46,  1.65s/it]

{'loss': 0.2118, 'grad_norm': 0.3917248249053955, 'learning_rate': 0.0002813246841925572, 'epoch': 2.97}


 30%|██▉       | 871/2930 [23:19<56:41,  1.65s/it]

{'loss': 0.2173, 'grad_norm': 0.3457864820957184, 'learning_rate': 0.0002811881188118812, 'epoch': 2.97}


 30%|██▉       | 872/2930 [23:21<56:46,  1.66s/it]

{'loss': 0.2106, 'grad_norm': 0.3480873107910156, 'learning_rate': 0.0002810515534312052, 'epoch': 2.97}


 30%|██▉       | 873/2930 [23:23<57:09,  1.67s/it]

{'loss': 0.1929, 'grad_norm': 0.3174504041671753, 'learning_rate': 0.0002809149880505292, 'epoch': 2.98}


 30%|██▉       | 874/2930 [23:24<57:37,  1.68s/it]

{'loss': 0.204, 'grad_norm': 0.3127475380897522, 'learning_rate': 0.0002807784226698532, 'epoch': 2.98}


 30%|██▉       | 875/2930 [23:26<57:18,  1.67s/it]

{'loss': 0.2395, 'grad_norm': 0.34543007612228394, 'learning_rate': 0.00028064185728917724, 'epoch': 2.98}


 30%|██▉       | 876/2930 [23:28<57:10,  1.67s/it]

{'loss': 0.2401, 'grad_norm': 0.312093049287796, 'learning_rate': 0.00028050529190850124, 'epoch': 2.99}


 30%|██▉       | 877/2930 [23:29<57:01,  1.67s/it]

{'loss': 0.2174, 'grad_norm': 0.2829693555831909, 'learning_rate': 0.0002803687265278252, 'epoch': 2.99}


 30%|██▉       | 878/2930 [23:31<56:35,  1.65s/it]

{'loss': 0.2237, 'grad_norm': 0.3093150854110718, 'learning_rate': 0.0002802321611471492, 'epoch': 2.99}


 30%|███       | 879/2930 [23:32<55:32,  1.63s/it]

{'loss': 0.1952, 'grad_norm': 0.26797592639923096, 'learning_rate': 0.0002800955957664732, 'epoch': 3.0}


 30%|███       | 880/2930 [23:34<55:19,  1.62s/it]

{'loss': 0.2307, 'grad_norm': 0.32668888568878174, 'learning_rate': 0.0002799590303857972, 'epoch': 3.0}


 30%|███       | 881/2930 [23:36<54:56,  1.61s/it]

{'loss': 0.1613, 'grad_norm': 0.2911737561225891, 'learning_rate': 0.00027982246500512126, 'epoch': 3.0}


 30%|███       | 882/2930 [23:37<54:55,  1.61s/it]

{'loss': 0.1786, 'grad_norm': 0.3687017261981964, 'learning_rate': 0.0002796858996244452, 'epoch': 3.01}


 30%|███       | 883/2930 [23:39<54:58,  1.61s/it]

{'loss': 0.1829, 'grad_norm': 0.36060306429862976, 'learning_rate': 0.0002795493342437692, 'epoch': 3.01}


 30%|███       | 884/2930 [23:40<54:44,  1.61s/it]

{'loss': 0.1842, 'grad_norm': 0.5352655053138733, 'learning_rate': 0.0002794127688630932, 'epoch': 3.01}


 30%|███       | 885/2930 [23:42<54:34,  1.60s/it]

{'loss': 0.1616, 'grad_norm': 0.3576361835002899, 'learning_rate': 0.00027927620348241724, 'epoch': 3.02}


 30%|███       | 886/2930 [23:44<54:16,  1.59s/it]

{'loss': 0.1793, 'grad_norm': 0.36187607049942017, 'learning_rate': 0.00027913963810174123, 'epoch': 3.02}


 30%|███       | 887/2930 [23:45<53:58,  1.59s/it]

{'loss': 0.1438, 'grad_norm': 0.2912381887435913, 'learning_rate': 0.0002790030727210652, 'epoch': 3.02}


 30%|███       | 888/2930 [23:47<53:29,  1.57s/it]

{'loss': 0.163, 'grad_norm': 0.3564930856227875, 'learning_rate': 0.0002788665073403892, 'epoch': 3.03}


 30%|███       | 889/2930 [23:48<53:11,  1.56s/it]

{'loss': 0.1647, 'grad_norm': 0.37539878487586975, 'learning_rate': 0.0002787299419597132, 'epoch': 3.03}


 30%|███       | 890/2930 [23:50<53:09,  1.56s/it]

{'loss': 0.1623, 'grad_norm': 0.34697988629341125, 'learning_rate': 0.0002785933765790372, 'epoch': 3.03}


 30%|███       | 891/2930 [23:51<53:17,  1.57s/it]

{'loss': 0.1649, 'grad_norm': 0.3390069305896759, 'learning_rate': 0.00027845681119836126, 'epoch': 3.04}


 30%|███       | 892/2930 [23:53<53:06,  1.56s/it]

{'loss': 0.1948, 'grad_norm': 0.42260754108428955, 'learning_rate': 0.00027832024581768525, 'epoch': 3.04}


 30%|███       | 893/2930 [23:55<53:09,  1.57s/it]

{'loss': 0.1745, 'grad_norm': 0.37667495012283325, 'learning_rate': 0.00027818368043700924, 'epoch': 3.05}


 31%|███       | 894/2930 [23:56<52:52,  1.56s/it]

{'loss': 0.1706, 'grad_norm': 0.33015862107276917, 'learning_rate': 0.00027804711505633324, 'epoch': 3.05}


 31%|███       | 895/2930 [23:58<52:57,  1.56s/it]

{'loss': 0.1989, 'grad_norm': 0.41323158144950867, 'learning_rate': 0.00027791054967565723, 'epoch': 3.05}


 31%|███       | 896/2930 [23:59<52:59,  1.56s/it]

{'loss': 0.179, 'grad_norm': 0.32872530817985535, 'learning_rate': 0.0002777739842949812, 'epoch': 3.06}


 31%|███       | 897/2930 [24:01<52:58,  1.56s/it]

{'loss': 0.1684, 'grad_norm': 0.32055896520614624, 'learning_rate': 0.0002776374189143053, 'epoch': 3.06}


 31%|███       | 898/2930 [24:02<52:49,  1.56s/it]

{'loss': 0.1767, 'grad_norm': 0.3418472111225128, 'learning_rate': 0.00027750085353362927, 'epoch': 3.06}


 31%|███       | 899/2930 [24:04<52:54,  1.56s/it]

{'loss': 0.1859, 'grad_norm': 0.43275657296180725, 'learning_rate': 0.0002773642881529532, 'epoch': 3.07}


 31%|███       | 900/2930 [24:05<52:54,  1.56s/it]

{'loss': 0.1778, 'grad_norm': 0.3546399176120758, 'learning_rate': 0.0002772277227722772, 'epoch': 3.07}


 31%|███       | 901/2930 [24:07<52:55,  1.57s/it]

{'loss': 0.1915, 'grad_norm': 0.4182811677455902, 'learning_rate': 0.00027709115739160125, 'epoch': 3.07}


 31%|███       | 902/2930 [24:09<52:51,  1.56s/it]

{'loss': 0.1907, 'grad_norm': 0.4132921099662781, 'learning_rate': 0.00027695459201092525, 'epoch': 3.08}


 31%|███       | 903/2930 [24:10<52:36,  1.56s/it]

{'loss': 0.166, 'grad_norm': 0.3578316271305084, 'learning_rate': 0.00027681802663024924, 'epoch': 3.08}


 31%|███       | 904/2930 [24:12<52:26,  1.55s/it]

{'loss': 0.1865, 'grad_norm': 0.4101463556289673, 'learning_rate': 0.00027668146124957323, 'epoch': 3.08}


 31%|███       | 905/2930 [24:13<52:20,  1.55s/it]

{'loss': 0.1871, 'grad_norm': 0.3503395915031433, 'learning_rate': 0.00027654489586889723, 'epoch': 3.09}


 31%|███       | 906/2930 [24:15<52:23,  1.55s/it]

{'loss': 0.1705, 'grad_norm': 0.30999404191970825, 'learning_rate': 0.0002764083304882212, 'epoch': 3.09}


 31%|███       | 907/2930 [24:16<52:20,  1.55s/it]

{'loss': 0.1556, 'grad_norm': 0.32750338315963745, 'learning_rate': 0.00027627176510754527, 'epoch': 3.09}


 31%|███       | 908/2930 [24:18<52:26,  1.56s/it]

{'loss': 0.1814, 'grad_norm': 0.3421001732349396, 'learning_rate': 0.00027613519972686926, 'epoch': 3.1}


 31%|███       | 909/2930 [24:19<52:29,  1.56s/it]

{'loss': 0.1917, 'grad_norm': 0.41279077529907227, 'learning_rate': 0.00027599863434619326, 'epoch': 3.1}


 31%|███       | 910/2930 [24:21<52:31,  1.56s/it]

{'loss': 0.195, 'grad_norm': 0.3888916075229645, 'learning_rate': 0.00027586206896551725, 'epoch': 3.1}


 31%|███       | 911/2930 [24:23<52:30,  1.56s/it]

{'loss': 0.1641, 'grad_norm': 0.3273974061012268, 'learning_rate': 0.00027572550358484125, 'epoch': 3.11}


 31%|███       | 912/2930 [24:24<52:37,  1.56s/it]

{'loss': 0.1896, 'grad_norm': 0.35132208466529846, 'learning_rate': 0.00027558893820416524, 'epoch': 3.11}


 31%|███       | 913/2930 [24:26<52:40,  1.57s/it]

{'loss': 0.186, 'grad_norm': 0.3431086540222168, 'learning_rate': 0.0002754523728234893, 'epoch': 3.11}


 31%|███       | 914/2930 [24:27<52:26,  1.56s/it]

{'loss': 0.1541, 'grad_norm': 0.3244113326072693, 'learning_rate': 0.0002753158074428133, 'epoch': 3.12}


 31%|███       | 915/2930 [24:29<52:31,  1.56s/it]

{'loss': 0.1827, 'grad_norm': 0.33612948656082153, 'learning_rate': 0.0002751792420621373, 'epoch': 3.12}


 31%|███▏      | 916/2930 [24:30<52:35,  1.57s/it]

{'loss': 0.1584, 'grad_norm': 0.3407977521419525, 'learning_rate': 0.0002750426766814612, 'epoch': 3.12}


 31%|███▏      | 917/2930 [24:32<52:38,  1.57s/it]

{'loss': 0.1655, 'grad_norm': 0.3577534258365631, 'learning_rate': 0.00027490611130078527, 'epoch': 3.13}


 31%|███▏      | 918/2930 [24:34<52:18,  1.56s/it]

{'loss': 0.1851, 'grad_norm': 0.4050396680831909, 'learning_rate': 0.00027476954592010926, 'epoch': 3.13}


 31%|███▏      | 919/2930 [24:35<52:05,  1.55s/it]

{'loss': 0.2033, 'grad_norm': 0.4126919209957123, 'learning_rate': 0.00027463298053943325, 'epoch': 3.13}


 31%|███▏      | 920/2930 [24:37<52:02,  1.55s/it]

{'loss': 0.1651, 'grad_norm': 0.3916444182395935, 'learning_rate': 0.0002744964151587573, 'epoch': 3.14}


 31%|███▏      | 921/2930 [24:38<52:16,  1.56s/it]

{'loss': 0.1736, 'grad_norm': 0.3322475850582123, 'learning_rate': 0.00027435984977808124, 'epoch': 3.14}


 31%|███▏      | 922/2930 [24:40<52:04,  1.56s/it]

{'loss': 0.1576, 'grad_norm': 0.3390955328941345, 'learning_rate': 0.00027422328439740524, 'epoch': 3.14}


 32%|███▏      | 923/2930 [24:41<52:05,  1.56s/it]

{'loss': 0.1699, 'grad_norm': 0.36328279972076416, 'learning_rate': 0.0002740867190167293, 'epoch': 3.15}


 32%|███▏      | 924/2930 [24:43<52:15,  1.56s/it]

{'loss': 0.1837, 'grad_norm': 0.4026196002960205, 'learning_rate': 0.0002739501536360533, 'epoch': 3.15}


 32%|███▏      | 925/2930 [24:44<52:01,  1.56s/it]

{'loss': 0.1813, 'grad_norm': 0.4087141752243042, 'learning_rate': 0.0002738135882553773, 'epoch': 3.15}


 32%|███▏      | 926/2930 [24:46<52:05,  1.56s/it]

{'loss': 0.1868, 'grad_norm': 0.376098096370697, 'learning_rate': 0.00027367702287470127, 'epoch': 3.16}


 32%|███▏      | 927/2930 [24:48<51:58,  1.56s/it]

{'loss': 0.1862, 'grad_norm': 0.3931495249271393, 'learning_rate': 0.00027354045749402526, 'epoch': 3.16}


 32%|███▏      | 928/2930 [24:49<51:50,  1.55s/it]

{'loss': 0.1652, 'grad_norm': 0.3501856327056885, 'learning_rate': 0.00027340389211334926, 'epoch': 3.16}


 32%|███▏      | 929/2930 [24:51<52:04,  1.56s/it]

{'loss': 0.1895, 'grad_norm': 0.3527006506919861, 'learning_rate': 0.0002732673267326733, 'epoch': 3.17}


 32%|███▏      | 930/2930 [24:52<51:53,  1.56s/it]

{'loss': 0.1835, 'grad_norm': 0.34535154700279236, 'learning_rate': 0.0002731307613519973, 'epoch': 3.17}


 32%|███▏      | 931/2930 [24:54<51:48,  1.56s/it]

{'loss': 0.1786, 'grad_norm': 0.38135215640068054, 'learning_rate': 0.0002729941959713213, 'epoch': 3.17}


 32%|███▏      | 932/2930 [24:55<51:44,  1.55s/it]

{'loss': 0.1778, 'grad_norm': 0.3445679843425751, 'learning_rate': 0.0002728576305906453, 'epoch': 3.18}


 32%|███▏      | 933/2930 [24:57<51:48,  1.56s/it]

{'loss': 0.1586, 'grad_norm': 0.323022723197937, 'learning_rate': 0.0002727210652099693, 'epoch': 3.18}


 32%|███▏      | 934/2930 [24:58<51:43,  1.56s/it]

{'loss': 0.178, 'grad_norm': 0.3748306334018707, 'learning_rate': 0.0002725844998292933, 'epoch': 3.18}


 32%|███▏      | 935/2930 [25:00<51:50,  1.56s/it]

{'loss': 0.1732, 'grad_norm': 0.36419615149497986, 'learning_rate': 0.00027244793444861727, 'epoch': 3.19}


 32%|███▏      | 936/2930 [25:02<51:54,  1.56s/it]

{'loss': 0.1877, 'grad_norm': 0.4152122437953949, 'learning_rate': 0.0002723113690679413, 'epoch': 3.19}


 32%|███▏      | 937/2930 [25:03<51:54,  1.56s/it]

{'loss': 0.1787, 'grad_norm': 0.39021027088165283, 'learning_rate': 0.0002721748036872653, 'epoch': 3.2}


 32%|███▏      | 938/2930 [25:05<51:46,  1.56s/it]

{'loss': 0.178, 'grad_norm': 0.3665179908275604, 'learning_rate': 0.00027203823830658925, 'epoch': 3.2}


 32%|███▏      | 939/2930 [25:06<51:22,  1.55s/it]

{'loss': 0.174, 'grad_norm': 0.36268332600593567, 'learning_rate': 0.0002719016729259133, 'epoch': 3.2}


 32%|███▏      | 940/2930 [25:08<51:27,  1.55s/it]

{'loss': 0.1662, 'grad_norm': 0.3188149034976959, 'learning_rate': 0.0002717651075452373, 'epoch': 3.21}


 32%|███▏      | 941/2930 [25:09<51:30,  1.55s/it]

{'loss': 0.166, 'grad_norm': 0.31554487347602844, 'learning_rate': 0.0002716285421645613, 'epoch': 3.21}


 32%|███▏      | 942/2930 [25:11<51:44,  1.56s/it]

{'loss': 0.1708, 'grad_norm': 0.329500287771225, 'learning_rate': 0.00027149197678388534, 'epoch': 3.21}


 32%|███▏      | 943/2930 [25:12<51:52,  1.57s/it]

{'loss': 0.1728, 'grad_norm': 0.33761778473854065, 'learning_rate': 0.0002713554114032093, 'epoch': 3.22}


 32%|███▏      | 944/2930 [25:14<52:01,  1.57s/it]

{'loss': 0.1841, 'grad_norm': 0.3661269247531891, 'learning_rate': 0.00027121884602253327, 'epoch': 3.22}


 32%|███▏      | 945/2930 [25:16<51:56,  1.57s/it]

{'loss': 0.147, 'grad_norm': 0.3403746485710144, 'learning_rate': 0.0002710822806418573, 'epoch': 3.22}


 32%|███▏      | 946/2930 [25:17<51:55,  1.57s/it]

{'loss': 0.1693, 'grad_norm': 0.3917948603630066, 'learning_rate': 0.0002709457152611813, 'epoch': 3.23}


 32%|███▏      | 947/2930 [25:19<52:15,  1.58s/it]

{'loss': 0.1584, 'grad_norm': 0.37389424443244934, 'learning_rate': 0.0002708091498805053, 'epoch': 3.23}


 32%|███▏      | 948/2930 [25:20<52:15,  1.58s/it]

{'loss': 0.1555, 'grad_norm': 0.3504187762737274, 'learning_rate': 0.0002706725844998293, 'epoch': 3.23}


 32%|███▏      | 949/2930 [25:22<52:17,  1.58s/it]

{'loss': 0.1828, 'grad_norm': 0.37423762679100037, 'learning_rate': 0.0002705360191191533, 'epoch': 3.24}


 32%|███▏      | 950/2930 [25:24<52:10,  1.58s/it]

{'loss': 0.1773, 'grad_norm': 0.37056103348731995, 'learning_rate': 0.0002703994537384773, 'epoch': 3.24}


 32%|███▏      | 951/2930 [25:25<52:02,  1.58s/it]

{'loss': 0.1615, 'grad_norm': 0.3522740304470062, 'learning_rate': 0.0002702628883578013, 'epoch': 3.24}


 32%|███▏      | 952/2930 [25:27<51:35,  1.56s/it]

{'loss': 0.1718, 'grad_norm': 0.32870739698410034, 'learning_rate': 0.00027012632297712533, 'epoch': 3.25}


 33%|███▎      | 953/2930 [25:28<51:31,  1.56s/it]

{'loss': 0.1632, 'grad_norm': 0.32489290833473206, 'learning_rate': 0.0002699897575964493, 'epoch': 3.25}


 33%|███▎      | 954/2930 [25:30<51:36,  1.57s/it]

{'loss': 0.1589, 'grad_norm': 0.34444108605384827, 'learning_rate': 0.0002698531922157733, 'epoch': 3.25}


 33%|███▎      | 955/2930 [25:31<51:35,  1.57s/it]

{'loss': 0.1776, 'grad_norm': 0.3620789051055908, 'learning_rate': 0.0002697166268350973, 'epoch': 3.26}


 33%|███▎      | 956/2930 [25:33<51:26,  1.56s/it]

{'loss': 0.1839, 'grad_norm': 0.4079180359840393, 'learning_rate': 0.0002695800614544213, 'epoch': 3.26}


 33%|███▎      | 957/2930 [25:35<51:43,  1.57s/it]

{'loss': 0.1541, 'grad_norm': 0.32887786626815796, 'learning_rate': 0.0002694434960737453, 'epoch': 3.26}


 33%|███▎      | 958/2930 [25:36<51:57,  1.58s/it]

{'loss': 0.1674, 'grad_norm': 0.365554541349411, 'learning_rate': 0.00026930693069306935, 'epoch': 3.27}


 33%|███▎      | 959/2930 [25:38<51:46,  1.58s/it]

{'loss': 0.1618, 'grad_norm': 0.33154651522636414, 'learning_rate': 0.00026917036531239335, 'epoch': 3.27}


 33%|███▎      | 960/2930 [25:39<51:49,  1.58s/it]

{'loss': 0.1746, 'grad_norm': 0.37256842851638794, 'learning_rate': 0.0002690337999317173, 'epoch': 3.27}


 33%|███▎      | 961/2930 [25:41<51:41,  1.57s/it]

{'loss': 0.147, 'grad_norm': 0.2974895238876343, 'learning_rate': 0.00026889723455104133, 'epoch': 3.28}


 33%|███▎      | 962/2930 [25:42<51:50,  1.58s/it]

{'loss': 0.1332, 'grad_norm': 0.29883691668510437, 'learning_rate': 0.00026876066917036533, 'epoch': 3.28}


 33%|███▎      | 963/2930 [25:44<51:49,  1.58s/it]

{'loss': 0.1951, 'grad_norm': 0.4404398798942566, 'learning_rate': 0.0002686241037896893, 'epoch': 3.28}


 33%|███▎      | 964/2930 [25:46<51:50,  1.58s/it]

{'loss': 0.1911, 'grad_norm': 0.4034886658191681, 'learning_rate': 0.0002684875384090133, 'epoch': 3.29}


 33%|███▎      | 965/2930 [25:47<51:28,  1.57s/it]

{'loss': 0.1435, 'grad_norm': 0.3069871664047241, 'learning_rate': 0.00026835097302833736, 'epoch': 3.29}


 33%|███▎      | 966/2930 [25:49<51:42,  1.58s/it]

{'loss': 0.1774, 'grad_norm': 0.3404713273048401, 'learning_rate': 0.0002682144076476613, 'epoch': 3.29}


 33%|███▎      | 967/2930 [25:50<51:27,  1.57s/it]

{'loss': 0.1645, 'grad_norm': 0.37029045820236206, 'learning_rate': 0.0002680778422669853, 'epoch': 3.3}


 33%|███▎      | 968/2930 [25:52<51:21,  1.57s/it]

{'loss': 0.1765, 'grad_norm': 0.3861466646194458, 'learning_rate': 0.00026794127688630935, 'epoch': 3.3}


 33%|███▎      | 969/2930 [25:53<51:36,  1.58s/it]

{'loss': 0.1942, 'grad_norm': 0.386334627866745, 'learning_rate': 0.00026780471150563334, 'epoch': 3.3}


 33%|███▎      | 970/2930 [25:55<51:18,  1.57s/it]

{'loss': 0.1698, 'grad_norm': 0.35139939188957214, 'learning_rate': 0.00026766814612495733, 'epoch': 3.31}


 33%|███▎      | 971/2930 [25:57<51:20,  1.57s/it]

{'loss': 0.1587, 'grad_norm': 0.3330991566181183, 'learning_rate': 0.00026753158074428133, 'epoch': 3.31}


 33%|███▎      | 972/2930 [25:58<51:22,  1.57s/it]

{'loss': 0.1615, 'grad_norm': 0.32738208770751953, 'learning_rate': 0.0002673950153636053, 'epoch': 3.31}


 33%|███▎      | 973/2930 [26:00<51:30,  1.58s/it]

{'loss': 0.1706, 'grad_norm': 0.3391670286655426, 'learning_rate': 0.0002672584499829293, 'epoch': 3.32}


 33%|███▎      | 974/2930 [26:01<51:32,  1.58s/it]

{'loss': 0.183, 'grad_norm': 0.3766893744468689, 'learning_rate': 0.00026712188460225337, 'epoch': 3.32}


 33%|███▎      | 975/2930 [26:03<51:34,  1.58s/it]

{'loss': 0.1759, 'grad_norm': 0.365605890750885, 'learning_rate': 0.00026698531922157736, 'epoch': 3.32}


 33%|███▎      | 976/2930 [26:04<51:18,  1.58s/it]

{'loss': 0.1803, 'grad_norm': 0.3761231303215027, 'learning_rate': 0.00026684875384090135, 'epoch': 3.33}


 33%|███▎      | 977/2930 [26:06<51:27,  1.58s/it]

{'loss': 0.1778, 'grad_norm': 0.43626147508621216, 'learning_rate': 0.00026671218846022535, 'epoch': 3.33}


 33%|███▎      | 978/2930 [26:08<51:34,  1.59s/it]

{'loss': 0.1728, 'grad_norm': 0.3760504424571991, 'learning_rate': 0.00026657562307954934, 'epoch': 3.34}


 33%|███▎      | 979/2930 [26:09<51:38,  1.59s/it]

{'loss': 0.1642, 'grad_norm': 0.36308005452156067, 'learning_rate': 0.00026643905769887334, 'epoch': 3.34}


 33%|███▎      | 980/2930 [26:11<51:47,  1.59s/it]

{'loss': 0.1806, 'grad_norm': 0.38514822721481323, 'learning_rate': 0.0002663024923181974, 'epoch': 3.34}


 33%|███▎      | 981/2930 [26:12<51:27,  1.58s/it]

{'loss': 0.1852, 'grad_norm': 0.38597121834754944, 'learning_rate': 0.0002661659269375214, 'epoch': 3.35}


 34%|███▎      | 982/2930 [26:14<51:03,  1.57s/it]

{'loss': 0.152, 'grad_norm': 0.3139248192310333, 'learning_rate': 0.0002660293615568454, 'epoch': 3.35}


 34%|███▎      | 983/2930 [26:16<51:09,  1.58s/it]

{'loss': 0.1642, 'grad_norm': 0.34472227096557617, 'learning_rate': 0.0002658927961761693, 'epoch': 3.35}


 34%|███▎      | 984/2930 [26:17<51:16,  1.58s/it]

{'loss': 0.1941, 'grad_norm': 0.4206332862377167, 'learning_rate': 0.00026575623079549336, 'epoch': 3.36}


 34%|███▎      | 985/2930 [26:19<51:23,  1.59s/it]

{'loss': 0.1961, 'grad_norm': 0.34616580605506897, 'learning_rate': 0.00026561966541481736, 'epoch': 3.36}


 34%|███▎      | 986/2930 [26:20<51:22,  1.59s/it]

{'loss': 0.182, 'grad_norm': 0.3608883023262024, 'learning_rate': 0.00026548310003414135, 'epoch': 3.36}


 34%|███▎      | 987/2930 [26:22<51:20,  1.59s/it]

{'loss': 0.1602, 'grad_norm': 0.30316489934921265, 'learning_rate': 0.0002653465346534654, 'epoch': 3.37}


 34%|███▎      | 988/2930 [26:23<51:05,  1.58s/it]

{'loss': 0.1753, 'grad_norm': 0.35452938079833984, 'learning_rate': 0.00026520996927278934, 'epoch': 3.37}


 34%|███▍      | 989/2930 [26:25<51:11,  1.58s/it]

{'loss': 0.2083, 'grad_norm': 0.3904125988483429, 'learning_rate': 0.00026507340389211333, 'epoch': 3.37}


 34%|███▍      | 990/2930 [26:27<50:48,  1.57s/it]

{'loss': 0.1691, 'grad_norm': 0.37211379408836365, 'learning_rate': 0.0002649368385114374, 'epoch': 3.38}


 34%|███▍      | 991/2930 [26:28<50:42,  1.57s/it]

{'loss': 0.1664, 'grad_norm': 0.3076159358024597, 'learning_rate': 0.0002648002731307614, 'epoch': 3.38}


 34%|███▍      | 992/2930 [26:30<50:54,  1.58s/it]

{'loss': 0.1741, 'grad_norm': 0.36583659052848816, 'learning_rate': 0.00026466370775008537, 'epoch': 3.38}


 34%|███▍      | 993/2930 [26:31<50:43,  1.57s/it]

{'loss': 0.1834, 'grad_norm': 0.38389262557029724, 'learning_rate': 0.00026452714236940936, 'epoch': 3.39}


 34%|███▍      | 994/2930 [26:33<50:42,  1.57s/it]

{'loss': 0.1653, 'grad_norm': 0.33549144864082336, 'learning_rate': 0.00026439057698873336, 'epoch': 3.39}


 34%|███▍      | 995/2930 [26:34<50:45,  1.57s/it]

{'loss': 0.1799, 'grad_norm': 0.3960261940956116, 'learning_rate': 0.00026425401160805735, 'epoch': 3.39}


 34%|███▍      | 996/2930 [26:36<50:41,  1.57s/it]

{'loss': 0.1378, 'grad_norm': 0.3106428384780884, 'learning_rate': 0.0002641174462273814, 'epoch': 3.4}


 34%|███▍      | 997/2930 [26:38<50:40,  1.57s/it]

{'loss': 0.1743, 'grad_norm': 0.4138626754283905, 'learning_rate': 0.0002639808808467054, 'epoch': 3.4}


 34%|███▍      | 998/2930 [26:39<50:25,  1.57s/it]

{'loss': 0.1725, 'grad_norm': 0.42327749729156494, 'learning_rate': 0.0002638443154660294, 'epoch': 3.4}


 34%|███▍      | 999/2930 [26:41<50:08,  1.56s/it]

{'loss': 0.1754, 'grad_norm': 0.4294850528240204, 'learning_rate': 0.0002637077500853534, 'epoch': 3.41}


 34%|███▍      | 1000/2930 [26:42<50:18,  1.56s/it]Checkpoint destination directory outputs/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.1596, 'grad_norm': 0.3817397356033325, 'learning_rate': 0.0002635711847046774, 'epoch': 3.41}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1000)... Done. 0.1s
 34%|███▍      | 1001/2930 [26:45<57:46,  1.80s/it]

{'loss': 0.1703, 'grad_norm': 0.3629986047744751, 'learning_rate': 0.00026343461932400137, 'epoch': 3.41}


 34%|███▍      | 1002/2930 [26:46<55:35,  1.73s/it]

{'loss': 0.1593, 'grad_norm': 0.33184731006622314, 'learning_rate': 0.00026329805394332536, 'epoch': 3.42}


 34%|███▍      | 1003/2930 [26:48<53:44,  1.67s/it]

{'loss': 0.1572, 'grad_norm': 0.34640389680862427, 'learning_rate': 0.0002631614885626494, 'epoch': 3.42}


 34%|███▍      | 1004/2930 [26:49<52:30,  1.64s/it]

{'loss': 0.1934, 'grad_norm': 0.4185563921928406, 'learning_rate': 0.0002630249231819734, 'epoch': 3.42}


 34%|███▍      | 1005/2930 [26:51<51:51,  1.62s/it]

{'loss': 0.1604, 'grad_norm': 0.35808274149894714, 'learning_rate': 0.00026288835780129735, 'epoch': 3.43}


 34%|███▍      | 1006/2930 [26:52<51:18,  1.60s/it]

{'loss': 0.154, 'grad_norm': 0.3085462152957916, 'learning_rate': 0.0002627517924206214, 'epoch': 3.43}


 34%|███▍      | 1007/2930 [26:54<51:00,  1.59s/it]

{'loss': 0.1722, 'grad_norm': 0.37019485235214233, 'learning_rate': 0.0002626152270399454, 'epoch': 3.43}


 34%|███▍      | 1008/2930 [26:56<50:33,  1.58s/it]

{'loss': 0.1886, 'grad_norm': 0.4001787006855011, 'learning_rate': 0.0002624786616592694, 'epoch': 3.44}


 34%|███▍      | 1009/2930 [26:57<50:28,  1.58s/it]

{'loss': 0.1583, 'grad_norm': 0.37132009863853455, 'learning_rate': 0.00026234209627859343, 'epoch': 3.44}


 34%|███▍      | 1010/2930 [26:59<50:12,  1.57s/it]

{'loss': 0.1559, 'grad_norm': 0.3759586811065674, 'learning_rate': 0.00026220553089791737, 'epoch': 3.44}


 35%|███▍      | 1011/2930 [27:00<49:54,  1.56s/it]

{'loss': 0.1593, 'grad_norm': 0.3872603476047516, 'learning_rate': 0.00026206896551724137, 'epoch': 3.45}


 35%|███▍      | 1012/2930 [27:02<49:46,  1.56s/it]

{'loss': 0.1773, 'grad_norm': 0.42793846130371094, 'learning_rate': 0.0002619324001365654, 'epoch': 3.45}


 35%|███▍      | 1013/2930 [27:03<49:34,  1.55s/it]

{'loss': 0.1908, 'grad_norm': 0.41991880536079407, 'learning_rate': 0.0002617958347558894, 'epoch': 3.45}


 35%|███▍      | 1014/2930 [27:05<49:43,  1.56s/it]

{'loss': 0.1916, 'grad_norm': 0.3740284740924835, 'learning_rate': 0.0002616592693752134, 'epoch': 3.46}


 35%|███▍      | 1015/2930 [27:06<49:33,  1.55s/it]

{'loss': 0.1677, 'grad_norm': 0.3566155731678009, 'learning_rate': 0.0002615227039945374, 'epoch': 3.46}


 35%|███▍      | 1016/2930 [27:08<49:32,  1.55s/it]

{'loss': 0.1946, 'grad_norm': 0.36681339144706726, 'learning_rate': 0.0002613861386138614, 'epoch': 3.46}


 35%|███▍      | 1017/2930 [27:10<49:31,  1.55s/it]

{'loss': 0.1749, 'grad_norm': 0.3362199068069458, 'learning_rate': 0.0002612495732331854, 'epoch': 3.47}


 35%|███▍      | 1018/2930 [27:11<49:36,  1.56s/it]

{'loss': 0.1689, 'grad_norm': 0.3777890205383301, 'learning_rate': 0.0002611130078525094, 'epoch': 3.47}


 35%|███▍      | 1019/2930 [27:13<49:29,  1.55s/it]

{'loss': 0.1445, 'grad_norm': 0.36486953496932983, 'learning_rate': 0.00026097644247183343, 'epoch': 3.47}


 35%|███▍      | 1020/2930 [27:14<49:33,  1.56s/it]

{'loss': 0.1873, 'grad_norm': 0.4320841431617737, 'learning_rate': 0.0002608398770911574, 'epoch': 3.48}


 35%|███▍      | 1021/2930 [27:16<49:32,  1.56s/it]

{'loss': 0.1604, 'grad_norm': 0.37373948097229004, 'learning_rate': 0.0002607033117104814, 'epoch': 3.48}


 35%|███▍      | 1022/2930 [27:17<49:36,  1.56s/it]

{'loss': 0.164, 'grad_norm': 0.3566828668117523, 'learning_rate': 0.0002605667463298054, 'epoch': 3.49}


 35%|███▍      | 1023/2930 [27:19<49:27,  1.56s/it]

{'loss': 0.1454, 'grad_norm': 0.3633231818675995, 'learning_rate': 0.0002604301809491294, 'epoch': 3.49}


 35%|███▍      | 1024/2930 [27:20<49:37,  1.56s/it]

{'loss': 0.1832, 'grad_norm': 0.4274308681488037, 'learning_rate': 0.0002602936155684534, 'epoch': 3.49}


 35%|███▍      | 1025/2930 [27:22<49:25,  1.56s/it]

{'loss': 0.1665, 'grad_norm': 0.36405718326568604, 'learning_rate': 0.00026015705018777745, 'epoch': 3.5}


 35%|███▌      | 1026/2930 [27:24<49:27,  1.56s/it]

{'loss': 0.1511, 'grad_norm': 0.3427641689777374, 'learning_rate': 0.00026002048480710144, 'epoch': 3.5}


 35%|███▌      | 1027/2930 [27:25<49:26,  1.56s/it]

{'loss': 0.1751, 'grad_norm': 0.34645405411720276, 'learning_rate': 0.0002598839194264254, 'epoch': 3.5}


 35%|███▌      | 1028/2930 [27:27<49:12,  1.55s/it]

{'loss': 0.1676, 'grad_norm': 0.35835638642311096, 'learning_rate': 0.00025974735404574943, 'epoch': 3.51}


 35%|███▌      | 1029/2930 [27:28<48:54,  1.54s/it]

{'loss': 0.1751, 'grad_norm': 0.3594391345977783, 'learning_rate': 0.0002596107886650734, 'epoch': 3.51}


 35%|███▌      | 1030/2930 [27:30<48:27,  1.53s/it]

{'loss': 0.1863, 'grad_norm': 0.43085435032844543, 'learning_rate': 0.0002594742232843974, 'epoch': 3.51}


 35%|███▌      | 1031/2930 [27:31<48:45,  1.54s/it]

{'loss': 0.1657, 'grad_norm': 0.32559487223625183, 'learning_rate': 0.0002593376579037214, 'epoch': 3.52}


 35%|███▌      | 1032/2930 [27:33<48:59,  1.55s/it]

{'loss': 0.1509, 'grad_norm': 0.31703299283981323, 'learning_rate': 0.0002592010925230454, 'epoch': 3.52}


 35%|███▌      | 1033/2930 [27:34<48:36,  1.54s/it]

{'loss': 0.1577, 'grad_norm': 0.329328715801239, 'learning_rate': 0.0002590645271423694, 'epoch': 3.52}


 35%|███▌      | 1034/2930 [27:36<48:50,  1.55s/it]

{'loss': 0.1827, 'grad_norm': 0.3928058445453644, 'learning_rate': 0.0002589279617616934, 'epoch': 3.53}


 35%|███▌      | 1035/2930 [27:37<49:07,  1.56s/it]

{'loss': 0.1575, 'grad_norm': 0.34431952238082886, 'learning_rate': 0.00025879139638101744, 'epoch': 3.53}


 35%|███▌      | 1036/2930 [27:39<49:11,  1.56s/it]

{'loss': 0.1834, 'grad_norm': 0.46435725688934326, 'learning_rate': 0.00025865483100034144, 'epoch': 3.53}


 35%|███▌      | 1037/2930 [27:41<49:01,  1.55s/it]

{'loss': 0.1699, 'grad_norm': 0.37664994597435, 'learning_rate': 0.00025851826561966543, 'epoch': 3.54}


 35%|███▌      | 1038/2930 [27:42<49:09,  1.56s/it]

{'loss': 0.1639, 'grad_norm': 0.37631767988204956, 'learning_rate': 0.0002583817002389894, 'epoch': 3.54}


 35%|███▌      | 1039/2930 [27:44<49:15,  1.56s/it]

{'loss': 0.1702, 'grad_norm': 0.37205561995506287, 'learning_rate': 0.0002582451348583134, 'epoch': 3.54}


 35%|███▌      | 1040/2930 [27:45<49:15,  1.56s/it]

{'loss': 0.1528, 'grad_norm': 0.33830299973487854, 'learning_rate': 0.0002581085694776374, 'epoch': 3.55}


 36%|███▌      | 1041/2930 [27:47<49:00,  1.56s/it]

{'loss': 0.1734, 'grad_norm': 0.3643508851528168, 'learning_rate': 0.00025797200409696146, 'epoch': 3.55}


 36%|███▌      | 1042/2930 [27:48<48:40,  1.55s/it]

{'loss': 0.1649, 'grad_norm': 0.3442721664905548, 'learning_rate': 0.00025783543871628546, 'epoch': 3.55}


 36%|███▌      | 1043/2930 [27:50<48:50,  1.55s/it]

{'loss': 0.142, 'grad_norm': 0.31444674730300903, 'learning_rate': 0.00025769887333560945, 'epoch': 3.56}


 36%|███▌      | 1044/2930 [27:51<48:40,  1.55s/it]

{'loss': 0.1638, 'grad_norm': 0.3809274733066559, 'learning_rate': 0.00025756230795493344, 'epoch': 3.56}


 36%|███▌      | 1045/2930 [27:53<48:34,  1.55s/it]

{'loss': 0.1558, 'grad_norm': 0.3955625891685486, 'learning_rate': 0.00025742574257425744, 'epoch': 3.56}


 36%|███▌      | 1046/2930 [27:55<48:39,  1.55s/it]

{'loss': 0.1315, 'grad_norm': 0.3090742826461792, 'learning_rate': 0.00025728917719358143, 'epoch': 3.57}


 36%|███▌      | 1047/2930 [27:56<48:35,  1.55s/it]

{'loss': 0.1567, 'grad_norm': 0.3882386088371277, 'learning_rate': 0.0002571526118129054, 'epoch': 3.57}


 36%|███▌      | 1048/2930 [27:58<48:30,  1.55s/it]

{'loss': 0.1459, 'grad_norm': 0.34428277611732483, 'learning_rate': 0.0002570160464322295, 'epoch': 3.57}


 36%|███▌      | 1049/2930 [27:59<48:45,  1.56s/it]

{'loss': 0.1559, 'grad_norm': 0.36404499411582947, 'learning_rate': 0.0002568794810515534, 'epoch': 3.58}


 36%|███▌      | 1050/2930 [28:01<48:39,  1.55s/it]

{'loss': 0.1658, 'grad_norm': 0.41500943899154663, 'learning_rate': 0.0002567429156708774, 'epoch': 3.58}


 36%|███▌      | 1051/2930 [28:02<48:52,  1.56s/it]

{'loss': 0.1687, 'grad_norm': 0.3768182694911957, 'learning_rate': 0.00025660635029020146, 'epoch': 3.58}


 36%|███▌      | 1052/2930 [28:04<48:28,  1.55s/it]

{'loss': 0.1595, 'grad_norm': 0.38860490918159485, 'learning_rate': 0.00025646978490952545, 'epoch': 3.59}


 36%|███▌      | 1053/2930 [28:05<48:41,  1.56s/it]

{'loss': 0.1627, 'grad_norm': 0.34335529804229736, 'learning_rate': 0.00025633321952884944, 'epoch': 3.59}


 36%|███▌      | 1054/2930 [28:07<48:47,  1.56s/it]

{'loss': 0.1595, 'grad_norm': 0.36082351207733154, 'learning_rate': 0.00025619665414817344, 'epoch': 3.59}


 36%|███▌      | 1055/2930 [28:09<48:36,  1.56s/it]

{'loss': 0.1675, 'grad_norm': 0.35502609610557556, 'learning_rate': 0.00025606008876749743, 'epoch': 3.6}


 36%|███▌      | 1056/2930 [28:10<48:31,  1.55s/it]

{'loss': 0.1611, 'grad_norm': 0.3230530321598053, 'learning_rate': 0.00025592352338682143, 'epoch': 3.6}


 36%|███▌      | 1057/2930 [28:12<48:24,  1.55s/it]

{'loss': 0.156, 'grad_norm': 0.352041095495224, 'learning_rate': 0.0002557869580061455, 'epoch': 3.6}


 36%|███▌      | 1058/2930 [28:13<48:41,  1.56s/it]

{'loss': 0.1604, 'grad_norm': 0.33591896295547485, 'learning_rate': 0.00025565039262546947, 'epoch': 3.61}


 36%|███▌      | 1059/2930 [28:15<48:46,  1.56s/it]

{'loss': 0.171, 'grad_norm': 0.39666086435317993, 'learning_rate': 0.00025551382724479346, 'epoch': 3.61}


 36%|███▌      | 1060/2930 [28:16<48:33,  1.56s/it]

{'loss': 0.141, 'grad_norm': 0.32352885603904724, 'learning_rate': 0.00025537726186411746, 'epoch': 3.61}


 36%|███▌      | 1061/2930 [28:18<48:31,  1.56s/it]

{'loss': 0.1409, 'grad_norm': 0.35599127411842346, 'learning_rate': 0.00025524069648344145, 'epoch': 3.62}


 36%|███▌      | 1062/2930 [28:19<48:39,  1.56s/it]

{'loss': 0.1321, 'grad_norm': 0.3448888957500458, 'learning_rate': 0.00025510413110276545, 'epoch': 3.62}


 36%|███▋      | 1063/2930 [28:21<48:51,  1.57s/it]

{'loss': 0.174, 'grad_norm': 0.3970230221748352, 'learning_rate': 0.00025496756572208944, 'epoch': 3.62}


 36%|███▋      | 1064/2930 [28:23<48:25,  1.56s/it]

{'loss': 0.162, 'grad_norm': 0.3859904706478119, 'learning_rate': 0.0002548310003414135, 'epoch': 3.63}


 36%|███▋      | 1065/2930 [28:24<48:33,  1.56s/it]

{'loss': 0.19, 'grad_norm': 0.45431312918663025, 'learning_rate': 0.0002546944349607375, 'epoch': 3.63}


 36%|███▋      | 1066/2930 [28:26<48:26,  1.56s/it]

{'loss': 0.1778, 'grad_norm': 0.40570199489593506, 'learning_rate': 0.0002545578695800614, 'epoch': 3.64}


 36%|███▋      | 1067/2930 [28:27<48:01,  1.55s/it]

{'loss': 0.137, 'grad_norm': 0.29514139890670776, 'learning_rate': 0.00025442130419938547, 'epoch': 3.64}


 36%|███▋      | 1068/2930 [28:29<47:58,  1.55s/it]

{'loss': 0.1829, 'grad_norm': 0.4131338596343994, 'learning_rate': 0.00025428473881870947, 'epoch': 3.64}


 36%|███▋      | 1069/2930 [28:30<47:58,  1.55s/it]

{'loss': 0.168, 'grad_norm': 0.3627215027809143, 'learning_rate': 0.00025414817343803346, 'epoch': 3.65}


 37%|███▋      | 1070/2930 [28:32<47:58,  1.55s/it]

{'loss': 0.1814, 'grad_norm': 0.36570242047309875, 'learning_rate': 0.0002540116080573575, 'epoch': 3.65}


 37%|███▋      | 1071/2930 [28:33<47:35,  1.54s/it]

{'loss': 0.1747, 'grad_norm': 0.35977673530578613, 'learning_rate': 0.00025387504267668145, 'epoch': 3.65}


 37%|███▋      | 1072/2930 [28:35<47:55,  1.55s/it]

{'loss': 0.1335, 'grad_norm': 0.32684123516082764, 'learning_rate': 0.00025373847729600544, 'epoch': 3.66}


 37%|███▋      | 1073/2930 [28:37<48:18,  1.56s/it]

{'loss': 0.176, 'grad_norm': 0.38243356347084045, 'learning_rate': 0.0002536019119153295, 'epoch': 3.66}


 37%|███▋      | 1074/2930 [28:38<48:33,  1.57s/it]

{'loss': 0.1623, 'grad_norm': 0.3846552073955536, 'learning_rate': 0.0002534653465346535, 'epoch': 3.66}


 37%|███▋      | 1075/2930 [28:40<48:28,  1.57s/it]

{'loss': 0.1539, 'grad_norm': 0.38130009174346924, 'learning_rate': 0.0002533287811539775, 'epoch': 3.67}


 37%|███▋      | 1076/2930 [28:41<48:41,  1.58s/it]

{'loss': 0.1714, 'grad_norm': 0.39432498812675476, 'learning_rate': 0.0002531922157733015, 'epoch': 3.67}


 37%|███▋      | 1077/2930 [28:43<48:52,  1.58s/it]

{'loss': 0.1492, 'grad_norm': 0.3568190932273865, 'learning_rate': 0.00025305565039262547, 'epoch': 3.67}


 37%|███▋      | 1078/2930 [28:44<48:54,  1.58s/it]

{'loss': 0.1569, 'grad_norm': 0.37407949566841125, 'learning_rate': 0.00025291908501194946, 'epoch': 3.68}


 37%|███▋      | 1079/2930 [28:46<48:52,  1.58s/it]

{'loss': 0.1577, 'grad_norm': 0.38671618700027466, 'learning_rate': 0.00025278251963127346, 'epoch': 3.68}


 37%|███▋      | 1080/2930 [28:48<49:01,  1.59s/it]

{'loss': 0.1552, 'grad_norm': 0.32971104979515076, 'learning_rate': 0.0002526459542505975, 'epoch': 3.68}


 37%|███▋      | 1081/2930 [28:49<49:09,  1.60s/it]

{'loss': 0.1631, 'grad_norm': 0.34972190856933594, 'learning_rate': 0.0002525093888699215, 'epoch': 3.69}


 37%|███▋      | 1082/2930 [28:51<48:47,  1.58s/it]

{'loss': 0.1676, 'grad_norm': 0.38350269198417664, 'learning_rate': 0.0002523728234892455, 'epoch': 3.69}


 37%|███▋      | 1083/2930 [28:52<48:37,  1.58s/it]

{'loss': 0.1559, 'grad_norm': 0.34857749938964844, 'learning_rate': 0.0002522362581085695, 'epoch': 3.69}


 37%|███▋      | 1084/2930 [28:54<48:35,  1.58s/it]

{'loss': 0.1496, 'grad_norm': 0.33727186918258667, 'learning_rate': 0.0002520996927278935, 'epoch': 3.7}


 37%|███▋      | 1085/2930 [28:56<48:43,  1.58s/it]

{'loss': 0.1715, 'grad_norm': 0.40153875946998596, 'learning_rate': 0.0002519631273472175, 'epoch': 3.7}


 37%|███▋      | 1086/2930 [28:57<48:55,  1.59s/it]

{'loss': 0.1402, 'grad_norm': 0.3432791829109192, 'learning_rate': 0.0002518265619665415, 'epoch': 3.7}


 37%|███▋      | 1087/2930 [28:59<48:53,  1.59s/it]

{'loss': 0.1635, 'grad_norm': 0.4188843369483948, 'learning_rate': 0.0002516899965858655, 'epoch': 3.71}


 37%|███▋      | 1088/2930 [29:00<48:34,  1.58s/it]

{'loss': 0.1489, 'grad_norm': 0.3490216135978699, 'learning_rate': 0.00025155343120518946, 'epoch': 3.71}


 37%|███▋      | 1089/2930 [29:02<48:29,  1.58s/it]

{'loss': 0.1489, 'grad_norm': 0.3465926945209503, 'learning_rate': 0.0002514168658245135, 'epoch': 3.71}


 37%|███▋      | 1090/2930 [29:03<48:26,  1.58s/it]

{'loss': 0.1535, 'grad_norm': 0.37719425559043884, 'learning_rate': 0.0002512803004438375, 'epoch': 3.72}


 37%|███▋      | 1091/2930 [29:05<48:22,  1.58s/it]

{'loss': 0.1434, 'grad_norm': 0.33232593536376953, 'learning_rate': 0.0002511437350631615, 'epoch': 3.72}


 37%|███▋      | 1092/2930 [29:07<48:04,  1.57s/it]

{'loss': 0.1439, 'grad_norm': 0.3512352406978607, 'learning_rate': 0.00025100716968248554, 'epoch': 3.72}


 37%|███▋      | 1093/2930 [29:08<48:19,  1.58s/it]

{'loss': 0.1431, 'grad_norm': 0.37342727184295654, 'learning_rate': 0.00025087060430180954, 'epoch': 3.73}


 37%|███▋      | 1094/2930 [29:10<48:25,  1.58s/it]

{'loss': 0.1255, 'grad_norm': 0.2888205945491791, 'learning_rate': 0.0002507340389211335, 'epoch': 3.73}


 37%|███▋      | 1095/2930 [29:11<48:30,  1.59s/it]

{'loss': 0.1564, 'grad_norm': 0.3708595335483551, 'learning_rate': 0.0002505974735404575, 'epoch': 3.73}


 37%|███▋      | 1096/2930 [29:13<48:37,  1.59s/it]

{'loss': 0.1512, 'grad_norm': 0.3968510031700134, 'learning_rate': 0.0002504609081597815, 'epoch': 3.74}


 37%|███▋      | 1097/2930 [29:15<48:38,  1.59s/it]

{'loss': 0.146, 'grad_norm': 0.38682061433792114, 'learning_rate': 0.0002503243427791055, 'epoch': 3.74}


 37%|███▋      | 1098/2930 [29:16<48:27,  1.59s/it]

{'loss': 0.1517, 'grad_norm': 0.4059889614582062, 'learning_rate': 0.0002501877773984295, 'epoch': 3.74}


 38%|███▊      | 1099/2930 [29:18<48:32,  1.59s/it]

{'loss': 0.1838, 'grad_norm': 0.4203929901123047, 'learning_rate': 0.0002500512120177535, 'epoch': 3.75}


 38%|███▊      | 1100/2930 [29:19<48:35,  1.59s/it]

{'loss': 0.174, 'grad_norm': 0.3686433434486389, 'learning_rate': 0.0002499146466370775, 'epoch': 3.75}


 38%|███▊      | 1101/2930 [29:21<48:32,  1.59s/it]

{'loss': 0.146, 'grad_norm': 0.32293492555618286, 'learning_rate': 0.0002497780812564015, 'epoch': 3.75}


 38%|███▊      | 1102/2930 [29:23<48:30,  1.59s/it]

{'loss': 0.1598, 'grad_norm': 0.34223106503486633, 'learning_rate': 0.00024964151587572554, 'epoch': 3.76}


 38%|███▊      | 1103/2930 [29:24<48:26,  1.59s/it]

{'loss': 0.1649, 'grad_norm': 0.4190748333930969, 'learning_rate': 0.00024950495049504953, 'epoch': 3.76}


 38%|███▊      | 1104/2930 [29:26<48:01,  1.58s/it]

{'loss': 0.1741, 'grad_norm': 0.3437836468219757, 'learning_rate': 0.0002493683851143735, 'epoch': 3.76}


 38%|███▊      | 1105/2930 [29:27<47:41,  1.57s/it]

{'loss': 0.1432, 'grad_norm': 0.2949671745300293, 'learning_rate': 0.0002492318197336975, 'epoch': 3.77}


 38%|███▊      | 1106/2930 [29:29<47:26,  1.56s/it]

{'loss': 0.1438, 'grad_norm': 0.30457019805908203, 'learning_rate': 0.0002490952543530215, 'epoch': 3.77}


 38%|███▊      | 1107/2930 [29:30<47:37,  1.57s/it]

{'loss': 0.1473, 'grad_norm': 0.31962963938713074, 'learning_rate': 0.0002489586889723455, 'epoch': 3.77}


 38%|███▊      | 1108/2930 [29:32<47:35,  1.57s/it]

{'loss': 0.1486, 'grad_norm': 0.3690437972545624, 'learning_rate': 0.00024882212359166956, 'epoch': 3.78}


 38%|███▊      | 1109/2930 [29:34<47:51,  1.58s/it]

{'loss': 0.1728, 'grad_norm': 0.4371497929096222, 'learning_rate': 0.00024868555821099355, 'epoch': 3.78}


 38%|███▊      | 1110/2930 [29:35<47:51,  1.58s/it]

{'loss': 0.1509, 'grad_norm': 0.38989201188087463, 'learning_rate': 0.0002485489928303175, 'epoch': 3.79}


 38%|███▊      | 1111/2930 [29:37<47:45,  1.58s/it]

{'loss': 0.137, 'grad_norm': 0.3559280335903168, 'learning_rate': 0.00024841242744964154, 'epoch': 3.79}


 38%|███▊      | 1112/2930 [29:38<47:47,  1.58s/it]

{'loss': 0.1508, 'grad_norm': 0.38096359372138977, 'learning_rate': 0.00024827586206896553, 'epoch': 3.79}


 38%|███▊      | 1113/2930 [29:40<47:55,  1.58s/it]

{'loss': 0.1709, 'grad_norm': 0.3649703860282898, 'learning_rate': 0.0002481392966882895, 'epoch': 3.8}


 38%|███▊      | 1114/2930 [29:41<47:50,  1.58s/it]

{'loss': 0.1427, 'grad_norm': 0.37762463092803955, 'learning_rate': 0.0002480027313076135, 'epoch': 3.8}


 38%|███▊      | 1115/2930 [29:43<47:36,  1.57s/it]

{'loss': 0.1383, 'grad_norm': 0.31555861234664917, 'learning_rate': 0.00024786616592693757, 'epoch': 3.8}


 38%|███▊      | 1116/2930 [29:45<47:30,  1.57s/it]

{'loss': 0.1523, 'grad_norm': 0.3671024441719055, 'learning_rate': 0.0002477296005462615, 'epoch': 3.81}


 38%|███▊      | 1117/2930 [29:46<47:34,  1.57s/it]

{'loss': 0.1566, 'grad_norm': 0.381940633058548, 'learning_rate': 0.0002475930351655855, 'epoch': 3.81}


 38%|███▊      | 1118/2930 [29:48<47:26,  1.57s/it]

{'loss': 0.1327, 'grad_norm': 0.34536147117614746, 'learning_rate': 0.00024745646978490955, 'epoch': 3.81}


 38%|███▊      | 1119/2930 [29:49<47:40,  1.58s/it]

{'loss': 0.1452, 'grad_norm': 0.3689875900745392, 'learning_rate': 0.00024731990440423355, 'epoch': 3.82}


 38%|███▊      | 1120/2930 [29:51<47:51,  1.59s/it]

{'loss': 0.1405, 'grad_norm': 0.36557966470718384, 'learning_rate': 0.00024718333902355754, 'epoch': 3.82}


 38%|███▊      | 1121/2930 [29:53<47:54,  1.59s/it]

{'loss': 0.1631, 'grad_norm': 0.40611204504966736, 'learning_rate': 0.00024704677364288153, 'epoch': 3.82}


 38%|███▊      | 1122/2930 [29:54<47:40,  1.58s/it]

{'loss': 0.1978, 'grad_norm': 0.48750707507133484, 'learning_rate': 0.00024691020826220553, 'epoch': 3.83}


 38%|███▊      | 1123/2930 [29:56<47:42,  1.58s/it]

{'loss': 0.1861, 'grad_norm': 0.42314422130584717, 'learning_rate': 0.0002467736428815295, 'epoch': 3.83}


 38%|███▊      | 1124/2930 [29:57<47:19,  1.57s/it]

{'loss': 0.156, 'grad_norm': 0.3630428612232208, 'learning_rate': 0.00024663707750085357, 'epoch': 3.83}


 38%|███▊      | 1125/2930 [29:59<47:17,  1.57s/it]

{'loss': 0.1588, 'grad_norm': 0.3614494800567627, 'learning_rate': 0.00024650051212017757, 'epoch': 3.84}


 38%|███▊      | 1126/2930 [30:00<46:58,  1.56s/it]

{'loss': 0.1568, 'grad_norm': 0.3009279668331146, 'learning_rate': 0.00024636394673950156, 'epoch': 3.84}


 38%|███▊      | 1127/2930 [30:02<47:00,  1.56s/it]

{'loss': 0.1484, 'grad_norm': 0.32101866602897644, 'learning_rate': 0.00024622738135882555, 'epoch': 3.84}


 38%|███▊      | 1128/2930 [30:03<46:53,  1.56s/it]

{'loss': 0.1441, 'grad_norm': 0.3438771069049835, 'learning_rate': 0.00024609081597814955, 'epoch': 3.85}


 39%|███▊      | 1129/2930 [30:05<46:32,  1.55s/it]

{'loss': 0.1521, 'grad_norm': 0.3521000146865845, 'learning_rate': 0.00024595425059747354, 'epoch': 3.85}


 39%|███▊      | 1130/2930 [30:06<46:09,  1.54s/it]

{'loss': 0.1528, 'grad_norm': 0.38227248191833496, 'learning_rate': 0.00024581768521679754, 'epoch': 3.85}


 39%|███▊      | 1131/2930 [30:08<45:57,  1.53s/it]

{'loss': 0.1657, 'grad_norm': 0.4465169608592987, 'learning_rate': 0.0002456811198361216, 'epoch': 3.86}


 39%|███▊      | 1132/2930 [30:10<46:12,  1.54s/it]

{'loss': 0.1459, 'grad_norm': 0.3788412809371948, 'learning_rate': 0.0002455445544554456, 'epoch': 3.86}


 39%|███▊      | 1133/2930 [30:11<46:17,  1.55s/it]

{'loss': 0.1491, 'grad_norm': 0.37555691599845886, 'learning_rate': 0.0002454079890747695, 'epoch': 3.86}


 39%|███▊      | 1134/2930 [30:13<46:12,  1.54s/it]

{'loss': 0.1655, 'grad_norm': 0.42666324973106384, 'learning_rate': 0.00024527142369409357, 'epoch': 3.87}


 39%|███▊      | 1135/2930 [30:14<46:25,  1.55s/it]

{'loss': 0.1573, 'grad_norm': 0.40419694781303406, 'learning_rate': 0.00024513485831341756, 'epoch': 3.87}


 39%|███▉      | 1136/2930 [30:16<46:28,  1.55s/it]

{'loss': 0.1521, 'grad_norm': 0.38826629519462585, 'learning_rate': 0.00024499829293274155, 'epoch': 3.87}


 39%|███▉      | 1137/2930 [30:17<46:39,  1.56s/it]

{'loss': 0.1313, 'grad_norm': 0.32776492834091187, 'learning_rate': 0.0002448617275520656, 'epoch': 3.88}


 39%|███▉      | 1138/2930 [30:19<46:39,  1.56s/it]

{'loss': 0.1427, 'grad_norm': 0.33157503604888916, 'learning_rate': 0.00024472516217138954, 'epoch': 3.88}


 39%|███▉      | 1139/2930 [30:20<46:36,  1.56s/it]

{'loss': 0.1496, 'grad_norm': 0.37755727767944336, 'learning_rate': 0.00024458859679071354, 'epoch': 3.88}


 39%|███▉      | 1140/2930 [30:22<46:47,  1.57s/it]

{'loss': 0.1517, 'grad_norm': 0.3339659869670868, 'learning_rate': 0.0002444520314100376, 'epoch': 3.89}


 39%|███▉      | 1141/2930 [30:24<46:38,  1.56s/it]

{'loss': 0.1404, 'grad_norm': 0.3460550308227539, 'learning_rate': 0.0002443154660293616, 'epoch': 3.89}


 39%|███▉      | 1142/2930 [30:25<46:41,  1.57s/it]

{'loss': 0.1522, 'grad_norm': 0.3632008731365204, 'learning_rate': 0.0002441789006486856, 'epoch': 3.89}


 39%|███▉      | 1143/2930 [30:27<46:21,  1.56s/it]

{'loss': 0.1454, 'grad_norm': 0.40042412281036377, 'learning_rate': 0.0002440423352680096, 'epoch': 3.9}


 39%|███▉      | 1144/2930 [30:28<46:05,  1.55s/it]

{'loss': 0.1608, 'grad_norm': 0.40733230113983154, 'learning_rate': 0.0002439057698873336, 'epoch': 3.9}


 39%|███▉      | 1145/2930 [30:30<46:13,  1.55s/it]

{'loss': 0.1677, 'grad_norm': 0.390065461397171, 'learning_rate': 0.00024376920450665756, 'epoch': 3.9}


 39%|███▉      | 1146/2930 [30:31<45:57,  1.55s/it]

{'loss': 0.1472, 'grad_norm': 0.3377401530742645, 'learning_rate': 0.00024363263912598155, 'epoch': 3.91}


 39%|███▉      | 1147/2930 [30:33<45:52,  1.54s/it]

{'loss': 0.1576, 'grad_norm': 0.3753320574760437, 'learning_rate': 0.0002434960737453056, 'epoch': 3.91}


 39%|███▉      | 1148/2930 [30:34<46:04,  1.55s/it]

{'loss': 0.1451, 'grad_norm': 0.36504435539245605, 'learning_rate': 0.00024335950836462957, 'epoch': 3.91}


 39%|███▉      | 1149/2930 [30:36<46:04,  1.55s/it]

{'loss': 0.1378, 'grad_norm': 0.3360435664653778, 'learning_rate': 0.00024322294298395356, 'epoch': 3.92}


 39%|███▉      | 1150/2930 [30:38<46:03,  1.55s/it]

{'loss': 0.1549, 'grad_norm': 0.34323692321777344, 'learning_rate': 0.00024308637760327758, 'epoch': 3.92}


 39%|███▉      | 1151/2930 [30:39<45:54,  1.55s/it]

{'loss': 0.165, 'grad_norm': 0.4291425049304962, 'learning_rate': 0.00024294981222260158, 'epoch': 3.92}


 39%|███▉      | 1152/2930 [30:41<45:46,  1.54s/it]

{'loss': 0.135, 'grad_norm': 0.33287960290908813, 'learning_rate': 0.00024281324684192557, 'epoch': 3.93}


 39%|███▉      | 1153/2930 [30:42<45:34,  1.54s/it]

{'loss': 0.143, 'grad_norm': 0.32875296473503113, 'learning_rate': 0.0002426766814612496, 'epoch': 3.93}


 39%|███▉      | 1154/2930 [30:44<45:48,  1.55s/it]

{'loss': 0.1407, 'grad_norm': 0.34151187539100647, 'learning_rate': 0.00024254011608057358, 'epoch': 3.94}


 39%|███▉      | 1155/2930 [30:45<45:46,  1.55s/it]

{'loss': 0.1345, 'grad_norm': 0.3366103768348694, 'learning_rate': 0.00024240355069989758, 'epoch': 3.94}


 39%|███▉      | 1156/2930 [30:47<45:36,  1.54s/it]

{'loss': 0.1538, 'grad_norm': 0.4076521098613739, 'learning_rate': 0.0002422669853192216, 'epoch': 3.94}


 39%|███▉      | 1157/2930 [30:48<45:50,  1.55s/it]

{'loss': 0.1488, 'grad_norm': 0.36130526661872864, 'learning_rate': 0.0002421304199385456, 'epoch': 3.95}


 40%|███▉      | 1158/2930 [30:50<45:33,  1.54s/it]

{'loss': 0.1409, 'grad_norm': 0.38224804401397705, 'learning_rate': 0.0002419938545578696, 'epoch': 3.95}


 40%|███▉      | 1159/2930 [30:51<45:45,  1.55s/it]

{'loss': 0.1489, 'grad_norm': 0.36380407214164734, 'learning_rate': 0.0002418572891771936, 'epoch': 3.95}


 40%|███▉      | 1160/2930 [30:53<45:51,  1.55s/it]

{'loss': 0.1339, 'grad_norm': 0.34985655546188354, 'learning_rate': 0.0002417207237965176, 'epoch': 3.96}


 40%|███▉      | 1161/2930 [30:55<45:52,  1.56s/it]

{'loss': 0.1384, 'grad_norm': 0.3534083068370819, 'learning_rate': 0.0002415841584158416, 'epoch': 3.96}


 40%|███▉      | 1162/2930 [30:56<45:52,  1.56s/it]

{'loss': 0.1743, 'grad_norm': 0.410017728805542, 'learning_rate': 0.00024144759303516557, 'epoch': 3.96}


 40%|███▉      | 1163/2930 [30:58<45:56,  1.56s/it]

{'loss': 0.1428, 'grad_norm': 0.32493481040000916, 'learning_rate': 0.0002413110276544896, 'epoch': 3.97}


 40%|███▉      | 1164/2930 [30:59<45:56,  1.56s/it]

{'loss': 0.1359, 'grad_norm': 0.30838915705680847, 'learning_rate': 0.0002411744622738136, 'epoch': 3.97}


 40%|███▉      | 1165/2930 [31:01<45:37,  1.55s/it]

{'loss': 0.133, 'grad_norm': 0.3362458050251007, 'learning_rate': 0.00024103789689313757, 'epoch': 3.97}


 40%|███▉      | 1166/2930 [31:02<45:30,  1.55s/it]

{'loss': 0.1374, 'grad_norm': 0.34494712948799133, 'learning_rate': 0.00024090133151246162, 'epoch': 3.98}


 40%|███▉      | 1167/2930 [31:04<45:08,  1.54s/it]

{'loss': 0.1286, 'grad_norm': 0.32820647954940796, 'learning_rate': 0.0002407647661317856, 'epoch': 3.98}


 40%|███▉      | 1168/2930 [31:05<45:12,  1.54s/it]

{'loss': 0.1451, 'grad_norm': 0.38277414441108704, 'learning_rate': 0.00024062820075110958, 'epoch': 3.98}


 40%|███▉      | 1169/2930 [31:07<45:21,  1.55s/it]

{'loss': 0.1751, 'grad_norm': 0.4564221501350403, 'learning_rate': 0.00024049163537043363, 'epoch': 3.99}


 40%|███▉      | 1170/2930 [31:09<45:14,  1.54s/it]

{'loss': 0.1519, 'grad_norm': 0.400677353143692, 'learning_rate': 0.0002403550699897576, 'epoch': 3.99}


 40%|███▉      | 1171/2930 [31:10<45:19,  1.55s/it]

{'loss': 0.1336, 'grad_norm': 0.3510235846042633, 'learning_rate': 0.0002402185046090816, 'epoch': 3.99}


 40%|████      | 1172/2930 [31:12<45:24,  1.55s/it]

{'loss': 0.1443, 'grad_norm': 0.32338064908981323, 'learning_rate': 0.00024008193922840562, 'epoch': 4.0}


 40%|████      | 1173/2930 [31:13<45:11,  1.54s/it]

{'loss': 0.1382, 'grad_norm': 0.366676390171051, 'learning_rate': 0.0002399453738477296, 'epoch': 4.0}


 40%|████      | 1174/2930 [31:15<45:22,  1.55s/it]

{'loss': 0.1056, 'grad_norm': 0.30803582072257996, 'learning_rate': 0.0002398088084670536, 'epoch': 4.0}


 40%|████      | 1175/2930 [31:16<45:24,  1.55s/it]

{'loss': 0.1149, 'grad_norm': 0.2782321870326996, 'learning_rate': 0.00023967224308637762, 'epoch': 4.01}


 40%|████      | 1176/2930 [31:18<45:27,  1.56s/it]

{'loss': 0.113, 'grad_norm': 0.3363230526447296, 'learning_rate': 0.00023953567770570162, 'epoch': 4.01}


 40%|████      | 1177/2930 [31:19<45:15,  1.55s/it]

{'loss': 0.1147, 'grad_norm': 0.35087984800338745, 'learning_rate': 0.0002393991123250256, 'epoch': 4.01}


 40%|████      | 1178/2930 [31:21<45:09,  1.55s/it]

{'loss': 0.0868, 'grad_norm': 0.3287426829338074, 'learning_rate': 0.0002392625469443496, 'epoch': 4.02}


 40%|████      | 1179/2930 [31:22<45:18,  1.55s/it]

{'loss': 0.0989, 'grad_norm': 0.3516135513782501, 'learning_rate': 0.00023912598156367363, 'epoch': 4.02}


 40%|████      | 1180/2930 [31:24<45:28,  1.56s/it]

{'loss': 0.1368, 'grad_norm': 0.4272698760032654, 'learning_rate': 0.00023898941618299762, 'epoch': 4.02}


 40%|████      | 1181/2930 [31:26<45:23,  1.56s/it]

{'loss': 0.128, 'grad_norm': 0.45487549901008606, 'learning_rate': 0.00023885285080232162, 'epoch': 4.03}


 40%|████      | 1182/2930 [31:27<45:12,  1.55s/it]

{'loss': 0.0949, 'grad_norm': 0.3497222661972046, 'learning_rate': 0.00023871628542164564, 'epoch': 4.03}


 40%|████      | 1183/2930 [31:29<45:18,  1.56s/it]

{'loss': 0.1012, 'grad_norm': 0.35345640778541565, 'learning_rate': 0.00023857972004096963, 'epoch': 4.03}


 40%|████      | 1184/2930 [31:30<45:07,  1.55s/it]

{'loss': 0.0966, 'grad_norm': 0.34546080231666565, 'learning_rate': 0.0002384431546602936, 'epoch': 4.04}


 40%|████      | 1185/2930 [31:32<45:11,  1.55s/it]

{'loss': 0.1059, 'grad_norm': 0.3523581624031067, 'learning_rate': 0.00023830658927961765, 'epoch': 4.04}


 40%|████      | 1186/2930 [31:33<45:13,  1.56s/it]

{'loss': 0.103, 'grad_norm': 0.3271745443344116, 'learning_rate': 0.00023817002389894164, 'epoch': 4.04}


 41%|████      | 1187/2930 [31:35<45:15,  1.56s/it]

{'loss': 0.1017, 'grad_norm': 0.3386284410953522, 'learning_rate': 0.0002380334585182656, 'epoch': 4.05}


 41%|████      | 1188/2930 [31:36<45:18,  1.56s/it]

{'loss': 0.1162, 'grad_norm': 0.39454829692840576, 'learning_rate': 0.00023789689313758966, 'epoch': 4.05}


 41%|████      | 1189/2930 [31:38<45:11,  1.56s/it]

{'loss': 0.0954, 'grad_norm': 0.34064221382141113, 'learning_rate': 0.00023776032775691362, 'epoch': 4.05}


 41%|████      | 1190/2930 [31:40<45:06,  1.56s/it]

{'loss': 0.1066, 'grad_norm': 0.3554370403289795, 'learning_rate': 0.00023762376237623762, 'epoch': 4.06}


 41%|████      | 1191/2930 [31:41<45:03,  1.55s/it]

{'loss': 0.1026, 'grad_norm': 0.33314597606658936, 'learning_rate': 0.00023748719699556167, 'epoch': 4.06}


 41%|████      | 1192/2930 [31:43<45:01,  1.55s/it]

{'loss': 0.112, 'grad_norm': 0.40034621953964233, 'learning_rate': 0.00023735063161488563, 'epoch': 4.06}


 41%|████      | 1193/2930 [31:44<44:40,  1.54s/it]

{'loss': 0.1042, 'grad_norm': 0.3640078902244568, 'learning_rate': 0.00023721406623420963, 'epoch': 4.07}


 41%|████      | 1194/2930 [31:46<44:53,  1.55s/it]

{'loss': 0.1092, 'grad_norm': 0.3620778024196625, 'learning_rate': 0.00023707750085353362, 'epoch': 4.07}


 41%|████      | 1195/2930 [31:47<44:56,  1.55s/it]

{'loss': 0.1129, 'grad_norm': 0.34062379598617554, 'learning_rate': 0.00023694093547285764, 'epoch': 4.08}


 41%|████      | 1196/2930 [31:49<45:07,  1.56s/it]

{'loss': 0.1002, 'grad_norm': 0.33035242557525635, 'learning_rate': 0.00023680437009218164, 'epoch': 4.08}


 41%|████      | 1197/2930 [31:50<44:58,  1.56s/it]

{'loss': 0.1006, 'grad_norm': 0.3668656647205353, 'learning_rate': 0.00023666780471150563, 'epoch': 4.08}


 41%|████      | 1198/2930 [31:52<45:04,  1.56s/it]

{'loss': 0.1165, 'grad_norm': 0.3612654209136963, 'learning_rate': 0.00023653123933082965, 'epoch': 4.09}


 41%|████      | 1199/2930 [31:54<45:05,  1.56s/it]

{'loss': 0.0897, 'grad_norm': 0.3120172321796417, 'learning_rate': 0.00023639467395015365, 'epoch': 4.09}


 41%|████      | 1200/2930 [31:55<45:08,  1.57s/it]

{'loss': 0.0998, 'grad_norm': 0.36515429615974426, 'learning_rate': 0.00023625810856947764, 'epoch': 4.09}


 41%|████      | 1201/2930 [31:57<45:12,  1.57s/it]

{'loss': 0.1069, 'grad_norm': 0.3658989667892456, 'learning_rate': 0.00023612154318880166, 'epoch': 4.1}


 41%|████      | 1202/2930 [31:58<44:57,  1.56s/it]

{'loss': 0.121, 'grad_norm': 0.43850329518318176, 'learning_rate': 0.00023598497780812566, 'epoch': 4.1}


 41%|████      | 1203/2930 [32:00<44:36,  1.55s/it]

{'loss': 0.1027, 'grad_norm': 0.3706657886505127, 'learning_rate': 0.00023584841242744965, 'epoch': 4.1}


 41%|████      | 1204/2930 [32:01<44:41,  1.55s/it]

{'loss': 0.1204, 'grad_norm': 0.37414342164993286, 'learning_rate': 0.00023571184704677367, 'epoch': 4.11}


 41%|████      | 1205/2930 [32:03<44:52,  1.56s/it]

{'loss': 0.1093, 'grad_norm': 0.3697146773338318, 'learning_rate': 0.00023557528166609767, 'epoch': 4.11}


 41%|████      | 1206/2930 [32:05<45:01,  1.57s/it]

{'loss': 0.1123, 'grad_norm': 0.3823131024837494, 'learning_rate': 0.00023543871628542163, 'epoch': 4.11}


 41%|████      | 1207/2930 [32:06<45:18,  1.58s/it]

{'loss': 0.1134, 'grad_norm': 0.3625350296497345, 'learning_rate': 0.00023530215090474568, 'epoch': 4.12}


 41%|████      | 1208/2930 [32:08<45:26,  1.58s/it]

{'loss': 0.1165, 'grad_norm': 0.36021867394447327, 'learning_rate': 0.00023516558552406968, 'epoch': 4.12}


 41%|████▏     | 1209/2930 [32:09<45:00,  1.57s/it]

{'loss': 0.1009, 'grad_norm': 0.3381684422492981, 'learning_rate': 0.00023502902014339364, 'epoch': 4.12}


 41%|████▏     | 1210/2930 [32:11<44:45,  1.56s/it]

{'loss': 0.1017, 'grad_norm': 0.34462082386016846, 'learning_rate': 0.0002348924547627177, 'epoch': 4.13}


 41%|████▏     | 1211/2930 [32:12<45:11,  1.58s/it]

{'loss': 0.1276, 'grad_norm': 0.40987396240234375, 'learning_rate': 0.00023475588938204168, 'epoch': 4.13}


 41%|████▏     | 1212/2930 [32:14<45:04,  1.57s/it]

{'loss': 0.1121, 'grad_norm': 0.3921017050743103, 'learning_rate': 0.00023461932400136565, 'epoch': 4.13}


 41%|████▏     | 1213/2930 [32:16<45:18,  1.58s/it]

{'loss': 0.1137, 'grad_norm': 0.39590439200401306, 'learning_rate': 0.00023448275862068965, 'epoch': 4.14}


 41%|████▏     | 1214/2930 [32:17<45:03,  1.58s/it]

{'loss': 0.1051, 'grad_norm': 0.3881707191467285, 'learning_rate': 0.00023434619324001367, 'epoch': 4.14}


 41%|████▏     | 1215/2930 [32:19<45:18,  1.59s/it]

{'loss': 0.1079, 'grad_norm': 0.35476046800613403, 'learning_rate': 0.00023420962785933766, 'epoch': 4.14}


 42%|████▏     | 1216/2930 [32:20<45:04,  1.58s/it]

{'loss': 0.1164, 'grad_norm': 0.400291383266449, 'learning_rate': 0.00023407306247866166, 'epoch': 4.15}


 42%|████▏     | 1217/2930 [32:22<45:12,  1.58s/it]

{'loss': 0.0984, 'grad_norm': 0.3155379593372345, 'learning_rate': 0.00023393649709798568, 'epoch': 4.15}


 42%|████▏     | 1218/2930 [32:24<45:16,  1.59s/it]

{'loss': 0.1058, 'grad_norm': 0.3774275779724121, 'learning_rate': 0.00023379993171730967, 'epoch': 4.15}


 42%|████▏     | 1219/2930 [32:25<45:04,  1.58s/it]

{'loss': 0.0999, 'grad_norm': 0.3685595989227295, 'learning_rate': 0.00023366336633663366, 'epoch': 4.16}


 42%|████▏     | 1220/2930 [32:27<44:54,  1.58s/it]

{'loss': 0.1082, 'grad_norm': 0.3295431137084961, 'learning_rate': 0.00023352680095595769, 'epoch': 4.16}


 42%|████▏     | 1221/2930 [32:28<45:00,  1.58s/it]

{'loss': 0.0996, 'grad_norm': 0.34490376710891724, 'learning_rate': 0.00023339023557528168, 'epoch': 4.16}


 42%|████▏     | 1222/2930 [32:30<44:46,  1.57s/it]

{'loss': 0.1043, 'grad_norm': 0.35443416237831116, 'learning_rate': 0.00023325367019460567, 'epoch': 4.17}


 42%|████▏     | 1223/2930 [32:31<44:37,  1.57s/it]

{'loss': 0.1016, 'grad_norm': 0.37805864214897156, 'learning_rate': 0.0002331171048139297, 'epoch': 4.17}


 42%|████▏     | 1224/2930 [32:33<44:43,  1.57s/it]

{'loss': 0.1101, 'grad_norm': 0.3954890966415405, 'learning_rate': 0.0002329805394332537, 'epoch': 4.17}


 42%|████▏     | 1225/2930 [32:34<44:27,  1.56s/it]

{'loss': 0.1056, 'grad_norm': 0.3987889289855957, 'learning_rate': 0.00023284397405257768, 'epoch': 4.18}


 42%|████▏     | 1226/2930 [32:36<44:38,  1.57s/it]

{'loss': 0.1128, 'grad_norm': 0.37308165431022644, 'learning_rate': 0.0002327074086719017, 'epoch': 4.18}


 42%|████▏     | 1227/2930 [32:38<44:49,  1.58s/it]

{'loss': 0.111, 'grad_norm': 0.35116440057754517, 'learning_rate': 0.0002325708432912257, 'epoch': 4.18}


 42%|████▏     | 1228/2930 [32:39<44:51,  1.58s/it]

{'loss': 0.0848, 'grad_norm': 0.28259575366973877, 'learning_rate': 0.0002324342779105497, 'epoch': 4.19}


 42%|████▏     | 1229/2930 [32:41<44:56,  1.59s/it]

{'loss': 0.1115, 'grad_norm': 0.3517514765262604, 'learning_rate': 0.00023229771252987366, 'epoch': 4.19}


 42%|████▏     | 1230/2930 [32:42<44:53,  1.58s/it]

{'loss': 0.1095, 'grad_norm': 0.3872342109680176, 'learning_rate': 0.0002321611471491977, 'epoch': 4.19}


 42%|████▏     | 1231/2930 [32:44<44:57,  1.59s/it]

{'loss': 0.0862, 'grad_norm': 0.3028417229652405, 'learning_rate': 0.00023202458176852168, 'epoch': 4.2}


 42%|████▏     | 1232/2930 [32:46<44:56,  1.59s/it]

{'loss': 0.0933, 'grad_norm': 0.3357841670513153, 'learning_rate': 0.00023188801638784567, 'epoch': 4.2}


 42%|████▏     | 1233/2930 [32:47<45:01,  1.59s/it]

{'loss': 0.1131, 'grad_norm': 0.40446117520332336, 'learning_rate': 0.00023175145100716972, 'epoch': 4.2}


 42%|████▏     | 1234/2930 [32:49<45:09,  1.60s/it]

{'loss': 0.1041, 'grad_norm': 0.3738800883293152, 'learning_rate': 0.00023161488562649369, 'epoch': 4.21}


 42%|████▏     | 1235/2930 [32:50<44:52,  1.59s/it]

{'loss': 0.1066, 'grad_norm': 0.4213985502719879, 'learning_rate': 0.00023147832024581768, 'epoch': 4.21}


 42%|████▏     | 1236/2930 [32:52<44:47,  1.59s/it]

{'loss': 0.1015, 'grad_norm': 0.42464199662208557, 'learning_rate': 0.0002313417548651417, 'epoch': 4.21}


 42%|████▏     | 1237/2930 [32:54<44:37,  1.58s/it]

{'loss': 0.104, 'grad_norm': 0.3962287902832031, 'learning_rate': 0.0002312051894844657, 'epoch': 4.22}


 42%|████▏     | 1238/2930 [32:55<44:45,  1.59s/it]

{'loss': 0.1141, 'grad_norm': 0.3740456700325012, 'learning_rate': 0.0002310686241037897, 'epoch': 4.22}


 42%|████▏     | 1239/2930 [32:57<44:48,  1.59s/it]

{'loss': 0.106, 'grad_norm': 0.37326210737228394, 'learning_rate': 0.0002309320587231137, 'epoch': 4.23}


 42%|████▏     | 1240/2930 [32:58<44:55,  1.59s/it]

{'loss': 0.112, 'grad_norm': 0.38100889325141907, 'learning_rate': 0.0002307954933424377, 'epoch': 4.23}


 42%|████▏     | 1241/2930 [33:00<44:55,  1.60s/it]

{'loss': 0.1065, 'grad_norm': 0.3355734646320343, 'learning_rate': 0.0002306589279617617, 'epoch': 4.23}


 42%|████▏     | 1242/2930 [33:02<44:46,  1.59s/it]

{'loss': 0.0975, 'grad_norm': 0.33029505610466003, 'learning_rate': 0.00023052236258108572, 'epoch': 4.24}


 42%|████▏     | 1243/2930 [33:03<44:52,  1.60s/it]

{'loss': 0.1051, 'grad_norm': 0.36362576484680176, 'learning_rate': 0.00023038579720040971, 'epoch': 4.24}


 42%|████▏     | 1244/2930 [33:05<44:35,  1.59s/it]

{'loss': 0.1045, 'grad_norm': 0.3711715638637543, 'learning_rate': 0.0002302492318197337, 'epoch': 4.24}


 42%|████▏     | 1245/2930 [33:06<44:36,  1.59s/it]

{'loss': 0.0959, 'grad_norm': 0.3576740026473999, 'learning_rate': 0.0002301126664390577, 'epoch': 4.25}


 43%|████▎     | 1246/2930 [33:08<44:41,  1.59s/it]

{'loss': 0.1201, 'grad_norm': 0.3996799886226654, 'learning_rate': 0.00022997610105838172, 'epoch': 4.25}


 43%|████▎     | 1247/2930 [33:10<44:40,  1.59s/it]

{'loss': 0.1088, 'grad_norm': 0.3714316189289093, 'learning_rate': 0.00022983953567770572, 'epoch': 4.25}


 43%|████▎     | 1248/2930 [33:11<44:26,  1.59s/it]

{'loss': 0.1223, 'grad_norm': 0.40470755100250244, 'learning_rate': 0.00022970297029702968, 'epoch': 4.26}


 43%|████▎     | 1249/2930 [33:13<44:17,  1.58s/it]

{'loss': 0.1344, 'grad_norm': 0.40810081362724304, 'learning_rate': 0.00022956640491635373, 'epoch': 4.26}


 43%|████▎     | 1250/2930 [33:14<44:25,  1.59s/it]

{'loss': 0.0982, 'grad_norm': 0.31901276111602783, 'learning_rate': 0.00022942983953567773, 'epoch': 4.26}


 43%|████▎     | 1251/2930 [33:16<44:25,  1.59s/it]

{'loss': 0.1136, 'grad_norm': 0.36974531412124634, 'learning_rate': 0.0002292932741550017, 'epoch': 4.27}


 43%|████▎     | 1252/2930 [33:17<44:10,  1.58s/it]

{'loss': 0.1212, 'grad_norm': 0.3801455795764923, 'learning_rate': 0.00022915670877432574, 'epoch': 4.27}


 43%|████▎     | 1253/2930 [33:19<44:10,  1.58s/it]

{'loss': 0.1172, 'grad_norm': 0.3722931146621704, 'learning_rate': 0.0002290201433936497, 'epoch': 4.27}


 43%|████▎     | 1254/2930 [33:21<44:16,  1.58s/it]

{'loss': 0.1189, 'grad_norm': 0.3652172088623047, 'learning_rate': 0.0002288835780129737, 'epoch': 4.28}


 43%|████▎     | 1255/2930 [33:22<44:23,  1.59s/it]

{'loss': 0.1163, 'grad_norm': 0.36915603280067444, 'learning_rate': 0.00022874701263229775, 'epoch': 4.28}


 43%|████▎     | 1256/2930 [33:24<44:10,  1.58s/it]

{'loss': 0.1134, 'grad_norm': 0.35276147723197937, 'learning_rate': 0.00022861044725162172, 'epoch': 4.28}


 43%|████▎     | 1257/2930 [33:25<44:03,  1.58s/it]

{'loss': 0.113, 'grad_norm': 0.3547325134277344, 'learning_rate': 0.0002284738818709457, 'epoch': 4.29}


 43%|████▎     | 1258/2930 [33:27<44:06,  1.58s/it]

{'loss': 0.1181, 'grad_norm': 0.3617885112762451, 'learning_rate': 0.00022833731649026976, 'epoch': 4.29}


 43%|████▎     | 1259/2930 [33:28<44:02,  1.58s/it]

{'loss': 0.1283, 'grad_norm': 0.3866473436355591, 'learning_rate': 0.00022820075110959373, 'epoch': 4.29}


 43%|████▎     | 1260/2930 [33:30<43:51,  1.58s/it]

{'loss': 0.1071, 'grad_norm': 0.34909480810165405, 'learning_rate': 0.00022806418572891772, 'epoch': 4.3}


 43%|████▎     | 1261/2930 [33:32<43:52,  1.58s/it]

{'loss': 0.0995, 'grad_norm': 0.32251983880996704, 'learning_rate': 0.00022792762034824172, 'epoch': 4.3}


 43%|████▎     | 1262/2930 [33:33<43:50,  1.58s/it]

{'loss': 0.1019, 'grad_norm': 0.3689984679222107, 'learning_rate': 0.00022779105496756574, 'epoch': 4.3}


 43%|████▎     | 1263/2930 [33:35<43:46,  1.58s/it]

{'loss': 0.1085, 'grad_norm': 0.38297203183174133, 'learning_rate': 0.00022765448958688973, 'epoch': 4.31}


 43%|████▎     | 1264/2930 [33:36<43:27,  1.57s/it]

{'loss': 0.1221, 'grad_norm': 0.3836141526699066, 'learning_rate': 0.00022751792420621373, 'epoch': 4.31}


 43%|████▎     | 1265/2930 [33:38<43:16,  1.56s/it]

{'loss': 0.1266, 'grad_norm': 0.4087560176849365, 'learning_rate': 0.00022738135882553775, 'epoch': 4.31}


 43%|████▎     | 1266/2930 [33:39<43:17,  1.56s/it]

{'loss': 0.126, 'grad_norm': 0.4035506844520569, 'learning_rate': 0.00022724479344486174, 'epoch': 4.32}


 43%|████▎     | 1267/2930 [33:41<43:07,  1.56s/it]

{'loss': 0.0972, 'grad_norm': 0.3455239534378052, 'learning_rate': 0.00022710822806418574, 'epoch': 4.32}


 43%|████▎     | 1268/2930 [33:43<43:07,  1.56s/it]

{'loss': 0.115, 'grad_norm': 0.34110090136528015, 'learning_rate': 0.00022697166268350976, 'epoch': 4.32}


 43%|████▎     | 1269/2930 [33:44<43:08,  1.56s/it]

{'loss': 0.1109, 'grad_norm': 0.3587521016597748, 'learning_rate': 0.00022683509730283375, 'epoch': 4.33}


 43%|████▎     | 1270/2930 [33:46<43:01,  1.56s/it]

{'loss': 0.0959, 'grad_norm': 0.35279932618141174, 'learning_rate': 0.00022669853192215772, 'epoch': 4.33}


 43%|████▎     | 1271/2930 [33:47<43:06,  1.56s/it]

{'loss': 0.105, 'grad_norm': 0.3673732280731201, 'learning_rate': 0.00022656196654148177, 'epoch': 4.33}


 43%|████▎     | 1272/2930 [33:49<43:13,  1.56s/it]

{'loss': 0.1009, 'grad_norm': 0.33731788396835327, 'learning_rate': 0.00022642540116080576, 'epoch': 4.34}


 43%|████▎     | 1273/2930 [33:50<43:12,  1.56s/it]

{'loss': 0.0989, 'grad_norm': 0.37158969044685364, 'learning_rate': 0.00022628883578012973, 'epoch': 4.34}


 43%|████▎     | 1274/2930 [33:52<43:12,  1.57s/it]

{'loss': 0.1143, 'grad_norm': 0.3682302236557007, 'learning_rate': 0.00022615227039945378, 'epoch': 4.34}


 44%|████▎     | 1275/2930 [33:53<43:06,  1.56s/it]

{'loss': 0.1018, 'grad_norm': 0.3883725702762604, 'learning_rate': 0.00022601570501877777, 'epoch': 4.35}


 44%|████▎     | 1276/2930 [33:55<43:05,  1.56s/it]

{'loss': 0.1004, 'grad_norm': 0.3468860685825348, 'learning_rate': 0.00022587913963810174, 'epoch': 4.35}


 44%|████▎     | 1277/2930 [33:57<42:55,  1.56s/it]

{'loss': 0.0989, 'grad_norm': 0.3566993176937103, 'learning_rate': 0.00022574257425742573, 'epoch': 4.35}


 44%|████▎     | 1278/2930 [33:58<42:56,  1.56s/it]

{'loss': 0.0955, 'grad_norm': 0.3518475890159607, 'learning_rate': 0.00022560600887674975, 'epoch': 4.36}


 44%|████▎     | 1279/2930 [34:00<43:01,  1.56s/it]

{'loss': 0.1157, 'grad_norm': 0.3828028738498688, 'learning_rate': 0.00022546944349607375, 'epoch': 4.36}


 44%|████▎     | 1280/2930 [34:01<43:14,  1.57s/it]

{'loss': 0.114, 'grad_norm': 0.40711069107055664, 'learning_rate': 0.00022533287811539774, 'epoch': 4.36}


 44%|████▎     | 1281/2930 [34:03<42:51,  1.56s/it]

{'loss': 0.0973, 'grad_norm': 0.31608396768569946, 'learning_rate': 0.00022519631273472176, 'epoch': 4.37}


 44%|████▍     | 1282/2930 [34:04<42:39,  1.55s/it]

{'loss': 0.1033, 'grad_norm': 0.3639872372150421, 'learning_rate': 0.00022505974735404576, 'epoch': 4.37}


 44%|████▍     | 1283/2930 [34:06<42:40,  1.55s/it]

{'loss': 0.0984, 'grad_norm': 0.3679070472717285, 'learning_rate': 0.00022492318197336975, 'epoch': 4.38}


 44%|████▍     | 1284/2930 [34:07<42:32,  1.55s/it]

{'loss': 0.0979, 'grad_norm': 0.34045636653900146, 'learning_rate': 0.00022478661659269377, 'epoch': 4.38}


 44%|████▍     | 1285/2930 [34:09<42:14,  1.54s/it]

{'loss': 0.1036, 'grad_norm': 0.38376671075820923, 'learning_rate': 0.00022465005121201777, 'epoch': 4.38}


 44%|████▍     | 1286/2930 [34:11<42:21,  1.55s/it]

{'loss': 0.1054, 'grad_norm': 0.377642959356308, 'learning_rate': 0.00022451348583134176, 'epoch': 4.39}


 44%|████▍     | 1287/2930 [34:12<42:18,  1.55s/it]

{'loss': 0.1039, 'grad_norm': 0.38423392176628113, 'learning_rate': 0.00022437692045066578, 'epoch': 4.39}


 44%|████▍     | 1288/2930 [34:14<42:10,  1.54s/it]

{'loss': 0.1051, 'grad_norm': 0.3772263824939728, 'learning_rate': 0.00022424035506998978, 'epoch': 4.39}


 44%|████▍     | 1289/2930 [34:15<42:02,  1.54s/it]

{'loss': 0.1012, 'grad_norm': 0.3541126549243927, 'learning_rate': 0.00022410378968931377, 'epoch': 4.4}


 44%|████▍     | 1290/2930 [34:17<42:10,  1.54s/it]

{'loss': 0.1143, 'grad_norm': 0.41143283247947693, 'learning_rate': 0.0002239672243086378, 'epoch': 4.4}


 44%|████▍     | 1291/2930 [34:18<42:19,  1.55s/it]

{'loss': 0.1133, 'grad_norm': 0.3809244632720947, 'learning_rate': 0.00022383065892796179, 'epoch': 4.4}


 44%|████▍     | 1292/2930 [34:20<42:25,  1.55s/it]

{'loss': 0.1061, 'grad_norm': 0.35734739899635315, 'learning_rate': 0.00022369409354728575, 'epoch': 4.41}


 44%|████▍     | 1293/2930 [34:21<42:26,  1.56s/it]

{'loss': 0.1067, 'grad_norm': 0.3187236189842224, 'learning_rate': 0.00022355752816660975, 'epoch': 4.41}


 44%|████▍     | 1294/2930 [34:23<42:33,  1.56s/it]

{'loss': 0.1046, 'grad_norm': 0.3712283968925476, 'learning_rate': 0.0002234209627859338, 'epoch': 4.41}


 44%|████▍     | 1295/2930 [34:25<42:30,  1.56s/it]

{'loss': 0.1118, 'grad_norm': 0.3823837339878082, 'learning_rate': 0.00022328439740525776, 'epoch': 4.42}


 44%|████▍     | 1296/2930 [34:26<42:35,  1.56s/it]

{'loss': 0.1083, 'grad_norm': 0.37083399295806885, 'learning_rate': 0.00022314783202458176, 'epoch': 4.42}


 44%|████▍     | 1297/2930 [34:28<42:28,  1.56s/it]

{'loss': 0.1238, 'grad_norm': 0.432931512594223, 'learning_rate': 0.0002230112666439058, 'epoch': 4.42}


 44%|████▍     | 1298/2930 [34:29<42:28,  1.56s/it]

{'loss': 0.1031, 'grad_norm': 0.3532668352127075, 'learning_rate': 0.00022287470126322977, 'epoch': 4.43}


 44%|████▍     | 1299/2930 [34:31<42:13,  1.55s/it]

{'loss': 0.1027, 'grad_norm': 0.3598916232585907, 'learning_rate': 0.00022273813588255377, 'epoch': 4.43}


 44%|████▍     | 1300/2930 [34:32<42:13,  1.55s/it]

{'loss': 0.0999, 'grad_norm': 0.34165439009666443, 'learning_rate': 0.0002226015705018778, 'epoch': 4.43}


 44%|████▍     | 1301/2930 [34:34<42:03,  1.55s/it]

{'loss': 0.0979, 'grad_norm': 0.359994113445282, 'learning_rate': 0.00022246500512120178, 'epoch': 4.44}


 44%|████▍     | 1302/2930 [34:35<41:53,  1.54s/it]

{'loss': 0.0971, 'grad_norm': 0.3361303210258484, 'learning_rate': 0.00022232843974052577, 'epoch': 4.44}


 44%|████▍     | 1303/2930 [34:37<41:52,  1.54s/it]

{'loss': 0.1017, 'grad_norm': 0.3537121117115021, 'learning_rate': 0.0002221918743598498, 'epoch': 4.44}


 45%|████▍     | 1304/2930 [34:38<41:55,  1.55s/it]

{'loss': 0.1043, 'grad_norm': 0.3666740357875824, 'learning_rate': 0.0002220553089791738, 'epoch': 4.45}


 45%|████▍     | 1305/2930 [34:40<41:42,  1.54s/it]

{'loss': 0.1085, 'grad_norm': 0.36131811141967773, 'learning_rate': 0.00022191874359849778, 'epoch': 4.45}


 45%|████▍     | 1306/2930 [34:42<41:50,  1.55s/it]

{'loss': 0.0956, 'grad_norm': 0.35290417075157166, 'learning_rate': 0.0002217821782178218, 'epoch': 4.45}


 45%|████▍     | 1307/2930 [34:43<41:57,  1.55s/it]

{'loss': 0.1189, 'grad_norm': 0.38197124004364014, 'learning_rate': 0.0002216456128371458, 'epoch': 4.46}


 45%|████▍     | 1308/2930 [34:45<42:03,  1.56s/it]

{'loss': 0.1135, 'grad_norm': 0.3722735643386841, 'learning_rate': 0.0002215090474564698, 'epoch': 4.46}


 45%|████▍     | 1309/2930 [34:46<41:56,  1.55s/it]

{'loss': 0.1077, 'grad_norm': 0.3898736238479614, 'learning_rate': 0.00022137248207579382, 'epoch': 4.46}


 45%|████▍     | 1310/2930 [34:48<42:10,  1.56s/it]

{'loss': 0.1064, 'grad_norm': 0.3697137236595154, 'learning_rate': 0.0002212359166951178, 'epoch': 4.47}


 45%|████▍     | 1311/2930 [34:49<41:57,  1.55s/it]

{'loss': 0.1078, 'grad_norm': 0.3959270417690277, 'learning_rate': 0.0002210993513144418, 'epoch': 4.47}


 45%|████▍     | 1312/2930 [34:51<41:41,  1.55s/it]

{'loss': 0.1159, 'grad_norm': 0.4228207468986511, 'learning_rate': 0.00022096278593376577, 'epoch': 4.47}


 45%|████▍     | 1313/2930 [34:52<41:45,  1.55s/it]

{'loss': 0.1105, 'grad_norm': 0.3828580677509308, 'learning_rate': 0.00022082622055308982, 'epoch': 4.48}


 45%|████▍     | 1314/2930 [34:54<41:53,  1.56s/it]

{'loss': 0.1089, 'grad_norm': 0.3642544150352478, 'learning_rate': 0.0002206896551724138, 'epoch': 4.48}


 45%|████▍     | 1315/2930 [34:56<41:56,  1.56s/it]

{'loss': 0.1312, 'grad_norm': 0.400427907705307, 'learning_rate': 0.00022055308979173778, 'epoch': 4.48}


 45%|████▍     | 1316/2930 [34:57<41:40,  1.55s/it]

{'loss': 0.0956, 'grad_norm': 0.33515945076942444, 'learning_rate': 0.00022041652441106183, 'epoch': 4.49}


 45%|████▍     | 1317/2930 [34:59<41:48,  1.56s/it]

{'loss': 0.1087, 'grad_norm': 0.3612820506095886, 'learning_rate': 0.0002202799590303858, 'epoch': 4.49}


 45%|████▍     | 1318/2930 [35:00<41:52,  1.56s/it]

{'loss': 0.111, 'grad_norm': 0.3186860978603363, 'learning_rate': 0.0002201433936497098, 'epoch': 4.49}


 45%|████▌     | 1319/2930 [35:02<41:55,  1.56s/it]

{'loss': 0.1172, 'grad_norm': 0.37013742327690125, 'learning_rate': 0.00022000682826903384, 'epoch': 4.5}


 45%|████▌     | 1320/2930 [35:03<41:40,  1.55s/it]

{'loss': 0.1072, 'grad_norm': 0.3472655415534973, 'learning_rate': 0.0002198702628883578, 'epoch': 4.5}


 45%|████▌     | 1321/2930 [35:05<41:48,  1.56s/it]

{'loss': 0.1242, 'grad_norm': 0.4035055339336395, 'learning_rate': 0.0002197336975076818, 'epoch': 4.5}


 45%|████▌     | 1322/2930 [35:06<41:53,  1.56s/it]

{'loss': 0.1102, 'grad_norm': 0.36281511187553406, 'learning_rate': 0.00021959713212700582, 'epoch': 4.51}


 45%|████▌     | 1323/2930 [35:08<41:55,  1.57s/it]

{'loss': 0.1135, 'grad_norm': 0.3819572329521179, 'learning_rate': 0.00021946056674632981, 'epoch': 4.51}


 45%|████▌     | 1324/2930 [35:10<41:51,  1.56s/it]

{'loss': 0.1242, 'grad_norm': 0.4552781283855438, 'learning_rate': 0.0002193240013656538, 'epoch': 4.51}


 45%|████▌     | 1325/2930 [35:11<41:39,  1.56s/it]

{'loss': 0.1012, 'grad_norm': 0.35279056429862976, 'learning_rate': 0.00021918743598497783, 'epoch': 4.52}


 45%|████▌     | 1326/2930 [35:13<41:45,  1.56s/it]

{'loss': 0.1046, 'grad_norm': 0.36662715673446655, 'learning_rate': 0.00021905087060430182, 'epoch': 4.52}


 45%|████▌     | 1327/2930 [35:14<41:22,  1.55s/it]

{'loss': 0.1151, 'grad_norm': 0.35865357518196106, 'learning_rate': 0.00021891430522362582, 'epoch': 4.53}


 45%|████▌     | 1328/2930 [35:16<41:24,  1.55s/it]

{'loss': 0.0949, 'grad_norm': 0.3307590186595917, 'learning_rate': 0.0002187777398429498, 'epoch': 4.53}


 45%|████▌     | 1329/2930 [35:17<41:18,  1.55s/it]

{'loss': 0.1082, 'grad_norm': 0.33310988545417786, 'learning_rate': 0.00021864117446227383, 'epoch': 4.53}


 45%|████▌     | 1330/2930 [35:19<41:13,  1.55s/it]

{'loss': 0.0997, 'grad_norm': 0.35183531045913696, 'learning_rate': 0.00021850460908159783, 'epoch': 4.54}


 45%|████▌     | 1331/2930 [35:20<41:25,  1.55s/it]

{'loss': 0.1151, 'grad_norm': 0.4016630947589874, 'learning_rate': 0.00021836804370092182, 'epoch': 4.54}


 45%|████▌     | 1332/2930 [35:22<41:31,  1.56s/it]

{'loss': 0.0951, 'grad_norm': 0.3568877875804901, 'learning_rate': 0.00021823147832024584, 'epoch': 4.54}


 45%|████▌     | 1333/2930 [35:24<41:34,  1.56s/it]

{'loss': 0.1059, 'grad_norm': 0.39961811900138855, 'learning_rate': 0.00021809491293956984, 'epoch': 4.55}


 46%|████▌     | 1334/2930 [35:25<41:40,  1.57s/it]

{'loss': 0.0945, 'grad_norm': 0.37139371037483215, 'learning_rate': 0.0002179583475588938, 'epoch': 4.55}


 46%|████▌     | 1335/2930 [35:27<41:37,  1.57s/it]

{'loss': 0.1274, 'grad_norm': 0.423658162355423, 'learning_rate': 0.00021782178217821785, 'epoch': 4.55}


 46%|████▌     | 1336/2930 [35:28<41:39,  1.57s/it]

{'loss': 0.1183, 'grad_norm': 0.41190263628959656, 'learning_rate': 0.00021768521679754185, 'epoch': 4.56}


 46%|████▌     | 1337/2930 [35:30<41:41,  1.57s/it]

{'loss': 0.0998, 'grad_norm': 0.34843534231185913, 'learning_rate': 0.00021754865141686581, 'epoch': 4.56}


 46%|████▌     | 1338/2930 [35:31<41:38,  1.57s/it]

{'loss': 0.1054, 'grad_norm': 0.3485611379146576, 'learning_rate': 0.00021741208603618986, 'epoch': 4.56}


 46%|████▌     | 1339/2930 [35:33<41:50,  1.58s/it]

{'loss': 0.1105, 'grad_norm': 0.34962090849876404, 'learning_rate': 0.00021727552065551383, 'epoch': 4.57}


 46%|████▌     | 1340/2930 [35:35<41:56,  1.58s/it]

{'loss': 0.1052, 'grad_norm': 0.33388465642929077, 'learning_rate': 0.00021713895527483782, 'epoch': 4.57}


 46%|████▌     | 1341/2930 [35:36<41:43,  1.58s/it]

{'loss': 0.1013, 'grad_norm': 0.34331101179122925, 'learning_rate': 0.00021700238989416187, 'epoch': 4.57}


 46%|████▌     | 1342/2930 [35:38<41:55,  1.58s/it]

{'loss': 0.0987, 'grad_norm': 0.3282906115055084, 'learning_rate': 0.00021686582451348584, 'epoch': 4.58}


 46%|████▌     | 1343/2930 [35:39<41:59,  1.59s/it]

{'loss': 0.0984, 'grad_norm': 0.32471388578414917, 'learning_rate': 0.00021672925913280983, 'epoch': 4.58}


 46%|████▌     | 1344/2930 [35:41<41:48,  1.58s/it]

{'loss': 0.098, 'grad_norm': 0.3188919126987457, 'learning_rate': 0.00021659269375213383, 'epoch': 4.58}


 46%|████▌     | 1345/2930 [35:43<41:46,  1.58s/it]

{'loss': 0.1057, 'grad_norm': 0.35936611890792847, 'learning_rate': 0.00021645612837145785, 'epoch': 4.59}


 46%|████▌     | 1346/2930 [35:44<41:34,  1.57s/it]

{'loss': 0.0982, 'grad_norm': 0.34890660643577576, 'learning_rate': 0.00021631956299078184, 'epoch': 4.59}


 46%|████▌     | 1347/2930 [35:46<41:46,  1.58s/it]

{'loss': 0.1058, 'grad_norm': 0.39104512333869934, 'learning_rate': 0.00021618299761010584, 'epoch': 4.59}


 46%|████▌     | 1348/2930 [35:47<41:36,  1.58s/it]

{'loss': 0.0929, 'grad_norm': 0.3592919111251831, 'learning_rate': 0.00021604643222942986, 'epoch': 4.6}


 46%|████▌     | 1349/2930 [35:49<41:35,  1.58s/it]

{'loss': 0.1154, 'grad_norm': 0.4171750247478485, 'learning_rate': 0.00021590986684875385, 'epoch': 4.6}


 46%|████▌     | 1350/2930 [35:50<41:31,  1.58s/it]

{'loss': 0.1037, 'grad_norm': 0.3850025236606598, 'learning_rate': 0.00021577330146807785, 'epoch': 4.6}


 46%|████▌     | 1351/2930 [35:52<41:41,  1.58s/it]

{'loss': 0.101, 'grad_norm': 0.34974533319473267, 'learning_rate': 0.00021563673608740187, 'epoch': 4.61}


 46%|████▌     | 1352/2930 [35:54<41:43,  1.59s/it]

{'loss': 0.0839, 'grad_norm': 0.3167378008365631, 'learning_rate': 0.00021550017070672586, 'epoch': 4.61}


 46%|████▌     | 1353/2930 [35:55<41:52,  1.59s/it]

{'loss': 0.1089, 'grad_norm': 0.3652569055557251, 'learning_rate': 0.00021536360532604986, 'epoch': 4.61}


 46%|████▌     | 1354/2930 [35:57<41:52,  1.59s/it]

{'loss': 0.0837, 'grad_norm': 0.3274801969528198, 'learning_rate': 0.00021522703994537388, 'epoch': 4.62}


 46%|████▌     | 1355/2930 [35:58<41:49,  1.59s/it]

{'loss': 0.1085, 'grad_norm': 0.4172804057598114, 'learning_rate': 0.00021509047456469787, 'epoch': 4.62}


 46%|████▋     | 1356/2930 [36:00<41:38,  1.59s/it]

{'loss': 0.1124, 'grad_norm': 0.4070846736431122, 'learning_rate': 0.00021495390918402184, 'epoch': 4.62}


 46%|████▋     | 1357/2930 [36:02<41:47,  1.59s/it]

{'loss': 0.1142, 'grad_norm': 0.3613613545894623, 'learning_rate': 0.00021481734380334589, 'epoch': 4.63}


 46%|████▋     | 1358/2930 [36:03<41:32,  1.59s/it]

{'loss': 0.0856, 'grad_norm': 0.3055890202522278, 'learning_rate': 0.00021468077842266988, 'epoch': 4.63}


 46%|████▋     | 1359/2930 [36:05<41:28,  1.58s/it]

{'loss': 0.1039, 'grad_norm': 0.4015965163707733, 'learning_rate': 0.00021454421304199385, 'epoch': 4.63}


 46%|████▋     | 1360/2930 [36:06<41:32,  1.59s/it]

{'loss': 0.1058, 'grad_norm': 0.3460541367530823, 'learning_rate': 0.00021440764766131784, 'epoch': 4.64}


 46%|████▋     | 1361/2930 [36:08<41:35,  1.59s/it]

{'loss': 0.1117, 'grad_norm': 0.39704811573028564, 'learning_rate': 0.0002142710822806419, 'epoch': 4.64}


 46%|████▋     | 1362/2930 [36:10<41:26,  1.59s/it]

{'loss': 0.0958, 'grad_norm': 0.3258242607116699, 'learning_rate': 0.00021413451689996586, 'epoch': 4.64}


 47%|████▋     | 1363/2930 [36:11<41:15,  1.58s/it]

{'loss': 0.1006, 'grad_norm': 0.34833312034606934, 'learning_rate': 0.00021399795151928985, 'epoch': 4.65}


 47%|████▋     | 1364/2930 [36:13<41:06,  1.57s/it]

{'loss': 0.1139, 'grad_norm': 0.413007915019989, 'learning_rate': 0.00021386138613861387, 'epoch': 4.65}


 47%|████▋     | 1365/2930 [36:14<41:01,  1.57s/it]

{'loss': 0.1031, 'grad_norm': 0.3444485366344452, 'learning_rate': 0.00021372482075793787, 'epoch': 4.65}


 47%|████▋     | 1366/2930 [36:16<41:07,  1.58s/it]

{'loss': 0.0924, 'grad_norm': 0.33075910806655884, 'learning_rate': 0.00021358825537726186, 'epoch': 4.66}


 47%|████▋     | 1367/2930 [36:17<41:13,  1.58s/it]

{'loss': 0.0993, 'grad_norm': 0.3276139199733734, 'learning_rate': 0.00021345168999658588, 'epoch': 4.66}


 47%|████▋     | 1368/2930 [36:19<41:22,  1.59s/it]

{'loss': 0.1297, 'grad_norm': 0.4269423484802246, 'learning_rate': 0.00021331512461590988, 'epoch': 4.66}


 47%|████▋     | 1369/2930 [36:21<41:09,  1.58s/it]

{'loss': 0.0955, 'grad_norm': 0.29836565256118774, 'learning_rate': 0.00021317855923523387, 'epoch': 4.67}


 47%|████▋     | 1370/2930 [36:22<40:58,  1.58s/it]

{'loss': 0.105, 'grad_norm': 0.3374583423137665, 'learning_rate': 0.0002130419938545579, 'epoch': 4.67}


 47%|████▋     | 1371/2930 [36:24<40:49,  1.57s/it]

{'loss': 0.0993, 'grad_norm': 0.34384721517562866, 'learning_rate': 0.00021290542847388189, 'epoch': 4.68}


 47%|████▋     | 1372/2930 [36:25<40:51,  1.57s/it]

{'loss': 0.101, 'grad_norm': 0.3567535877227783, 'learning_rate': 0.00021276886309320588, 'epoch': 4.68}


 47%|████▋     | 1373/2930 [36:27<41:01,  1.58s/it]

{'loss': 0.0999, 'grad_norm': 0.3731863498687744, 'learning_rate': 0.0002126322977125299, 'epoch': 4.68}


 47%|████▋     | 1374/2930 [36:28<41:06,  1.58s/it]

{'loss': 0.1077, 'grad_norm': 0.37757256627082825, 'learning_rate': 0.0002124957323318539, 'epoch': 4.69}


 47%|████▋     | 1375/2930 [36:30<41:01,  1.58s/it]

{'loss': 0.0886, 'grad_norm': 0.3090389668941498, 'learning_rate': 0.0002123591669511779, 'epoch': 4.69}


 47%|████▋     | 1376/2930 [36:32<40:48,  1.58s/it]

{'loss': 0.1064, 'grad_norm': 0.38508227467536926, 'learning_rate': 0.00021222260157050186, 'epoch': 4.69}


 47%|████▋     | 1377/2930 [36:33<40:37,  1.57s/it]

{'loss': 0.1127, 'grad_norm': 0.3645094037055969, 'learning_rate': 0.0002120860361898259, 'epoch': 4.7}


 47%|████▋     | 1378/2930 [36:35<40:45,  1.58s/it]

{'loss': 0.1093, 'grad_norm': 0.3439214527606964, 'learning_rate': 0.0002119494708091499, 'epoch': 4.7}


 47%|████▋     | 1379/2930 [36:36<40:44,  1.58s/it]

{'loss': 0.1162, 'grad_norm': 0.39182204008102417, 'learning_rate': 0.00021181290542847387, 'epoch': 4.7}


 47%|████▋     | 1380/2930 [36:38<40:26,  1.57s/it]

{'loss': 0.0894, 'grad_norm': 0.30677205324172974, 'learning_rate': 0.00021167634004779791, 'epoch': 4.71}


 47%|████▋     | 1381/2930 [36:39<40:28,  1.57s/it]

{'loss': 0.0986, 'grad_norm': 0.33671268820762634, 'learning_rate': 0.00021153977466712188, 'epoch': 4.71}


 47%|████▋     | 1382/2930 [36:41<40:25,  1.57s/it]

{'loss': 0.1047, 'grad_norm': 0.3388548195362091, 'learning_rate': 0.00021140320928644588, 'epoch': 4.71}


 47%|████▋     | 1383/2930 [36:43<40:23,  1.57s/it]

{'loss': 0.0956, 'grad_norm': 0.3186938166618347, 'learning_rate': 0.00021126664390576992, 'epoch': 4.72}


 47%|████▋     | 1384/2930 [36:44<40:07,  1.56s/it]

{'loss': 0.1195, 'grad_norm': 0.4082128703594208, 'learning_rate': 0.0002111300785250939, 'epoch': 4.72}


 47%|████▋     | 1385/2930 [36:46<39:58,  1.55s/it]

{'loss': 0.0959, 'grad_norm': 0.34671810269355774, 'learning_rate': 0.00021099351314441788, 'epoch': 4.72}


 47%|████▋     | 1386/2930 [36:47<40:15,  1.56s/it]

{'loss': 0.1262, 'grad_norm': 0.41033658385276794, 'learning_rate': 0.0002108569477637419, 'epoch': 4.73}


 47%|████▋     | 1387/2930 [36:49<40:33,  1.58s/it]

{'loss': 0.1058, 'grad_norm': 0.3798959255218506, 'learning_rate': 0.0002107203823830659, 'epoch': 4.73}


 47%|████▋     | 1388/2930 [36:50<40:38,  1.58s/it]

{'loss': 0.1062, 'grad_norm': 0.36302489042282104, 'learning_rate': 0.0002105838170023899, 'epoch': 4.73}


 47%|████▋     | 1389/2930 [36:52<40:35,  1.58s/it]

{'loss': 0.0991, 'grad_norm': 0.33936160802841187, 'learning_rate': 0.00021044725162171392, 'epoch': 4.74}


 47%|████▋     | 1390/2930 [36:54<40:31,  1.58s/it]

{'loss': 0.0965, 'grad_norm': 0.3423795700073242, 'learning_rate': 0.0002103106862410379, 'epoch': 4.74}


 47%|████▋     | 1391/2930 [36:55<40:12,  1.57s/it]

{'loss': 0.0983, 'grad_norm': 0.37137851119041443, 'learning_rate': 0.0002101741208603619, 'epoch': 4.74}


 48%|████▊     | 1392/2930 [36:57<40:18,  1.57s/it]

{'loss': 0.1057, 'grad_norm': 0.34403157234191895, 'learning_rate': 0.0002100375554796859, 'epoch': 4.75}


 48%|████▊     | 1393/2930 [36:58<40:18,  1.57s/it]

{'loss': 0.0941, 'grad_norm': 0.339282751083374, 'learning_rate': 0.00020990099009900992, 'epoch': 4.75}


 48%|████▊     | 1394/2930 [37:00<40:18,  1.57s/it]

{'loss': 0.0952, 'grad_norm': 0.3709377646446228, 'learning_rate': 0.0002097644247183339, 'epoch': 4.75}


 48%|████▊     | 1395/2930 [37:01<40:04,  1.57s/it]

{'loss': 0.1112, 'grad_norm': 0.4023854434490204, 'learning_rate': 0.0002096278593376579, 'epoch': 4.76}


 48%|████▊     | 1396/2930 [37:03<39:28,  1.54s/it]

{'loss': 0.1169, 'grad_norm': 0.42894166707992554, 'learning_rate': 0.00020949129395698193, 'epoch': 4.76}


 48%|████▊     | 1397/2930 [37:04<39:23,  1.54s/it]

{'loss': 0.0849, 'grad_norm': 0.30156126618385315, 'learning_rate': 0.00020935472857630592, 'epoch': 4.76}


 48%|████▊     | 1398/2930 [37:06<39:39,  1.55s/it]

{'loss': 0.1006, 'grad_norm': 0.39492663741111755, 'learning_rate': 0.0002092181631956299, 'epoch': 4.77}


 48%|████▊     | 1399/2930 [37:08<39:48,  1.56s/it]

{'loss': 0.1037, 'grad_norm': 0.4103110730648041, 'learning_rate': 0.00020908159781495394, 'epoch': 4.77}


 48%|████▊     | 1400/2930 [37:09<39:47,  1.56s/it]

{'loss': 0.0941, 'grad_norm': 0.3178209364414215, 'learning_rate': 0.00020894503243427793, 'epoch': 4.77}


 48%|████▊     | 1401/2930 [37:11<39:32,  1.55s/it]

{'loss': 0.0918, 'grad_norm': 0.3348737359046936, 'learning_rate': 0.0002088084670536019, 'epoch': 4.78}


 48%|████▊     | 1402/2930 [37:12<39:40,  1.56s/it]

{'loss': 0.0928, 'grad_norm': 0.35469484329223633, 'learning_rate': 0.00020867190167292595, 'epoch': 4.78}


 48%|████▊     | 1403/2930 [37:14<39:42,  1.56s/it]

{'loss': 0.1131, 'grad_norm': 0.41404619812965393, 'learning_rate': 0.00020853533629224991, 'epoch': 4.78}


 48%|████▊     | 1404/2930 [37:15<39:30,  1.55s/it]

{'loss': 0.1136, 'grad_norm': 0.3924034535884857, 'learning_rate': 0.0002083987709115739, 'epoch': 4.79}


 48%|████▊     | 1405/2930 [37:17<39:15,  1.54s/it]

{'loss': 0.0976, 'grad_norm': 0.36585646867752075, 'learning_rate': 0.00020826220553089796, 'epoch': 4.79}


 48%|████▊     | 1406/2930 [37:18<39:12,  1.54s/it]

{'loss': 0.0878, 'grad_norm': 0.3471987545490265, 'learning_rate': 0.00020812564015022192, 'epoch': 4.79}


 48%|████▊     | 1407/2930 [37:20<39:02,  1.54s/it]

{'loss': 0.1134, 'grad_norm': 0.39948633313179016, 'learning_rate': 0.00020798907476954592, 'epoch': 4.8}


 48%|████▊     | 1408/2930 [37:22<39:12,  1.55s/it]

{'loss': 0.0982, 'grad_norm': 0.36210429668426514, 'learning_rate': 0.0002078525093888699, 'epoch': 4.8}


 48%|████▊     | 1409/2930 [37:23<39:21,  1.55s/it]

{'loss': 0.1133, 'grad_norm': 0.4277440011501312, 'learning_rate': 0.00020771594400819393, 'epoch': 4.8}


 48%|████▊     | 1410/2930 [37:25<39:21,  1.55s/it]

{'loss': 0.0927, 'grad_norm': 0.34105992317199707, 'learning_rate': 0.00020757937862751793, 'epoch': 4.81}


 48%|████▊     | 1411/2930 [37:26<39:30,  1.56s/it]

{'loss': 0.1135, 'grad_norm': 0.4279976189136505, 'learning_rate': 0.00020744281324684192, 'epoch': 4.81}


 48%|████▊     | 1412/2930 [37:28<39:20,  1.56s/it]

{'loss': 0.0971, 'grad_norm': 0.3737383782863617, 'learning_rate': 0.00020730624786616594, 'epoch': 4.82}


 48%|████▊     | 1413/2930 [37:29<39:11,  1.55s/it]

{'loss': 0.1028, 'grad_norm': 0.39430245757102966, 'learning_rate': 0.00020716968248548994, 'epoch': 4.82}


 48%|████▊     | 1414/2930 [37:31<39:17,  1.56s/it]

{'loss': 0.1117, 'grad_norm': 0.4153781831264496, 'learning_rate': 0.00020703311710481393, 'epoch': 4.82}


 48%|████▊     | 1415/2930 [37:32<39:07,  1.55s/it]

{'loss': 0.0934, 'grad_norm': 0.34884050488471985, 'learning_rate': 0.00020689655172413795, 'epoch': 4.83}


 48%|████▊     | 1416/2930 [37:34<38:49,  1.54s/it]

{'loss': 0.1107, 'grad_norm': 0.3900842070579529, 'learning_rate': 0.00020675998634346195, 'epoch': 4.83}


 48%|████▊     | 1417/2930 [37:36<39:05,  1.55s/it]

{'loss': 0.1093, 'grad_norm': 0.3464319109916687, 'learning_rate': 0.00020662342096278594, 'epoch': 4.83}


 48%|████▊     | 1418/2930 [37:37<38:57,  1.55s/it]

{'loss': 0.0939, 'grad_norm': 0.31381621956825256, 'learning_rate': 0.00020648685558210996, 'epoch': 4.84}


 48%|████▊     | 1419/2930 [37:39<38:52,  1.54s/it]

{'loss': 0.0993, 'grad_norm': 0.352895051240921, 'learning_rate': 0.00020635029020143396, 'epoch': 4.84}


 48%|████▊     | 1420/2930 [37:40<38:49,  1.54s/it]

{'loss': 0.0989, 'grad_norm': 0.35588905215263367, 'learning_rate': 0.00020621372482075792, 'epoch': 4.84}


 48%|████▊     | 1421/2930 [37:42<38:41,  1.54s/it]

{'loss': 0.1088, 'grad_norm': 0.3703453242778778, 'learning_rate': 0.00020607715944008197, 'epoch': 4.85}


 49%|████▊     | 1422/2930 [37:43<38:44,  1.54s/it]

{'loss': 0.0992, 'grad_norm': 0.3723497986793518, 'learning_rate': 0.00020594059405940597, 'epoch': 4.85}


 49%|████▊     | 1423/2930 [37:45<38:47,  1.54s/it]

{'loss': 0.0874, 'grad_norm': 0.3534556031227112, 'learning_rate': 0.00020580402867872993, 'epoch': 4.85}


 49%|████▊     | 1424/2930 [37:46<38:57,  1.55s/it]

{'loss': 0.0936, 'grad_norm': 0.38822612166404724, 'learning_rate': 0.00020566746329805398, 'epoch': 4.86}


 49%|████▊     | 1425/2930 [37:48<38:49,  1.55s/it]

{'loss': 0.0915, 'grad_norm': 0.3419961929321289, 'learning_rate': 0.00020553089791737798, 'epoch': 4.86}


 49%|████▊     | 1426/2930 [37:49<38:48,  1.55s/it]

{'loss': 0.1007, 'grad_norm': 0.3969825208187103, 'learning_rate': 0.00020539433253670194, 'epoch': 4.86}


 49%|████▊     | 1427/2930 [37:51<38:44,  1.55s/it]

{'loss': 0.0949, 'grad_norm': 0.37186047434806824, 'learning_rate': 0.00020525776715602594, 'epoch': 4.87}


 49%|████▊     | 1428/2930 [37:52<38:31,  1.54s/it]

{'loss': 0.103, 'grad_norm': 0.39771780371665955, 'learning_rate': 0.00020512120177534996, 'epoch': 4.87}


 49%|████▉     | 1429/2930 [37:54<38:47,  1.55s/it]

{'loss': 0.0968, 'grad_norm': 0.357074499130249, 'learning_rate': 0.00020498463639467395, 'epoch': 4.87}


 49%|████▉     | 1430/2930 [37:56<38:45,  1.55s/it]

{'loss': 0.0937, 'grad_norm': 0.31244194507598877, 'learning_rate': 0.00020484807101399795, 'epoch': 4.88}


 49%|████▉     | 1431/2930 [37:57<38:44,  1.55s/it]

{'loss': 0.1026, 'grad_norm': 0.3574528396129608, 'learning_rate': 0.00020471150563332197, 'epoch': 4.88}


 49%|████▉     | 1432/2930 [37:59<38:49,  1.56s/it]

{'loss': 0.1196, 'grad_norm': 0.38496336340904236, 'learning_rate': 0.00020457494025264596, 'epoch': 4.88}


 49%|████▉     | 1433/2930 [38:00<38:49,  1.56s/it]

{'loss': 0.0913, 'grad_norm': 0.31550201773643494, 'learning_rate': 0.00020443837487196996, 'epoch': 4.89}


 49%|████▉     | 1434/2930 [38:02<38:43,  1.55s/it]

{'loss': 0.1085, 'grad_norm': 0.35577768087387085, 'learning_rate': 0.00020430180949129398, 'epoch': 4.89}


 49%|████▉     | 1435/2930 [38:03<38:45,  1.56s/it]

{'loss': 0.1131, 'grad_norm': 0.3733772039413452, 'learning_rate': 0.00020416524411061797, 'epoch': 4.89}


 49%|████▉     | 1436/2930 [38:05<38:52,  1.56s/it]

{'loss': 0.1033, 'grad_norm': 0.32958054542541504, 'learning_rate': 0.00020402867872994197, 'epoch': 4.9}


 49%|████▉     | 1437/2930 [38:07<38:55,  1.56s/it]

{'loss': 0.0926, 'grad_norm': 0.3226839601993561, 'learning_rate': 0.000203892113349266, 'epoch': 4.9}


 49%|████▉     | 1438/2930 [38:08<38:54,  1.56s/it]

{'loss': 0.0929, 'grad_norm': 0.3614698350429535, 'learning_rate': 0.00020375554796858998, 'epoch': 4.9}


 49%|████▉     | 1439/2930 [38:10<38:38,  1.56s/it]

{'loss': 0.0895, 'grad_norm': 0.3063931465148926, 'learning_rate': 0.00020361898258791398, 'epoch': 4.91}


 49%|████▉     | 1440/2930 [38:11<38:30,  1.55s/it]

{'loss': 0.1019, 'grad_norm': 0.3821222484111786, 'learning_rate': 0.000203482417207238, 'epoch': 4.91}


 49%|████▉     | 1441/2930 [38:13<38:46,  1.56s/it]

{'loss': 0.0851, 'grad_norm': 0.30656468868255615, 'learning_rate': 0.000203345851826562, 'epoch': 4.91}


 49%|████▉     | 1442/2930 [38:14<38:48,  1.57s/it]

{'loss': 0.1152, 'grad_norm': 0.45297518372535706, 'learning_rate': 0.00020320928644588598, 'epoch': 4.92}


 49%|████▉     | 1443/2930 [38:16<38:53,  1.57s/it]

{'loss': 0.0937, 'grad_norm': 0.3533563017845154, 'learning_rate': 0.00020307272106520995, 'epoch': 4.92}


 49%|████▉     | 1444/2930 [38:17<38:46,  1.57s/it]

{'loss': 0.1032, 'grad_norm': 0.3634732663631439, 'learning_rate': 0.000202936155684534, 'epoch': 4.92}


 49%|████▉     | 1445/2930 [38:19<38:54,  1.57s/it]

{'loss': 0.0898, 'grad_norm': 0.33958640694618225, 'learning_rate': 0.00020279959030385797, 'epoch': 4.93}


 49%|████▉     | 1446/2930 [38:21<38:52,  1.57s/it]

{'loss': 0.1039, 'grad_norm': 0.3652137219905853, 'learning_rate': 0.00020266302492318196, 'epoch': 4.93}


 49%|████▉     | 1447/2930 [38:22<38:51,  1.57s/it]

{'loss': 0.0971, 'grad_norm': 0.3321746587753296, 'learning_rate': 0.000202526459542506, 'epoch': 4.93}


 49%|████▉     | 1448/2930 [38:24<38:27,  1.56s/it]

{'loss': 0.0951, 'grad_norm': 0.3940568268299103, 'learning_rate': 0.00020238989416182998, 'epoch': 4.94}


 49%|████▉     | 1449/2930 [38:25<38:18,  1.55s/it]

{'loss': 0.1047, 'grad_norm': 0.37579044699668884, 'learning_rate': 0.00020225332878115397, 'epoch': 4.94}


 49%|████▉     | 1450/2930 [38:27<38:21,  1.56s/it]

{'loss': 0.0957, 'grad_norm': 0.34887969493865967, 'learning_rate': 0.000202116763400478, 'epoch': 4.94}


 50%|████▉     | 1451/2930 [38:28<38:18,  1.55s/it]

{'loss': 0.0942, 'grad_norm': 0.3190223276615143, 'learning_rate': 0.00020198019801980199, 'epoch': 4.95}


 50%|████▉     | 1452/2930 [38:30<38:26,  1.56s/it]

{'loss': 0.1157, 'grad_norm': 0.3603910505771637, 'learning_rate': 0.00020184363263912598, 'epoch': 4.95}


 50%|████▉     | 1453/2930 [38:32<38:30,  1.56s/it]

{'loss': 0.0996, 'grad_norm': 0.31495431065559387, 'learning_rate': 0.00020170706725845, 'epoch': 4.95}


 50%|████▉     | 1454/2930 [38:33<38:34,  1.57s/it]

{'loss': 0.1005, 'grad_norm': 0.35225602984428406, 'learning_rate': 0.000201570501877774, 'epoch': 4.96}


 50%|████▉     | 1455/2930 [38:35<38:36,  1.57s/it]

{'loss': 0.0882, 'grad_norm': 0.32632988691329956, 'learning_rate': 0.000201433936497098, 'epoch': 4.96}


 50%|████▉     | 1456/2930 [38:36<38:35,  1.57s/it]

{'loss': 0.0976, 'grad_norm': 0.3330793082714081, 'learning_rate': 0.000201297371116422, 'epoch': 4.97}


 50%|████▉     | 1457/2930 [38:38<38:36,  1.57s/it]

{'loss': 0.116, 'grad_norm': 0.400644987821579, 'learning_rate': 0.000201160805735746, 'epoch': 4.97}


 50%|████▉     | 1458/2930 [38:39<38:30,  1.57s/it]

{'loss': 0.0931, 'grad_norm': 0.32311660051345825, 'learning_rate': 0.00020102424035507, 'epoch': 4.97}


 50%|████▉     | 1459/2930 [38:41<38:15,  1.56s/it]

{'loss': 0.1143, 'grad_norm': 0.38004013895988464, 'learning_rate': 0.000200887674974394, 'epoch': 4.98}


 50%|████▉     | 1460/2930 [38:42<38:13,  1.56s/it]

{'loss': 0.0924, 'grad_norm': 0.3652631640434265, 'learning_rate': 0.00020075110959371801, 'epoch': 4.98}


 50%|████▉     | 1461/2930 [38:44<38:11,  1.56s/it]

{'loss': 0.0929, 'grad_norm': 0.333185076713562, 'learning_rate': 0.000200614544213042, 'epoch': 4.98}


 50%|████▉     | 1462/2930 [38:46<37:56,  1.55s/it]

{'loss': 0.0897, 'grad_norm': 0.3731049597263336, 'learning_rate': 0.00020047797883236598, 'epoch': 4.99}


 50%|████▉     | 1463/2930 [38:47<38:01,  1.56s/it]

{'loss': 0.1034, 'grad_norm': 0.36191412806510925, 'learning_rate': 0.00020034141345169002, 'epoch': 4.99}


 50%|████▉     | 1464/2930 [38:49<37:56,  1.55s/it]

{'loss': 0.0881, 'grad_norm': 0.3334520757198334, 'learning_rate': 0.00020020484807101402, 'epoch': 4.99}


 50%|█████     | 1465/2930 [38:50<37:45,  1.55s/it]

{'loss': 0.0921, 'grad_norm': 0.34671032428741455, 'learning_rate': 0.00020006828269033799, 'epoch': 5.0}


 50%|█████     | 1466/2930 [38:52<37:53,  1.55s/it]

{'loss': 0.0838, 'grad_norm': 0.3132627606391907, 'learning_rate': 0.000199931717309662, 'epoch': 5.0}


 50%|█████     | 1467/2930 [38:53<37:47,  1.55s/it]

{'loss': 0.0675, 'grad_norm': 0.2908782362937927, 'learning_rate': 0.000199795151928986, 'epoch': 5.0}


 50%|█████     | 1468/2930 [38:55<37:54,  1.56s/it]

{'loss': 0.0623, 'grad_norm': 0.275020033121109, 'learning_rate': 0.00019965858654831002, 'epoch': 5.01}


 50%|█████     | 1469/2930 [38:56<37:56,  1.56s/it]

{'loss': 0.0891, 'grad_norm': 0.36246514320373535, 'learning_rate': 0.00019952202116763402, 'epoch': 5.01}


 50%|█████     | 1470/2930 [38:58<37:58,  1.56s/it]

{'loss': 0.0711, 'grad_norm': 0.27465325593948364, 'learning_rate': 0.000199385455786958, 'epoch': 5.01}


 50%|█████     | 1471/2930 [39:00<38:06,  1.57s/it]

{'loss': 0.0697, 'grad_norm': 0.3624454736709595, 'learning_rate': 0.000199248890406282, 'epoch': 5.02}


 50%|█████     | 1472/2930 [39:01<38:01,  1.56s/it]

{'loss': 0.0706, 'grad_norm': 0.368582159280777, 'learning_rate': 0.00019911232502560603, 'epoch': 5.02}


 50%|█████     | 1473/2930 [39:03<37:55,  1.56s/it]

{'loss': 0.0652, 'grad_norm': 0.36119285225868225, 'learning_rate': 0.00019897575964493002, 'epoch': 5.02}


 50%|█████     | 1474/2930 [39:04<37:57,  1.56s/it]

{'loss': 0.0679, 'grad_norm': 0.37981656193733215, 'learning_rate': 0.00019883919426425401, 'epoch': 5.03}


 50%|█████     | 1475/2930 [39:06<38:01,  1.57s/it]

{'loss': 0.0647, 'grad_norm': 0.3875839412212372, 'learning_rate': 0.00019870262888357804, 'epoch': 5.03}


 50%|█████     | 1476/2930 [39:07<38:04,  1.57s/it]

{'loss': 0.063, 'grad_norm': 0.3976998031139374, 'learning_rate': 0.00019856606350290203, 'epoch': 5.03}


 50%|█████     | 1477/2930 [39:09<38:08,  1.57s/it]

{'loss': 0.0623, 'grad_norm': 0.3391408324241638, 'learning_rate': 0.00019842949812222602, 'epoch': 5.04}


 50%|█████     | 1478/2930 [39:11<38:07,  1.58s/it]

{'loss': 0.0734, 'grad_norm': 0.37747564911842346, 'learning_rate': 0.00019829293274155002, 'epoch': 5.04}


 50%|█████     | 1479/2930 [39:12<37:56,  1.57s/it]

{'loss': 0.0704, 'grad_norm': 0.3456377387046814, 'learning_rate': 0.000198156367360874, 'epoch': 5.04}


 51%|█████     | 1480/2930 [39:14<37:58,  1.57s/it]

{'loss': 0.0747, 'grad_norm': 0.30833911895751953, 'learning_rate': 0.00019801980198019803, 'epoch': 5.05}


 51%|█████     | 1481/2930 [39:15<38:03,  1.58s/it]

{'loss': 0.0868, 'grad_norm': 0.38156217336654663, 'learning_rate': 0.00019788323659952203, 'epoch': 5.05}


 51%|█████     | 1482/2930 [39:17<37:53,  1.57s/it]

{'loss': 0.0845, 'grad_norm': 0.3607122302055359, 'learning_rate': 0.00019774667121884602, 'epoch': 5.05}


 51%|█████     | 1483/2930 [39:18<38:07,  1.58s/it]

{'loss': 0.0652, 'grad_norm': 0.30260169506073, 'learning_rate': 0.00019761010583817004, 'epoch': 5.06}


 51%|█████     | 1484/2930 [39:20<38:01,  1.58s/it]

{'loss': 0.0904, 'grad_norm': 0.3728996217250824, 'learning_rate': 0.00019747354045749404, 'epoch': 5.06}


 51%|█████     | 1485/2930 [39:22<38:01,  1.58s/it]

{'loss': 0.0726, 'grad_norm': 0.32316353917121887, 'learning_rate': 0.00019733697507681803, 'epoch': 5.06}


 51%|█████     | 1486/2930 [39:23<37:47,  1.57s/it]

{'loss': 0.0635, 'grad_norm': 0.298284113407135, 'learning_rate': 0.00019720040969614205, 'epoch': 5.07}


 51%|█████     | 1487/2930 [39:25<37:55,  1.58s/it]

{'loss': 0.0653, 'grad_norm': 0.3410676121711731, 'learning_rate': 0.00019706384431546602, 'epoch': 5.07}


 51%|█████     | 1488/2930 [39:26<38:00,  1.58s/it]

{'loss': 0.0566, 'grad_norm': 0.2831306457519531, 'learning_rate': 0.00019692727893479004, 'epoch': 5.07}


 51%|█████     | 1489/2930 [39:28<38:00,  1.58s/it]

{'loss': 0.0719, 'grad_norm': 0.33146071434020996, 'learning_rate': 0.00019679071355411403, 'epoch': 5.08}


 51%|█████     | 1490/2930 [39:30<38:06,  1.59s/it]

{'loss': 0.0701, 'grad_norm': 0.3424219489097595, 'learning_rate': 0.00019665414817343803, 'epoch': 5.08}


 51%|█████     | 1491/2930 [39:31<37:56,  1.58s/it]

{'loss': 0.0511, 'grad_norm': 0.29372870922088623, 'learning_rate': 0.00019651758279276205, 'epoch': 5.08}


 51%|█████     | 1492/2930 [39:33<37:58,  1.58s/it]

{'loss': 0.061, 'grad_norm': 0.3172338902950287, 'learning_rate': 0.00019638101741208604, 'epoch': 5.09}


 51%|█████     | 1493/2930 [39:34<37:47,  1.58s/it]

{'loss': 0.0714, 'grad_norm': 0.3563513457775116, 'learning_rate': 0.00019624445203141004, 'epoch': 5.09}


 51%|█████     | 1494/2930 [39:36<37:44,  1.58s/it]

{'loss': 0.0621, 'grad_norm': 0.29251518845558167, 'learning_rate': 0.00019610788665073406, 'epoch': 5.09}


 51%|█████     | 1495/2930 [39:37<37:53,  1.58s/it]

{'loss': 0.0682, 'grad_norm': 0.32943934202194214, 'learning_rate': 0.00019597132127005803, 'epoch': 5.1}


 51%|█████     | 1496/2930 [39:39<37:46,  1.58s/it]

{'loss': 0.0613, 'grad_norm': 0.3235938847064972, 'learning_rate': 0.00019583475588938205, 'epoch': 5.1}


 51%|█████     | 1497/2930 [39:41<37:42,  1.58s/it]

{'loss': 0.08, 'grad_norm': 0.3784848153591156, 'learning_rate': 0.00019569819050870607, 'epoch': 5.1}


 51%|█████     | 1498/2930 [39:42<37:54,  1.59s/it]

{'loss': 0.0629, 'grad_norm': 0.30846771597862244, 'learning_rate': 0.00019556162512803004, 'epoch': 5.11}


 51%|█████     | 1499/2930 [39:44<37:44,  1.58s/it]

{'loss': 0.0678, 'grad_norm': 0.34299734234809875, 'learning_rate': 0.00019542505974735406, 'epoch': 5.11}


 51%|█████     | 1500/2930 [39:45<37:28,  1.57s/it]

{'loss': 0.077, 'grad_norm': 0.3535394072532654, 'learning_rate': 0.00019528849436667808, 'epoch': 5.12}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1500)... Done. 0.2s
 51%|█████     | 1501/2930 [39:48<44:14,  1.86s/it]

{'loss': 0.077, 'grad_norm': 0.3542109727859497, 'learning_rate': 0.00019515192898600205, 'epoch': 5.12}


 51%|█████▏    | 1502/2930 [39:49<42:16,  1.78s/it]

{'loss': 0.0649, 'grad_norm': 0.3183377981185913, 'learning_rate': 0.00019501536360532607, 'epoch': 5.12}


 51%|█████▏    | 1503/2930 [39:51<40:53,  1.72s/it]

{'loss': 0.059, 'grad_norm': 0.29554179310798645, 'learning_rate': 0.00019487879822465006, 'epoch': 5.13}


 51%|█████▏    | 1504/2930 [39:53<39:33,  1.66s/it]

{'loss': 0.0695, 'grad_norm': 0.3576511740684509, 'learning_rate': 0.00019474223284397406, 'epoch': 5.13}


 51%|█████▏    | 1505/2930 [39:54<39:04,  1.65s/it]

{'loss': 0.0707, 'grad_norm': 0.3539257347583771, 'learning_rate': 0.00019460566746329808, 'epoch': 5.13}


 51%|█████▏    | 1506/2930 [39:56<38:23,  1.62s/it]

{'loss': 0.0645, 'grad_norm': 0.32694998383522034, 'learning_rate': 0.00019446910208262204, 'epoch': 5.14}


 51%|█████▏    | 1507/2930 [39:57<38:07,  1.61s/it]

{'loss': 0.078, 'grad_norm': 0.403602659702301, 'learning_rate': 0.00019433253670194606, 'epoch': 5.14}


 51%|█████▏    | 1508/2930 [39:59<37:43,  1.59s/it]

{'loss': 0.0662, 'grad_norm': 0.3172549307346344, 'learning_rate': 0.00019419597132127009, 'epoch': 5.14}


 52%|█████▏    | 1509/2930 [40:00<37:23,  1.58s/it]

{'loss': 0.0618, 'grad_norm': 0.32690101861953735, 'learning_rate': 0.00019405940594059405, 'epoch': 5.15}


 52%|█████▏    | 1510/2930 [40:02<37:17,  1.58s/it]

{'loss': 0.0688, 'grad_norm': 0.3398485481739044, 'learning_rate': 0.00019392284055991807, 'epoch': 5.15}


 52%|█████▏    | 1511/2930 [40:04<37:17,  1.58s/it]

{'loss': 0.0626, 'grad_norm': 0.34701472520828247, 'learning_rate': 0.00019378627517924207, 'epoch': 5.15}


 52%|█████▏    | 1512/2930 [40:05<37:24,  1.58s/it]

{'loss': 0.0713, 'grad_norm': 0.34342867136001587, 'learning_rate': 0.00019364970979856606, 'epoch': 5.16}


 52%|█████▏    | 1513/2930 [40:07<36:50,  1.56s/it]

{'loss': 0.0761, 'grad_norm': 0.3835356831550598, 'learning_rate': 0.00019351314441789008, 'epoch': 5.16}


 52%|█████▏    | 1514/2930 [40:08<36:53,  1.56s/it]

{'loss': 0.073, 'grad_norm': 0.32108479738235474, 'learning_rate': 0.00019337657903721408, 'epoch': 5.16}


 52%|█████▏    | 1515/2930 [40:10<37:02,  1.57s/it]

{'loss': 0.0667, 'grad_norm': 0.31288865208625793, 'learning_rate': 0.00019324001365653807, 'epoch': 5.17}


 52%|█████▏    | 1516/2930 [40:11<37:10,  1.58s/it]

{'loss': 0.0717, 'grad_norm': 0.35712960362434387, 'learning_rate': 0.0001931034482758621, 'epoch': 5.17}


 52%|█████▏    | 1517/2930 [40:13<37:02,  1.57s/it]

{'loss': 0.057, 'grad_norm': 0.3023334741592407, 'learning_rate': 0.0001929668828951861, 'epoch': 5.17}


 52%|█████▏    | 1518/2930 [40:15<37:08,  1.58s/it]

{'loss': 0.0646, 'grad_norm': 0.3248802125453949, 'learning_rate': 0.00019283031751451008, 'epoch': 5.18}


 52%|█████▏    | 1519/2930 [40:16<37:05,  1.58s/it]

{'loss': 0.0696, 'grad_norm': 0.4016325771808624, 'learning_rate': 0.00019269375213383408, 'epoch': 5.18}


 52%|█████▏    | 1520/2930 [40:18<37:16,  1.59s/it]

{'loss': 0.064, 'grad_norm': 0.33792006969451904, 'learning_rate': 0.00019255718675315807, 'epoch': 5.18}


 52%|█████▏    | 1521/2930 [40:19<37:09,  1.58s/it]

{'loss': 0.0666, 'grad_norm': 0.3924632966518402, 'learning_rate': 0.0001924206213724821, 'epoch': 5.19}


 52%|█████▏    | 1522/2930 [40:21<37:00,  1.58s/it]

{'loss': 0.0737, 'grad_norm': 0.3000447750091553, 'learning_rate': 0.00019228405599180609, 'epoch': 5.19}


 52%|█████▏    | 1523/2930 [40:22<36:55,  1.57s/it]

{'loss': 0.0658, 'grad_norm': 0.30632859468460083, 'learning_rate': 0.00019214749061113008, 'epoch': 5.19}


 52%|█████▏    | 1524/2930 [40:24<36:49,  1.57s/it]

{'loss': 0.0651, 'grad_norm': 0.34517765045166016, 'learning_rate': 0.0001920109252304541, 'epoch': 5.2}


 52%|█████▏    | 1525/2930 [40:26<36:44,  1.57s/it]

{'loss': 0.0673, 'grad_norm': 0.3567304015159607, 'learning_rate': 0.0001918743598497781, 'epoch': 5.2}


 52%|█████▏    | 1526/2930 [40:27<36:42,  1.57s/it]

{'loss': 0.0686, 'grad_norm': 0.34267234802246094, 'learning_rate': 0.0001917377944691021, 'epoch': 5.2}


 52%|█████▏    | 1527/2930 [40:29<36:41,  1.57s/it]

{'loss': 0.0706, 'grad_norm': 0.35095569491386414, 'learning_rate': 0.0001916012290884261, 'epoch': 5.21}


 52%|█████▏    | 1528/2930 [40:30<36:34,  1.57s/it]

{'loss': 0.0646, 'grad_norm': 0.3195610046386719, 'learning_rate': 0.0001914646637077501, 'epoch': 5.21}


 52%|█████▏    | 1529/2930 [40:32<36:24,  1.56s/it]

{'loss': 0.0737, 'grad_norm': 0.32430464029312134, 'learning_rate': 0.0001913280983270741, 'epoch': 5.21}


 52%|█████▏    | 1530/2930 [40:33<36:33,  1.57s/it]

{'loss': 0.0664, 'grad_norm': 0.3427524268627167, 'learning_rate': 0.0001911915329463981, 'epoch': 5.22}


 52%|█████▏    | 1531/2930 [40:35<36:33,  1.57s/it]

{'loss': 0.0741, 'grad_norm': 0.3497605621814728, 'learning_rate': 0.0001910549675657221, 'epoch': 5.22}


 52%|█████▏    | 1532/2930 [40:37<36:25,  1.56s/it]

{'loss': 0.0703, 'grad_norm': 0.33988383412361145, 'learning_rate': 0.0001909184021850461, 'epoch': 5.22}


 52%|█████▏    | 1533/2930 [40:38<36:29,  1.57s/it]

{'loss': 0.072, 'grad_norm': 0.3197706341743469, 'learning_rate': 0.0001907818368043701, 'epoch': 5.23}


 52%|█████▏    | 1534/2930 [40:40<36:30,  1.57s/it]

{'loss': 0.0609, 'grad_norm': 0.3244132995605469, 'learning_rate': 0.0001906452714236941, 'epoch': 5.23}


 52%|█████▏    | 1535/2930 [40:41<36:26,  1.57s/it]

{'loss': 0.0725, 'grad_norm': 0.4079059064388275, 'learning_rate': 0.00019050870604301812, 'epoch': 5.23}


 52%|█████▏    | 1536/2930 [40:43<36:22,  1.57s/it]

{'loss': 0.0687, 'grad_norm': 0.31910791993141174, 'learning_rate': 0.0001903721406623421, 'epoch': 5.24}


 52%|█████▏    | 1537/2930 [40:44<36:17,  1.56s/it]

{'loss': 0.0811, 'grad_norm': 0.41336309909820557, 'learning_rate': 0.0001902355752816661, 'epoch': 5.24}


 52%|█████▏    | 1538/2930 [40:46<36:12,  1.56s/it]

{'loss': 0.0664, 'grad_norm': 0.33013829588890076, 'learning_rate': 0.0001900990099009901, 'epoch': 5.24}


 53%|█████▎    | 1539/2930 [40:47<36:14,  1.56s/it]

{'loss': 0.0672, 'grad_norm': 0.3664257228374481, 'learning_rate': 0.00018996244452031412, 'epoch': 5.25}


 53%|█████▎    | 1540/2930 [40:49<36:22,  1.57s/it]

{'loss': 0.0658, 'grad_norm': 0.36986616253852844, 'learning_rate': 0.00018982587913963812, 'epoch': 5.25}


 53%|█████▎    | 1541/2930 [40:51<36:13,  1.56s/it]

{'loss': 0.0728, 'grad_norm': 0.355871319770813, 'learning_rate': 0.0001896893137589621, 'epoch': 5.25}


 53%|█████▎    | 1542/2930 [40:52<36:02,  1.56s/it]

{'loss': 0.0651, 'grad_norm': 0.3417467474937439, 'learning_rate': 0.0001895527483782861, 'epoch': 5.26}


 53%|█████▎    | 1543/2930 [40:54<36:04,  1.56s/it]

{'loss': 0.0777, 'grad_norm': 0.35177117586135864, 'learning_rate': 0.00018941618299761012, 'epoch': 5.26}


 53%|█████▎    | 1544/2930 [40:55<36:04,  1.56s/it]

{'loss': 0.0764, 'grad_norm': 0.3506298065185547, 'learning_rate': 0.00018927961761693412, 'epoch': 5.27}


 53%|█████▎    | 1545/2930 [40:57<36:06,  1.56s/it]

{'loss': 0.0812, 'grad_norm': 0.3829716742038727, 'learning_rate': 0.0001891430522362581, 'epoch': 5.27}


 53%|█████▎    | 1546/2930 [40:58<36:00,  1.56s/it]

{'loss': 0.069, 'grad_norm': 0.33997243642807007, 'learning_rate': 0.0001890064868555821, 'epoch': 5.27}


 53%|█████▎    | 1547/2930 [41:00<35:59,  1.56s/it]

{'loss': 0.0639, 'grad_norm': 0.32711243629455566, 'learning_rate': 0.00018886992147490613, 'epoch': 5.28}


 53%|█████▎    | 1548/2930 [41:02<36:01,  1.56s/it]

{'loss': 0.0742, 'grad_norm': 0.2937215566635132, 'learning_rate': 0.00018873335609423012, 'epoch': 5.28}


 53%|█████▎    | 1549/2930 [41:03<35:52,  1.56s/it]

{'loss': 0.0769, 'grad_norm': 0.3664388060569763, 'learning_rate': 0.00018859679071355412, 'epoch': 5.28}


 53%|█████▎    | 1550/2930 [41:05<35:41,  1.55s/it]

{'loss': 0.0657, 'grad_norm': 0.32952481508255005, 'learning_rate': 0.00018846022533287814, 'epoch': 5.29}


 53%|█████▎    | 1551/2930 [41:06<35:32,  1.55s/it]

{'loss': 0.0798, 'grad_norm': 0.3964177668094635, 'learning_rate': 0.00018832365995220213, 'epoch': 5.29}


 53%|█████▎    | 1552/2930 [41:08<35:22,  1.54s/it]

{'loss': 0.0741, 'grad_norm': 0.33290964365005493, 'learning_rate': 0.00018818709457152613, 'epoch': 5.29}


 53%|█████▎    | 1553/2930 [41:09<35:30,  1.55s/it]

{'loss': 0.0764, 'grad_norm': 0.31504353880882263, 'learning_rate': 0.00018805052919085012, 'epoch': 5.3}


 53%|█████▎    | 1554/2930 [41:11<35:39,  1.55s/it]

{'loss': 0.0721, 'grad_norm': 0.3228926360607147, 'learning_rate': 0.00018791396381017411, 'epoch': 5.3}


 53%|█████▎    | 1555/2930 [41:12<35:25,  1.55s/it]

{'loss': 0.0759, 'grad_norm': 0.3532179892063141, 'learning_rate': 0.00018777739842949814, 'epoch': 5.3}


 53%|█████▎    | 1556/2930 [41:14<35:16,  1.54s/it]

{'loss': 0.0583, 'grad_norm': 0.3606696128845215, 'learning_rate': 0.00018764083304882213, 'epoch': 5.31}


 53%|█████▎    | 1557/2930 [41:15<35:11,  1.54s/it]

{'loss': 0.0875, 'grad_norm': 0.4106493592262268, 'learning_rate': 0.00018750426766814612, 'epoch': 5.31}


 53%|█████▎    | 1558/2930 [41:17<35:03,  1.53s/it]

{'loss': 0.074, 'grad_norm': 0.34150904417037964, 'learning_rate': 0.00018736770228747015, 'epoch': 5.31}


 53%|█████▎    | 1559/2930 [41:18<35:09,  1.54s/it]

{'loss': 0.0708, 'grad_norm': 0.3467136323451996, 'learning_rate': 0.00018723113690679414, 'epoch': 5.32}


 53%|█████▎    | 1560/2930 [41:20<35:22,  1.55s/it]

{'loss': 0.0572, 'grad_norm': 0.28427475690841675, 'learning_rate': 0.00018709457152611813, 'epoch': 5.32}


 53%|█████▎    | 1561/2930 [41:22<35:26,  1.55s/it]

{'loss': 0.066, 'grad_norm': 0.34894511103630066, 'learning_rate': 0.00018695800614544215, 'epoch': 5.32}


 53%|█████▎    | 1562/2930 [41:23<35:27,  1.55s/it]

{'loss': 0.0652, 'grad_norm': 0.32096943259239197, 'learning_rate': 0.00018682144076476612, 'epoch': 5.33}


 53%|█████▎    | 1563/2930 [41:25<35:26,  1.56s/it]

{'loss': 0.066, 'grad_norm': 0.3073258697986603, 'learning_rate': 0.00018668487538409014, 'epoch': 5.33}


 53%|█████▎    | 1564/2930 [41:26<35:19,  1.55s/it]

{'loss': 0.056, 'grad_norm': 0.3097512423992157, 'learning_rate': 0.00018654831000341414, 'epoch': 5.33}


 53%|█████▎    | 1565/2930 [41:28<35:06,  1.54s/it]

{'loss': 0.0699, 'grad_norm': 0.3816448152065277, 'learning_rate': 0.00018641174462273813, 'epoch': 5.34}


 53%|█████▎    | 1566/2930 [41:29<35:00,  1.54s/it]

{'loss': 0.0628, 'grad_norm': 0.29534485936164856, 'learning_rate': 0.00018627517924206215, 'epoch': 5.34}


 53%|█████▎    | 1567/2930 [41:31<35:12,  1.55s/it]

{'loss': 0.0693, 'grad_norm': 0.3512754440307617, 'learning_rate': 0.00018613861386138615, 'epoch': 5.34}


 54%|█████▎    | 1568/2930 [41:32<35:21,  1.56s/it]

{'loss': 0.0666, 'grad_norm': 0.34783557057380676, 'learning_rate': 0.00018600204848071014, 'epoch': 5.35}


 54%|█████▎    | 1569/2930 [41:34<35:14,  1.55s/it]

{'loss': 0.0702, 'grad_norm': 0.33007100224494934, 'learning_rate': 0.00018586548310003416, 'epoch': 5.35}


 54%|█████▎    | 1570/2930 [41:36<35:20,  1.56s/it]

{'loss': 0.0716, 'grad_norm': 0.36544427275657654, 'learning_rate': 0.00018572891771935813, 'epoch': 5.35}


 54%|█████▎    | 1571/2930 [41:37<35:13,  1.56s/it]

{'loss': 0.0676, 'grad_norm': 0.34248584508895874, 'learning_rate': 0.00018559235233868215, 'epoch': 5.36}


 54%|█████▎    | 1572/2930 [41:39<35:14,  1.56s/it]

{'loss': 0.0657, 'grad_norm': 0.3516068458557129, 'learning_rate': 0.00018545578695800617, 'epoch': 5.36}


 54%|█████▎    | 1573/2930 [41:40<35:21,  1.56s/it]

{'loss': 0.0717, 'grad_norm': 0.33073437213897705, 'learning_rate': 0.00018531922157733014, 'epoch': 5.36}


 54%|█████▎    | 1574/2930 [41:42<35:11,  1.56s/it]

{'loss': 0.0723, 'grad_norm': 0.3822007179260254, 'learning_rate': 0.00018518265619665416, 'epoch': 5.37}


 54%|█████▍    | 1575/2930 [41:43<35:16,  1.56s/it]

{'loss': 0.0823, 'grad_norm': 0.4278223216533661, 'learning_rate': 0.00018504609081597818, 'epoch': 5.37}


 54%|█████▍    | 1576/2930 [41:45<35:14,  1.56s/it]

{'loss': 0.0735, 'grad_norm': 0.3589874505996704, 'learning_rate': 0.00018490952543530215, 'epoch': 5.37}


 54%|█████▍    | 1577/2930 [41:47<35:10,  1.56s/it]

{'loss': 0.0654, 'grad_norm': 0.31271955370903015, 'learning_rate': 0.00018477296005462617, 'epoch': 5.38}


 54%|█████▍    | 1578/2930 [41:48<35:09,  1.56s/it]

{'loss': 0.0732, 'grad_norm': 0.35223937034606934, 'learning_rate': 0.00018463639467395016, 'epoch': 5.38}


 54%|█████▍    | 1579/2930 [41:50<34:59,  1.55s/it]

{'loss': 0.0755, 'grad_norm': 0.3407895863056183, 'learning_rate': 0.00018449982929327416, 'epoch': 5.38}


 54%|█████▍    | 1580/2930 [41:51<35:00,  1.56s/it]

{'loss': 0.0722, 'grad_norm': 0.3751716613769531, 'learning_rate': 0.00018436326391259818, 'epoch': 5.39}


 54%|█████▍    | 1581/2930 [41:53<34:55,  1.55s/it]

{'loss': 0.0695, 'grad_norm': 0.31074362993240356, 'learning_rate': 0.00018422669853192215, 'epoch': 5.39}


 54%|█████▍    | 1582/2930 [41:54<34:49,  1.55s/it]

{'loss': 0.0717, 'grad_norm': 0.3561902940273285, 'learning_rate': 0.00018409013315124617, 'epoch': 5.39}


 54%|█████▍    | 1583/2930 [41:56<34:44,  1.55s/it]

{'loss': 0.0699, 'grad_norm': 0.34651196002960205, 'learning_rate': 0.0001839535677705702, 'epoch': 5.4}


 54%|█████▍    | 1584/2930 [41:57<34:52,  1.55s/it]

{'loss': 0.0637, 'grad_norm': 0.310001015663147, 'learning_rate': 0.00018381700238989416, 'epoch': 5.4}


 54%|█████▍    | 1585/2930 [41:59<34:57,  1.56s/it]

{'loss': 0.0706, 'grad_norm': 0.3490409553050995, 'learning_rate': 0.00018368043700921818, 'epoch': 5.4}


 54%|█████▍    | 1586/2930 [42:01<34:59,  1.56s/it]

{'loss': 0.0787, 'grad_norm': 0.35389629006385803, 'learning_rate': 0.00018354387162854217, 'epoch': 5.41}


 54%|█████▍    | 1587/2930 [42:02<34:49,  1.56s/it]

{'loss': 0.0656, 'grad_norm': 0.33963412046432495, 'learning_rate': 0.00018340730624786617, 'epoch': 5.41}


 54%|█████▍    | 1588/2930 [42:04<34:43,  1.55s/it]

{'loss': 0.0726, 'grad_norm': 0.32629188895225525, 'learning_rate': 0.00018327074086719019, 'epoch': 5.42}


 54%|█████▍    | 1589/2930 [42:05<34:45,  1.55s/it]

{'loss': 0.0723, 'grad_norm': 0.3920169174671173, 'learning_rate': 0.00018313417548651418, 'epoch': 5.42}


 54%|█████▍    | 1590/2930 [42:07<34:49,  1.56s/it]

{'loss': 0.0642, 'grad_norm': 0.3320388197898865, 'learning_rate': 0.00018299761010583817, 'epoch': 5.42}


 54%|█████▍    | 1591/2930 [42:08<34:44,  1.56s/it]

{'loss': 0.0636, 'grad_norm': 0.33136409521102905, 'learning_rate': 0.0001828610447251622, 'epoch': 5.43}


 54%|█████▍    | 1592/2930 [42:10<34:27,  1.55s/it]

{'loss': 0.0634, 'grad_norm': 0.3083954453468323, 'learning_rate': 0.0001827244793444862, 'epoch': 5.43}


 54%|█████▍    | 1593/2930 [42:11<34:25,  1.55s/it]

{'loss': 0.0756, 'grad_norm': 0.35582852363586426, 'learning_rate': 0.00018258791396381018, 'epoch': 5.43}


 54%|█████▍    | 1594/2930 [42:13<34:22,  1.54s/it]

{'loss': 0.0638, 'grad_norm': 0.3406873643398285, 'learning_rate': 0.00018245134858313418, 'epoch': 5.44}


 54%|█████▍    | 1595/2930 [42:14<34:16,  1.54s/it]

{'loss': 0.0655, 'grad_norm': 0.3493720591068268, 'learning_rate': 0.00018231478320245817, 'epoch': 5.44}


 54%|█████▍    | 1596/2930 [42:16<34:22,  1.55s/it]

{'loss': 0.0683, 'grad_norm': 0.3432040512561798, 'learning_rate': 0.0001821782178217822, 'epoch': 5.44}


 55%|█████▍    | 1597/2930 [42:18<34:16,  1.54s/it]

{'loss': 0.0794, 'grad_norm': 0.3890399932861328, 'learning_rate': 0.0001820416524411062, 'epoch': 5.45}


 55%|█████▍    | 1598/2930 [42:19<34:28,  1.55s/it]

{'loss': 0.0783, 'grad_norm': 0.3379170596599579, 'learning_rate': 0.00018190508706043018, 'epoch': 5.45}


 55%|█████▍    | 1599/2930 [42:21<34:18,  1.55s/it]

{'loss': 0.0675, 'grad_norm': 0.32529279589653015, 'learning_rate': 0.0001817685216797542, 'epoch': 5.45}


 55%|█████▍    | 1600/2930 [42:22<34:18,  1.55s/it]

{'loss': 0.0813, 'grad_norm': 0.3628244996070862, 'learning_rate': 0.0001816319562990782, 'epoch': 5.46}


 55%|█████▍    | 1601/2930 [42:24<34:27,  1.56s/it]

{'loss': 0.0696, 'grad_norm': 0.32885250449180603, 'learning_rate': 0.0001814953909184022, 'epoch': 5.46}


 55%|█████▍    | 1602/2930 [42:25<34:24,  1.55s/it]

{'loss': 0.0643, 'grad_norm': 0.30822983384132385, 'learning_rate': 0.00018135882553772619, 'epoch': 5.46}


 55%|█████▍    | 1603/2930 [42:27<34:10,  1.55s/it]

{'loss': 0.0661, 'grad_norm': 0.31839653849601746, 'learning_rate': 0.0001812222601570502, 'epoch': 5.47}


 55%|█████▍    | 1604/2930 [42:28<34:22,  1.56s/it]

{'loss': 0.0739, 'grad_norm': 0.3489604890346527, 'learning_rate': 0.0001810856947763742, 'epoch': 5.47}


 55%|█████▍    | 1605/2930 [42:30<34:36,  1.57s/it]

{'loss': 0.0618, 'grad_norm': 0.3464997708797455, 'learning_rate': 0.0001809491293956982, 'epoch': 5.47}


 55%|█████▍    | 1606/2930 [42:32<34:45,  1.57s/it]

{'loss': 0.0775, 'grad_norm': 0.38572052121162415, 'learning_rate': 0.0001808125640150222, 'epoch': 5.48}


 55%|█████▍    | 1607/2930 [42:33<34:30,  1.57s/it]

{'loss': 0.0748, 'grad_norm': 0.34444284439086914, 'learning_rate': 0.0001806759986343462, 'epoch': 5.48}


 55%|█████▍    | 1608/2930 [42:35<34:38,  1.57s/it]

{'loss': 0.0656, 'grad_norm': 0.32591724395751953, 'learning_rate': 0.0001805394332536702, 'epoch': 5.48}


 55%|█████▍    | 1609/2930 [42:36<34:47,  1.58s/it]

{'loss': 0.0694, 'grad_norm': 0.33943411707878113, 'learning_rate': 0.0001804028678729942, 'epoch': 5.49}


 55%|█████▍    | 1610/2930 [42:38<34:40,  1.58s/it]

{'loss': 0.0748, 'grad_norm': 0.3733276128768921, 'learning_rate': 0.0001802663024923182, 'epoch': 5.49}


 55%|█████▍    | 1611/2930 [42:39<34:38,  1.58s/it]

{'loss': 0.0754, 'grad_norm': 0.34659209847450256, 'learning_rate': 0.00018012973711164221, 'epoch': 5.49}


 55%|█████▌    | 1612/2930 [42:41<34:46,  1.58s/it]

{'loss': 0.0716, 'grad_norm': 0.3433821499347687, 'learning_rate': 0.0001799931717309662, 'epoch': 5.5}


 55%|█████▌    | 1613/2930 [42:43<34:48,  1.59s/it]

{'loss': 0.0617, 'grad_norm': 0.30931270122528076, 'learning_rate': 0.0001798566063502902, 'epoch': 5.5}


 55%|█████▌    | 1614/2930 [42:44<34:46,  1.59s/it]

{'loss': 0.0725, 'grad_norm': 0.34607037901878357, 'learning_rate': 0.00017972004096961422, 'epoch': 5.5}


 55%|█████▌    | 1615/2930 [42:46<34:45,  1.59s/it]

{'loss': 0.0738, 'grad_norm': 0.36173173785209656, 'learning_rate': 0.00017958347558893822, 'epoch': 5.51}


 55%|█████▌    | 1616/2930 [42:47<34:45,  1.59s/it]

{'loss': 0.0727, 'grad_norm': 0.31279388070106506, 'learning_rate': 0.0001794469102082622, 'epoch': 5.51}


 55%|█████▌    | 1617/2930 [42:49<34:51,  1.59s/it]

{'loss': 0.0606, 'grad_norm': 0.3127437233924866, 'learning_rate': 0.0001793103448275862, 'epoch': 5.51}


 55%|█████▌    | 1618/2930 [42:51<34:46,  1.59s/it]

{'loss': 0.0752, 'grad_norm': 0.4009944200515747, 'learning_rate': 0.0001791737794469102, 'epoch': 5.52}


 55%|█████▌    | 1619/2930 [42:52<34:40,  1.59s/it]

{'loss': 0.0779, 'grad_norm': 0.38957273960113525, 'learning_rate': 0.00017903721406623422, 'epoch': 5.52}


 55%|█████▌    | 1620/2930 [42:54<34:43,  1.59s/it]

{'loss': 0.085, 'grad_norm': 0.36151185631752014, 'learning_rate': 0.00017890064868555822, 'epoch': 5.52}


 55%|█████▌    | 1621/2930 [42:55<34:36,  1.59s/it]

{'loss': 0.0585, 'grad_norm': 0.29465794563293457, 'learning_rate': 0.0001787640833048822, 'epoch': 5.53}


 55%|█████▌    | 1622/2930 [42:57<34:17,  1.57s/it]

{'loss': 0.0774, 'grad_norm': 0.3526611626148224, 'learning_rate': 0.00017862751792420623, 'epoch': 5.53}


 55%|█████▌    | 1623/2930 [42:59<34:24,  1.58s/it]

{'loss': 0.0819, 'grad_norm': 0.3774256408214569, 'learning_rate': 0.00017849095254353023, 'epoch': 5.53}


 55%|█████▌    | 1624/2930 [43:00<34:27,  1.58s/it]

{'loss': 0.0636, 'grad_norm': 0.3555929362773895, 'learning_rate': 0.00017835438716285422, 'epoch': 5.54}


 55%|█████▌    | 1625/2930 [43:02<34:29,  1.59s/it]

{'loss': 0.0702, 'grad_norm': 0.3215370178222656, 'learning_rate': 0.00017821782178217824, 'epoch': 5.54}


 55%|█████▌    | 1626/2930 [43:03<34:35,  1.59s/it]

{'loss': 0.0647, 'grad_norm': 0.29388880729675293, 'learning_rate': 0.0001780812564015022, 'epoch': 5.54}


 56%|█████▌    | 1627/2930 [43:05<34:35,  1.59s/it]

{'loss': 0.0628, 'grad_norm': 0.27704301476478577, 'learning_rate': 0.00017794469102082623, 'epoch': 5.55}


 56%|█████▌    | 1628/2930 [43:06<34:37,  1.60s/it]

{'loss': 0.0787, 'grad_norm': 0.34346985816955566, 'learning_rate': 0.00017780812564015022, 'epoch': 5.55}


 56%|█████▌    | 1629/2930 [43:08<34:42,  1.60s/it]

{'loss': 0.0693, 'grad_norm': 0.33279886841773987, 'learning_rate': 0.00017767156025947422, 'epoch': 5.55}


 56%|█████▌    | 1630/2930 [43:10<34:23,  1.59s/it]

{'loss': 0.0685, 'grad_norm': 0.349695086479187, 'learning_rate': 0.00017753499487879824, 'epoch': 5.56}


 56%|█████▌    | 1631/2930 [43:11<34:22,  1.59s/it]

{'loss': 0.0571, 'grad_norm': 0.3158538341522217, 'learning_rate': 0.00017739842949812223, 'epoch': 5.56}


 56%|█████▌    | 1632/2930 [43:13<34:05,  1.58s/it]

{'loss': 0.072, 'grad_norm': 0.3882369101047516, 'learning_rate': 0.00017726186411744623, 'epoch': 5.57}


 56%|█████▌    | 1633/2930 [43:14<34:04,  1.58s/it]

{'loss': 0.0711, 'grad_norm': 0.35186469554901123, 'learning_rate': 0.00017712529873677025, 'epoch': 5.57}


 56%|█████▌    | 1634/2930 [43:16<34:02,  1.58s/it]

{'loss': 0.0711, 'grad_norm': 0.360072523355484, 'learning_rate': 0.00017698873335609424, 'epoch': 5.57}


 56%|█████▌    | 1635/2930 [43:18<34:00,  1.58s/it]

{'loss': 0.0748, 'grad_norm': 0.3913479149341583, 'learning_rate': 0.00017685216797541824, 'epoch': 5.58}


 56%|█████▌    | 1636/2930 [43:19<34:02,  1.58s/it]

{'loss': 0.0559, 'grad_norm': 0.29921117424964905, 'learning_rate': 0.00017671560259474226, 'epoch': 5.58}


 56%|█████▌    | 1637/2930 [43:21<34:01,  1.58s/it]

{'loss': 0.0717, 'grad_norm': 0.33582741022109985, 'learning_rate': 0.00017657903721406622, 'epoch': 5.58}


 56%|█████▌    | 1638/2930 [43:22<33:52,  1.57s/it]

{'loss': 0.0655, 'grad_norm': 0.3426766097545624, 'learning_rate': 0.00017644247183339025, 'epoch': 5.59}


 56%|█████▌    | 1639/2930 [43:24<33:53,  1.58s/it]

{'loss': 0.0616, 'grad_norm': 0.307269811630249, 'learning_rate': 0.00017630590645271427, 'epoch': 5.59}


 56%|█████▌    | 1640/2930 [43:25<33:36,  1.56s/it]

{'loss': 0.0661, 'grad_norm': 0.3315980136394501, 'learning_rate': 0.00017616934107203823, 'epoch': 5.59}


 56%|█████▌    | 1641/2930 [43:27<33:36,  1.56s/it]

{'loss': 0.0761, 'grad_norm': 0.3688851594924927, 'learning_rate': 0.00017603277569136226, 'epoch': 5.6}


 56%|█████▌    | 1642/2930 [43:29<33:40,  1.57s/it]

{'loss': 0.0667, 'grad_norm': 0.31265443563461304, 'learning_rate': 0.00017589621031068625, 'epoch': 5.6}


 56%|█████▌    | 1643/2930 [43:30<33:46,  1.57s/it]

{'loss': 0.0651, 'grad_norm': 0.3640306293964386, 'learning_rate': 0.00017575964493001024, 'epoch': 5.6}


 56%|█████▌    | 1644/2930 [43:32<33:54,  1.58s/it]

{'loss': 0.0804, 'grad_norm': 0.3720439672470093, 'learning_rate': 0.00017562307954933426, 'epoch': 5.61}


 56%|█████▌    | 1645/2930 [43:33<33:52,  1.58s/it]

{'loss': 0.0713, 'grad_norm': 0.3713473379611969, 'learning_rate': 0.00017548651416865823, 'epoch': 5.61}


 56%|█████▌    | 1646/2930 [43:35<33:58,  1.59s/it]

{'loss': 0.0642, 'grad_norm': 1.5305079221725464, 'learning_rate': 0.00017534994878798225, 'epoch': 5.61}


 56%|█████▌    | 1647/2930 [43:36<33:51,  1.58s/it]

{'loss': 0.0534, 'grad_norm': 0.2676602602005005, 'learning_rate': 0.00017521338340730627, 'epoch': 5.62}


 56%|█████▌    | 1648/2930 [43:38<33:51,  1.58s/it]

{'loss': 0.0708, 'grad_norm': 0.3354998528957367, 'learning_rate': 0.00017507681802663024, 'epoch': 5.62}


 56%|█████▋    | 1649/2930 [43:40<33:53,  1.59s/it]

{'loss': 0.0743, 'grad_norm': 0.3390306532382965, 'learning_rate': 0.00017494025264595426, 'epoch': 5.62}


 56%|█████▋    | 1650/2930 [43:41<33:40,  1.58s/it]

{'loss': 0.0733, 'grad_norm': 0.3113814890384674, 'learning_rate': 0.00017480368726527828, 'epoch': 5.63}


 56%|█████▋    | 1651/2930 [43:43<33:44,  1.58s/it]

{'loss': 0.0831, 'grad_norm': 0.37810593843460083, 'learning_rate': 0.00017466712188460225, 'epoch': 5.63}


 56%|█████▋    | 1652/2930 [43:44<33:49,  1.59s/it]

{'loss': 0.0715, 'grad_norm': 0.33792048692703247, 'learning_rate': 0.00017453055650392627, 'epoch': 5.63}


 56%|█████▋    | 1653/2930 [43:46<33:50,  1.59s/it]

{'loss': 0.0751, 'grad_norm': 0.3481842279434204, 'learning_rate': 0.00017439399112325027, 'epoch': 5.64}


 56%|█████▋    | 1654/2930 [43:48<33:30,  1.58s/it]

{'loss': 0.0695, 'grad_norm': 0.3404812812805176, 'learning_rate': 0.00017425742574257426, 'epoch': 5.64}


 56%|█████▋    | 1655/2930 [43:49<33:17,  1.57s/it]

{'loss': 0.0676, 'grad_norm': 0.31662479043006897, 'learning_rate': 0.00017412086036189828, 'epoch': 5.64}


 57%|█████▋    | 1656/2930 [43:51<33:23,  1.57s/it]

{'loss': 0.0625, 'grad_norm': 0.353133887052536, 'learning_rate': 0.00017398429498122225, 'epoch': 5.65}


 57%|█████▋    | 1657/2930 [43:52<33:23,  1.57s/it]

{'loss': 0.0698, 'grad_norm': 0.34623149037361145, 'learning_rate': 0.00017384772960054627, 'epoch': 5.65}


 57%|█████▋    | 1658/2930 [43:54<33:19,  1.57s/it]

{'loss': 0.0887, 'grad_norm': 0.3883180320262909, 'learning_rate': 0.0001737111642198703, 'epoch': 5.65}


 57%|█████▋    | 1659/2930 [43:55<33:12,  1.57s/it]

{'loss': 0.0746, 'grad_norm': 0.37482866644859314, 'learning_rate': 0.00017357459883919426, 'epoch': 5.66}


 57%|█████▋    | 1660/2930 [43:57<33:09,  1.57s/it]

{'loss': 0.0614, 'grad_norm': 0.31609025597572327, 'learning_rate': 0.00017343803345851828, 'epoch': 5.66}


 57%|█████▋    | 1661/2930 [43:58<32:56,  1.56s/it]

{'loss': 0.0729, 'grad_norm': 0.35787707567214966, 'learning_rate': 0.00017330146807784227, 'epoch': 5.66}


 57%|█████▋    | 1662/2930 [44:00<32:50,  1.55s/it]

{'loss': 0.0627, 'grad_norm': 0.3222653269767761, 'learning_rate': 0.00017316490269716627, 'epoch': 5.67}


 57%|█████▋    | 1663/2930 [44:02<32:49,  1.55s/it]

{'loss': 0.0759, 'grad_norm': 0.3390509784221649, 'learning_rate': 0.0001730283373164903, 'epoch': 5.67}


 57%|█████▋    | 1664/2930 [44:03<32:45,  1.55s/it]

{'loss': 0.0684, 'grad_norm': 0.3520734906196594, 'learning_rate': 0.00017289177193581428, 'epoch': 5.67}


 57%|█████▋    | 1665/2930 [44:05<32:47,  1.56s/it]

{'loss': 0.0819, 'grad_norm': 0.3724004328250885, 'learning_rate': 0.00017275520655513828, 'epoch': 5.68}


 57%|█████▋    | 1666/2930 [44:06<32:54,  1.56s/it]

{'loss': 0.0671, 'grad_norm': 0.3544366657733917, 'learning_rate': 0.0001726186411744623, 'epoch': 5.68}


 57%|█████▋    | 1667/2930 [44:08<32:40,  1.55s/it]

{'loss': 0.0707, 'grad_norm': 0.34365180134773254, 'learning_rate': 0.0001724820757937863, 'epoch': 5.68}


 57%|█████▋    | 1668/2930 [44:09<32:36,  1.55s/it]

{'loss': 0.0844, 'grad_norm': 0.40785151720046997, 'learning_rate': 0.0001723455104131103, 'epoch': 5.69}


 57%|█████▋    | 1669/2930 [44:11<32:36,  1.55s/it]

{'loss': 0.065, 'grad_norm': 0.3612571954727173, 'learning_rate': 0.00017220894503243428, 'epoch': 5.69}


 57%|█████▋    | 1670/2930 [44:12<32:33,  1.55s/it]

{'loss': 0.0791, 'grad_norm': 0.3600432276725769, 'learning_rate': 0.00017207237965175828, 'epoch': 5.69}


 57%|█████▋    | 1671/2930 [44:14<32:40,  1.56s/it]

{'loss': 0.0671, 'grad_norm': 0.3508896827697754, 'learning_rate': 0.0001719358142710823, 'epoch': 5.7}


 57%|█████▋    | 1672/2930 [44:16<32:35,  1.55s/it]

{'loss': 0.0685, 'grad_norm': 0.35256052017211914, 'learning_rate': 0.0001717992488904063, 'epoch': 5.7}


 57%|█████▋    | 1673/2930 [44:17<32:37,  1.56s/it]

{'loss': 0.0787, 'grad_norm': 0.3937649130821228, 'learning_rate': 0.00017166268350973028, 'epoch': 5.71}


 57%|█████▋    | 1674/2930 [44:19<32:42,  1.56s/it]

{'loss': 0.0646, 'grad_norm': 0.3146854639053345, 'learning_rate': 0.0001715261181290543, 'epoch': 5.71}


 57%|█████▋    | 1675/2930 [44:20<32:44,  1.57s/it]

{'loss': 0.0667, 'grad_norm': 0.3129865527153015, 'learning_rate': 0.0001713895527483783, 'epoch': 5.71}


 57%|█████▋    | 1676/2930 [44:22<32:40,  1.56s/it]

{'loss': 0.0687, 'grad_norm': 0.3160679340362549, 'learning_rate': 0.0001712529873677023, 'epoch': 5.72}


 57%|█████▋    | 1677/2930 [44:23<32:22,  1.55s/it]

{'loss': 0.0625, 'grad_norm': 0.3586801290512085, 'learning_rate': 0.0001711164219870263, 'epoch': 5.72}


 57%|█████▋    | 1678/2930 [44:25<32:22,  1.55s/it]

{'loss': 0.071, 'grad_norm': 0.3611689507961273, 'learning_rate': 0.0001709798566063503, 'epoch': 5.72}


 57%|█████▋    | 1679/2930 [44:26<32:30,  1.56s/it]

{'loss': 0.0724, 'grad_norm': 0.32646921277046204, 'learning_rate': 0.0001708432912256743, 'epoch': 5.73}


 57%|█████▋    | 1680/2930 [44:28<32:27,  1.56s/it]

{'loss': 0.0734, 'grad_norm': 0.3438551723957062, 'learning_rate': 0.0001707067258449983, 'epoch': 5.73}


 57%|█████▋    | 1681/2930 [44:30<32:27,  1.56s/it]

{'loss': 0.0671, 'grad_norm': 0.3439573347568512, 'learning_rate': 0.0001705701604643223, 'epoch': 5.73}


 57%|█████▋    | 1682/2930 [44:31<32:19,  1.55s/it]

{'loss': 0.0641, 'grad_norm': 0.3441942632198334, 'learning_rate': 0.0001704335950836463, 'epoch': 5.74}


 57%|█████▋    | 1683/2930 [44:33<32:11,  1.55s/it]

{'loss': 0.0606, 'grad_norm': 0.35302695631980896, 'learning_rate': 0.0001702970297029703, 'epoch': 5.74}


 57%|█████▋    | 1684/2930 [44:34<32:09,  1.55s/it]

{'loss': 0.0721, 'grad_norm': 0.34783971309661865, 'learning_rate': 0.0001701604643222943, 'epoch': 5.74}


 58%|█████▊    | 1685/2930 [44:36<32:14,  1.55s/it]

{'loss': 0.074, 'grad_norm': 0.3457019329071045, 'learning_rate': 0.0001700238989416183, 'epoch': 5.75}


 58%|█████▊    | 1686/2930 [44:37<32:21,  1.56s/it]

{'loss': 0.0687, 'grad_norm': 0.3112567961215973, 'learning_rate': 0.00016988733356094232, 'epoch': 5.75}


 58%|█████▊    | 1687/2930 [44:39<32:25,  1.57s/it]

{'loss': 0.0658, 'grad_norm': 0.3122674226760864, 'learning_rate': 0.0001697507681802663, 'epoch': 5.75}


 58%|█████▊    | 1688/2930 [44:40<32:22,  1.56s/it]

{'loss': 0.0773, 'grad_norm': 0.3454590439796448, 'learning_rate': 0.0001696142027995903, 'epoch': 5.76}


 58%|█████▊    | 1689/2930 [44:42<32:23,  1.57s/it]

{'loss': 0.0634, 'grad_norm': 0.30267852544784546, 'learning_rate': 0.00016947763741891433, 'epoch': 5.76}


 58%|█████▊    | 1690/2930 [44:44<32:22,  1.57s/it]

{'loss': 0.0597, 'grad_norm': 0.25762200355529785, 'learning_rate': 0.00016934107203823832, 'epoch': 5.76}


 58%|█████▊    | 1691/2930 [44:45<32:19,  1.57s/it]

{'loss': 0.0717, 'grad_norm': 0.32357263565063477, 'learning_rate': 0.00016920450665756231, 'epoch': 5.77}


 58%|█████▊    | 1692/2930 [44:47<32:13,  1.56s/it]

{'loss': 0.0789, 'grad_norm': 0.37627384066581726, 'learning_rate': 0.0001690679412768863, 'epoch': 5.77}


 58%|█████▊    | 1693/2930 [44:48<32:12,  1.56s/it]

{'loss': 0.0724, 'grad_norm': 0.348939448595047, 'learning_rate': 0.0001689313758962103, 'epoch': 5.77}


 58%|█████▊    | 1694/2930 [44:50<32:09,  1.56s/it]

{'loss': 0.072, 'grad_norm': 0.355587899684906, 'learning_rate': 0.00016879481051553432, 'epoch': 5.78}


 58%|█████▊    | 1695/2930 [44:51<32:00,  1.55s/it]

{'loss': 0.062, 'grad_norm': 0.3096280097961426, 'learning_rate': 0.00016865824513485832, 'epoch': 5.78}


 58%|█████▊    | 1696/2930 [44:53<32:03,  1.56s/it]

{'loss': 0.0677, 'grad_norm': 0.33475735783576965, 'learning_rate': 0.0001685216797541823, 'epoch': 5.78}


 58%|█████▊    | 1697/2930 [44:55<32:01,  1.56s/it]

{'loss': 0.0708, 'grad_norm': 0.4205711781978607, 'learning_rate': 0.00016838511437350633, 'epoch': 5.79}


 58%|█████▊    | 1698/2930 [44:56<32:01,  1.56s/it]

{'loss': 0.0702, 'grad_norm': 0.3334733247756958, 'learning_rate': 0.00016824854899283033, 'epoch': 5.79}


 58%|█████▊    | 1699/2930 [44:58<31:42,  1.55s/it]

{'loss': 0.0596, 'grad_norm': 0.29289719462394714, 'learning_rate': 0.00016811198361215432, 'epoch': 5.79}


 58%|█████▊    | 1700/2930 [44:59<31:52,  1.55s/it]

{'loss': 0.0643, 'grad_norm': 0.3059764504432678, 'learning_rate': 0.00016797541823147834, 'epoch': 5.8}


 58%|█████▊    | 1701/2930 [45:01<31:45,  1.55s/it]

{'loss': 0.0569, 'grad_norm': 0.2892500162124634, 'learning_rate': 0.0001678388528508023, 'epoch': 5.8}


 58%|█████▊    | 1702/2930 [45:02<31:57,  1.56s/it]

{'loss': 0.0709, 'grad_norm': 0.33708661794662476, 'learning_rate': 0.00016770228747012633, 'epoch': 5.8}


 58%|█████▊    | 1703/2930 [45:04<31:52,  1.56s/it]

{'loss': 0.067, 'grad_norm': 0.3264619708061218, 'learning_rate': 0.00016756572208945033, 'epoch': 5.81}


 58%|█████▊    | 1704/2930 [45:05<31:51,  1.56s/it]

{'loss': 0.0733, 'grad_norm': 0.37959522008895874, 'learning_rate': 0.00016742915670877432, 'epoch': 5.81}


 58%|█████▊    | 1705/2930 [45:07<31:53,  1.56s/it]

{'loss': 0.0625, 'grad_norm': 0.33454447984695435, 'learning_rate': 0.00016729259132809834, 'epoch': 5.81}


 58%|█████▊    | 1706/2930 [45:09<31:57,  1.57s/it]

{'loss': 0.0561, 'grad_norm': 0.2910950183868408, 'learning_rate': 0.00016715602594742234, 'epoch': 5.82}


 58%|█████▊    | 1707/2930 [45:10<31:46,  1.56s/it]

{'loss': 0.0615, 'grad_norm': 0.2915467619895935, 'learning_rate': 0.00016701946056674633, 'epoch': 5.82}


 58%|█████▊    | 1708/2930 [45:12<31:47,  1.56s/it]

{'loss': 0.063, 'grad_norm': 0.3309738039970398, 'learning_rate': 0.00016688289518607035, 'epoch': 5.82}


 58%|█████▊    | 1709/2930 [45:13<31:37,  1.55s/it]

{'loss': 0.0714, 'grad_norm': 0.34512269496917725, 'learning_rate': 0.00016674632980539432, 'epoch': 5.83}


 58%|█████▊    | 1710/2930 [45:15<31:25,  1.55s/it]

{'loss': 0.0649, 'grad_norm': 0.33787062764167786, 'learning_rate': 0.00016660976442471834, 'epoch': 5.83}


 58%|█████▊    | 1711/2930 [45:16<31:21,  1.54s/it]

{'loss': 0.0614, 'grad_norm': 0.31520965695381165, 'learning_rate': 0.00016647319904404236, 'epoch': 5.83}


 58%|█████▊    | 1712/2930 [45:18<31:31,  1.55s/it]

{'loss': 0.0574, 'grad_norm': 0.35259997844696045, 'learning_rate': 0.00016633663366336633, 'epoch': 5.84}


 58%|█████▊    | 1713/2930 [45:19<31:26,  1.55s/it]

{'loss': 0.0567, 'grad_norm': 0.30822810530662537, 'learning_rate': 0.00016620006828269035, 'epoch': 5.84}


 58%|█████▊    | 1714/2930 [45:21<31:21,  1.55s/it]

{'loss': 0.0687, 'grad_norm': 0.351807177066803, 'learning_rate': 0.00016606350290201437, 'epoch': 5.84}


 59%|█████▊    | 1715/2930 [45:22<31:16,  1.54s/it]

{'loss': 0.0652, 'grad_norm': 0.33687692880630493, 'learning_rate': 0.00016592693752133834, 'epoch': 5.85}


 59%|█████▊    | 1716/2930 [45:24<31:22,  1.55s/it]

{'loss': 0.0612, 'grad_norm': 0.3030819594860077, 'learning_rate': 0.00016579037214066236, 'epoch': 5.85}


 59%|█████▊    | 1717/2930 [45:26<31:22,  1.55s/it]

{'loss': 0.0696, 'grad_norm': 0.338060200214386, 'learning_rate': 0.00016565380675998635, 'epoch': 5.86}


 59%|█████▊    | 1718/2930 [45:27<31:28,  1.56s/it]

{'loss': 0.0805, 'grad_norm': 0.4051430821418762, 'learning_rate': 0.00016551724137931035, 'epoch': 5.86}


 59%|█████▊    | 1719/2930 [45:29<31:16,  1.55s/it]

{'loss': 0.0608, 'grad_norm': 0.3190195858478546, 'learning_rate': 0.00016538067599863437, 'epoch': 5.86}


 59%|█████▊    | 1720/2930 [45:30<31:30,  1.56s/it]

{'loss': 0.0746, 'grad_norm': 0.35475409030914307, 'learning_rate': 0.00016524411061795833, 'epoch': 5.87}


 59%|█████▊    | 1721/2930 [45:32<31:33,  1.57s/it]

{'loss': 0.0679, 'grad_norm': 0.3157251179218292, 'learning_rate': 0.00016510754523728236, 'epoch': 5.87}


 59%|█████▉    | 1722/2930 [45:33<31:35,  1.57s/it]

{'loss': 0.0649, 'grad_norm': 0.30528098344802856, 'learning_rate': 0.00016497097985660638, 'epoch': 5.87}


 59%|█████▉    | 1723/2930 [45:35<31:38,  1.57s/it]

{'loss': 0.0655, 'grad_norm': 0.33273202180862427, 'learning_rate': 0.00016483441447593034, 'epoch': 5.88}


 59%|█████▉    | 1724/2930 [45:37<31:27,  1.57s/it]

{'loss': 0.0776, 'grad_norm': 0.3653239905834198, 'learning_rate': 0.00016469784909525437, 'epoch': 5.88}


 59%|█████▉    | 1725/2930 [45:38<31:24,  1.56s/it]

{'loss': 0.073, 'grad_norm': 0.3535367548465729, 'learning_rate': 0.00016456128371457836, 'epoch': 5.88}


 59%|█████▉    | 1726/2930 [45:40<31:19,  1.56s/it]

{'loss': 0.0622, 'grad_norm': 0.335769385099411, 'learning_rate': 0.00016442471833390235, 'epoch': 5.89}


 59%|█████▉    | 1727/2930 [45:41<31:21,  1.56s/it]

{'loss': 0.064, 'grad_norm': 0.30276283621788025, 'learning_rate': 0.00016428815295322637, 'epoch': 5.89}


 59%|█████▉    | 1728/2930 [45:43<31:19,  1.56s/it]

{'loss': 0.075, 'grad_norm': 0.35170578956604004, 'learning_rate': 0.00016415158757255037, 'epoch': 5.89}


 59%|█████▉    | 1729/2930 [45:44<31:13,  1.56s/it]

{'loss': 0.0687, 'grad_norm': 0.33023756742477417, 'learning_rate': 0.00016401502219187436, 'epoch': 5.9}


 59%|█████▉    | 1730/2930 [45:46<31:04,  1.55s/it]

{'loss': 0.0714, 'grad_norm': 0.38983458280563354, 'learning_rate': 0.00016387845681119838, 'epoch': 5.9}


 59%|█████▉    | 1731/2930 [45:47<31:06,  1.56s/it]

{'loss': 0.0699, 'grad_norm': 0.350766122341156, 'learning_rate': 0.00016374189143052238, 'epoch': 5.9}


 59%|█████▉    | 1732/2930 [45:49<31:13,  1.56s/it]

{'loss': 0.0749, 'grad_norm': 0.36954185366630554, 'learning_rate': 0.00016360532604984637, 'epoch': 5.91}


 59%|█████▉    | 1733/2930 [45:51<31:13,  1.56s/it]

{'loss': 0.0598, 'grad_norm': 0.3187588155269623, 'learning_rate': 0.00016346876066917037, 'epoch': 5.91}


 59%|█████▉    | 1734/2930 [45:52<31:14,  1.57s/it]

{'loss': 0.0614, 'grad_norm': 0.31767237186431885, 'learning_rate': 0.00016333219528849436, 'epoch': 5.91}


 59%|█████▉    | 1735/2930 [45:54<31:12,  1.57s/it]

{'loss': 0.0624, 'grad_norm': 0.29891642928123474, 'learning_rate': 0.00016319562990781838, 'epoch': 5.92}


 59%|█████▉    | 1736/2930 [45:55<31:12,  1.57s/it]

{'loss': 0.0615, 'grad_norm': 0.3215799033641815, 'learning_rate': 0.00016305906452714238, 'epoch': 5.92}


 59%|█████▉    | 1737/2930 [45:57<31:14,  1.57s/it]

{'loss': 0.0663, 'grad_norm': 0.3284009099006653, 'learning_rate': 0.00016292249914646637, 'epoch': 5.92}


 59%|█████▉    | 1738/2930 [45:58<31:20,  1.58s/it]

{'loss': 0.0569, 'grad_norm': 0.3144752085208893, 'learning_rate': 0.0001627859337657904, 'epoch': 5.93}


 59%|█████▉    | 1739/2930 [46:00<31:22,  1.58s/it]

{'loss': 0.0677, 'grad_norm': 0.3231860101222992, 'learning_rate': 0.00016264936838511439, 'epoch': 5.93}


 59%|█████▉    | 1740/2930 [46:02<31:16,  1.58s/it]

{'loss': 0.06, 'grad_norm': 0.2895374298095703, 'learning_rate': 0.00016251280300443838, 'epoch': 5.93}


 59%|█████▉    | 1741/2930 [46:03<31:16,  1.58s/it]

{'loss': 0.0799, 'grad_norm': 0.3636800944805145, 'learning_rate': 0.0001623762376237624, 'epoch': 5.94}


 59%|█████▉    | 1742/2930 [46:05<31:20,  1.58s/it]

{'loss': 0.069, 'grad_norm': 0.3278675675392151, 'learning_rate': 0.0001622396722430864, 'epoch': 5.94}


 59%|█████▉    | 1743/2930 [46:06<31:22,  1.59s/it]

{'loss': 0.0789, 'grad_norm': 0.3785158097743988, 'learning_rate': 0.0001621031068624104, 'epoch': 5.94}


 60%|█████▉    | 1744/2930 [46:08<31:18,  1.58s/it]

{'loss': 0.0608, 'grad_norm': 0.3288118541240692, 'learning_rate': 0.00016196654148173438, 'epoch': 5.95}


 60%|█████▉    | 1745/2930 [46:10<31:12,  1.58s/it]

{'loss': 0.0587, 'grad_norm': 0.2989487648010254, 'learning_rate': 0.00016182997610105838, 'epoch': 5.95}


 60%|█████▉    | 1746/2930 [46:11<31:09,  1.58s/it]

{'loss': 0.0736, 'grad_norm': 0.3349917232990265, 'learning_rate': 0.0001616934107203824, 'epoch': 5.95}


 60%|█████▉    | 1747/2930 [46:13<31:08,  1.58s/it]

{'loss': 0.0744, 'grad_norm': 0.3836369812488556, 'learning_rate': 0.0001615568453397064, 'epoch': 5.96}


 60%|█████▉    | 1748/2930 [46:14<30:46,  1.56s/it]

{'loss': 0.062, 'grad_norm': 0.313242644071579, 'learning_rate': 0.0001614202799590304, 'epoch': 5.96}


 60%|█████▉    | 1749/2930 [46:16<30:43,  1.56s/it]

{'loss': 0.0687, 'grad_norm': 0.317844957113266, 'learning_rate': 0.0001612837145783544, 'epoch': 5.96}


 60%|█████▉    | 1750/2930 [46:17<30:53,  1.57s/it]

{'loss': 0.0621, 'grad_norm': 0.29437562823295593, 'learning_rate': 0.0001611471491976784, 'epoch': 5.97}


 60%|█████▉    | 1751/2930 [46:19<31:05,  1.58s/it]

{'loss': 0.0727, 'grad_norm': 0.3292478024959564, 'learning_rate': 0.0001610105838170024, 'epoch': 5.97}


 60%|█████▉    | 1752/2930 [46:21<30:55,  1.57s/it]

{'loss': 0.0612, 'grad_norm': 0.27738338708877563, 'learning_rate': 0.0001608740184363264, 'epoch': 5.97}


 60%|█████▉    | 1753/2930 [46:22<30:59,  1.58s/it]

{'loss': 0.0669, 'grad_norm': 0.32084816694259644, 'learning_rate': 0.0001607374530556504, 'epoch': 5.98}


 60%|█████▉    | 1754/2930 [46:24<30:52,  1.58s/it]

{'loss': 0.0664, 'grad_norm': 0.32851576805114746, 'learning_rate': 0.0001606008876749744, 'epoch': 5.98}


 60%|█████▉    | 1755/2930 [46:25<30:58,  1.58s/it]

{'loss': 0.0619, 'grad_norm': 0.3387250602245331, 'learning_rate': 0.0001604643222942984, 'epoch': 5.98}


 60%|█████▉    | 1756/2930 [46:27<31:02,  1.59s/it]

{'loss': 0.0692, 'grad_norm': 0.36410871148109436, 'learning_rate': 0.0001603277569136224, 'epoch': 5.99}


 60%|█████▉    | 1757/2930 [46:28<30:54,  1.58s/it]

{'loss': 0.0591, 'grad_norm': 0.31704026460647583, 'learning_rate': 0.00016019119153294642, 'epoch': 5.99}


 60%|██████    | 1758/2930 [46:30<30:40,  1.57s/it]

{'loss': 0.0634, 'grad_norm': 0.32279786467552185, 'learning_rate': 0.0001600546261522704, 'epoch': 5.99}


 60%|██████    | 1759/2930 [46:32<30:46,  1.58s/it]

{'loss': 0.057, 'grad_norm': 0.2750723361968994, 'learning_rate': 0.0001599180607715944, 'epoch': 6.0}


 60%|██████    | 1760/2930 [46:33<30:22,  1.56s/it]

{'loss': 0.0637, 'grad_norm': 0.32976701855659485, 'learning_rate': 0.0001597814953909184, 'epoch': 6.0}


 60%|██████    | 1761/2930 [46:35<30:19,  1.56s/it]

{'loss': 0.0448, 'grad_norm': 0.22905269265174866, 'learning_rate': 0.00015964493001024242, 'epoch': 6.01}


 60%|██████    | 1762/2930 [46:36<30:21,  1.56s/it]

{'loss': 0.0455, 'grad_norm': 0.2613821029663086, 'learning_rate': 0.00015950836462956641, 'epoch': 6.01}


 60%|██████    | 1763/2930 [46:38<30:33,  1.57s/it]

{'loss': 0.0365, 'grad_norm': 0.24872489273548126, 'learning_rate': 0.0001593717992488904, 'epoch': 6.01}


 60%|██████    | 1764/2930 [46:39<30:32,  1.57s/it]

{'loss': 0.0426, 'grad_norm': 0.26074182987213135, 'learning_rate': 0.00015923523386821443, 'epoch': 6.02}


 60%|██████    | 1765/2930 [46:41<30:36,  1.58s/it]

{'loss': 0.0483, 'grad_norm': 0.3057127296924591, 'learning_rate': 0.00015909866848753842, 'epoch': 6.02}


 60%|██████    | 1766/2930 [46:43<30:22,  1.57s/it]

{'loss': 0.0389, 'grad_norm': 0.303374707698822, 'learning_rate': 0.00015896210310686242, 'epoch': 6.02}


 60%|██████    | 1767/2930 [46:44<30:20,  1.57s/it]

{'loss': 0.0361, 'grad_norm': 0.2755545973777771, 'learning_rate': 0.0001588255377261864, 'epoch': 6.03}


 60%|██████    | 1768/2930 [46:46<30:21,  1.57s/it]

{'loss': 0.0488, 'grad_norm': 0.41784998774528503, 'learning_rate': 0.0001586889723455104, 'epoch': 6.03}


 60%|██████    | 1769/2930 [46:47<30:16,  1.56s/it]

{'loss': 0.044, 'grad_norm': 0.3623383045196533, 'learning_rate': 0.00015855240696483443, 'epoch': 6.03}


 60%|██████    | 1770/2930 [46:49<30:25,  1.57s/it]

{'loss': 0.0382, 'grad_norm': 0.302849680185318, 'learning_rate': 0.00015841584158415842, 'epoch': 6.04}


 60%|██████    | 1771/2930 [46:50<30:14,  1.57s/it]

{'loss': 0.0369, 'grad_norm': 0.3422788679599762, 'learning_rate': 0.00015827927620348242, 'epoch': 6.04}


 60%|██████    | 1772/2930 [46:52<30:11,  1.56s/it]

{'loss': 0.0429, 'grad_norm': 0.334575355052948, 'learning_rate': 0.00015814271082280644, 'epoch': 6.04}


 61%|██████    | 1773/2930 [46:54<30:10,  1.57s/it]

{'loss': 0.0459, 'grad_norm': 0.33621644973754883, 'learning_rate': 0.00015800614544213043, 'epoch': 6.05}


 61%|██████    | 1774/2930 [46:55<30:08,  1.56s/it]

{'loss': 0.0413, 'grad_norm': 0.32231226563453674, 'learning_rate': 0.00015786958006145442, 'epoch': 6.05}


 61%|██████    | 1775/2930 [46:57<30:13,  1.57s/it]

{'loss': 0.0492, 'grad_norm': 0.3492160141468048, 'learning_rate': 0.00015773301468077845, 'epoch': 6.05}


 61%|██████    | 1776/2930 [46:58<30:02,  1.56s/it]

{'loss': 0.0388, 'grad_norm': 0.2782396376132965, 'learning_rate': 0.0001575964493001024, 'epoch': 6.06}


 61%|██████    | 1777/2930 [47:00<30:16,  1.58s/it]

{'loss': 0.0432, 'grad_norm': 0.2971923053264618, 'learning_rate': 0.00015745988391942643, 'epoch': 6.06}


 61%|██████    | 1778/2930 [47:01<30:23,  1.58s/it]

{'loss': 0.0448, 'grad_norm': 0.29518139362335205, 'learning_rate': 0.00015732331853875043, 'epoch': 6.06}


 61%|██████    | 1779/2930 [47:03<30:14,  1.58s/it]

{'loss': 0.0458, 'grad_norm': 0.30050745606422424, 'learning_rate': 0.00015718675315807442, 'epoch': 6.07}


 61%|██████    | 1780/2930 [47:05<30:17,  1.58s/it]

{'loss': 0.0526, 'grad_norm': 0.318877249956131, 'learning_rate': 0.00015705018777739844, 'epoch': 6.07}


 61%|██████    | 1781/2930 [47:06<30:13,  1.58s/it]

{'loss': 0.0417, 'grad_norm': 0.27623477578163147, 'learning_rate': 0.00015691362239672244, 'epoch': 6.07}


 61%|██████    | 1782/2930 [47:08<30:10,  1.58s/it]

{'loss': 0.0491, 'grad_norm': 0.33222314715385437, 'learning_rate': 0.00015677705701604643, 'epoch': 6.08}


 61%|██████    | 1783/2930 [47:09<30:16,  1.58s/it]

{'loss': 0.0407, 'grad_norm': 0.29032647609710693, 'learning_rate': 0.00015664049163537045, 'epoch': 6.08}


 61%|██████    | 1784/2930 [47:11<30:11,  1.58s/it]

{'loss': 0.0406, 'grad_norm': 0.2787192463874817, 'learning_rate': 0.00015650392625469442, 'epoch': 6.08}


 61%|██████    | 1785/2930 [47:12<29:52,  1.57s/it]

{'loss': 0.0481, 'grad_norm': 0.35046669840812683, 'learning_rate': 0.00015636736087401844, 'epoch': 6.09}


 61%|██████    | 1786/2930 [47:14<29:57,  1.57s/it]

{'loss': 0.0445, 'grad_norm': 0.2999608814716339, 'learning_rate': 0.00015623079549334246, 'epoch': 6.09}


 61%|██████    | 1787/2930 [47:16<29:58,  1.57s/it]

{'loss': 0.05, 'grad_norm': 0.32988592982292175, 'learning_rate': 0.00015609423011266643, 'epoch': 6.09}


 61%|██████    | 1788/2930 [47:17<30:00,  1.58s/it]

{'loss': 0.0395, 'grad_norm': 0.2782198190689087, 'learning_rate': 0.00015595766473199045, 'epoch': 6.1}


 61%|██████    | 1789/2930 [47:19<29:59,  1.58s/it]

{'loss': 0.0495, 'grad_norm': 0.3349972665309906, 'learning_rate': 0.00015582109935131447, 'epoch': 6.1}


 61%|██████    | 1790/2930 [47:20<29:57,  1.58s/it]

{'loss': 0.0401, 'grad_norm': 0.2745943069458008, 'learning_rate': 0.00015568453397063844, 'epoch': 6.1}


 61%|██████    | 1791/2930 [47:22<29:50,  1.57s/it]

{'loss': 0.0525, 'grad_norm': 0.3719400465488434, 'learning_rate': 0.00015554796858996246, 'epoch': 6.11}


 61%|██████    | 1792/2930 [47:23<29:47,  1.57s/it]

{'loss': 0.0417, 'grad_norm': 0.2968427836894989, 'learning_rate': 0.00015541140320928645, 'epoch': 6.11}


 61%|██████    | 1793/2930 [47:25<29:44,  1.57s/it]

{'loss': 0.0533, 'grad_norm': 0.3413950204849243, 'learning_rate': 0.00015527483782861045, 'epoch': 6.11}


 61%|██████    | 1794/2930 [47:27<29:42,  1.57s/it]

{'loss': 0.0442, 'grad_norm': 0.3121582865715027, 'learning_rate': 0.00015513827244793447, 'epoch': 6.12}


 61%|██████▏   | 1795/2930 [47:28<29:39,  1.57s/it]

{'loss': 0.0446, 'grad_norm': 0.3141803443431854, 'learning_rate': 0.00015500170706725844, 'epoch': 6.12}


 61%|██████▏   | 1796/2930 [47:30<29:34,  1.57s/it]

{'loss': 0.0518, 'grad_norm': 0.3651014566421509, 'learning_rate': 0.00015486514168658246, 'epoch': 6.12}


 61%|██████▏   | 1797/2930 [47:31<29:34,  1.57s/it]

{'loss': 0.0443, 'grad_norm': 0.30956360697746277, 'learning_rate': 0.00015472857630590648, 'epoch': 6.13}


 61%|██████▏   | 1798/2930 [47:33<29:24,  1.56s/it]

{'loss': 0.0414, 'grad_norm': 0.2819195091724396, 'learning_rate': 0.00015459201092523045, 'epoch': 6.13}


 61%|██████▏   | 1799/2930 [47:34<29:28,  1.56s/it]

{'loss': 0.0489, 'grad_norm': 0.33730530738830566, 'learning_rate': 0.00015445544554455447, 'epoch': 6.13}


 61%|██████▏   | 1800/2930 [47:36<29:35,  1.57s/it]

{'loss': 0.0416, 'grad_norm': 0.2910168468952179, 'learning_rate': 0.00015431888016387846, 'epoch': 6.14}


 61%|██████▏   | 1801/2930 [47:38<29:35,  1.57s/it]

{'loss': 0.0432, 'grad_norm': 0.2705249786376953, 'learning_rate': 0.00015418231478320246, 'epoch': 6.14}


 62%|██████▏   | 1802/2930 [47:39<29:34,  1.57s/it]

{'loss': 0.0414, 'grad_norm': 0.2878706157207489, 'learning_rate': 0.00015404574940252648, 'epoch': 6.14}


 62%|██████▏   | 1803/2930 [47:41<29:29,  1.57s/it]

{'loss': 0.0413, 'grad_norm': 0.3129920959472656, 'learning_rate': 0.00015390918402185047, 'epoch': 6.15}


 62%|██████▏   | 1804/2930 [47:42<29:15,  1.56s/it]

{'loss': 0.0449, 'grad_norm': 0.2639618217945099, 'learning_rate': 0.00015377261864117447, 'epoch': 6.15}


 62%|██████▏   | 1805/2930 [47:44<29:08,  1.55s/it]

{'loss': 0.0479, 'grad_norm': 0.30115577578544617, 'learning_rate': 0.0001536360532604985, 'epoch': 6.16}


 62%|██████▏   | 1806/2930 [47:45<29:16,  1.56s/it]

{'loss': 0.0499, 'grad_norm': 0.30348876118659973, 'learning_rate': 0.00015349948787982248, 'epoch': 6.16}


 62%|██████▏   | 1807/2930 [47:47<29:14,  1.56s/it]

{'loss': 0.0469, 'grad_norm': 0.3442979156970978, 'learning_rate': 0.00015336292249914648, 'epoch': 6.16}


 62%|██████▏   | 1808/2930 [47:48<29:16,  1.57s/it]

{'loss': 0.0467, 'grad_norm': 0.34679895639419556, 'learning_rate': 0.00015322635711847047, 'epoch': 6.17}


 62%|██████▏   | 1809/2930 [47:50<29:10,  1.56s/it]

{'loss': 0.043, 'grad_norm': 0.30340343713760376, 'learning_rate': 0.00015308979173779446, 'epoch': 6.17}


 62%|██████▏   | 1810/2930 [47:52<28:57,  1.55s/it]

{'loss': 0.0396, 'grad_norm': 0.3174162805080414, 'learning_rate': 0.00015295322635711848, 'epoch': 6.17}


 62%|██████▏   | 1811/2930 [47:53<29:05,  1.56s/it]

{'loss': 0.0421, 'grad_norm': 0.3161090016365051, 'learning_rate': 0.00015281666097644248, 'epoch': 6.18}


 62%|██████▏   | 1812/2930 [47:55<29:07,  1.56s/it]

{'loss': 0.044, 'grad_norm': 0.31212562322616577, 'learning_rate': 0.00015268009559576647, 'epoch': 6.18}


 62%|██████▏   | 1813/2930 [47:56<28:56,  1.55s/it]

{'loss': 0.0431, 'grad_norm': 0.3294938802719116, 'learning_rate': 0.0001525435302150905, 'epoch': 6.18}


 62%|██████▏   | 1814/2930 [47:58<29:00,  1.56s/it]

{'loss': 0.0409, 'grad_norm': 0.29684245586395264, 'learning_rate': 0.0001524069648344145, 'epoch': 6.19}


 62%|██████▏   | 1815/2930 [47:59<28:58,  1.56s/it]

{'loss': 0.0472, 'grad_norm': 0.3954585790634155, 'learning_rate': 0.00015227039945373848, 'epoch': 6.19}


 62%|██████▏   | 1816/2930 [48:01<29:00,  1.56s/it]

{'loss': 0.0495, 'grad_norm': 0.33930057287216187, 'learning_rate': 0.00015213383407306248, 'epoch': 6.19}


 62%|██████▏   | 1817/2930 [48:03<28:59,  1.56s/it]

{'loss': 0.0498, 'grad_norm': 0.32951024174690247, 'learning_rate': 0.0001519972686923865, 'epoch': 6.2}


 62%|██████▏   | 1818/2930 [48:04<28:53,  1.56s/it]

{'loss': 0.0366, 'grad_norm': 0.2567121982574463, 'learning_rate': 0.0001518607033117105, 'epoch': 6.2}


 62%|██████▏   | 1819/2930 [48:06<28:58,  1.56s/it]

{'loss': 0.0462, 'grad_norm': 0.3449903428554535, 'learning_rate': 0.00015172413793103449, 'epoch': 6.2}


 62%|██████▏   | 1820/2930 [48:07<28:46,  1.56s/it]

{'loss': 0.041, 'grad_norm': 0.31108495593070984, 'learning_rate': 0.00015158757255035848, 'epoch': 6.21}


 62%|██████▏   | 1821/2930 [48:09<28:41,  1.55s/it]

{'loss': 0.047, 'grad_norm': 0.30412065982818604, 'learning_rate': 0.0001514510071696825, 'epoch': 6.21}


 62%|██████▏   | 1822/2930 [48:10<28:31,  1.54s/it]

{'loss': 0.0435, 'grad_norm': 0.3152598738670349, 'learning_rate': 0.0001513144417890065, 'epoch': 6.21}


 62%|██████▏   | 1823/2930 [48:12<28:25,  1.54s/it]

{'loss': 0.0454, 'grad_norm': 0.3156518340110779, 'learning_rate': 0.0001511778764083305, 'epoch': 6.22}


 62%|██████▏   | 1824/2930 [48:13<28:18,  1.54s/it]

{'loss': 0.0441, 'grad_norm': 0.34372320771217346, 'learning_rate': 0.00015104131102765448, 'epoch': 6.22}


 62%|██████▏   | 1825/2930 [48:15<28:25,  1.54s/it]

{'loss': 0.0519, 'grad_norm': 0.3437564969062805, 'learning_rate': 0.0001509047456469785, 'epoch': 6.22}


 62%|██████▏   | 1826/2930 [48:16<28:31,  1.55s/it]

{'loss': 0.0419, 'grad_norm': 0.2609958052635193, 'learning_rate': 0.0001507681802663025, 'epoch': 6.23}


 62%|██████▏   | 1827/2930 [48:18<28:39,  1.56s/it]

{'loss': 0.0516, 'grad_norm': 0.32485881447792053, 'learning_rate': 0.0001506316148856265, 'epoch': 6.23}


 62%|██████▏   | 1828/2930 [48:20<28:33,  1.56s/it]

{'loss': 0.0457, 'grad_norm': 0.31679767370224, 'learning_rate': 0.00015049504950495051, 'epoch': 6.23}


 62%|██████▏   | 1829/2930 [48:21<28:33,  1.56s/it]

{'loss': 0.0413, 'grad_norm': 0.41784000396728516, 'learning_rate': 0.0001503584841242745, 'epoch': 6.24}


 62%|██████▏   | 1830/2930 [48:23<28:05,  1.53s/it]

{'loss': 0.0385, 'grad_norm': 0.2466733157634735, 'learning_rate': 0.0001502219187435985, 'epoch': 6.24}


 62%|██████▏   | 1831/2930 [48:24<28:17,  1.54s/it]

{'loss': 0.042, 'grad_norm': 0.2995202839374542, 'learning_rate': 0.0001500853533629225, 'epoch': 6.24}


 63%|██████▎   | 1832/2930 [48:26<28:07,  1.54s/it]

{'loss': 0.0414, 'grad_norm': 0.29041776061058044, 'learning_rate': 0.0001499487879822465, 'epoch': 6.25}


 63%|██████▎   | 1833/2930 [48:27<28:18,  1.55s/it]

{'loss': 0.045, 'grad_norm': 0.3372683823108673, 'learning_rate': 0.0001498122226015705, 'epoch': 6.25}


 63%|██████▎   | 1834/2930 [48:29<28:24,  1.56s/it]

{'loss': 0.0428, 'grad_norm': 0.29618918895721436, 'learning_rate': 0.0001496756572208945, 'epoch': 6.25}


 63%|██████▎   | 1835/2930 [48:30<28:30,  1.56s/it]

{'loss': 0.0595, 'grad_norm': 0.3914436399936676, 'learning_rate': 0.0001495390918402185, 'epoch': 6.26}


 63%|██████▎   | 1836/2930 [48:32<28:29,  1.56s/it]

{'loss': 0.0429, 'grad_norm': 0.3098396956920624, 'learning_rate': 0.00014940252645954252, 'epoch': 6.26}


 63%|██████▎   | 1837/2930 [48:33<28:07,  1.54s/it]

{'loss': 0.0458, 'grad_norm': 0.33541813492774963, 'learning_rate': 0.00014926596107886652, 'epoch': 6.26}


 63%|██████▎   | 1838/2930 [48:35<28:05,  1.54s/it]

{'loss': 0.0475, 'grad_norm': 0.339534193277359, 'learning_rate': 0.0001491293956981905, 'epoch': 6.27}


 63%|██████▎   | 1839/2930 [48:37<28:15,  1.55s/it]

{'loss': 0.052, 'grad_norm': 0.3137018382549286, 'learning_rate': 0.00014899283031751453, 'epoch': 6.27}


 63%|██████▎   | 1840/2930 [48:38<28:18,  1.56s/it]

{'loss': 0.0501, 'grad_norm': 0.31584855914115906, 'learning_rate': 0.0001488562649368385, 'epoch': 6.27}


 63%|██████▎   | 1841/2930 [48:40<28:07,  1.55s/it]

{'loss': 0.0385, 'grad_norm': 0.27807509899139404, 'learning_rate': 0.00014871969955616252, 'epoch': 6.28}


 63%|██████▎   | 1842/2930 [48:41<28:13,  1.56s/it]

{'loss': 0.0427, 'grad_norm': 0.2938554584980011, 'learning_rate': 0.00014858313417548651, 'epoch': 6.28}


 63%|██████▎   | 1843/2930 [48:43<28:06,  1.55s/it]

{'loss': 0.0374, 'grad_norm': 0.2667103111743927, 'learning_rate': 0.0001484465687948105, 'epoch': 6.28}


 63%|██████▎   | 1844/2930 [48:44<28:01,  1.55s/it]

{'loss': 0.0457, 'grad_norm': 0.297451913356781, 'learning_rate': 0.00014831000341413453, 'epoch': 6.29}


 63%|██████▎   | 1845/2930 [48:46<28:08,  1.56s/it]

{'loss': 0.0537, 'grad_norm': 0.326172798871994, 'learning_rate': 0.00014817343803345852, 'epoch': 6.29}


 63%|██████▎   | 1846/2930 [48:47<28:13,  1.56s/it]

{'loss': 0.0486, 'grad_norm': 0.29315611720085144, 'learning_rate': 0.00014803687265278252, 'epoch': 6.29}


 63%|██████▎   | 1847/2930 [48:49<28:13,  1.56s/it]

{'loss': 0.0425, 'grad_norm': 0.3096560537815094, 'learning_rate': 0.00014790030727210654, 'epoch': 6.3}


 63%|██████▎   | 1848/2930 [48:51<28:13,  1.57s/it]

{'loss': 0.0463, 'grad_norm': 0.3186192512512207, 'learning_rate': 0.00014776374189143053, 'epoch': 6.3}


 63%|██████▎   | 1849/2930 [48:52<28:09,  1.56s/it]

{'loss': 0.0431, 'grad_norm': 0.27718478441238403, 'learning_rate': 0.00014762717651075453, 'epoch': 6.31}


 63%|██████▎   | 1850/2930 [48:54<28:08,  1.56s/it]

{'loss': 0.0465, 'grad_norm': 0.3455556333065033, 'learning_rate': 0.00014749061113007855, 'epoch': 6.31}


 63%|██████▎   | 1851/2930 [48:55<28:03,  1.56s/it]

{'loss': 0.0516, 'grad_norm': 0.32290413975715637, 'learning_rate': 0.00014735404574940252, 'epoch': 6.31}


 63%|██████▎   | 1852/2930 [48:57<28:02,  1.56s/it]

{'loss': 0.0434, 'grad_norm': 0.2971360683441162, 'learning_rate': 0.00014721748036872654, 'epoch': 6.32}


 63%|██████▎   | 1853/2930 [48:58<27:53,  1.55s/it]

{'loss': 0.0414, 'grad_norm': 0.31546756625175476, 'learning_rate': 0.00014708091498805053, 'epoch': 6.32}


 63%|██████▎   | 1854/2930 [49:00<27:50,  1.55s/it]

{'loss': 0.0412, 'grad_norm': 0.32264095544815063, 'learning_rate': 0.00014694434960737453, 'epoch': 6.32}


 63%|██████▎   | 1855/2930 [49:02<27:46,  1.55s/it]

{'loss': 0.0406, 'grad_norm': 0.31952884793281555, 'learning_rate': 0.00014680778422669855, 'epoch': 6.33}


 63%|██████▎   | 1856/2930 [49:03<27:53,  1.56s/it]

{'loss': 0.044, 'grad_norm': 0.3116503357887268, 'learning_rate': 0.00014667121884602254, 'epoch': 6.33}


 63%|██████▎   | 1857/2930 [49:05<27:47,  1.55s/it]

{'loss': 0.0457, 'grad_norm': 0.31419074535369873, 'learning_rate': 0.00014653465346534653, 'epoch': 6.33}


 63%|██████▎   | 1858/2930 [49:06<27:54,  1.56s/it]

{'loss': 0.0483, 'grad_norm': 0.3131401240825653, 'learning_rate': 0.00014639808808467056, 'epoch': 6.34}


 63%|██████▎   | 1859/2930 [49:08<27:53,  1.56s/it]

{'loss': 0.0406, 'grad_norm': 0.2854275107383728, 'learning_rate': 0.00014626152270399452, 'epoch': 6.34}


 63%|██████▎   | 1860/2930 [49:09<27:43,  1.56s/it]

{'loss': 0.04, 'grad_norm': 0.2815302908420563, 'learning_rate': 0.00014612495732331854, 'epoch': 6.34}


 64%|██████▎   | 1861/2930 [49:11<27:46,  1.56s/it]

{'loss': 0.0458, 'grad_norm': 0.3396608531475067, 'learning_rate': 0.00014598839194264257, 'epoch': 6.35}


 64%|██████▎   | 1862/2930 [49:12<27:46,  1.56s/it]

{'loss': 0.0419, 'grad_norm': 0.2924013137817383, 'learning_rate': 0.00014585182656196653, 'epoch': 6.35}


 64%|██████▎   | 1863/2930 [49:14<27:42,  1.56s/it]

{'loss': 0.0356, 'grad_norm': 0.27109894156455994, 'learning_rate': 0.00014571526118129055, 'epoch': 6.35}


 64%|██████▎   | 1864/2930 [49:16<27:42,  1.56s/it]

{'loss': 0.0414, 'grad_norm': 0.30489251017570496, 'learning_rate': 0.00014557869580061458, 'epoch': 6.36}


 64%|██████▎   | 1865/2930 [49:17<27:36,  1.56s/it]

{'loss': 0.0421, 'grad_norm': 0.2726000249385834, 'learning_rate': 0.00014544213041993854, 'epoch': 6.36}


 64%|██████▎   | 1866/2930 [49:19<27:32,  1.55s/it]

{'loss': 0.0419, 'grad_norm': 0.31250205636024475, 'learning_rate': 0.00014530556503926256, 'epoch': 6.36}


 64%|██████▎   | 1867/2930 [49:20<27:26,  1.55s/it]

{'loss': 0.0472, 'grad_norm': 0.3336576819419861, 'learning_rate': 0.00014516899965858656, 'epoch': 6.37}


 64%|██████▍   | 1868/2930 [49:22<27:27,  1.55s/it]

{'loss': 0.0468, 'grad_norm': 0.3374710977077484, 'learning_rate': 0.00014503243427791055, 'epoch': 6.37}


 64%|██████▍   | 1869/2930 [49:23<27:30,  1.56s/it]

{'loss': 0.0447, 'grad_norm': 0.32718947529792786, 'learning_rate': 0.00014489586889723457, 'epoch': 6.37}


 64%|██████▍   | 1870/2930 [49:25<27:36,  1.56s/it]

{'loss': 0.0434, 'grad_norm': 0.3031948506832123, 'learning_rate': 0.00014475930351655854, 'epoch': 6.38}


 64%|██████▍   | 1871/2930 [49:26<27:45,  1.57s/it]

{'loss': 0.0471, 'grad_norm': 0.3276292681694031, 'learning_rate': 0.00014462273813588256, 'epoch': 6.38}


 64%|██████▍   | 1872/2930 [49:28<27:45,  1.57s/it]

{'loss': 0.0467, 'grad_norm': 0.35202839970588684, 'learning_rate': 0.00014448617275520658, 'epoch': 6.38}


 64%|██████▍   | 1873/2930 [49:30<27:40,  1.57s/it]

{'loss': 0.0559, 'grad_norm': 0.3363291323184967, 'learning_rate': 0.00014434960737453055, 'epoch': 6.39}


 64%|██████▍   | 1874/2930 [49:31<27:42,  1.57s/it]

{'loss': 0.0346, 'grad_norm': 0.2581416368484497, 'learning_rate': 0.00014421304199385457, 'epoch': 6.39}


 64%|██████▍   | 1875/2930 [49:33<27:36,  1.57s/it]

{'loss': 0.0418, 'grad_norm': 0.2725876569747925, 'learning_rate': 0.00014407647661317856, 'epoch': 6.39}


 64%|██████▍   | 1876/2930 [49:34<27:30,  1.57s/it]

{'loss': 0.0446, 'grad_norm': 0.28246980905532837, 'learning_rate': 0.00014393991123250256, 'epoch': 6.4}


 64%|██████▍   | 1877/2930 [49:36<27:24,  1.56s/it]

{'loss': 0.0463, 'grad_norm': 0.29372459650039673, 'learning_rate': 0.00014380334585182658, 'epoch': 6.4}


 64%|██████▍   | 1878/2930 [49:37<27:25,  1.56s/it]

{'loss': 0.0604, 'grad_norm': 0.3902915418148041, 'learning_rate': 0.00014366678047115057, 'epoch': 6.4}


 64%|██████▍   | 1879/2930 [49:39<27:24,  1.57s/it]

{'loss': 0.0497, 'grad_norm': 0.2832409143447876, 'learning_rate': 0.00014353021509047457, 'epoch': 6.41}


 64%|██████▍   | 1880/2930 [49:41<27:20,  1.56s/it]

{'loss': 0.043, 'grad_norm': 0.27999401092529297, 'learning_rate': 0.0001433936497097986, 'epoch': 6.41}


 64%|██████▍   | 1881/2930 [49:42<27:15,  1.56s/it]

{'loss': 0.0421, 'grad_norm': 0.2886159420013428, 'learning_rate': 0.00014325708432912258, 'epoch': 6.41}


 64%|██████▍   | 1882/2930 [49:44<27:19,  1.56s/it]

{'loss': 0.0487, 'grad_norm': 0.3430490791797638, 'learning_rate': 0.00014312051894844658, 'epoch': 6.42}


 64%|██████▍   | 1883/2930 [49:45<27:28,  1.57s/it]

{'loss': 0.0464, 'grad_norm': 0.30334821343421936, 'learning_rate': 0.00014298395356777057, 'epoch': 6.42}


 64%|██████▍   | 1884/2930 [49:47<27:33,  1.58s/it]

{'loss': 0.0463, 'grad_norm': 0.2971780598163605, 'learning_rate': 0.00014284738818709457, 'epoch': 6.42}


 64%|██████▍   | 1885/2930 [49:48<27:26,  1.58s/it]

{'loss': 0.0447, 'grad_norm': 0.31187182664871216, 'learning_rate': 0.0001427108228064186, 'epoch': 6.43}


 64%|██████▍   | 1886/2930 [49:50<27:34,  1.58s/it]

{'loss': 0.0468, 'grad_norm': 0.3401166796684265, 'learning_rate': 0.00014257425742574258, 'epoch': 6.43}


 64%|██████▍   | 1887/2930 [49:52<27:33,  1.58s/it]

{'loss': 0.0469, 'grad_norm': 0.36595281958580017, 'learning_rate': 0.00014243769204506658, 'epoch': 6.43}


 64%|██████▍   | 1888/2930 [49:53<27:34,  1.59s/it]

{'loss': 0.0452, 'grad_norm': 0.3006207346916199, 'learning_rate': 0.0001423011266643906, 'epoch': 6.44}


 64%|██████▍   | 1889/2930 [49:55<27:24,  1.58s/it]

{'loss': 0.046, 'grad_norm': 0.33181479573249817, 'learning_rate': 0.0001421645612837146, 'epoch': 6.44}


 65%|██████▍   | 1890/2930 [49:56<27:20,  1.58s/it]

{'loss': 0.0461, 'grad_norm': 0.3417797386646271, 'learning_rate': 0.00014202799590303859, 'epoch': 6.45}


 65%|██████▍   | 1891/2930 [49:58<27:27,  1.59s/it]

{'loss': 0.0459, 'grad_norm': 0.3281199038028717, 'learning_rate': 0.00014189143052236258, 'epoch': 6.45}


 65%|██████▍   | 1892/2930 [50:00<27:19,  1.58s/it]

{'loss': 0.0488, 'grad_norm': 0.3684510886669159, 'learning_rate': 0.0001417548651416866, 'epoch': 6.45}


 65%|██████▍   | 1893/2930 [50:01<27:10,  1.57s/it]

{'loss': 0.0407, 'grad_norm': 0.32720980048179626, 'learning_rate': 0.0001416182997610106, 'epoch': 6.46}


 65%|██████▍   | 1894/2930 [50:03<27:08,  1.57s/it]

{'loss': 0.0545, 'grad_norm': 0.3416285514831543, 'learning_rate': 0.0001414817343803346, 'epoch': 6.46}


 65%|██████▍   | 1895/2930 [50:04<27:16,  1.58s/it]

{'loss': 0.0461, 'grad_norm': 0.320828914642334, 'learning_rate': 0.00014134516899965858, 'epoch': 6.46}


 65%|██████▍   | 1896/2930 [50:06<27:20,  1.59s/it]

{'loss': 0.0478, 'grad_norm': 0.3120727837085724, 'learning_rate': 0.0001412086036189826, 'epoch': 6.47}


 65%|██████▍   | 1897/2930 [50:07<27:22,  1.59s/it]

{'loss': 0.0499, 'grad_norm': 0.32273152470588684, 'learning_rate': 0.0001410720382383066, 'epoch': 6.47}


 65%|██████▍   | 1898/2930 [50:09<27:11,  1.58s/it]

{'loss': 0.0449, 'grad_norm': 0.3307949900627136, 'learning_rate': 0.0001409354728576306, 'epoch': 6.47}


 65%|██████▍   | 1899/2930 [50:11<27:15,  1.59s/it]

{'loss': 0.0534, 'grad_norm': 0.38578927516937256, 'learning_rate': 0.0001407989074769546, 'epoch': 6.48}


 65%|██████▍   | 1900/2930 [50:12<27:11,  1.58s/it]

{'loss': 0.0431, 'grad_norm': 0.27688753604888916, 'learning_rate': 0.0001406623420962786, 'epoch': 6.48}


 65%|██████▍   | 1901/2930 [50:14<27:08,  1.58s/it]

{'loss': 0.039, 'grad_norm': 0.2823440134525299, 'learning_rate': 0.0001405257767156026, 'epoch': 6.48}


 65%|██████▍   | 1902/2930 [50:15<27:06,  1.58s/it]

{'loss': 0.0476, 'grad_norm': 0.30463355779647827, 'learning_rate': 0.0001403892113349266, 'epoch': 6.49}


 65%|██████▍   | 1903/2930 [50:17<27:08,  1.59s/it]

{'loss': 0.0523, 'grad_norm': 0.33477267622947693, 'learning_rate': 0.00014025264595425062, 'epoch': 6.49}


 65%|██████▍   | 1904/2930 [50:19<27:03,  1.58s/it]

{'loss': 0.0443, 'grad_norm': 0.3157355785369873, 'learning_rate': 0.0001401160805735746, 'epoch': 6.49}


 65%|██████▌   | 1905/2930 [50:20<27:05,  1.59s/it]

{'loss': 0.0477, 'grad_norm': 0.3019934594631195, 'learning_rate': 0.0001399795151928986, 'epoch': 6.5}


 65%|██████▌   | 1906/2930 [50:22<27:08,  1.59s/it]

{'loss': 0.0486, 'grad_norm': 0.3186829388141632, 'learning_rate': 0.0001398429498122226, 'epoch': 6.5}


 65%|██████▌   | 1907/2930 [50:23<27:04,  1.59s/it]

{'loss': 0.0523, 'grad_norm': 0.3115313649177551, 'learning_rate': 0.0001397063844315466, 'epoch': 6.5}


 65%|██████▌   | 1908/2930 [50:25<26:51,  1.58s/it]

{'loss': 0.0467, 'grad_norm': 0.3228011429309845, 'learning_rate': 0.00013956981905087062, 'epoch': 6.51}


 65%|██████▌   | 1909/2930 [50:26<26:44,  1.57s/it]

{'loss': 0.0416, 'grad_norm': 0.3089749813079834, 'learning_rate': 0.0001394332536701946, 'epoch': 6.51}


 65%|██████▌   | 1910/2930 [50:28<26:41,  1.57s/it]

{'loss': 0.0413, 'grad_norm': 0.27466505765914917, 'learning_rate': 0.0001392966882895186, 'epoch': 6.51}


 65%|██████▌   | 1911/2930 [50:30<26:43,  1.57s/it]

{'loss': 0.0496, 'grad_norm': 0.3266758918762207, 'learning_rate': 0.00013916012290884262, 'epoch': 6.52}


 65%|██████▌   | 1912/2930 [50:31<26:34,  1.57s/it]

{'loss': 0.043, 'grad_norm': 0.3156699240207672, 'learning_rate': 0.00013902355752816662, 'epoch': 6.52}


 65%|██████▌   | 1913/2930 [50:33<26:31,  1.56s/it]

{'loss': 0.0434, 'grad_norm': 0.2913324534893036, 'learning_rate': 0.0001388869921474906, 'epoch': 6.52}


 65%|██████▌   | 1914/2930 [50:34<26:26,  1.56s/it]

{'loss': 0.0398, 'grad_norm': 0.29040929675102234, 'learning_rate': 0.00013875042676681463, 'epoch': 6.53}


 65%|██████▌   | 1915/2930 [50:36<26:33,  1.57s/it]

{'loss': 0.0479, 'grad_norm': 0.338202565908432, 'learning_rate': 0.0001386138613861386, 'epoch': 6.53}


 65%|██████▌   | 1916/2930 [50:37<26:41,  1.58s/it]

{'loss': 0.0466, 'grad_norm': 0.29437193274497986, 'learning_rate': 0.00013847729600546262, 'epoch': 6.53}


 65%|██████▌   | 1917/2930 [50:39<26:40,  1.58s/it]

{'loss': 0.0418, 'grad_norm': 0.2797069847583771, 'learning_rate': 0.00013834073062478662, 'epoch': 6.54}


 65%|██████▌   | 1918/2930 [50:41<26:42,  1.58s/it]

{'loss': 0.0446, 'grad_norm': 0.30982881784439087, 'learning_rate': 0.0001382041652441106, 'epoch': 6.54}


 65%|██████▌   | 1919/2930 [50:42<26:38,  1.58s/it]

{'loss': 0.0441, 'grad_norm': 0.28555068373680115, 'learning_rate': 0.00013806759986343463, 'epoch': 6.54}


 66%|██████▌   | 1920/2930 [50:44<26:44,  1.59s/it]

{'loss': 0.0487, 'grad_norm': 0.31579115986824036, 'learning_rate': 0.00013793103448275863, 'epoch': 6.55}


 66%|██████▌   | 1921/2930 [50:45<26:42,  1.59s/it]

{'loss': 0.0465, 'grad_norm': 0.27942895889282227, 'learning_rate': 0.00013779446910208262, 'epoch': 6.55}


 66%|██████▌   | 1922/2930 [50:47<26:39,  1.59s/it]

{'loss': 0.0523, 'grad_norm': 0.29388609528541565, 'learning_rate': 0.00013765790372140664, 'epoch': 6.55}


 66%|██████▌   | 1923/2930 [50:49<26:40,  1.59s/it]

{'loss': 0.0457, 'grad_norm': 0.30960527062416077, 'learning_rate': 0.0001375213383407306, 'epoch': 6.56}


 66%|██████▌   | 1924/2930 [50:50<26:23,  1.57s/it]

{'loss': 0.0428, 'grad_norm': 0.3199256658554077, 'learning_rate': 0.00013738477296005463, 'epoch': 6.56}


 66%|██████▌   | 1925/2930 [50:52<26:21,  1.57s/it]

{'loss': 0.041, 'grad_norm': 0.2813293933868408, 'learning_rate': 0.00013724820757937865, 'epoch': 6.56}


 66%|██████▌   | 1926/2930 [50:53<26:13,  1.57s/it]

{'loss': 0.0418, 'grad_norm': 0.26239320635795593, 'learning_rate': 0.00013711164219870262, 'epoch': 6.57}


 66%|██████▌   | 1927/2930 [50:55<26:05,  1.56s/it]

{'loss': 0.047, 'grad_norm': 0.35643282532691956, 'learning_rate': 0.00013697507681802664, 'epoch': 6.57}


 66%|██████▌   | 1928/2930 [50:56<26:06,  1.56s/it]

{'loss': 0.0603, 'grad_norm': 0.37874746322631836, 'learning_rate': 0.00013683851143735063, 'epoch': 6.57}


 66%|██████▌   | 1929/2930 [50:58<25:57,  1.56s/it]

{'loss': 0.0451, 'grad_norm': 0.33242666721343994, 'learning_rate': 0.00013670194605667463, 'epoch': 6.58}


 66%|██████▌   | 1930/2930 [50:59<25:57,  1.56s/it]

{'loss': 0.0362, 'grad_norm': 0.27843177318573, 'learning_rate': 0.00013656538067599865, 'epoch': 6.58}


 66%|██████▌   | 1931/2930 [51:01<26:03,  1.56s/it]

{'loss': 0.0421, 'grad_norm': 0.34340643882751465, 'learning_rate': 0.00013642881529532264, 'epoch': 6.58}


 66%|██████▌   | 1932/2930 [51:03<26:09,  1.57s/it]

{'loss': 0.0523, 'grad_norm': 0.35159555077552795, 'learning_rate': 0.00013629224991464664, 'epoch': 6.59}


 66%|██████▌   | 1933/2930 [51:04<26:12,  1.58s/it]

{'loss': 0.0478, 'grad_norm': 0.3277459740638733, 'learning_rate': 0.00013615568453397066, 'epoch': 6.59}


 66%|██████▌   | 1934/2930 [51:06<26:05,  1.57s/it]

{'loss': 0.0414, 'grad_norm': 0.32810407876968384, 'learning_rate': 0.00013601911915329463, 'epoch': 6.6}


 66%|██████▌   | 1935/2930 [51:07<26:05,  1.57s/it]

{'loss': 0.0471, 'grad_norm': 0.30364662408828735, 'learning_rate': 0.00013588255377261865, 'epoch': 6.6}


 66%|██████▌   | 1936/2930 [51:09<26:07,  1.58s/it]

{'loss': 0.0435, 'grad_norm': 0.29360532760620117, 'learning_rate': 0.00013574598839194267, 'epoch': 6.6}


 66%|██████▌   | 1937/2930 [51:10<25:57,  1.57s/it]

{'loss': 0.0512, 'grad_norm': 0.3416955769062042, 'learning_rate': 0.00013560942301126664, 'epoch': 6.61}


 66%|██████▌   | 1938/2930 [51:12<25:49,  1.56s/it]

{'loss': 0.0435, 'grad_norm': 0.28380757570266724, 'learning_rate': 0.00013547285763059066, 'epoch': 6.61}


 66%|██████▌   | 1939/2930 [51:14<25:41,  1.56s/it]

{'loss': 0.0429, 'grad_norm': 0.30630016326904297, 'learning_rate': 0.00013533629224991465, 'epoch': 6.61}


 66%|██████▌   | 1940/2930 [51:15<25:43,  1.56s/it]

{'loss': 0.0376, 'grad_norm': 0.2765847444534302, 'learning_rate': 0.00013519972686923864, 'epoch': 6.62}


 66%|██████▌   | 1941/2930 [51:17<25:45,  1.56s/it]

{'loss': 0.0517, 'grad_norm': 0.31570765376091003, 'learning_rate': 0.00013506316148856267, 'epoch': 6.62}


 66%|██████▋   | 1942/2930 [51:18<25:44,  1.56s/it]

{'loss': 0.0465, 'grad_norm': 0.3139350712299347, 'learning_rate': 0.00013492659610788666, 'epoch': 6.62}


 66%|██████▋   | 1943/2930 [51:20<25:43,  1.56s/it]

{'loss': 0.0414, 'grad_norm': 0.28108251094818115, 'learning_rate': 0.00013479003072721065, 'epoch': 6.63}


 66%|██████▋   | 1944/2930 [51:21<25:47,  1.57s/it]

{'loss': 0.0475, 'grad_norm': 0.3033149838447571, 'learning_rate': 0.00013465346534653468, 'epoch': 6.63}


 66%|██████▋   | 1945/2930 [51:23<25:45,  1.57s/it]

{'loss': 0.04, 'grad_norm': 0.31081804633140564, 'learning_rate': 0.00013451689996585864, 'epoch': 6.63}


 66%|██████▋   | 1946/2930 [51:25<25:37,  1.56s/it]

{'loss': 0.0463, 'grad_norm': 0.2982846200466156, 'learning_rate': 0.00013438033458518266, 'epoch': 6.64}


 66%|██████▋   | 1947/2930 [51:26<25:31,  1.56s/it]

{'loss': 0.0416, 'grad_norm': 0.3343614935874939, 'learning_rate': 0.00013424376920450666, 'epoch': 6.64}


 66%|██████▋   | 1948/2930 [51:28<25:36,  1.57s/it]

{'loss': 0.0393, 'grad_norm': 0.2714769244194031, 'learning_rate': 0.00013410720382383065, 'epoch': 6.64}


 67%|██████▋   | 1949/2930 [51:29<25:40,  1.57s/it]

{'loss': 0.0453, 'grad_norm': 0.29564034938812256, 'learning_rate': 0.00013397063844315467, 'epoch': 6.65}


 67%|██████▋   | 1950/2930 [51:31<25:37,  1.57s/it]

{'loss': 0.044, 'grad_norm': 0.32685279846191406, 'learning_rate': 0.00013383407306247867, 'epoch': 6.65}


 67%|██████▋   | 1951/2930 [51:32<25:30,  1.56s/it]

{'loss': 0.0415, 'grad_norm': 0.32142576575279236, 'learning_rate': 0.00013369750768180266, 'epoch': 6.65}


 67%|██████▋   | 1952/2930 [51:34<25:27,  1.56s/it]

{'loss': 0.0433, 'grad_norm': 0.3126049339771271, 'learning_rate': 0.00013356094230112668, 'epoch': 6.66}


 67%|██████▋   | 1953/2930 [51:35<25:28,  1.56s/it]

{'loss': 0.0391, 'grad_norm': 0.2828398644924164, 'learning_rate': 0.00013342437692045068, 'epoch': 6.66}


 67%|██████▋   | 1954/2930 [51:37<25:24,  1.56s/it]

{'loss': 0.0481, 'grad_norm': 0.3816729485988617, 'learning_rate': 0.00013328781153977467, 'epoch': 6.66}


 67%|██████▋   | 1955/2930 [51:39<25:24,  1.56s/it]

{'loss': 0.0371, 'grad_norm': 0.33048924803733826, 'learning_rate': 0.0001331512461590987, 'epoch': 6.67}


 67%|██████▋   | 1956/2930 [51:40<25:20,  1.56s/it]

{'loss': 0.0524, 'grad_norm': 0.34672126173973083, 'learning_rate': 0.0001330146807784227, 'epoch': 6.67}


 67%|██████▋   | 1957/2930 [51:42<25:23,  1.57s/it]

{'loss': 0.0423, 'grad_norm': 0.3128804862499237, 'learning_rate': 0.00013287811539774668, 'epoch': 6.67}


 67%|██████▋   | 1958/2930 [51:43<25:23,  1.57s/it]

{'loss': 0.044, 'grad_norm': 0.2774321436882019, 'learning_rate': 0.00013274155001707067, 'epoch': 6.68}


 67%|██████▋   | 1959/2930 [51:45<25:05,  1.55s/it]

{'loss': 0.0412, 'grad_norm': 0.3017050325870514, 'learning_rate': 0.00013260498463639467, 'epoch': 6.68}


 67%|██████▋   | 1960/2930 [51:46<25:09,  1.56s/it]

{'loss': 0.0445, 'grad_norm': 0.3110371530056, 'learning_rate': 0.0001324684192557187, 'epoch': 6.68}


 67%|██████▋   | 1961/2930 [51:48<25:11,  1.56s/it]

{'loss': 0.0454, 'grad_norm': 0.2891545593738556, 'learning_rate': 0.00013233185387504268, 'epoch': 6.69}


 67%|██████▋   | 1962/2930 [51:50<25:12,  1.56s/it]

{'loss': 0.0494, 'grad_norm': 0.28654244542121887, 'learning_rate': 0.00013219528849436668, 'epoch': 6.69}


 67%|██████▋   | 1963/2930 [51:51<25:10,  1.56s/it]

{'loss': 0.0475, 'grad_norm': 0.30981913208961487, 'learning_rate': 0.0001320587231136907, 'epoch': 6.69}


 67%|██████▋   | 1964/2930 [51:53<25:09,  1.56s/it]

{'loss': 0.0475, 'grad_norm': 0.2897730767726898, 'learning_rate': 0.0001319221577330147, 'epoch': 6.7}


 67%|██████▋   | 1965/2930 [51:54<25:04,  1.56s/it]

{'loss': 0.0483, 'grad_norm': 0.3189067244529724, 'learning_rate': 0.0001317855923523387, 'epoch': 6.7}


 67%|██████▋   | 1966/2930 [51:56<24:42,  1.54s/it]

{'loss': 0.0492, 'grad_norm': 0.34344053268432617, 'learning_rate': 0.00013164902697166268, 'epoch': 6.7}


 67%|██████▋   | 1967/2930 [51:57<24:42,  1.54s/it]

{'loss': 0.0435, 'grad_norm': 0.29607072472572327, 'learning_rate': 0.0001315124615909867, 'epoch': 6.71}


 67%|██████▋   | 1968/2930 [51:59<24:31,  1.53s/it]

{'loss': 0.0436, 'grad_norm': 0.274970680475235, 'learning_rate': 0.0001313758962103107, 'epoch': 6.71}


 67%|██████▋   | 1969/2930 [52:00<24:39,  1.54s/it]

{'loss': 0.0396, 'grad_norm': 0.2723548710346222, 'learning_rate': 0.0001312393308296347, 'epoch': 6.71}


 67%|██████▋   | 1970/2930 [52:02<24:46,  1.55s/it]

{'loss': 0.0498, 'grad_norm': 0.29681500792503357, 'learning_rate': 0.00013110276544895869, 'epoch': 6.72}


 67%|██████▋   | 1971/2930 [52:03<24:45,  1.55s/it]

{'loss': 0.0416, 'grad_norm': 0.3062795400619507, 'learning_rate': 0.0001309662000682827, 'epoch': 6.72}


 67%|██████▋   | 1972/2930 [52:05<24:41,  1.55s/it]

{'loss': 0.0427, 'grad_norm': 0.3122633397579193, 'learning_rate': 0.0001308296346876067, 'epoch': 6.72}


 67%|██████▋   | 1973/2930 [52:07<24:43,  1.55s/it]

{'loss': 0.0464, 'grad_norm': 0.3158898651599884, 'learning_rate': 0.0001306930693069307, 'epoch': 6.73}


 67%|██████▋   | 1974/2930 [52:08<24:43,  1.55s/it]

{'loss': 0.0414, 'grad_norm': 0.3021998703479767, 'learning_rate': 0.0001305565039262547, 'epoch': 6.73}


 67%|██████▋   | 1975/2930 [52:10<24:43,  1.55s/it]

{'loss': 0.0578, 'grad_norm': 0.37606167793273926, 'learning_rate': 0.0001304199385455787, 'epoch': 6.73}


 67%|██████▋   | 1976/2930 [52:11<24:44,  1.56s/it]

{'loss': 0.0436, 'grad_norm': 0.30156970024108887, 'learning_rate': 0.0001302833731649027, 'epoch': 6.74}


 67%|██████▋   | 1977/2930 [52:13<24:45,  1.56s/it]

{'loss': 0.0441, 'grad_norm': 0.34075334668159485, 'learning_rate': 0.0001301468077842267, 'epoch': 6.74}


 68%|██████▊   | 1978/2930 [52:14<24:44,  1.56s/it]

{'loss': 0.0382, 'grad_norm': 0.313290536403656, 'learning_rate': 0.00013001024240355072, 'epoch': 6.75}


 68%|██████▊   | 1979/2930 [52:16<24:44,  1.56s/it]

{'loss': 0.0475, 'grad_norm': 0.31989267468452454, 'learning_rate': 0.00012987367702287471, 'epoch': 6.75}


 68%|██████▊   | 1980/2930 [52:17<24:47,  1.57s/it]

{'loss': 0.0496, 'grad_norm': 0.3555223345756531, 'learning_rate': 0.0001297371116421987, 'epoch': 6.75}


 68%|██████▊   | 1981/2930 [52:19<24:44,  1.56s/it]

{'loss': 0.0427, 'grad_norm': 0.29077592492103577, 'learning_rate': 0.0001296005462615227, 'epoch': 6.76}


 68%|██████▊   | 1982/2930 [52:21<24:46,  1.57s/it]

{'loss': 0.0432, 'grad_norm': 0.2783481776714325, 'learning_rate': 0.0001294639808808467, 'epoch': 6.76}


 68%|██████▊   | 1983/2930 [52:22<24:46,  1.57s/it]

{'loss': 0.0464, 'grad_norm': 0.33302274346351624, 'learning_rate': 0.00012932741550017072, 'epoch': 6.76}


 68%|██████▊   | 1984/2930 [52:24<24:43,  1.57s/it]

{'loss': 0.0473, 'grad_norm': 0.2830049395561218, 'learning_rate': 0.0001291908501194947, 'epoch': 6.77}


 68%|██████▊   | 1985/2930 [52:25<24:36,  1.56s/it]

{'loss': 0.0546, 'grad_norm': 0.3406882584095001, 'learning_rate': 0.0001290542847388187, 'epoch': 6.77}


 68%|██████▊   | 1986/2930 [52:27<24:28,  1.56s/it]

{'loss': 0.0408, 'grad_norm': 0.27920404076576233, 'learning_rate': 0.00012891771935814273, 'epoch': 6.77}


 68%|██████▊   | 1987/2930 [52:28<24:22,  1.55s/it]

{'loss': 0.0409, 'grad_norm': 0.24956835806369781, 'learning_rate': 0.00012878115397746672, 'epoch': 6.78}


 68%|██████▊   | 1988/2930 [52:30<24:21,  1.55s/it]

{'loss': 0.0413, 'grad_norm': 0.29371583461761475, 'learning_rate': 0.00012864458859679072, 'epoch': 6.78}


 68%|██████▊   | 1989/2930 [52:31<24:28,  1.56s/it]

{'loss': 0.046, 'grad_norm': 0.3094078004360199, 'learning_rate': 0.00012850802321611474, 'epoch': 6.78}


 68%|██████▊   | 1990/2930 [52:33<24:31,  1.57s/it]

{'loss': 0.0442, 'grad_norm': 0.3709909915924072, 'learning_rate': 0.0001283714578354387, 'epoch': 6.79}


 68%|██████▊   | 1991/2930 [52:35<24:23,  1.56s/it]

{'loss': 0.042, 'grad_norm': 0.27748700976371765, 'learning_rate': 0.00012823489245476273, 'epoch': 6.79}


 68%|██████▊   | 1992/2930 [52:36<24:18,  1.55s/it]

{'loss': 0.0481, 'grad_norm': 0.3055112063884735, 'learning_rate': 0.00012809832707408672, 'epoch': 6.79}


 68%|██████▊   | 1993/2930 [52:38<24:15,  1.55s/it]

{'loss': 0.0515, 'grad_norm': 0.3238344192504883, 'learning_rate': 0.00012796176169341071, 'epoch': 6.8}


 68%|██████▊   | 1994/2930 [52:39<24:17,  1.56s/it]

{'loss': 0.0533, 'grad_norm': 0.32570651173591614, 'learning_rate': 0.00012782519631273473, 'epoch': 6.8}


 68%|██████▊   | 1995/2930 [52:41<24:19,  1.56s/it]

{'loss': 0.0575, 'grad_norm': 0.35145434737205505, 'learning_rate': 0.00012768863093205873, 'epoch': 6.8}


 68%|██████▊   | 1996/2930 [52:42<24:20,  1.56s/it]

{'loss': 0.0484, 'grad_norm': 0.3259075880050659, 'learning_rate': 0.00012755206555138272, 'epoch': 6.81}


 68%|██████▊   | 1997/2930 [52:44<24:21,  1.57s/it]

{'loss': 0.0466, 'grad_norm': 0.27172595262527466, 'learning_rate': 0.00012741550017070674, 'epoch': 6.81}


 68%|██████▊   | 1998/2930 [52:46<24:17,  1.56s/it]

{'loss': 0.0446, 'grad_norm': 0.31529501080513, 'learning_rate': 0.0001272789347900307, 'epoch': 6.81}


 68%|██████▊   | 1999/2930 [52:47<24:08,  1.56s/it]

{'loss': 0.0452, 'grad_norm': 0.2755546569824219, 'learning_rate': 0.00012714236940935473, 'epoch': 6.82}


 68%|██████▊   | 2000/2930 [52:49<24:14,  1.56s/it]

{'loss': 0.0486, 'grad_norm': 0.3393344581127167, 'learning_rate': 0.00012700580402867875, 'epoch': 6.82}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-2000)... Done. 0.1s
 68%|██████▊   | 2001/2930 [52:51<27:47,  1.79s/it]

{'loss': 0.0519, 'grad_norm': 0.3147582709789276, 'learning_rate': 0.00012686923864800272, 'epoch': 6.82}


 68%|██████▊   | 2002/2930 [52:53<26:43,  1.73s/it]

{'loss': 0.0385, 'grad_norm': 0.2579488754272461, 'learning_rate': 0.00012673267326732674, 'epoch': 6.83}


 68%|██████▊   | 2003/2930 [52:54<26:04,  1.69s/it]

{'loss': 0.0441, 'grad_norm': 0.3157392144203186, 'learning_rate': 0.00012659610788665076, 'epoch': 6.83}


 68%|██████▊   | 2004/2930 [52:56<25:33,  1.66s/it]

{'loss': 0.0456, 'grad_norm': 0.31572192907333374, 'learning_rate': 0.00012645954250597473, 'epoch': 6.83}


 68%|██████▊   | 2005/2930 [52:57<25:13,  1.64s/it]

{'loss': 0.0474, 'grad_norm': 0.31368228793144226, 'learning_rate': 0.00012632297712529875, 'epoch': 6.84}


 68%|██████▊   | 2006/2930 [52:59<24:52,  1.62s/it]

{'loss': 0.0419, 'grad_norm': 0.29697373509407043, 'learning_rate': 0.00012618641174462275, 'epoch': 6.84}


 68%|██████▊   | 2007/2930 [53:01<24:48,  1.61s/it]

{'loss': 0.0392, 'grad_norm': 0.3329012393951416, 'learning_rate': 0.00012604984636394674, 'epoch': 6.84}


 69%|██████▊   | 2008/2930 [53:02<24:26,  1.59s/it]

{'loss': 0.0408, 'grad_norm': 0.28364405035972595, 'learning_rate': 0.00012591328098327076, 'epoch': 6.85}


 69%|██████▊   | 2009/2930 [53:04<24:22,  1.59s/it]

{'loss': 0.0449, 'grad_norm': 0.26672014594078064, 'learning_rate': 0.00012577671560259473, 'epoch': 6.85}


 69%|██████▊   | 2010/2930 [53:05<24:23,  1.59s/it]

{'loss': 0.0451, 'grad_norm': 0.2730829417705536, 'learning_rate': 0.00012564015022191875, 'epoch': 6.85}


 69%|██████▊   | 2011/2930 [53:07<24:20,  1.59s/it]

{'loss': 0.0438, 'grad_norm': 0.2977665662765503, 'learning_rate': 0.00012550358484124277, 'epoch': 6.86}


 69%|██████▊   | 2012/2930 [53:08<24:19,  1.59s/it]

{'loss': 0.0422, 'grad_norm': 0.315403550863266, 'learning_rate': 0.00012536701946056674, 'epoch': 6.86}


 69%|██████▊   | 2013/2930 [53:10<24:10,  1.58s/it]

{'loss': 0.0405, 'grad_norm': 0.2802335023880005, 'learning_rate': 0.00012523045407989076, 'epoch': 6.86}


 69%|██████▊   | 2014/2930 [53:12<24:13,  1.59s/it]

{'loss': 0.05, 'grad_norm': 0.3181862533092499, 'learning_rate': 0.00012509388869921475, 'epoch': 6.87}


 69%|██████▉   | 2015/2930 [53:13<24:03,  1.58s/it]

{'loss': 0.0405, 'grad_norm': 0.2697003483772278, 'learning_rate': 0.00012495732331853875, 'epoch': 6.87}


 69%|██████▉   | 2016/2930 [53:15<24:07,  1.58s/it]

{'loss': 0.0467, 'grad_norm': 0.3105449378490448, 'learning_rate': 0.00012482075793786277, 'epoch': 6.87}


 69%|██████▉   | 2017/2930 [53:16<23:55,  1.57s/it]

{'loss': 0.0407, 'grad_norm': 0.27371761202812195, 'learning_rate': 0.00012468419255718676, 'epoch': 6.88}


 69%|██████▉   | 2018/2930 [53:18<24:01,  1.58s/it]

{'loss': 0.0478, 'grad_norm': 0.33163732290267944, 'learning_rate': 0.00012454762717651076, 'epoch': 6.88}


 69%|██████▉   | 2019/2930 [53:19<23:54,  1.57s/it]

{'loss': 0.042, 'grad_norm': 0.27791088819503784, 'learning_rate': 0.00012441106179583478, 'epoch': 6.88}


 69%|██████▉   | 2020/2930 [53:21<23:58,  1.58s/it]

{'loss': 0.0481, 'grad_norm': 0.32736560702323914, 'learning_rate': 0.00012427449641515875, 'epoch': 6.89}


 69%|██████▉   | 2021/2930 [53:23<23:56,  1.58s/it]

{'loss': 0.0466, 'grad_norm': 0.3054434657096863, 'learning_rate': 0.00012413793103448277, 'epoch': 6.89}


 69%|██████▉   | 2022/2930 [53:24<24:01,  1.59s/it]

{'loss': 0.0423, 'grad_norm': 0.2953626811504364, 'learning_rate': 0.00012400136565380676, 'epoch': 6.9}


 69%|██████▉   | 2023/2930 [53:26<23:52,  1.58s/it]

{'loss': 0.0342, 'grad_norm': 0.2802223265171051, 'learning_rate': 0.00012386480027313075, 'epoch': 6.9}


 69%|██████▉   | 2024/2930 [53:27<23:48,  1.58s/it]

{'loss': 0.0376, 'grad_norm': 0.24872449040412903, 'learning_rate': 0.00012372823489245478, 'epoch': 6.9}


 69%|██████▉   | 2025/2930 [53:29<23:54,  1.59s/it]

{'loss': 0.044, 'grad_norm': 0.30765852332115173, 'learning_rate': 0.00012359166951177877, 'epoch': 6.91}


 69%|██████▉   | 2026/2930 [53:31<23:58,  1.59s/it]

{'loss': 0.0424, 'grad_norm': 0.2969575822353363, 'learning_rate': 0.00012345510413110276, 'epoch': 6.91}


 69%|██████▉   | 2027/2930 [53:32<23:56,  1.59s/it]

{'loss': 0.0383, 'grad_norm': 0.2728998064994812, 'learning_rate': 0.00012331853875042679, 'epoch': 6.91}


 69%|██████▉   | 2028/2930 [53:34<23:55,  1.59s/it]

{'loss': 0.0411, 'grad_norm': 0.3012443780899048, 'learning_rate': 0.00012318197336975078, 'epoch': 6.92}


 69%|██████▉   | 2029/2930 [53:35<23:53,  1.59s/it]

{'loss': 0.0434, 'grad_norm': 0.3203495442867279, 'learning_rate': 0.00012304540798907477, 'epoch': 6.92}


 69%|██████▉   | 2030/2930 [53:37<23:54,  1.59s/it]

{'loss': 0.038, 'grad_norm': 0.281471312046051, 'learning_rate': 0.00012290884260839877, 'epoch': 6.92}


 69%|██████▉   | 2031/2930 [53:39<23:55,  1.60s/it]

{'loss': 0.0453, 'grad_norm': 0.30092042684555054, 'learning_rate': 0.0001227722772277228, 'epoch': 6.93}


 69%|██████▉   | 2032/2930 [53:40<23:56,  1.60s/it]

{'loss': 0.0481, 'grad_norm': 0.30780792236328125, 'learning_rate': 0.00012263571184704678, 'epoch': 6.93}


 69%|██████▉   | 2033/2930 [53:42<23:54,  1.60s/it]

{'loss': 0.0399, 'grad_norm': 0.27961793541908264, 'learning_rate': 0.00012249914646637078, 'epoch': 6.93}


 69%|██████▉   | 2034/2930 [53:43<23:51,  1.60s/it]

{'loss': 0.0384, 'grad_norm': 0.2879559397697449, 'learning_rate': 0.00012236258108569477, 'epoch': 6.94}


 69%|██████▉   | 2035/2930 [53:45<23:35,  1.58s/it]

{'loss': 0.061, 'grad_norm': 0.4134892225265503, 'learning_rate': 0.0001222260157050188, 'epoch': 6.94}


 69%|██████▉   | 2036/2930 [53:46<23:33,  1.58s/it]

{'loss': 0.0456, 'grad_norm': 0.31226903200149536, 'learning_rate': 0.0001220894503243428, 'epoch': 6.94}


 70%|██████▉   | 2037/2930 [53:48<23:37,  1.59s/it]

{'loss': 0.0423, 'grad_norm': 0.30519071221351624, 'learning_rate': 0.0001219528849436668, 'epoch': 6.95}


 70%|██████▉   | 2038/2930 [53:50<23:38,  1.59s/it]

{'loss': 0.0387, 'grad_norm': 0.2605845332145691, 'learning_rate': 0.00012181631956299078, 'epoch': 6.95}


 70%|██████▉   | 2039/2930 [53:51<23:37,  1.59s/it]

{'loss': 0.037, 'grad_norm': 0.28143012523651123, 'learning_rate': 0.00012167975418231478, 'epoch': 6.95}


 70%|██████▉   | 2040/2930 [53:53<23:37,  1.59s/it]

{'loss': 0.0452, 'grad_norm': 0.3181428909301758, 'learning_rate': 0.00012154318880163879, 'epoch': 6.96}


 70%|██████▉   | 2041/2930 [53:54<23:30,  1.59s/it]

{'loss': 0.0365, 'grad_norm': 0.28235098719596863, 'learning_rate': 0.00012140662342096278, 'epoch': 6.96}


 70%|██████▉   | 2042/2930 [53:56<23:32,  1.59s/it]

{'loss': 0.0404, 'grad_norm': 0.28715214133262634, 'learning_rate': 0.00012127005804028679, 'epoch': 6.96}


 70%|██████▉   | 2043/2930 [53:58<23:30,  1.59s/it]

{'loss': 0.0421, 'grad_norm': 0.27737167477607727, 'learning_rate': 0.0001211334926596108, 'epoch': 6.97}


 70%|██████▉   | 2044/2930 [53:59<23:25,  1.59s/it]

{'loss': 0.0423, 'grad_norm': 0.2864264249801636, 'learning_rate': 0.0001209969272789348, 'epoch': 6.97}


 70%|██████▉   | 2045/2930 [54:01<23:15,  1.58s/it]

{'loss': 0.0479, 'grad_norm': 0.3193000853061676, 'learning_rate': 0.0001208603618982588, 'epoch': 6.97}


 70%|██████▉   | 2046/2930 [54:02<23:08,  1.57s/it]

{'loss': 0.0468, 'grad_norm': 0.3008442223072052, 'learning_rate': 0.00012072379651758278, 'epoch': 6.98}


 70%|██████▉   | 2047/2930 [54:04<23:07,  1.57s/it]

{'loss': 0.0447, 'grad_norm': 0.2834815979003906, 'learning_rate': 0.0001205872311369068, 'epoch': 6.98}


 70%|██████▉   | 2048/2930 [54:05<23:09,  1.57s/it]

{'loss': 0.0469, 'grad_norm': 0.30975067615509033, 'learning_rate': 0.00012045066575623081, 'epoch': 6.98}


 70%|██████▉   | 2049/2930 [54:07<23:11,  1.58s/it]

{'loss': 0.0404, 'grad_norm': 0.2672470211982727, 'learning_rate': 0.00012031410037555479, 'epoch': 6.99}


 70%|██████▉   | 2050/2930 [54:09<23:02,  1.57s/it]

{'loss': 0.0482, 'grad_norm': 0.3026576340198517, 'learning_rate': 0.0001201775349948788, 'epoch': 6.99}


 70%|███████   | 2051/2930 [54:10<22:53,  1.56s/it]

{'loss': 0.0365, 'grad_norm': 0.2676089107990265, 'learning_rate': 0.00012004096961420281, 'epoch': 6.99}


 70%|███████   | 2052/2930 [54:12<22:58,  1.57s/it]

{'loss': 0.0493, 'grad_norm': 0.31164389848709106, 'learning_rate': 0.0001199044042335268, 'epoch': 7.0}


 70%|███████   | 2053/2930 [54:13<23:00,  1.57s/it]

{'loss': 0.0392, 'grad_norm': 0.27796879410743713, 'learning_rate': 0.00011976783885285081, 'epoch': 7.0}


 70%|███████   | 2054/2930 [54:15<22:59,  1.58s/it]

{'loss': 0.0311, 'grad_norm': 0.25891169905662537, 'learning_rate': 0.0001196312734721748, 'epoch': 7.0}


 70%|███████   | 2055/2930 [54:16<22:58,  1.58s/it]

{'loss': 0.028, 'grad_norm': 0.2369595468044281, 'learning_rate': 0.00011949470809149881, 'epoch': 7.01}


 70%|███████   | 2056/2930 [54:18<22:56,  1.57s/it]

{'loss': 0.0263, 'grad_norm': 0.17286799848079681, 'learning_rate': 0.00011935814271082282, 'epoch': 7.01}


 70%|███████   | 2057/2930 [54:20<22:54,  1.57s/it]

{'loss': 0.0305, 'grad_norm': 0.25224488973617554, 'learning_rate': 0.0001192215773301468, 'epoch': 7.01}


 70%|███████   | 2058/2930 [54:21<22:44,  1.57s/it]

{'loss': 0.0262, 'grad_norm': 0.22588315606117249, 'learning_rate': 0.00011908501194947082, 'epoch': 7.02}


 70%|███████   | 2059/2930 [54:23<22:38,  1.56s/it]

{'loss': 0.029, 'grad_norm': 0.27261868119239807, 'learning_rate': 0.00011894844656879483, 'epoch': 7.02}


 70%|███████   | 2060/2930 [54:24<22:38,  1.56s/it]

{'loss': 0.0289, 'grad_norm': 1.6267364025115967, 'learning_rate': 0.00011881188118811881, 'epoch': 7.02}


 70%|███████   | 2061/2930 [54:26<22:35,  1.56s/it]

{'loss': 0.0229, 'grad_norm': 0.20950467884540558, 'learning_rate': 0.00011867531580744282, 'epoch': 7.03}


 70%|███████   | 2062/2930 [54:27<22:36,  1.56s/it]

{'loss': 0.0281, 'grad_norm': 0.2649810016155243, 'learning_rate': 0.00011853875042676681, 'epoch': 7.03}


 70%|███████   | 2063/2930 [54:29<22:31,  1.56s/it]

{'loss': 0.0268, 'grad_norm': 0.24487410485744476, 'learning_rate': 0.00011840218504609082, 'epoch': 7.03}


 70%|███████   | 2064/2930 [54:30<22:31,  1.56s/it]

{'loss': 0.0253, 'grad_norm': 0.23627310991287231, 'learning_rate': 0.00011826561966541483, 'epoch': 7.04}


 70%|███████   | 2065/2930 [54:32<22:25,  1.56s/it]

{'loss': 0.027, 'grad_norm': 0.3119427263736725, 'learning_rate': 0.00011812905428473882, 'epoch': 7.04}


 71%|███████   | 2066/2930 [54:34<22:24,  1.56s/it]

{'loss': 0.0301, 'grad_norm': 0.2767426371574402, 'learning_rate': 0.00011799248890406283, 'epoch': 7.05}


 71%|███████   | 2067/2930 [54:35<22:23,  1.56s/it]

{'loss': 0.0321, 'grad_norm': 0.29966267943382263, 'learning_rate': 0.00011785592352338684, 'epoch': 7.05}


 71%|███████   | 2068/2930 [54:37<22:19,  1.55s/it]

{'loss': 0.028, 'grad_norm': 0.29986250400543213, 'learning_rate': 0.00011771935814271082, 'epoch': 7.05}


 71%|███████   | 2069/2930 [54:38<22:08,  1.54s/it]

{'loss': 0.0282, 'grad_norm': 0.2591519355773926, 'learning_rate': 0.00011758279276203484, 'epoch': 7.06}


 71%|███████   | 2070/2930 [54:40<22:08,  1.54s/it]

{'loss': 0.0313, 'grad_norm': 0.2910647988319397, 'learning_rate': 0.00011744622738135885, 'epoch': 7.06}


 71%|███████   | 2071/2930 [54:41<22:10,  1.55s/it]

{'loss': 0.0291, 'grad_norm': 0.2734425961971283, 'learning_rate': 0.00011730966200068283, 'epoch': 7.06}


 71%|███████   | 2072/2930 [54:43<22:02,  1.54s/it]

{'loss': 0.0248, 'grad_norm': 0.24467864632606506, 'learning_rate': 0.00011717309662000683, 'epoch': 7.07}


 71%|███████   | 2073/2930 [54:44<22:07,  1.55s/it]

{'loss': 0.0277, 'grad_norm': 0.2640608251094818, 'learning_rate': 0.00011703653123933083, 'epoch': 7.07}


 71%|███████   | 2074/2930 [54:46<22:11,  1.56s/it]

{'loss': 0.0302, 'grad_norm': 0.27841368317604065, 'learning_rate': 0.00011689996585865484, 'epoch': 7.07}


 71%|███████   | 2075/2930 [54:48<22:07,  1.55s/it]

{'loss': 0.0313, 'grad_norm': 0.29730919003486633, 'learning_rate': 0.00011676340047797884, 'epoch': 7.08}


 71%|███████   | 2076/2930 [54:49<22:11,  1.56s/it]

{'loss': 0.0292, 'grad_norm': 0.32054394483566284, 'learning_rate': 0.00011662683509730284, 'epoch': 7.08}


 71%|███████   | 2077/2930 [54:51<22:10,  1.56s/it]

{'loss': 0.0279, 'grad_norm': 0.299113929271698, 'learning_rate': 0.00011649026971662684, 'epoch': 7.08}


 71%|███████   | 2078/2930 [54:52<22:09,  1.56s/it]

{'loss': 0.0322, 'grad_norm': 0.2910673916339874, 'learning_rate': 0.00011635370433595085, 'epoch': 7.09}


 71%|███████   | 2079/2930 [54:54<22:04,  1.56s/it]

{'loss': 0.0279, 'grad_norm': 0.25042030215263367, 'learning_rate': 0.00011621713895527485, 'epoch': 7.09}


 71%|███████   | 2080/2930 [54:55<22:06,  1.56s/it]

{'loss': 0.031, 'grad_norm': 0.3030356466770172, 'learning_rate': 0.00011608057357459885, 'epoch': 7.09}


 71%|███████   | 2081/2930 [54:57<22:08,  1.56s/it]

{'loss': 0.0313, 'grad_norm': 0.2634522020816803, 'learning_rate': 0.00011594400819392284, 'epoch': 7.1}


 71%|███████   | 2082/2930 [54:58<22:02,  1.56s/it]

{'loss': 0.0288, 'grad_norm': 0.28348344564437866, 'learning_rate': 0.00011580744281324684, 'epoch': 7.1}


 71%|███████   | 2083/2930 [55:00<22:04,  1.56s/it]

{'loss': 0.0306, 'grad_norm': 0.25302854180336, 'learning_rate': 0.00011567087743257085, 'epoch': 7.1}


 71%|███████   | 2084/2930 [55:02<22:01,  1.56s/it]

{'loss': 0.0296, 'grad_norm': 0.2577928602695465, 'learning_rate': 0.00011553431205189484, 'epoch': 7.11}


 71%|███████   | 2085/2930 [55:03<21:53,  1.55s/it]

{'loss': 0.0232, 'grad_norm': 0.20357725024223328, 'learning_rate': 0.00011539774667121885, 'epoch': 7.11}


 71%|███████   | 2086/2930 [55:05<21:54,  1.56s/it]

{'loss': 0.0296, 'grad_norm': 0.2924128770828247, 'learning_rate': 0.00011526118129054286, 'epoch': 7.11}


 71%|███████   | 2087/2930 [55:06<21:50,  1.55s/it]

{'loss': 0.0283, 'grad_norm': 0.24809326231479645, 'learning_rate': 0.00011512461590986685, 'epoch': 7.12}


 71%|███████▏  | 2088/2930 [55:08<21:41,  1.55s/it]

{'loss': 0.0239, 'grad_norm': 0.2297573983669281, 'learning_rate': 0.00011498805052919086, 'epoch': 7.12}


 71%|███████▏  | 2089/2930 [55:09<21:46,  1.55s/it]

{'loss': 0.0287, 'grad_norm': 0.26537495851516724, 'learning_rate': 0.00011485148514851484, 'epoch': 7.12}


 71%|███████▏  | 2090/2930 [55:11<21:45,  1.55s/it]

{'loss': 0.0313, 'grad_norm': 0.27188223600387573, 'learning_rate': 0.00011471491976783886, 'epoch': 7.13}


 71%|███████▏  | 2091/2930 [55:12<21:48,  1.56s/it]

{'loss': 0.0267, 'grad_norm': 0.2636672258377075, 'learning_rate': 0.00011457835438716287, 'epoch': 7.13}


 71%|███████▏  | 2092/2930 [55:14<21:49,  1.56s/it]

{'loss': 0.0275, 'grad_norm': 0.2483128160238266, 'learning_rate': 0.00011444178900648685, 'epoch': 7.13}


 71%|███████▏  | 2093/2930 [55:16<21:48,  1.56s/it]

{'loss': 0.0325, 'grad_norm': 0.29467904567718506, 'learning_rate': 0.00011430522362581086, 'epoch': 7.14}


 71%|███████▏  | 2094/2930 [55:17<21:48,  1.56s/it]

{'loss': 0.0244, 'grad_norm': 0.237762913107872, 'learning_rate': 0.00011416865824513488, 'epoch': 7.14}


 72%|███████▏  | 2095/2930 [55:19<21:52,  1.57s/it]

{'loss': 0.0293, 'grad_norm': 0.289798378944397, 'learning_rate': 0.00011403209286445886, 'epoch': 7.14}


 72%|███████▏  | 2096/2930 [55:20<21:52,  1.57s/it]

{'loss': 0.0359, 'grad_norm': 0.3034299314022064, 'learning_rate': 0.00011389552748378287, 'epoch': 7.15}


 72%|███████▏  | 2097/2930 [55:22<21:41,  1.56s/it]

{'loss': 0.0279, 'grad_norm': 0.26190489530563354, 'learning_rate': 0.00011375896210310686, 'epoch': 7.15}


 72%|███████▏  | 2098/2930 [55:23<21:34,  1.56s/it]

{'loss': 0.0289, 'grad_norm': 0.2797812521457672, 'learning_rate': 0.00011362239672243087, 'epoch': 7.15}


 72%|███████▏  | 2099/2930 [55:25<21:33,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.2962021231651306, 'learning_rate': 0.00011348583134175488, 'epoch': 7.16}


 72%|███████▏  | 2100/2930 [55:27<21:32,  1.56s/it]

{'loss': 0.0306, 'grad_norm': 0.299813836812973, 'learning_rate': 0.00011334926596107886, 'epoch': 7.16}


 72%|███████▏  | 2101/2930 [55:28<21:34,  1.56s/it]

{'loss': 0.0239, 'grad_norm': 0.24269084632396698, 'learning_rate': 0.00011321270058040288, 'epoch': 7.16}


 72%|███████▏  | 2102/2930 [55:30<21:28,  1.56s/it]

{'loss': 0.0294, 'grad_norm': 0.2596096396446228, 'learning_rate': 0.00011307613519972689, 'epoch': 7.17}


 72%|███████▏  | 2103/2930 [55:31<21:31,  1.56s/it]

{'loss': 0.039, 'grad_norm': 0.34400713443756104, 'learning_rate': 0.00011293956981905087, 'epoch': 7.17}


 72%|███████▏  | 2104/2930 [55:33<21:33,  1.57s/it]

{'loss': 0.0254, 'grad_norm': 0.2589457035064697, 'learning_rate': 0.00011280300443837488, 'epoch': 7.17}


 72%|███████▏  | 2105/2930 [55:34<21:31,  1.57s/it]

{'loss': 0.0294, 'grad_norm': 0.2769988477230072, 'learning_rate': 0.00011266643905769887, 'epoch': 7.18}


 72%|███████▏  | 2106/2930 [55:36<21:31,  1.57s/it]

{'loss': 0.0282, 'grad_norm': 0.2314051389694214, 'learning_rate': 0.00011252987367702288, 'epoch': 7.18}


 72%|███████▏  | 2107/2930 [55:37<21:27,  1.56s/it]

{'loss': 0.0273, 'grad_norm': 0.2959965169429779, 'learning_rate': 0.00011239330829634689, 'epoch': 7.18}


 72%|███████▏  | 2108/2930 [55:39<21:29,  1.57s/it]

{'loss': 0.0293, 'grad_norm': 0.23903293907642365, 'learning_rate': 0.00011225674291567088, 'epoch': 7.19}


 72%|███████▏  | 2109/2930 [55:41<21:29,  1.57s/it]

{'loss': 0.0381, 'grad_norm': 0.29102468490600586, 'learning_rate': 0.00011212017753499489, 'epoch': 7.19}


 72%|███████▏  | 2110/2930 [55:42<21:22,  1.56s/it]

{'loss': 0.0271, 'grad_norm': 0.2590663433074951, 'learning_rate': 0.0001119836121543189, 'epoch': 7.2}


 72%|███████▏  | 2111/2930 [55:44<21:20,  1.56s/it]

{'loss': 0.0294, 'grad_norm': 0.2721909284591675, 'learning_rate': 0.00011184704677364288, 'epoch': 7.2}


 72%|███████▏  | 2112/2930 [55:45<21:22,  1.57s/it]

{'loss': 0.0259, 'grad_norm': 0.26143142580986023, 'learning_rate': 0.0001117104813929669, 'epoch': 7.2}


 72%|███████▏  | 2113/2930 [55:47<21:22,  1.57s/it]

{'loss': 0.0302, 'grad_norm': 0.2564115822315216, 'learning_rate': 0.00011157391601229088, 'epoch': 7.21}


 72%|███████▏  | 2114/2930 [55:48<21:19,  1.57s/it]

{'loss': 0.0251, 'grad_norm': 0.23191586136817932, 'learning_rate': 0.00011143735063161489, 'epoch': 7.21}


 72%|███████▏  | 2115/2930 [55:50<21:19,  1.57s/it]

{'loss': 0.0261, 'grad_norm': 0.2415716052055359, 'learning_rate': 0.0001113007852509389, 'epoch': 7.21}


 72%|███████▏  | 2116/2930 [55:52<21:12,  1.56s/it]

{'loss': 0.0291, 'grad_norm': 0.26231813430786133, 'learning_rate': 0.00011116421987026289, 'epoch': 7.22}


 72%|███████▏  | 2117/2930 [55:53<21:02,  1.55s/it]

{'loss': 0.0239, 'grad_norm': 0.26164284348487854, 'learning_rate': 0.0001110276544895869, 'epoch': 7.22}


 72%|███████▏  | 2118/2930 [55:55<21:07,  1.56s/it]

{'loss': 0.027, 'grad_norm': 0.2940569818019867, 'learning_rate': 0.0001108910891089109, 'epoch': 7.22}


 72%|███████▏  | 2119/2930 [55:56<21:03,  1.56s/it]

{'loss': 0.0297, 'grad_norm': 0.28192004561424255, 'learning_rate': 0.0001107545237282349, 'epoch': 7.23}


 72%|███████▏  | 2120/2930 [55:58<20:53,  1.55s/it]

{'loss': 0.0247, 'grad_norm': 0.22131513059139252, 'learning_rate': 0.0001106179583475589, 'epoch': 7.23}


 72%|███████▏  | 2121/2930 [55:59<20:59,  1.56s/it]

{'loss': 0.0345, 'grad_norm': 0.29260018467903137, 'learning_rate': 0.00011048139296688289, 'epoch': 7.23}


 72%|███████▏  | 2122/2930 [56:01<20:48,  1.55s/it]

{'loss': 0.0286, 'grad_norm': 0.28473958373069763, 'learning_rate': 0.0001103448275862069, 'epoch': 7.24}


 72%|███████▏  | 2123/2930 [56:02<20:48,  1.55s/it]

{'loss': 0.0319, 'grad_norm': 0.2603546977043152, 'learning_rate': 0.00011020826220553091, 'epoch': 7.24}


 72%|███████▏  | 2124/2930 [56:04<20:50,  1.55s/it]

{'loss': 0.0332, 'grad_norm': 0.27973034977912903, 'learning_rate': 0.0001100716968248549, 'epoch': 7.24}


 73%|███████▎  | 2125/2930 [56:06<20:53,  1.56s/it]

{'loss': 0.0295, 'grad_norm': 0.2762130796909332, 'learning_rate': 0.0001099351314441789, 'epoch': 7.25}


 73%|███████▎  | 2126/2930 [56:07<20:57,  1.56s/it]

{'loss': 0.0234, 'grad_norm': 0.19531244039535522, 'learning_rate': 0.00010979856606350291, 'epoch': 7.25}


 73%|███████▎  | 2127/2930 [56:09<20:58,  1.57s/it]

{'loss': 0.0292, 'grad_norm': 0.254202276468277, 'learning_rate': 0.0001096620006828269, 'epoch': 7.25}


 73%|███████▎  | 2128/2930 [56:10<20:52,  1.56s/it]

{'loss': 0.0298, 'grad_norm': 0.26893526315689087, 'learning_rate': 0.00010952543530215091, 'epoch': 7.26}


 73%|███████▎  | 2129/2930 [56:12<20:46,  1.56s/it]

{'loss': 0.0239, 'grad_norm': 0.20338557660579681, 'learning_rate': 0.0001093888699214749, 'epoch': 7.26}


 73%|███████▎  | 2130/2930 [56:13<20:43,  1.55s/it]

{'loss': 0.0314, 'grad_norm': 0.30095723271369934, 'learning_rate': 0.00010925230454079891, 'epoch': 7.26}


 73%|███████▎  | 2131/2930 [56:15<20:43,  1.56s/it]

{'loss': 0.0258, 'grad_norm': 0.2589108645915985, 'learning_rate': 0.00010911573916012292, 'epoch': 7.27}


 73%|███████▎  | 2132/2930 [56:16<20:45,  1.56s/it]

{'loss': 0.0282, 'grad_norm': 0.28593939542770386, 'learning_rate': 0.0001089791737794469, 'epoch': 7.27}


 73%|███████▎  | 2133/2930 [56:18<20:48,  1.57s/it]

{'loss': 0.0353, 'grad_norm': 0.303489089012146, 'learning_rate': 0.00010884260839877092, 'epoch': 7.27}


 73%|███████▎  | 2134/2930 [56:20<20:48,  1.57s/it]

{'loss': 0.0306, 'grad_norm': 0.2955761253833771, 'learning_rate': 0.00010870604301809493, 'epoch': 7.28}


 73%|███████▎  | 2135/2930 [56:21<20:48,  1.57s/it]

{'loss': 0.0292, 'grad_norm': 0.26887762546539307, 'learning_rate': 0.00010856947763741891, 'epoch': 7.28}


 73%|███████▎  | 2136/2930 [56:23<20:51,  1.58s/it]

{'loss': 0.0308, 'grad_norm': 0.28532305359840393, 'learning_rate': 0.00010843291225674292, 'epoch': 7.28}


 73%|███████▎  | 2137/2930 [56:24<20:45,  1.57s/it]

{'loss': 0.0277, 'grad_norm': 0.2562413513660431, 'learning_rate': 0.00010829634687606691, 'epoch': 7.29}


 73%|███████▎  | 2138/2930 [56:26<20:54,  1.58s/it]

{'loss': 0.0297, 'grad_norm': 0.28537294268608093, 'learning_rate': 0.00010815978149539092, 'epoch': 7.29}


 73%|███████▎  | 2139/2930 [56:28<20:47,  1.58s/it]

{'loss': 0.0271, 'grad_norm': 0.2626531720161438, 'learning_rate': 0.00010802321611471493, 'epoch': 7.29}


 73%|███████▎  | 2140/2930 [56:29<20:38,  1.57s/it]

{'loss': 0.0254, 'grad_norm': 0.2458665668964386, 'learning_rate': 0.00010788665073403892, 'epoch': 7.3}


 73%|███████▎  | 2141/2930 [56:31<20:32,  1.56s/it]

{'loss': 0.0273, 'grad_norm': 0.27797815203666687, 'learning_rate': 0.00010775008535336293, 'epoch': 7.3}


 73%|███████▎  | 2142/2930 [56:32<20:34,  1.57s/it]

{'loss': 0.0308, 'grad_norm': 0.2970736026763916, 'learning_rate': 0.00010761351997268694, 'epoch': 7.3}


 73%|███████▎  | 2143/2930 [56:34<20:32,  1.57s/it]

{'loss': 0.0274, 'grad_norm': 0.26575061678886414, 'learning_rate': 0.00010747695459201092, 'epoch': 7.31}


 73%|███████▎  | 2144/2930 [56:35<20:32,  1.57s/it]

{'loss': 0.0275, 'grad_norm': 0.25474756956100464, 'learning_rate': 0.00010734038921133494, 'epoch': 7.31}


 73%|███████▎  | 2145/2930 [56:37<20:35,  1.57s/it]

{'loss': 0.0332, 'grad_norm': 0.2773955464363098, 'learning_rate': 0.00010720382383065892, 'epoch': 7.31}


 73%|███████▎  | 2146/2930 [56:39<20:32,  1.57s/it]

{'loss': 0.0241, 'grad_norm': 0.22583706676959991, 'learning_rate': 0.00010706725844998293, 'epoch': 7.32}


 73%|███████▎  | 2147/2930 [56:40<20:40,  1.58s/it]

{'loss': 0.0323, 'grad_norm': 0.2868548631668091, 'learning_rate': 0.00010693069306930694, 'epoch': 7.32}


 73%|███████▎  | 2148/2930 [56:42<20:38,  1.58s/it]

{'loss': 0.027, 'grad_norm': 0.2543281018733978, 'learning_rate': 0.00010679412768863093, 'epoch': 7.32}


 73%|███████▎  | 2149/2930 [56:43<20:39,  1.59s/it]

{'loss': 0.0291, 'grad_norm': 0.26900821924209595, 'learning_rate': 0.00010665756230795494, 'epoch': 7.33}


 73%|███████▎  | 2150/2930 [56:45<20:43,  1.59s/it]

{'loss': 0.0271, 'grad_norm': 0.25042712688446045, 'learning_rate': 0.00010652099692727895, 'epoch': 7.33}


 73%|███████▎  | 2151/2930 [56:46<20:38,  1.59s/it]

{'loss': 0.0277, 'grad_norm': 0.29768988490104675, 'learning_rate': 0.00010638443154660294, 'epoch': 7.34}


 73%|███████▎  | 2152/2930 [56:48<20:33,  1.59s/it]

{'loss': 0.033, 'grad_norm': 0.2778698801994324, 'learning_rate': 0.00010624786616592695, 'epoch': 7.34}


 73%|███████▎  | 2153/2930 [56:50<20:30,  1.58s/it]

{'loss': 0.0353, 'grad_norm': 0.3330160081386566, 'learning_rate': 0.00010611130078525093, 'epoch': 7.34}


 74%|███████▎  | 2154/2930 [56:51<20:28,  1.58s/it]

{'loss': 0.0249, 'grad_norm': 0.246782124042511, 'learning_rate': 0.00010597473540457495, 'epoch': 7.35}


 74%|███████▎  | 2155/2930 [56:53<20:20,  1.57s/it]

{'loss': 0.0244, 'grad_norm': 0.26224610209465027, 'learning_rate': 0.00010583817002389896, 'epoch': 7.35}


 74%|███████▎  | 2156/2930 [56:54<20:19,  1.58s/it]

{'loss': 0.0295, 'grad_norm': 0.2766464650630951, 'learning_rate': 0.00010570160464322294, 'epoch': 7.35}


 74%|███████▎  | 2157/2930 [56:56<20:16,  1.57s/it]

{'loss': 0.0257, 'grad_norm': 0.24305643141269684, 'learning_rate': 0.00010556503926254695, 'epoch': 7.36}


 74%|███████▎  | 2158/2930 [56:58<20:17,  1.58s/it]

{'loss': 0.0335, 'grad_norm': 0.29768267273902893, 'learning_rate': 0.00010542847388187095, 'epoch': 7.36}


 74%|███████▎  | 2159/2930 [56:59<20:18,  1.58s/it]

{'loss': 0.0293, 'grad_norm': 0.23170173168182373, 'learning_rate': 0.00010529190850119495, 'epoch': 7.36}


 74%|███████▎  | 2160/2930 [57:01<20:13,  1.58s/it]

{'loss': 0.0274, 'grad_norm': 0.24896302819252014, 'learning_rate': 0.00010515534312051895, 'epoch': 7.37}


 74%|███████▍  | 2161/2930 [57:02<20:14,  1.58s/it]

{'loss': 0.0342, 'grad_norm': 0.322327584028244, 'learning_rate': 0.00010501877773984295, 'epoch': 7.37}


 74%|███████▍  | 2162/2930 [57:04<20:16,  1.58s/it]

{'loss': 0.0318, 'grad_norm': 0.2956230342388153, 'learning_rate': 0.00010488221235916696, 'epoch': 7.37}


 74%|███████▍  | 2163/2930 [57:05<20:09,  1.58s/it]

{'loss': 0.0277, 'grad_norm': 0.25295162200927734, 'learning_rate': 0.00010474564697849096, 'epoch': 7.38}


 74%|███████▍  | 2164/2930 [57:07<20:10,  1.58s/it]

{'loss': 0.0334, 'grad_norm': 0.28550463914871216, 'learning_rate': 0.00010460908159781495, 'epoch': 7.38}


 74%|███████▍  | 2165/2930 [57:09<20:06,  1.58s/it]

{'loss': 0.0282, 'grad_norm': 0.2781795263290405, 'learning_rate': 0.00010447251621713897, 'epoch': 7.38}


 74%|███████▍  | 2166/2930 [57:10<20:10,  1.58s/it]

{'loss': 0.0222, 'grad_norm': 0.22260317206382751, 'learning_rate': 0.00010433595083646297, 'epoch': 7.39}


 74%|███████▍  | 2167/2930 [57:12<20:11,  1.59s/it]

{'loss': 0.0313, 'grad_norm': 0.25475025177001953, 'learning_rate': 0.00010419938545578695, 'epoch': 7.39}


 74%|███████▍  | 2168/2930 [57:13<20:14,  1.59s/it]

{'loss': 0.0272, 'grad_norm': 0.26373210549354553, 'learning_rate': 0.00010406282007511096, 'epoch': 7.39}


 74%|███████▍  | 2169/2930 [57:15<20:04,  1.58s/it]

{'loss': 0.0294, 'grad_norm': 0.257324755191803, 'learning_rate': 0.00010392625469443496, 'epoch': 7.4}


 74%|███████▍  | 2170/2930 [57:17<20:07,  1.59s/it]

{'loss': 0.0294, 'grad_norm': 0.27922582626342773, 'learning_rate': 0.00010378968931375896, 'epoch': 7.4}


 74%|███████▍  | 2171/2930 [57:18<20:02,  1.58s/it]

{'loss': 0.0309, 'grad_norm': 0.27076196670532227, 'learning_rate': 0.00010365312393308297, 'epoch': 7.4}


 74%|███████▍  | 2172/2930 [57:20<20:04,  1.59s/it]

{'loss': 0.027, 'grad_norm': 0.2609488070011139, 'learning_rate': 0.00010351655855240697, 'epoch': 7.41}


 74%|███████▍  | 2173/2930 [57:21<20:02,  1.59s/it]

{'loss': 0.0236, 'grad_norm': 0.24950407445430756, 'learning_rate': 0.00010337999317173097, 'epoch': 7.41}


 74%|███████▍  | 2174/2930 [57:23<20:00,  1.59s/it]

{'loss': 0.027, 'grad_norm': 0.28556105494499207, 'learning_rate': 0.00010324342779105498, 'epoch': 7.41}


 74%|███████▍  | 2175/2930 [57:24<20:01,  1.59s/it]

{'loss': 0.029, 'grad_norm': 0.26428624987602234, 'learning_rate': 0.00010310686241037896, 'epoch': 7.42}


 74%|███████▍  | 2176/2930 [57:26<20:01,  1.59s/it]

{'loss': 0.0307, 'grad_norm': 0.29207906126976013, 'learning_rate': 0.00010297029702970298, 'epoch': 7.42}


 74%|███████▍  | 2177/2930 [57:28<20:00,  1.59s/it]

{'loss': 0.0304, 'grad_norm': 0.2888163924217224, 'learning_rate': 0.00010283373164902699, 'epoch': 7.42}


 74%|███████▍  | 2178/2930 [57:29<19:59,  1.60s/it]

{'loss': 0.0233, 'grad_norm': 0.21011067926883698, 'learning_rate': 0.00010269716626835097, 'epoch': 7.43}


 74%|███████▍  | 2179/2930 [57:31<19:53,  1.59s/it]

{'loss': 0.0231, 'grad_norm': 0.2389896661043167, 'learning_rate': 0.00010256060088767498, 'epoch': 7.43}


 74%|███████▍  | 2180/2930 [57:32<19:46,  1.58s/it]

{'loss': 0.0235, 'grad_norm': 0.23945797979831696, 'learning_rate': 0.00010242403550699897, 'epoch': 7.43}


 74%|███████▍  | 2181/2930 [57:34<19:44,  1.58s/it]

{'loss': 0.027, 'grad_norm': 0.2909491956233978, 'learning_rate': 0.00010228747012632298, 'epoch': 7.44}


 74%|███████▍  | 2182/2930 [57:36<19:45,  1.58s/it]

{'loss': 0.0268, 'grad_norm': 0.2476532906293869, 'learning_rate': 0.00010215090474564699, 'epoch': 7.44}


 75%|███████▍  | 2183/2930 [57:37<19:44,  1.59s/it]

{'loss': 0.0292, 'grad_norm': 0.24486438930034637, 'learning_rate': 0.00010201433936497098, 'epoch': 7.44}


 75%|███████▍  | 2184/2930 [57:39<19:40,  1.58s/it]

{'loss': 0.0244, 'grad_norm': 0.24016954004764557, 'learning_rate': 0.00010187777398429499, 'epoch': 7.45}


 75%|███████▍  | 2185/2930 [57:40<19:42,  1.59s/it]

{'loss': 0.0319, 'grad_norm': 0.2605952024459839, 'learning_rate': 0.000101741208603619, 'epoch': 7.45}


 75%|███████▍  | 2186/2930 [57:42<19:41,  1.59s/it]

{'loss': 0.0309, 'grad_norm': 0.27853336930274963, 'learning_rate': 0.00010160464322294299, 'epoch': 7.45}


 75%|███████▍  | 2187/2930 [57:43<19:32,  1.58s/it]

{'loss': 0.03, 'grad_norm': 0.2839866280555725, 'learning_rate': 0.000101468077842267, 'epoch': 7.46}


 75%|███████▍  | 2188/2930 [57:45<19:28,  1.57s/it]

{'loss': 0.0276, 'grad_norm': 0.252958744764328, 'learning_rate': 0.00010133151246159098, 'epoch': 7.46}


 75%|███████▍  | 2189/2930 [57:47<19:20,  1.57s/it]

{'loss': 0.0289, 'grad_norm': 0.24105361104011536, 'learning_rate': 0.00010119494708091499, 'epoch': 7.46}


 75%|███████▍  | 2190/2930 [57:48<19:17,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.2685422897338867, 'learning_rate': 0.000101058381700239, 'epoch': 7.47}


 75%|███████▍  | 2191/2930 [57:50<19:16,  1.57s/it]

{'loss': 0.026, 'grad_norm': 0.2620730400085449, 'learning_rate': 0.00010092181631956299, 'epoch': 7.47}


 75%|███████▍  | 2192/2930 [57:51<19:20,  1.57s/it]

{'loss': 0.0298, 'grad_norm': 0.2862175703048706, 'learning_rate': 0.000100785250938887, 'epoch': 7.47}


 75%|███████▍  | 2193/2930 [57:53<19:16,  1.57s/it]

{'loss': 0.0251, 'grad_norm': 0.22177544236183167, 'learning_rate': 0.000100648685558211, 'epoch': 7.48}


 75%|███████▍  | 2194/2930 [57:54<19:08,  1.56s/it]

{'loss': 0.0269, 'grad_norm': 0.24596723914146423, 'learning_rate': 0.000100512120177535, 'epoch': 7.48}


 75%|███████▍  | 2195/2930 [57:56<19:08,  1.56s/it]

{'loss': 0.0389, 'grad_norm': 0.30577513575553894, 'learning_rate': 0.00010037555479685901, 'epoch': 7.49}


 75%|███████▍  | 2196/2930 [57:58<18:59,  1.55s/it]

{'loss': 0.0277, 'grad_norm': 0.2586871385574341, 'learning_rate': 0.00010023898941618299, 'epoch': 7.49}


 75%|███████▍  | 2197/2930 [57:59<18:57,  1.55s/it]

{'loss': 0.0327, 'grad_norm': 0.27738794684410095, 'learning_rate': 0.00010010242403550701, 'epoch': 7.49}


 75%|███████▌  | 2198/2930 [58:01<18:52,  1.55s/it]

{'loss': 0.026, 'grad_norm': 0.2459893822669983, 'learning_rate': 9.9965858654831e-05, 'epoch': 7.5}


 75%|███████▌  | 2199/2930 [58:02<18:56,  1.56s/it]

{'loss': 0.0252, 'grad_norm': 0.23317502439022064, 'learning_rate': 9.982929327415501e-05, 'epoch': 7.5}


 75%|███████▌  | 2200/2930 [58:04<18:39,  1.53s/it]

{'loss': 0.0278, 'grad_norm': 0.23888526856899261, 'learning_rate': 9.9692727893479e-05, 'epoch': 7.5}


 75%|███████▌  | 2201/2930 [58:05<18:41,  1.54s/it]

{'loss': 0.0284, 'grad_norm': 0.2771427631378174, 'learning_rate': 9.955616251280301e-05, 'epoch': 7.51}


 75%|███████▌  | 2202/2930 [58:07<18:40,  1.54s/it]

{'loss': 0.028, 'grad_norm': 0.2594112157821655, 'learning_rate': 9.941959713212701e-05, 'epoch': 7.51}


 75%|███████▌  | 2203/2930 [58:08<18:39,  1.54s/it]

{'loss': 0.0299, 'grad_norm': 0.2542882561683655, 'learning_rate': 9.928303175145101e-05, 'epoch': 7.51}


 75%|███████▌  | 2204/2930 [58:10<18:43,  1.55s/it]

{'loss': 0.0349, 'grad_norm': 0.2838057279586792, 'learning_rate': 9.914646637077501e-05, 'epoch': 7.52}


 75%|███████▌  | 2205/2930 [58:11<18:46,  1.55s/it]

{'loss': 0.0347, 'grad_norm': 0.31660017371177673, 'learning_rate': 9.900990099009902e-05, 'epoch': 7.52}


 75%|███████▌  | 2206/2930 [58:13<18:42,  1.55s/it]

{'loss': 0.0272, 'grad_norm': 0.25599899888038635, 'learning_rate': 9.887333560942301e-05, 'epoch': 7.52}


 75%|███████▌  | 2207/2930 [58:15<18:44,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.2778076231479645, 'learning_rate': 9.873677022874702e-05, 'epoch': 7.53}


 75%|███████▌  | 2208/2930 [58:16<18:36,  1.55s/it]

{'loss': 0.0255, 'grad_norm': 0.2665703594684601, 'learning_rate': 9.860020484807103e-05, 'epoch': 7.53}


 75%|███████▌  | 2209/2930 [58:18<18:32,  1.54s/it]

{'loss': 0.0339, 'grad_norm': 0.3386360704898834, 'learning_rate': 9.846363946739502e-05, 'epoch': 7.53}


 75%|███████▌  | 2210/2930 [58:19<18:37,  1.55s/it]

{'loss': 0.0333, 'grad_norm': 0.31175315380096436, 'learning_rate': 9.832707408671901e-05, 'epoch': 7.54}


 75%|███████▌  | 2211/2930 [58:21<18:37,  1.55s/it]

{'loss': 0.0262, 'grad_norm': 0.24012376368045807, 'learning_rate': 9.819050870604302e-05, 'epoch': 7.54}


 75%|███████▌  | 2212/2930 [58:22<18:39,  1.56s/it]

{'loss': 0.0314, 'grad_norm': 0.3068667948246002, 'learning_rate': 9.805394332536703e-05, 'epoch': 7.54}


 76%|███████▌  | 2213/2930 [58:24<18:27,  1.54s/it]

{'loss': 0.0264, 'grad_norm': 0.22600802779197693, 'learning_rate': 9.791737794469102e-05, 'epoch': 7.55}


 76%|███████▌  | 2214/2930 [58:25<18:28,  1.55s/it]

{'loss': 0.0341, 'grad_norm': 0.28893935680389404, 'learning_rate': 9.778081256401502e-05, 'epoch': 7.55}


 76%|███████▌  | 2215/2930 [58:27<18:31,  1.56s/it]

{'loss': 0.0308, 'grad_norm': 0.2459677904844284, 'learning_rate': 9.764424718333904e-05, 'epoch': 7.55}


 76%|███████▌  | 2216/2930 [58:29<18:34,  1.56s/it]

{'loss': 0.0266, 'grad_norm': 0.253851979970932, 'learning_rate': 9.750768180266303e-05, 'epoch': 7.56}


 76%|███████▌  | 2217/2930 [58:30<18:32,  1.56s/it]

{'loss': 0.0267, 'grad_norm': 0.2557026743888855, 'learning_rate': 9.737111642198703e-05, 'epoch': 7.56}


 76%|███████▌  | 2218/2930 [58:32<18:24,  1.55s/it]

{'loss': 0.0284, 'grad_norm': 0.2634890079498291, 'learning_rate': 9.723455104131102e-05, 'epoch': 7.56}


 76%|███████▌  | 2219/2930 [58:33<18:19,  1.55s/it]

{'loss': 0.0308, 'grad_norm': 0.2819722592830658, 'learning_rate': 9.709798566063504e-05, 'epoch': 7.57}


 76%|███████▌  | 2220/2930 [58:35<18:20,  1.55s/it]

{'loss': 0.0271, 'grad_norm': 0.26297223567962646, 'learning_rate': 9.696142027995904e-05, 'epoch': 7.57}


 76%|███████▌  | 2221/2930 [58:36<18:16,  1.55s/it]

{'loss': 0.0307, 'grad_norm': 0.2953716814517975, 'learning_rate': 9.682485489928303e-05, 'epoch': 7.57}


 76%|███████▌  | 2222/2930 [58:38<18:12,  1.54s/it]

{'loss': 0.0255, 'grad_norm': 0.21705831587314606, 'learning_rate': 9.668828951860704e-05, 'epoch': 7.58}


 76%|███████▌  | 2223/2930 [58:39<18:16,  1.55s/it]

{'loss': 0.0279, 'grad_norm': 0.25792884826660156, 'learning_rate': 9.655172413793105e-05, 'epoch': 7.58}


 76%|███████▌  | 2224/2930 [58:41<18:11,  1.55s/it]

{'loss': 0.0267, 'grad_norm': 0.27649155259132385, 'learning_rate': 9.641515875725504e-05, 'epoch': 7.58}


 76%|███████▌  | 2225/2930 [58:42<18:14,  1.55s/it]

{'loss': 0.039, 'grad_norm': 0.3152443468570709, 'learning_rate': 9.627859337657903e-05, 'epoch': 7.59}


 76%|███████▌  | 2226/2930 [58:44<18:08,  1.55s/it]

{'loss': 0.0234, 'grad_norm': 0.2350403517484665, 'learning_rate': 9.614202799590304e-05, 'epoch': 7.59}


 76%|███████▌  | 2227/2930 [58:46<18:06,  1.54s/it]

{'loss': 0.0279, 'grad_norm': 0.27505433559417725, 'learning_rate': 9.600546261522705e-05, 'epoch': 7.59}


 76%|███████▌  | 2228/2930 [58:47<18:08,  1.55s/it]

{'loss': 0.0279, 'grad_norm': 0.2785845398902893, 'learning_rate': 9.586889723455104e-05, 'epoch': 7.6}


 76%|███████▌  | 2229/2930 [58:49<18:07,  1.55s/it]

{'loss': 0.0291, 'grad_norm': 0.26890110969543457, 'learning_rate': 9.573233185387505e-05, 'epoch': 7.6}


 76%|███████▌  | 2230/2930 [58:50<18:07,  1.55s/it]

{'loss': 0.03, 'grad_norm': 0.25012290477752686, 'learning_rate': 9.559576647319905e-05, 'epoch': 7.6}


 76%|███████▌  | 2231/2930 [58:52<18:03,  1.55s/it]

{'loss': 0.0312, 'grad_norm': 0.2584919035434723, 'learning_rate': 9.545920109252305e-05, 'epoch': 7.61}


 76%|███████▌  | 2232/2930 [58:53<18:06,  1.56s/it]

{'loss': 0.0326, 'grad_norm': 0.31402650475502014, 'learning_rate': 9.532263571184705e-05, 'epoch': 7.61}


 76%|███████▌  | 2233/2930 [58:55<18:05,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.27088287472724915, 'learning_rate': 9.518607033117106e-05, 'epoch': 7.61}


 76%|███████▌  | 2234/2930 [58:56<18:02,  1.55s/it]

{'loss': 0.0301, 'grad_norm': 0.28110137581825256, 'learning_rate': 9.504950495049505e-05, 'epoch': 7.62}


 76%|███████▋  | 2235/2930 [58:58<18:07,  1.56s/it]

{'loss': 0.0331, 'grad_norm': 0.2777478098869324, 'learning_rate': 9.491293956981906e-05, 'epoch': 7.62}


 76%|███████▋  | 2236/2930 [59:00<18:03,  1.56s/it]

{'loss': 0.0314, 'grad_norm': 0.250100702047348, 'learning_rate': 9.477637418914305e-05, 'epoch': 7.62}


 76%|███████▋  | 2237/2930 [59:01<17:58,  1.56s/it]

{'loss': 0.0308, 'grad_norm': 0.2814415991306305, 'learning_rate': 9.463980880846706e-05, 'epoch': 7.63}


 76%|███████▋  | 2238/2930 [59:03<17:57,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.28604984283447266, 'learning_rate': 9.450324342779105e-05, 'epoch': 7.63}


 76%|███████▋  | 2239/2930 [59:04<17:53,  1.55s/it]

{'loss': 0.0326, 'grad_norm': 0.30177491903305054, 'learning_rate': 9.436667804711506e-05, 'epoch': 7.64}


 76%|███████▋  | 2240/2930 [59:06<17:52,  1.55s/it]

{'loss': 0.0308, 'grad_norm': 0.24685926735401154, 'learning_rate': 9.423011266643907e-05, 'epoch': 7.64}


 76%|███████▋  | 2241/2930 [59:07<17:44,  1.55s/it]

{'loss': 0.027, 'grad_norm': 0.23054225742816925, 'learning_rate': 9.409354728576306e-05, 'epoch': 7.64}


 77%|███████▋  | 2242/2930 [59:09<17:39,  1.54s/it]

{'loss': 0.0302, 'grad_norm': 0.26161864399909973, 'learning_rate': 9.395698190508706e-05, 'epoch': 7.65}


 77%|███████▋  | 2243/2930 [59:10<17:37,  1.54s/it]

{'loss': 0.0252, 'grad_norm': 0.2519480288028717, 'learning_rate': 9.382041652441106e-05, 'epoch': 7.65}


 77%|███████▋  | 2244/2930 [59:12<17:40,  1.55s/it]

{'loss': 0.0281, 'grad_norm': 0.25666728615760803, 'learning_rate': 9.368385114373507e-05, 'epoch': 7.65}


 77%|███████▋  | 2245/2930 [59:13<17:42,  1.55s/it]

{'loss': 0.0279, 'grad_norm': 0.23750126361846924, 'learning_rate': 9.354728576305907e-05, 'epoch': 7.66}


 77%|███████▋  | 2246/2930 [59:15<17:40,  1.55s/it]

{'loss': 0.0289, 'grad_norm': 0.2706579864025116, 'learning_rate': 9.341072038238306e-05, 'epoch': 7.66}


 77%|███████▋  | 2247/2930 [59:17<17:38,  1.55s/it]

{'loss': 0.0312, 'grad_norm': 0.29395776987075806, 'learning_rate': 9.327415500170707e-05, 'epoch': 7.66}


 77%|███████▋  | 2248/2930 [59:18<17:38,  1.55s/it]

{'loss': 0.0251, 'grad_norm': 0.23987160623073578, 'learning_rate': 9.313758962103108e-05, 'epoch': 7.67}


 77%|███████▋  | 2249/2930 [59:20<17:34,  1.55s/it]

{'loss': 0.0259, 'grad_norm': 0.23967669904232025, 'learning_rate': 9.300102424035507e-05, 'epoch': 7.67}


 77%|███████▋  | 2250/2930 [59:21<17:37,  1.56s/it]

{'loss': 0.0284, 'grad_norm': 0.27275514602661133, 'learning_rate': 9.286445885967906e-05, 'epoch': 7.67}


 77%|███████▋  | 2251/2930 [59:23<17:36,  1.56s/it]

{'loss': 0.0258, 'grad_norm': 0.2630118429660797, 'learning_rate': 9.272789347900309e-05, 'epoch': 7.68}


 77%|███████▋  | 2252/2930 [59:24<17:35,  1.56s/it]

{'loss': 0.0349, 'grad_norm': 0.3026372790336609, 'learning_rate': 9.259132809832708e-05, 'epoch': 7.68}


 77%|███████▋  | 2253/2930 [59:26<17:34,  1.56s/it]

{'loss': 0.0282, 'grad_norm': 0.2752801775932312, 'learning_rate': 9.245476271765107e-05, 'epoch': 7.68}


 77%|███████▋  | 2254/2930 [59:27<17:31,  1.56s/it]

{'loss': 0.0312, 'grad_norm': 0.33442026376724243, 'learning_rate': 9.231819733697508e-05, 'epoch': 7.69}


 77%|███████▋  | 2255/2930 [59:29<17:30,  1.56s/it]

{'loss': 0.0255, 'grad_norm': 0.24856829643249512, 'learning_rate': 9.218163195629909e-05, 'epoch': 7.69}


 77%|███████▋  | 2256/2930 [59:31<17:32,  1.56s/it]

{'loss': 0.0277, 'grad_norm': 0.2817090153694153, 'learning_rate': 9.204506657562308e-05, 'epoch': 7.69}


 77%|███████▋  | 2257/2930 [59:32<17:26,  1.56s/it]

{'loss': 0.0292, 'grad_norm': 0.2836478352546692, 'learning_rate': 9.190850119494708e-05, 'epoch': 7.7}


 77%|███████▋  | 2258/2930 [59:34<17:22,  1.55s/it]

{'loss': 0.0266, 'grad_norm': 0.26274165511131287, 'learning_rate': 9.177193581427109e-05, 'epoch': 7.7}


 77%|███████▋  | 2259/2930 [59:35<17:15,  1.54s/it]

{'loss': 0.0274, 'grad_norm': 0.24670375883579254, 'learning_rate': 9.163537043359509e-05, 'epoch': 7.7}


 77%|███████▋  | 2260/2930 [59:37<17:17,  1.55s/it]

{'loss': 0.0267, 'grad_norm': 0.25166216492652893, 'learning_rate': 9.149880505291909e-05, 'epoch': 7.71}


 77%|███████▋  | 2261/2930 [59:38<17:18,  1.55s/it]

{'loss': 0.0312, 'grad_norm': 0.3263058662414551, 'learning_rate': 9.13622396722431e-05, 'epoch': 7.71}


 77%|███████▋  | 2262/2930 [59:40<17:15,  1.55s/it]

{'loss': 0.0252, 'grad_norm': 0.2872278094291687, 'learning_rate': 9.122567429156709e-05, 'epoch': 7.71}


 77%|███████▋  | 2263/2930 [59:41<17:18,  1.56s/it]

{'loss': 0.0316, 'grad_norm': 0.2879410982131958, 'learning_rate': 9.10891089108911e-05, 'epoch': 7.72}


 77%|███████▋  | 2264/2930 [59:43<17:17,  1.56s/it]

{'loss': 0.0316, 'grad_norm': 0.27110347151756287, 'learning_rate': 9.095254353021509e-05, 'epoch': 7.72}


 77%|███████▋  | 2265/2930 [59:45<17:13,  1.55s/it]

{'loss': 0.0326, 'grad_norm': 0.30562153458595276, 'learning_rate': 9.08159781495391e-05, 'epoch': 7.72}


 77%|███████▋  | 2266/2930 [59:46<17:14,  1.56s/it]

{'loss': 0.0311, 'grad_norm': 0.27184203267097473, 'learning_rate': 9.067941276886309e-05, 'epoch': 7.73}


 77%|███████▋  | 2267/2930 [59:48<17:15,  1.56s/it]

{'loss': 0.0244, 'grad_norm': 0.23059333860874176, 'learning_rate': 9.05428473881871e-05, 'epoch': 7.73}


 77%|███████▋  | 2268/2930 [59:49<17:17,  1.57s/it]

{'loss': 0.0278, 'grad_norm': 0.2612017095088959, 'learning_rate': 9.04062820075111e-05, 'epoch': 7.73}


 77%|███████▋  | 2269/2930 [59:51<17:20,  1.57s/it]

{'loss': 0.0278, 'grad_norm': 0.23456121981143951, 'learning_rate': 9.02697166268351e-05, 'epoch': 7.74}


 77%|███████▋  | 2270/2930 [59:52<17:21,  1.58s/it]

{'loss': 0.0282, 'grad_norm': 0.2451271116733551, 'learning_rate': 9.01331512461591e-05, 'epoch': 7.74}


 78%|███████▊  | 2271/2930 [59:54<17:21,  1.58s/it]

{'loss': 0.0286, 'grad_norm': 0.24991613626480103, 'learning_rate': 8.99965858654831e-05, 'epoch': 7.74}


 78%|███████▊  | 2272/2930 [59:56<17:17,  1.58s/it]

{'loss': 0.0305, 'grad_norm': 0.2857860028743744, 'learning_rate': 8.986002048480711e-05, 'epoch': 7.75}


 78%|███████▊  | 2273/2930 [59:57<17:20,  1.58s/it]

{'loss': 0.0267, 'grad_norm': 0.237946555018425, 'learning_rate': 8.97234551041311e-05, 'epoch': 7.75}


 78%|███████▊  | 2274/2930 [59:59<17:12,  1.57s/it]

{'loss': 0.0247, 'grad_norm': 0.2210036665201187, 'learning_rate': 8.95868897234551e-05, 'epoch': 7.75}


 78%|███████▊  | 2275/2930 [1:00:00<17:15,  1.58s/it]

{'loss': 0.03, 'grad_norm': 0.27283722162246704, 'learning_rate': 8.945032434277911e-05, 'epoch': 7.76}


 78%|███████▊  | 2276/2930 [1:00:02<17:13,  1.58s/it]

{'loss': 0.0286, 'grad_norm': 0.23417142033576965, 'learning_rate': 8.931375896210312e-05, 'epoch': 7.76}


 78%|███████▊  | 2277/2930 [1:00:04<17:13,  1.58s/it]

{'loss': 0.0281, 'grad_norm': 0.2618906497955322, 'learning_rate': 8.917719358142711e-05, 'epoch': 7.76}


 78%|███████▊  | 2278/2930 [1:00:05<17:09,  1.58s/it]

{'loss': 0.0254, 'grad_norm': 0.2903417944908142, 'learning_rate': 8.90406282007511e-05, 'epoch': 7.77}


 78%|███████▊  | 2279/2930 [1:00:07<17:03,  1.57s/it]

{'loss': 0.0296, 'grad_norm': 0.29707008600234985, 'learning_rate': 8.890406282007511e-05, 'epoch': 7.77}


 78%|███████▊  | 2280/2930 [1:00:08<16:53,  1.56s/it]

{'loss': 0.0301, 'grad_norm': 0.25122249126434326, 'learning_rate': 8.876749743939912e-05, 'epoch': 7.77}


 78%|███████▊  | 2281/2930 [1:00:10<16:57,  1.57s/it]

{'loss': 0.0332, 'grad_norm': 0.30726614594459534, 'learning_rate': 8.863093205872311e-05, 'epoch': 7.78}


 78%|███████▊  | 2282/2930 [1:00:11<17:00,  1.57s/it]

{'loss': 0.0278, 'grad_norm': 0.24677421152591705, 'learning_rate': 8.849436667804712e-05, 'epoch': 7.78}


 78%|███████▊  | 2283/2930 [1:00:13<16:56,  1.57s/it]

{'loss': 0.0239, 'grad_norm': 0.22532427310943604, 'learning_rate': 8.835780129737113e-05, 'epoch': 7.79}


 78%|███████▊  | 2284/2930 [1:00:15<17:00,  1.58s/it]

{'loss': 0.0285, 'grad_norm': 0.26789435744285583, 'learning_rate': 8.822123591669512e-05, 'epoch': 7.79}


 78%|███████▊  | 2285/2930 [1:00:16<17:02,  1.59s/it]

{'loss': 0.0259, 'grad_norm': 0.2585493326187134, 'learning_rate': 8.808467053601912e-05, 'epoch': 7.79}


 78%|███████▊  | 2286/2930 [1:00:18<17:04,  1.59s/it]

{'loss': 0.0256, 'grad_norm': 0.2554212510585785, 'learning_rate': 8.794810515534312e-05, 'epoch': 7.8}


 78%|███████▊  | 2287/2930 [1:00:19<17:00,  1.59s/it]

{'loss': 0.027, 'grad_norm': 0.24800768494606018, 'learning_rate': 8.781153977466713e-05, 'epoch': 7.8}


 78%|███████▊  | 2288/2930 [1:00:21<16:55,  1.58s/it]

{'loss': 0.0266, 'grad_norm': 0.24568966031074524, 'learning_rate': 8.767497439399113e-05, 'epoch': 7.8}


 78%|███████▊  | 2289/2930 [1:00:22<16:50,  1.58s/it]

{'loss': 0.0272, 'grad_norm': 0.26561012864112854, 'learning_rate': 8.753840901331512e-05, 'epoch': 7.81}


 78%|███████▊  | 2290/2930 [1:00:24<16:51,  1.58s/it]

{'loss': 0.0252, 'grad_norm': 0.23510771989822388, 'learning_rate': 8.740184363263914e-05, 'epoch': 7.81}


 78%|███████▊  | 2291/2930 [1:00:26<16:42,  1.57s/it]

{'loss': 0.0265, 'grad_norm': 0.29229122400283813, 'learning_rate': 8.726527825196314e-05, 'epoch': 7.81}


 78%|███████▊  | 2292/2930 [1:00:27<16:41,  1.57s/it]

{'loss': 0.0222, 'grad_norm': 0.2302798479795456, 'learning_rate': 8.712871287128713e-05, 'epoch': 7.82}


 78%|███████▊  | 2293/2930 [1:00:29<16:43,  1.58s/it]

{'loss': 0.0271, 'grad_norm': 0.23058722913265228, 'learning_rate': 8.699214749061112e-05, 'epoch': 7.82}


 78%|███████▊  | 2294/2930 [1:00:30<16:46,  1.58s/it]

{'loss': 0.0247, 'grad_norm': 0.21422143280506134, 'learning_rate': 8.685558210993515e-05, 'epoch': 7.82}


 78%|███████▊  | 2295/2930 [1:00:32<16:45,  1.58s/it]

{'loss': 0.0306, 'grad_norm': 0.27887460589408875, 'learning_rate': 8.671901672925914e-05, 'epoch': 7.83}


 78%|███████▊  | 2296/2930 [1:00:33<16:45,  1.59s/it]

{'loss': 0.0261, 'grad_norm': 0.26936233043670654, 'learning_rate': 8.658245134858313e-05, 'epoch': 7.83}


 78%|███████▊  | 2297/2930 [1:00:35<16:45,  1.59s/it]

{'loss': 0.0268, 'grad_norm': 0.2543882131576538, 'learning_rate': 8.644588596790714e-05, 'epoch': 7.83}


 78%|███████▊  | 2298/2930 [1:00:37<16:47,  1.59s/it]

{'loss': 0.0296, 'grad_norm': 0.25935909152030945, 'learning_rate': 8.630932058723115e-05, 'epoch': 7.84}


 78%|███████▊  | 2299/2930 [1:00:38<16:36,  1.58s/it]

{'loss': 0.0299, 'grad_norm': 0.26799172163009644, 'learning_rate': 8.617275520655514e-05, 'epoch': 7.84}


 78%|███████▊  | 2300/2930 [1:00:40<16:36,  1.58s/it]

{'loss': 0.0254, 'grad_norm': 0.23926378786563873, 'learning_rate': 8.603618982587914e-05, 'epoch': 7.84}


 79%|███████▊  | 2301/2930 [1:00:41<16:37,  1.59s/it]

{'loss': 0.0285, 'grad_norm': 0.26196181774139404, 'learning_rate': 8.589962444520315e-05, 'epoch': 7.85}


 79%|███████▊  | 2302/2930 [1:00:43<16:36,  1.59s/it]

{'loss': 0.031, 'grad_norm': 0.29127633571624756, 'learning_rate': 8.576305906452715e-05, 'epoch': 7.85}


 79%|███████▊  | 2303/2930 [1:00:45<16:33,  1.58s/it]

{'loss': 0.026, 'grad_norm': 0.24108795821666718, 'learning_rate': 8.562649368385115e-05, 'epoch': 7.85}


 79%|███████▊  | 2304/2930 [1:00:46<16:32,  1.59s/it]

{'loss': 0.0242, 'grad_norm': 0.22233347594738007, 'learning_rate': 8.548992830317515e-05, 'epoch': 7.86}


 79%|███████▊  | 2305/2930 [1:00:48<16:28,  1.58s/it]

{'loss': 0.0224, 'grad_norm': 0.2233371138572693, 'learning_rate': 8.535336292249915e-05, 'epoch': 7.86}


 79%|███████▊  | 2306/2930 [1:00:49<16:24,  1.58s/it]

{'loss': 0.0262, 'grad_norm': 0.24301594495773315, 'learning_rate': 8.521679754182316e-05, 'epoch': 7.86}


 79%|███████▊  | 2307/2930 [1:00:51<16:18,  1.57s/it]

{'loss': 0.0294, 'grad_norm': 0.28268685936927795, 'learning_rate': 8.508023216114715e-05, 'epoch': 7.87}


 79%|███████▉  | 2308/2930 [1:00:52<16:20,  1.58s/it]

{'loss': 0.0269, 'grad_norm': 0.25795140862464905, 'learning_rate': 8.494366678047116e-05, 'epoch': 7.87}


 79%|███████▉  | 2309/2930 [1:00:54<16:21,  1.58s/it]

{'loss': 0.0267, 'grad_norm': 0.23828153312206268, 'learning_rate': 8.480710139979515e-05, 'epoch': 7.87}


 79%|███████▉  | 2310/2930 [1:00:56<16:22,  1.58s/it]

{'loss': 0.0244, 'grad_norm': 0.25477248430252075, 'learning_rate': 8.467053601911916e-05, 'epoch': 7.88}


 79%|███████▉  | 2311/2930 [1:00:57<16:16,  1.58s/it]

{'loss': 0.0273, 'grad_norm': 0.26331523060798645, 'learning_rate': 8.453397063844315e-05, 'epoch': 7.88}


 79%|███████▉  | 2312/2930 [1:00:59<16:13,  1.58s/it]

{'loss': 0.0304, 'grad_norm': 0.29257655143737793, 'learning_rate': 8.439740525776716e-05, 'epoch': 7.88}


 79%|███████▉  | 2313/2930 [1:01:00<16:14,  1.58s/it]

{'loss': 0.0253, 'grad_norm': 0.2281775325536728, 'learning_rate': 8.426083987709116e-05, 'epoch': 7.89}


 79%|███████▉  | 2314/2930 [1:01:02<16:15,  1.58s/it]

{'loss': 0.0258, 'grad_norm': 0.260055810213089, 'learning_rate': 8.412427449641516e-05, 'epoch': 7.89}


 79%|███████▉  | 2315/2930 [1:01:03<16:05,  1.57s/it]

{'loss': 0.0209, 'grad_norm': 0.24580511450767517, 'learning_rate': 8.398770911573917e-05, 'epoch': 7.89}


 79%|███████▉  | 2316/2930 [1:01:05<16:06,  1.57s/it]

{'loss': 0.0286, 'grad_norm': 0.27855509519577026, 'learning_rate': 8.385114373506317e-05, 'epoch': 7.9}


 79%|███████▉  | 2317/2930 [1:01:07<16:04,  1.57s/it]

{'loss': 0.0318, 'grad_norm': 0.2897314727306366, 'learning_rate': 8.371457835438716e-05, 'epoch': 7.9}


 79%|███████▉  | 2318/2930 [1:01:08<15:59,  1.57s/it]

{'loss': 0.0241, 'grad_norm': 0.24914215505123138, 'learning_rate': 8.357801297371117e-05, 'epoch': 7.9}


 79%|███████▉  | 2319/2930 [1:01:10<15:56,  1.57s/it]

{'loss': 0.0307, 'grad_norm': 0.26792511343955994, 'learning_rate': 8.344144759303518e-05, 'epoch': 7.91}


 79%|███████▉  | 2320/2930 [1:01:11<15:56,  1.57s/it]

{'loss': 0.0289, 'grad_norm': 0.26168790459632874, 'learning_rate': 8.330488221235917e-05, 'epoch': 7.91}


 79%|███████▉  | 2321/2930 [1:01:13<15:57,  1.57s/it]

{'loss': 0.0276, 'grad_norm': 0.2862778902053833, 'learning_rate': 8.316831683168316e-05, 'epoch': 7.91}


 79%|███████▉  | 2322/2930 [1:01:14<15:56,  1.57s/it]

{'loss': 0.0242, 'grad_norm': 0.23310118913650513, 'learning_rate': 8.303175145100718e-05, 'epoch': 7.92}


 79%|███████▉  | 2323/2930 [1:01:16<15:52,  1.57s/it]

{'loss': 0.0282, 'grad_norm': 0.27126139402389526, 'learning_rate': 8.289518607033118e-05, 'epoch': 7.92}


 79%|███████▉  | 2324/2930 [1:01:18<15:50,  1.57s/it]

{'loss': 0.0293, 'grad_norm': 0.2773128151893616, 'learning_rate': 8.275862068965517e-05, 'epoch': 7.92}


 79%|███████▉  | 2325/2930 [1:01:19<15:51,  1.57s/it]

{'loss': 0.0267, 'grad_norm': 0.23464533686637878, 'learning_rate': 8.262205530897917e-05, 'epoch': 7.93}


 79%|███████▉  | 2326/2930 [1:01:21<15:48,  1.57s/it]

{'loss': 0.0276, 'grad_norm': 0.2492048740386963, 'learning_rate': 8.248548992830319e-05, 'epoch': 7.93}


 79%|███████▉  | 2327/2930 [1:01:22<15:44,  1.57s/it]

{'loss': 0.027, 'grad_norm': 0.24589940905570984, 'learning_rate': 8.234892454762718e-05, 'epoch': 7.94}


 79%|███████▉  | 2328/2930 [1:01:24<15:41,  1.56s/it]

{'loss': 0.0329, 'grad_norm': 0.28296566009521484, 'learning_rate': 8.221235916695118e-05, 'epoch': 7.94}


 79%|███████▉  | 2329/2930 [1:01:25<15:35,  1.56s/it]

{'loss': 0.0258, 'grad_norm': 0.25131914019584656, 'learning_rate': 8.207579378627518e-05, 'epoch': 7.94}


 80%|███████▉  | 2330/2930 [1:01:27<15:33,  1.56s/it]

{'loss': 0.0307, 'grad_norm': 0.2579595446586609, 'learning_rate': 8.193922840559919e-05, 'epoch': 7.95}


 80%|███████▉  | 2331/2930 [1:01:29<15:32,  1.56s/it]

{'loss': 0.0289, 'grad_norm': 0.2646232545375824, 'learning_rate': 8.180266302492319e-05, 'epoch': 7.95}


 80%|███████▉  | 2332/2930 [1:01:30<15:32,  1.56s/it]

{'loss': 0.0274, 'grad_norm': 0.2732212245464325, 'learning_rate': 8.166609764424718e-05, 'epoch': 7.95}


 80%|███████▉  | 2333/2930 [1:01:32<15:29,  1.56s/it]

{'loss': 0.0225, 'grad_norm': 0.19797085225582123, 'learning_rate': 8.152953226357119e-05, 'epoch': 7.96}


 80%|███████▉  | 2334/2930 [1:01:33<15:30,  1.56s/it]

{'loss': 0.0285, 'grad_norm': 0.23276470601558685, 'learning_rate': 8.13929668828952e-05, 'epoch': 7.96}


 80%|███████▉  | 2335/2930 [1:01:35<15:25,  1.56s/it]

{'loss': 0.0314, 'grad_norm': 0.27027660608291626, 'learning_rate': 8.125640150221919e-05, 'epoch': 7.96}


 80%|███████▉  | 2336/2930 [1:01:36<15:22,  1.55s/it]

{'loss': 0.031, 'grad_norm': 0.29664796590805054, 'learning_rate': 8.11198361215432e-05, 'epoch': 7.97}


 80%|███████▉  | 2337/2930 [1:01:38<15:21,  1.55s/it]

{'loss': 0.0304, 'grad_norm': 0.25877082347869873, 'learning_rate': 8.098327074086719e-05, 'epoch': 7.97}


 80%|███████▉  | 2338/2930 [1:01:39<15:10,  1.54s/it]

{'loss': 0.0289, 'grad_norm': 0.30814504623413086, 'learning_rate': 8.08467053601912e-05, 'epoch': 7.97}


 80%|███████▉  | 2339/2930 [1:01:41<15:14,  1.55s/it]

{'loss': 0.038, 'grad_norm': 0.29557695984840393, 'learning_rate': 8.07101399795152e-05, 'epoch': 7.98}


 80%|███████▉  | 2340/2930 [1:01:42<15:14,  1.55s/it]

{'loss': 0.0295, 'grad_norm': 0.25301381945610046, 'learning_rate': 8.05735745988392e-05, 'epoch': 7.98}


 80%|███████▉  | 2341/2930 [1:01:44<15:16,  1.56s/it]

{'loss': 0.0283, 'grad_norm': 0.2564309239387512, 'learning_rate': 8.04370092181632e-05, 'epoch': 7.98}


 80%|███████▉  | 2342/2930 [1:01:46<15:16,  1.56s/it]

{'loss': 0.0313, 'grad_norm': 0.26952117681503296, 'learning_rate': 8.03004438374872e-05, 'epoch': 7.99}


 80%|███████▉  | 2343/2930 [1:01:47<15:14,  1.56s/it]

{'loss': 0.0276, 'grad_norm': 0.2534742057323456, 'learning_rate': 8.01638784568112e-05, 'epoch': 7.99}


 80%|████████  | 2344/2930 [1:01:49<15:15,  1.56s/it]

{'loss': 0.0253, 'grad_norm': 0.24028442800045013, 'learning_rate': 8.00273130761352e-05, 'epoch': 7.99}


 80%|████████  | 2345/2930 [1:01:50<15:12,  1.56s/it]

{'loss': 0.0279, 'grad_norm': 0.26708656549453735, 'learning_rate': 7.98907476954592e-05, 'epoch': 8.0}


 80%|████████  | 2346/2930 [1:01:52<15:12,  1.56s/it]

{'loss': 0.0282, 'grad_norm': 0.26036638021469116, 'learning_rate': 7.975418231478321e-05, 'epoch': 8.0}


 80%|████████  | 2347/2930 [1:01:53<15:10,  1.56s/it]

{'loss': 0.0165, 'grad_norm': 0.16754765808582306, 'learning_rate': 7.961761693410721e-05, 'epoch': 8.0}


 80%|████████  | 2348/2930 [1:01:55<15:07,  1.56s/it]

{'loss': 0.0183, 'grad_norm': 0.20461131632328033, 'learning_rate': 7.948105155343121e-05, 'epoch': 8.01}


 80%|████████  | 2349/2930 [1:01:57<15:07,  1.56s/it]

{'loss': 0.0158, 'grad_norm': 0.1491977870464325, 'learning_rate': 7.93444861727552e-05, 'epoch': 8.01}


 80%|████████  | 2350/2930 [1:01:58<15:04,  1.56s/it]

{'loss': 0.0179, 'grad_norm': 0.2059331238269806, 'learning_rate': 7.920792079207921e-05, 'epoch': 8.01}


 80%|████████  | 2351/2930 [1:02:00<14:58,  1.55s/it]

{'loss': 0.0159, 'grad_norm': 0.18217721581459045, 'learning_rate': 7.907135541140322e-05, 'epoch': 8.02}


 80%|████████  | 2352/2930 [1:02:01<14:48,  1.54s/it]

{'loss': 0.0157, 'grad_norm': 0.16539280116558075, 'learning_rate': 7.893479003072721e-05, 'epoch': 8.02}


 80%|████████  | 2353/2930 [1:02:03<14:52,  1.55s/it]

{'loss': 0.0214, 'grad_norm': 0.21204619109630585, 'learning_rate': 7.87982246500512e-05, 'epoch': 8.02}


 80%|████████  | 2354/2930 [1:02:04<14:47,  1.54s/it]

{'loss': 0.0169, 'grad_norm': 0.19257031381130219, 'learning_rate': 7.866165926937521e-05, 'epoch': 8.03}


 80%|████████  | 2355/2930 [1:02:06<14:45,  1.54s/it]

{'loss': 0.015, 'grad_norm': 0.16885347664356232, 'learning_rate': 7.852509388869922e-05, 'epoch': 8.03}


 80%|████████  | 2356/2930 [1:02:07<14:48,  1.55s/it]

{'loss': 0.0187, 'grad_norm': 0.20042690634727478, 'learning_rate': 7.838852850802322e-05, 'epoch': 8.03}


 80%|████████  | 2357/2930 [1:02:09<14:50,  1.55s/it]

{'loss': 0.0161, 'grad_norm': 0.18386220932006836, 'learning_rate': 7.825196312734721e-05, 'epoch': 8.04}


 80%|████████  | 2358/2930 [1:02:10<14:46,  1.55s/it]

{'loss': 0.0172, 'grad_norm': 0.22468554973602295, 'learning_rate': 7.811539774667123e-05, 'epoch': 8.04}


 81%|████████  | 2359/2930 [1:02:12<14:47,  1.55s/it]

{'loss': 0.0171, 'grad_norm': 0.20090486109256744, 'learning_rate': 7.797883236599523e-05, 'epoch': 8.04}


 81%|████████  | 2360/2930 [1:02:14<14:48,  1.56s/it]

{'loss': 0.0197, 'grad_norm': 0.2071441113948822, 'learning_rate': 7.784226698531922e-05, 'epoch': 8.05}


 81%|████████  | 2361/2930 [1:02:15<14:45,  1.56s/it]

{'loss': 0.0159, 'grad_norm': 0.20546920597553253, 'learning_rate': 7.770570160464323e-05, 'epoch': 8.05}


 81%|████████  | 2362/2930 [1:02:17<14:47,  1.56s/it]

{'loss': 0.0162, 'grad_norm': 0.2381221055984497, 'learning_rate': 7.756913622396724e-05, 'epoch': 8.05}


 81%|████████  | 2363/2930 [1:02:18<14:48,  1.57s/it]

{'loss': 0.0158, 'grad_norm': 0.22161512076854706, 'learning_rate': 7.743257084329123e-05, 'epoch': 8.06}


 81%|████████  | 2364/2930 [1:02:20<14:42,  1.56s/it]

{'loss': 0.0143, 'grad_norm': 0.22208011150360107, 'learning_rate': 7.729600546261522e-05, 'epoch': 8.06}


 81%|████████  | 2365/2930 [1:02:21<14:43,  1.56s/it]

{'loss': 0.0203, 'grad_norm': 0.21526797115802765, 'learning_rate': 7.715944008193923e-05, 'epoch': 8.06}


 81%|████████  | 2366/2930 [1:02:23<14:41,  1.56s/it]

{'loss': 0.0202, 'grad_norm': 0.23764793574810028, 'learning_rate': 7.702287470126324e-05, 'epoch': 8.07}


 81%|████████  | 2367/2930 [1:02:25<14:40,  1.56s/it]

{'loss': 0.0178, 'grad_norm': 0.2137763649225235, 'learning_rate': 7.688630932058723e-05, 'epoch': 8.07}


 81%|████████  | 2368/2930 [1:02:26<14:33,  1.55s/it]

{'loss': 0.0167, 'grad_norm': 0.2279406487941742, 'learning_rate': 7.674974393991124e-05, 'epoch': 8.08}


 81%|████████  | 2369/2930 [1:02:28<14:36,  1.56s/it]

{'loss': 0.0185, 'grad_norm': 0.2308296412229538, 'learning_rate': 7.661317855923523e-05, 'epoch': 8.08}


 81%|████████  | 2370/2930 [1:02:29<14:37,  1.57s/it]

{'loss': 0.0154, 'grad_norm': 0.21330255270004272, 'learning_rate': 7.647661317855924e-05, 'epoch': 8.08}


 81%|████████  | 2371/2930 [1:02:31<14:34,  1.56s/it]

{'loss': 0.0208, 'grad_norm': 0.254022479057312, 'learning_rate': 7.634004779788324e-05, 'epoch': 8.09}


 81%|████████  | 2372/2930 [1:02:32<14:34,  1.57s/it]

{'loss': 0.0175, 'grad_norm': 0.21939390897750854, 'learning_rate': 7.620348241720724e-05, 'epoch': 8.09}


 81%|████████  | 2373/2930 [1:02:34<14:34,  1.57s/it]

{'loss': 0.0155, 'grad_norm': 0.2075413465499878, 'learning_rate': 7.606691703653124e-05, 'epoch': 8.09}


 81%|████████  | 2374/2930 [1:02:35<14:27,  1.56s/it]

{'loss': 0.0161, 'grad_norm': 0.19935716688632965, 'learning_rate': 7.593035165585525e-05, 'epoch': 8.1}


 81%|████████  | 2375/2930 [1:02:37<14:28,  1.56s/it]

{'loss': 0.0174, 'grad_norm': 0.20441056787967682, 'learning_rate': 7.579378627517924e-05, 'epoch': 8.1}


 81%|████████  | 2376/2930 [1:02:39<14:26,  1.56s/it]

{'loss': 0.019, 'grad_norm': 0.2214910089969635, 'learning_rate': 7.565722089450325e-05, 'epoch': 8.1}


 81%|████████  | 2377/2930 [1:02:40<14:19,  1.55s/it]

{'loss': 0.0163, 'grad_norm': 0.1938783973455429, 'learning_rate': 7.552065551382724e-05, 'epoch': 8.11}


 81%|████████  | 2378/2930 [1:02:42<14:20,  1.56s/it]

{'loss': 0.019, 'grad_norm': 0.20865586400032043, 'learning_rate': 7.538409013315125e-05, 'epoch': 8.11}


 81%|████████  | 2379/2930 [1:02:43<14:18,  1.56s/it]

{'loss': 0.02, 'grad_norm': 0.24872682988643646, 'learning_rate': 7.524752475247526e-05, 'epoch': 8.11}


 81%|████████  | 2380/2930 [1:02:45<14:14,  1.55s/it]

{'loss': 0.013, 'grad_norm': 0.17605671286582947, 'learning_rate': 7.511095937179925e-05, 'epoch': 8.12}


 81%|████████▏ | 2381/2930 [1:02:46<14:16,  1.56s/it]

{'loss': 0.0191, 'grad_norm': 0.22805605828762054, 'learning_rate': 7.497439399112325e-05, 'epoch': 8.12}


 81%|████████▏ | 2382/2930 [1:02:48<14:11,  1.55s/it]

{'loss': 0.0203, 'grad_norm': 0.23339439928531647, 'learning_rate': 7.483782861044725e-05, 'epoch': 8.12}


 81%|████████▏ | 2383/2930 [1:02:49<14:09,  1.55s/it]

{'loss': 0.0179, 'grad_norm': 0.19994433224201202, 'learning_rate': 7.470126322977126e-05, 'epoch': 8.13}


 81%|████████▏ | 2384/2930 [1:02:51<14:12,  1.56s/it]

{'loss': 0.0189, 'grad_norm': 0.20757031440734863, 'learning_rate': 7.456469784909526e-05, 'epoch': 8.13}


 81%|████████▏ | 2385/2930 [1:02:53<14:08,  1.56s/it]

{'loss': 0.0131, 'grad_norm': 0.17207317054271698, 'learning_rate': 7.442813246841925e-05, 'epoch': 8.13}


 81%|████████▏ | 2386/2930 [1:02:54<14:06,  1.56s/it]

{'loss': 0.0154, 'grad_norm': 0.19673705101013184, 'learning_rate': 7.429156708774326e-05, 'epoch': 8.14}


 81%|████████▏ | 2387/2930 [1:02:56<14:07,  1.56s/it]

{'loss': 0.0166, 'grad_norm': 0.2120162844657898, 'learning_rate': 7.415500170706726e-05, 'epoch': 8.14}


 82%|████████▏ | 2388/2930 [1:02:57<14:02,  1.55s/it]

{'loss': 0.016, 'grad_norm': 0.20580337941646576, 'learning_rate': 7.401843632639126e-05, 'epoch': 8.14}


 82%|████████▏ | 2389/2930 [1:02:59<13:58,  1.55s/it]

{'loss': 0.02, 'grad_norm': 0.2420387715101242, 'learning_rate': 7.388187094571527e-05, 'epoch': 8.15}


 82%|████████▏ | 2390/2930 [1:03:00<13:55,  1.55s/it]

{'loss': 0.0146, 'grad_norm': 0.21077635884284973, 'learning_rate': 7.374530556503927e-05, 'epoch': 8.15}


 82%|████████▏ | 2391/2930 [1:03:02<13:54,  1.55s/it]

{'loss': 0.0168, 'grad_norm': 0.189570352435112, 'learning_rate': 7.360874018436327e-05, 'epoch': 8.15}


 82%|████████▏ | 2392/2930 [1:03:03<13:56,  1.56s/it]

{'loss': 0.0156, 'grad_norm': 0.20352774858474731, 'learning_rate': 7.347217480368726e-05, 'epoch': 8.16}


 82%|████████▏ | 2393/2930 [1:03:05<13:54,  1.55s/it]

{'loss': 0.018, 'grad_norm': 0.20836496353149414, 'learning_rate': 7.333560942301127e-05, 'epoch': 8.16}


 82%|████████▏ | 2394/2930 [1:03:07<13:56,  1.56s/it]

{'loss': 0.0184, 'grad_norm': 0.26222267746925354, 'learning_rate': 7.319904404233528e-05, 'epoch': 8.16}


 82%|████████▏ | 2395/2930 [1:03:08<13:52,  1.56s/it]

{'loss': 0.0163, 'grad_norm': 0.2223762422800064, 'learning_rate': 7.306247866165927e-05, 'epoch': 8.17}


 82%|████████▏ | 2396/2930 [1:03:10<13:51,  1.56s/it]

{'loss': 0.02, 'grad_norm': 0.25957968831062317, 'learning_rate': 7.292591328098327e-05, 'epoch': 8.17}


 82%|████████▏ | 2397/2930 [1:03:11<13:52,  1.56s/it]

{'loss': 0.019, 'grad_norm': 0.22944994270801544, 'learning_rate': 7.278934790030729e-05, 'epoch': 8.17}


 82%|████████▏ | 2398/2930 [1:03:13<13:53,  1.57s/it]

{'loss': 0.0172, 'grad_norm': 0.2012891173362732, 'learning_rate': 7.265278251963128e-05, 'epoch': 8.18}


 82%|████████▏ | 2399/2930 [1:03:14<13:53,  1.57s/it]

{'loss': 0.0179, 'grad_norm': 0.22114719450473785, 'learning_rate': 7.251621713895528e-05, 'epoch': 8.18}


 82%|████████▏ | 2400/2930 [1:03:16<13:47,  1.56s/it]

{'loss': 0.0139, 'grad_norm': 0.16718392074108124, 'learning_rate': 7.237965175827927e-05, 'epoch': 8.18}


 82%|████████▏ | 2401/2930 [1:03:18<13:42,  1.55s/it]

{'loss': 0.0165, 'grad_norm': 0.2156258374452591, 'learning_rate': 7.224308637760329e-05, 'epoch': 8.19}


 82%|████████▏ | 2402/2930 [1:03:19<13:43,  1.56s/it]

{'loss': 0.02, 'grad_norm': 0.21486471593379974, 'learning_rate': 7.210652099692729e-05, 'epoch': 8.19}


 82%|████████▏ | 2403/2930 [1:03:21<13:45,  1.57s/it]

{'loss': 0.018, 'grad_norm': 0.21331094205379486, 'learning_rate': 7.196995561625128e-05, 'epoch': 8.19}


 82%|████████▏ | 2404/2930 [1:03:22<13:42,  1.56s/it]

{'loss': 0.0197, 'grad_norm': 0.2406088262796402, 'learning_rate': 7.183339023557529e-05, 'epoch': 8.2}


 82%|████████▏ | 2405/2930 [1:03:24<13:44,  1.57s/it]

{'loss': 0.0157, 'grad_norm': 0.22052335739135742, 'learning_rate': 7.16968248548993e-05, 'epoch': 8.2}


 82%|████████▏ | 2406/2930 [1:03:25<13:42,  1.57s/it]

{'loss': 0.0165, 'grad_norm': 0.22093649208545685, 'learning_rate': 7.156025947422329e-05, 'epoch': 8.2}


 82%|████████▏ | 2407/2930 [1:03:27<13:46,  1.58s/it]

{'loss': 0.0161, 'grad_norm': 0.20835091173648834, 'learning_rate': 7.142369409354728e-05, 'epoch': 8.21}


 82%|████████▏ | 2408/2930 [1:03:29<13:47,  1.59s/it]

{'loss': 0.0165, 'grad_norm': 0.19621515274047852, 'learning_rate': 7.128712871287129e-05, 'epoch': 8.21}


 82%|████████▏ | 2409/2930 [1:03:30<13:47,  1.59s/it]

{'loss': 0.0171, 'grad_norm': 0.2003609538078308, 'learning_rate': 7.11505633321953e-05, 'epoch': 8.21}


 82%|████████▏ | 2410/2930 [1:03:32<13:42,  1.58s/it]

{'loss': 0.0192, 'grad_norm': 0.22765663266181946, 'learning_rate': 7.101399795151929e-05, 'epoch': 8.22}


 82%|████████▏ | 2411/2930 [1:03:33<13:43,  1.59s/it]

{'loss': 0.0165, 'grad_norm': 0.18966488540172577, 'learning_rate': 7.08774325708433e-05, 'epoch': 8.22}


 82%|████████▏ | 2412/2930 [1:03:35<13:38,  1.58s/it]

{'loss': 0.0152, 'grad_norm': 0.15152248740196228, 'learning_rate': 7.07408671901673e-05, 'epoch': 8.23}


 82%|████████▏ | 2413/2930 [1:03:36<13:38,  1.58s/it]

{'loss': 0.0141, 'grad_norm': 0.19389940798282623, 'learning_rate': 7.06043018094913e-05, 'epoch': 8.23}


 82%|████████▏ | 2414/2930 [1:03:38<13:35,  1.58s/it]

{'loss': 0.0151, 'grad_norm': 0.1753150224685669, 'learning_rate': 7.04677364288153e-05, 'epoch': 8.23}


 82%|████████▏ | 2415/2930 [1:03:40<13:32,  1.58s/it]

{'loss': 0.0176, 'grad_norm': 0.2222563475370407, 'learning_rate': 7.03311710481393e-05, 'epoch': 8.24}


 82%|████████▏ | 2416/2930 [1:03:41<13:29,  1.58s/it]

{'loss': 0.0176, 'grad_norm': 0.24354074895381927, 'learning_rate': 7.01946056674633e-05, 'epoch': 8.24}


 82%|████████▏ | 2417/2930 [1:03:43<13:25,  1.57s/it]

{'loss': 0.0209, 'grad_norm': 0.23841829597949982, 'learning_rate': 7.00580402867873e-05, 'epoch': 8.24}


 83%|████████▎ | 2418/2930 [1:03:44<13:24,  1.57s/it]

{'loss': 0.0164, 'grad_norm': 0.19319485127925873, 'learning_rate': 6.99214749061113e-05, 'epoch': 8.25}


 83%|████████▎ | 2419/2930 [1:03:46<13:28,  1.58s/it]

{'loss': 0.0197, 'grad_norm': 0.23750893771648407, 'learning_rate': 6.978490952543531e-05, 'epoch': 8.25}


 83%|████████▎ | 2420/2930 [1:03:48<13:28,  1.58s/it]

{'loss': 0.0173, 'grad_norm': 0.22068406641483307, 'learning_rate': 6.96483441447593e-05, 'epoch': 8.25}


 83%|████████▎ | 2421/2930 [1:03:49<13:30,  1.59s/it]

{'loss': 0.0184, 'grad_norm': 0.22028790414333344, 'learning_rate': 6.951177876408331e-05, 'epoch': 8.26}


 83%|████████▎ | 2422/2930 [1:03:51<13:22,  1.58s/it]

{'loss': 0.0207, 'grad_norm': 0.2490527480840683, 'learning_rate': 6.937521338340732e-05, 'epoch': 8.26}


 83%|████████▎ | 2423/2930 [1:03:52<13:16,  1.57s/it]

{'loss': 0.0172, 'grad_norm': 0.21904541552066803, 'learning_rate': 6.923864800273131e-05, 'epoch': 8.26}


 83%|████████▎ | 2424/2930 [1:03:54<13:14,  1.57s/it]

{'loss': 0.0173, 'grad_norm': 0.23018255829811096, 'learning_rate': 6.91020826220553e-05, 'epoch': 8.27}


 83%|████████▎ | 2425/2930 [1:03:55<13:15,  1.58s/it]

{'loss': 0.0205, 'grad_norm': 0.2356296181678772, 'learning_rate': 6.896551724137931e-05, 'epoch': 8.27}


 83%|████████▎ | 2426/2930 [1:03:57<13:15,  1.58s/it]

{'loss': 0.0197, 'grad_norm': 0.22183749079704285, 'learning_rate': 6.882895186070332e-05, 'epoch': 8.27}


 83%|████████▎ | 2427/2930 [1:03:59<13:16,  1.58s/it]

{'loss': 0.0183, 'grad_norm': 0.20237404108047485, 'learning_rate': 6.869238648002732e-05, 'epoch': 8.28}


 83%|████████▎ | 2428/2930 [1:04:00<13:16,  1.59s/it]

{'loss': 0.0197, 'grad_norm': 0.20415186882019043, 'learning_rate': 6.855582109935131e-05, 'epoch': 8.28}


 83%|████████▎ | 2429/2930 [1:04:02<13:13,  1.58s/it]

{'loss': 0.0164, 'grad_norm': 0.19491279125213623, 'learning_rate': 6.841925571867532e-05, 'epoch': 8.28}


 83%|████████▎ | 2430/2930 [1:04:03<13:09,  1.58s/it]

{'loss': 0.0153, 'grad_norm': 0.20251969993114471, 'learning_rate': 6.828269033799932e-05, 'epoch': 8.29}


 83%|████████▎ | 2431/2930 [1:04:05<13:06,  1.58s/it]

{'loss': 0.0152, 'grad_norm': 0.1668357104063034, 'learning_rate': 6.814612495732332e-05, 'epoch': 8.29}


 83%|████████▎ | 2432/2930 [1:04:06<13:08,  1.58s/it]

{'loss': 0.0182, 'grad_norm': 0.2019345611333847, 'learning_rate': 6.800955957664731e-05, 'epoch': 8.29}


 83%|████████▎ | 2433/2930 [1:04:08<13:07,  1.58s/it]

{'loss': 0.0185, 'grad_norm': 0.20142580568790436, 'learning_rate': 6.787299419597133e-05, 'epoch': 8.3}


 83%|████████▎ | 2434/2930 [1:04:10<13:04,  1.58s/it]

{'loss': 0.0212, 'grad_norm': 0.23531223833560944, 'learning_rate': 6.773642881529533e-05, 'epoch': 8.3}


 83%|████████▎ | 2435/2930 [1:04:11<13:04,  1.59s/it]

{'loss': 0.0151, 'grad_norm': 0.16994990408420563, 'learning_rate': 6.759986343461932e-05, 'epoch': 8.3}


 83%|████████▎ | 2436/2930 [1:04:13<13:01,  1.58s/it]

{'loss': 0.0171, 'grad_norm': 0.20785453915596008, 'learning_rate': 6.746329805394333e-05, 'epoch': 8.31}


 83%|████████▎ | 2437/2930 [1:04:14<13:01,  1.58s/it]

{'loss': 0.0197, 'grad_norm': 0.21879398822784424, 'learning_rate': 6.732673267326734e-05, 'epoch': 8.31}


 83%|████████▎ | 2438/2930 [1:04:16<13:00,  1.59s/it]

{'loss': 0.0168, 'grad_norm': 0.19171275198459625, 'learning_rate': 6.719016729259133e-05, 'epoch': 8.31}


 83%|████████▎ | 2439/2930 [1:04:18<12:57,  1.58s/it]

{'loss': 0.016, 'grad_norm': 0.20365343987941742, 'learning_rate': 6.705360191191533e-05, 'epoch': 8.32}


 83%|████████▎ | 2440/2930 [1:04:19<12:57,  1.59s/it]

{'loss': 0.0143, 'grad_norm': 0.159261092543602, 'learning_rate': 6.691703653123933e-05, 'epoch': 8.32}


 83%|████████▎ | 2441/2930 [1:04:21<12:53,  1.58s/it]

{'loss': 0.0192, 'grad_norm': 0.22062376141548157, 'learning_rate': 6.678047115056334e-05, 'epoch': 8.32}


 83%|████████▎ | 2442/2930 [1:04:22<12:53,  1.59s/it]

{'loss': 0.0156, 'grad_norm': 0.18407124280929565, 'learning_rate': 6.664390576988734e-05, 'epoch': 8.33}


 83%|████████▎ | 2443/2930 [1:04:24<12:52,  1.59s/it]

{'loss': 0.0174, 'grad_norm': 0.20627817511558533, 'learning_rate': 6.650734038921134e-05, 'epoch': 8.33}


 83%|████████▎ | 2444/2930 [1:04:25<12:44,  1.57s/it]

{'loss': 0.0169, 'grad_norm': 0.22257013618946075, 'learning_rate': 6.637077500853534e-05, 'epoch': 8.33}


 83%|████████▎ | 2445/2930 [1:04:27<12:41,  1.57s/it]

{'loss': 0.0188, 'grad_norm': 0.20513437688350677, 'learning_rate': 6.623420962785935e-05, 'epoch': 8.34}


 83%|████████▎ | 2446/2930 [1:04:29<12:44,  1.58s/it]

{'loss': 0.0171, 'grad_norm': 0.21644487977027893, 'learning_rate': 6.609764424718334e-05, 'epoch': 8.34}


 84%|████████▎ | 2447/2930 [1:04:30<12:46,  1.59s/it]

{'loss': 0.0139, 'grad_norm': 0.1784965991973877, 'learning_rate': 6.596107886650735e-05, 'epoch': 8.34}


 84%|████████▎ | 2448/2930 [1:04:32<12:41,  1.58s/it]

{'loss': 0.017, 'grad_norm': 0.19113273918628693, 'learning_rate': 6.582451348583134e-05, 'epoch': 8.35}


 84%|████████▎ | 2449/2930 [1:04:33<12:40,  1.58s/it]

{'loss': 0.0185, 'grad_norm': 0.20311619341373444, 'learning_rate': 6.568794810515535e-05, 'epoch': 8.35}


 84%|████████▎ | 2450/2930 [1:04:35<12:39,  1.58s/it]

{'loss': 0.0195, 'grad_norm': 0.23524455726146698, 'learning_rate': 6.555138272447934e-05, 'epoch': 8.35}


 84%|████████▎ | 2451/2930 [1:04:37<12:37,  1.58s/it]

{'loss': 0.016, 'grad_norm': 0.1940624862909317, 'learning_rate': 6.541481734380335e-05, 'epoch': 8.36}


 84%|████████▎ | 2452/2930 [1:04:38<12:37,  1.59s/it]

{'loss': 0.0172, 'grad_norm': 0.22081884741783142, 'learning_rate': 6.527825196312734e-05, 'epoch': 8.36}


 84%|████████▎ | 2453/2930 [1:04:40<12:36,  1.59s/it]

{'loss': 0.0167, 'grad_norm': 0.20526407659053802, 'learning_rate': 6.514168658245135e-05, 'epoch': 8.36}


 84%|████████▍ | 2454/2930 [1:04:41<12:34,  1.59s/it]

{'loss': 0.0163, 'grad_norm': 0.18753810226917267, 'learning_rate': 6.500512120177536e-05, 'epoch': 8.37}


 84%|████████▍ | 2455/2930 [1:04:43<12:28,  1.58s/it]

{'loss': 0.0183, 'grad_norm': 0.23173849284648895, 'learning_rate': 6.486855582109935e-05, 'epoch': 8.37}


 84%|████████▍ | 2456/2930 [1:04:44<12:25,  1.57s/it]

{'loss': 0.0155, 'grad_norm': 0.20426930487155914, 'learning_rate': 6.473199044042335e-05, 'epoch': 8.38}


 84%|████████▍ | 2457/2930 [1:04:46<12:23,  1.57s/it]

{'loss': 0.0183, 'grad_norm': 0.19822223484516144, 'learning_rate': 6.459542505974736e-05, 'epoch': 8.38}


 84%|████████▍ | 2458/2930 [1:04:48<12:17,  1.56s/it]

{'loss': 0.0185, 'grad_norm': 0.19195839762687683, 'learning_rate': 6.445885967907136e-05, 'epoch': 8.38}


 84%|████████▍ | 2459/2930 [1:04:49<12:15,  1.56s/it]

{'loss': 0.015, 'grad_norm': 0.21623797714710236, 'learning_rate': 6.432229429839536e-05, 'epoch': 8.39}


 84%|████████▍ | 2460/2930 [1:04:51<12:12,  1.56s/it]

{'loss': 0.0195, 'grad_norm': 0.255292147397995, 'learning_rate': 6.418572891771935e-05, 'epoch': 8.39}


 84%|████████▍ | 2461/2930 [1:04:52<12:13,  1.56s/it]

{'loss': 0.0199, 'grad_norm': 0.21989211440086365, 'learning_rate': 6.404916353704336e-05, 'epoch': 8.39}


 84%|████████▍ | 2462/2930 [1:04:54<12:13,  1.57s/it]

{'loss': 0.0139, 'grad_norm': 0.15191733837127686, 'learning_rate': 6.391259815636737e-05, 'epoch': 8.4}


 84%|████████▍ | 2463/2930 [1:04:55<12:12,  1.57s/it]

{'loss': 0.0196, 'grad_norm': 0.21850983798503876, 'learning_rate': 6.377603277569136e-05, 'epoch': 8.4}


 84%|████████▍ | 2464/2930 [1:04:57<12:05,  1.56s/it]

{'loss': 0.0166, 'grad_norm': 0.20536170899868011, 'learning_rate': 6.363946739501536e-05, 'epoch': 8.4}


 84%|████████▍ | 2465/2930 [1:04:58<12:02,  1.55s/it]

{'loss': 0.0166, 'grad_norm': 0.1925705075263977, 'learning_rate': 6.350290201433938e-05, 'epoch': 8.41}


 84%|████████▍ | 2466/2930 [1:05:00<11:58,  1.55s/it]

{'loss': 0.0192, 'grad_norm': 0.22329506278038025, 'learning_rate': 6.336633663366337e-05, 'epoch': 8.41}


 84%|████████▍ | 2467/2930 [1:05:02<12:01,  1.56s/it]

{'loss': 0.0142, 'grad_norm': 0.19403992593288422, 'learning_rate': 6.322977125298737e-05, 'epoch': 8.41}


 84%|████████▍ | 2468/2930 [1:05:03<11:57,  1.55s/it]

{'loss': 0.0218, 'grad_norm': 0.22090627253055573, 'learning_rate': 6.309320587231137e-05, 'epoch': 8.42}


 84%|████████▍ | 2469/2930 [1:05:05<11:57,  1.56s/it]

{'loss': 0.0172, 'grad_norm': 0.2282877415418625, 'learning_rate': 6.295664049163538e-05, 'epoch': 8.42}


 84%|████████▍ | 2470/2930 [1:05:06<11:56,  1.56s/it]

{'loss': 0.0149, 'grad_norm': 0.21483731269836426, 'learning_rate': 6.282007511095937e-05, 'epoch': 8.42}


 84%|████████▍ | 2471/2930 [1:05:08<11:56,  1.56s/it]

{'loss': 0.0205, 'grad_norm': 0.284654825925827, 'learning_rate': 6.268350973028337e-05, 'epoch': 8.43}


 84%|████████▍ | 2472/2930 [1:05:09<11:57,  1.57s/it]

{'loss': 0.0149, 'grad_norm': 0.20202788710594177, 'learning_rate': 6.254694434960738e-05, 'epoch': 8.43}


 84%|████████▍ | 2473/2930 [1:05:11<11:50,  1.56s/it]

{'loss': 0.0158, 'grad_norm': 0.20955075323581696, 'learning_rate': 6.241037896893138e-05, 'epoch': 8.43}


 84%|████████▍ | 2474/2930 [1:05:13<11:52,  1.56s/it]

{'loss': 0.0179, 'grad_norm': 0.22902925312519073, 'learning_rate': 6.227381358825538e-05, 'epoch': 8.44}


 84%|████████▍ | 2475/2930 [1:05:14<11:51,  1.56s/it]

{'loss': 0.0189, 'grad_norm': 0.2277541607618332, 'learning_rate': 6.213724820757937e-05, 'epoch': 8.44}


 85%|████████▍ | 2476/2930 [1:05:16<11:44,  1.55s/it]

{'loss': 0.0164, 'grad_norm': 0.273553729057312, 'learning_rate': 6.200068282690338e-05, 'epoch': 8.44}


 85%|████████▍ | 2477/2930 [1:05:17<11:46,  1.56s/it]

{'loss': 0.0145, 'grad_norm': 0.18187665939331055, 'learning_rate': 6.186411744622739e-05, 'epoch': 8.45}


 85%|████████▍ | 2478/2930 [1:05:19<11:45,  1.56s/it]

{'loss': 0.0177, 'grad_norm': 0.2099970281124115, 'learning_rate': 6.172755206555138e-05, 'epoch': 8.45}


 85%|████████▍ | 2479/2930 [1:05:20<11:44,  1.56s/it]

{'loss': 0.0164, 'grad_norm': 0.18695132434368134, 'learning_rate': 6.159098668487539e-05, 'epoch': 8.45}


 85%|████████▍ | 2480/2930 [1:05:22<11:41,  1.56s/it]

{'loss': 0.0192, 'grad_norm': 0.24347150325775146, 'learning_rate': 6.145442130419938e-05, 'epoch': 8.46}


 85%|████████▍ | 2481/2930 [1:05:23<11:34,  1.55s/it]

{'loss': 0.0154, 'grad_norm': 0.20585975050926208, 'learning_rate': 6.131785592352339e-05, 'epoch': 8.46}


 85%|████████▍ | 2482/2930 [1:05:25<11:29,  1.54s/it]

{'loss': 0.0136, 'grad_norm': 0.1914815753698349, 'learning_rate': 6.118129054284739e-05, 'epoch': 8.46}


 85%|████████▍ | 2483/2930 [1:05:26<11:31,  1.55s/it]

{'loss': 0.0204, 'grad_norm': 0.240316703915596, 'learning_rate': 6.10447251621714e-05, 'epoch': 8.47}


 85%|████████▍ | 2484/2930 [1:05:28<11:32,  1.55s/it]

{'loss': 0.0167, 'grad_norm': 0.2126767486333847, 'learning_rate': 6.090815978149539e-05, 'epoch': 8.47}


 85%|████████▍ | 2485/2930 [1:05:30<11:35,  1.56s/it]

{'loss': 0.0181, 'grad_norm': 0.18487387895584106, 'learning_rate': 6.0771594400819395e-05, 'epoch': 8.47}


 85%|████████▍ | 2486/2930 [1:05:31<11:35,  1.57s/it]

{'loss': 0.0178, 'grad_norm': 0.20138633251190186, 'learning_rate': 6.0635029020143396e-05, 'epoch': 8.48}


 85%|████████▍ | 2487/2930 [1:05:33<11:32,  1.56s/it]

{'loss': 0.0216, 'grad_norm': 0.24494865536689758, 'learning_rate': 6.04984636394674e-05, 'epoch': 8.48}


 85%|████████▍ | 2488/2930 [1:05:34<11:32,  1.57s/it]

{'loss': 0.0198, 'grad_norm': 0.25057873129844666, 'learning_rate': 6.036189825879139e-05, 'epoch': 8.48}


 85%|████████▍ | 2489/2930 [1:05:36<11:28,  1.56s/it]

{'loss': 0.0172, 'grad_norm': 0.19596639275550842, 'learning_rate': 6.0225332878115406e-05, 'epoch': 8.49}


 85%|████████▍ | 2490/2930 [1:05:37<11:26,  1.56s/it]

{'loss': 0.0167, 'grad_norm': 0.20021145045757294, 'learning_rate': 6.00887674974394e-05, 'epoch': 8.49}


 85%|████████▌ | 2491/2930 [1:05:39<11:25,  1.56s/it]

{'loss': 0.0163, 'grad_norm': 0.21047012507915497, 'learning_rate': 5.99522021167634e-05, 'epoch': 8.49}


 85%|████████▌ | 2492/2930 [1:05:41<11:23,  1.56s/it]

{'loss': 0.016, 'grad_norm': 0.19733397662639618, 'learning_rate': 5.98156367360874e-05, 'epoch': 8.5}


 85%|████████▌ | 2493/2930 [1:05:42<11:23,  1.56s/it]

{'loss': 0.0217, 'grad_norm': 0.21515029668807983, 'learning_rate': 5.967907135541141e-05, 'epoch': 8.5}


 85%|████████▌ | 2494/2930 [1:05:44<11:22,  1.57s/it]

{'loss': 0.0191, 'grad_norm': 0.22450917959213257, 'learning_rate': 5.954250597473541e-05, 'epoch': 8.5}


 85%|████████▌ | 2495/2930 [1:05:45<11:19,  1.56s/it]

{'loss': 0.0169, 'grad_norm': 0.19581271708011627, 'learning_rate': 5.9405940594059404e-05, 'epoch': 8.51}


 85%|████████▌ | 2496/2930 [1:05:47<11:12,  1.55s/it]

{'loss': 0.0146, 'grad_norm': 0.18827277421951294, 'learning_rate': 5.9269375213383405e-05, 'epoch': 8.51}


 85%|████████▌ | 2497/2930 [1:05:48<11:11,  1.55s/it]

{'loss': 0.0172, 'grad_norm': 0.22556494176387787, 'learning_rate': 5.913280983270741e-05, 'epoch': 8.51}


 85%|████████▌ | 2498/2930 [1:05:50<11:11,  1.55s/it]

{'loss': 0.0176, 'grad_norm': 0.18838681280612946, 'learning_rate': 5.8996244452031414e-05, 'epoch': 8.52}


 85%|████████▌ | 2499/2930 [1:05:51<11:08,  1.55s/it]

{'loss': 0.0186, 'grad_norm': 0.22715900838375092, 'learning_rate': 5.885967907135541e-05, 'epoch': 8.52}


 85%|████████▌ | 2500/2930 [1:05:53<11:05,  1.55s/it]

{'loss': 0.02, 'grad_norm': 0.25708696246147156, 'learning_rate': 5.872311369067942e-05, 'epoch': 8.53}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-2500)... Done. 0.1s
 85%|████████▌ | 2501/2930 [1:05:57<16:58,  2.37s/it]

{'loss': 0.0151, 'grad_norm': 0.17770379781723022, 'learning_rate': 5.858654831000342e-05, 'epoch': 8.53}


 85%|████████▌ | 2502/2930 [1:05:59<15:09,  2.12s/it]

{'loss': 0.0157, 'grad_norm': 0.17522071301937103, 'learning_rate': 5.844998292932742e-05, 'epoch': 8.53}


 85%|████████▌ | 2503/2930 [1:06:00<13:56,  1.96s/it]

{'loss': 0.0168, 'grad_norm': 0.18549050390720367, 'learning_rate': 5.831341754865142e-05, 'epoch': 8.54}


 85%|████████▌ | 2504/2930 [1:06:02<13:04,  1.84s/it]

{'loss': 0.0171, 'grad_norm': 0.20558308064937592, 'learning_rate': 5.8176852167975426e-05, 'epoch': 8.54}


 85%|████████▌ | 2505/2930 [1:06:04<12:28,  1.76s/it]

{'loss': 0.0132, 'grad_norm': 0.2206801474094391, 'learning_rate': 5.804028678729943e-05, 'epoch': 8.54}


 86%|████████▌ | 2506/2930 [1:06:05<12:03,  1.71s/it]

{'loss': 0.0166, 'grad_norm': 0.21058624982833862, 'learning_rate': 5.790372140662342e-05, 'epoch': 8.55}


 86%|████████▌ | 2507/2930 [1:06:07<11:42,  1.66s/it]

{'loss': 0.0174, 'grad_norm': 0.2055584192276001, 'learning_rate': 5.776715602594742e-05, 'epoch': 8.55}


 86%|████████▌ | 2508/2930 [1:06:08<11:30,  1.64s/it]

{'loss': 0.0177, 'grad_norm': 0.19709627330303192, 'learning_rate': 5.763059064527143e-05, 'epoch': 8.55}


 86%|████████▌ | 2509/2930 [1:06:10<11:20,  1.62s/it]

{'loss': 0.0185, 'grad_norm': 0.19493818283081055, 'learning_rate': 5.749402526459543e-05, 'epoch': 8.56}


 86%|████████▌ | 2510/2930 [1:06:11<11:12,  1.60s/it]

{'loss': 0.0226, 'grad_norm': 0.27261167764663696, 'learning_rate': 5.735745988391943e-05, 'epoch': 8.56}


 86%|████████▌ | 2511/2930 [1:06:13<11:04,  1.59s/it]

{'loss': 0.0177, 'grad_norm': 0.19949258863925934, 'learning_rate': 5.7220894503243426e-05, 'epoch': 8.56}


 86%|████████▌ | 2512/2930 [1:06:14<10:55,  1.57s/it]

{'loss': 0.0177, 'grad_norm': 0.21264292299747467, 'learning_rate': 5.708432912256744e-05, 'epoch': 8.57}


 86%|████████▌ | 2513/2930 [1:06:16<10:54,  1.57s/it]

{'loss': 0.0194, 'grad_norm': 0.23924608528614044, 'learning_rate': 5.6947763741891435e-05, 'epoch': 8.57}


 86%|████████▌ | 2514/2930 [1:06:18<10:48,  1.56s/it]

{'loss': 0.0175, 'grad_norm': 0.218159481883049, 'learning_rate': 5.6811198361215435e-05, 'epoch': 8.57}


 86%|████████▌ | 2515/2930 [1:06:19<10:50,  1.57s/it]

{'loss': 0.0202, 'grad_norm': 0.21500244736671448, 'learning_rate': 5.667463298053943e-05, 'epoch': 8.58}


 86%|████████▌ | 2516/2930 [1:06:21<10:49,  1.57s/it]

{'loss': 0.0151, 'grad_norm': 0.16433298587799072, 'learning_rate': 5.6538067599863444e-05, 'epoch': 8.58}


 86%|████████▌ | 2517/2930 [1:06:22<10:47,  1.57s/it]

{'loss': 0.0169, 'grad_norm': 0.1903216689825058, 'learning_rate': 5.640150221918744e-05, 'epoch': 8.58}


 86%|████████▌ | 2518/2930 [1:06:24<10:45,  1.57s/it]

{'loss': 0.015, 'grad_norm': 0.18245896697044373, 'learning_rate': 5.626493683851144e-05, 'epoch': 8.59}


 86%|████████▌ | 2519/2930 [1:06:25<10:45,  1.57s/it]

{'loss': 0.0152, 'grad_norm': 0.19380053877830505, 'learning_rate': 5.612837145783544e-05, 'epoch': 8.59}


 86%|████████▌ | 2520/2930 [1:06:27<10:43,  1.57s/it]

{'loss': 0.0167, 'grad_norm': 0.19304195046424866, 'learning_rate': 5.599180607715945e-05, 'epoch': 8.59}


 86%|████████▌ | 2521/2930 [1:06:29<10:42,  1.57s/it]

{'loss': 0.0133, 'grad_norm': 0.15598535537719727, 'learning_rate': 5.585524069648345e-05, 'epoch': 8.6}


 86%|████████▌ | 2522/2930 [1:06:30<10:40,  1.57s/it]

{'loss': 0.0158, 'grad_norm': 0.19944098591804504, 'learning_rate': 5.571867531580744e-05, 'epoch': 8.6}


 86%|████████▌ | 2523/2930 [1:06:32<10:39,  1.57s/it]

{'loss': 0.0192, 'grad_norm': 0.21955814957618713, 'learning_rate': 5.5582109935131444e-05, 'epoch': 8.6}


 86%|████████▌ | 2524/2930 [1:06:33<10:37,  1.57s/it]

{'loss': 0.0155, 'grad_norm': 0.20104509592056274, 'learning_rate': 5.544554455445545e-05, 'epoch': 8.61}


 86%|████████▌ | 2525/2930 [1:06:35<10:30,  1.56s/it]

{'loss': 0.016, 'grad_norm': 0.212450310587883, 'learning_rate': 5.530897917377945e-05, 'epoch': 8.61}


 86%|████████▌ | 2526/2930 [1:06:36<10:24,  1.55s/it]

{'loss': 0.0196, 'grad_norm': 0.22907894849777222, 'learning_rate': 5.517241379310345e-05, 'epoch': 8.61}


 86%|████████▌ | 2527/2930 [1:06:38<10:23,  1.55s/it]

{'loss': 0.0144, 'grad_norm': 0.16643783450126648, 'learning_rate': 5.503584841242745e-05, 'epoch': 8.62}


 86%|████████▋ | 2528/2930 [1:06:39<10:23,  1.55s/it]

{'loss': 0.018, 'grad_norm': 0.206962451338768, 'learning_rate': 5.4899283031751455e-05, 'epoch': 8.62}


 86%|████████▋ | 2529/2930 [1:06:41<10:25,  1.56s/it]

{'loss': 0.0204, 'grad_norm': 0.2279236763715744, 'learning_rate': 5.4762717651075456e-05, 'epoch': 8.62}


 86%|████████▋ | 2530/2930 [1:06:43<10:23,  1.56s/it]

{'loss': 0.0167, 'grad_norm': 0.22189795970916748, 'learning_rate': 5.462615227039946e-05, 'epoch': 8.63}


 86%|████████▋ | 2531/2930 [1:06:44<10:20,  1.56s/it]

{'loss': 0.0154, 'grad_norm': 0.18627209961414337, 'learning_rate': 5.448958688972345e-05, 'epoch': 8.63}


 86%|████████▋ | 2532/2930 [1:06:46<10:19,  1.56s/it]

{'loss': 0.0152, 'grad_norm': 0.184022456407547, 'learning_rate': 5.4353021509047466e-05, 'epoch': 8.63}


 86%|████████▋ | 2533/2930 [1:06:47<10:23,  1.57s/it]

{'loss': 0.0159, 'grad_norm': 0.17705896496772766, 'learning_rate': 5.421645612837146e-05, 'epoch': 8.64}


 86%|████████▋ | 2534/2930 [1:06:49<10:27,  1.58s/it]

{'loss': 0.0172, 'grad_norm': 0.21381133794784546, 'learning_rate': 5.407989074769546e-05, 'epoch': 8.64}


 87%|████████▋ | 2535/2930 [1:06:50<10:25,  1.58s/it]

{'loss': 0.0149, 'grad_norm': 0.17816904187202454, 'learning_rate': 5.394332536701946e-05, 'epoch': 8.64}


 87%|████████▋ | 2536/2930 [1:06:52<10:25,  1.59s/it]

{'loss': 0.0166, 'grad_norm': 0.18807171285152435, 'learning_rate': 5.380675998634347e-05, 'epoch': 8.65}


 87%|████████▋ | 2537/2930 [1:06:54<10:20,  1.58s/it]

{'loss': 0.0181, 'grad_norm': 0.22814840078353882, 'learning_rate': 5.367019460566747e-05, 'epoch': 8.65}


 87%|████████▋ | 2538/2930 [1:06:55<10:19,  1.58s/it]

{'loss': 0.0156, 'grad_norm': 0.1909162849187851, 'learning_rate': 5.3533629224991464e-05, 'epoch': 8.65}


 87%|████████▋ | 2539/2930 [1:06:57<10:16,  1.58s/it]

{'loss': 0.0176, 'grad_norm': 0.1959623545408249, 'learning_rate': 5.3397063844315465e-05, 'epoch': 8.66}


 87%|████████▋ | 2540/2930 [1:06:58<10:14,  1.57s/it]

{'loss': 0.0186, 'grad_norm': 0.24294477701187134, 'learning_rate': 5.326049846363947e-05, 'epoch': 8.66}


 87%|████████▋ | 2541/2930 [1:07:00<10:12,  1.57s/it]

{'loss': 0.0179, 'grad_norm': 0.2351587414741516, 'learning_rate': 5.3123933082963474e-05, 'epoch': 8.66}


 87%|████████▋ | 2542/2930 [1:07:02<10:12,  1.58s/it]

{'loss': 0.0161, 'grad_norm': 0.1824256032705307, 'learning_rate': 5.2987367702287475e-05, 'epoch': 8.67}


 87%|████████▋ | 2543/2930 [1:07:03<10:09,  1.58s/it]

{'loss': 0.0175, 'grad_norm': 0.21158169209957123, 'learning_rate': 5.285080232161147e-05, 'epoch': 8.67}


 87%|████████▋ | 2544/2930 [1:07:05<10:09,  1.58s/it]

{'loss': 0.014, 'grad_norm': 0.17485617101192474, 'learning_rate': 5.2714236940935477e-05, 'epoch': 8.68}


 87%|████████▋ | 2545/2930 [1:07:06<10:08,  1.58s/it]

{'loss': 0.0176, 'grad_norm': 0.1949247121810913, 'learning_rate': 5.257767156025948e-05, 'epoch': 8.68}


 87%|████████▋ | 2546/2930 [1:07:08<10:06,  1.58s/it]

{'loss': 0.0221, 'grad_norm': 0.24534134566783905, 'learning_rate': 5.244110617958348e-05, 'epoch': 8.68}


 87%|████████▋ | 2547/2930 [1:07:09<10:04,  1.58s/it]

{'loss': 0.0138, 'grad_norm': 0.1818705052137375, 'learning_rate': 5.230454079890747e-05, 'epoch': 8.69}


 87%|████████▋ | 2548/2930 [1:07:11<10:04,  1.58s/it]

{'loss': 0.0189, 'grad_norm': 0.21936415135860443, 'learning_rate': 5.216797541823149e-05, 'epoch': 8.69}


 87%|████████▋ | 2549/2930 [1:07:13<10:03,  1.58s/it]

{'loss': 0.0153, 'grad_norm': 0.17195753753185272, 'learning_rate': 5.203141003755548e-05, 'epoch': 8.69}


 87%|████████▋ | 2550/2930 [1:07:14<10:04,  1.59s/it]

{'loss': 0.0155, 'grad_norm': 0.19176048040390015, 'learning_rate': 5.189484465687948e-05, 'epoch': 8.7}


 87%|████████▋ | 2551/2930 [1:07:16<10:02,  1.59s/it]

{'loss': 0.0207, 'grad_norm': 0.21999438107013702, 'learning_rate': 5.175827927620348e-05, 'epoch': 8.7}


 87%|████████▋ | 2552/2930 [1:07:17<09:56,  1.58s/it]

{'loss': 0.0134, 'grad_norm': 0.1675901710987091, 'learning_rate': 5.162171389552749e-05, 'epoch': 8.7}


 87%|████████▋ | 2553/2930 [1:07:19<09:58,  1.59s/it]

{'loss': 0.0157, 'grad_norm': 0.18687105178833008, 'learning_rate': 5.148514851485149e-05, 'epoch': 8.71}


 87%|████████▋ | 2554/2930 [1:07:21<09:55,  1.58s/it]

{'loss': 0.0174, 'grad_norm': 0.21878845989704132, 'learning_rate': 5.1348583134175486e-05, 'epoch': 8.71}


 87%|████████▋ | 2555/2930 [1:07:22<09:51,  1.58s/it]

{'loss': 0.0179, 'grad_norm': 0.227687269449234, 'learning_rate': 5.121201775349949e-05, 'epoch': 8.71}


 87%|████████▋ | 2556/2930 [1:07:24<09:51,  1.58s/it]

{'loss': 0.0163, 'grad_norm': 0.21815595030784607, 'learning_rate': 5.1075452372823494e-05, 'epoch': 8.72}


 87%|████████▋ | 2557/2930 [1:07:25<09:51,  1.59s/it]

{'loss': 0.0178, 'grad_norm': 0.1998915672302246, 'learning_rate': 5.0938886992147495e-05, 'epoch': 8.72}


 87%|████████▋ | 2558/2930 [1:07:27<09:47,  1.58s/it]

{'loss': 0.0188, 'grad_norm': 0.21957559883594513, 'learning_rate': 5.0802321611471496e-05, 'epoch': 8.72}


 87%|████████▋ | 2559/2930 [1:07:28<09:43,  1.57s/it]

{'loss': 0.0178, 'grad_norm': 0.209043487906456, 'learning_rate': 5.066575623079549e-05, 'epoch': 8.73}


 87%|████████▋ | 2560/2930 [1:07:30<09:44,  1.58s/it]

{'loss': 0.0177, 'grad_norm': 0.20775513350963593, 'learning_rate': 5.05291908501195e-05, 'epoch': 8.73}


 87%|████████▋ | 2561/2930 [1:07:32<09:41,  1.58s/it]

{'loss': 0.0141, 'grad_norm': 0.17835769057273865, 'learning_rate': 5.03926254694435e-05, 'epoch': 8.73}


 87%|████████▋ | 2562/2930 [1:07:33<09:41,  1.58s/it]

{'loss': 0.0136, 'grad_norm': 0.1963476538658142, 'learning_rate': 5.02560600887675e-05, 'epoch': 8.74}


 87%|████████▋ | 2563/2930 [1:07:35<09:40,  1.58s/it]

{'loss': 0.0173, 'grad_norm': 0.196073979139328, 'learning_rate': 5.0119494708091494e-05, 'epoch': 8.74}


 88%|████████▊ | 2564/2930 [1:07:36<09:41,  1.59s/it]

{'loss': 0.0177, 'grad_norm': 0.2241547852754593, 'learning_rate': 4.99829293274155e-05, 'epoch': 8.74}


 88%|████████▊ | 2565/2930 [1:07:38<09:37,  1.58s/it]

{'loss': 0.0176, 'grad_norm': 0.2221212387084961, 'learning_rate': 4.98463639467395e-05, 'epoch': 8.75}


 88%|████████▊ | 2566/2930 [1:07:39<09:34,  1.58s/it]

{'loss': 0.0133, 'grad_norm': 0.18844296038150787, 'learning_rate': 4.9709798566063503e-05, 'epoch': 8.75}


 88%|████████▊ | 2567/2930 [1:07:41<09:33,  1.58s/it]

{'loss': 0.0203, 'grad_norm': 0.28310608863830566, 'learning_rate': 4.9573233185387504e-05, 'epoch': 8.75}


 88%|████████▊ | 2568/2930 [1:07:43<09:32,  1.58s/it]

{'loss': 0.0221, 'grad_norm': 0.26982787251472473, 'learning_rate': 4.9436667804711505e-05, 'epoch': 8.76}


 88%|████████▊ | 2569/2930 [1:07:44<09:31,  1.58s/it]

{'loss': 0.0183, 'grad_norm': 0.2290879338979721, 'learning_rate': 4.930010242403551e-05, 'epoch': 8.76}


 88%|████████▊ | 2570/2930 [1:07:46<09:31,  1.59s/it]

{'loss': 0.0186, 'grad_norm': 0.21110478043556213, 'learning_rate': 4.916353704335951e-05, 'epoch': 8.76}


 88%|████████▊ | 2571/2930 [1:07:47<09:28,  1.58s/it]

{'loss': 0.014, 'grad_norm': 0.169353187084198, 'learning_rate': 4.9026971662683515e-05, 'epoch': 8.77}


 88%|████████▊ | 2572/2930 [1:07:49<09:28,  1.59s/it]

{'loss': 0.0181, 'grad_norm': 0.20035098493099213, 'learning_rate': 4.889040628200751e-05, 'epoch': 8.77}


 88%|████████▊ | 2573/2930 [1:07:51<09:25,  1.58s/it]

{'loss': 0.0161, 'grad_norm': 0.18463513255119324, 'learning_rate': 4.875384090133152e-05, 'epoch': 8.77}


 88%|████████▊ | 2574/2930 [1:07:52<09:21,  1.58s/it]

{'loss': 0.0168, 'grad_norm': 0.20803330838680267, 'learning_rate': 4.861727552065551e-05, 'epoch': 8.78}


 88%|████████▊ | 2575/2930 [1:07:54<09:19,  1.58s/it]

{'loss': 0.0174, 'grad_norm': 0.2212296575307846, 'learning_rate': 4.848071013997952e-05, 'epoch': 8.78}


 88%|████████▊ | 2576/2930 [1:07:55<09:12,  1.56s/it]

{'loss': 0.0144, 'grad_norm': 0.16271540522575378, 'learning_rate': 4.834414475930352e-05, 'epoch': 8.78}


 88%|████████▊ | 2577/2930 [1:07:57<09:15,  1.57s/it]

{'loss': 0.0189, 'grad_norm': 0.18849311769008636, 'learning_rate': 4.820757937862752e-05, 'epoch': 8.79}


 88%|████████▊ | 2578/2930 [1:07:58<09:16,  1.58s/it]

{'loss': 0.016, 'grad_norm': 0.18052808940410614, 'learning_rate': 4.807101399795152e-05, 'epoch': 8.79}


 88%|████████▊ | 2579/2930 [1:08:00<09:17,  1.59s/it]

{'loss': 0.0207, 'grad_norm': 0.2525709867477417, 'learning_rate': 4.793444861727552e-05, 'epoch': 8.79}


 88%|████████▊ | 2580/2930 [1:08:02<09:15,  1.59s/it]

{'loss': 0.0174, 'grad_norm': 0.19996149837970734, 'learning_rate': 4.779788323659952e-05, 'epoch': 8.8}


 88%|████████▊ | 2581/2930 [1:08:03<09:14,  1.59s/it]

{'loss': 0.0175, 'grad_norm': 0.21950271725654602, 'learning_rate': 4.7661317855923524e-05, 'epoch': 8.8}


 88%|████████▊ | 2582/2930 [1:08:05<09:14,  1.59s/it]

{'loss': 0.0197, 'grad_norm': 0.22744278609752655, 'learning_rate': 4.7524752475247525e-05, 'epoch': 8.8}


 88%|████████▊ | 2583/2930 [1:08:06<09:11,  1.59s/it]

{'loss': 0.0205, 'grad_norm': 0.31832900643348694, 'learning_rate': 4.7388187094571526e-05, 'epoch': 8.81}


 88%|████████▊ | 2584/2930 [1:08:08<09:10,  1.59s/it]

{'loss': 0.0147, 'grad_norm': 0.19742737710475922, 'learning_rate': 4.725162171389553e-05, 'epoch': 8.81}


 88%|████████▊ | 2585/2930 [1:08:10<09:10,  1.60s/it]

{'loss': 0.0187, 'grad_norm': 0.21233059465885162, 'learning_rate': 4.7115056333219534e-05, 'epoch': 8.82}


 88%|████████▊ | 2586/2930 [1:08:11<09:03,  1.58s/it]

{'loss': 0.0199, 'grad_norm': 0.24110502004623413, 'learning_rate': 4.697849095254353e-05, 'epoch': 8.82}


 88%|████████▊ | 2587/2930 [1:08:13<09:01,  1.58s/it]

{'loss': 0.0194, 'grad_norm': 0.23460885882377625, 'learning_rate': 4.6841925571867536e-05, 'epoch': 8.82}


 88%|████████▊ | 2588/2930 [1:08:14<08:53,  1.56s/it]

{'loss': 0.0185, 'grad_norm': 0.2247346043586731, 'learning_rate': 4.670536019119153e-05, 'epoch': 8.83}


 88%|████████▊ | 2589/2930 [1:08:16<08:53,  1.56s/it]

{'loss': 0.0201, 'grad_norm': 0.21536828577518463, 'learning_rate': 4.656879481051554e-05, 'epoch': 8.83}


 88%|████████▊ | 2590/2930 [1:08:17<08:48,  1.56s/it]

{'loss': 0.0176, 'grad_norm': 0.19759660959243774, 'learning_rate': 4.643222942983953e-05, 'epoch': 8.83}


 88%|████████▊ | 2591/2930 [1:08:19<08:46,  1.55s/it]

{'loss': 0.0176, 'grad_norm': 0.2070324420928955, 'learning_rate': 4.629566404916354e-05, 'epoch': 8.84}


 88%|████████▊ | 2592/2930 [1:08:20<08:42,  1.55s/it]

{'loss': 0.0179, 'grad_norm': 0.2109810709953308, 'learning_rate': 4.615909866848754e-05, 'epoch': 8.84}


 88%|████████▊ | 2593/2930 [1:08:22<08:39,  1.54s/it]

{'loss': 0.0167, 'grad_norm': 0.20839695632457733, 'learning_rate': 4.602253328781154e-05, 'epoch': 8.84}


 89%|████████▊ | 2594/2930 [1:08:24<08:40,  1.55s/it]

{'loss': 0.0143, 'grad_norm': 0.20449373126029968, 'learning_rate': 4.588596790713554e-05, 'epoch': 8.85}


 89%|████████▊ | 2595/2930 [1:08:25<08:38,  1.55s/it]

{'loss': 0.0157, 'grad_norm': 0.1968080699443817, 'learning_rate': 4.5749402526459544e-05, 'epoch': 8.85}


 89%|████████▊ | 2596/2930 [1:08:27<08:39,  1.56s/it]

{'loss': 0.0187, 'grad_norm': 0.21847806870937347, 'learning_rate': 4.5612837145783545e-05, 'epoch': 8.85}


 89%|████████▊ | 2597/2930 [1:08:28<08:36,  1.55s/it]

{'loss': 0.0185, 'grad_norm': 0.25053954124450684, 'learning_rate': 4.5476271765107545e-05, 'epoch': 8.86}


 89%|████████▊ | 2598/2930 [1:08:30<08:37,  1.56s/it]

{'loss': 0.0175, 'grad_norm': 0.20481467247009277, 'learning_rate': 4.5339706384431546e-05, 'epoch': 8.86}


 89%|████████▊ | 2599/2930 [1:08:31<08:34,  1.56s/it]

{'loss': 0.0151, 'grad_norm': 0.19731950759887695, 'learning_rate': 4.520314100375555e-05, 'epoch': 8.86}


 89%|████████▊ | 2600/2930 [1:08:33<08:34,  1.56s/it]

{'loss': 0.0177, 'grad_norm': 0.2084892988204956, 'learning_rate': 4.506657562307955e-05, 'epoch': 8.87}


 89%|████████▉ | 2601/2930 [1:08:34<08:30,  1.55s/it]

{'loss': 0.0157, 'grad_norm': 0.22856755554676056, 'learning_rate': 4.4930010242403556e-05, 'epoch': 8.87}


 89%|████████▉ | 2602/2930 [1:08:36<08:27,  1.55s/it]

{'loss': 0.0152, 'grad_norm': 0.19129103422164917, 'learning_rate': 4.479344486172755e-05, 'epoch': 8.87}


 89%|████████▉ | 2603/2930 [1:08:37<08:26,  1.55s/it]

{'loss': 0.0147, 'grad_norm': 0.21987129747867584, 'learning_rate': 4.465687948105156e-05, 'epoch': 8.88}


 89%|████████▉ | 2604/2930 [1:08:39<08:27,  1.56s/it]

{'loss': 0.0187, 'grad_norm': 0.22310861945152283, 'learning_rate': 4.452031410037555e-05, 'epoch': 8.88}


 89%|████████▉ | 2605/2930 [1:08:41<08:27,  1.56s/it]

{'loss': 0.0182, 'grad_norm': 0.24365930259227753, 'learning_rate': 4.438374871969956e-05, 'epoch': 8.88}


 89%|████████▉ | 2606/2930 [1:08:42<08:26,  1.56s/it]

{'loss': 0.0142, 'grad_norm': 0.2251519113779068, 'learning_rate': 4.424718333902356e-05, 'epoch': 8.89}


 89%|████████▉ | 2607/2930 [1:08:44<08:24,  1.56s/it]

{'loss': 0.0152, 'grad_norm': 0.2034931182861328, 'learning_rate': 4.411061795834756e-05, 'epoch': 8.89}


 89%|████████▉ | 2608/2930 [1:08:45<08:24,  1.57s/it]

{'loss': 0.0185, 'grad_norm': 0.2388840615749359, 'learning_rate': 4.397405257767156e-05, 'epoch': 8.89}


 89%|████████▉ | 2609/2930 [1:08:47<08:23,  1.57s/it]

{'loss': 0.0209, 'grad_norm': 0.20518845319747925, 'learning_rate': 4.383748719699556e-05, 'epoch': 8.9}


 89%|████████▉ | 2610/2930 [1:08:49<08:23,  1.57s/it]

{'loss': 0.0198, 'grad_norm': 0.22213906049728394, 'learning_rate': 4.370092181631957e-05, 'epoch': 8.9}


 89%|████████▉ | 2611/2930 [1:08:50<08:21,  1.57s/it]

{'loss': 0.0229, 'grad_norm': 0.23038914799690247, 'learning_rate': 4.3564356435643565e-05, 'epoch': 8.9}


 89%|████████▉ | 2612/2930 [1:08:52<08:17,  1.56s/it]

{'loss': 0.015, 'grad_norm': 0.18831497430801392, 'learning_rate': 4.342779105496757e-05, 'epoch': 8.91}


 89%|████████▉ | 2613/2930 [1:08:53<08:12,  1.55s/it]

{'loss': 0.0142, 'grad_norm': 0.1865576207637787, 'learning_rate': 4.329122567429157e-05, 'epoch': 8.91}


 89%|████████▉ | 2614/2930 [1:08:55<08:10,  1.55s/it]

{'loss': 0.0177, 'grad_norm': 0.2023814469575882, 'learning_rate': 4.3154660293615575e-05, 'epoch': 8.91}


 89%|████████▉ | 2615/2930 [1:08:56<08:10,  1.56s/it]

{'loss': 0.0193, 'grad_norm': 0.24396316707134247, 'learning_rate': 4.301809491293957e-05, 'epoch': 8.92}


 89%|████████▉ | 2616/2930 [1:08:58<08:07,  1.55s/it]

{'loss': 0.0151, 'grad_norm': 0.19413617253303528, 'learning_rate': 4.2881529532263576e-05, 'epoch': 8.92}


 89%|████████▉ | 2617/2930 [1:08:59<08:01,  1.54s/it]

{'loss': 0.0145, 'grad_norm': 0.18117225170135498, 'learning_rate': 4.274496415158758e-05, 'epoch': 8.92}


 89%|████████▉ | 2618/2930 [1:09:01<08:01,  1.54s/it]

{'loss': 0.0155, 'grad_norm': 0.19099964201450348, 'learning_rate': 4.260839877091158e-05, 'epoch': 8.93}


 89%|████████▉ | 2619/2930 [1:09:02<07:58,  1.54s/it]

{'loss': 0.0178, 'grad_norm': 0.20121164619922638, 'learning_rate': 4.247183339023558e-05, 'epoch': 8.93}


 89%|████████▉ | 2620/2930 [1:09:04<07:59,  1.55s/it]

{'loss': 0.0172, 'grad_norm': 0.2272365391254425, 'learning_rate': 4.233526800955958e-05, 'epoch': 8.93}


 89%|████████▉ | 2621/2930 [1:09:06<07:58,  1.55s/it]

{'loss': 0.0188, 'grad_norm': 0.20548401772975922, 'learning_rate': 4.219870262888358e-05, 'epoch': 8.94}


 89%|████████▉ | 2622/2930 [1:09:07<07:58,  1.55s/it]

{'loss': 0.0192, 'grad_norm': 0.2299867868423462, 'learning_rate': 4.206213724820758e-05, 'epoch': 8.94}


 90%|████████▉ | 2623/2930 [1:09:09<07:57,  1.56s/it]

{'loss': 0.0165, 'grad_norm': 0.20901189744472504, 'learning_rate': 4.192557186753158e-05, 'epoch': 8.94}


 90%|████████▉ | 2624/2930 [1:09:10<07:57,  1.56s/it]

{'loss': 0.0184, 'grad_norm': 0.2003057599067688, 'learning_rate': 4.1789006486855584e-05, 'epoch': 8.95}


 90%|████████▉ | 2625/2930 [1:09:12<07:57,  1.56s/it]

{'loss': 0.0197, 'grad_norm': 0.22887325286865234, 'learning_rate': 4.1652441106179585e-05, 'epoch': 8.95}


 90%|████████▉ | 2626/2930 [1:09:13<07:56,  1.57s/it]

{'loss': 0.0184, 'grad_norm': 0.22406283020973206, 'learning_rate': 4.151587572550359e-05, 'epoch': 8.95}


 90%|████████▉ | 2627/2930 [1:09:15<07:55,  1.57s/it]

{'loss': 0.0153, 'grad_norm': 0.19365453720092773, 'learning_rate': 4.1379310344827587e-05, 'epoch': 8.96}


 90%|████████▉ | 2628/2930 [1:09:17<07:54,  1.57s/it]

{'loss': 0.0186, 'grad_norm': 0.19243140518665314, 'learning_rate': 4.1242744964151594e-05, 'epoch': 8.96}


 90%|████████▉ | 2629/2930 [1:09:18<07:49,  1.56s/it]

{'loss': 0.0158, 'grad_norm': 0.16453222930431366, 'learning_rate': 4.110617958347559e-05, 'epoch': 8.97}


 90%|████████▉ | 2630/2930 [1:09:20<07:46,  1.56s/it]

{'loss': 0.0171, 'grad_norm': 0.2017916440963745, 'learning_rate': 4.0969614202799596e-05, 'epoch': 8.97}


 90%|████████▉ | 2631/2930 [1:09:21<07:42,  1.55s/it]

{'loss': 0.0151, 'grad_norm': 0.22694534063339233, 'learning_rate': 4.083304882212359e-05, 'epoch': 8.97}


 90%|████████▉ | 2632/2930 [1:09:23<07:40,  1.55s/it]

{'loss': 0.0201, 'grad_norm': 0.20597103238105774, 'learning_rate': 4.06964834414476e-05, 'epoch': 8.98}


 90%|████████▉ | 2633/2930 [1:09:24<07:38,  1.54s/it]

{'loss': 0.0206, 'grad_norm': 0.23351241648197174, 'learning_rate': 4.05599180607716e-05, 'epoch': 8.98}


 90%|████████▉ | 2634/2930 [1:09:26<07:38,  1.55s/it]

{'loss': 0.0188, 'grad_norm': 0.21232862770557404, 'learning_rate': 4.04233526800956e-05, 'epoch': 8.98}


 90%|████████▉ | 2635/2930 [1:09:27<07:36,  1.55s/it]

{'loss': 0.0178, 'grad_norm': 0.19300341606140137, 'learning_rate': 4.02867872994196e-05, 'epoch': 8.99}


 90%|████████▉ | 2636/2930 [1:09:29<07:32,  1.54s/it]

{'loss': 0.0165, 'grad_norm': 0.18855145573616028, 'learning_rate': 4.01502219187436e-05, 'epoch': 8.99}


 90%|█████████ | 2637/2930 [1:09:30<07:32,  1.54s/it]

{'loss': 0.0128, 'grad_norm': 0.17967556416988373, 'learning_rate': 4.00136565380676e-05, 'epoch': 8.99}


 90%|█████████ | 2638/2930 [1:09:32<07:27,  1.53s/it]

{'loss': 0.0149, 'grad_norm': 0.17172251641750336, 'learning_rate': 3.9877091157391603e-05, 'epoch': 9.0}


 90%|█████████ | 2639/2930 [1:09:33<07:28,  1.54s/it]

{'loss': 0.0134, 'grad_norm': 0.18023595213890076, 'learning_rate': 3.9740525776715604e-05, 'epoch': 9.0}


 90%|█████████ | 2640/2930 [1:09:35<07:27,  1.54s/it]

{'loss': 0.0123, 'grad_norm': 0.1359195113182068, 'learning_rate': 3.9603960396039605e-05, 'epoch': 9.0}


 90%|█████████ | 2641/2930 [1:09:37<07:28,  1.55s/it]

{'loss': 0.0121, 'grad_norm': 0.14611971378326416, 'learning_rate': 3.9467395015363606e-05, 'epoch': 9.01}


 90%|█████████ | 2642/2930 [1:09:38<07:24,  1.54s/it]

{'loss': 0.0118, 'grad_norm': 0.1378476619720459, 'learning_rate': 3.933082963468761e-05, 'epoch': 9.01}


 90%|█████████ | 2643/2930 [1:09:40<07:25,  1.55s/it]

{'loss': 0.0124, 'grad_norm': 0.14747954905033112, 'learning_rate': 3.919426425401161e-05, 'epoch': 9.01}


 90%|█████████ | 2644/2930 [1:09:41<07:26,  1.56s/it]

{'loss': 0.0108, 'grad_norm': 0.12317822128534317, 'learning_rate': 3.9057698873335616e-05, 'epoch': 9.02}


 90%|█████████ | 2645/2930 [1:09:43<07:25,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.11727199703454971, 'learning_rate': 3.892113349265961e-05, 'epoch': 9.02}


 90%|█████████ | 2646/2930 [1:09:44<07:22,  1.56s/it]

{'loss': 0.0103, 'grad_norm': 0.13068914413452148, 'learning_rate': 3.878456811198362e-05, 'epoch': 9.02}


 90%|█████████ | 2647/2930 [1:09:46<07:20,  1.56s/it]

{'loss': 0.011, 'grad_norm': 0.12066185474395752, 'learning_rate': 3.864800273130761e-05, 'epoch': 9.03}


 90%|█████████ | 2648/2930 [1:09:47<07:20,  1.56s/it]

{'loss': 0.0116, 'grad_norm': 0.149842768907547, 'learning_rate': 3.851143735063162e-05, 'epoch': 9.03}


 90%|█████████ | 2649/2930 [1:09:49<07:20,  1.57s/it]

{'loss': 0.0116, 'grad_norm': 0.13331569731235504, 'learning_rate': 3.837487196995562e-05, 'epoch': 9.03}


 90%|█████████ | 2650/2930 [1:09:51<07:19,  1.57s/it]

{'loss': 0.0089, 'grad_norm': 0.1299351304769516, 'learning_rate': 3.823830658927962e-05, 'epoch': 9.04}


 90%|█████████ | 2651/2930 [1:09:52<07:18,  1.57s/it]

{'loss': 0.0103, 'grad_norm': 0.1482771933078766, 'learning_rate': 3.810174120860362e-05, 'epoch': 9.04}


 91%|█████████ | 2652/2930 [1:09:54<07:14,  1.56s/it]

{'loss': 0.0099, 'grad_norm': 0.1374061107635498, 'learning_rate': 3.796517582792762e-05, 'epoch': 9.04}


 91%|█████████ | 2653/2930 [1:09:55<07:13,  1.56s/it]

{'loss': 0.0098, 'grad_norm': 0.13652139902114868, 'learning_rate': 3.7828610447251624e-05, 'epoch': 9.05}


 91%|█████████ | 2654/2930 [1:09:57<07:07,  1.55s/it]

{'loss': 0.0115, 'grad_norm': 0.17120978236198425, 'learning_rate': 3.7692045066575625e-05, 'epoch': 9.05}


 91%|█████████ | 2655/2930 [1:09:58<07:08,  1.56s/it]

{'loss': 0.0115, 'grad_norm': 0.14538441598415375, 'learning_rate': 3.7555479685899626e-05, 'epoch': 9.05}


 91%|█████████ | 2656/2930 [1:10:00<07:08,  1.56s/it]

{'loss': 0.009, 'grad_norm': 0.12020476907491684, 'learning_rate': 3.741891430522363e-05, 'epoch': 9.06}


 91%|█████████ | 2657/2930 [1:10:02<07:06,  1.56s/it]

{'loss': 0.01, 'grad_norm': 0.13811321556568146, 'learning_rate': 3.728234892454763e-05, 'epoch': 9.06}


 91%|█████████ | 2658/2930 [1:10:03<07:06,  1.57s/it]

{'loss': 0.0101, 'grad_norm': 0.1345093548297882, 'learning_rate': 3.714578354387163e-05, 'epoch': 9.06}


 91%|█████████ | 2659/2930 [1:10:05<07:03,  1.56s/it]

{'loss': 0.0084, 'grad_norm': 0.13357670605182648, 'learning_rate': 3.700921816319563e-05, 'epoch': 9.07}


 91%|█████████ | 2660/2930 [1:10:06<07:01,  1.56s/it]

{'loss': 0.0105, 'grad_norm': 0.16367831826210022, 'learning_rate': 3.687265278251964e-05, 'epoch': 9.07}


 91%|█████████ | 2661/2930 [1:10:08<06:57,  1.55s/it]

{'loss': 0.0094, 'grad_norm': 0.15173928439617157, 'learning_rate': 3.673608740184363e-05, 'epoch': 9.07}


 91%|█████████ | 2662/2930 [1:10:09<06:55,  1.55s/it]

{'loss': 0.0106, 'grad_norm': 0.1502644419670105, 'learning_rate': 3.659952202116764e-05, 'epoch': 9.08}


 91%|█████████ | 2663/2930 [1:10:11<06:55,  1.56s/it]

{'loss': 0.0127, 'grad_norm': 0.16586074233055115, 'learning_rate': 3.646295664049163e-05, 'epoch': 9.08}


 91%|█████████ | 2664/2930 [1:10:12<06:55,  1.56s/it]

{'loss': 0.0102, 'grad_norm': 0.1572737991809845, 'learning_rate': 3.632639125981564e-05, 'epoch': 9.08}


 91%|█████████ | 2665/2930 [1:10:14<06:54,  1.56s/it]

{'loss': 0.0113, 'grad_norm': 0.1714312881231308, 'learning_rate': 3.6189825879139635e-05, 'epoch': 9.09}


 91%|█████████ | 2666/2930 [1:10:16<06:54,  1.57s/it]

{'loss': 0.0118, 'grad_norm': 0.16454842686653137, 'learning_rate': 3.605326049846364e-05, 'epoch': 9.09}


 91%|█████████ | 2667/2930 [1:10:17<06:51,  1.56s/it]

{'loss': 0.0086, 'grad_norm': 0.1323021948337555, 'learning_rate': 3.5916695117787644e-05, 'epoch': 9.09}


 91%|█████████ | 2668/2930 [1:10:19<06:50,  1.57s/it]

{'loss': 0.0096, 'grad_norm': 0.15449416637420654, 'learning_rate': 3.5780129737111644e-05, 'epoch': 9.1}


 91%|█████████ | 2669/2930 [1:10:20<06:49,  1.57s/it]

{'loss': 0.0096, 'grad_norm': 0.15241608023643494, 'learning_rate': 3.5643564356435645e-05, 'epoch': 9.1}


 91%|█████████ | 2670/2930 [1:10:22<06:48,  1.57s/it]

{'loss': 0.011, 'grad_norm': 0.1495332270860672, 'learning_rate': 3.5506998975759646e-05, 'epoch': 9.1}


 91%|█████████ | 2671/2930 [1:10:23<06:48,  1.58s/it]

{'loss': 0.0099, 'grad_norm': 0.1659526377916336, 'learning_rate': 3.537043359508365e-05, 'epoch': 9.11}


 91%|█████████ | 2672/2930 [1:10:25<06:48,  1.58s/it]

{'loss': 0.0112, 'grad_norm': 0.16414868831634521, 'learning_rate': 3.523386821440765e-05, 'epoch': 9.11}


 91%|█████████ | 2673/2930 [1:10:27<06:45,  1.58s/it]

{'loss': 0.0122, 'grad_norm': 0.16692544519901276, 'learning_rate': 3.509730283373165e-05, 'epoch': 9.12}


 91%|█████████▏| 2674/2930 [1:10:28<06:43,  1.58s/it]

{'loss': 0.0131, 'grad_norm': 0.16756483912467957, 'learning_rate': 3.496073745305565e-05, 'epoch': 9.12}


 91%|█████████▏| 2675/2930 [1:10:30<06:41,  1.57s/it]

{'loss': 0.0109, 'grad_norm': 0.1446222960948944, 'learning_rate': 3.482417207237965e-05, 'epoch': 9.12}


 91%|█████████▏| 2676/2930 [1:10:31<06:39,  1.57s/it]

{'loss': 0.0088, 'grad_norm': 0.1361384093761444, 'learning_rate': 3.468760669170366e-05, 'epoch': 9.13}


 91%|█████████▏| 2677/2930 [1:10:33<06:35,  1.56s/it]

{'loss': 0.0116, 'grad_norm': 0.15101118385791779, 'learning_rate': 3.455104131102765e-05, 'epoch': 9.13}


 91%|█████████▏| 2678/2930 [1:10:34<06:35,  1.57s/it]

{'loss': 0.0103, 'grad_norm': 0.1560632437467575, 'learning_rate': 3.441447593035166e-05, 'epoch': 9.13}


 91%|█████████▏| 2679/2930 [1:10:36<06:33,  1.57s/it]

{'loss': 0.0108, 'grad_norm': 0.15312784910202026, 'learning_rate': 3.4277910549675655e-05, 'epoch': 9.14}


 91%|█████████▏| 2680/2930 [1:10:38<06:31,  1.56s/it]

{'loss': 0.0122, 'grad_norm': 0.16934987902641296, 'learning_rate': 3.414134516899966e-05, 'epoch': 9.14}


 92%|█████████▏| 2681/2930 [1:10:39<06:31,  1.57s/it]

{'loss': 0.0112, 'grad_norm': 0.2022218108177185, 'learning_rate': 3.4004779788323656e-05, 'epoch': 9.14}


 92%|█████████▏| 2682/2930 [1:10:41<06:31,  1.58s/it]

{'loss': 0.0135, 'grad_norm': 0.19973355531692505, 'learning_rate': 3.3868214407647664e-05, 'epoch': 9.15}


 92%|█████████▏| 2683/2930 [1:10:42<06:26,  1.57s/it]

{'loss': 0.0091, 'grad_norm': 0.11735522747039795, 'learning_rate': 3.3731649026971665e-05, 'epoch': 9.15}


 92%|█████████▏| 2684/2930 [1:10:44<06:24,  1.56s/it]

{'loss': 0.0102, 'grad_norm': 0.1563248336315155, 'learning_rate': 3.3595083646295666e-05, 'epoch': 9.15}


 92%|█████████▏| 2685/2930 [1:10:45<06:24,  1.57s/it]

{'loss': 0.0104, 'grad_norm': 0.13901738822460175, 'learning_rate': 3.345851826561967e-05, 'epoch': 9.16}


 92%|█████████▏| 2686/2930 [1:10:47<06:24,  1.58s/it]

{'loss': 0.011, 'grad_norm': 0.1436186134815216, 'learning_rate': 3.332195288494367e-05, 'epoch': 9.16}


 92%|█████████▏| 2687/2930 [1:10:49<06:25,  1.58s/it]

{'loss': 0.0096, 'grad_norm': 0.16661424934864044, 'learning_rate': 3.318538750426767e-05, 'epoch': 9.16}


 92%|█████████▏| 2688/2930 [1:10:50<06:23,  1.59s/it]

{'loss': 0.0101, 'grad_norm': 0.14666692912578583, 'learning_rate': 3.304882212359167e-05, 'epoch': 9.17}


 92%|█████████▏| 2689/2930 [1:10:52<06:22,  1.59s/it]

{'loss': 0.0121, 'grad_norm': 0.18978825211524963, 'learning_rate': 3.291225674291567e-05, 'epoch': 9.17}


 92%|█████████▏| 2690/2930 [1:10:53<06:21,  1.59s/it]

{'loss': 0.0098, 'grad_norm': 0.14087922871112823, 'learning_rate': 3.277569136223967e-05, 'epoch': 9.17}


 92%|█████████▏| 2691/2930 [1:10:55<06:19,  1.59s/it]

{'loss': 0.0101, 'grad_norm': 0.15074659883975983, 'learning_rate': 3.263912598156367e-05, 'epoch': 9.18}


 92%|█████████▏| 2692/2930 [1:10:57<06:18,  1.59s/it]

{'loss': 0.0113, 'grad_norm': 0.147142231464386, 'learning_rate': 3.250256060088768e-05, 'epoch': 9.18}


 92%|█████████▏| 2693/2930 [1:10:58<06:11,  1.57s/it]

{'loss': 0.0086, 'grad_norm': 0.11834585666656494, 'learning_rate': 3.2365995220211674e-05, 'epoch': 9.18}


 92%|█████████▏| 2694/2930 [1:11:00<06:10,  1.57s/it]

{'loss': 0.0107, 'grad_norm': 0.1522262990474701, 'learning_rate': 3.222942983953568e-05, 'epoch': 9.19}


 92%|█████████▏| 2695/2930 [1:11:01<06:11,  1.58s/it]

{'loss': 0.0102, 'grad_norm': 0.14788714051246643, 'learning_rate': 3.2092864458859676e-05, 'epoch': 9.19}


 92%|█████████▏| 2696/2930 [1:11:03<06:08,  1.57s/it]

{'loss': 0.0095, 'grad_norm': 0.16078168153762817, 'learning_rate': 3.1956299078183684e-05, 'epoch': 9.19}


 92%|█████████▏| 2697/2930 [1:11:04<06:08,  1.58s/it]

{'loss': 0.0118, 'grad_norm': 0.18082262575626373, 'learning_rate': 3.181973369750768e-05, 'epoch': 9.2}


 92%|█████████▏| 2698/2930 [1:11:06<06:03,  1.57s/it]

{'loss': 0.0094, 'grad_norm': 0.15738120675086975, 'learning_rate': 3.1683168316831686e-05, 'epoch': 9.2}


 92%|█████████▏| 2699/2930 [1:11:08<06:02,  1.57s/it]

{'loss': 0.012, 'grad_norm': 0.20361857116222382, 'learning_rate': 3.1546602936155686e-05, 'epoch': 9.2}


 92%|█████████▏| 2700/2930 [1:11:09<06:00,  1.57s/it]

{'loss': 0.011, 'grad_norm': 0.16893553733825684, 'learning_rate': 3.141003755547969e-05, 'epoch': 9.21}


 92%|█████████▏| 2701/2930 [1:11:11<06:00,  1.57s/it]

{'loss': 0.0096, 'grad_norm': 0.18624529242515564, 'learning_rate': 3.127347217480369e-05, 'epoch': 9.21}


 92%|█████████▏| 2702/2930 [1:11:12<05:57,  1.57s/it]

{'loss': 0.0086, 'grad_norm': 0.11239442229270935, 'learning_rate': 3.113690679412769e-05, 'epoch': 9.21}


 92%|█████████▏| 2703/2930 [1:11:14<05:57,  1.57s/it]

{'loss': 0.0105, 'grad_norm': 0.1402500867843628, 'learning_rate': 3.100034141345169e-05, 'epoch': 9.22}


 92%|█████████▏| 2704/2930 [1:11:15<05:53,  1.56s/it]

{'loss': 0.0102, 'grad_norm': 0.12792527675628662, 'learning_rate': 3.086377603277569e-05, 'epoch': 9.22}


 92%|█████████▏| 2705/2930 [1:11:17<05:53,  1.57s/it]

{'loss': 0.0087, 'grad_norm': 0.15770718455314636, 'learning_rate': 3.072721065209969e-05, 'epoch': 9.22}


 92%|█████████▏| 2706/2930 [1:11:19<05:54,  1.58s/it]

{'loss': 0.0101, 'grad_norm': 0.1399139016866684, 'learning_rate': 3.059064527142369e-05, 'epoch': 9.23}


 92%|█████████▏| 2707/2930 [1:11:20<05:54,  1.59s/it]

{'loss': 0.0108, 'grad_norm': 0.18225596845149994, 'learning_rate': 3.0454079890747694e-05, 'epoch': 9.23}


 92%|█████████▏| 2708/2930 [1:11:22<05:48,  1.57s/it]

{'loss': 0.0086, 'grad_norm': 0.1484624445438385, 'learning_rate': 3.0317514510071698e-05, 'epoch': 9.23}


 92%|█████████▏| 2709/2930 [1:11:23<05:45,  1.56s/it]

{'loss': 0.0093, 'grad_norm': 0.15361246466636658, 'learning_rate': 3.0180949129395696e-05, 'epoch': 9.24}


 92%|█████████▏| 2710/2930 [1:11:25<05:44,  1.56s/it]

{'loss': 0.0098, 'grad_norm': 0.14650008082389832, 'learning_rate': 3.00443837487197e-05, 'epoch': 9.24}


 93%|█████████▎| 2711/2930 [1:11:26<05:42,  1.56s/it]

{'loss': 0.009, 'grad_norm': 0.14307647943496704, 'learning_rate': 2.99078183680437e-05, 'epoch': 9.24}


 93%|█████████▎| 2712/2930 [1:11:28<05:41,  1.57s/it]

{'loss': 0.0127, 'grad_norm': 0.19794951379299164, 'learning_rate': 2.9771252987367705e-05, 'epoch': 9.25}


 93%|█████████▎| 2713/2930 [1:11:30<05:39,  1.57s/it]

{'loss': 0.0139, 'grad_norm': 0.21055129170417786, 'learning_rate': 2.9634687606691703e-05, 'epoch': 9.25}


 93%|█████████▎| 2714/2930 [1:11:31<05:39,  1.57s/it]

{'loss': 0.0113, 'grad_norm': 0.14587552845478058, 'learning_rate': 2.9498122226015707e-05, 'epoch': 9.25}


 93%|█████████▎| 2715/2930 [1:11:33<05:38,  1.57s/it]

{'loss': 0.0104, 'grad_norm': 0.16781798005104065, 'learning_rate': 2.936155684533971e-05, 'epoch': 9.26}


 93%|█████████▎| 2716/2930 [1:11:34<05:36,  1.57s/it]

{'loss': 0.0112, 'grad_norm': 0.16526859998703003, 'learning_rate': 2.922499146466371e-05, 'epoch': 9.26}


 93%|█████████▎| 2717/2930 [1:11:36<05:34,  1.57s/it]

{'loss': 0.0083, 'grad_norm': 0.11861450970172882, 'learning_rate': 2.9088426083987713e-05, 'epoch': 9.27}


 93%|█████████▎| 2718/2930 [1:11:37<05:33,  1.57s/it]

{'loss': 0.0096, 'grad_norm': 0.12806183099746704, 'learning_rate': 2.895186070331171e-05, 'epoch': 9.27}


 93%|█████████▎| 2719/2930 [1:11:39<05:32,  1.58s/it]

{'loss': 0.0115, 'grad_norm': 0.149682879447937, 'learning_rate': 2.8815295322635715e-05, 'epoch': 9.27}


 93%|█████████▎| 2720/2930 [1:11:41<05:30,  1.58s/it]

{'loss': 0.0108, 'grad_norm': 0.15556487441062927, 'learning_rate': 2.8678729941959716e-05, 'epoch': 9.28}


 93%|█████████▎| 2721/2930 [1:11:42<05:28,  1.57s/it]

{'loss': 0.0111, 'grad_norm': 0.15585559606552124, 'learning_rate': 2.854216456128372e-05, 'epoch': 9.28}


 93%|█████████▎| 2722/2930 [1:11:44<05:27,  1.57s/it]

{'loss': 0.0089, 'grad_norm': 0.14196188747882843, 'learning_rate': 2.8405599180607718e-05, 'epoch': 9.28}


 93%|█████████▎| 2723/2930 [1:11:45<05:25,  1.57s/it]

{'loss': 0.0119, 'grad_norm': 0.18510210514068604, 'learning_rate': 2.8269033799931722e-05, 'epoch': 9.29}


 93%|█████████▎| 2724/2930 [1:11:47<05:23,  1.57s/it]

{'loss': 0.0121, 'grad_norm': 0.22496367990970612, 'learning_rate': 2.813246841925572e-05, 'epoch': 9.29}


 93%|█████████▎| 2725/2930 [1:11:48<05:21,  1.57s/it]

{'loss': 0.0083, 'grad_norm': 0.12955829501152039, 'learning_rate': 2.7995903038579724e-05, 'epoch': 9.29}


 93%|█████████▎| 2726/2930 [1:11:50<05:19,  1.56s/it]

{'loss': 0.0099, 'grad_norm': 0.12885431945323944, 'learning_rate': 2.785933765790372e-05, 'epoch': 9.3}


 93%|█████████▎| 2727/2930 [1:11:52<05:16,  1.56s/it]

{'loss': 0.0112, 'grad_norm': 0.17191335558891296, 'learning_rate': 2.7722772277227726e-05, 'epoch': 9.3}


 93%|█████████▎| 2728/2930 [1:11:53<05:16,  1.57s/it]

{'loss': 0.0098, 'grad_norm': 0.14151543378829956, 'learning_rate': 2.7586206896551727e-05, 'epoch': 9.3}


 93%|█████████▎| 2729/2930 [1:11:55<05:15,  1.57s/it]

{'loss': 0.0082, 'grad_norm': 0.20778772234916687, 'learning_rate': 2.7449641515875728e-05, 'epoch': 9.31}


 93%|█████████▎| 2730/2930 [1:11:56<05:13,  1.57s/it]

{'loss': 0.0087, 'grad_norm': 0.1334349811077118, 'learning_rate': 2.731307613519973e-05, 'epoch': 9.31}


 93%|█████████▎| 2731/2930 [1:11:58<05:12,  1.57s/it]

{'loss': 0.0091, 'grad_norm': 0.11126577854156494, 'learning_rate': 2.7176510754523733e-05, 'epoch': 9.31}


 93%|█████████▎| 2732/2930 [1:11:59<05:10,  1.57s/it]

{'loss': 0.0086, 'grad_norm': 0.1521071344614029, 'learning_rate': 2.703994537384773e-05, 'epoch': 9.32}


 93%|█████████▎| 2733/2930 [1:12:01<05:08,  1.57s/it]

{'loss': 0.0104, 'grad_norm': 0.1337984800338745, 'learning_rate': 2.6903379993171735e-05, 'epoch': 9.32}


 93%|█████████▎| 2734/2930 [1:12:03<05:05,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.1375209242105484, 'learning_rate': 2.6766814612495732e-05, 'epoch': 9.32}


 93%|█████████▎| 2735/2930 [1:12:04<05:05,  1.56s/it]

{'loss': 0.0116, 'grad_norm': 0.1407107710838318, 'learning_rate': 2.6630249231819736e-05, 'epoch': 9.33}


 93%|█████████▎| 2736/2930 [1:12:06<05:03,  1.56s/it]

{'loss': 0.0105, 'grad_norm': 0.1485554724931717, 'learning_rate': 2.6493683851143737e-05, 'epoch': 9.33}


 93%|█████████▎| 2737/2930 [1:12:07<05:01,  1.56s/it]

{'loss': 0.0114, 'grad_norm': 0.1550803929567337, 'learning_rate': 2.6357118470467738e-05, 'epoch': 9.33}


 93%|█████████▎| 2738/2930 [1:12:09<04:59,  1.56s/it]

{'loss': 0.0113, 'grad_norm': 0.18820078670978546, 'learning_rate': 2.622055308979174e-05, 'epoch': 9.34}


 93%|█████████▎| 2739/2930 [1:12:10<04:59,  1.57s/it]

{'loss': 0.0086, 'grad_norm': 0.11173958331346512, 'learning_rate': 2.6083987709115743e-05, 'epoch': 9.34}


 94%|█████████▎| 2740/2930 [1:12:12<04:58,  1.57s/it]

{'loss': 0.0126, 'grad_norm': 0.16261908411979675, 'learning_rate': 2.594742232843974e-05, 'epoch': 9.34}


 94%|█████████▎| 2741/2930 [1:12:14<04:56,  1.57s/it]

{'loss': 0.0083, 'grad_norm': 0.11531133949756622, 'learning_rate': 2.5810856947763745e-05, 'epoch': 9.35}


 94%|█████████▎| 2742/2930 [1:12:15<04:52,  1.56s/it]

{'loss': 0.0093, 'grad_norm': 0.12103971838951111, 'learning_rate': 2.5674291567087743e-05, 'epoch': 9.35}


 94%|█████████▎| 2743/2930 [1:12:17<04:51,  1.56s/it]

{'loss': 0.0099, 'grad_norm': 0.14383479952812195, 'learning_rate': 2.5537726186411747e-05, 'epoch': 9.35}


 94%|█████████▎| 2744/2930 [1:12:18<04:48,  1.55s/it]

{'loss': 0.012, 'grad_norm': 0.17956365644931793, 'learning_rate': 2.5401160805735748e-05, 'epoch': 9.36}


 94%|█████████▎| 2745/2930 [1:12:20<04:48,  1.56s/it]

{'loss': 0.0111, 'grad_norm': 0.16954535245895386, 'learning_rate': 2.526459542505975e-05, 'epoch': 9.36}


 94%|█████████▎| 2746/2930 [1:12:21<04:48,  1.57s/it]

{'loss': 0.0074, 'grad_norm': 0.11032852530479431, 'learning_rate': 2.512803004438375e-05, 'epoch': 9.36}


 94%|█████████▍| 2747/2930 [1:12:23<04:45,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.17565292119979858, 'learning_rate': 2.499146466370775e-05, 'epoch': 9.37}


 94%|█████████▍| 2748/2930 [1:12:24<04:43,  1.56s/it]

{'loss': 0.0113, 'grad_norm': 0.1538904309272766, 'learning_rate': 2.4854899283031752e-05, 'epoch': 9.37}


 94%|█████████▍| 2749/2930 [1:12:26<04:42,  1.56s/it]

{'loss': 0.0111, 'grad_norm': 0.15638652443885803, 'learning_rate': 2.4718333902355753e-05, 'epoch': 9.37}


 94%|█████████▍| 2750/2930 [1:12:28<04:40,  1.56s/it]

{'loss': 0.0093, 'grad_norm': 0.15866632759571075, 'learning_rate': 2.4581768521679754e-05, 'epoch': 9.38}


 94%|█████████▍| 2751/2930 [1:12:29<04:37,  1.55s/it]

{'loss': 0.011, 'grad_norm': 0.15547560155391693, 'learning_rate': 2.4445203141003755e-05, 'epoch': 9.38}


 94%|█████████▍| 2752/2930 [1:12:31<04:36,  1.55s/it]

{'loss': 0.0108, 'grad_norm': 0.1790786236524582, 'learning_rate': 2.4308637760327755e-05, 'epoch': 9.38}


 94%|█████████▍| 2753/2930 [1:12:32<04:34,  1.55s/it]

{'loss': 0.012, 'grad_norm': 0.19160126149654388, 'learning_rate': 2.417207237965176e-05, 'epoch': 9.39}


 94%|█████████▍| 2754/2930 [1:12:34<04:32,  1.55s/it]

{'loss': 0.0088, 'grad_norm': 0.15473832190036774, 'learning_rate': 2.403550699897576e-05, 'epoch': 9.39}


 94%|█████████▍| 2755/2930 [1:12:35<04:30,  1.55s/it]

{'loss': 0.0096, 'grad_norm': 0.1499279886484146, 'learning_rate': 2.389894161829976e-05, 'epoch': 9.39}


 94%|█████████▍| 2756/2930 [1:12:37<04:28,  1.55s/it]

{'loss': 0.0102, 'grad_norm': 0.14858347177505493, 'learning_rate': 2.3762376237623762e-05, 'epoch': 9.4}


 94%|█████████▍| 2757/2930 [1:12:38<04:28,  1.55s/it]

{'loss': 0.0104, 'grad_norm': 0.1566535085439682, 'learning_rate': 2.3625810856947763e-05, 'epoch': 9.4}


 94%|█████████▍| 2758/2930 [1:12:40<04:26,  1.55s/it]

{'loss': 0.0126, 'grad_norm': 0.16275350749492645, 'learning_rate': 2.3489245476271764e-05, 'epoch': 9.4}


 94%|█████████▍| 2759/2930 [1:12:41<04:26,  1.56s/it]

{'loss': 0.0097, 'grad_norm': 0.1397586464881897, 'learning_rate': 2.3352680095595765e-05, 'epoch': 9.41}


 94%|█████████▍| 2760/2930 [1:12:43<04:25,  1.56s/it]

{'loss': 0.0109, 'grad_norm': 0.16011406481266022, 'learning_rate': 2.3216114714919766e-05, 'epoch': 9.41}


 94%|█████████▍| 2761/2930 [1:12:45<04:24,  1.56s/it]

{'loss': 0.0102, 'grad_norm': 0.17818334698677063, 'learning_rate': 2.307954933424377e-05, 'epoch': 9.42}


 94%|█████████▍| 2762/2930 [1:12:46<04:21,  1.56s/it]

{'loss': 0.0098, 'grad_norm': 0.15242557227611542, 'learning_rate': 2.294298395356777e-05, 'epoch': 9.42}


 94%|█████████▍| 2763/2930 [1:12:48<04:19,  1.56s/it]

{'loss': 0.0098, 'grad_norm': 0.1414269208908081, 'learning_rate': 2.2806418572891772e-05, 'epoch': 9.42}


 94%|█████████▍| 2764/2930 [1:12:49<04:19,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.12808062136173248, 'learning_rate': 2.2669853192215773e-05, 'epoch': 9.43}


 94%|█████████▍| 2765/2930 [1:12:51<04:16,  1.55s/it]

{'loss': 0.0105, 'grad_norm': 0.1272948980331421, 'learning_rate': 2.2533287811539774e-05, 'epoch': 9.43}


 94%|█████████▍| 2766/2930 [1:12:52<04:15,  1.56s/it]

{'loss': 0.0111, 'grad_norm': 0.1468317061662674, 'learning_rate': 2.2396722430863775e-05, 'epoch': 9.43}


 94%|█████████▍| 2767/2930 [1:12:54<04:14,  1.56s/it]

{'loss': 0.0091, 'grad_norm': 0.1272442489862442, 'learning_rate': 2.2260157050187776e-05, 'epoch': 9.44}


 94%|█████████▍| 2768/2930 [1:12:56<04:12,  1.56s/it]

{'loss': 0.0116, 'grad_norm': 0.1718420535326004, 'learning_rate': 2.212359166951178e-05, 'epoch': 9.44}


 95%|█████████▍| 2769/2930 [1:12:57<04:11,  1.56s/it]

{'loss': 0.0097, 'grad_norm': 0.1670963019132614, 'learning_rate': 2.198702628883578e-05, 'epoch': 9.44}


 95%|█████████▍| 2770/2930 [1:12:59<04:10,  1.56s/it]

{'loss': 0.0107, 'grad_norm': 0.1592429280281067, 'learning_rate': 2.1850460908159785e-05, 'epoch': 9.45}


 95%|█████████▍| 2771/2930 [1:13:00<04:08,  1.57s/it]

{'loss': 0.0113, 'grad_norm': 0.15397468209266663, 'learning_rate': 2.1713895527483786e-05, 'epoch': 9.45}


 95%|█████████▍| 2772/2930 [1:13:02<04:06,  1.56s/it]

{'loss': 0.0083, 'grad_norm': 0.16961123049259186, 'learning_rate': 2.1577330146807787e-05, 'epoch': 9.45}


 95%|█████████▍| 2773/2930 [1:13:03<04:05,  1.56s/it]

{'loss': 0.0124, 'grad_norm': 0.18203340470790863, 'learning_rate': 2.1440764766131788e-05, 'epoch': 9.46}


 95%|█████████▍| 2774/2930 [1:13:05<04:03,  1.56s/it]

{'loss': 0.0097, 'grad_norm': 0.14207366108894348, 'learning_rate': 2.130419938545579e-05, 'epoch': 9.46}


 95%|█████████▍| 2775/2930 [1:13:06<04:01,  1.56s/it]

{'loss': 0.009, 'grad_norm': 0.14762859046459198, 'learning_rate': 2.116763400477979e-05, 'epoch': 9.46}


 95%|█████████▍| 2776/2930 [1:13:08<04:01,  1.57s/it]

{'loss': 0.0107, 'grad_norm': 0.15775905549526215, 'learning_rate': 2.103106862410379e-05, 'epoch': 9.47}


 95%|█████████▍| 2777/2930 [1:13:10<04:00,  1.57s/it]

{'loss': 0.0124, 'grad_norm': 0.1596367508172989, 'learning_rate': 2.0894503243427792e-05, 'epoch': 9.47}


 95%|█████████▍| 2778/2930 [1:13:11<03:58,  1.57s/it]

{'loss': 0.0106, 'grad_norm': 0.15843336284160614, 'learning_rate': 2.0757937862751796e-05, 'epoch': 9.47}


 95%|█████████▍| 2779/2930 [1:13:13<03:56,  1.57s/it]

{'loss': 0.0094, 'grad_norm': 0.14173048734664917, 'learning_rate': 2.0621372482075797e-05, 'epoch': 9.48}


 95%|█████████▍| 2780/2930 [1:13:14<03:54,  1.56s/it]

{'loss': 0.0094, 'grad_norm': 0.1750105768442154, 'learning_rate': 2.0484807101399798e-05, 'epoch': 9.48}


 95%|█████████▍| 2781/2930 [1:13:16<03:52,  1.56s/it]

{'loss': 0.01, 'grad_norm': 0.1570262610912323, 'learning_rate': 2.03482417207238e-05, 'epoch': 9.48}


 95%|█████████▍| 2782/2930 [1:13:17<03:50,  1.56s/it]

{'loss': 0.0108, 'grad_norm': 0.16482025384902954, 'learning_rate': 2.02116763400478e-05, 'epoch': 9.49}


 95%|█████████▍| 2783/2930 [1:13:19<03:50,  1.57s/it]

{'loss': 0.0128, 'grad_norm': 0.21284809708595276, 'learning_rate': 2.00751109593718e-05, 'epoch': 9.49}


 95%|█████████▌| 2784/2930 [1:13:21<03:47,  1.56s/it]

{'loss': 0.0109, 'grad_norm': 0.16288387775421143, 'learning_rate': 1.9938545578695802e-05, 'epoch': 9.49}


 95%|█████████▌| 2785/2930 [1:13:22<03:45,  1.56s/it]

{'loss': 0.0097, 'grad_norm': 0.19356520473957062, 'learning_rate': 1.9801980198019803e-05, 'epoch': 9.5}


 95%|█████████▌| 2786/2930 [1:13:24<03:44,  1.56s/it]

{'loss': 0.0116, 'grad_norm': 0.18937529623508453, 'learning_rate': 1.9665414817343804e-05, 'epoch': 9.5}


 95%|█████████▌| 2787/2930 [1:13:25<03:47,  1.59s/it]

{'loss': 0.0105, 'grad_norm': 0.15299788117408752, 'learning_rate': 1.9528849436667808e-05, 'epoch': 9.5}


 95%|█████████▌| 2788/2930 [1:13:27<03:49,  1.62s/it]

{'loss': 0.0077, 'grad_norm': 0.1638791561126709, 'learning_rate': 1.939228405599181e-05, 'epoch': 9.51}


 95%|█████████▌| 2789/2930 [1:13:29<03:46,  1.61s/it]

{'loss': 0.0104, 'grad_norm': 0.1447203904390335, 'learning_rate': 1.925571867531581e-05, 'epoch': 9.51}


 95%|█████████▌| 2790/2930 [1:13:30<03:45,  1.61s/it]

{'loss': 0.0098, 'grad_norm': 0.17466455698013306, 'learning_rate': 1.911915329463981e-05, 'epoch': 9.51}


 95%|█████████▌| 2791/2930 [1:13:32<03:45,  1.62s/it]

{'loss': 0.0085, 'grad_norm': 0.1354006975889206, 'learning_rate': 1.898258791396381e-05, 'epoch': 9.52}


 95%|█████████▌| 2792/2930 [1:13:34<03:44,  1.63s/it]

{'loss': 0.0087, 'grad_norm': 0.14395780861377716, 'learning_rate': 1.8846022533287812e-05, 'epoch': 9.52}


 95%|█████████▌| 2793/2930 [1:13:35<03:46,  1.65s/it]

{'loss': 0.0096, 'grad_norm': 0.14142537117004395, 'learning_rate': 1.8709457152611813e-05, 'epoch': 9.52}


 95%|█████████▌| 2794/2930 [1:13:37<03:45,  1.66s/it]

{'loss': 0.0095, 'grad_norm': 0.13913141191005707, 'learning_rate': 1.8572891771935814e-05, 'epoch': 9.53}


 95%|█████████▌| 2795/2930 [1:13:39<03:48,  1.69s/it]

{'loss': 0.009, 'grad_norm': 0.14785481989383698, 'learning_rate': 1.843632639125982e-05, 'epoch': 9.53}


 95%|█████████▌| 2796/2930 [1:13:40<03:45,  1.68s/it]

{'loss': 0.0095, 'grad_norm': 0.14676369726657867, 'learning_rate': 1.829976101058382e-05, 'epoch': 9.53}


 95%|█████████▌| 2797/2930 [1:13:42<03:44,  1.69s/it]

{'loss': 0.0117, 'grad_norm': 0.1975269615650177, 'learning_rate': 1.816319562990782e-05, 'epoch': 9.54}


 95%|█████████▌| 2798/2930 [1:13:44<03:44,  1.70s/it]

{'loss': 0.0109, 'grad_norm': 0.16328604519367218, 'learning_rate': 1.802663024923182e-05, 'epoch': 9.54}


 96%|█████████▌| 2799/2930 [1:13:45<03:40,  1.68s/it]

{'loss': 0.009, 'grad_norm': 0.13357731699943542, 'learning_rate': 1.7890064868555822e-05, 'epoch': 9.54}


 96%|█████████▌| 2800/2930 [1:13:47<03:36,  1.66s/it]

{'loss': 0.0133, 'grad_norm': 0.17884725332260132, 'learning_rate': 1.7753499487879823e-05, 'epoch': 9.55}


 96%|█████████▌| 2801/2930 [1:13:49<03:30,  1.63s/it]

{'loss': 0.0095, 'grad_norm': 0.14625877141952515, 'learning_rate': 1.7616934107203824e-05, 'epoch': 9.55}


 96%|█████████▌| 2802/2930 [1:13:50<03:28,  1.63s/it]

{'loss': 0.0087, 'grad_norm': 0.17353112995624542, 'learning_rate': 1.7480368726527825e-05, 'epoch': 9.55}


 96%|█████████▌| 2803/2930 [1:13:52<03:24,  1.61s/it]

{'loss': 0.0116, 'grad_norm': 0.18573860824108124, 'learning_rate': 1.734380334585183e-05, 'epoch': 9.56}


 96%|█████████▌| 2804/2930 [1:13:53<03:23,  1.61s/it]

{'loss': 0.0101, 'grad_norm': 0.14207087457180023, 'learning_rate': 1.720723796517583e-05, 'epoch': 9.56}


 96%|█████████▌| 2805/2930 [1:13:55<03:19,  1.59s/it]

{'loss': 0.0102, 'grad_norm': 0.15901021659374237, 'learning_rate': 1.707067258449983e-05, 'epoch': 9.57}


 96%|█████████▌| 2806/2930 [1:13:56<03:16,  1.59s/it]

{'loss': 0.0092, 'grad_norm': 0.15093638002872467, 'learning_rate': 1.6934107203823832e-05, 'epoch': 9.57}


 96%|█████████▌| 2807/2930 [1:13:58<03:17,  1.60s/it]

{'loss': 0.0099, 'grad_norm': 0.14176839590072632, 'learning_rate': 1.6797541823147833e-05, 'epoch': 9.57}


 96%|█████████▌| 2808/2930 [1:14:00<03:14,  1.59s/it]

{'loss': 0.0104, 'grad_norm': 0.15355098247528076, 'learning_rate': 1.6660976442471834e-05, 'epoch': 9.58}


 96%|█████████▌| 2809/2930 [1:14:01<03:14,  1.60s/it]

{'loss': 0.0107, 'grad_norm': 0.13853521645069122, 'learning_rate': 1.6524411061795835e-05, 'epoch': 9.58}


 96%|█████████▌| 2810/2930 [1:14:03<03:14,  1.62s/it]

{'loss': 0.0099, 'grad_norm': 0.15776507556438446, 'learning_rate': 1.6387845681119836e-05, 'epoch': 9.58}


 96%|█████████▌| 2811/2930 [1:14:05<03:13,  1.62s/it]

{'loss': 0.011, 'grad_norm': 0.18682068586349487, 'learning_rate': 1.625128030044384e-05, 'epoch': 9.59}


 96%|█████████▌| 2812/2930 [1:14:06<03:08,  1.59s/it]

{'loss': 0.0074, 'grad_norm': 0.1301063448190689, 'learning_rate': 1.611471491976784e-05, 'epoch': 9.59}


 96%|█████████▌| 2813/2930 [1:14:08<03:05,  1.58s/it]

{'loss': 0.0114, 'grad_norm': 0.16663016378879547, 'learning_rate': 1.5978149539091842e-05, 'epoch': 9.59}


 96%|█████████▌| 2814/2930 [1:14:09<03:02,  1.58s/it]

{'loss': 0.011, 'grad_norm': 0.15394540131092072, 'learning_rate': 1.5841584158415843e-05, 'epoch': 9.6}


 96%|█████████▌| 2815/2930 [1:14:11<03:00,  1.57s/it]

{'loss': 0.011, 'grad_norm': 0.18476131558418274, 'learning_rate': 1.5705018777739844e-05, 'epoch': 9.6}


 96%|█████████▌| 2816/2930 [1:14:12<02:58,  1.57s/it]

{'loss': 0.0105, 'grad_norm': 0.1502552181482315, 'learning_rate': 1.5568453397063845e-05, 'epoch': 9.6}


 96%|█████████▌| 2817/2930 [1:14:14<02:58,  1.58s/it]

{'loss': 0.009, 'grad_norm': 0.13323228061199188, 'learning_rate': 1.5431888016387846e-05, 'epoch': 9.61}


 96%|█████████▌| 2818/2930 [1:14:16<02:59,  1.60s/it]

{'loss': 0.0103, 'grad_norm': 0.16466915607452393, 'learning_rate': 1.5295322635711846e-05, 'epoch': 9.61}


 96%|█████████▌| 2819/2930 [1:14:17<03:00,  1.63s/it]

{'loss': 0.0087, 'grad_norm': 0.12663643062114716, 'learning_rate': 1.5158757255035849e-05, 'epoch': 9.61}


 96%|█████████▌| 2820/2930 [1:14:19<02:58,  1.63s/it]

{'loss': 0.0085, 'grad_norm': 0.14641742408275604, 'learning_rate': 1.502219187435985e-05, 'epoch': 9.62}


 96%|█████████▋| 2821/2930 [1:14:21<02:57,  1.63s/it]

{'loss': 0.0112, 'grad_norm': 0.1421845257282257, 'learning_rate': 1.4885626493683853e-05, 'epoch': 9.62}


 96%|█████████▋| 2822/2930 [1:14:22<02:54,  1.62s/it]

{'loss': 0.0106, 'grad_norm': 0.13213105499744415, 'learning_rate': 1.4749061113007854e-05, 'epoch': 9.62}


 96%|█████████▋| 2823/2930 [1:14:24<02:52,  1.62s/it]

{'loss': 0.0097, 'grad_norm': 0.13045856356620789, 'learning_rate': 1.4612495732331854e-05, 'epoch': 9.63}


 96%|█████████▋| 2824/2930 [1:14:25<02:51,  1.61s/it]

{'loss': 0.0094, 'grad_norm': 0.14910869300365448, 'learning_rate': 1.4475930351655855e-05, 'epoch': 9.63}


 96%|█████████▋| 2825/2930 [1:14:27<02:48,  1.60s/it]

{'loss': 0.0122, 'grad_norm': 0.19203466176986694, 'learning_rate': 1.4339364970979858e-05, 'epoch': 9.63}


 96%|█████████▋| 2826/2930 [1:14:29<02:47,  1.61s/it]

{'loss': 0.0087, 'grad_norm': 0.13275952637195587, 'learning_rate': 1.4202799590303859e-05, 'epoch': 9.64}


 96%|█████████▋| 2827/2930 [1:14:30<02:43,  1.59s/it]

{'loss': 0.008, 'grad_norm': 0.12083788961172104, 'learning_rate': 1.406623420962786e-05, 'epoch': 9.64}


 97%|█████████▋| 2828/2930 [1:14:32<02:42,  1.60s/it]

{'loss': 0.0117, 'grad_norm': 0.2139604240655899, 'learning_rate': 1.392966882895186e-05, 'epoch': 9.64}


 97%|█████████▋| 2829/2930 [1:14:33<02:42,  1.61s/it]

{'loss': 0.0079, 'grad_norm': 0.13894769549369812, 'learning_rate': 1.3793103448275863e-05, 'epoch': 9.65}


 97%|█████████▋| 2830/2930 [1:14:35<02:40,  1.61s/it]

{'loss': 0.0092, 'grad_norm': 0.12863589823246002, 'learning_rate': 1.3656538067599864e-05, 'epoch': 9.65}


 97%|█████████▋| 2831/2930 [1:14:37<02:39,  1.61s/it]

{'loss': 0.0083, 'grad_norm': 0.11825835704803467, 'learning_rate': 1.3519972686923865e-05, 'epoch': 9.65}


 97%|█████████▋| 2832/2930 [1:14:38<02:39,  1.63s/it]

{'loss': 0.0094, 'grad_norm': 0.15828098356723785, 'learning_rate': 1.3383407306247866e-05, 'epoch': 9.66}


 97%|█████████▋| 2833/2930 [1:14:40<02:39,  1.65s/it]

{'loss': 0.0103, 'grad_norm': 0.15997791290283203, 'learning_rate': 1.3246841925571869e-05, 'epoch': 9.66}


 97%|█████████▋| 2834/2930 [1:14:42<02:37,  1.64s/it]

{'loss': 0.0099, 'grad_norm': 0.15897764265537262, 'learning_rate': 1.311027654489587e-05, 'epoch': 9.66}


 97%|█████████▋| 2835/2930 [1:14:43<02:35,  1.64s/it]

{'loss': 0.0106, 'grad_norm': 0.14809516072273254, 'learning_rate': 1.297371116421987e-05, 'epoch': 9.67}


 97%|█████████▋| 2836/2930 [1:14:45<02:32,  1.62s/it]

{'loss': 0.0112, 'grad_norm': 0.1854226142168045, 'learning_rate': 1.2837145783543871e-05, 'epoch': 9.67}


 97%|█████████▋| 2837/2930 [1:14:46<02:28,  1.59s/it]

{'loss': 0.0105, 'grad_norm': 0.18167728185653687, 'learning_rate': 1.2700580402867874e-05, 'epoch': 9.67}


 97%|█████████▋| 2838/2930 [1:14:48<02:27,  1.61s/it]

{'loss': 0.0099, 'grad_norm': 0.13755741715431213, 'learning_rate': 1.2564015022191875e-05, 'epoch': 9.68}


 97%|█████████▋| 2839/2930 [1:14:50<02:25,  1.60s/it]

{'loss': 0.0121, 'grad_norm': 0.17655952274799347, 'learning_rate': 1.2427449641515876e-05, 'epoch': 9.68}


 97%|█████████▋| 2840/2930 [1:14:51<02:24,  1.61s/it]

{'loss': 0.0115, 'grad_norm': 0.17266331613063812, 'learning_rate': 1.2290884260839877e-05, 'epoch': 9.68}


 97%|█████████▋| 2841/2930 [1:14:53<02:23,  1.61s/it]

{'loss': 0.0101, 'grad_norm': 0.13771435618400574, 'learning_rate': 1.2154318880163878e-05, 'epoch': 9.69}


 97%|█████████▋| 2842/2930 [1:14:54<02:21,  1.60s/it]

{'loss': 0.0101, 'grad_norm': 0.14256691932678223, 'learning_rate': 1.201775349948788e-05, 'epoch': 9.69}


 97%|█████████▋| 2843/2930 [1:14:56<02:19,  1.60s/it]

{'loss': 0.0084, 'grad_norm': 0.12802010774612427, 'learning_rate': 1.1881188118811881e-05, 'epoch': 9.69}


 97%|█████████▋| 2844/2930 [1:14:58<02:17,  1.60s/it]

{'loss': 0.0113, 'grad_norm': 0.20687533915042877, 'learning_rate': 1.1744622738135882e-05, 'epoch': 9.7}


 97%|█████████▋| 2845/2930 [1:14:59<02:16,  1.61s/it]

{'loss': 0.0084, 'grad_norm': 0.14734265208244324, 'learning_rate': 1.1608057357459883e-05, 'epoch': 9.7}


 97%|█████████▋| 2846/2930 [1:15:01<02:14,  1.60s/it]

{'loss': 0.0094, 'grad_norm': 0.13525700569152832, 'learning_rate': 1.1471491976783886e-05, 'epoch': 9.71}


 97%|█████████▋| 2847/2930 [1:15:02<02:11,  1.59s/it]

{'loss': 0.0092, 'grad_norm': 0.15311715006828308, 'learning_rate': 1.1334926596107887e-05, 'epoch': 9.71}


 97%|█████████▋| 2848/2930 [1:15:04<02:10,  1.59s/it]

{'loss': 0.0104, 'grad_norm': 0.1685292273759842, 'learning_rate': 1.1198361215431888e-05, 'epoch': 9.71}


 97%|█████████▋| 2849/2930 [1:15:06<02:07,  1.58s/it]

{'loss': 0.0077, 'grad_norm': 0.11894046515226364, 'learning_rate': 1.106179583475589e-05, 'epoch': 9.72}


 97%|█████████▋| 2850/2930 [1:15:07<02:05,  1.57s/it]

{'loss': 0.0105, 'grad_norm': 0.15676330029964447, 'learning_rate': 1.0925230454079893e-05, 'epoch': 9.72}


 97%|█████████▋| 2851/2930 [1:15:09<02:04,  1.57s/it]

{'loss': 0.0098, 'grad_norm': 0.13595512509346008, 'learning_rate': 1.0788665073403894e-05, 'epoch': 9.72}


 97%|█████████▋| 2852/2930 [1:15:10<02:01,  1.56s/it]

{'loss': 0.0068, 'grad_norm': 0.10883069038391113, 'learning_rate': 1.0652099692727895e-05, 'epoch': 9.73}


 97%|█████████▋| 2853/2930 [1:15:12<01:58,  1.54s/it]

{'loss': 0.0093, 'grad_norm': 0.1397627592086792, 'learning_rate': 1.0515534312051895e-05, 'epoch': 9.73}


 97%|█████████▋| 2854/2930 [1:15:13<01:57,  1.54s/it]

{'loss': 0.009, 'grad_norm': 0.14226065576076508, 'learning_rate': 1.0378968931375898e-05, 'epoch': 9.73}


 97%|█████████▋| 2855/2930 [1:15:15<01:56,  1.55s/it]

{'loss': 0.0107, 'grad_norm': 0.16838520765304565, 'learning_rate': 1.0242403550699899e-05, 'epoch': 9.74}


 97%|█████████▋| 2856/2930 [1:15:16<01:57,  1.58s/it]

{'loss': 0.0097, 'grad_norm': 0.15307052433490753, 'learning_rate': 1.01058381700239e-05, 'epoch': 9.74}


 98%|█████████▊| 2857/2930 [1:15:18<01:56,  1.60s/it]

{'loss': 0.0089, 'grad_norm': 0.1303814798593521, 'learning_rate': 9.969272789347901e-06, 'epoch': 9.74}


 98%|█████████▊| 2858/2930 [1:15:20<01:55,  1.60s/it]

{'loss': 0.0094, 'grad_norm': 0.13294397294521332, 'learning_rate': 9.832707408671902e-06, 'epoch': 9.75}


 98%|█████████▊| 2859/2930 [1:15:21<01:55,  1.63s/it]

{'loss': 0.0116, 'grad_norm': 0.15438146889209747, 'learning_rate': 9.696142027995904e-06, 'epoch': 9.75}


 98%|█████████▊| 2860/2930 [1:15:23<01:53,  1.62s/it]

{'loss': 0.0089, 'grad_norm': 0.1507008671760559, 'learning_rate': 9.559576647319905e-06, 'epoch': 9.75}


 98%|█████████▊| 2861/2930 [1:15:25<01:51,  1.61s/it]

{'loss': 0.01, 'grad_norm': 0.16133196651935577, 'learning_rate': 9.423011266643906e-06, 'epoch': 9.76}


 98%|█████████▊| 2862/2930 [1:15:26<01:49,  1.61s/it]

{'loss': 0.0105, 'grad_norm': 0.14993195235729218, 'learning_rate': 9.286445885967907e-06, 'epoch': 9.76}


 98%|█████████▊| 2863/2930 [1:15:28<01:47,  1.60s/it]

{'loss': 0.0102, 'grad_norm': 0.1317332535982132, 'learning_rate': 9.14988050529191e-06, 'epoch': 9.76}


 98%|█████████▊| 2864/2930 [1:15:29<01:45,  1.60s/it]

{'loss': 0.0092, 'grad_norm': 0.13370956480503082, 'learning_rate': 9.01331512461591e-06, 'epoch': 9.77}


 98%|█████████▊| 2865/2930 [1:15:31<01:45,  1.62s/it]

{'loss': 0.01, 'grad_norm': 0.13538841903209686, 'learning_rate': 8.876749743939912e-06, 'epoch': 9.77}


 98%|█████████▊| 2866/2930 [1:15:33<01:44,  1.63s/it]

{'loss': 0.0087, 'grad_norm': 0.13530565798282623, 'learning_rate': 8.740184363263912e-06, 'epoch': 9.77}


 98%|█████████▊| 2867/2930 [1:15:34<01:43,  1.64s/it]

{'loss': 0.0088, 'grad_norm': 0.13710440695285797, 'learning_rate': 8.603618982587915e-06, 'epoch': 9.78}


 98%|█████████▊| 2868/2930 [1:15:36<01:41,  1.64s/it]

{'loss': 0.0107, 'grad_norm': 0.16883614659309387, 'learning_rate': 8.467053601911916e-06, 'epoch': 9.78}


 98%|█████████▊| 2869/2930 [1:15:38<01:40,  1.65s/it]

{'loss': 0.014, 'grad_norm': 0.20707328617572784, 'learning_rate': 8.330488221235917e-06, 'epoch': 9.78}


 98%|█████████▊| 2870/2930 [1:15:39<01:38,  1.64s/it]

{'loss': 0.0084, 'grad_norm': 0.14068195223808289, 'learning_rate': 8.193922840559918e-06, 'epoch': 9.79}


 98%|█████████▊| 2871/2930 [1:15:41<01:37,  1.66s/it]

{'loss': 0.012, 'grad_norm': 0.1913718432188034, 'learning_rate': 8.05735745988392e-06, 'epoch': 9.79}


 98%|█████████▊| 2872/2930 [1:15:43<01:36,  1.66s/it]

{'loss': 0.0108, 'grad_norm': 0.15219233930110931, 'learning_rate': 7.920792079207921e-06, 'epoch': 9.79}


 98%|█████████▊| 2873/2930 [1:15:44<01:34,  1.66s/it]

{'loss': 0.0108, 'grad_norm': 0.13651011884212494, 'learning_rate': 7.784226698531922e-06, 'epoch': 9.8}


 98%|█████████▊| 2874/2930 [1:15:46<01:33,  1.67s/it]

{'loss': 0.012, 'grad_norm': 0.1762649118900299, 'learning_rate': 7.647661317855923e-06, 'epoch': 9.8}


 98%|█████████▊| 2875/2930 [1:15:48<01:32,  1.68s/it]

{'loss': 0.0102, 'grad_norm': 0.15227407217025757, 'learning_rate': 7.511095937179925e-06, 'epoch': 9.8}


 98%|█████████▊| 2876/2930 [1:15:49<01:30,  1.68s/it]

{'loss': 0.0111, 'grad_norm': 0.14838114380836487, 'learning_rate': 7.374530556503927e-06, 'epoch': 9.81}


 98%|█████████▊| 2877/2930 [1:15:51<01:29,  1.69s/it]

{'loss': 0.0117, 'grad_norm': 0.16236534714698792, 'learning_rate': 7.237965175827928e-06, 'epoch': 9.81}


 98%|█████████▊| 2878/2930 [1:15:53<01:27,  1.69s/it]

{'loss': 0.0131, 'grad_norm': 0.22804942727088928, 'learning_rate': 7.1013997951519294e-06, 'epoch': 9.81}


 98%|█████████▊| 2879/2930 [1:15:54<01:25,  1.68s/it]

{'loss': 0.0114, 'grad_norm': 0.15519022941589355, 'learning_rate': 6.96483441447593e-06, 'epoch': 9.82}


 98%|█████████▊| 2880/2930 [1:15:56<01:24,  1.68s/it]

{'loss': 0.0098, 'grad_norm': 0.1567540168762207, 'learning_rate': 6.828269033799932e-06, 'epoch': 9.82}


 98%|█████████▊| 2881/2930 [1:15:58<01:22,  1.68s/it]

{'loss': 0.0101, 'grad_norm': 0.13656705617904663, 'learning_rate': 6.691703653123933e-06, 'epoch': 9.82}


 98%|█████████▊| 2882/2930 [1:15:59<01:18,  1.64s/it]

{'loss': 0.0089, 'grad_norm': 0.1741792857646942, 'learning_rate': 6.555138272447935e-06, 'epoch': 9.83}


 98%|█████████▊| 2883/2930 [1:16:01<01:15,  1.61s/it]

{'loss': 0.0111, 'grad_norm': 0.1800878643989563, 'learning_rate': 6.418572891771936e-06, 'epoch': 9.83}


 98%|█████████▊| 2884/2930 [1:16:02<01:13,  1.59s/it]

{'loss': 0.0101, 'grad_norm': 0.14461646974086761, 'learning_rate': 6.2820075110959375e-06, 'epoch': 9.83}


 98%|█████████▊| 2885/2930 [1:16:04<01:11,  1.59s/it]

{'loss': 0.0094, 'grad_norm': 0.15722985565662384, 'learning_rate': 6.145442130419938e-06, 'epoch': 9.84}


 98%|█████████▊| 2886/2930 [1:16:06<01:09,  1.58s/it]

{'loss': 0.0098, 'grad_norm': 0.15287627279758453, 'learning_rate': 6.00887674974394e-06, 'epoch': 9.84}


 99%|█████████▊| 2887/2930 [1:16:07<01:07,  1.58s/it]

{'loss': 0.0085, 'grad_norm': 0.15258465707302094, 'learning_rate': 5.872311369067941e-06, 'epoch': 9.84}


 99%|█████████▊| 2888/2930 [1:16:09<01:05,  1.57s/it]

{'loss': 0.0092, 'grad_norm': 0.12776826322078705, 'learning_rate': 5.735745988391943e-06, 'epoch': 9.85}


 99%|█████████▊| 2889/2930 [1:16:10<01:04,  1.57s/it]

{'loss': 0.0097, 'grad_norm': 0.12831591069698334, 'learning_rate': 5.599180607715944e-06, 'epoch': 9.85}


 99%|█████████▊| 2890/2930 [1:16:12<01:02,  1.57s/it]

{'loss': 0.0093, 'grad_norm': 0.16413021087646484, 'learning_rate': 5.462615227039946e-06, 'epoch': 9.86}


 99%|█████████▊| 2891/2930 [1:16:13<01:01,  1.57s/it]

{'loss': 0.0103, 'grad_norm': 0.1723063737154007, 'learning_rate': 5.326049846363947e-06, 'epoch': 9.86}


 99%|█████████▊| 2892/2930 [1:16:15<00:59,  1.57s/it]

{'loss': 0.0097, 'grad_norm': 0.15390126407146454, 'learning_rate': 5.189484465687949e-06, 'epoch': 9.86}


 99%|█████████▊| 2893/2930 [1:16:17<00:57,  1.56s/it]

{'loss': 0.0108, 'grad_norm': 0.17544305324554443, 'learning_rate': 5.05291908501195e-06, 'epoch': 9.87}


 99%|█████████▉| 2894/2930 [1:16:18<00:56,  1.56s/it]

{'loss': 0.0077, 'grad_norm': 0.11374568939208984, 'learning_rate': 4.916353704335951e-06, 'epoch': 9.87}


 99%|█████████▉| 2895/2930 [1:16:20<00:54,  1.56s/it]

{'loss': 0.0095, 'grad_norm': 0.15824581682682037, 'learning_rate': 4.779788323659953e-06, 'epoch': 9.87}


 99%|█████████▉| 2896/2930 [1:16:21<00:53,  1.56s/it]

{'loss': 0.0103, 'grad_norm': 0.1678433120250702, 'learning_rate': 4.6432229429839536e-06, 'epoch': 9.88}


 99%|█████████▉| 2897/2930 [1:16:23<00:51,  1.56s/it]

{'loss': 0.0091, 'grad_norm': 0.12164731323719025, 'learning_rate': 4.506657562307955e-06, 'epoch': 9.88}


 99%|█████████▉| 2898/2930 [1:16:24<00:49,  1.56s/it]

{'loss': 0.0091, 'grad_norm': 0.15140077471733093, 'learning_rate': 4.370092181631956e-06, 'epoch': 9.88}


 99%|█████████▉| 2899/2930 [1:16:26<00:48,  1.56s/it]

{'loss': 0.0092, 'grad_norm': 0.12077735364437103, 'learning_rate': 4.233526800955958e-06, 'epoch': 9.89}


 99%|█████████▉| 2900/2930 [1:16:27<00:46,  1.56s/it]

{'loss': 0.0068, 'grad_norm': 0.1046542227268219, 'learning_rate': 4.096961420279959e-06, 'epoch': 9.89}


 99%|█████████▉| 2901/2930 [1:16:29<00:45,  1.56s/it]

{'loss': 0.0086, 'grad_norm': 0.10747513175010681, 'learning_rate': 3.960396039603961e-06, 'epoch': 9.89}


 99%|█████████▉| 2902/2930 [1:16:31<00:43,  1.56s/it]

{'loss': 0.0093, 'grad_norm': 0.142911896109581, 'learning_rate': 3.823830658927962e-06, 'epoch': 9.9}


 99%|█████████▉| 2903/2930 [1:16:32<00:42,  1.56s/it]

{'loss': 0.0102, 'grad_norm': 0.1370154768228531, 'learning_rate': 3.6872652782519634e-06, 'epoch': 9.9}


 99%|█████████▉| 2904/2930 [1:16:34<00:40,  1.57s/it]

{'loss': 0.0109, 'grad_norm': 0.16790106892585754, 'learning_rate': 3.5506998975759647e-06, 'epoch': 9.9}


 99%|█████████▉| 2905/2930 [1:16:35<00:39,  1.56s/it]

{'loss': 0.008, 'grad_norm': 0.12467613071203232, 'learning_rate': 3.414134516899966e-06, 'epoch': 9.91}


 99%|█████████▉| 2906/2930 [1:16:37<00:37,  1.57s/it]

{'loss': 0.0101, 'grad_norm': 0.14100266993045807, 'learning_rate': 3.2775691362239674e-06, 'epoch': 9.91}


 99%|█████████▉| 2907/2930 [1:16:38<00:36,  1.58s/it]

{'loss': 0.0089, 'grad_norm': 0.1337946206331253, 'learning_rate': 3.1410037555479687e-06, 'epoch': 9.91}


 99%|█████████▉| 2908/2930 [1:16:40<00:34,  1.57s/it]

{'loss': 0.0088, 'grad_norm': 0.1552768498659134, 'learning_rate': 3.00443837487197e-06, 'epoch': 9.92}


 99%|█████████▉| 2909/2930 [1:16:42<00:32,  1.57s/it]

{'loss': 0.0081, 'grad_norm': 0.1289549022912979, 'learning_rate': 2.8678729941959714e-06, 'epoch': 9.92}


 99%|█████████▉| 2910/2930 [1:16:43<00:31,  1.58s/it]

{'loss': 0.0086, 'grad_norm': 0.12549369037151337, 'learning_rate': 2.731307613519973e-06, 'epoch': 9.92}


 99%|█████████▉| 2911/2930 [1:16:45<00:29,  1.58s/it]

{'loss': 0.0083, 'grad_norm': 0.1299205720424652, 'learning_rate': 2.5947422328439745e-06, 'epoch': 9.93}


 99%|█████████▉| 2912/2930 [1:16:46<00:28,  1.57s/it]

{'loss': 0.0085, 'grad_norm': 0.13253886997699738, 'learning_rate': 2.4581768521679754e-06, 'epoch': 9.93}


 99%|█████████▉| 2913/2930 [1:16:48<00:26,  1.59s/it]

{'loss': 0.0098, 'grad_norm': 0.15522238612174988, 'learning_rate': 2.3216114714919768e-06, 'epoch': 9.93}


 99%|█████████▉| 2914/2930 [1:16:50<00:25,  1.60s/it]

{'loss': 0.0109, 'grad_norm': 0.16205117106437683, 'learning_rate': 2.185046090815978e-06, 'epoch': 9.94}


 99%|█████████▉| 2915/2930 [1:16:51<00:23,  1.58s/it]

{'loss': 0.0108, 'grad_norm': 0.15886084735393524, 'learning_rate': 2.0484807101399795e-06, 'epoch': 9.94}


100%|█████████▉| 2916/2930 [1:16:53<00:22,  1.59s/it]

{'loss': 0.0095, 'grad_norm': 0.1496817171573639, 'learning_rate': 1.911915329463981e-06, 'epoch': 9.94}


100%|█████████▉| 2917/2930 [1:16:54<00:20,  1.61s/it]

{'loss': 0.0094, 'grad_norm': 0.14814259111881256, 'learning_rate': 1.7753499487879824e-06, 'epoch': 9.95}


100%|█████████▉| 2918/2930 [1:16:56<00:19,  1.64s/it]

{'loss': 0.0085, 'grad_norm': 0.12779465317726135, 'learning_rate': 1.6387845681119837e-06, 'epoch': 9.95}


100%|█████████▉| 2919/2930 [1:16:58<00:18,  1.64s/it]

{'loss': 0.0097, 'grad_norm': 0.1426357626914978, 'learning_rate': 1.502219187435985e-06, 'epoch': 9.95}


100%|█████████▉| 2920/2930 [1:16:59<00:16,  1.66s/it]

{'loss': 0.0085, 'grad_norm': 0.14411866664886475, 'learning_rate': 1.3656538067599866e-06, 'epoch': 9.96}


100%|█████████▉| 2921/2930 [1:17:01<00:14,  1.66s/it]

{'loss': 0.0089, 'grad_norm': 0.1353800892829895, 'learning_rate': 1.2290884260839877e-06, 'epoch': 9.96}


100%|█████████▉| 2922/2930 [1:17:03<00:13,  1.66s/it]

{'loss': 0.0083, 'grad_norm': 0.1194336861371994, 'learning_rate': 1.092523045407989e-06, 'epoch': 9.96}


100%|█████████▉| 2923/2930 [1:17:04<00:11,  1.65s/it]

{'loss': 0.0119, 'grad_norm': 0.1767970472574234, 'learning_rate': 9.559576647319904e-07, 'epoch': 9.97}


100%|█████████▉| 2924/2930 [1:17:06<00:09,  1.67s/it]

{'loss': 0.0095, 'grad_norm': 0.1282525658607483, 'learning_rate': 8.193922840559918e-07, 'epoch': 9.97}


100%|█████████▉| 2925/2930 [1:17:08<00:08,  1.66s/it]

{'loss': 0.0084, 'grad_norm': 0.12733615934848785, 'learning_rate': 6.828269033799933e-07, 'epoch': 9.97}


100%|█████████▉| 2926/2930 [1:17:09<00:06,  1.67s/it]

{'loss': 0.007, 'grad_norm': 0.11595934629440308, 'learning_rate': 5.462615227039945e-07, 'epoch': 9.98}


100%|█████████▉| 2927/2930 [1:17:11<00:05,  1.67s/it]

{'loss': 0.01, 'grad_norm': 0.13657528162002563, 'learning_rate': 4.096961420279959e-07, 'epoch': 9.98}


100%|█████████▉| 2928/2930 [1:17:13<00:03,  1.68s/it]

{'loss': 0.0096, 'grad_norm': 0.13415466248989105, 'learning_rate': 2.7313076135199727e-07, 'epoch': 9.98}


100%|█████████▉| 2929/2930 [1:17:14<00:01,  1.67s/it]

{'loss': 0.01, 'grad_norm': 0.14900048077106476, 'learning_rate': 1.3656538067599863e-07, 'epoch': 9.99}


100%|██████████| 2930/2930 [1:17:16<00:00,  1.67s/it]

{'loss': 0.0086, 'grad_norm': 0.13757388293743134, 'learning_rate': 0.0, 'epoch': 9.99}
{'train_runtime': 4642.9428, 'train_samples_per_second': 2.526, 'train_steps_per_second': 0.631, 'train_loss': 0.14177345379188963, 'epoch': 9.99}


100%|██████████| 2930/2930 [1:17:17<00:00,  1.58s/it]


In [8]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

4642.9428 seconds used for training.
77.38 minutes used for training.
Peak reserved memory = 19.908 GB.
Peak reserved memory for training = 17.689 GB.
Peak reserved memory % of max memory = 84.188 %.
Peak reserved memory for training % of max memory = 74.804 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [55]:
# import JSON file for evals
import json
with open('../eval_combined_output.json') as f:
    data = json.load(f)

print("Loaded", len(data), "examples.")


Loaded 186 examples.


In [56]:

# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

for i in range(len(data)):
    print("Parsing example", i+1, "out of", len(data))
    inputs = tokenizer(
    [
        alpaca_prompt.format(
            data[i]['instruction'], # instruction
            data[i]['input'], # input
            ""
        )
    ], return_tensors = "pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True)
    model_output = tokenizer.batch_decode(outputs)
    response = model_output[0].replace(EOS_TOKEN, "")
    # print(response)
    response = response.split("### Response:")[1].strip()
    data[i]['predic'] = response
    # print(data[i]['predic'])
    # print(response)
    # print("=====================================================")

Parsing example 1 out of 186
Parsing example 2 out of 186
Parsing example 3 out of 186
Parsing example 4 out of 186
Parsing example 5 out of 186
Parsing example 6 out of 186
Parsing example 7 out of 186
Parsing example 8 out of 186
Parsing example 9 out of 186
Parsing example 10 out of 186
Parsing example 11 out of 186
Parsing example 12 out of 186
Parsing example 13 out of 186
Parsing example 14 out of 186
Parsing example 15 out of 186
Parsing example 16 out of 186
Parsing example 17 out of 186
Parsing example 18 out of 186
Parsing example 19 out of 186
Parsing example 20 out of 186
Parsing example 21 out of 186
Parsing example 22 out of 186
Parsing example 23 out of 186
Parsing example 24 out of 186
Parsing example 25 out of 186
Parsing example 26 out of 186
Parsing example 27 out of 186
Parsing example 28 out of 186
Parsing example 29 out of 186
Parsing example 30 out of 186
Parsing example 31 out of 186
Parsing example 32 out of 186
Parsing example 33 out of 186
Parsing example 34 

In [57]:
# output data to JSON file
with open('../eval_tokens_over_time.json', 'w') as f:
    json.dump(data, f, indent=2)

wandb: Network error (ConnectTimeout), entering retry loop.


In [41]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Given  a language description of the task in the frame sequence and the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]'",
        "put the red block on the yellow block\nframe t-2: [{gripper,{204,38},68,71,76,73,0,0,0,45,55,52,87,70,62,61,65,67,51,44,40,40,42,47,57,76,77}, {table,{309,308},331,342,353,249,202,179,171,174,188,221,290,334,313,313,334,296,274,250,212,193,190,198,223,274,342}, {yellow block,{265,265},34,29,27,27,28,33,41,44,44,45,45,40,37,30,26,24,24,26,30,39,43,43,47,44,40}, {green block,{165,217},43,39,32,28,27,28,31,35,42,41,41,40,41,42,32,26,24,24,24,27,32,40,42,44,43}, {blue block,{387,328},45,40,34,31,31,33,38,48,50,49,50,48,45,42,35,30,30,29,32,38,48,49,50,53,49}, {red block,{316,195},37,39,42,40,40,37,29,26,24,24,26,31,34,36,38,38,37,36,30,27,24,23,25,27,34}]\nframe t-1: [{gripper,{205,38},67,71,76,74,0,0,0,45,56,54,87,70,62,61,66,67,51,44,40,40,42,47,57,75,77}, {table,{309,308},331,342,353,249,202,180,171,174,188,221,290,334,313,313,334,295,273,249,211,192,189,198,223,274,342}, {yellow block,{266,265},33,29,27,27,28,32,41,44,45,46,45,40,37,30,25,24,24,26,30,39,43,43,47,44,40}, {green block,{166,218},42,38,30,28,27,28,31,35,42,41,41,40,41,44,33,26,24,24,25,27,32,41,42,44,43}, {blue block,{387,328},46,40,35,32,31,34,39,48,50,49,50,48,45,42,33,30,29,29,31,38,47,49,50,53,49}, {red block,{317,196},36,39,40,40,38,37,29,26,24,24,26,30,34,36,39,40,38,36,31,26,23,23,24,28,35}]\nframe t-0: [{gripper,{207,42},72,69,75,77,4,4,4,52,62,58,89,74,65,62,66,72,56,48,44,44,46,51,63,74,73}, {table,{309,308},331,342,353,250,202,180,172,175,189,221,290,334,313,313,334,295,274,247,212,193,189,198,223,274,342}, {yellow block,{266,266},33,28,27,27,28,31,39,43,44,45,45,40,37,31,26,24,26,27,31,41,43,44,49,44,40}, {green block,{166,218},43,38,30,28,27,28,31,35,42,41,41,40,41,44,32,25,24,24,25,27,31,41,41,44,43}, {blue block,{388,328},45,40,34,31,31,33,38,48,51,50,50,49,46,43,34,30,29,29,31,38,48,49,50,53,48}, {red block,{317,196},36,38,40,40,38,37,30,26,25,25,26,30,34,36,38,40,38,37,31,27,23,23,24,28,35}]",
        ""
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True)
tokenizer.batch_decode(outputs)

["<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGiven  a language description of the task in the frame sequence and the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]'\n\n### Input:\nput the red block on the yellow block\nframe t-2: [{gripper,{204,38},68,71,76,73,0,0,0,45,55,52,87,70,62,61,65,67,51,44,40,40,42,47,57,76,77}, {table,{309,308},331,342,353,249,202,179,171,174,188,221,290,334,313,313,334,296,274,250,212,193,190,198,223,274,342}, {yellow block,{265,265},34,29,27,27,28,33,41,44,44,45,45,40,37,30,26,24,24,26,30,39,43,43,47,44,40}, {green block,{165,217},43,39,32,28,27,28,31,35,42,41,41,40,41,42,32,26,24,24,24,27,32,40,42

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [28]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence for the next 6 numbers.", # instruction
#         "1, 1, 2, 3, 5, 8, 13, 21", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

instruction = "a language description of the task in the frame sequence and the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]'"

inputs = tokenizer(
[
    alpaca_prompt.format(
		instruction,
		"put the yellow object on top of the green object\nframe t-2: [{gripper,{224,121},68,70,75,91,167,188,116,145,90,18,13,11,11,11,230,116,159,135,125,123,129,109,86,80,79}, {table,{315,317},325,336,337,237,191,171,163,165,180,211,276,340,319,319,340,308,286,98,134,202,197,207,233,287,336}, {yellow block,{210,314},43,36,32,30,30,33,38,50,50,49,50,48,46,37,30,28,27,28,30,38,48,49,52,53,49}, {green block,{177,210},6,6,6,9,19,29,32,37,48,54,55,54,55,40,30,25,23,21,22,26,20,37,43,150,139}, {blue block,{387,328},46,40,34,31,31,34,39,48,50,49,50,49,46,41,33,30,28,28,30,36,47,49,50,54,49}, {red block,{345,197},40,42,43,41,40,32,28,25,24,25,29,36,39,39,41,39,38,34,28,25,24,24,27,31,41}]\nframe t-1: [{gripper,{218,127},82,75,79,94,170,192,125,151,92,11,9,9,9,9,236,106,167,142,131,129,124,111,90,79,82}, {table,{316,317},324,335,337,237,192,171,163,165,180,211,276,341,320,320,341,308,288,39,155,201,197,206,231,286,335}, {yellow block,{210,314},42,36,32,30,30,33,38,50,50,49,50,48,47,38,30,28,26,28,31,37,47,48,52,53,50}, {green block,{178,213},36,2,2,2,2,4,4,33,44,52,53,53,55,47,33,28,25,24,24,26,26,27,30,137,140}, {blue block,{387,328},46,40,34,32,31,34,39,47,50,49,50,49,46,41,33,30,29,28,30,36,47,49,50,54,49}, {red block,{346,197},39,41,42,40,40,32,28,26,25,26,29,36,41,38,36,39,38,34,28,24,23,23,25,30,40}]\nframe t-0: [{gripper,{210,132},83,75,83,95,172,196,129,155,92,10,8,8,8,8,227,106,173,147,136,132,120,109,94,80,81}, {table,{317,318},323,334,335,236,191,170,162,164,178,209,274,343,321,321,343,310,289,38,175,202,198,207,233,287,334}, {yellow block,{209,315},40,36,32,30,30,33,38,49,48,48,49,47,46,38,30,28,27,27,30,37,50,50,54,47,43}, {green block,{177,213},28,30,0,0,0,0,0,35,46,51,52,52,53,45,33,28,25,24,25,29,31,30,31,124,134}, {blue block,{387,328},45,40,34,31,31,33,39,47,50,49,50,48,46,41,33,29,29,28,30,36,47,49,50,54,49}, {red block,{345,197},40,41,43,40,40,32,28,25,24,26,29,35,38,38,40,40,38,34,28,25,23,24,27,31,41}]",
		""
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048)

<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
a language description of the task in the frame sequence and the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},radial_points}}]'

### Input:
put the yellow object on top of the green object
frame t-2: [{gripper,{224,121},68,70,75,91,167,188,116,145,90,18,13,11,11,11,230,116,159,135,125,123,129,109,86,80,79}, {table,{315,317},325,336,337,237,191,171,163,165,180,211,276,340,319,319,340,308,286,98,134,202,197,207,233,287,336}, {yellow block,{210,314},43,36,32,30,30,33,38,50,50,49,50,48,46,37,30,28,27,28,30,38,48,49,52,53,49}, {green block,{177,210},6,6,6,9,19,29,32,37,48,54,55,54,55,40,30,25,23,21,22,26,20

frame t+1: [{gripper,{209,136},88,87,81,102,182,203,145,153,102,8,7,7,8,8,234,108,180,153,141,139,12,111,98,86,84}, {table,{318,319},322,333,333,234,190,169,161,163,177,208,273,344,322,322,344,312,291,0,0,203,198,208,233,287,333}, {yellow block,{210,316},39,35,31,29,29,31,37,48,47,47,48,47,47,39,30,28,26,27,29,37,47,47,50,46,44}, {green block,{173,219},24,28,32,32,32,35,33,29,26,26,25,24,25,27,30,28,28,29,33,35,30,25,24,23,22}, {blue block,{387,328},45,40,34,31,31,33,39,47,50,49,50,49,46,42,33,29,29,28,30,36,47,49,50,54,48}, {red block,{343,198},40,42,40,36,34,33,30,27,26,27,29,36,36,35,36,38,35,33,30,28,27,27,31,34,40}]<eos>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [11]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [12]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nput the green block on top of the yellow block\n\n\n### Input:\nput the yellow block on top of the green block\n\n\n### Response:\nput the yellow block on top of the green block\n\n\n### Input:\nput the green block on top of the blue block\n\n\n### Response:\nput the yellow block']

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [13]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [14]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [15]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>