To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

See on our [blog post](https://unsloth.ai/blog/gemma) on how we made **Gemma 7b 2.5x faster** and **Gemma 2x faster**!

In [3]:
# %%capture
import torch
import os

os.environ["WANDB_PROJECT"] = "gemma_2b_test"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
# major_version, minor_version = torch.cuda.get_device_capability()
# if major_version >= 8:
#     # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
#     !pip install "unsloth[colab-ampere] @ git+https://github.com/unslothai/unsloth.git"
# else:
#     # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
#     !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
# pass

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.

In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-2b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth: Fast Gemma patching release 2024.3
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.647 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. Xformers = 0.0.22.post7. FA = True.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.3 patched 18 layers with 18 QKV layers, 18 O layers and 18 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [6]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
# load json dataset at input.json
dataset = load_dataset('json', data_files='input-small.json', split='train')
dataset = dataset.map(formatting_prompts_func, batched = True,)

Generating train split: 43 examples [00:00, 14509.66 examples/s]
Map: 100%|██████████| 43/43 [00:00<00:00, 14504.99 examples/s]


<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 1,
        # max_steps = ,
        num_train_epochs = 100,
        learning_rate = 8e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",
    ),
)

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(
Map (num_proc=2): 100%|██████████| 43/43 [00:00<00:00, 46.57 examples/s]


In [8]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 4090. Max memory = 23.647 GB.
2.219 GB of memory reserved.


In [9]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 43 | Num Epochs = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 1,000
 "-____-"     Number of trainable parameters = 19,611,648
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33myashas-salankimatt[0m ([33myashas-personal[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 1/1000 [00:03<56:42,  3.41s/it]

{'loss': 0.6248, 'grad_norm': 0.2204626500606537, 'learning_rate': 0.002, 'epoch': 0.09}


  0%|          | 2/1000 [00:06<52:19,  3.15s/it]

{'loss': 0.6299, 'grad_norm': 0.11693359166383743, 'learning_rate': 0.001997997997997998, 'epoch': 0.19}


  0%|          | 3/1000 [00:09<50:52,  3.06s/it]

{'loss': 0.596, 'grad_norm': 0.38200682401657104, 'learning_rate': 0.001995995995995996, 'epoch': 0.28}


  0%|          | 4/1000 [00:12<51:46,  3.12s/it]

{'loss': 0.7228, 'grad_norm': 0.9068372249603271, 'learning_rate': 0.0019939939939939938, 'epoch': 0.37}


  0%|          | 5/1000 [00:15<50:45,  3.06s/it]

{'loss': 0.6225, 'grad_norm': 0.4440826177597046, 'learning_rate': 0.001991991991991992, 'epoch': 0.47}


  1%|          | 6/1000 [00:18<50:19,  3.04s/it]

{'loss': 0.5211, 'grad_norm': 0.29924121499061584, 'learning_rate': 0.00198998998998999, 'epoch': 0.56}


  1%|          | 7/1000 [00:21<50:05,  3.03s/it]

{'loss': 0.5388, 'grad_norm': 0.5933983325958252, 'learning_rate': 0.001987987987987988, 'epoch': 0.65}


  1%|          | 8/1000 [00:24<51:09,  3.09s/it]

{'loss': 0.5192, 'grad_norm': 0.2977130115032196, 'learning_rate': 0.001985985985985986, 'epoch': 0.74}


  1%|          | 9/1000 [00:27<50:36,  3.06s/it]

{'loss': 0.5086, 'grad_norm': 0.4721057713031769, 'learning_rate': 0.001983983983983984, 'epoch': 0.84}


  1%|          | 10/1000 [00:30<50:22,  3.05s/it]

{'loss': 0.5412, 'grad_norm': 0.57562255859375, 'learning_rate': 0.001981981981981982, 'epoch': 0.93}


  1%|          | 11/1000 [00:33<50:05,  3.04s/it]

{'loss': 0.4778, 'grad_norm': 0.5603854656219482, 'learning_rate': 0.00197997997997998, 'epoch': 1.02}


  1%|          | 12/1000 [00:36<49:58,  3.04s/it]

{'loss': 0.4624, 'grad_norm': 0.3974635601043701, 'learning_rate': 0.0019779779779779782, 'epoch': 1.12}


  1%|▏         | 13/1000 [00:39<49:41,  3.02s/it]

{'loss': 0.4669, 'grad_norm': 0.45064789056777954, 'learning_rate': 0.0019759759759759763, 'epoch': 1.21}


  1%|▏         | 14/1000 [00:42<49:22,  3.00s/it]

{'loss': 0.4701, 'grad_norm': 0.6731711626052856, 'learning_rate': 0.001973973973973974, 'epoch': 1.3}


  2%|▏         | 15/1000 [00:45<49:15,  3.00s/it]

{'loss': 0.4395, 'grad_norm': 0.18675543367862701, 'learning_rate': 0.001971971971971972, 'epoch': 1.4}


  2%|▏         | 16/1000 [00:48<49:14,  3.00s/it]

{'loss': 0.416, 'grad_norm': 0.16011585295200348, 'learning_rate': 0.00196996996996997, 'epoch': 1.49}


  2%|▏         | 17/1000 [00:51<49:13,  3.00s/it]

{'loss': 0.4102, 'grad_norm': 0.10524789243936539, 'learning_rate': 0.001967967967967968, 'epoch': 1.58}


  2%|▏         | 18/1000 [00:54<49:11,  3.01s/it]

{'loss': 0.4121, 'grad_norm': 0.20033268630504608, 'learning_rate': 0.001965965965965966, 'epoch': 1.67}


  2%|▏         | 19/1000 [00:57<49:14,  3.01s/it]

{'loss': 0.4064, 'grad_norm': 0.11050759255886078, 'learning_rate': 0.0019639639639639638, 'epoch': 1.77}


  2%|▏         | 20/1000 [01:00<49:07,  3.01s/it]

{'loss': 0.3862, 'grad_norm': 0.13043314218521118, 'learning_rate': 0.001961961961961962, 'epoch': 1.86}


  2%|▏         | 21/1000 [01:03<49:02,  3.01s/it]

{'loss': 0.4014, 'grad_norm': 0.14499731361865997, 'learning_rate': 0.00195995995995996, 'epoch': 1.95}


  2%|▏         | 22/1000 [01:06<48:53,  3.00s/it]

{'loss': 0.3537, 'grad_norm': 0.1966066062450409, 'learning_rate': 0.001957957957957958, 'epoch': 2.05}


  2%|▏         | 23/1000 [01:09<48:50,  3.00s/it]

{'loss': 0.3725, 'grad_norm': 0.21345072984695435, 'learning_rate': 0.001955955955955956, 'epoch': 2.14}


  2%|▏         | 24/1000 [01:12<48:43,  3.00s/it]

{'loss': 0.3845, 'grad_norm': 0.3094753324985504, 'learning_rate': 0.001953953953953954, 'epoch': 2.23}


  2%|▎         | 25/1000 [01:15<48:35,  2.99s/it]

{'loss': 0.376, 'grad_norm': 0.31922733783721924, 'learning_rate': 0.001951951951951952, 'epoch': 2.33}


  3%|▎         | 26/1000 [01:18<48:33,  2.99s/it]

{'loss': 0.3701, 'grad_norm': 0.40778297185897827, 'learning_rate': 0.00194994994994995, 'epoch': 2.42}


  3%|▎         | 27/1000 [01:21<48:35,  3.00s/it]

{'loss': 0.327, 'grad_norm': 0.19390186667442322, 'learning_rate': 0.001947947947947948, 'epoch': 2.51}


  3%|▎         | 28/1000 [01:24<48:34,  3.00s/it]

{'loss': 0.3183, 'grad_norm': 0.24013637006282806, 'learning_rate': 0.001945945945945946, 'epoch': 2.6}


  3%|▎         | 29/1000 [01:27<48:38,  3.01s/it]

{'loss': 0.329, 'grad_norm': 0.16304755210876465, 'learning_rate': 0.001943943943943944, 'epoch': 2.7}


  3%|▎         | 30/1000 [01:30<48:34,  3.00s/it]

{'loss': 0.338, 'grad_norm': 0.16777722537517548, 'learning_rate': 0.001941941941941942, 'epoch': 2.79}


  3%|▎         | 31/1000 [01:33<48:31,  3.00s/it]

{'loss': 0.3246, 'grad_norm': 0.17272473871707916, 'learning_rate': 0.00193993993993994, 'epoch': 2.88}


  3%|▎         | 32/1000 [01:36<48:25,  3.00s/it]

{'loss': 0.3276, 'grad_norm': 0.21011485159397125, 'learning_rate': 0.001937937937937938, 'epoch': 2.98}


  3%|▎         | 33/1000 [01:39<48:17,  3.00s/it]

{'loss': 0.2986, 'grad_norm': 0.18070094287395477, 'learning_rate': 0.001935935935935936, 'epoch': 3.07}


  3%|▎         | 34/1000 [01:42<48:18,  3.00s/it]

{'loss': 0.2888, 'grad_norm': 0.1892654299736023, 'learning_rate': 0.001933933933933934, 'epoch': 3.16}


  4%|▎         | 35/1000 [01:45<48:18,  3.00s/it]

{'loss': 0.2671, 'grad_norm': 0.14621806144714355, 'learning_rate': 0.001931931931931932, 'epoch': 3.26}


  4%|▎         | 36/1000 [01:48<48:13,  3.00s/it]

{'loss': 0.286, 'grad_norm': 0.1926691234111786, 'learning_rate': 0.00192992992992993, 'epoch': 3.35}


  4%|▎         | 37/1000 [01:51<48:13,  3.00s/it]

{'loss': 0.2561, 'grad_norm': 0.1583983302116394, 'learning_rate': 0.0019279279279279281, 'epoch': 3.44}


  4%|▍         | 38/1000 [01:54<48:01,  2.99s/it]

{'loss': 0.2626, 'grad_norm': 0.22915111482143402, 'learning_rate': 0.0019259259259259258, 'epoch': 3.53}


  4%|▍         | 39/1000 [01:57<47:57,  2.99s/it]

{'loss': 0.2524, 'grad_norm': 0.17812588810920715, 'learning_rate': 0.0019239239239239238, 'epoch': 3.63}


  4%|▍         | 40/1000 [02:00<47:55,  3.00s/it]

{'loss': 0.2265, 'grad_norm': 0.13349206745624542, 'learning_rate': 0.0019219219219219219, 'epoch': 3.72}


  4%|▍         | 41/1000 [02:03<47:49,  2.99s/it]

{'loss': 0.2535, 'grad_norm': 0.2258288711309433, 'learning_rate': 0.00191991991991992, 'epoch': 3.81}


  4%|▍         | 42/1000 [02:06<47:51,  3.00s/it]

{'loss': 0.2423, 'grad_norm': 0.173696830868721, 'learning_rate': 0.001917917917917918, 'epoch': 3.91}


  4%|▍         | 43/1000 [02:09<47:47,  3.00s/it]

{'loss': 0.2315, 'grad_norm': 0.23605255782604218, 'learning_rate': 0.0019159159159159158, 'epoch': 4.0}


  4%|▍         | 44/1000 [02:12<47:45,  3.00s/it]

{'loss': 0.1886, 'grad_norm': 0.20791642367839813, 'learning_rate': 0.0019139139139139139, 'epoch': 4.09}


  4%|▍         | 45/1000 [02:15<47:39,  2.99s/it]

{'loss': 0.1805, 'grad_norm': 0.23653602600097656, 'learning_rate': 0.001911911911911912, 'epoch': 4.19}


  5%|▍         | 46/1000 [02:18<47:35,  2.99s/it]

{'loss': 0.1898, 'grad_norm': 0.2770611643791199, 'learning_rate': 0.00190990990990991, 'epoch': 4.28}


  5%|▍         | 47/1000 [02:21<47:30,  2.99s/it]

{'loss': 0.1717, 'grad_norm': 0.26397013664245605, 'learning_rate': 0.001907907907907908, 'epoch': 4.37}


  5%|▍         | 48/1000 [02:24<47:30,  2.99s/it]

{'loss': 0.1726, 'grad_norm': 0.26032182574272156, 'learning_rate': 0.001905905905905906, 'epoch': 4.47}


  5%|▍         | 49/1000 [02:27<47:36,  3.00s/it]

{'loss': 0.1479, 'grad_norm': 0.20846165716648102, 'learning_rate': 0.001903903903903904, 'epoch': 4.56}


  5%|▌         | 50/1000 [02:30<47:39,  3.01s/it]

{'loss': 0.1616, 'grad_norm': 0.24298708140850067, 'learning_rate': 0.001901901901901902, 'epoch': 4.65}


  5%|▌         | 51/1000 [02:33<47:30,  3.00s/it]

{'loss': 0.1445, 'grad_norm': 0.2579248249530792, 'learning_rate': 0.0018998998998999, 'epoch': 4.74}


  5%|▌         | 52/1000 [02:36<47:19,  2.99s/it]

{'loss': 0.1425, 'grad_norm': 0.19726793467998505, 'learning_rate': 0.001897897897897898, 'epoch': 4.84}


  5%|▌         | 53/1000 [02:39<47:20,  3.00s/it]

{'loss': 0.1565, 'grad_norm': 0.26564180850982666, 'learning_rate': 0.0018958958958958958, 'epoch': 4.93}


  5%|▌         | 54/1000 [02:42<47:21,  3.00s/it]

{'loss': 0.1209, 'grad_norm': 0.19641900062561035, 'learning_rate': 0.0018938938938938938, 'epoch': 5.02}


  6%|▌         | 55/1000 [02:45<47:20,  3.01s/it]

{'loss': 0.1164, 'grad_norm': 0.22598569095134735, 'learning_rate': 0.0018918918918918919, 'epoch': 5.12}


  6%|▌         | 56/1000 [02:48<47:08,  3.00s/it]

{'loss': 0.1108, 'grad_norm': 0.251605749130249, 'learning_rate': 0.00188988988988989, 'epoch': 5.21}


  6%|▌         | 57/1000 [02:51<46:57,  2.99s/it]

{'loss': 0.0952, 'grad_norm': 0.23231488466262817, 'learning_rate': 0.001887887887887888, 'epoch': 5.3}


  6%|▌         | 58/1000 [02:54<46:51,  2.98s/it]

{'loss': 0.1099, 'grad_norm': 0.29818660020828247, 'learning_rate': 0.0018858858858858858, 'epoch': 5.4}


  6%|▌         | 59/1000 [02:57<46:53,  2.99s/it]

{'loss': 0.101, 'grad_norm': 0.24504368007183075, 'learning_rate': 0.0018838838838838839, 'epoch': 5.49}


  6%|▌         | 60/1000 [03:00<46:58,  3.00s/it]

{'loss': 0.1039, 'grad_norm': 0.21923938393592834, 'learning_rate': 0.001881881881881882, 'epoch': 5.58}


  6%|▌         | 61/1000 [03:03<46:59,  3.00s/it]

{'loss': 0.1001, 'grad_norm': 0.24487842619419098, 'learning_rate': 0.00187987987987988, 'epoch': 5.67}


  6%|▌         | 62/1000 [03:06<46:51,  3.00s/it]

{'loss': 0.0911, 'grad_norm': 0.261868417263031, 'learning_rate': 0.001877877877877878, 'epoch': 5.77}


  6%|▋         | 63/1000 [03:09<46:48,  3.00s/it]

{'loss': 0.1053, 'grad_norm': 0.26674044132232666, 'learning_rate': 0.0018758758758758759, 'epoch': 5.86}


  6%|▋         | 64/1000 [03:12<46:46,  3.00s/it]

{'loss': 0.0883, 'grad_norm': 0.1992938369512558, 'learning_rate': 0.001873873873873874, 'epoch': 5.95}


  6%|▋         | 65/1000 [03:15<46:44,  3.00s/it]

{'loss': 0.0867, 'grad_norm': 0.5339002013206482, 'learning_rate': 0.001871871871871872, 'epoch': 6.05}


  7%|▋         | 66/1000 [03:18<46:42,  3.00s/it]

{'loss': 0.0775, 'grad_norm': 0.1707654446363449, 'learning_rate': 0.00186986986986987, 'epoch': 6.14}


  7%|▋         | 67/1000 [03:21<46:37,  3.00s/it]

{'loss': 0.0801, 'grad_norm': 0.8775418400764465, 'learning_rate': 0.001867867867867868, 'epoch': 6.23}


  7%|▋         | 68/1000 [03:24<46:30,  2.99s/it]

{'loss': 0.1034, 'grad_norm': 0.4073001444339752, 'learning_rate': 0.0018658658658658657, 'epoch': 6.33}


  7%|▋         | 69/1000 [03:27<46:44,  3.01s/it]

{'loss': 0.1003, 'grad_norm': 0.29436206817626953, 'learning_rate': 0.0018638638638638638, 'epoch': 6.42}


  7%|▋         | 70/1000 [03:30<46:49,  3.02s/it]

{'loss': 0.0956, 'grad_norm': 0.290439635515213, 'learning_rate': 0.0018618618618618619, 'epoch': 6.51}


  7%|▋         | 71/1000 [03:33<46:43,  3.02s/it]

{'loss': 0.0852, 'grad_norm': 0.21431350708007812, 'learning_rate': 0.00185985985985986, 'epoch': 6.6}


  7%|▋         | 72/1000 [03:36<46:31,  3.01s/it]

{'loss': 0.0848, 'grad_norm': 0.24671588838100433, 'learning_rate': 0.001857857857857858, 'epoch': 6.7}


  7%|▋         | 73/1000 [03:39<46:22,  3.00s/it]

{'loss': 0.0784, 'grad_norm': 0.2750566601753235, 'learning_rate': 0.0018558558558558558, 'epoch': 6.79}


  7%|▋         | 74/1000 [03:42<46:20,  3.00s/it]

{'loss': 0.0853, 'grad_norm': 0.2785981297492981, 'learning_rate': 0.0018538538538538539, 'epoch': 6.88}


  8%|▊         | 75/1000 [03:45<46:22,  3.01s/it]

{'loss': 0.0802, 'grad_norm': 0.2566787004470825, 'learning_rate': 0.001851851851851852, 'epoch': 6.98}


  8%|▊         | 76/1000 [03:48<46:13,  3.00s/it]

{'loss': 0.0635, 'grad_norm': 0.18839526176452637, 'learning_rate': 0.00184984984984985, 'epoch': 7.07}


  8%|▊         | 77/1000 [03:51<46:10,  3.00s/it]

{'loss': 0.0669, 'grad_norm': 0.20859406888484955, 'learning_rate': 0.001847847847847848, 'epoch': 7.16}


  8%|▊         | 78/1000 [03:54<46:02,  3.00s/it]

{'loss': 0.0644, 'grad_norm': 0.3039081394672394, 'learning_rate': 0.0018458458458458459, 'epoch': 7.26}


  8%|▊         | 79/1000 [03:57<45:58,  2.99s/it]

{'loss': 0.0628, 'grad_norm': 0.2052895873785019, 'learning_rate': 0.001843843843843844, 'epoch': 7.35}


  8%|▊         | 80/1000 [04:00<45:56,  3.00s/it]

{'loss': 0.0676, 'grad_norm': 0.19869670271873474, 'learning_rate': 0.001841841841841842, 'epoch': 7.44}


  8%|▊         | 81/1000 [04:03<45:39,  2.98s/it]

{'loss': 0.0623, 'grad_norm': 0.22746023535728455, 'learning_rate': 0.0018398398398398398, 'epoch': 7.53}


  8%|▊         | 82/1000 [04:06<45:36,  2.98s/it]

{'loss': 0.0648, 'grad_norm': 0.19103148579597473, 'learning_rate': 0.0018378378378378379, 'epoch': 7.63}


  8%|▊         | 83/1000 [04:09<45:34,  2.98s/it]

{'loss': 0.0625, 'grad_norm': 0.18535712361335754, 'learning_rate': 0.0018358358358358357, 'epoch': 7.72}


  8%|▊         | 84/1000 [04:12<45:40,  2.99s/it]

{'loss': 0.0735, 'grad_norm': 0.21970614790916443, 'learning_rate': 0.0018338338338338338, 'epoch': 7.81}


  8%|▊         | 85/1000 [04:15<45:39,  2.99s/it]

{'loss': 0.0728, 'grad_norm': 0.21595558524131775, 'learning_rate': 0.0018318318318318318, 'epoch': 7.91}


  9%|▊         | 86/1000 [04:18<45:29,  2.99s/it]

{'loss': 0.0693, 'grad_norm': 0.2423926591873169, 'learning_rate': 0.00182982982982983, 'epoch': 8.0}


  9%|▊         | 87/1000 [04:21<45:28,  2.99s/it]

{'loss': 0.041, 'grad_norm': 0.17347559332847595, 'learning_rate': 0.001827827827827828, 'epoch': 8.09}


  9%|▉         | 88/1000 [04:24<45:26,  2.99s/it]

{'loss': 0.05, 'grad_norm': 0.19068543612957, 'learning_rate': 0.0018258258258258258, 'epoch': 8.19}


  9%|▉         | 89/1000 [04:27<45:33,  3.00s/it]

{'loss': 0.0395, 'grad_norm': 0.15929880738258362, 'learning_rate': 0.0018238238238238239, 'epoch': 8.28}


  9%|▉         | 90/1000 [04:30<45:40,  3.01s/it]

{'loss': 0.0446, 'grad_norm': 0.2608363628387451, 'learning_rate': 0.001821821821821822, 'epoch': 8.37}


  9%|▉         | 91/1000 [04:33<45:38,  3.01s/it]

{'loss': 0.0509, 'grad_norm': 0.18586759269237518, 'learning_rate': 0.00181981981981982, 'epoch': 8.47}


  9%|▉         | 92/1000 [04:36<45:40,  3.02s/it]

{'loss': 0.0421, 'grad_norm': 0.16114765405654907, 'learning_rate': 0.001817817817817818, 'epoch': 8.56}


  9%|▉         | 93/1000 [04:39<45:26,  3.01s/it]

{'loss': 0.0462, 'grad_norm': 0.16853126883506775, 'learning_rate': 0.0018158158158158159, 'epoch': 8.65}


  9%|▉         | 94/1000 [04:42<45:19,  3.00s/it]

{'loss': 0.0459, 'grad_norm': 0.17938560247421265, 'learning_rate': 0.001813813813813814, 'epoch': 8.74}


 10%|▉         | 95/1000 [04:45<45:20,  3.01s/it]

{'loss': 0.0494, 'grad_norm': 0.19151407480239868, 'learning_rate': 0.001811811811811812, 'epoch': 8.84}


 10%|▉         | 96/1000 [04:48<45:12,  3.00s/it]

{'loss': 0.0421, 'grad_norm': 0.14819422364234924, 'learning_rate': 0.0018098098098098098, 'epoch': 8.93}


 10%|▉         | 97/1000 [04:51<45:07,  3.00s/it]

{'loss': 0.0381, 'grad_norm': 0.14905127882957458, 'learning_rate': 0.0018078078078078077, 'epoch': 9.02}


 10%|▉         | 98/1000 [04:54<45:01,  3.00s/it]

{'loss': 0.0287, 'grad_norm': 0.12670212984085083, 'learning_rate': 0.0018058058058058057, 'epoch': 9.12}


 10%|▉         | 99/1000 [04:57<44:51,  2.99s/it]

{'loss': 0.0306, 'grad_norm': 0.32756730914115906, 'learning_rate': 0.0018038038038038038, 'epoch': 9.21}


 10%|█         | 100/1000 [05:00<44:48,  2.99s/it]

{'loss': 0.0413, 'grad_norm': 0.19094567000865936, 'learning_rate': 0.0018018018018018018, 'epoch': 9.3}


 10%|█         | 101/1000 [05:03<44:42,  2.98s/it]

{'loss': 0.0407, 'grad_norm': 0.18511539697647095, 'learning_rate': 0.0017997997997997999, 'epoch': 9.4}


 10%|█         | 102/1000 [05:06<44:41,  2.99s/it]

{'loss': 0.0326, 'grad_norm': 0.16668114066123962, 'learning_rate': 0.0017977977977977977, 'epoch': 9.49}


 10%|█         | 103/1000 [05:09<44:32,  2.98s/it]

{'loss': 0.0299, 'grad_norm': 0.15231497585773468, 'learning_rate': 0.0017957957957957958, 'epoch': 9.58}


 10%|█         | 104/1000 [05:12<44:32,  2.98s/it]

{'loss': 0.0425, 'grad_norm': 0.19253884255886078, 'learning_rate': 0.0017937937937937938, 'epoch': 9.67}


 10%|█         | 105/1000 [05:15<44:38,  2.99s/it]

{'loss': 0.0385, 'grad_norm': 0.161960169672966, 'learning_rate': 0.001791791791791792, 'epoch': 9.77}


 11%|█         | 106/1000 [05:18<44:34,  2.99s/it]

{'loss': 0.0358, 'grad_norm': 0.15865224599838257, 'learning_rate': 0.00178978978978979, 'epoch': 9.86}


 11%|█         | 107/1000 [05:21<44:35,  3.00s/it]

{'loss': 0.0367, 'grad_norm': 0.16319987177848816, 'learning_rate': 0.0017877877877877878, 'epoch': 9.95}


 11%|█         | 108/1000 [05:24<44:26,  2.99s/it]

{'loss': 0.0279, 'grad_norm': 0.1199568435549736, 'learning_rate': 0.0017857857857857859, 'epoch': 10.05}


 11%|█         | 109/1000 [05:27<44:19,  2.99s/it]

{'loss': 0.0234, 'grad_norm': 0.12916956841945648, 'learning_rate': 0.001783783783783784, 'epoch': 10.14}


 11%|█         | 110/1000 [05:30<44:06,  2.97s/it]

{'loss': 0.0263, 'grad_norm': 0.14657966792583466, 'learning_rate': 0.0017817817817817817, 'epoch': 10.23}


 11%|█         | 111/1000 [05:33<44:03,  2.97s/it]

{'loss': 0.0303, 'grad_norm': 0.1660681962966919, 'learning_rate': 0.0017797797797797798, 'epoch': 10.33}


 11%|█         | 112/1000 [05:36<44:04,  2.98s/it]

{'loss': 0.0325, 'grad_norm': 0.1787068247795105, 'learning_rate': 0.0017777777777777776, 'epoch': 10.42}


 11%|█▏        | 113/1000 [05:39<44:01,  2.98s/it]

{'loss': 0.0293, 'grad_norm': 0.36577415466308594, 'learning_rate': 0.0017757757757757757, 'epoch': 10.51}


 11%|█▏        | 114/1000 [05:42<43:55,  2.97s/it]

{'loss': 0.035, 'grad_norm': 0.15158431231975555, 'learning_rate': 0.0017737737737737738, 'epoch': 10.6}


 12%|█▏        | 115/1000 [05:45<43:48,  2.97s/it]

{'loss': 0.0359, 'grad_norm': 0.14038600027561188, 'learning_rate': 0.0017717717717717718, 'epoch': 10.7}


 12%|█▏        | 116/1000 [05:48<43:45,  2.97s/it]

{'loss': 0.0309, 'grad_norm': 0.16171874105930328, 'learning_rate': 0.0017697697697697699, 'epoch': 10.79}


 12%|█▏        | 117/1000 [05:51<43:42,  2.97s/it]

{'loss': 0.0272, 'grad_norm': 0.12541256844997406, 'learning_rate': 0.0017677677677677677, 'epoch': 10.88}


 12%|█▏        | 118/1000 [05:54<43:35,  2.97s/it]

{'loss': 0.0356, 'grad_norm': 0.28332453966140747, 'learning_rate': 0.0017657657657657658, 'epoch': 10.98}


 12%|█▏        | 119/1000 [05:57<43:34,  2.97s/it]

{'loss': 0.03, 'grad_norm': 0.20303605496883392, 'learning_rate': 0.0017637637637637638, 'epoch': 11.07}


 12%|█▏        | 120/1000 [06:00<43:34,  2.97s/it]

{'loss': 0.0237, 'grad_norm': 0.1583126187324524, 'learning_rate': 0.0017617617617617619, 'epoch': 11.16}


 12%|█▏        | 121/1000 [06:03<43:32,  2.97s/it]

{'loss': 0.0242, 'grad_norm': 0.1663227677345276, 'learning_rate': 0.00175975975975976, 'epoch': 11.26}


 12%|█▏        | 122/1000 [06:06<43:32,  2.98s/it]

{'loss': 0.0203, 'grad_norm': 0.1336936503648758, 'learning_rate': 0.0017577577577577578, 'epoch': 11.35}


 12%|█▏        | 123/1000 [06:09<43:30,  2.98s/it]

{'loss': 0.0375, 'grad_norm': 0.22252091765403748, 'learning_rate': 0.0017557557557557558, 'epoch': 11.44}


 12%|█▏        | 124/1000 [06:12<43:25,  2.97s/it]

{'loss': 0.0208, 'grad_norm': 0.12601596117019653, 'learning_rate': 0.001753753753753754, 'epoch': 11.53}


 12%|█▎        | 125/1000 [06:15<43:21,  2.97s/it]

{'loss': 0.0393, 'grad_norm': 0.19643957912921906, 'learning_rate': 0.0017517517517517517, 'epoch': 11.63}


 13%|█▎        | 126/1000 [06:18<43:16,  2.97s/it]

{'loss': 0.0299, 'grad_norm': 0.1497407853603363, 'learning_rate': 0.0017497497497497498, 'epoch': 11.72}


 13%|█▎        | 127/1000 [06:20<43:06,  2.96s/it]

{'loss': 0.027, 'grad_norm': 0.1200670525431633, 'learning_rate': 0.0017477477477477476, 'epoch': 11.81}


 13%|█▎        | 128/1000 [06:23<43:00,  2.96s/it]

{'loss': 0.0414, 'grad_norm': 0.17628063261508942, 'learning_rate': 0.0017457457457457457, 'epoch': 11.91}


 13%|█▎        | 129/1000 [06:26<43:03,  2.97s/it]

{'loss': 0.0273, 'grad_norm': 0.1418675035238266, 'learning_rate': 0.0017437437437437437, 'epoch': 12.0}


 13%|█▎        | 130/1000 [06:29<43:02,  2.97s/it]

{'loss': 0.022, 'grad_norm': 0.1458865851163864, 'learning_rate': 0.0017417417417417418, 'epoch': 12.09}


 13%|█▎        | 131/1000 [06:32<43:02,  2.97s/it]

{'loss': 0.0167, 'grad_norm': 0.11567732691764832, 'learning_rate': 0.0017397397397397399, 'epoch': 12.19}


 13%|█▎        | 132/1000 [06:35<43:01,  2.97s/it]

{'loss': 0.0262, 'grad_norm': 0.44207289814949036, 'learning_rate': 0.0017377377377377377, 'epoch': 12.28}


 13%|█▎        | 133/1000 [06:38<42:52,  2.97s/it]

{'loss': 0.0236, 'grad_norm': 0.14352640509605408, 'learning_rate': 0.0017357357357357358, 'epoch': 12.37}


 13%|█▎        | 134/1000 [06:41<42:56,  2.98s/it]

{'loss': 0.0339, 'grad_norm': 0.19560161232948303, 'learning_rate': 0.0017337337337337338, 'epoch': 12.47}


 14%|█▎        | 135/1000 [06:44<42:52,  2.97s/it]

{'loss': 0.0332, 'grad_norm': 0.17569731175899506, 'learning_rate': 0.0017317317317317319, 'epoch': 12.56}


 14%|█▎        | 136/1000 [06:47<42:56,  2.98s/it]

{'loss': 0.036, 'grad_norm': 0.2208559513092041, 'learning_rate': 0.00172972972972973, 'epoch': 12.65}


 14%|█▎        | 137/1000 [06:50<42:52,  2.98s/it]

{'loss': 0.0293, 'grad_norm': 0.17576010525226593, 'learning_rate': 0.0017277277277277278, 'epoch': 12.74}


 14%|█▍        | 138/1000 [06:53<42:48,  2.98s/it]

{'loss': 0.0305, 'grad_norm': 0.17175643146038055, 'learning_rate': 0.0017257257257257258, 'epoch': 12.84}


 14%|█▍        | 139/1000 [06:56<42:42,  2.98s/it]

{'loss': 0.0311, 'grad_norm': 0.1681094914674759, 'learning_rate': 0.0017237237237237237, 'epoch': 12.93}


 14%|█▍        | 140/1000 [06:59<42:38,  2.97s/it]

{'loss': 0.0332, 'grad_norm': 0.25889313220977783, 'learning_rate': 0.0017217217217217217, 'epoch': 13.02}


 14%|█▍        | 141/1000 [07:02<42:32,  2.97s/it]

{'loss': 0.0215, 'grad_norm': 0.13084806501865387, 'learning_rate': 0.0017197197197197198, 'epoch': 13.12}


 14%|█▍        | 142/1000 [07:05<42:27,  2.97s/it]

{'loss': 0.0192, 'grad_norm': 0.12187618762254715, 'learning_rate': 0.0017177177177177176, 'epoch': 13.21}


 14%|█▍        | 143/1000 [07:08<42:17,  2.96s/it]

{'loss': 0.0195, 'grad_norm': 0.12164618819952011, 'learning_rate': 0.0017157157157157157, 'epoch': 13.3}


 14%|█▍        | 144/1000 [07:11<42:16,  2.96s/it]

{'loss': 0.0199, 'grad_norm': 0.12924732267856598, 'learning_rate': 0.0017137137137137137, 'epoch': 13.4}


 14%|█▍        | 145/1000 [07:14<42:12,  2.96s/it]

{'loss': 0.0244, 'grad_norm': 0.15081335604190826, 'learning_rate': 0.0017117117117117118, 'epoch': 13.49}


 15%|█▍        | 146/1000 [07:17<42:05,  2.96s/it]

{'loss': 0.0299, 'grad_norm': 0.20312143862247467, 'learning_rate': 0.0017097097097097098, 'epoch': 13.58}


 15%|█▍        | 147/1000 [07:20<42:03,  2.96s/it]

{'loss': 0.0275, 'grad_norm': 0.17748410999774933, 'learning_rate': 0.0017077077077077077, 'epoch': 13.67}


 15%|█▍        | 148/1000 [07:23<41:57,  2.95s/it]

{'loss': 0.0289, 'grad_norm': 0.16115255653858185, 'learning_rate': 0.0017057057057057057, 'epoch': 13.77}


 15%|█▍        | 149/1000 [07:26<41:55,  2.96s/it]

{'loss': 0.0255, 'grad_norm': 0.14800146222114563, 'learning_rate': 0.0017037037037037038, 'epoch': 13.86}


 15%|█▌        | 150/1000 [07:29<41:56,  2.96s/it]

{'loss': 0.0222, 'grad_norm': 0.13397763669490814, 'learning_rate': 0.0017017017017017019, 'epoch': 13.95}


 15%|█▌        | 151/1000 [07:32<41:59,  2.97s/it]

{'loss': 0.0249, 'grad_norm': 0.15193189680576324, 'learning_rate': 0.0016996996996997, 'epoch': 14.05}


 15%|█▌        | 152/1000 [07:35<41:54,  2.97s/it]

{'loss': 0.0166, 'grad_norm': 0.11898165196180344, 'learning_rate': 0.0016976976976976978, 'epoch': 14.14}


 15%|█▌        | 153/1000 [07:38<41:53,  2.97s/it]

{'loss': 0.0223, 'grad_norm': 0.13656771183013916, 'learning_rate': 0.0016956956956956956, 'epoch': 14.23}


 15%|█▌        | 154/1000 [07:41<41:53,  2.97s/it]

{'loss': 0.0263, 'grad_norm': 0.15442809462547302, 'learning_rate': 0.0016936936936936937, 'epoch': 14.33}


 16%|█▌        | 155/1000 [07:44<41:47,  2.97s/it]

{'loss': 0.0182, 'grad_norm': 0.35817015171051025, 'learning_rate': 0.0016916916916916917, 'epoch': 14.42}


 16%|█▌        | 156/1000 [07:47<41:46,  2.97s/it]

{'loss': 0.0234, 'grad_norm': 0.14386148750782013, 'learning_rate': 0.0016896896896896898, 'epoch': 14.51}


 16%|█▌        | 157/1000 [07:50<41:54,  2.98s/it]

{'loss': 0.0234, 'grad_norm': 0.15172500908374786, 'learning_rate': 0.0016876876876876876, 'epoch': 14.6}


 16%|█▌        | 158/1000 [07:53<41:59,  2.99s/it]

{'loss': 0.0196, 'grad_norm': 0.1171698123216629, 'learning_rate': 0.0016856856856856857, 'epoch': 14.7}


 16%|█▌        | 159/1000 [07:56<41:56,  2.99s/it]

{'loss': 0.0222, 'grad_norm': 0.14162251353263855, 'learning_rate': 0.0016836836836836837, 'epoch': 14.79}


 16%|█▌        | 160/1000 [07:59<41:51,  2.99s/it]

{'loss': 0.0208, 'grad_norm': 0.16906990110874176, 'learning_rate': 0.0016816816816816818, 'epoch': 14.88}


 16%|█▌        | 161/1000 [08:02<41:47,  2.99s/it]

{'loss': 0.0198, 'grad_norm': 0.13473807275295258, 'learning_rate': 0.0016796796796796796, 'epoch': 14.98}


 16%|█▌        | 162/1000 [08:05<41:45,  2.99s/it]

{'loss': 0.0237, 'grad_norm': 0.7500114440917969, 'learning_rate': 0.0016776776776776777, 'epoch': 15.07}


 16%|█▋        | 163/1000 [08:07<41:36,  2.98s/it]

{'loss': 0.022, 'grad_norm': 0.17404961585998535, 'learning_rate': 0.0016756756756756757, 'epoch': 15.16}


 16%|█▋        | 164/1000 [08:10<41:25,  2.97s/it]

{'loss': 0.0317, 'grad_norm': 0.20667695999145508, 'learning_rate': 0.0016736736736736738, 'epoch': 15.26}


 16%|█▋        | 165/1000 [08:13<41:22,  2.97s/it]

{'loss': 0.0324, 'grad_norm': 0.18450109660625458, 'learning_rate': 0.0016716716716716718, 'epoch': 15.35}


 17%|█▋        | 166/1000 [08:16<41:23,  2.98s/it]

{'loss': 0.0354, 'grad_norm': 0.21405695378780365, 'learning_rate': 0.0016696696696696697, 'epoch': 15.44}


 17%|█▋        | 167/1000 [08:19<41:20,  2.98s/it]

{'loss': 0.0282, 'grad_norm': 0.17413051426410675, 'learning_rate': 0.0016676676676676677, 'epoch': 15.53}


 17%|█▋        | 168/1000 [08:22<41:18,  2.98s/it]

{'loss': 0.0292, 'grad_norm': 0.474658340215683, 'learning_rate': 0.0016656656656656656, 'epoch': 15.63}


 17%|█▋        | 169/1000 [08:25<41:21,  2.99s/it]

{'loss': 0.0324, 'grad_norm': 0.216481015086174, 'learning_rate': 0.0016636636636636636, 'epoch': 15.72}


 17%|█▋        | 170/1000 [08:28<41:18,  2.99s/it]

{'loss': 0.0319, 'grad_norm': 0.22553342580795288, 'learning_rate': 0.0016616616616616617, 'epoch': 15.81}


 17%|█▋        | 171/1000 [08:31<41:14,  2.99s/it]

{'loss': 0.0325, 'grad_norm': 0.21478119492530823, 'learning_rate': 0.0016596596596596595, 'epoch': 15.91}


 17%|█▋        | 172/1000 [08:34<41:11,  2.98s/it]

{'loss': 0.0429, 'grad_norm': 0.2862038016319275, 'learning_rate': 0.0016576576576576576, 'epoch': 16.0}


 17%|█▋        | 173/1000 [08:37<41:09,  2.99s/it]

{'loss': 0.0243, 'grad_norm': 0.15950839221477509, 'learning_rate': 0.0016556556556556557, 'epoch': 16.09}


 17%|█▋        | 174/1000 [08:40<41:03,  2.98s/it]

{'loss': 0.0271, 'grad_norm': 0.31929463148117065, 'learning_rate': 0.0016536536536536537, 'epoch': 16.19}


 18%|█▊        | 175/1000 [08:43<40:53,  2.97s/it]

{'loss': 0.0245, 'grad_norm': 0.1751406043767929, 'learning_rate': 0.0016516516516516518, 'epoch': 16.28}


 18%|█▊        | 176/1000 [08:46<40:44,  2.97s/it]

{'loss': 0.0242, 'grad_norm': 0.16707609593868256, 'learning_rate': 0.0016496496496496496, 'epoch': 16.37}


 18%|█▊        | 177/1000 [08:49<40:46,  2.97s/it]

{'loss': 0.0321, 'grad_norm': 0.7891533970832825, 'learning_rate': 0.0016476476476476477, 'epoch': 16.47}


 18%|█▊        | 178/1000 [08:52<40:46,  2.98s/it]

{'loss': 0.0252, 'grad_norm': 0.18779712915420532, 'learning_rate': 0.0016456456456456457, 'epoch': 16.56}


 18%|█▊        | 179/1000 [08:55<40:39,  2.97s/it]

{'loss': 0.0282, 'grad_norm': 0.1881711632013321, 'learning_rate': 0.0016436436436436438, 'epoch': 16.65}


 18%|█▊        | 180/1000 [08:58<40:43,  2.98s/it]

{'loss': 0.0269, 'grad_norm': 0.203497052192688, 'learning_rate': 0.0016416416416416418, 'epoch': 16.74}


 18%|█▊        | 181/1000 [09:01<40:34,  2.97s/it]

{'loss': 0.0327, 'grad_norm': 0.46536093950271606, 'learning_rate': 0.0016396396396396397, 'epoch': 16.84}


 18%|█▊        | 182/1000 [09:04<40:24,  2.96s/it]

{'loss': 0.0267, 'grad_norm': 0.17711864411830902, 'learning_rate': 0.0016376376376376375, 'epoch': 16.93}


 18%|█▊        | 183/1000 [09:07<40:21,  2.96s/it]

{'loss': 0.0351, 'grad_norm': 0.2085958570241928, 'learning_rate': 0.0016356356356356356, 'epoch': 17.02}


 18%|█▊        | 184/1000 [09:10<40:13,  2.96s/it]

{'loss': 0.0245, 'grad_norm': 0.16959838569164276, 'learning_rate': 0.0016336336336336336, 'epoch': 17.12}


 18%|█▊        | 185/1000 [09:13<40:13,  2.96s/it]

{'loss': 0.0317, 'grad_norm': 0.21090783178806305, 'learning_rate': 0.0016316316316316317, 'epoch': 17.21}


 19%|█▊        | 186/1000 [09:16<40:19,  2.97s/it]

{'loss': 0.0237, 'grad_norm': 0.24964235723018646, 'learning_rate': 0.0016296296296296295, 'epoch': 17.3}


 19%|█▊        | 187/1000 [09:19<40:27,  2.99s/it]

{'loss': 0.0279, 'grad_norm': 0.1871645301580429, 'learning_rate': 0.0016276276276276276, 'epoch': 17.4}


 19%|█▉        | 188/1000 [09:22<40:26,  2.99s/it]

{'loss': 0.0246, 'grad_norm': 0.17427071928977966, 'learning_rate': 0.0016256256256256256, 'epoch': 17.49}


 19%|█▉        | 189/1000 [09:25<40:19,  2.98s/it]

{'loss': 0.0342, 'grad_norm': 0.2249002605676651, 'learning_rate': 0.0016236236236236237, 'epoch': 17.58}


 19%|█▉        | 190/1000 [09:28<40:24,  2.99s/it]

{'loss': 0.0237, 'grad_norm': 0.14843669533729553, 'learning_rate': 0.0016216216216216218, 'epoch': 17.67}


 19%|█▉        | 191/1000 [09:31<40:25,  3.00s/it]

{'loss': 0.0312, 'grad_norm': 0.19408485293388367, 'learning_rate': 0.0016196196196196196, 'epoch': 17.77}


 19%|█▉        | 192/1000 [09:34<40:13,  2.99s/it]

{'loss': 0.0232, 'grad_norm': 0.1466081440448761, 'learning_rate': 0.0016176176176176177, 'epoch': 17.86}


 19%|█▉        | 193/1000 [09:37<40:05,  2.98s/it]

{'loss': 0.031, 'grad_norm': 0.7054815888404846, 'learning_rate': 0.0016156156156156157, 'epoch': 17.95}


 19%|█▉        | 194/1000 [09:40<39:59,  2.98s/it]

{'loss': 0.032, 'grad_norm': 0.23588448762893677, 'learning_rate': 0.0016136136136136138, 'epoch': 18.05}


 20%|█▉        | 195/1000 [09:43<39:54,  2.98s/it]

{'loss': 0.0304, 'grad_norm': 0.25636205077171326, 'learning_rate': 0.0016116116116116118, 'epoch': 18.14}


 20%|█▉        | 196/1000 [09:46<39:59,  2.98s/it]

{'loss': 0.0183, 'grad_norm': 0.1628827303647995, 'learning_rate': 0.0016096096096096097, 'epoch': 18.23}


 20%|█▉        | 197/1000 [09:49<39:48,  2.97s/it]

{'loss': 0.0383, 'grad_norm': 0.6799976229667664, 'learning_rate': 0.0016076076076076075, 'epoch': 18.33}


 20%|█▉        | 198/1000 [09:52<39:46,  2.98s/it]

{'loss': 0.0301, 'grad_norm': 0.24623911082744598, 'learning_rate': 0.0016056056056056056, 'epoch': 18.42}


 20%|█▉        | 199/1000 [09:55<39:43,  2.98s/it]

{'loss': 0.0384, 'grad_norm': 0.26263728737831116, 'learning_rate': 0.0016036036036036036, 'epoch': 18.51}


 20%|██        | 200/1000 [09:58<39:45,  2.98s/it]

{'loss': 0.0377, 'grad_norm': 0.24216783046722412, 'learning_rate': 0.0016016016016016017, 'epoch': 18.6}


 20%|██        | 201/1000 [10:01<39:45,  2.99s/it]

{'loss': 0.0388, 'grad_norm': 0.21968820691108704, 'learning_rate': 0.0015995995995995995, 'epoch': 18.7}


 20%|██        | 202/1000 [10:04<39:35,  2.98s/it]

{'loss': 0.0335, 'grad_norm': 0.23277999460697174, 'learning_rate': 0.0015975975975975976, 'epoch': 18.79}


 20%|██        | 203/1000 [10:07<39:26,  2.97s/it]

{'loss': 0.0448, 'grad_norm': 0.7189760804176331, 'learning_rate': 0.0015955955955955956, 'epoch': 18.88}


 20%|██        | 204/1000 [10:10<39:15,  2.96s/it]

{'loss': 0.0479, 'grad_norm': 0.5283987522125244, 'learning_rate': 0.0015935935935935937, 'epoch': 18.98}


 20%|██        | 205/1000 [10:12<39:11,  2.96s/it]

{'loss': 0.0259, 'grad_norm': 0.1992095708847046, 'learning_rate': 0.0015915915915915917, 'epoch': 19.07}


 21%|██        | 206/1000 [10:15<39:10,  2.96s/it]

{'loss': 0.035, 'grad_norm': 0.2722899317741394, 'learning_rate': 0.0015895895895895896, 'epoch': 19.16}


 21%|██        | 207/1000 [10:18<39:10,  2.96s/it]

{'loss': 0.0252, 'grad_norm': 0.16028214991092682, 'learning_rate': 0.0015875875875875876, 'epoch': 19.26}


 21%|██        | 208/1000 [10:21<39:03,  2.96s/it]

{'loss': 0.0186, 'grad_norm': 0.1479153335094452, 'learning_rate': 0.0015855855855855857, 'epoch': 19.35}


 21%|██        | 209/1000 [10:24<38:57,  2.96s/it]

{'loss': 0.0273, 'grad_norm': 0.18401648104190826, 'learning_rate': 0.0015835835835835838, 'epoch': 19.44}


 21%|██        | 210/1000 [10:27<38:59,  2.96s/it]

{'loss': 0.0219, 'grad_norm': 0.45847856998443604, 'learning_rate': 0.0015815815815815818, 'epoch': 19.53}


 21%|██        | 211/1000 [10:30<38:58,  2.96s/it]

{'loss': 0.0212, 'grad_norm': 0.172963485121727, 'learning_rate': 0.0015795795795795794, 'epoch': 19.63}


 21%|██        | 212/1000 [10:33<39:03,  2.97s/it]

{'loss': 0.0292, 'grad_norm': 0.18817926943302155, 'learning_rate': 0.0015775775775775775, 'epoch': 19.72}


 21%|██▏       | 213/1000 [10:36<38:56,  2.97s/it]

{'loss': 0.0303, 'grad_norm': 0.20419757068157196, 'learning_rate': 0.0015755755755755755, 'epoch': 19.81}


 21%|██▏       | 214/1000 [10:39<38:57,  2.97s/it]

{'loss': 0.0271, 'grad_norm': 0.19756686687469482, 'learning_rate': 0.0015735735735735736, 'epoch': 19.91}


 22%|██▏       | 215/1000 [10:42<38:54,  2.97s/it]

{'loss': 0.0214, 'grad_norm': 0.14679427444934845, 'learning_rate': 0.0015715715715715717, 'epoch': 20.0}


 22%|██▏       | 216/1000 [10:45<38:47,  2.97s/it]

{'loss': 0.0191, 'grad_norm': 0.14583350718021393, 'learning_rate': 0.0015695695695695695, 'epoch': 20.09}


 22%|██▏       | 217/1000 [10:48<38:40,  2.96s/it]

{'loss': 0.0176, 'grad_norm': 0.15297341346740723, 'learning_rate': 0.0015675675675675676, 'epoch': 20.19}


 22%|██▏       | 218/1000 [10:51<38:46,  2.98s/it]

{'loss': 0.0239, 'grad_norm': 0.18278124928474426, 'learning_rate': 0.0015655655655655656, 'epoch': 20.28}


 22%|██▏       | 219/1000 [10:54<38:46,  2.98s/it]

{'loss': 0.0206, 'grad_norm': 0.1664392501115799, 'learning_rate': 0.0015635635635635637, 'epoch': 20.37}


 22%|██▏       | 220/1000 [10:57<38:50,  2.99s/it]

{'loss': 0.0229, 'grad_norm': 0.49318164587020874, 'learning_rate': 0.0015615615615615615, 'epoch': 20.47}


 22%|██▏       | 221/1000 [11:00<38:53,  3.00s/it]

{'loss': 0.0158, 'grad_norm': 0.13801375031471252, 'learning_rate': 0.0015595595595595596, 'epoch': 20.56}


 22%|██▏       | 222/1000 [11:03<38:51,  3.00s/it]

{'loss': 0.0205, 'grad_norm': 0.154294952750206, 'learning_rate': 0.0015575575575575576, 'epoch': 20.65}


 22%|██▏       | 223/1000 [11:06<38:47,  3.00s/it]

{'loss': 0.0218, 'grad_norm': 0.14349927008152008, 'learning_rate': 0.0015555555555555557, 'epoch': 20.74}


 22%|██▏       | 224/1000 [11:09<38:48,  3.00s/it]

{'loss': 0.0228, 'grad_norm': 0.14842985570430756, 'learning_rate': 0.0015535535535535537, 'epoch': 20.84}


 22%|██▎       | 225/1000 [11:12<38:45,  3.00s/it]

{'loss': 0.0218, 'grad_norm': 0.14195027947425842, 'learning_rate': 0.0015515515515515516, 'epoch': 20.93}


 23%|██▎       | 226/1000 [11:15<38:41,  3.00s/it]

{'loss': 0.0323, 'grad_norm': 0.1998073309659958, 'learning_rate': 0.0015495495495495494, 'epoch': 21.02}


 23%|██▎       | 227/1000 [11:18<38:36,  3.00s/it]

{'loss': 0.0128, 'grad_norm': 0.09599553048610687, 'learning_rate': 0.0015475475475475475, 'epoch': 21.12}


 23%|██▎       | 228/1000 [11:21<38:33,  3.00s/it]

{'loss': 0.0193, 'grad_norm': 0.15941940248012543, 'learning_rate': 0.0015455455455455455, 'epoch': 21.21}


 23%|██▎       | 229/1000 [11:24<38:30,  3.00s/it]

{'loss': 0.0202, 'grad_norm': 0.18800006806850433, 'learning_rate': 0.0015435435435435436, 'epoch': 21.3}


 23%|██▎       | 230/1000 [11:27<38:23,  2.99s/it]

{'loss': 0.0111, 'grad_norm': 0.10920831561088562, 'learning_rate': 0.0015415415415415414, 'epoch': 21.4}


 23%|██▎       | 231/1000 [11:30<38:11,  2.98s/it]

{'loss': 0.0272, 'grad_norm': 0.2068585604429245, 'learning_rate': 0.0015395395395395395, 'epoch': 21.49}


 23%|██▎       | 232/1000 [11:33<38:08,  2.98s/it]

{'loss': 0.0188, 'grad_norm': 0.13729514181613922, 'learning_rate': 0.0015375375375375375, 'epoch': 21.58}


 23%|██▎       | 233/1000 [11:36<38:07,  2.98s/it]

{'loss': 0.0162, 'grad_norm': 0.11570963263511658, 'learning_rate': 0.0015355355355355356, 'epoch': 21.67}


 23%|██▎       | 234/1000 [11:39<38:14,  3.00s/it]

{'loss': 0.0158, 'grad_norm': 0.11246766149997711, 'learning_rate': 0.0015335335335335337, 'epoch': 21.77}


 24%|██▎       | 235/1000 [11:42<38:15,  3.00s/it]

{'loss': 0.0146, 'grad_norm': 0.13074816763401031, 'learning_rate': 0.0015315315315315315, 'epoch': 21.86}


 24%|██▎       | 236/1000 [11:45<38:13,  3.00s/it]

{'loss': 0.0154, 'grad_norm': 0.132390558719635, 'learning_rate': 0.0015295295295295296, 'epoch': 21.95}


 24%|██▎       | 237/1000 [11:48<38:10,  3.00s/it]

{'loss': 0.0119, 'grad_norm': 0.09564812481403351, 'learning_rate': 0.0015275275275275276, 'epoch': 22.05}


 24%|██▍       | 238/1000 [11:51<38:05,  3.00s/it]

{'loss': 0.0158, 'grad_norm': 0.10601690411567688, 'learning_rate': 0.0015255255255255257, 'epoch': 22.14}


 24%|██▍       | 239/1000 [11:54<38:02,  3.00s/it]

{'loss': 0.0122, 'grad_norm': 0.11311355233192444, 'learning_rate': 0.0015235235235235237, 'epoch': 22.23}


 24%|██▍       | 240/1000 [11:57<37:54,  2.99s/it]

{'loss': 0.0132, 'grad_norm': 0.12214581668376923, 'learning_rate': 0.0015215215215215214, 'epoch': 22.33}


 24%|██▍       | 241/1000 [12:00<37:46,  2.99s/it]

{'loss': 0.0102, 'grad_norm': 0.1116672083735466, 'learning_rate': 0.0015195195195195194, 'epoch': 22.42}


 24%|██▍       | 242/1000 [12:03<37:43,  2.99s/it]

{'loss': 0.0075, 'grad_norm': 0.08826189488172531, 'learning_rate': 0.0015175175175175175, 'epoch': 22.51}


 24%|██▍       | 243/1000 [12:06<37:44,  2.99s/it]

{'loss': 0.0119, 'grad_norm': 0.10075847059488297, 'learning_rate': 0.0015155155155155155, 'epoch': 22.6}


 24%|██▍       | 244/1000 [12:09<37:45,  3.00s/it]

{'loss': 0.0139, 'grad_norm': 0.11353474110364914, 'learning_rate': 0.0015135135135135136, 'epoch': 22.7}


 24%|██▍       | 245/1000 [12:12<37:42,  3.00s/it]

{'loss': 0.0093, 'grad_norm': 0.10257426649332047, 'learning_rate': 0.0015115115115115114, 'epoch': 22.79}


 25%|██▍       | 246/1000 [12:15<37:42,  3.00s/it]

{'loss': 0.0108, 'grad_norm': 0.09495458006858826, 'learning_rate': 0.0015095095095095095, 'epoch': 22.88}


 25%|██▍       | 247/1000 [12:18<37:42,  3.01s/it]

{'loss': 0.0148, 'grad_norm': 0.5463971495628357, 'learning_rate': 0.0015075075075075075, 'epoch': 22.98}


 25%|██▍       | 248/1000 [12:21<37:42,  3.01s/it]

{'loss': 0.0102, 'grad_norm': 0.10208092629909515, 'learning_rate': 0.0015055055055055056, 'epoch': 23.07}


 25%|██▍       | 249/1000 [12:24<37:31,  3.00s/it]

{'loss': 0.019, 'grad_norm': 0.18766802549362183, 'learning_rate': 0.0015035035035035036, 'epoch': 23.16}


 25%|██▌       | 250/1000 [12:27<37:31,  3.00s/it]

{'loss': 0.0138, 'grad_norm': 0.12946560978889465, 'learning_rate': 0.0015015015015015015, 'epoch': 23.26}


 25%|██▌       | 251/1000 [12:30<36:50,  2.95s/it]

{'loss': 0.0158, 'grad_norm': 0.16246852278709412, 'learning_rate': 0.0014994994994994995, 'epoch': 23.35}


 25%|██▌       | 252/1000 [12:33<36:04,  2.89s/it]

{'loss': 0.0094, 'grad_norm': 0.12050007283687592, 'learning_rate': 0.0014974974974974976, 'epoch': 23.44}


 25%|██▌       | 253/1000 [12:35<35:34,  2.86s/it]

{'loss': 0.02, 'grad_norm': 0.20695464313030243, 'learning_rate': 0.0014954954954954957, 'epoch': 23.53}


 25%|██▌       | 254/1000 [12:38<35:14,  2.83s/it]

{'loss': 0.0123, 'grad_norm': 0.12067891657352448, 'learning_rate': 0.0014934934934934937, 'epoch': 23.63}


 26%|██▌       | 255/1000 [12:41<35:00,  2.82s/it]

{'loss': 0.016, 'grad_norm': 0.1243496686220169, 'learning_rate': 0.0014914914914914913, 'epoch': 23.72}


 26%|██▌       | 256/1000 [12:44<34:50,  2.81s/it]

{'loss': 0.0141, 'grad_norm': 0.11034452170133591, 'learning_rate': 0.0014894894894894894, 'epoch': 23.81}


 26%|██▌       | 257/1000 [12:46<34:35,  2.79s/it]

{'loss': 0.0119, 'grad_norm': 0.10018245875835419, 'learning_rate': 0.0014874874874874875, 'epoch': 23.91}


 26%|██▌       | 258/1000 [12:49<34:23,  2.78s/it]

{'loss': 0.0113, 'grad_norm': 0.10271686315536499, 'learning_rate': 0.0014854854854854855, 'epoch': 24.0}


 26%|██▌       | 259/1000 [12:52<34:27,  2.79s/it]

{'loss': 0.0088, 'grad_norm': 0.0902242660522461, 'learning_rate': 0.0014834834834834836, 'epoch': 24.09}


 26%|██▌       | 260/1000 [12:55<34:24,  2.79s/it]

{'loss': 0.0068, 'grad_norm': 0.0898832306265831, 'learning_rate': 0.0014814814814814814, 'epoch': 24.19}


 26%|██▌       | 261/1000 [12:58<34:22,  2.79s/it]

{'loss': 0.008, 'grad_norm': 0.10742070525884628, 'learning_rate': 0.0014794794794794795, 'epoch': 24.28}


 26%|██▌       | 262/1000 [13:00<34:19,  2.79s/it]

{'loss': 0.012, 'grad_norm': 0.3394037187099457, 'learning_rate': 0.0014774774774774775, 'epoch': 24.37}


 26%|██▋       | 263/1000 [13:03<34:15,  2.79s/it]

{'loss': 0.0128, 'grad_norm': 0.1366187185049057, 'learning_rate': 0.0014754754754754756, 'epoch': 24.47}


 26%|██▋       | 264/1000 [13:06<34:07,  2.78s/it]

{'loss': 0.0144, 'grad_norm': 0.16941972076892853, 'learning_rate': 0.0014734734734734736, 'epoch': 24.56}


 26%|██▋       | 265/1000 [13:09<34:13,  2.79s/it]

{'loss': 0.0131, 'grad_norm': 0.13689333200454712, 'learning_rate': 0.0014714714714714715, 'epoch': 24.65}


 27%|██▋       | 266/1000 [13:12<34:12,  2.80s/it]

{'loss': 0.0135, 'grad_norm': 0.12805940210819244, 'learning_rate': 0.0014694694694694695, 'epoch': 24.74}


 27%|██▋       | 267/1000 [13:14<34:10,  2.80s/it]

{'loss': 0.0181, 'grad_norm': 0.15572786331176758, 'learning_rate': 0.0014674674674674676, 'epoch': 24.84}


 27%|██▋       | 268/1000 [13:17<34:08,  2.80s/it]

{'loss': 0.0119, 'grad_norm': 0.11057024449110031, 'learning_rate': 0.0014654654654654656, 'epoch': 24.93}


 27%|██▋       | 269/1000 [13:20<34:09,  2.80s/it]

{'loss': 0.017, 'grad_norm': 0.15199409425258636, 'learning_rate': 0.0014634634634634637, 'epoch': 25.02}


 27%|██▋       | 270/1000 [13:23<34:05,  2.80s/it]

{'loss': 0.0099, 'grad_norm': 0.11110897362232208, 'learning_rate': 0.0014614614614614613, 'epoch': 25.12}


 27%|██▋       | 271/1000 [13:26<34:03,  2.80s/it]

{'loss': 0.0065, 'grad_norm': 0.0694834515452385, 'learning_rate': 0.0014594594594594594, 'epoch': 25.21}


 27%|██▋       | 272/1000 [13:28<34:04,  2.81s/it]

{'loss': 0.0126, 'grad_norm': 0.12897028028964996, 'learning_rate': 0.0014574574574574574, 'epoch': 25.3}


 27%|██▋       | 273/1000 [13:31<34:08,  2.82s/it]

{'loss': 0.0112, 'grad_norm': 0.10043667256832123, 'learning_rate': 0.0014554554554554555, 'epoch': 25.4}


 27%|██▋       | 274/1000 [13:34<34:04,  2.82s/it]

{'loss': 0.0099, 'grad_norm': 0.10153573751449585, 'learning_rate': 0.0014534534534534536, 'epoch': 25.49}


 28%|██▊       | 275/1000 [13:37<34:09,  2.83s/it]

{'loss': 0.0086, 'grad_norm': 0.4327797591686249, 'learning_rate': 0.0014514514514514514, 'epoch': 25.58}


 28%|██▊       | 276/1000 [13:40<34:05,  2.83s/it]

{'loss': 0.0142, 'grad_norm': 0.12818408012390137, 'learning_rate': 0.0014494494494494495, 'epoch': 25.67}


 28%|██▊       | 277/1000 [13:42<33:51,  2.81s/it]

{'loss': 0.0144, 'grad_norm': 0.13305599987506866, 'learning_rate': 0.0014474474474474475, 'epoch': 25.77}


 28%|██▊       | 278/1000 [13:45<33:47,  2.81s/it]

{'loss': 0.0168, 'grad_norm': 0.16424229741096497, 'learning_rate': 0.0014454454454454456, 'epoch': 25.86}


 28%|██▊       | 279/1000 [13:48<33:51,  2.82s/it]

{'loss': 0.0105, 'grad_norm': 0.10903183370828629, 'learning_rate': 0.0014434434434434436, 'epoch': 25.95}


 28%|██▊       | 280/1000 [13:51<33:41,  2.81s/it]

{'loss': 0.0126, 'grad_norm': 0.14578936994075775, 'learning_rate': 0.0014414414414414415, 'epoch': 26.05}


 28%|██▊       | 281/1000 [13:54<33:37,  2.81s/it]

{'loss': 0.0116, 'grad_norm': 0.12450461089611053, 'learning_rate': 0.0014394394394394395, 'epoch': 26.14}


 28%|██▊       | 282/1000 [13:56<33:24,  2.79s/it]

{'loss': 0.0094, 'grad_norm': 0.1154106929898262, 'learning_rate': 0.0014374374374374376, 'epoch': 26.23}


 28%|██▊       | 283/1000 [13:59<33:13,  2.78s/it]

{'loss': 0.0105, 'grad_norm': 0.12836499512195587, 'learning_rate': 0.0014354354354354356, 'epoch': 26.33}


 28%|██▊       | 284/1000 [14:02<33:13,  2.78s/it]

{'loss': 0.0091, 'grad_norm': 0.11856179684400558, 'learning_rate': 0.0014334334334334333, 'epoch': 26.42}


 28%|██▊       | 285/1000 [14:05<33:09,  2.78s/it]

{'loss': 0.0134, 'grad_norm': 0.12042251974344254, 'learning_rate': 0.0014314314314314313, 'epoch': 26.51}


 29%|██▊       | 286/1000 [14:08<33:09,  2.79s/it]

{'loss': 0.0162, 'grad_norm': 0.13722313940525055, 'learning_rate': 0.0014294294294294294, 'epoch': 26.6}


 29%|██▊       | 287/1000 [14:10<33:00,  2.78s/it]

{'loss': 0.0101, 'grad_norm': 0.11666831374168396, 'learning_rate': 0.0014274274274274274, 'epoch': 26.7}


 29%|██▉       | 288/1000 [14:13<32:53,  2.77s/it]

{'loss': 0.0107, 'grad_norm': 0.11098901182413101, 'learning_rate': 0.0014254254254254255, 'epoch': 26.79}


 29%|██▉       | 289/1000 [14:16<32:48,  2.77s/it]

{'loss': 0.0182, 'grad_norm': 0.1570785492658615, 'learning_rate': 0.0014234234234234233, 'epoch': 26.88}


 29%|██▉       | 290/1000 [14:19<32:43,  2.77s/it]

{'loss': 0.0102, 'grad_norm': 0.11240728944540024, 'learning_rate': 0.0014214214214214214, 'epoch': 26.98}


 29%|██▉       | 291/1000 [14:21<32:38,  2.76s/it]

{'loss': 0.0107, 'grad_norm': 0.11979077011346817, 'learning_rate': 0.0014194194194194194, 'epoch': 27.07}


 29%|██▉       | 292/1000 [14:24<32:33,  2.76s/it]

{'loss': 0.0058, 'grad_norm': 0.06227131932973862, 'learning_rate': 0.0014174174174174175, 'epoch': 27.16}


 29%|██▉       | 293/1000 [14:27<32:37,  2.77s/it]

{'loss': 0.0082, 'grad_norm': 0.10084498673677444, 'learning_rate': 0.0014154154154154156, 'epoch': 27.26}


 29%|██▉       | 294/1000 [14:30<32:41,  2.78s/it]

{'loss': 0.0086, 'grad_norm': 0.09121516346931458, 'learning_rate': 0.0014134134134134134, 'epoch': 27.35}


 30%|██▉       | 295/1000 [14:33<32:36,  2.77s/it]

{'loss': 0.0076, 'grad_norm': 0.08678147196769714, 'learning_rate': 0.0014114114114114115, 'epoch': 27.44}


 30%|██▉       | 296/1000 [14:35<32:32,  2.77s/it]

{'loss': 0.0106, 'grad_norm': 0.10272347927093506, 'learning_rate': 0.0014094094094094095, 'epoch': 27.53}


 30%|██▉       | 297/1000 [14:38<32:29,  2.77s/it]

{'loss': 0.0071, 'grad_norm': 0.07840856909751892, 'learning_rate': 0.0014074074074074076, 'epoch': 27.63}


 30%|██▉       | 298/1000 [14:41<32:22,  2.77s/it]

{'loss': 0.009, 'grad_norm': 0.10716904699802399, 'learning_rate': 0.0014054054054054054, 'epoch': 27.72}


 30%|██▉       | 299/1000 [14:44<32:24,  2.77s/it]

{'loss': 0.0077, 'grad_norm': 0.08777103573083878, 'learning_rate': 0.0014034034034034032, 'epoch': 27.81}


 30%|███       | 300/1000 [14:46<32:14,  2.76s/it]

{'loss': 0.0074, 'grad_norm': 0.1003369688987732, 'learning_rate': 0.0014014014014014013, 'epoch': 27.91}


 30%|███       | 301/1000 [14:49<32:09,  2.76s/it]

{'loss': 0.006, 'grad_norm': 0.07700402289628983, 'learning_rate': 0.0013993993993993994, 'epoch': 28.0}


 30%|███       | 302/1000 [14:52<32:11,  2.77s/it]

{'loss': 0.0035, 'grad_norm': 0.05425987020134926, 'learning_rate': 0.0013973973973973974, 'epoch': 28.09}


 30%|███       | 303/1000 [14:55<32:12,  2.77s/it]

{'loss': 0.0074, 'grad_norm': 0.08867619186639786, 'learning_rate': 0.0013953953953953955, 'epoch': 28.19}


 30%|███       | 304/1000 [14:57<32:09,  2.77s/it]

{'loss': 0.0047, 'grad_norm': 0.06515760719776154, 'learning_rate': 0.0013933933933933933, 'epoch': 28.28}


 30%|███       | 305/1000 [15:00<32:04,  2.77s/it]

{'loss': 0.0061, 'grad_norm': 0.08021624386310577, 'learning_rate': 0.0013913913913913914, 'epoch': 28.37}


 31%|███       | 306/1000 [15:03<32:04,  2.77s/it]

{'loss': 0.0103, 'grad_norm': 0.1081114262342453, 'learning_rate': 0.0013893893893893894, 'epoch': 28.47}


 31%|███       | 307/1000 [15:06<32:02,  2.77s/it]

{'loss': 0.0055, 'grad_norm': 0.0835670754313469, 'learning_rate': 0.0013873873873873875, 'epoch': 28.56}


 31%|███       | 308/1000 [15:09<32:00,  2.78s/it]

{'loss': 0.0073, 'grad_norm': 0.08508322387933731, 'learning_rate': 0.0013853853853853855, 'epoch': 28.65}


 31%|███       | 309/1000 [15:11<31:56,  2.77s/it]

{'loss': 0.0053, 'grad_norm': 0.07496494799852371, 'learning_rate': 0.0013833833833833834, 'epoch': 28.74}


 31%|███       | 310/1000 [15:14<31:49,  2.77s/it]

{'loss': 0.0067, 'grad_norm': 0.07663434743881226, 'learning_rate': 0.0013813813813813814, 'epoch': 28.84}


 31%|███       | 311/1000 [15:17<31:43,  2.76s/it]

{'loss': 0.0058, 'grad_norm': 0.08244332671165466, 'learning_rate': 0.0013793793793793795, 'epoch': 28.93}


 31%|███       | 312/1000 [15:20<31:39,  2.76s/it]

{'loss': 0.0069, 'grad_norm': 0.0868007019162178, 'learning_rate': 0.0013773773773773776, 'epoch': 29.02}


 31%|███▏      | 313/1000 [15:22<31:32,  2.76s/it]

{'loss': 0.0052, 'grad_norm': 0.4052487015724182, 'learning_rate': 0.0013753753753753754, 'epoch': 29.12}


 31%|███▏      | 314/1000 [15:25<31:26,  2.75s/it]

{'loss': 0.0038, 'grad_norm': 0.06757152825593948, 'learning_rate': 0.0013733733733733732, 'epoch': 29.21}


 32%|███▏      | 315/1000 [15:28<31:28,  2.76s/it]

{'loss': 0.0061, 'grad_norm': 0.07421130686998367, 'learning_rate': 0.0013713713713713713, 'epoch': 29.3}


 32%|███▏      | 316/1000 [15:31<31:28,  2.76s/it]

{'loss': 0.0085, 'grad_norm': 0.12837092578411102, 'learning_rate': 0.0013693693693693693, 'epoch': 29.4}


 32%|███▏      | 317/1000 [15:33<31:33,  2.77s/it]

{'loss': 0.0085, 'grad_norm': 0.09021154791116714, 'learning_rate': 0.0013673673673673674, 'epoch': 29.49}


 32%|███▏      | 318/1000 [15:36<31:31,  2.77s/it]

{'loss': 0.007, 'grad_norm': 0.07567004859447479, 'learning_rate': 0.0013653653653653655, 'epoch': 29.58}


 32%|███▏      | 319/1000 [15:39<31:25,  2.77s/it]

{'loss': 0.0077, 'grad_norm': 0.12329138815402985, 'learning_rate': 0.0013633633633633633, 'epoch': 29.67}


 32%|███▏      | 320/1000 [15:42<31:22,  2.77s/it]

{'loss': 0.0061, 'grad_norm': 0.0948009118437767, 'learning_rate': 0.0013613613613613614, 'epoch': 29.77}


 32%|███▏      | 321/1000 [15:44<31:20,  2.77s/it]

{'loss': 0.0051, 'grad_norm': 0.07781707495450974, 'learning_rate': 0.0013593593593593594, 'epoch': 29.86}


 32%|███▏      | 322/1000 [15:47<31:17,  2.77s/it]

{'loss': 0.0083, 'grad_norm': 0.10603152215480804, 'learning_rate': 0.0013573573573573575, 'epoch': 29.95}


 32%|███▏      | 323/1000 [15:50<31:17,  2.77s/it]

{'loss': 0.0064, 'grad_norm': 0.11571641266345978, 'learning_rate': 0.0013553553553553555, 'epoch': 30.05}


 32%|███▏      | 324/1000 [15:53<31:11,  2.77s/it]

{'loss': 0.0045, 'grad_norm': 0.07329671829938889, 'learning_rate': 0.0013533533533533534, 'epoch': 30.14}


 32%|███▎      | 325/1000 [15:56<31:12,  2.77s/it]

{'loss': 0.0067, 'grad_norm': 0.09423427283763885, 'learning_rate': 0.0013513513513513514, 'epoch': 30.23}


 33%|███▎      | 326/1000 [15:58<31:13,  2.78s/it]

{'loss': 0.0067, 'grad_norm': 0.10144127160310745, 'learning_rate': 0.0013493493493493495, 'epoch': 30.33}


 33%|███▎      | 327/1000 [16:01<31:17,  2.79s/it]

{'loss': 0.0051, 'grad_norm': 0.07445558905601501, 'learning_rate': 0.0013473473473473473, 'epoch': 30.42}


 33%|███▎      | 328/1000 [16:04<31:16,  2.79s/it]

{'loss': 0.0058, 'grad_norm': 0.07344485819339752, 'learning_rate': 0.0013453453453453454, 'epoch': 30.51}


 33%|███▎      | 329/1000 [16:07<31:10,  2.79s/it]

{'loss': 0.0061, 'grad_norm': 0.07438346743583679, 'learning_rate': 0.0013433433433433432, 'epoch': 30.6}


 33%|███▎      | 330/1000 [16:09<31:01,  2.78s/it]

{'loss': 0.0052, 'grad_norm': 0.07676655054092407, 'learning_rate': 0.0013413413413413413, 'epoch': 30.7}


 33%|███▎      | 331/1000 [16:12<30:58,  2.78s/it]

{'loss': 0.0064, 'grad_norm': 0.0767897292971611, 'learning_rate': 0.0013393393393393393, 'epoch': 30.79}


 33%|███▎      | 332/1000 [16:15<30:53,  2.78s/it]

{'loss': 0.0062, 'grad_norm': 0.07645553350448608, 'learning_rate': 0.0013373373373373374, 'epoch': 30.88}


 33%|███▎      | 333/1000 [16:18<30:50,  2.77s/it]

{'loss': 0.0077, 'grad_norm': 0.09462464600801468, 'learning_rate': 0.0013353353353353354, 'epoch': 30.98}


 33%|███▎      | 334/1000 [16:21<30:46,  2.77s/it]

{'loss': 0.0036, 'grad_norm': 0.049924131482839584, 'learning_rate': 0.0013333333333333333, 'epoch': 31.07}


 34%|███▎      | 335/1000 [16:23<30:40,  2.77s/it]

{'loss': 0.0047, 'grad_norm': 0.07813724875450134, 'learning_rate': 0.0013313313313313313, 'epoch': 31.16}


 34%|███▎      | 336/1000 [16:26<30:39,  2.77s/it]

{'loss': 0.0046, 'grad_norm': 0.09082891792058945, 'learning_rate': 0.0013293293293293294, 'epoch': 31.26}


 34%|███▎      | 337/1000 [16:29<30:40,  2.78s/it]

{'loss': 0.0047, 'grad_norm': 0.0659075453877449, 'learning_rate': 0.0013273273273273275, 'epoch': 31.35}


 34%|███▍      | 338/1000 [16:32<30:37,  2.78s/it]

{'loss': 0.0027, 'grad_norm': 0.04118886962532997, 'learning_rate': 0.0013253253253253255, 'epoch': 31.44}


 34%|███▍      | 339/1000 [16:34<30:38,  2.78s/it]

{'loss': 0.0039, 'grad_norm': 0.06225672736763954, 'learning_rate': 0.0013233233233233234, 'epoch': 31.53}


 34%|███▍      | 340/1000 [16:37<30:33,  2.78s/it]

{'loss': 0.0043, 'grad_norm': 0.05968526005744934, 'learning_rate': 0.0013213213213213214, 'epoch': 31.63}


 34%|███▍      | 341/1000 [16:40<30:34,  2.78s/it]

{'loss': 0.0047, 'grad_norm': 0.06615372747182846, 'learning_rate': 0.0013193193193193195, 'epoch': 31.72}


 34%|███▍      | 342/1000 [16:43<30:38,  2.79s/it]

{'loss': 0.0053, 'grad_norm': 0.06672164797782898, 'learning_rate': 0.0013173173173173173, 'epoch': 31.81}


 34%|███▍      | 343/1000 [16:46<30:41,  2.80s/it]

{'loss': 0.0035, 'grad_norm': 0.05841416120529175, 'learning_rate': 0.0013153153153153154, 'epoch': 31.91}


 34%|███▍      | 344/1000 [16:48<30:40,  2.81s/it]

{'loss': 0.0058, 'grad_norm': 0.07656476646661758, 'learning_rate': 0.0013133133133133132, 'epoch': 32.0}


 34%|███▍      | 345/1000 [16:51<30:34,  2.80s/it]

{'loss': 0.0024, 'grad_norm': 0.04063386842608452, 'learning_rate': 0.0013113113113113113, 'epoch': 32.09}


 35%|███▍      | 346/1000 [16:54<30:30,  2.80s/it]

{'loss': 0.0034, 'grad_norm': 0.04993503540754318, 'learning_rate': 0.0013093093093093093, 'epoch': 32.19}


 35%|███▍      | 347/1000 [16:57<30:31,  2.81s/it]

{'loss': 0.003, 'grad_norm': 0.05899167060852051, 'learning_rate': 0.0013073073073073074, 'epoch': 32.28}


 35%|███▍      | 348/1000 [17:00<30:31,  2.81s/it]

{'loss': 0.0021, 'grad_norm': 0.03621702641248703, 'learning_rate': 0.0013053053053053052, 'epoch': 32.37}


 35%|███▍      | 349/1000 [17:03<30:28,  2.81s/it]

{'loss': 0.0038, 'grad_norm': 0.04735443741083145, 'learning_rate': 0.0013033033033033033, 'epoch': 32.47}


 35%|███▌      | 350/1000 [17:05<30:19,  2.80s/it]

{'loss': 0.0031, 'grad_norm': 0.047074075788259506, 'learning_rate': 0.0013013013013013013, 'epoch': 32.56}


 35%|███▌      | 351/1000 [17:08<30:20,  2.80s/it]

{'loss': 0.0047, 'grad_norm': 0.04885280877351761, 'learning_rate': 0.0012992992992992994, 'epoch': 32.65}


 35%|███▌      | 352/1000 [17:11<30:17,  2.81s/it]

{'loss': 0.0042, 'grad_norm': 0.06170407682657242, 'learning_rate': 0.0012972972972972974, 'epoch': 32.74}


 35%|███▌      | 353/1000 [17:14<30:17,  2.81s/it]

{'loss': 0.0037, 'grad_norm': 0.06180822476744652, 'learning_rate': 0.0012952952952952953, 'epoch': 32.84}


 35%|███▌      | 354/1000 [17:17<30:20,  2.82s/it]

{'loss': 0.0032, 'grad_norm': 0.04855296388268471, 'learning_rate': 0.0012932932932932933, 'epoch': 32.93}


 36%|███▌      | 355/1000 [17:19<30:20,  2.82s/it]

{'loss': 0.0039, 'grad_norm': 0.05172926187515259, 'learning_rate': 0.0012912912912912914, 'epoch': 33.02}


 36%|███▌      | 356/1000 [17:22<30:09,  2.81s/it]

{'loss': 0.0017, 'grad_norm': 0.03115912713110447, 'learning_rate': 0.0012892892892892892, 'epoch': 33.12}


 36%|███▌      | 357/1000 [17:25<29:57,  2.80s/it]

{'loss': 0.0017, 'grad_norm': 0.03118099458515644, 'learning_rate': 0.0012872872872872873, 'epoch': 33.21}


 36%|███▌      | 358/1000 [17:28<29:56,  2.80s/it]

{'loss': 0.0027, 'grad_norm': 0.09290407598018646, 'learning_rate': 0.0012852852852852851, 'epoch': 33.3}


 36%|███▌      | 359/1000 [17:31<29:50,  2.79s/it]

{'loss': 0.003, 'grad_norm': 0.062028996646404266, 'learning_rate': 0.0012832832832832832, 'epoch': 33.4}


 36%|███▌      | 360/1000 [17:33<29:40,  2.78s/it]

{'loss': 0.0037, 'grad_norm': 0.06242119520902634, 'learning_rate': 0.0012812812812812813, 'epoch': 33.49}


 36%|███▌      | 361/1000 [17:36<29:35,  2.78s/it]

{'loss': 0.0024, 'grad_norm': 0.04311477020382881, 'learning_rate': 0.0012792792792792793, 'epoch': 33.58}


 36%|███▌      | 362/1000 [17:39<29:33,  2.78s/it]

{'loss': 0.0025, 'grad_norm': 0.04930288344621658, 'learning_rate': 0.0012772772772772774, 'epoch': 33.67}


 36%|███▋      | 363/1000 [17:42<29:30,  2.78s/it]

{'loss': 0.0026, 'grad_norm': 0.03412387892603874, 'learning_rate': 0.0012752752752752752, 'epoch': 33.77}


 36%|███▋      | 364/1000 [17:44<29:23,  2.77s/it]

{'loss': 0.0066, 'grad_norm': 0.08664079010486603, 'learning_rate': 0.0012732732732732733, 'epoch': 33.86}


 36%|███▋      | 365/1000 [17:47<29:23,  2.78s/it]

{'loss': 0.0064, 'grad_norm': 0.8719766736030579, 'learning_rate': 0.0012712712712712713, 'epoch': 33.95}


 37%|███▋      | 366/1000 [17:50<29:21,  2.78s/it]

{'loss': 0.0034, 'grad_norm': 0.07285726815462112, 'learning_rate': 0.0012692692692692694, 'epoch': 34.05}


 37%|███▋      | 367/1000 [17:53<29:23,  2.79s/it]

{'loss': 0.0131, 'grad_norm': 0.16661174595355988, 'learning_rate': 0.0012672672672672674, 'epoch': 34.14}


 37%|███▋      | 368/1000 [17:56<29:20,  2.79s/it]

{'loss': 0.016, 'grad_norm': 0.21597643196582794, 'learning_rate': 0.0012652652652652653, 'epoch': 34.23}


 37%|███▋      | 369/1000 [17:58<29:14,  2.78s/it]

{'loss': 0.0158, 'grad_norm': 0.19439759850502014, 'learning_rate': 0.0012632632632632633, 'epoch': 34.33}


 37%|███▋      | 370/1000 [18:01<29:12,  2.78s/it]

{'loss': 0.0062, 'grad_norm': 0.11315418034791946, 'learning_rate': 0.0012612612612612614, 'epoch': 34.42}


 37%|███▋      | 371/1000 [18:04<29:07,  2.78s/it]

{'loss': 0.0072, 'grad_norm': 0.10819867253303528, 'learning_rate': 0.0012592592592592592, 'epoch': 34.51}


 37%|███▋      | 372/1000 [18:07<29:00,  2.77s/it]

{'loss': 0.008, 'grad_norm': 0.11476534605026245, 'learning_rate': 0.0012572572572572573, 'epoch': 34.6}


 37%|███▋      | 373/1000 [18:09<28:57,  2.77s/it]

{'loss': 0.0091, 'grad_norm': 0.13075686991214752, 'learning_rate': 0.0012552552552552551, 'epoch': 34.7}


 37%|███▋      | 374/1000 [18:12<28:59,  2.78s/it]

{'loss': 0.0338, 'grad_norm': 1.8338048458099365, 'learning_rate': 0.0012532532532532532, 'epoch': 34.79}


 38%|███▊      | 375/1000 [18:15<28:58,  2.78s/it]

{'loss': 0.0177, 'grad_norm': 0.18288426101207733, 'learning_rate': 0.0012512512512512512, 'epoch': 34.88}


 38%|███▊      | 376/1000 [18:18<28:51,  2.78s/it]

{'loss': 0.0319, 'grad_norm': 0.2959078252315521, 'learning_rate': 0.0012492492492492493, 'epoch': 34.98}


 38%|███▊      | 377/1000 [18:20<28:45,  2.77s/it]

{'loss': 0.0208, 'grad_norm': 0.2056169956922531, 'learning_rate': 0.0012472472472472474, 'epoch': 35.07}


 38%|███▊      | 378/1000 [18:23<28:40,  2.77s/it]

{'loss': 0.0233, 'grad_norm': 0.21074797213077545, 'learning_rate': 0.0012452452452452452, 'epoch': 35.16}


 38%|███▊      | 379/1000 [18:26<28:33,  2.76s/it]

{'loss': 0.0172, 'grad_norm': 0.15737591683864594, 'learning_rate': 0.0012432432432432433, 'epoch': 35.26}


 38%|███▊      | 380/1000 [18:29<28:33,  2.76s/it]

{'loss': 0.0199, 'grad_norm': 0.17381712794303894, 'learning_rate': 0.0012412412412412413, 'epoch': 35.35}


 38%|███▊      | 381/1000 [18:32<28:27,  2.76s/it]

{'loss': 0.0206, 'grad_norm': 0.17823565006256104, 'learning_rate': 0.0012392392392392394, 'epoch': 35.44}


 38%|███▊      | 382/1000 [18:34<28:26,  2.76s/it]

{'loss': 0.0186, 'grad_norm': 0.15753379464149475, 'learning_rate': 0.0012372372372372374, 'epoch': 35.53}


 38%|███▊      | 383/1000 [18:37<28:30,  2.77s/it]

{'loss': 0.0135, 'grad_norm': 0.11811896413564682, 'learning_rate': 0.0012352352352352353, 'epoch': 35.63}


 38%|███▊      | 384/1000 [18:40<28:32,  2.78s/it]

{'loss': 0.0156, 'grad_norm': 0.15179665386676788, 'learning_rate': 0.0012332332332332333, 'epoch': 35.72}


 38%|███▊      | 385/1000 [18:43<28:35,  2.79s/it]

{'loss': 0.0117, 'grad_norm': 0.156512051820755, 'learning_rate': 0.0012312312312312312, 'epoch': 35.81}


 39%|███▊      | 386/1000 [18:45<28:30,  2.79s/it]

{'loss': 0.0184, 'grad_norm': 0.18240876495838165, 'learning_rate': 0.0012292292292292292, 'epoch': 35.91}


 39%|███▊      | 387/1000 [18:48<28:17,  2.77s/it]

{'loss': 0.0104, 'grad_norm': 0.12744146585464478, 'learning_rate': 0.0012272272272272273, 'epoch': 36.0}


 39%|███▉      | 388/1000 [18:51<28:13,  2.77s/it]

{'loss': 0.0088, 'grad_norm': 0.11735134571790695, 'learning_rate': 0.0012252252252252251, 'epoch': 36.09}


 39%|███▉      | 389/1000 [18:54<28:18,  2.78s/it]

{'loss': 0.0089, 'grad_norm': 0.10435677319765091, 'learning_rate': 0.0012232232232232232, 'epoch': 36.19}


 39%|███▉      | 390/1000 [18:57<28:14,  2.78s/it]

{'loss': 0.0074, 'grad_norm': 0.1069398745894432, 'learning_rate': 0.0012212212212212212, 'epoch': 36.28}


 39%|███▉      | 391/1000 [18:59<28:08,  2.77s/it]

{'loss': 0.0113, 'grad_norm': 0.20668792724609375, 'learning_rate': 0.0012192192192192193, 'epoch': 36.37}


 39%|███▉      | 392/1000 [19:02<28:05,  2.77s/it]

{'loss': 0.013, 'grad_norm': 0.12832938134670258, 'learning_rate': 0.0012172172172172173, 'epoch': 36.47}


 39%|███▉      | 393/1000 [19:05<28:02,  2.77s/it]

{'loss': 0.0102, 'grad_norm': 0.13163897395133972, 'learning_rate': 0.0012152152152152152, 'epoch': 36.56}


 39%|███▉      | 394/1000 [19:08<27:58,  2.77s/it]

{'loss': 0.0135, 'grad_norm': 0.14289605617523193, 'learning_rate': 0.0012132132132132132, 'epoch': 36.65}


 40%|███▉      | 395/1000 [19:10<27:57,  2.77s/it]

{'loss': 0.0131, 'grad_norm': 0.1352647840976715, 'learning_rate': 0.0012112112112112113, 'epoch': 36.74}


 40%|███▉      | 396/1000 [19:13<27:59,  2.78s/it]

{'loss': 0.0091, 'grad_norm': 0.1026504635810852, 'learning_rate': 0.0012092092092092094, 'epoch': 36.84}


 40%|███▉      | 397/1000 [19:16<27:53,  2.78s/it]

{'loss': 0.012, 'grad_norm': 0.13999910652637482, 'learning_rate': 0.0012072072072072074, 'epoch': 36.93}


 40%|███▉      | 398/1000 [19:19<27:47,  2.77s/it]

{'loss': 0.0086, 'grad_norm': 0.13585957884788513, 'learning_rate': 0.0012052052052052053, 'epoch': 37.02}


 40%|███▉      | 399/1000 [19:21<27:40,  2.76s/it]

{'loss': 0.0044, 'grad_norm': 0.06898730993270874, 'learning_rate': 0.0012032032032032033, 'epoch': 37.12}


 40%|████      | 400/1000 [19:24<27:40,  2.77s/it]

{'loss': 0.0051, 'grad_norm': 0.06994560360908508, 'learning_rate': 0.0012012012012012011, 'epoch': 37.21}


 40%|████      | 401/1000 [19:27<27:40,  2.77s/it]

{'loss': 0.0112, 'grad_norm': 0.1387774795293808, 'learning_rate': 0.0011991991991991992, 'epoch': 37.3}


 40%|████      | 402/1000 [19:30<27:35,  2.77s/it]

{'loss': 0.0088, 'grad_norm': 0.08667416870594025, 'learning_rate': 0.0011971971971971973, 'epoch': 37.4}


 40%|████      | 403/1000 [19:33<27:40,  2.78s/it]

{'loss': 0.008, 'grad_norm': 0.08598697185516357, 'learning_rate': 0.001195195195195195, 'epoch': 37.49}


 40%|████      | 404/1000 [19:35<27:39,  2.78s/it]

{'loss': 0.0062, 'grad_norm': 0.0907205268740654, 'learning_rate': 0.0011931931931931932, 'epoch': 37.58}


 40%|████      | 405/1000 [19:38<27:36,  2.78s/it]

{'loss': 0.0084, 'grad_norm': 0.08570162951946259, 'learning_rate': 0.0011911911911911912, 'epoch': 37.67}


 41%|████      | 406/1000 [19:41<27:27,  2.77s/it]

{'loss': 0.0076, 'grad_norm': 0.10218426585197449, 'learning_rate': 0.0011891891891891893, 'epoch': 37.77}


 41%|████      | 407/1000 [19:44<27:24,  2.77s/it]

{'loss': 0.0098, 'grad_norm': 0.10739883780479431, 'learning_rate': 0.0011871871871871871, 'epoch': 37.86}


 41%|████      | 408/1000 [19:46<27:19,  2.77s/it]

{'loss': 0.0096, 'grad_norm': 0.11295542865991592, 'learning_rate': 0.0011851851851851852, 'epoch': 37.95}


 41%|████      | 409/1000 [19:49<27:21,  2.78s/it]

{'loss': 0.0056, 'grad_norm': 0.0802343562245369, 'learning_rate': 0.0011831831831831832, 'epoch': 38.05}


 41%|████      | 410/1000 [19:52<27:26,  2.79s/it]

{'loss': 0.0061, 'grad_norm': 0.09044903516769409, 'learning_rate': 0.0011811811811811813, 'epoch': 38.14}


 41%|████      | 411/1000 [19:55<27:20,  2.79s/it]

{'loss': 0.006, 'grad_norm': 0.08450707793235779, 'learning_rate': 0.0011791791791791793, 'epoch': 38.23}


 41%|████      | 412/1000 [19:58<27:14,  2.78s/it]

{'loss': 0.0041, 'grad_norm': 0.05670813471078873, 'learning_rate': 0.0011771771771771772, 'epoch': 38.33}


 41%|████▏     | 413/1000 [20:00<27:10,  2.78s/it]

{'loss': 0.0059, 'grad_norm': 0.0767657682299614, 'learning_rate': 0.0011751751751751752, 'epoch': 38.42}


 41%|████▏     | 414/1000 [20:03<27:12,  2.79s/it]

{'loss': 0.006, 'grad_norm': 0.08444706350564957, 'learning_rate': 0.001173173173173173, 'epoch': 38.51}


 42%|████▏     | 415/1000 [20:06<27:14,  2.79s/it]

{'loss': 0.006, 'grad_norm': 0.08652391284704208, 'learning_rate': 0.0011711711711711711, 'epoch': 38.6}


 42%|████▏     | 416/1000 [20:09<27:15,  2.80s/it]

{'loss': 0.0057, 'grad_norm': 0.09136863797903061, 'learning_rate': 0.0011691691691691692, 'epoch': 38.7}


 42%|████▏     | 417/1000 [20:12<27:14,  2.80s/it]

{'loss': 0.0076, 'grad_norm': 0.083625428378582, 'learning_rate': 0.001167167167167167, 'epoch': 38.79}


 42%|████▏     | 418/1000 [20:14<27:12,  2.80s/it]

{'loss': 0.0042, 'grad_norm': 0.07047466188669205, 'learning_rate': 0.001165165165165165, 'epoch': 38.88}


 42%|████▏     | 419/1000 [20:17<27:10,  2.81s/it]

{'loss': 0.0057, 'grad_norm': 0.07656373828649521, 'learning_rate': 0.0011631631631631631, 'epoch': 38.98}


 42%|████▏     | 420/1000 [20:20<27:07,  2.81s/it]

{'loss': 0.0036, 'grad_norm': 0.06967911869287491, 'learning_rate': 0.0011611611611611612, 'epoch': 39.07}


 42%|████▏     | 421/1000 [20:23<27:08,  2.81s/it]

{'loss': 0.0028, 'grad_norm': 0.051445458084344864, 'learning_rate': 0.0011591591591591593, 'epoch': 39.16}


 42%|████▏     | 422/1000 [20:26<27:05,  2.81s/it]

{'loss': 0.0032, 'grad_norm': 0.0559643991291523, 'learning_rate': 0.001157157157157157, 'epoch': 39.26}


 42%|████▏     | 423/1000 [20:28<27:04,  2.81s/it]

{'loss': 0.0023, 'grad_norm': 0.03927305340766907, 'learning_rate': 0.0011551551551551552, 'epoch': 39.35}


 42%|████▏     | 424/1000 [20:31<27:04,  2.82s/it]

{'loss': 0.0031, 'grad_norm': 0.057706110179424286, 'learning_rate': 0.0011531531531531532, 'epoch': 39.44}


 42%|████▎     | 425/1000 [20:34<26:59,  2.82s/it]

{'loss': 0.0032, 'grad_norm': 0.06381837278604507, 'learning_rate': 0.0011511511511511513, 'epoch': 39.53}


 43%|████▎     | 426/1000 [20:37<26:53,  2.81s/it]

{'loss': 0.0021, 'grad_norm': 0.05114300921559334, 'learning_rate': 0.0011491491491491493, 'epoch': 39.63}


 43%|████▎     | 427/1000 [20:40<26:51,  2.81s/it]

{'loss': 0.0039, 'grad_norm': 0.07307816296815872, 'learning_rate': 0.0011471471471471472, 'epoch': 39.72}


 43%|████▎     | 428/1000 [20:43<26:50,  2.82s/it]

{'loss': 0.0039, 'grad_norm': 0.05135943368077278, 'learning_rate': 0.0011451451451451452, 'epoch': 39.81}


 43%|████▎     | 429/1000 [20:45<26:45,  2.81s/it]

{'loss': 0.0026, 'grad_norm': 0.0514879934489727, 'learning_rate': 0.001143143143143143, 'epoch': 39.91}


 43%|████▎     | 430/1000 [20:48<26:36,  2.80s/it]

{'loss': 0.0038, 'grad_norm': 0.08439819514751434, 'learning_rate': 0.0011411411411411411, 'epoch': 40.0}


 43%|████▎     | 431/1000 [20:51<26:33,  2.80s/it]

{'loss': 0.0016, 'grad_norm': 0.036230411380529404, 'learning_rate': 0.0011391391391391392, 'epoch': 40.09}


 43%|████▎     | 432/1000 [20:54<26:31,  2.80s/it]

{'loss': 0.0031, 'grad_norm': 0.06861860305070877, 'learning_rate': 0.001137137137137137, 'epoch': 40.19}


 43%|████▎     | 433/1000 [20:57<26:25,  2.80s/it]

{'loss': 0.0015, 'grad_norm': 0.03462895378470421, 'learning_rate': 0.001135135135135135, 'epoch': 40.28}


 43%|████▎     | 434/1000 [20:59<26:18,  2.79s/it]

{'loss': 0.0016, 'grad_norm': 0.033459682017564774, 'learning_rate': 0.0011331331331331331, 'epoch': 40.37}


 44%|████▎     | 435/1000 [21:02<26:16,  2.79s/it]

{'loss': 0.003, 'grad_norm': 0.05087810382246971, 'learning_rate': 0.0011311311311311312, 'epoch': 40.47}


 44%|████▎     | 436/1000 [21:05<26:15,  2.79s/it]

{'loss': 0.0033, 'grad_norm': 0.06251133978366852, 'learning_rate': 0.0011291291291291292, 'epoch': 40.56}


 44%|████▎     | 437/1000 [21:08<26:12,  2.79s/it]

{'loss': 0.003, 'grad_norm': 0.057716287672519684, 'learning_rate': 0.001127127127127127, 'epoch': 40.65}


 44%|████▍     | 438/1000 [21:10<26:04,  2.78s/it]

{'loss': 0.0021, 'grad_norm': 0.04524217173457146, 'learning_rate': 0.0011251251251251251, 'epoch': 40.74}


 44%|████▍     | 439/1000 [21:13<25:57,  2.78s/it]

{'loss': 0.0023, 'grad_norm': 0.05485425889492035, 'learning_rate': 0.0011231231231231232, 'epoch': 40.84}


 44%|████▍     | 440/1000 [21:16<25:51,  2.77s/it]

{'loss': 0.002, 'grad_norm': 0.04494103044271469, 'learning_rate': 0.0011211211211211213, 'epoch': 40.93}


 44%|████▍     | 441/1000 [21:19<25:51,  2.78s/it]

{'loss': 0.0031, 'grad_norm': 0.06474560499191284, 'learning_rate': 0.0011191191191191193, 'epoch': 41.02}


 44%|████▍     | 442/1000 [21:22<25:51,  2.78s/it]

{'loss': 0.0015, 'grad_norm': 0.032112184911966324, 'learning_rate': 0.0011171171171171172, 'epoch': 41.12}


 44%|████▍     | 443/1000 [21:24<25:48,  2.78s/it]

{'loss': 0.003, 'grad_norm': 0.07855093479156494, 'learning_rate': 0.001115115115115115, 'epoch': 41.21}


 44%|████▍     | 444/1000 [21:27<25:46,  2.78s/it]

{'loss': 0.0024, 'grad_norm': 0.05911662057042122, 'learning_rate': 0.001113113113113113, 'epoch': 41.3}


 44%|████▍     | 445/1000 [21:30<25:43,  2.78s/it]

{'loss': 0.0019, 'grad_norm': 0.07458988577127457, 'learning_rate': 0.0011111111111111111, 'epoch': 41.4}


 45%|████▍     | 446/1000 [21:33<25:42,  2.78s/it]

{'loss': 0.0029, 'grad_norm': 0.06002940610051155, 'learning_rate': 0.0011091091091091092, 'epoch': 41.49}


 45%|████▍     | 447/1000 [21:35<25:31,  2.77s/it]

{'loss': 0.0016, 'grad_norm': 0.04031607136130333, 'learning_rate': 0.001107107107107107, 'epoch': 41.58}


 45%|████▍     | 448/1000 [21:38<25:26,  2.77s/it]

{'loss': 0.0029, 'grad_norm': 0.05871531367301941, 'learning_rate': 0.001105105105105105, 'epoch': 41.67}


 45%|████▍     | 449/1000 [21:41<25:23,  2.77s/it]

{'loss': 0.0022, 'grad_norm': 0.06386204808950424, 'learning_rate': 0.0011031031031031031, 'epoch': 41.77}


 45%|████▌     | 450/1000 [21:44<25:26,  2.78s/it]

{'loss': 0.0017, 'grad_norm': 0.031453412026166916, 'learning_rate': 0.0011011011011011012, 'epoch': 41.86}


 45%|████▌     | 451/1000 [21:47<25:28,  2.78s/it]

{'loss': 0.005, 'grad_norm': 0.0686318650841713, 'learning_rate': 0.0010990990990990992, 'epoch': 41.95}


 45%|████▌     | 452/1000 [21:49<25:27,  2.79s/it]

{'loss': 0.0025, 'grad_norm': 0.044364456087350845, 'learning_rate': 0.001097097097097097, 'epoch': 42.05}


 45%|████▌     | 453/1000 [21:52<25:25,  2.79s/it]

{'loss': 0.004, 'grad_norm': 0.07827119529247284, 'learning_rate': 0.0010950950950950951, 'epoch': 42.14}


 45%|████▌     | 454/1000 [21:55<25:19,  2.78s/it]

{'loss': 0.0013, 'grad_norm': 0.01674993522465229, 'learning_rate': 0.0010930930930930932, 'epoch': 42.23}


 46%|████▌     | 455/1000 [21:58<25:13,  2.78s/it]

{'loss': 0.002, 'grad_norm': 0.041024889796972275, 'learning_rate': 0.0010910910910910912, 'epoch': 42.33}


 46%|████▌     | 456/1000 [22:00<25:06,  2.77s/it]

{'loss': 0.0023, 'grad_norm': 0.04098690673708916, 'learning_rate': 0.0010890890890890893, 'epoch': 42.42}


 46%|████▌     | 457/1000 [22:03<25:02,  2.77s/it]

{'loss': 0.0024, 'grad_norm': 0.05196992680430412, 'learning_rate': 0.001087087087087087, 'epoch': 42.51}


 46%|████▌     | 458/1000 [22:06<24:57,  2.76s/it]

{'loss': 0.0017, 'grad_norm': 0.038588881492614746, 'learning_rate': 0.001085085085085085, 'epoch': 42.6}


 46%|████▌     | 459/1000 [22:09<24:52,  2.76s/it]

{'loss': 0.0023, 'grad_norm': 0.044969286769628525, 'learning_rate': 0.001083083083083083, 'epoch': 42.7}


 46%|████▌     | 460/1000 [22:11<24:54,  2.77s/it]

{'loss': 0.0026, 'grad_norm': 0.07375585287809372, 'learning_rate': 0.001081081081081081, 'epoch': 42.79}


 46%|████▌     | 461/1000 [22:14<24:55,  2.78s/it]

{'loss': 0.003, 'grad_norm': 0.0510486401617527, 'learning_rate': 0.0010790790790790792, 'epoch': 42.88}


 46%|████▌     | 462/1000 [22:17<24:53,  2.78s/it]

{'loss': 0.0032, 'grad_norm': 0.04110394045710564, 'learning_rate': 0.001077077077077077, 'epoch': 42.98}


 46%|████▋     | 463/1000 [22:20<24:47,  2.77s/it]

{'loss': 0.002, 'grad_norm': 0.024054836481809616, 'learning_rate': 0.001075075075075075, 'epoch': 43.07}


 46%|████▋     | 464/1000 [22:23<24:43,  2.77s/it]

{'loss': 0.0015, 'grad_norm': 0.03442402184009552, 'learning_rate': 0.0010730730730730731, 'epoch': 43.16}


 46%|████▋     | 465/1000 [22:25<24:48,  2.78s/it]

{'loss': 0.0018, 'grad_norm': 0.03139084577560425, 'learning_rate': 0.0010710710710710712, 'epoch': 43.26}


 47%|████▋     | 466/1000 [22:28<24:47,  2.78s/it]

{'loss': 0.0026, 'grad_norm': 0.04958639666438103, 'learning_rate': 0.0010690690690690692, 'epoch': 43.35}


 47%|████▋     | 467/1000 [22:31<24:39,  2.77s/it]

{'loss': 0.0016, 'grad_norm': 0.03323910012841225, 'learning_rate': 0.001067067067067067, 'epoch': 43.44}


 47%|████▋     | 468/1000 [22:34<24:37,  2.78s/it]

{'loss': 0.0019, 'grad_norm': 0.038213685154914856, 'learning_rate': 0.0010650650650650651, 'epoch': 43.53}


 47%|████▋     | 469/1000 [22:36<24:34,  2.78s/it]

{'loss': 0.0022, 'grad_norm': 0.07056744396686554, 'learning_rate': 0.0010630630630630632, 'epoch': 43.63}


 47%|████▋     | 470/1000 [22:39<24:32,  2.78s/it]

{'loss': 0.0014, 'grad_norm': 0.03680761158466339, 'learning_rate': 0.0010610610610610612, 'epoch': 43.72}


 47%|████▋     | 471/1000 [22:42<24:25,  2.77s/it]

{'loss': 0.0031, 'grad_norm': 0.05797829478979111, 'learning_rate': 0.001059059059059059, 'epoch': 43.81}


 47%|████▋     | 472/1000 [22:45<24:18,  2.76s/it]

{'loss': 0.0023, 'grad_norm': 0.03306601941585541, 'learning_rate': 0.001057057057057057, 'epoch': 43.91}


 47%|████▋     | 473/1000 [22:47<24:13,  2.76s/it]

{'loss': 0.0029, 'grad_norm': 0.03531021997332573, 'learning_rate': 0.001055055055055055, 'epoch': 44.0}


 47%|████▋     | 474/1000 [22:50<24:17,  2.77s/it]

{'loss': 0.0018, 'grad_norm': 0.03529014438390732, 'learning_rate': 0.001053053053053053, 'epoch': 44.09}


 48%|████▊     | 475/1000 [22:53<24:19,  2.78s/it]

{'loss': 0.0013, 'grad_norm': 0.029707856476306915, 'learning_rate': 0.001051051051051051, 'epoch': 44.19}


 48%|████▊     | 476/1000 [22:56<24:15,  2.78s/it]

{'loss': 0.0019, 'grad_norm': 0.038505978882312775, 'learning_rate': 0.001049049049049049, 'epoch': 44.28}


 48%|████▊     | 477/1000 [22:59<24:10,  2.77s/it]

{'loss': 0.0016, 'grad_norm': 0.03348147124052048, 'learning_rate': 0.001047047047047047, 'epoch': 44.37}


 48%|████▊     | 478/1000 [23:01<24:04,  2.77s/it]

{'loss': 0.0018, 'grad_norm': 0.03438611328601837, 'learning_rate': 0.001045045045045045, 'epoch': 44.47}


 48%|████▊     | 479/1000 [23:04<24:01,  2.77s/it]

{'loss': 0.0023, 'grad_norm': 0.041374336928129196, 'learning_rate': 0.001043043043043043, 'epoch': 44.56}


 48%|████▊     | 480/1000 [23:07<23:59,  2.77s/it]

{'loss': 0.0021, 'grad_norm': 0.03722222149372101, 'learning_rate': 0.0010410410410410412, 'epoch': 44.65}


 48%|████▊     | 481/1000 [23:10<23:59,  2.77s/it]

{'loss': 0.0016, 'grad_norm': 0.027286631986498833, 'learning_rate': 0.001039039039039039, 'epoch': 44.74}


 48%|████▊     | 482/1000 [23:12<23:56,  2.77s/it]

{'loss': 0.0012, 'grad_norm': 0.016527727246284485, 'learning_rate': 0.001037037037037037, 'epoch': 44.84}


 48%|████▊     | 483/1000 [23:15<23:51,  2.77s/it]

{'loss': 0.0033, 'grad_norm': 0.06069435924291611, 'learning_rate': 0.001035035035035035, 'epoch': 44.93}


 48%|████▊     | 484/1000 [23:18<23:46,  2.76s/it]

{'loss': 0.0024, 'grad_norm': 0.04517413303256035, 'learning_rate': 0.0010330330330330332, 'epoch': 45.02}


 48%|████▊     | 485/1000 [23:21<23:47,  2.77s/it]

{'loss': 0.0013, 'grad_norm': 0.024984322488307953, 'learning_rate': 0.0010310310310310312, 'epoch': 45.12}


 49%|████▊     | 486/1000 [23:24<23:42,  2.77s/it]

{'loss': 0.0015, 'grad_norm': 0.03412657603621483, 'learning_rate': 0.0010290290290290288, 'epoch': 45.21}


 49%|████▊     | 487/1000 [23:26<23:39,  2.77s/it]

{'loss': 0.0015, 'grad_norm': 0.022605758160352707, 'learning_rate': 0.001027027027027027, 'epoch': 45.3}


 49%|████▉     | 488/1000 [23:29<23:39,  2.77s/it]

{'loss': 0.0019, 'grad_norm': 0.034697774797677994, 'learning_rate': 0.001025025025025025, 'epoch': 45.4}


 49%|████▉     | 489/1000 [23:32<23:35,  2.77s/it]

{'loss': 0.0017, 'grad_norm': 0.031802624464035034, 'learning_rate': 0.001023023023023023, 'epoch': 45.49}


 49%|████▉     | 490/1000 [23:35<23:34,  2.77s/it]

{'loss': 0.0015, 'grad_norm': 0.04652908816933632, 'learning_rate': 0.001021021021021021, 'epoch': 45.58}


 49%|████▉     | 491/1000 [23:37<23:36,  2.78s/it]

{'loss': 0.0012, 'grad_norm': 0.011993377469480038, 'learning_rate': 0.001019019019019019, 'epoch': 45.67}


 49%|████▉     | 492/1000 [23:40<23:35,  2.79s/it]

{'loss': 0.0019, 'grad_norm': 0.062344640493392944, 'learning_rate': 0.001017017017017017, 'epoch': 45.77}


 49%|████▉     | 493/1000 [23:43<23:37,  2.80s/it]

{'loss': 0.0016, 'grad_norm': 0.030106155201792717, 'learning_rate': 0.001015015015015015, 'epoch': 45.86}


 49%|████▉     | 494/1000 [23:46<23:38,  2.80s/it]

{'loss': 0.0018, 'grad_norm': 0.04093213379383087, 'learning_rate': 0.001013013013013013, 'epoch': 45.95}


 50%|████▉     | 495/1000 [23:49<23:38,  2.81s/it]

{'loss': 0.0015, 'grad_norm': 0.02282726764678955, 'learning_rate': 0.0010110110110110111, 'epoch': 46.05}


 50%|████▉     | 496/1000 [23:52<23:34,  2.81s/it]

{'loss': 0.0015, 'grad_norm': 0.04156970977783203, 'learning_rate': 0.001009009009009009, 'epoch': 46.14}


 50%|████▉     | 497/1000 [23:54<23:29,  2.80s/it]

{'loss': 0.0027, 'grad_norm': 0.030880684033036232, 'learning_rate': 0.001007007007007007, 'epoch': 46.23}


 50%|████▉     | 498/1000 [23:57<23:30,  2.81s/it]

{'loss': 0.0025, 'grad_norm': 0.042682062834501266, 'learning_rate': 0.001005005005005005, 'epoch': 46.33}


 50%|████▉     | 499/1000 [24:00<23:30,  2.82s/it]

{'loss': 0.0012, 'grad_norm': 0.009067803621292114, 'learning_rate': 0.0010030030030030032, 'epoch': 46.42}


 50%|█████     | 500/1000 [24:03<23:22,  2.80s/it]Checkpoint destination directory outputs/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0011, 'grad_norm': 0.014194684103131294, 'learning_rate': 0.0010010010010010012, 'epoch': 46.51}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-500)... Done. 0.2s
 50%|█████     | 501/1000 [24:16<49:59,  6.01s/it]

{'loss': 0.0014, 'grad_norm': 0.030148006975650787, 'learning_rate': 0.000998998998998999, 'epoch': 46.6}


 50%|█████     | 502/1000 [24:19<41:47,  5.03s/it]

{'loss': 0.0011, 'grad_norm': 0.008643808774650097, 'learning_rate': 0.0009969969969969969, 'epoch': 46.7}


 50%|█████     | 503/1000 [24:22<36:01,  4.35s/it]

{'loss': 0.0011, 'grad_norm': 0.011928128078579903, 'learning_rate': 0.000994994994994995, 'epoch': 46.79}


 50%|█████     | 504/1000 [24:24<32:00,  3.87s/it]

{'loss': 0.0022, 'grad_norm': 0.031719814985990524, 'learning_rate': 0.000992992992992993, 'epoch': 46.88}


 50%|█████     | 505/1000 [24:27<29:17,  3.55s/it]

{'loss': 0.0011, 'grad_norm': 0.01134156621992588, 'learning_rate': 0.000990990990990991, 'epoch': 46.98}


 51%|█████     | 506/1000 [24:30<27:20,  3.32s/it]

{'loss': 0.0011, 'grad_norm': 0.011974232271313667, 'learning_rate': 0.0009889889889889891, 'epoch': 47.07}


 51%|█████     | 507/1000 [24:33<25:54,  3.15s/it]

{'loss': 0.001, 'grad_norm': 0.009685841389000416, 'learning_rate': 0.000986986986986987, 'epoch': 47.16}


 51%|█████     | 508/1000 [24:36<24:54,  3.04s/it]

{'loss': 0.001, 'grad_norm': 0.008041341789066792, 'learning_rate': 0.000984984984984985, 'epoch': 47.26}


 51%|█████     | 509/1000 [24:38<24:15,  2.96s/it]

{'loss': 0.0012, 'grad_norm': 0.010544364340603352, 'learning_rate': 0.000982982982982983, 'epoch': 47.35}


 51%|█████     | 510/1000 [24:41<23:48,  2.92s/it]

{'loss': 0.0011, 'grad_norm': 0.008778706192970276, 'learning_rate': 0.000980980980980981, 'epoch': 47.44}


 51%|█████     | 511/1000 [24:44<23:26,  2.88s/it]

{'loss': 0.001, 'grad_norm': 0.007793562486767769, 'learning_rate': 0.000978978978978979, 'epoch': 47.53}


 51%|█████     | 512/1000 [24:47<23:06,  2.84s/it]

{'loss': 0.0011, 'grad_norm': 0.010472383350133896, 'learning_rate': 0.000976976976976977, 'epoch': 47.63}


 51%|█████▏    | 513/1000 [24:50<22:54,  2.82s/it]

{'loss': 0.001, 'grad_norm': 0.00800417922437191, 'learning_rate': 0.000974974974974975, 'epoch': 47.72}


 51%|█████▏    | 514/1000 [24:52<22:40,  2.80s/it]

{'loss': 0.0011, 'grad_norm': 0.010395576246082783, 'learning_rate': 0.000972972972972973, 'epoch': 47.81}


 52%|█████▏    | 515/1000 [24:55<22:35,  2.79s/it]

{'loss': 0.0013, 'grad_norm': 0.031162628903985023, 'learning_rate': 0.000970970970970971, 'epoch': 47.91}


 52%|█████▏    | 516/1000 [24:58<22:33,  2.80s/it]

{'loss': 0.0011, 'grad_norm': 0.008229011669754982, 'learning_rate': 0.000968968968968969, 'epoch': 48.0}


 52%|█████▏    | 517/1000 [25:01<22:29,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.005693014711141586, 'learning_rate': 0.000966966966966967, 'epoch': 48.09}


 52%|█████▏    | 518/1000 [25:03<22:23,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.00472146924585104, 'learning_rate': 0.000964964964964965, 'epoch': 48.19}


 52%|█████▏    | 519/1000 [25:06<22:15,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.006739554926753044, 'learning_rate': 0.0009629629629629629, 'epoch': 48.28}


 52%|█████▏    | 520/1000 [25:09<22:08,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.006326691247522831, 'learning_rate': 0.0009609609609609609, 'epoch': 48.37}


 52%|█████▏    | 521/1000 [25:12<22:07,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004322423599660397, 'learning_rate': 0.000958958958958959, 'epoch': 48.47}


 52%|█████▏    | 522/1000 [25:14<22:02,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.005275717470794916, 'learning_rate': 0.0009569569569569569, 'epoch': 48.56}


 52%|█████▏    | 523/1000 [25:17<21:58,  2.76s/it]

{'loss': 0.0011, 'grad_norm': 0.007286928128451109, 'learning_rate': 0.000954954954954955, 'epoch': 48.65}


 52%|█████▏    | 524/1000 [25:20<21:57,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.009518984705209732, 'learning_rate': 0.000952952952952953, 'epoch': 48.74}


 52%|█████▎    | 525/1000 [25:23<21:56,  2.77s/it]

{'loss': 0.0011, 'grad_norm': 0.0076581742614507675, 'learning_rate': 0.000950950950950951, 'epoch': 48.84}


 53%|█████▎    | 526/1000 [25:25<21:49,  2.76s/it]

{'loss': 0.0012, 'grad_norm': 0.008818943053483963, 'learning_rate': 0.000948948948948949, 'epoch': 48.93}


 53%|█████▎    | 527/1000 [25:28<21:46,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.005841933190822601, 'learning_rate': 0.0009469469469469469, 'epoch': 49.02}


 53%|█████▎    | 528/1000 [25:31<21:44,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.007067856378853321, 'learning_rate': 0.000944944944944945, 'epoch': 49.12}


 53%|█████▎    | 529/1000 [25:34<21:45,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.022261235862970352, 'learning_rate': 0.0009429429429429429, 'epoch': 49.21}


 53%|█████▎    | 530/1000 [25:37<21:41,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005122487433254719, 'learning_rate': 0.000940940940940941, 'epoch': 49.3}


 53%|█████▎    | 531/1000 [25:39<21:36,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.005736043211072683, 'learning_rate': 0.000938938938938939, 'epoch': 49.4}


 53%|█████▎    | 532/1000 [25:42<21:39,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005500317085534334, 'learning_rate': 0.000936936936936937, 'epoch': 49.49}


 53%|█████▎    | 533/1000 [25:45<21:37,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005690107587724924, 'learning_rate': 0.000934934934934935, 'epoch': 49.58}


 53%|█████▎    | 534/1000 [25:48<21:29,  2.77s/it]

{'loss': 0.0011, 'grad_norm': 0.01583140902221203, 'learning_rate': 0.0009329329329329329, 'epoch': 49.67}


 54%|█████▎    | 535/1000 [25:50<21:24,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.006315015256404877, 'learning_rate': 0.0009309309309309309, 'epoch': 49.77}


 54%|█████▎    | 536/1000 [25:53<21:22,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.007522172760218382, 'learning_rate': 0.000928928928928929, 'epoch': 49.86}


 54%|█████▎    | 537/1000 [25:56<21:17,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.005803990643471479, 'learning_rate': 0.0009269269269269269, 'epoch': 49.95}


 54%|█████▍    | 538/1000 [25:59<21:15,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.006560762412846088, 'learning_rate': 0.000924924924924925, 'epoch': 50.05}


 54%|█████▍    | 539/1000 [26:01<21:10,  2.76s/it]

{'loss': 0.0009, 'grad_norm': 0.005123397801071405, 'learning_rate': 0.0009229229229229229, 'epoch': 50.14}


 54%|█████▍    | 540/1000 [26:04<21:10,  2.76s/it]

{'loss': 0.0009, 'grad_norm': 0.005609103478491306, 'learning_rate': 0.000920920920920921, 'epoch': 50.23}


 54%|█████▍    | 541/1000 [26:07<21:11,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004306567832827568, 'learning_rate': 0.0009189189189189189, 'epoch': 50.33}


 54%|█████▍    | 542/1000 [26:10<21:10,  2.77s/it]

{'loss': 0.0011, 'grad_norm': 0.018686862662434578, 'learning_rate': 0.0009169169169169169, 'epoch': 50.42}


 54%|█████▍    | 543/1000 [26:13<21:12,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.00628719013184309, 'learning_rate': 0.000914914914914915, 'epoch': 50.51}


 54%|█████▍    | 544/1000 [26:15<21:02,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.005820378195494413, 'learning_rate': 0.0009129129129129129, 'epoch': 50.6}


 55%|█████▍    | 545/1000 [26:18<20:56,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.007535787299275398, 'learning_rate': 0.000910910910910911, 'epoch': 50.7}


 55%|█████▍    | 546/1000 [26:21<20:56,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.006216641515493393, 'learning_rate': 0.000908908908908909, 'epoch': 50.79}


 55%|█████▍    | 547/1000 [26:24<20:57,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.0056008230894804, 'learning_rate': 0.000906906906906907, 'epoch': 50.88}


 55%|█████▍    | 548/1000 [26:26<20:58,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.006009987089782953, 'learning_rate': 0.0009049049049049049, 'epoch': 50.98}


 55%|█████▍    | 549/1000 [26:29<20:59,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.00538256112486124, 'learning_rate': 0.0009029029029029029, 'epoch': 51.07}


 55%|█████▌    | 550/1000 [26:32<20:54,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.004297400824725628, 'learning_rate': 0.0009009009009009009, 'epoch': 51.16}


 55%|█████▌    | 551/1000 [26:35<20:49,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.00530287204310298, 'learning_rate': 0.0008988988988988989, 'epoch': 51.26}


 55%|█████▌    | 552/1000 [26:38<20:46,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005018638446927071, 'learning_rate': 0.0008968968968968969, 'epoch': 51.35}


 55%|█████▌    | 553/1000 [26:40<20:43,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.0048153214156627655, 'learning_rate': 0.000894894894894895, 'epoch': 51.44}


 55%|█████▌    | 554/1000 [26:43<20:40,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005478543229401112, 'learning_rate': 0.0008928928928928929, 'epoch': 51.53}


 56%|█████▌    | 555/1000 [26:46<20:40,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.006168559659272432, 'learning_rate': 0.0008908908908908909, 'epoch': 51.63}


 56%|█████▌    | 556/1000 [26:49<20:35,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.004636559169739485, 'learning_rate': 0.0008888888888888888, 'epoch': 51.72}


 56%|█████▌    | 557/1000 [26:51<20:29,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.006312352139502764, 'learning_rate': 0.0008868868868868869, 'epoch': 51.81}


 56%|█████▌    | 558/1000 [26:54<20:25,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.005704349838197231, 'learning_rate': 0.0008848848848848849, 'epoch': 51.91}


 56%|█████▌    | 559/1000 [26:57<20:25,  2.78s/it]

{'loss': 0.0011, 'grad_norm': 0.008049027994275093, 'learning_rate': 0.0008828828828828829, 'epoch': 52.0}


 56%|█████▌    | 560/1000 [27:00<20:23,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005900105927139521, 'learning_rate': 0.0008808808808808809, 'epoch': 52.09}


 56%|█████▌    | 561/1000 [27:03<20:19,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.006347515154629946, 'learning_rate': 0.0008788788788788789, 'epoch': 52.19}


 56%|█████▌    | 562/1000 [27:05<20:21,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.006557863671332598, 'learning_rate': 0.000876876876876877, 'epoch': 52.28}


 56%|█████▋    | 563/1000 [27:08<20:23,  2.80s/it]

{'loss': 0.0009, 'grad_norm': 0.00601592194288969, 'learning_rate': 0.0008748748748748749, 'epoch': 52.37}


 56%|█████▋    | 564/1000 [27:11<20:26,  2.81s/it]

{'loss': 0.001, 'grad_norm': 0.016323665156960487, 'learning_rate': 0.0008728728728728728, 'epoch': 52.47}


 56%|█████▋    | 565/1000 [27:14<20:27,  2.82s/it]

{'loss': 0.001, 'grad_norm': 0.004659431055188179, 'learning_rate': 0.0008708708708708709, 'epoch': 52.56}


 57%|█████▋    | 566/1000 [27:17<20:23,  2.82s/it]

{'loss': 0.0009, 'grad_norm': 0.00462191179394722, 'learning_rate': 0.0008688688688688689, 'epoch': 52.65}


 57%|█████▋    | 567/1000 [27:20<20:19,  2.82s/it]

{'loss': 0.001, 'grad_norm': 0.005063162650913, 'learning_rate': 0.0008668668668668669, 'epoch': 52.74}


 57%|█████▋    | 568/1000 [27:22<20:13,  2.81s/it]

{'loss': 0.001, 'grad_norm': 0.007546721026301384, 'learning_rate': 0.000864864864864865, 'epoch': 52.84}


 57%|█████▋    | 569/1000 [27:25<20:11,  2.81s/it]

{'loss': 0.001, 'grad_norm': 0.0063664885237813, 'learning_rate': 0.0008628628628628629, 'epoch': 52.93}


 57%|█████▋    | 570/1000 [27:28<20:13,  2.82s/it]

{'loss': 0.0011, 'grad_norm': 0.007820897735655308, 'learning_rate': 0.0008608608608608609, 'epoch': 53.02}


 57%|█████▋    | 571/1000 [27:31<20:12,  2.83s/it]

{'loss': 0.0009, 'grad_norm': 0.0044373441487550735, 'learning_rate': 0.0008588588588588588, 'epoch': 53.12}


 57%|█████▋    | 572/1000 [27:34<20:07,  2.82s/it]

{'loss': 0.0009, 'grad_norm': 0.005179597996175289, 'learning_rate': 0.0008568568568568569, 'epoch': 53.21}


 57%|█████▋    | 573/1000 [27:36<20:01,  2.81s/it]

{'loss': 0.0009, 'grad_norm': 0.004271842539310455, 'learning_rate': 0.0008548548548548549, 'epoch': 53.3}


 57%|█████▋    | 574/1000 [27:39<20:05,  2.83s/it]

{'loss': 0.0009, 'grad_norm': 0.005934496410191059, 'learning_rate': 0.0008528528528528529, 'epoch': 53.4}


 57%|█████▊    | 575/1000 [27:42<20:07,  2.84s/it]

{'loss': 0.001, 'grad_norm': 0.005639166571199894, 'learning_rate': 0.0008508508508508509, 'epoch': 53.49}


 58%|█████▊    | 576/1000 [27:45<20:01,  2.83s/it]

{'loss': 0.001, 'grad_norm': 0.00574887078255415, 'learning_rate': 0.0008488488488488489, 'epoch': 53.58}


 58%|█████▊    | 577/1000 [27:48<19:49,  2.81s/it]

{'loss': 0.001, 'grad_norm': 0.0070459553971886635, 'learning_rate': 0.0008468468468468468, 'epoch': 53.67}


 58%|█████▊    | 578/1000 [27:51<19:40,  2.80s/it]

{'loss': 0.001, 'grad_norm': 0.0053420052863657475, 'learning_rate': 0.0008448448448448449, 'epoch': 53.77}


 58%|█████▊    | 579/1000 [27:53<19:32,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.005432996898889542, 'learning_rate': 0.0008428428428428428, 'epoch': 53.86}


 58%|█████▊    | 580/1000 [27:56<19:27,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005831349175423384, 'learning_rate': 0.0008408408408408409, 'epoch': 53.95}


 58%|█████▊    | 581/1000 [27:59<19:23,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.00669609010219574, 'learning_rate': 0.0008388388388388388, 'epoch': 54.05}


 58%|█████▊    | 582/1000 [28:02<19:18,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005738838575780392, 'learning_rate': 0.0008368368368368369, 'epoch': 54.14}


 58%|█████▊    | 583/1000 [28:04<19:13,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005119386129081249, 'learning_rate': 0.0008348348348348348, 'epoch': 54.23}


 58%|█████▊    | 584/1000 [28:07<19:09,  2.76s/it]

{'loss': 0.0009, 'grad_norm': 0.004580301232635975, 'learning_rate': 0.0008328328328328328, 'epoch': 54.33}


 58%|█████▊    | 585/1000 [28:10<19:08,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.0041924030520021915, 'learning_rate': 0.0008308308308308308, 'epoch': 54.42}


 59%|█████▊    | 586/1000 [28:13<19:11,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005346381571143866, 'learning_rate': 0.0008288288288288288, 'epoch': 54.51}


 59%|█████▊    | 587/1000 [28:15<19:08,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.005660634953528643, 'learning_rate': 0.0008268268268268269, 'epoch': 54.6}


 59%|█████▉    | 588/1000 [28:18<19:04,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005552636459469795, 'learning_rate': 0.0008248248248248248, 'epoch': 54.7}


 59%|█████▉    | 589/1000 [28:21<19:02,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005862879566848278, 'learning_rate': 0.0008228228228228229, 'epoch': 54.79}


 59%|█████▉    | 590/1000 [28:24<19:00,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.007424304727464914, 'learning_rate': 0.0008208208208208209, 'epoch': 54.88}


 59%|█████▉    | 591/1000 [28:27<19:00,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.00554997380822897, 'learning_rate': 0.0008188188188188188, 'epoch': 54.98}


 59%|█████▉    | 592/1000 [28:29<18:55,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.003520325059071183, 'learning_rate': 0.0008168168168168168, 'epoch': 55.07}


 59%|█████▉    | 593/1000 [28:32<18:50,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.004904530942440033, 'learning_rate': 0.0008148148148148148, 'epoch': 55.16}


 59%|█████▉    | 594/1000 [28:35<18:50,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.0051420219242572784, 'learning_rate': 0.0008128128128128128, 'epoch': 55.26}


 60%|█████▉    | 595/1000 [28:38<18:45,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.004829346667975187, 'learning_rate': 0.0008108108108108109, 'epoch': 55.35}


 60%|█████▉    | 596/1000 [28:40<18:40,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004913767799735069, 'learning_rate': 0.0008088088088088088, 'epoch': 55.44}


 60%|█████▉    | 597/1000 [28:43<18:36,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.006039491388946772, 'learning_rate': 0.0008068068068068069, 'epoch': 55.53}


 60%|█████▉    | 598/1000 [28:46<18:36,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005799802485853434, 'learning_rate': 0.0008048048048048048, 'epoch': 55.63}


 60%|█████▉    | 599/1000 [28:49<18:37,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.007055402733385563, 'learning_rate': 0.0008028028028028028, 'epoch': 55.72}


 60%|██████    | 600/1000 [28:52<18:33,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.004408523440361023, 'learning_rate': 0.0008008008008008008, 'epoch': 55.81}


 60%|██████    | 601/1000 [28:54<18:29,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.006718651857227087, 'learning_rate': 0.0007987987987987988, 'epoch': 55.91}


 60%|██████    | 602/1000 [28:57<18:23,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.005820451304316521, 'learning_rate': 0.0007967967967967968, 'epoch': 56.0}


 60%|██████    | 603/1000 [29:00<18:20,  2.77s/it]

{'loss': 0.0008, 'grad_norm': 0.005731355398893356, 'learning_rate': 0.0007947947947947948, 'epoch': 56.09}


 60%|██████    | 604/1000 [29:03<18:17,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005305266939103603, 'learning_rate': 0.0007927927927927928, 'epoch': 56.19}


 60%|██████    | 605/1000 [29:05<18:14,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.006083925720304251, 'learning_rate': 0.0007907907907907909, 'epoch': 56.28}


 61%|██████    | 606/1000 [29:08<18:11,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004002666100859642, 'learning_rate': 0.0007887887887887887, 'epoch': 56.37}


 61%|██████    | 607/1000 [29:11<18:08,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.006040920503437519, 'learning_rate': 0.0007867867867867868, 'epoch': 56.47}


 61%|██████    | 608/1000 [29:14<18:04,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004592646844685078, 'learning_rate': 0.0007847847847847848, 'epoch': 56.56}


 61%|██████    | 609/1000 [29:17<18:05,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.00515854824334383, 'learning_rate': 0.0007827827827827828, 'epoch': 56.65}


 61%|██████    | 610/1000 [29:19<18:01,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.005443588364869356, 'learning_rate': 0.0007807807807807808, 'epoch': 56.74}


 61%|██████    | 611/1000 [29:22<18:01,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005207471549510956, 'learning_rate': 0.0007787787787787788, 'epoch': 56.84}


 61%|██████    | 612/1000 [29:25<18:00,  2.78s/it]

{'loss': 0.001, 'grad_norm': 0.0059023452922701836, 'learning_rate': 0.0007767767767767769, 'epoch': 56.93}


 61%|██████▏   | 613/1000 [29:28<17:59,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.005412122700363398, 'learning_rate': 0.0007747747747747747, 'epoch': 57.02}


 61%|██████▏   | 614/1000 [29:30<17:56,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.004014482256025076, 'learning_rate': 0.0007727727727727728, 'epoch': 57.12}


 62%|██████▏   | 615/1000 [29:33<17:53,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.0038894633762538433, 'learning_rate': 0.0007707707707707707, 'epoch': 57.21}


 62%|██████▏   | 616/1000 [29:36<17:49,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.005240234546363354, 'learning_rate': 0.0007687687687687688, 'epoch': 57.3}


 62%|██████▏   | 617/1000 [29:39<17:41,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004040597938001156, 'learning_rate': 0.0007667667667667668, 'epoch': 57.4}


 62%|██████▏   | 618/1000 [29:42<17:38,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005286985542625189, 'learning_rate': 0.0007647647647647648, 'epoch': 57.49}


 62%|██████▏   | 619/1000 [29:44<17:34,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004892811644822359, 'learning_rate': 0.0007627627627627628, 'epoch': 57.58}


 62%|██████▏   | 620/1000 [29:47<17:30,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.005646155681461096, 'learning_rate': 0.0007607607607607607, 'epoch': 57.67}


 62%|██████▏   | 621/1000 [29:50<17:26,  2.76s/it]

{'loss': 0.0009, 'grad_norm': 0.004856550600379705, 'learning_rate': 0.0007587587587587587, 'epoch': 57.77}


 62%|██████▏   | 622/1000 [29:53<17:25,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.005464382469654083, 'learning_rate': 0.0007567567567567568, 'epoch': 57.86}


 62%|██████▏   | 623/1000 [29:55<17:24,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.004736950621008873, 'learning_rate': 0.0007547547547547547, 'epoch': 57.95}


 62%|██████▏   | 624/1000 [29:58<17:21,  2.77s/it]

{'loss': 0.001, 'grad_norm': 0.0054096076637506485, 'learning_rate': 0.0007527527527527528, 'epoch': 58.05}


 62%|██████▎   | 625/1000 [30:01<17:20,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.0046325973235070705, 'learning_rate': 0.0007507507507507507, 'epoch': 58.14}


 63%|██████▎   | 626/1000 [30:04<17:12,  2.76s/it]

{'loss': 0.0009, 'grad_norm': 0.00454937806352973, 'learning_rate': 0.0007487487487487488, 'epoch': 58.23}


 63%|██████▎   | 627/1000 [30:06<17:15,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.004584938753396273, 'learning_rate': 0.0007467467467467469, 'epoch': 58.33}


 63%|██████▎   | 628/1000 [30:09<17:09,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.005642315838485956, 'learning_rate': 0.0007447447447447447, 'epoch': 58.42}


 63%|██████▎   | 629/1000 [30:12<17:06,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.0049066124483942986, 'learning_rate': 0.0007427427427427428, 'epoch': 58.51}


 63%|██████▎   | 630/1000 [30:15<17:04,  2.77s/it]

{'loss': 0.0009, 'grad_norm': 0.004979234654456377, 'learning_rate': 0.0007407407407407407, 'epoch': 58.6}


 63%|██████▎   | 631/1000 [30:18<17:06,  2.78s/it]

{'loss': 0.0009, 'grad_norm': 0.004991329275071621, 'learning_rate': 0.0007387387387387388, 'epoch': 58.7}


 63%|██████▎   | 632/1000 [30:20<17:05,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.00438812468200922, 'learning_rate': 0.0007367367367367368, 'epoch': 58.79}


 63%|██████▎   | 633/1000 [30:23<17:04,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.006005723960697651, 'learning_rate': 0.0007347347347347348, 'epoch': 58.88}


 63%|██████▎   | 634/1000 [30:26<17:02,  2.79s/it]

{'loss': 0.001, 'grad_norm': 0.004877123050391674, 'learning_rate': 0.0007327327327327328, 'epoch': 58.98}


 64%|██████▎   | 635/1000 [30:29<16:56,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.007415796164423227, 'learning_rate': 0.0007307307307307307, 'epoch': 59.07}


 64%|██████▎   | 636/1000 [30:32<16:54,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.004727993626147509, 'learning_rate': 0.0007287287287287287, 'epoch': 59.16}


 64%|██████▎   | 637/1000 [30:34<16:51,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.004820824600756168, 'learning_rate': 0.0007267267267267268, 'epoch': 59.26}


 64%|██████▍   | 638/1000 [30:37<16:50,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.005232499446719885, 'learning_rate': 0.0007247247247247247, 'epoch': 59.35}


 64%|██████▍   | 639/1000 [30:40<16:47,  2.79s/it]

{'loss': 0.0009, 'grad_norm': 0.004785620607435703, 'learning_rate': 0.0007227227227227228, 'epoch': 59.44}


 64%|██████▍   | 640/1000 [30:43<16:46,  2.80s/it]

{'loss': 0.0009, 'grad_norm': 0.005055941641330719, 'learning_rate': 0.0007207207207207207, 'epoch': 59.53}


 64%|██████▍   | 641/1000 [30:46<16:48,  2.81s/it]

{'loss': 0.0009, 'grad_norm': 0.004847476724535227, 'learning_rate': 0.0007187187187187188, 'epoch': 59.63}


 64%|██████▍   | 642/1000 [30:48<16:45,  2.81s/it]

{'loss': 0.0009, 'grad_norm': 0.004750678315758705, 'learning_rate': 0.0007167167167167166, 'epoch': 59.72}


 64%|██████▍   | 643/1000 [30:51<16:45,  2.82s/it]

{'loss': 0.0009, 'grad_norm': 0.004703650716692209, 'learning_rate': 0.0007147147147147147, 'epoch': 59.81}


 64%|██████▍   | 644/1000 [30:54<16:24,  2.76s/it]

{'loss': 0.001, 'grad_norm': 0.006315199192613363, 'learning_rate': 0.0007127127127127127, 'epoch': 59.91}


 64%|██████▍   | 645/1000 [30:56<16:03,  2.71s/it]

{'loss': 0.001, 'grad_norm': 0.005466056987643242, 'learning_rate': 0.0007107107107107107, 'epoch': 60.0}


 65%|██████▍   | 646/1000 [30:59<15:47,  2.68s/it]

{'loss': 0.0009, 'grad_norm': 0.004128921311348677, 'learning_rate': 0.0007087087087087087, 'epoch': 60.09}


 65%|██████▍   | 647/1000 [31:02<15:41,  2.67s/it]

{'loss': 0.0009, 'grad_norm': 0.00519982585683465, 'learning_rate': 0.0007067067067067067, 'epoch': 60.19}


 65%|██████▍   | 648/1000 [31:04<15:33,  2.65s/it]

{'loss': 0.0009, 'grad_norm': 0.005540397949516773, 'learning_rate': 0.0007047047047047048, 'epoch': 60.28}


 65%|██████▍   | 649/1000 [31:07<15:27,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.003583627287298441, 'learning_rate': 0.0007027027027027027, 'epoch': 60.37}


 65%|██████▌   | 650/1000 [31:09<15:17,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0053488630801439285, 'learning_rate': 0.0007007007007007007, 'epoch': 60.47}


 65%|██████▌   | 651/1000 [31:12<15:22,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005517234094440937, 'learning_rate': 0.0006986986986986987, 'epoch': 60.56}


 65%|██████▌   | 652/1000 [31:15<15:17,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.006610073149204254, 'learning_rate': 0.0006966966966966967, 'epoch': 60.65}


 65%|██████▌   | 653/1000 [31:18<15:21,  2.66s/it]

{'loss': 0.0009, 'grad_norm': 0.006281854584813118, 'learning_rate': 0.0006946946946946947, 'epoch': 60.74}


 65%|██████▌   | 654/1000 [31:20<15:22,  2.67s/it]

{'loss': 0.001, 'grad_norm': 0.0055662780068814754, 'learning_rate': 0.0006926926926926928, 'epoch': 60.84}


 66%|██████▌   | 655/1000 [31:23<15:17,  2.66s/it]

{'loss': 0.001, 'grad_norm': 0.0051424275152385235, 'learning_rate': 0.0006906906906906907, 'epoch': 60.93}


 66%|██████▌   | 656/1000 [31:25<15:07,  2.64s/it]

{'loss': 0.001, 'grad_norm': 0.0067566693760454655, 'learning_rate': 0.0006886886886886888, 'epoch': 61.02}


 66%|██████▌   | 657/1000 [31:28<15:02,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.0065893204882740974, 'learning_rate': 0.0006866866866866866, 'epoch': 61.12}


 66%|██████▌   | 658/1000 [31:31<14:54,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005402968265116215, 'learning_rate': 0.0006846846846846847, 'epoch': 61.21}


 66%|██████▌   | 659/1000 [31:33<14:52,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004769261460751295, 'learning_rate': 0.0006826826826826827, 'epoch': 61.3}


 66%|██████▌   | 660/1000 [31:36<14:47,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005315279122442007, 'learning_rate': 0.0006806806806806807, 'epoch': 61.4}


 66%|██████▌   | 661/1000 [31:38<14:45,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005348887760192156, 'learning_rate': 0.0006786786786786787, 'epoch': 61.49}


 66%|██████▌   | 662/1000 [31:41<14:42,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.003994453232735395, 'learning_rate': 0.0006766766766766767, 'epoch': 61.58}


 66%|██████▋   | 663/1000 [31:44<14:35,  2.60s/it]

{'loss': 0.001, 'grad_norm': 0.006423485931009054, 'learning_rate': 0.0006746746746746747, 'epoch': 61.67}


 66%|██████▋   | 664/1000 [31:46<14:28,  2.58s/it]

{'loss': 0.0009, 'grad_norm': 0.005487724207341671, 'learning_rate': 0.0006726726726726727, 'epoch': 61.77}


 66%|██████▋   | 665/1000 [31:49<14:26,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.00510474992915988, 'learning_rate': 0.0006706706706706706, 'epoch': 61.86}


 67%|██████▋   | 666/1000 [31:51<14:20,  2.58s/it]

{'loss': 0.001, 'grad_norm': 0.004575695376843214, 'learning_rate': 0.0006686686686686687, 'epoch': 61.95}


 67%|██████▋   | 667/1000 [31:54<14:17,  2.58s/it]

{'loss': 0.001, 'grad_norm': 0.006318713538348675, 'learning_rate': 0.0006666666666666666, 'epoch': 62.05}


 67%|██████▋   | 668/1000 [31:57<14:28,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004629228729754686, 'learning_rate': 0.0006646646646646647, 'epoch': 62.14}


 67%|██████▋   | 669/1000 [31:59<14:24,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00417210441082716, 'learning_rate': 0.0006626626626626628, 'epoch': 62.23}


 67%|██████▋   | 670/1000 [32:02<14:25,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00515393353998661, 'learning_rate': 0.0006606606606606607, 'epoch': 62.33}


 67%|██████▋   | 671/1000 [32:04<14:21,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005272227339446545, 'learning_rate': 0.0006586586586586587, 'epoch': 62.42}


 67%|██████▋   | 672/1000 [32:07<14:25,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.004666326567530632, 'learning_rate': 0.0006566566566566566, 'epoch': 62.51}


 67%|██████▋   | 673/1000 [32:10<14:20,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004507464822381735, 'learning_rate': 0.0006546546546546547, 'epoch': 62.6}


 67%|██████▋   | 674/1000 [32:12<14:13,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004493135958909988, 'learning_rate': 0.0006526526526526526, 'epoch': 62.7}


 68%|██████▊   | 675/1000 [32:15<14:07,  2.61s/it]

{'loss': 0.001, 'grad_norm': 0.004387193825095892, 'learning_rate': 0.0006506506506506507, 'epoch': 62.79}


 68%|██████▊   | 676/1000 [32:18<14:05,  2.61s/it]

{'loss': 0.001, 'grad_norm': 0.00477193295955658, 'learning_rate': 0.0006486486486486487, 'epoch': 62.88}


 68%|██████▊   | 677/1000 [32:20<13:57,  2.59s/it]

{'loss': 0.001, 'grad_norm': 0.00627317326143384, 'learning_rate': 0.0006466466466466467, 'epoch': 62.98}


 68%|██████▊   | 678/1000 [32:23<13:56,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005958099849522114, 'learning_rate': 0.0006446446446446446, 'epoch': 63.07}


 68%|██████▊   | 679/1000 [32:25<13:52,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005122070200741291, 'learning_rate': 0.0006426426426426426, 'epoch': 63.16}


 68%|██████▊   | 680/1000 [32:28<13:51,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004745810758322477, 'learning_rate': 0.0006406406406406406, 'epoch': 63.26}


 68%|██████▊   | 681/1000 [32:31<13:55,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005352159030735493, 'learning_rate': 0.0006386386386386387, 'epoch': 63.35}


 68%|██████▊   | 682/1000 [32:33<13:54,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005828471854329109, 'learning_rate': 0.0006366366366366366, 'epoch': 63.44}


 68%|██████▊   | 683/1000 [32:36<13:49,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005063640419393778, 'learning_rate': 0.0006346346346346347, 'epoch': 63.53}


 68%|██████▊   | 684/1000 [32:38<13:42,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005305576603859663, 'learning_rate': 0.0006326326326326326, 'epoch': 63.63}


 68%|██████▊   | 685/1000 [32:41<13:35,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004934265278279781, 'learning_rate': 0.0006306306306306307, 'epoch': 63.72}


 69%|██████▊   | 686/1000 [32:44<13:36,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0063949464820325375, 'learning_rate': 0.0006286286286286286, 'epoch': 63.81}


 69%|██████▊   | 687/1000 [32:46<13:35,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005123211536556482, 'learning_rate': 0.0006266266266266266, 'epoch': 63.91}


 69%|██████▉   | 688/1000 [32:49<13:31,  2.60s/it]

{'loss': 0.001, 'grad_norm': 0.005168468225747347, 'learning_rate': 0.0006246246246246246, 'epoch': 64.0}


 69%|██████▉   | 689/1000 [32:51<13:32,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004274832550436258, 'learning_rate': 0.0006226226226226226, 'epoch': 64.09}


 69%|██████▉   | 690/1000 [32:54<13:26,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005837561096996069, 'learning_rate': 0.0006206206206206207, 'epoch': 64.19}


 69%|██████▉   | 691/1000 [32:57<13:26,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0039881994016468525, 'learning_rate': 0.0006186186186186187, 'epoch': 64.28}


 69%|██████▉   | 692/1000 [32:59<13:23,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004921737592667341, 'learning_rate': 0.0006166166166166167, 'epoch': 64.37}


 69%|██████▉   | 693/1000 [33:02<13:26,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.006080364342778921, 'learning_rate': 0.0006146146146146146, 'epoch': 64.47}


 69%|██████▉   | 694/1000 [33:04<13:20,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004030365962535143, 'learning_rate': 0.0006126126126126126, 'epoch': 64.56}


 70%|██████▉   | 695/1000 [33:07<13:19,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004934939555823803, 'learning_rate': 0.0006106106106106106, 'epoch': 64.65}


 70%|██████▉   | 696/1000 [33:10<13:18,  2.63s/it]

{'loss': 0.001, 'grad_norm': 0.004940544255077839, 'learning_rate': 0.0006086086086086087, 'epoch': 64.74}


 70%|██████▉   | 697/1000 [33:12<13:11,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004450839478522539, 'learning_rate': 0.0006066066066066066, 'epoch': 64.84}


 70%|██████▉   | 698/1000 [33:15<13:06,  2.60s/it]

{'loss': 0.001, 'grad_norm': 0.004888167139142752, 'learning_rate': 0.0006046046046046047, 'epoch': 64.93}


 70%|██████▉   | 699/1000 [33:18<13:03,  2.60s/it]

{'loss': 0.001, 'grad_norm': 0.005614493042230606, 'learning_rate': 0.0006026026026026026, 'epoch': 65.02}


 70%|███████   | 700/1000 [33:20<13:01,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004568004980683327, 'learning_rate': 0.0006006006006006006, 'epoch': 65.12}


 70%|███████   | 701/1000 [33:23<12:58,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.00491096219047904, 'learning_rate': 0.0005985985985985986, 'epoch': 65.21}


 70%|███████   | 702/1000 [33:25<12:58,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004417866468429565, 'learning_rate': 0.0005965965965965966, 'epoch': 65.3}


 70%|███████   | 703/1000 [33:28<12:54,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005960181355476379, 'learning_rate': 0.0005945945945945946, 'epoch': 65.4}


 70%|███████   | 704/1000 [33:31<12:50,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006144127808511257, 'learning_rate': 0.0005925925925925926, 'epoch': 65.49}


 70%|███████   | 705/1000 [33:33<12:45,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.00465022400021553, 'learning_rate': 0.0005905905905905906, 'epoch': 65.58}


 71%|███████   | 706/1000 [33:36<12:46,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005546695552766323, 'learning_rate': 0.0005885885885885886, 'epoch': 65.67}


 71%|███████   | 707/1000 [33:38<12:41,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005240352358669043, 'learning_rate': 0.0005865865865865865, 'epoch': 65.77}


 71%|███████   | 708/1000 [33:41<12:35,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005172856617718935, 'learning_rate': 0.0005845845845845846, 'epoch': 65.86}


 71%|███████   | 709/1000 [33:44<12:34,  2.59s/it]

{'loss': 0.001, 'grad_norm': 0.006124091800302267, 'learning_rate': 0.0005825825825825825, 'epoch': 65.95}


 71%|███████   | 710/1000 [33:46<12:30,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005465738940984011, 'learning_rate': 0.0005805805805805806, 'epoch': 66.05}


 71%|███████   | 711/1000 [33:49<12:27,  2.58s/it]

{'loss': 0.0009, 'grad_norm': 0.005223135929554701, 'learning_rate': 0.0005785785785785786, 'epoch': 66.14}


 71%|███████   | 712/1000 [33:51<12:24,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005092476028949022, 'learning_rate': 0.0005765765765765766, 'epoch': 66.23}


 71%|███████▏  | 713/1000 [33:54<12:24,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005102634429931641, 'learning_rate': 0.0005745745745745747, 'epoch': 66.33}


 71%|███████▏  | 714/1000 [33:56<12:21,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005702032241970301, 'learning_rate': 0.0005725725725725726, 'epoch': 66.42}


 72%|███████▏  | 715/1000 [33:59<12:18,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005094748921692371, 'learning_rate': 0.0005705705705705706, 'epoch': 66.51}


 72%|███████▏  | 716/1000 [34:02<12:22,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.003098653396591544, 'learning_rate': 0.0005685685685685685, 'epoch': 66.6}


 72%|███████▏  | 717/1000 [34:04<12:22,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005292132962495089, 'learning_rate': 0.0005665665665665666, 'epoch': 66.7}


 72%|███████▏  | 718/1000 [34:07<12:18,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004768182523548603, 'learning_rate': 0.0005645645645645646, 'epoch': 66.79}


 72%|███████▏  | 719/1000 [34:10<12:15,  2.62s/it]

{'loss': 0.001, 'grad_norm': 0.005750990007072687, 'learning_rate': 0.0005625625625625626, 'epoch': 66.88}


 72%|███████▏  | 720/1000 [34:12<12:09,  2.61s/it]

{'loss': 0.001, 'grad_norm': 0.006446843966841698, 'learning_rate': 0.0005605605605605606, 'epoch': 66.98}


 72%|███████▏  | 721/1000 [34:15<12:07,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00464686518535018, 'learning_rate': 0.0005585585585585586, 'epoch': 67.07}


 72%|███████▏  | 722/1000 [34:17<12:03,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004099560435861349, 'learning_rate': 0.0005565565565565565, 'epoch': 67.16}


 72%|███████▏  | 723/1000 [34:20<12:05,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004498236812651157, 'learning_rate': 0.0005545545545545546, 'epoch': 67.26}


 72%|███████▏  | 724/1000 [34:23<12:00,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005328867118805647, 'learning_rate': 0.0005525525525525525, 'epoch': 67.35}


 72%|███████▎  | 725/1000 [34:25<11:55,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006489933934062719, 'learning_rate': 0.0005505505505505506, 'epoch': 67.44}


 73%|███████▎  | 726/1000 [34:28<11:51,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006325933616608381, 'learning_rate': 0.0005485485485485485, 'epoch': 67.53}


 73%|███████▎  | 727/1000 [34:30<11:49,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005391648970544338, 'learning_rate': 0.0005465465465465466, 'epoch': 67.63}


 73%|███████▎  | 728/1000 [34:33<11:45,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004351357463747263, 'learning_rate': 0.0005445445445445447, 'epoch': 67.72}


 73%|███████▎  | 729/1000 [34:36<11:43,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004834192339330912, 'learning_rate': 0.0005425425425425425, 'epoch': 67.81}


 73%|███████▎  | 730/1000 [34:38<11:41,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0048619648441672325, 'learning_rate': 0.0005405405405405405, 'epoch': 67.91}


 73%|███████▎  | 731/1000 [34:41<11:39,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004727635532617569, 'learning_rate': 0.0005385385385385385, 'epoch': 68.0}


 73%|███████▎  | 732/1000 [34:43<11:37,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0052162036299705505, 'learning_rate': 0.0005365365365365366, 'epoch': 68.09}


 73%|███████▎  | 733/1000 [34:46<11:35,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.00380492745898664, 'learning_rate': 0.0005345345345345346, 'epoch': 68.19}


 73%|███████▎  | 734/1000 [34:49<11:30,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004356956109404564, 'learning_rate': 0.0005325325325325326, 'epoch': 68.28}


 74%|███████▎  | 735/1000 [34:51<11:28,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0050609237514436245, 'learning_rate': 0.0005305305305305306, 'epoch': 68.37}


 74%|███████▎  | 736/1000 [34:54<11:27,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004486294463276863, 'learning_rate': 0.0005285285285285285, 'epoch': 68.47}


 74%|███████▎  | 737/1000 [34:56<11:24,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004462026990950108, 'learning_rate': 0.0005265265265265265, 'epoch': 68.56}


 74%|███████▍  | 738/1000 [34:59<11:21,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005147531162947416, 'learning_rate': 0.0005245245245245245, 'epoch': 68.65}


 74%|███████▍  | 739/1000 [35:02<11:19,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0050637489184737206, 'learning_rate': 0.0005225225225225225, 'epoch': 68.74}


 74%|███████▍  | 740/1000 [35:04<11:15,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0046780104748904705, 'learning_rate': 0.0005205205205205206, 'epoch': 68.84}


 74%|███████▍  | 741/1000 [35:07<11:12,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.00649573840200901, 'learning_rate': 0.0005185185185185185, 'epoch': 68.93}


 74%|███████▍  | 742/1000 [35:09<11:10,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004121126141399145, 'learning_rate': 0.0005165165165165166, 'epoch': 69.02}


 74%|███████▍  | 743/1000 [35:12<11:06,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005139260087162256, 'learning_rate': 0.0005145145145145144, 'epoch': 69.12}


 74%|███████▍  | 744/1000 [35:15<11:03,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004405038896948099, 'learning_rate': 0.0005125125125125125, 'epoch': 69.21}


 74%|███████▍  | 745/1000 [35:17<11:01,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004359386395663023, 'learning_rate': 0.0005105105105105105, 'epoch': 69.3}


 75%|███████▍  | 746/1000 [35:20<10:58,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.006022099405527115, 'learning_rate': 0.0005085085085085085, 'epoch': 69.4}


 75%|███████▍  | 747/1000 [35:22<10:55,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.0061242845840752125, 'learning_rate': 0.0005065065065065065, 'epoch': 69.49}


 75%|███████▍  | 748/1000 [35:25<11:00,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0056119151413440704, 'learning_rate': 0.0005045045045045045, 'epoch': 69.58}


 75%|███████▍  | 749/1000 [35:28<11:00,  2.63s/it]

{'loss': 0.001, 'grad_norm': 0.0056916349567472935, 'learning_rate': 0.0005025025025025025, 'epoch': 69.67}


 75%|███████▌  | 750/1000 [35:30<10:54,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005073244217783213, 'learning_rate': 0.0005005005005005006, 'epoch': 69.77}


 75%|███████▌  | 751/1000 [35:33<10:53,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005603079218417406, 'learning_rate': 0.0004984984984984984, 'epoch': 69.86}


 75%|███████▌  | 752/1000 [35:36<10:51,  2.63s/it]

{'loss': 0.001, 'grad_norm': 0.004498012363910675, 'learning_rate': 0.0004964964964964965, 'epoch': 69.95}


 75%|███████▌  | 753/1000 [35:38<10:47,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.003957652952522039, 'learning_rate': 0.0004944944944944946, 'epoch': 70.05}


 75%|███████▌  | 754/1000 [35:41<10:43,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005899989977478981, 'learning_rate': 0.0004924924924924925, 'epoch': 70.14}


 76%|███████▌  | 755/1000 [35:43<10:39,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00578395975753665, 'learning_rate': 0.0004904904904904905, 'epoch': 70.23}


 76%|███████▌  | 756/1000 [35:46<10:34,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.00512510072439909, 'learning_rate': 0.0004884884884884885, 'epoch': 70.33}


 76%|███████▌  | 757/1000 [35:49<10:34,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00464695505797863, 'learning_rate': 0.0004864864864864865, 'epoch': 70.42}


 76%|███████▌  | 758/1000 [35:51<10:29,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0049385204911231995, 'learning_rate': 0.0004844844844844845, 'epoch': 70.51}


 76%|███████▌  | 759/1000 [35:54<10:27,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006543036084622145, 'learning_rate': 0.0004824824824824825, 'epoch': 70.6}


 76%|███████▌  | 760/1000 [35:56<10:25,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004690464586019516, 'learning_rate': 0.00048048048048048047, 'epoch': 70.7}


 76%|███████▌  | 761/1000 [35:59<10:25,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004293113946914673, 'learning_rate': 0.00047847847847847847, 'epoch': 70.79}


 76%|███████▌  | 762/1000 [36:02<10:23,  2.62s/it]

{'loss': 0.001, 'grad_norm': 0.005701511632651091, 'learning_rate': 0.0004764764764764765, 'epoch': 70.88}


 76%|███████▋  | 763/1000 [36:04<10:20,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005801317747682333, 'learning_rate': 0.0004744744744744745, 'epoch': 70.98}


 76%|███████▋  | 764/1000 [36:07<10:15,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0050879609771072865, 'learning_rate': 0.0004724724724724725, 'epoch': 71.07}


 76%|███████▋  | 765/1000 [36:09<10:11,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.003959728870540857, 'learning_rate': 0.0004704704704704705, 'epoch': 71.16}


 77%|███████▋  | 766/1000 [36:12<10:07,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.003927950747311115, 'learning_rate': 0.0004684684684684685, 'epoch': 71.26}


 77%|███████▋  | 767/1000 [36:15<10:06,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004778233356773853, 'learning_rate': 0.00046646646646646644, 'epoch': 71.35}


 77%|███████▋  | 768/1000 [36:17<10:06,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005506793037056923, 'learning_rate': 0.0004644644644644645, 'epoch': 71.44}


 77%|███████▋  | 769/1000 [36:20<10:01,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005333637818694115, 'learning_rate': 0.0004624624624624625, 'epoch': 71.53}


 77%|███████▋  | 770/1000 [36:22<09:57,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004676744807511568, 'learning_rate': 0.0004604604604604605, 'epoch': 71.63}


 77%|███████▋  | 771/1000 [36:25<09:57,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004362679086625576, 'learning_rate': 0.00045845845845845845, 'epoch': 71.72}


 77%|███████▋  | 772/1000 [36:28<09:53,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004646714776754379, 'learning_rate': 0.00045645645645645645, 'epoch': 71.81}


 77%|███████▋  | 773/1000 [36:30<09:52,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005374584347009659, 'learning_rate': 0.0004544544544544545, 'epoch': 71.91}


 77%|███████▋  | 774/1000 [36:33<09:53,  2.62s/it]

{'loss': 0.001, 'grad_norm': 0.005148720927536488, 'learning_rate': 0.00045245245245245245, 'epoch': 72.0}


 78%|███████▊  | 775/1000 [36:35<09:47,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004247575532644987, 'learning_rate': 0.00045045045045045046, 'epoch': 72.09}


 78%|███████▊  | 776/1000 [36:38<09:43,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0048266686499118805, 'learning_rate': 0.00044844844844844846, 'epoch': 72.19}


 78%|███████▊  | 777/1000 [36:41<09:38,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005021723918616772, 'learning_rate': 0.00044644644644644646, 'epoch': 72.28}


 78%|███████▊  | 778/1000 [36:43<09:37,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004453569650650024, 'learning_rate': 0.0004444444444444444, 'epoch': 72.37}


 78%|███████▊  | 779/1000 [36:46<09:34,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005982222966849804, 'learning_rate': 0.00044244244244244247, 'epoch': 72.47}


 78%|███████▊  | 780/1000 [36:48<09:33,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005484255496412516, 'learning_rate': 0.00044044044044044047, 'epoch': 72.56}


 78%|███████▊  | 781/1000 [36:51<09:32,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004797578323632479, 'learning_rate': 0.0004384384384384385, 'epoch': 72.65}


 78%|███████▊  | 782/1000 [36:54<09:29,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006132503971457481, 'learning_rate': 0.0004364364364364364, 'epoch': 72.74}


 78%|███████▊  | 783/1000 [36:56<09:26,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004848809912800789, 'learning_rate': 0.0004344344344344344, 'epoch': 72.84}


 78%|███████▊  | 784/1000 [36:59<09:24,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004503164906054735, 'learning_rate': 0.0004324324324324325, 'epoch': 72.93}


 78%|███████▊  | 785/1000 [37:02<09:20,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0051297759637236595, 'learning_rate': 0.00043043043043043043, 'epoch': 73.02}


 79%|███████▊  | 786/1000 [37:04<09:18,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0043725077994167805, 'learning_rate': 0.00042842842842842843, 'epoch': 73.12}


 79%|███████▊  | 787/1000 [37:07<09:15,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005198729690164328, 'learning_rate': 0.00042642642642642644, 'epoch': 73.21}


 79%|███████▉  | 788/1000 [37:09<09:14,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006160935387015343, 'learning_rate': 0.00042442442442442444, 'epoch': 73.3}


 79%|███████▉  | 789/1000 [37:12<09:11,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004737855866551399, 'learning_rate': 0.00042242242242242244, 'epoch': 73.4}


 79%|███████▉  | 790/1000 [37:15<09:07,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0053480821661651134, 'learning_rate': 0.00042042042042042044, 'epoch': 73.49}


 79%|███████▉  | 791/1000 [37:17<09:12,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.004748126026242971, 'learning_rate': 0.00041841841841841845, 'epoch': 73.58}


 79%|███████▉  | 792/1000 [37:20<09:07,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004372806288301945, 'learning_rate': 0.0004164164164164164, 'epoch': 73.67}


 79%|███████▉  | 793/1000 [37:23<09:04,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004583439324051142, 'learning_rate': 0.0004144144144144144, 'epoch': 73.77}


 79%|███████▉  | 794/1000 [37:25<08:58,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004807659890502691, 'learning_rate': 0.0004124124124124124, 'epoch': 73.86}


 80%|███████▉  | 795/1000 [37:28<08:54,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005351128056645393, 'learning_rate': 0.00041041041041041046, 'epoch': 73.95}


 80%|███████▉  | 796/1000 [37:30<08:51,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.007114302832633257, 'learning_rate': 0.0004084084084084084, 'epoch': 74.05}


 80%|███████▉  | 797/1000 [37:33<08:46,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004424531944096088, 'learning_rate': 0.0004064064064064064, 'epoch': 74.14}


 80%|███████▉  | 798/1000 [37:36<08:45,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005046687554568052, 'learning_rate': 0.0004044044044044044, 'epoch': 74.23}


 80%|███████▉  | 799/1000 [37:38<08:42,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004553201142698526, 'learning_rate': 0.0004024024024024024, 'epoch': 74.33}


 80%|████████  | 800/1000 [37:41<08:41,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005300381686538458, 'learning_rate': 0.0004004004004004004, 'epoch': 74.42}


 80%|████████  | 801/1000 [37:43<08:37,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004393191542476416, 'learning_rate': 0.0003983983983983984, 'epoch': 74.51}


 80%|████████  | 802/1000 [37:46<08:36,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0052809640765190125, 'learning_rate': 0.0003963963963963964, 'epoch': 74.6}


 80%|████████  | 803/1000 [37:49<08:34,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0055501414462924, 'learning_rate': 0.00039439439439439437, 'epoch': 74.7}


 80%|████████  | 804/1000 [37:51<08:29,  2.60s/it]

{'loss': 0.001, 'grad_norm': 0.005796784069389105, 'learning_rate': 0.0003923923923923924, 'epoch': 74.79}


 80%|████████  | 805/1000 [37:54<08:26,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005209118127822876, 'learning_rate': 0.0003903903903903904, 'epoch': 74.88}


 81%|████████  | 806/1000 [37:56<08:23,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005717949476093054, 'learning_rate': 0.00038838838838838844, 'epoch': 74.98}


 81%|████████  | 807/1000 [37:59<08:21,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005229995120316744, 'learning_rate': 0.0003863863863863864, 'epoch': 75.07}


 81%|████████  | 808/1000 [38:01<08:17,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004944903776049614, 'learning_rate': 0.0003843843843843844, 'epoch': 75.16}


 81%|████████  | 809/1000 [38:04<08:14,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.006028860807418823, 'learning_rate': 0.0003823823823823824, 'epoch': 75.26}


 81%|████████  | 810/1000 [38:07<08:13,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005048729944974184, 'learning_rate': 0.00038038038038038034, 'epoch': 75.35}


 81%|████████  | 811/1000 [38:09<08:10,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004992008674889803, 'learning_rate': 0.0003783783783783784, 'epoch': 75.44}


 81%|████████  | 812/1000 [38:12<08:07,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005278703756630421, 'learning_rate': 0.0003763763763763764, 'epoch': 75.53}


 81%|████████▏ | 813/1000 [38:14<08:04,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004506049677729607, 'learning_rate': 0.0003743743743743744, 'epoch': 75.63}


 81%|████████▏ | 814/1000 [38:17<08:02,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004034656099975109, 'learning_rate': 0.00037237237237237235, 'epoch': 75.72}


 82%|████████▏ | 815/1000 [38:20<07:59,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005864436272531748, 'learning_rate': 0.00037037037037037035, 'epoch': 75.81}


 82%|████████▏ | 816/1000 [38:22<07:55,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005160591099411249, 'learning_rate': 0.0003683683683683684, 'epoch': 75.91}


 82%|████████▏ | 817/1000 [38:25<07:57,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004996043164283037, 'learning_rate': 0.0003663663663663664, 'epoch': 76.0}


 82%|████████▏ | 818/1000 [38:27<07:54,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004918511491268873, 'learning_rate': 0.00036436436436436436, 'epoch': 76.09}


 82%|████████▏ | 819/1000 [38:30<07:49,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005201864056289196, 'learning_rate': 0.00036236236236236236, 'epoch': 76.19}


 82%|████████▏ | 820/1000 [38:33<07:45,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004492416977882385, 'learning_rate': 0.00036036036036036037, 'epoch': 76.28}


 82%|████████▏ | 821/1000 [38:35<07:45,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004455370828509331, 'learning_rate': 0.0003583583583583583, 'epoch': 76.37}


 82%|████████▏ | 822/1000 [38:38<07:43,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004748644307255745, 'learning_rate': 0.00035635635635635637, 'epoch': 76.47}


 82%|████████▏ | 823/1000 [38:40<07:42,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005465961527079344, 'learning_rate': 0.0003543543543543544, 'epoch': 76.56}


 82%|████████▏ | 824/1000 [38:43<07:40,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004653491545468569, 'learning_rate': 0.0003523523523523524, 'epoch': 76.65}


 82%|████████▎ | 825/1000 [38:46<07:37,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0050563388504087925, 'learning_rate': 0.0003503503503503503, 'epoch': 76.74}


 83%|████████▎ | 826/1000 [38:48<07:35,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0048650517128407955, 'learning_rate': 0.00034834834834834833, 'epoch': 76.84}


 83%|████████▎ | 827/1000 [38:51<07:31,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006074638105928898, 'learning_rate': 0.0003463463463463464, 'epoch': 76.93}


 83%|████████▎ | 828/1000 [38:54<07:31,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.006220606621354818, 'learning_rate': 0.0003443443443443444, 'epoch': 77.02}


 83%|████████▎ | 829/1000 [38:56<07:27,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.003972330596297979, 'learning_rate': 0.00034234234234234234, 'epoch': 77.12}


 83%|████████▎ | 830/1000 [38:59<07:24,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004148792941123247, 'learning_rate': 0.00034034034034034034, 'epoch': 77.21}


 83%|████████▎ | 831/1000 [39:01<07:22,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004665898159146309, 'learning_rate': 0.00033833833833833834, 'epoch': 77.3}


 83%|████████▎ | 832/1000 [39:04<07:18,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00474063353613019, 'learning_rate': 0.00033633633633633635, 'epoch': 77.4}


 83%|████████▎ | 833/1000 [39:07<07:17,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004362752195447683, 'learning_rate': 0.00033433433433433435, 'epoch': 77.49}


 83%|████████▎ | 834/1000 [39:09<07:14,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005252499599009752, 'learning_rate': 0.00033233233233233235, 'epoch': 77.58}


 84%|████████▎ | 835/1000 [39:12<07:10,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005096133332699537, 'learning_rate': 0.00033033033033033035, 'epoch': 77.67}


 84%|████████▎ | 836/1000 [39:15<07:09,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005595281720161438, 'learning_rate': 0.0003283283283283283, 'epoch': 77.77}


 84%|████████▎ | 837/1000 [39:17<07:05,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005261037964373827, 'learning_rate': 0.0003263263263263263, 'epoch': 77.86}


 84%|████████▍ | 838/1000 [39:20<07:02,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004542606417089701, 'learning_rate': 0.00032432432432432436, 'epoch': 77.95}


 84%|████████▍ | 839/1000 [39:22<06:59,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006030418444424868, 'learning_rate': 0.0003223223223223223, 'epoch': 78.05}


 84%|████████▍ | 840/1000 [39:25<06:56,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006518964190036058, 'learning_rate': 0.0003203203203203203, 'epoch': 78.14}


 84%|████████▍ | 841/1000 [39:28<06:53,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004217600915580988, 'learning_rate': 0.0003183183183183183, 'epoch': 78.23}


 84%|████████▍ | 842/1000 [39:30<06:52,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006001576315611601, 'learning_rate': 0.0003163163163163163, 'epoch': 78.33}


 84%|████████▍ | 843/1000 [39:33<06:47,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0051309033297002316, 'learning_rate': 0.0003143143143143143, 'epoch': 78.42}


 84%|████████▍ | 844/1000 [39:35<06:44,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005750332027673721, 'learning_rate': 0.0003123123123123123, 'epoch': 78.51}


 84%|████████▍ | 845/1000 [39:38<06:40,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.006327745039016008, 'learning_rate': 0.00031031031031031033, 'epoch': 78.6}


 85%|████████▍ | 846/1000 [39:40<06:39,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004373130854219198, 'learning_rate': 0.00030830830830830833, 'epoch': 78.7}


 85%|████████▍ | 847/1000 [39:43<06:37,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006739124655723572, 'learning_rate': 0.0003063063063063063, 'epoch': 78.79}


 85%|████████▍ | 848/1000 [39:46<06:34,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004565860144793987, 'learning_rate': 0.00030430430430430434, 'epoch': 78.88}


 85%|████████▍ | 849/1000 [39:48<06:31,  2.59s/it]

{'loss': 0.001, 'grad_norm': 0.0058022006414830685, 'learning_rate': 0.00030230230230230234, 'epoch': 78.98}


 85%|████████▌ | 850/1000 [39:51<06:28,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005292449612170458, 'learning_rate': 0.0003003003003003003, 'epoch': 79.07}


 85%|████████▌ | 851/1000 [39:53<06:26,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004624104592949152, 'learning_rate': 0.0002982982982982983, 'epoch': 79.16}


 85%|████████▌ | 852/1000 [39:56<06:24,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.0044622598215937614, 'learning_rate': 0.0002962962962962963, 'epoch': 79.26}


 85%|████████▌ | 853/1000 [39:59<06:24,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00514860637485981, 'learning_rate': 0.0002942942942942943, 'epoch': 79.35}


 85%|████████▌ | 854/1000 [40:01<06:21,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005344431847333908, 'learning_rate': 0.0002922922922922923, 'epoch': 79.44}


 86%|████████▌ | 855/1000 [40:04<06:19,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00455870758742094, 'learning_rate': 0.0002902902902902903, 'epoch': 79.53}


 86%|████████▌ | 856/1000 [40:07<06:14,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005169988144189119, 'learning_rate': 0.0002882882882882883, 'epoch': 79.63}


 86%|████████▌ | 857/1000 [40:09<06:11,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005183654837310314, 'learning_rate': 0.0002862862862862863, 'epoch': 79.72}


 86%|████████▌ | 858/1000 [40:12<06:07,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005088067147880793, 'learning_rate': 0.00028428428428428425, 'epoch': 79.81}


 86%|████████▌ | 859/1000 [40:14<06:04,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005213799886405468, 'learning_rate': 0.0002822822822822823, 'epoch': 79.91}


 86%|████████▌ | 860/1000 [40:17<06:02,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004457606468349695, 'learning_rate': 0.0002802802802802803, 'epoch': 80.0}


 86%|████████▌ | 861/1000 [40:19<06:01,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.003913497552275658, 'learning_rate': 0.00027827827827827826, 'epoch': 80.09}


 86%|████████▌ | 862/1000 [40:22<05:59,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005697009153664112, 'learning_rate': 0.00027627627627627627, 'epoch': 80.19}


 86%|████████▋ | 863/1000 [40:25<05:57,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004843274597078562, 'learning_rate': 0.00027427427427427427, 'epoch': 80.28}


 86%|████████▋ | 864/1000 [40:27<05:55,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005225572735071182, 'learning_rate': 0.0002722722722722723, 'epoch': 80.37}


 86%|████████▋ | 865/1000 [40:30<05:55,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004732534755021334, 'learning_rate': 0.0002702702702702703, 'epoch': 80.47}


 87%|████████▋ | 866/1000 [40:33<05:51,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004842433612793684, 'learning_rate': 0.0002682682682682683, 'epoch': 80.56}


 87%|████████▋ | 867/1000 [40:35<05:47,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004786440636962652, 'learning_rate': 0.0002662662662662663, 'epoch': 80.65}


 87%|████████▋ | 868/1000 [40:38<05:43,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005430594086647034, 'learning_rate': 0.00026426426426426423, 'epoch': 80.74}


 87%|████████▋ | 869/1000 [40:40<05:40,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0050706504844129086, 'learning_rate': 0.00026226226226226223, 'epoch': 80.84}


 87%|████████▋ | 870/1000 [40:43<05:38,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005583117250353098, 'learning_rate': 0.0002602602602602603, 'epoch': 80.93}


 87%|████████▋ | 871/1000 [40:46<05:38,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005822559352964163, 'learning_rate': 0.0002582582582582583, 'epoch': 81.02}


 87%|████████▋ | 872/1000 [40:48<05:35,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.006199766416102648, 'learning_rate': 0.00025625625625625624, 'epoch': 81.12}


 87%|████████▋ | 873/1000 [40:51<05:31,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005196480546146631, 'learning_rate': 0.00025425425425425424, 'epoch': 81.21}


 87%|████████▋ | 874/1000 [40:53<05:27,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006123671308159828, 'learning_rate': 0.00025225225225225225, 'epoch': 81.3}


 88%|████████▊ | 875/1000 [40:56<05:24,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004678761120885611, 'learning_rate': 0.0002502502502502503, 'epoch': 81.4}


 88%|████████▊ | 876/1000 [40:59<05:22,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005526341963559389, 'learning_rate': 0.00024824824824824825, 'epoch': 81.49}


 88%|████████▊ | 877/1000 [41:01<05:23,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004942110739648342, 'learning_rate': 0.00024624624624624625, 'epoch': 81.58}


 88%|████████▊ | 878/1000 [41:04<05:20,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0053625814616680145, 'learning_rate': 0.00024424424424424426, 'epoch': 81.67}


 88%|████████▊ | 879/1000 [41:06<05:16,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.0047049871645867825, 'learning_rate': 0.00024224224224224226, 'epoch': 81.77}


 88%|████████▊ | 880/1000 [41:09<05:14,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005122072994709015, 'learning_rate': 0.00024024024024024023, 'epoch': 81.86}


 88%|████████▊ | 881/1000 [41:12<05:13,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005123209673911333, 'learning_rate': 0.00023823823823823824, 'epoch': 81.95}


 88%|████████▊ | 882/1000 [41:14<05:09,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.00432162918150425, 'learning_rate': 0.00023623623623623624, 'epoch': 82.05}


 88%|████████▊ | 883/1000 [41:17<05:07,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004765363875776529, 'learning_rate': 0.00023423423423423424, 'epoch': 82.14}


 88%|████████▊ | 884/1000 [41:20<05:05,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004959010053426027, 'learning_rate': 0.00023223223223223225, 'epoch': 82.23}


 88%|████████▊ | 885/1000 [41:22<05:02,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005064112599939108, 'learning_rate': 0.00023023023023023025, 'epoch': 82.33}


 89%|████████▊ | 886/1000 [41:25<04:57,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006142341997474432, 'learning_rate': 0.00022822822822822822, 'epoch': 82.42}


 89%|████████▊ | 887/1000 [41:28<04:57,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.0046863132156431675, 'learning_rate': 0.00022622622622622623, 'epoch': 82.51}


 89%|████████▉ | 888/1000 [41:30<04:55,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.00540094543248415, 'learning_rate': 0.00022422422422422423, 'epoch': 82.6}


 89%|████████▉ | 889/1000 [41:33<04:50,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004909657873213291, 'learning_rate': 0.0002222222222222222, 'epoch': 82.7}


 89%|████████▉ | 890/1000 [41:35<04:46,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004822991788387299, 'learning_rate': 0.00022022022022022024, 'epoch': 82.79}


 89%|████████▉ | 891/1000 [41:38<04:43,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004753682296723127, 'learning_rate': 0.0002182182182182182, 'epoch': 82.88}


 89%|████████▉ | 892/1000 [41:41<04:40,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005664428696036339, 'learning_rate': 0.00021621621621621624, 'epoch': 82.98}


 89%|████████▉ | 893/1000 [41:43<04:37,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005224739201366901, 'learning_rate': 0.00021421421421421422, 'epoch': 83.07}


 89%|████████▉ | 894/1000 [41:46<04:34,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005606709513813257, 'learning_rate': 0.00021221221221221222, 'epoch': 83.16}


 90%|████████▉ | 895/1000 [41:48<04:32,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.0056505161337554455, 'learning_rate': 0.00021021021021021022, 'epoch': 83.26}


 90%|████████▉ | 896/1000 [41:51<04:30,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004067895468324423, 'learning_rate': 0.0002082082082082082, 'epoch': 83.35}


 90%|████████▉ | 897/1000 [41:54<04:27,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005673372186720371, 'learning_rate': 0.0002062062062062062, 'epoch': 83.44}


 90%|████████▉ | 898/1000 [41:56<04:25,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004777818918228149, 'learning_rate': 0.0002042042042042042, 'epoch': 83.53}


 90%|████████▉ | 899/1000 [41:59<04:22,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0042342375963926315, 'learning_rate': 0.0002022022022022022, 'epoch': 83.63}


 90%|█████████ | 900/1000 [42:01<04:20,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005134632810950279, 'learning_rate': 0.0002002002002002002, 'epoch': 83.72}


 90%|█████████ | 901/1000 [42:04<04:17,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0050768135115504265, 'learning_rate': 0.0001981981981981982, 'epoch': 83.81}


 90%|█████████ | 902/1000 [42:07<04:14,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004603811539709568, 'learning_rate': 0.0001961961961961962, 'epoch': 83.91}


 90%|█████████ | 903/1000 [42:09<04:11,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006010351236909628, 'learning_rate': 0.00019419419419419422, 'epoch': 84.0}


 90%|█████████ | 904/1000 [42:12<04:08,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004712512716650963, 'learning_rate': 0.0001921921921921922, 'epoch': 84.09}


 90%|█████████ | 905/1000 [42:14<04:06,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.0047902921214699745, 'learning_rate': 0.00019019019019019017, 'epoch': 84.19}


 91%|█████████ | 906/1000 [42:17<04:04,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006162314210087061, 'learning_rate': 0.0001881881881881882, 'epoch': 84.28}


 91%|█████████ | 907/1000 [42:19<04:01,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.0053647770546376705, 'learning_rate': 0.00018618618618618617, 'epoch': 84.37}


 91%|█████████ | 908/1000 [42:22<03:57,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004498952999711037, 'learning_rate': 0.0001841841841841842, 'epoch': 84.47}


 91%|█████████ | 909/1000 [42:25<03:55,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004967319779098034, 'learning_rate': 0.00018218218218218218, 'epoch': 84.56}


 91%|█████████ | 910/1000 [42:27<03:53,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.005319822113960981, 'learning_rate': 0.00018018018018018018, 'epoch': 84.65}


 91%|█████████ | 911/1000 [42:30<03:51,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005924291908740997, 'learning_rate': 0.00017817817817817819, 'epoch': 84.74}


 91%|█████████ | 912/1000 [42:32<03:48,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006012198980897665, 'learning_rate': 0.0001761761761761762, 'epoch': 84.84}


 91%|█████████▏| 913/1000 [42:35<03:46,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005654493346810341, 'learning_rate': 0.00017417417417417416, 'epoch': 84.93}


 91%|█████████▏| 914/1000 [42:38<03:48,  2.65s/it]

{'loss': 0.0009, 'grad_norm': 0.005291327368468046, 'learning_rate': 0.0001721721721721722, 'epoch': 85.02}


 92%|█████████▏| 915/1000 [42:40<03:44,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005491611547768116, 'learning_rate': 0.00017017017017017017, 'epoch': 85.12}


 92%|█████████▏| 916/1000 [42:43<03:43,  2.66s/it]

{'loss': 0.0009, 'grad_norm': 0.005493619944900274, 'learning_rate': 0.00016816816816816817, 'epoch': 85.21}


 92%|█████████▏| 917/1000 [42:46<03:40,  2.65s/it]

{'loss': 0.0009, 'grad_norm': 0.005633119493722916, 'learning_rate': 0.00016616616616616618, 'epoch': 85.3}


 92%|█████████▏| 918/1000 [42:48<03:36,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005285446532070637, 'learning_rate': 0.00016416416416416415, 'epoch': 85.4}


 92%|█████████▏| 919/1000 [42:51<03:33,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005487255286425352, 'learning_rate': 0.00016216216216216218, 'epoch': 85.49}


 92%|█████████▏| 920/1000 [42:54<03:30,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005247251596301794, 'learning_rate': 0.00016016016016016016, 'epoch': 85.58}


 92%|█████████▏| 921/1000 [42:56<03:28,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005201181396842003, 'learning_rate': 0.00015815815815815816, 'epoch': 85.67}


 92%|█████████▏| 922/1000 [42:59<03:25,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.006415599025785923, 'learning_rate': 0.00015615615615615616, 'epoch': 85.77}


 92%|█████████▏| 923/1000 [43:02<03:21,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004794733133167028, 'learning_rate': 0.00015415415415415416, 'epoch': 85.86}


 92%|█████████▏| 924/1000 [43:04<03:18,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005826341453939676, 'learning_rate': 0.00015215215215215217, 'epoch': 85.95}


 92%|█████████▎| 925/1000 [43:07<03:15,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006257457658648491, 'learning_rate': 0.00015015015015015014, 'epoch': 86.05}


 93%|█████████▎| 926/1000 [43:09<03:12,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005907529965043068, 'learning_rate': 0.00014814814814814815, 'epoch': 86.14}


 93%|█████████▎| 927/1000 [43:12<03:09,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.004965157713741064, 'learning_rate': 0.00014614614614614615, 'epoch': 86.23}


 93%|█████████▎| 928/1000 [43:14<03:06,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004294954240322113, 'learning_rate': 0.00014414414414414415, 'epoch': 86.33}


 93%|█████████▎| 929/1000 [43:17<03:04,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005463412031531334, 'learning_rate': 0.00014214214214214213, 'epoch': 86.42}


 93%|█████████▎| 930/1000 [43:20<03:01,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.0046057021245360374, 'learning_rate': 0.00014014014014014016, 'epoch': 86.51}


 93%|█████████▎| 931/1000 [43:22<02:59,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.006264116149395704, 'learning_rate': 0.00013813813813813813, 'epoch': 86.6}


 93%|█████████▎| 932/1000 [43:25<02:56,  2.59s/it]

{'loss': 0.0009, 'grad_norm': 0.004772479180246592, 'learning_rate': 0.00013613613613613616, 'epoch': 86.7}


 93%|█████████▎| 933/1000 [43:27<02:53,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005987710319459438, 'learning_rate': 0.00013413413413413414, 'epoch': 86.79}


 93%|█████████▎| 934/1000 [43:30<02:52,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006432294379919767, 'learning_rate': 0.00013213213213213211, 'epoch': 86.88}


 94%|█████████▎| 935/1000 [43:33<02:50,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004743521101772785, 'learning_rate': 0.00013013013013013014, 'epoch': 86.98}


 94%|█████████▎| 936/1000 [43:35<02:48,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005615281872451305, 'learning_rate': 0.00012812812812812812, 'epoch': 87.07}


 94%|█████████▎| 937/1000 [43:38<02:44,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005134435836225748, 'learning_rate': 0.00012612612612612612, 'epoch': 87.16}


 94%|█████████▍| 938/1000 [43:41<02:43,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.0045804292894899845, 'learning_rate': 0.00012412412412412413, 'epoch': 87.26}


 94%|█████████▍| 939/1000 [43:43<02:40,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005011300556361675, 'learning_rate': 0.00012212212212212213, 'epoch': 87.35}


 94%|█████████▍| 940/1000 [43:46<02:39,  2.65s/it]

{'loss': 0.0009, 'grad_norm': 0.005737390369176865, 'learning_rate': 0.00012012012012012012, 'epoch': 87.44}


 94%|█████████▍| 941/1000 [43:49<02:35,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.004861002787947655, 'learning_rate': 0.00011811811811811812, 'epoch': 87.53}


 94%|█████████▍| 942/1000 [43:51<02:32,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005671943537890911, 'learning_rate': 0.00011611611611611612, 'epoch': 87.63}


 94%|█████████▍| 943/1000 [43:54<02:29,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00483910134062171, 'learning_rate': 0.00011411411411411411, 'epoch': 87.72}


 94%|█████████▍| 944/1000 [43:56<02:26,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.00649560522288084, 'learning_rate': 0.00011211211211211212, 'epoch': 87.81}


 94%|█████████▍| 945/1000 [43:59<02:24,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004792740568518639, 'learning_rate': 0.00011011011011011012, 'epoch': 87.91}


 95%|█████████▍| 946/1000 [44:02<02:21,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004884125664830208, 'learning_rate': 0.00010810810810810812, 'epoch': 88.0}


 95%|█████████▍| 947/1000 [44:04<02:18,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.006317455787211657, 'learning_rate': 0.00010610610610610611, 'epoch': 88.09}


 95%|█████████▍| 948/1000 [44:07<02:16,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.006164257880300283, 'learning_rate': 0.0001041041041041041, 'epoch': 88.19}


 95%|█████████▍| 949/1000 [44:10<02:13,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004466003272682428, 'learning_rate': 0.0001021021021021021, 'epoch': 88.28}


 95%|█████████▌| 950/1000 [44:12<02:11,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.004721886478364468, 'learning_rate': 0.0001001001001001001, 'epoch': 88.37}


 95%|█████████▌| 951/1000 [44:15<02:08,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005947052966803312, 'learning_rate': 9.80980980980981e-05, 'epoch': 88.47}


 95%|█████████▌| 952/1000 [44:17<02:06,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.00461878115311265, 'learning_rate': 9.60960960960961e-05, 'epoch': 88.56}


 95%|█████████▌| 953/1000 [44:20<02:02,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005542756989598274, 'learning_rate': 9.40940940940941e-05, 'epoch': 88.65}


 95%|█████████▌| 954/1000 [44:23<01:59,  2.60s/it]

{'loss': 0.0009, 'grad_norm': 0.005270383786410093, 'learning_rate': 9.20920920920921e-05, 'epoch': 88.74}


 96%|█████████▌| 955/1000 [44:25<01:57,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004892627242952585, 'learning_rate': 9.009009009009009e-05, 'epoch': 88.84}


 96%|█████████▌| 956/1000 [44:28<01:54,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004252477549016476, 'learning_rate': 8.80880880880881e-05, 'epoch': 88.93}


 96%|█████████▌| 957/1000 [44:30<01:52,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005125406198203564, 'learning_rate': 8.60860860860861e-05, 'epoch': 89.02}


 96%|█████████▌| 958/1000 [44:33<01:49,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.006224116776138544, 'learning_rate': 8.408408408408409e-05, 'epoch': 89.12}


 96%|█████████▌| 959/1000 [44:36<01:47,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004906150978058577, 'learning_rate': 8.208208208208208e-05, 'epoch': 89.21}


 96%|█████████▌| 960/1000 [44:38<01:45,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004579299129545689, 'learning_rate': 8.008008008008008e-05, 'epoch': 89.3}


 96%|█████████▌| 961/1000 [44:41<01:42,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005291109438985586, 'learning_rate': 7.807807807807808e-05, 'epoch': 89.4}


 96%|█████████▌| 962/1000 [44:44<01:39,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005488655064254999, 'learning_rate': 7.607607607607608e-05, 'epoch': 89.49}


 96%|█████████▋| 963/1000 [44:46<01:37,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004119394347071648, 'learning_rate': 7.407407407407407e-05, 'epoch': 89.58}


 96%|█████████▋| 964/1000 [44:49<01:34,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005245835054665804, 'learning_rate': 7.207207207207208e-05, 'epoch': 89.67}


 96%|█████████▋| 965/1000 [44:51<01:31,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005400402937084436, 'learning_rate': 7.007007007007008e-05, 'epoch': 89.77}


 97%|█████████▋| 966/1000 [44:54<01:29,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005284850485622883, 'learning_rate': 6.806806806806808e-05, 'epoch': 89.86}


 97%|█████████▋| 967/1000 [44:57<01:26,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005430346820503473, 'learning_rate': 6.606606606606606e-05, 'epoch': 89.95}


 97%|█████████▋| 968/1000 [44:59<01:24,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.006896638777107, 'learning_rate': 6.406406406406406e-05, 'epoch': 90.05}


 97%|█████████▋| 969/1000 [45:02<01:21,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005804664455354214, 'learning_rate': 6.206206206206206e-05, 'epoch': 90.14}


 97%|█████████▋| 970/1000 [45:05<01:18,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00458492711186409, 'learning_rate': 6.006006006006006e-05, 'epoch': 90.23}


 97%|█████████▋| 971/1000 [45:07<01:16,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00411634286865592, 'learning_rate': 5.805805805805806e-05, 'epoch': 90.33}


 97%|█████████▋| 972/1000 [45:10<01:13,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004750021733343601, 'learning_rate': 5.605605605605606e-05, 'epoch': 90.42}


 97%|█████████▋| 973/1000 [45:12<01:10,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005801167339086533, 'learning_rate': 5.405405405405406e-05, 'epoch': 90.51}


 97%|█████████▋| 974/1000 [45:15<01:08,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004878166131675243, 'learning_rate': 5.205205205205205e-05, 'epoch': 90.6}


 98%|█████████▊| 975/1000 [45:18<01:05,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005457410588860512, 'learning_rate': 5.005005005005005e-05, 'epoch': 90.7}


 98%|█████████▊| 976/1000 [45:20<01:02,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.00535906758159399, 'learning_rate': 4.804804804804805e-05, 'epoch': 90.79}


 98%|█████████▊| 977/1000 [45:23<00:59,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005218366626650095, 'learning_rate': 4.604604604604605e-05, 'epoch': 90.88}


 98%|█████████▊| 978/1000 [45:26<00:57,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005457831546664238, 'learning_rate': 4.404404404404405e-05, 'epoch': 90.98}


 98%|█████████▊| 979/1000 [45:28<00:55,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0045090666972100735, 'learning_rate': 4.204204204204204e-05, 'epoch': 91.07}


 98%|█████████▊| 980/1000 [45:31<00:52,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004378278274089098, 'learning_rate': 4.004004004004004e-05, 'epoch': 91.16}


 98%|█████████▊| 981/1000 [45:33<00:49,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.006145849823951721, 'learning_rate': 3.803803803803804e-05, 'epoch': 91.26}


 98%|█████████▊| 982/1000 [45:36<00:47,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005175464786589146, 'learning_rate': 3.603603603603604e-05, 'epoch': 91.35}


 98%|█████████▊| 983/1000 [45:39<00:44,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.0053061991930007935, 'learning_rate': 3.403403403403404e-05, 'epoch': 91.44}


 98%|█████████▊| 984/1000 [45:41<00:41,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004651503637433052, 'learning_rate': 3.203203203203203e-05, 'epoch': 91.53}


 98%|█████████▊| 985/1000 [45:44<00:39,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005121590569615364, 'learning_rate': 3.003003003003003e-05, 'epoch': 91.63}


 99%|█████████▊| 986/1000 [45:46<00:36,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005293362773954868, 'learning_rate': 2.802802802802803e-05, 'epoch': 91.72}


 99%|█████████▊| 987/1000 [45:49<00:33,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.005477495025843382, 'learning_rate': 2.6026026026026025e-05, 'epoch': 91.81}


 99%|█████████▉| 988/1000 [45:52<00:31,  2.61s/it]

{'loss': 0.0009, 'grad_norm': 0.004816056694835424, 'learning_rate': 2.4024024024024024e-05, 'epoch': 91.91}


 99%|█████████▉| 989/1000 [45:54<00:28,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004626130219548941, 'learning_rate': 2.2022022022022024e-05, 'epoch': 92.0}


 99%|█████████▉| 990/1000 [45:57<00:26,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.0056447554379701614, 'learning_rate': 2.002002002002002e-05, 'epoch': 92.09}


 99%|█████████▉| 991/1000 [46:00<00:23,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004338358994573355, 'learning_rate': 1.801801801801802e-05, 'epoch': 92.19}


 99%|█████████▉| 992/1000 [46:02<00:20,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004959416575729847, 'learning_rate': 1.6016016016016015e-05, 'epoch': 92.28}


 99%|█████████▉| 993/1000 [46:05<00:18,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.004483124241232872, 'learning_rate': 1.4014014014014014e-05, 'epoch': 92.37}


 99%|█████████▉| 994/1000 [46:07<00:15,  2.64s/it]

{'loss': 0.0009, 'grad_norm': 0.005554167553782463, 'learning_rate': 1.2012012012012012e-05, 'epoch': 92.47}


100%|█████████▉| 995/1000 [46:10<00:13,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.004885959438979626, 'learning_rate': 1.001001001001001e-05, 'epoch': 92.56}


100%|█████████▉| 996/1000 [46:13<00:10,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.0045693181455135345, 'learning_rate': 8.008008008008007e-06, 'epoch': 92.65}


100%|█████████▉| 997/1000 [46:15<00:07,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.005001838784664869, 'learning_rate': 6.006006006006006e-06, 'epoch': 92.74}


100%|█████████▉| 998/1000 [46:18<00:05,  2.63s/it]

{'loss': 0.0009, 'grad_norm': 0.0038486875128000975, 'learning_rate': 4.004004004004004e-06, 'epoch': 92.84}


100%|█████████▉| 999/1000 [46:21<00:02,  2.62s/it]

{'loss': 0.0009, 'grad_norm': 0.005863032303750515, 'learning_rate': 2.002002002002002e-06, 'epoch': 92.93}


100%|██████████| 1000/1000 [46:23<00:00,  2.63s/it]Checkpoint destination directory outputs/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


{'loss': 0.0009, 'grad_norm': 0.005194799043238163, 'learning_rate': 0.0, 'epoch': 93.02}


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-1000)... Done. 0.1s
100%|██████████| 1000/1000 [46:24<00:00,  2.63s/it]

{'train_runtime': 2788.1459, 'train_samples_per_second': 1.542, 'train_steps_per_second': 0.359, 'train_loss': 0.028386561334773432, 'epoch': 93.02}


100%|██████████| 1000/1000 [46:25<00:00,  2.79s/it]


In [10]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2788.1459 seconds used for training.
46.47 minutes used for training.
Peak reserved memory = 21.664 GB.
Peak reserved memory for training = 19.445 GB.
Peak reserved memory % of max memory = 91.614 %.
Peak reserved memory for training % of max memory = 82.23 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [11]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence.", # instruction
#         "1, 1, 2, 3, 5, 8", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

inputs = tokenizer(
[
    alpaca_prompt.format(
    "Given the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]' and 'frame t+1: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points",
    "frame t-3: [{gripper,{231,48},91,92,80,81,85,91,94,96,98,15,14,13,13,13,13,70,82,81,78,74,71,70,94,95,90,87,86,86,88,97,84,72,64,59,55,52,50,50,50,50,52,55,59,64,72,79,91,95,87,86}, {table,{311,309},329,332,340,354,351,288,247,220,201,187,178,173,170,170,173,178,187,201,220,247,288,349,336,323,315,312,315,323,336,317,298,284,277,267,224,229,181,195,193,190,190,192,198,208,224,245,276,320,340,332}, {yellow block,{210,313},43,40,36,33,32,31,30,30,30,31,33,35,39,44,50,51,50,49,50,50,51,50,49,48,47,43,37,33,30,29,28,27,26,27,27,29,30,33,37,41,48,49,49,50,52,54,53,50,49,48}, {green block,{166,217},42,43,38,33,31,29,27,26,27,26,27,27,30,31,34,38,42,42,40,40,41,42,41,41,41,43,43,37,33,30,26,27,25,24,24,25,25,26,28,30,34,38,40,41,42,43,44,42,42,43}, {blue block,{387,328},45,44,40,36,34,33,31,32,31,31,33,35,38,42,47,52,50,49,49,49,50,51,48,46,45,45,42,37,35,32,30,30,30,29,29,30,32,34,38,42,48,50,49,49,50,52,54,52,49,47}, {red block,{345,196},39,40,41,42,42,40,40,40,41,37,32,29,28,27,25,25,25,25,25,27,28,31,35,39,39,39,40,41,41,40,38,38,38,37,34,29,27,25,24,23,23,23,23,24,25,28,30,33,39,40}]\nframe t-2: [{gripper,{231,48},92,93,81,82,85,91,95,96,99,16,14,14,14,14,14,71,82,81,77,74,71,69,94,94,89,86,85,85,88,96,84,72,64,59,55,52,50,50,50,50,52,55,59,64,72,77,90,95,88,86}, {table,{311,309},329,332,340,354,351,288,247,220,201,187,179,173,170,170,173,178,187,201,220,247,288,349,336,323,315,312,315,323,336,316,297,284,276,265,222,228,180,193,192,190,190,192,198,208,223,244,274,318,340,332}, {yellow block,{211,314},42,38,34,33,31,29,29,29,29,31,32,34,37,42,48,50,49,49,49,50,51,50,49,48,47,44,38,34,30,28,28,27,27,27,28,29,30,33,37,43,48,49,49,50,52,54,53,50,49,47}, {green block,{166,218},42,42,37,33,30,29,28,26,27,26,27,28,30,32,35,38,41,41,39,39,40,40,40,40,41,42,44,37,32,28,26,25,25,24,24,25,25,26,27,30,32,39,41,41,42,43,44,44,43,43}, {blue block,{388,328},45,43,39,36,34,33,31,30,30,31,33,35,38,41,47,51,50,49,49,50,50,50,49,47,45,45,43,37,34,32,31,30,30,29,29,30,32,34,38,42,47,49,49,49,50,52,52,51,48,46}, {red block,{345,197},40,41,41,42,42,41,40,40,38,35,32,29,27,26,24,24,24,24,26,27,28,30,34,37,38,39,40,40,41,40,38,38,38,37,35,30,28,26,24,23,23,24,24,25,25,28,30,35,40,41}]\nframe t-1: [{gripper,{231,54},89,93,90,83,86,91,98,98,102,18,17,17,17,17,17,76,84,83,78,75,71,69,68,95,91,88,86,87,89,92,94,81,72,66,61,58,56,56,56,56,58,61,66,70,69,74,83,93,89,87}, {table,{311,309},329,332,340,354,351,290,249,221,201,188,179,174,171,171,174,179,188,202,221,249,290,349,336,323,315,312,315,323,336,316,297,284,277,211,219,230,171,187,193,191,190,193,198,209,224,245,276,320,340,332}, {yellow block,{211,314},42,38,35,33,31,29,29,29,29,30,32,34,37,42,48,51,50,49,49,50,51,50,49,47,47,44,37,33,30,28,28,27,27,27,27,29,30,33,38,43,48,49,49,50,52,54,53,50,48,47}, {green block,{166,218},43,43,37,33,30,29,28,27,27,27,27,28,30,33,35,38,41,41,40,40,41,42,40,41,41,43,44,37,32,27,26,25,25,24,24,25,25,26,27,29,31,38,42,42,42,43,44,44,43,43}, {blue block,{388,328},45,44,39,36,34,31,31,30,30,31,33,35,38,41,48,52,50,49,49,50,50,50,49,47,46,45,43,37,34,32,31,30,29,29,29,30,31,34,37,41,47,49,49,49,50,52,53,51,48,46}, {red block,{345,197},40,41,42,42,43,41,41,41,40,36,32,29,27,26,25,24,24,24,26,27,28,30,34,37,38,39,40,41,41,40,38,38,38,37,35,30,28,26,24,23,23,24,24,25,25,28,30,35,41,41}]\nframe t-0: [{gripper,{233,64},87,89,95,94,90,94,102,106,109,107,30,29,29,29,80,92,90,82,76,72,70,67,66,69,95,92,89,90,92,94,97,95,85,77,72,69,67,66,66,67,69,72,77,77,68,62,63,71,86,91}, {table,{312,311},328,331,339,353,349,286,246,219,199,186,177,172,169,169,172,177,186,199,219,246,286,347,337,324,316,313,316,324,337,319,300,286,278,191,210,230,152,167,196,193,192,195,201,211,227,247,279,324,339,331}, {yellow block,{211,314},42,38,35,33,31,29,29,29,29,30,32,34,37,42,48,51,50,49,50,50,51,50,48,47,47,44,38,33,30,28,28,27,27,27,27,29,30,33,38,44,48,49,49,50,52,54,53,50,48,47}, {green block,{166,218},43,42,37,33,30,29,28,27,27,27,27,28,31,33,35,38,41,41,40,40,41,42,40,41,41,43,44,37,33,27,26,25,25,24,24,25,25,26,27,29,31,38,42,42,42,43,44,44,43,43}, {blue block,{388,328},45,44,39,36,34,31,31,30,30,31,33,35,38,42,47,51,51,50,49,50,50,51,49,47,46,45,43,37,34,32,31,30,29,28,29,30,31,33,37,40,47,49,49,49,50,52,53,51,48,46}, {red block,{345,197},40,41,42,42,43,41,41,40,40,37,32,29,28,26,25,24,24,24,25,27,28,30,34,37,38,39,40,41,41,40,38,38,38,37,35,31,28,26,25,24,23,24,24,25,25,28,30,36,41,41}]",
    "",
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 2048, use_cache = True)
tokenizer.batch_decode(outputs)

["<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nGiven the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]', predict the next two frames in the following format 'frame t+0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]' and 'frame t+1: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points\n\n### Input:\nframe t-3: [{gripper,{231,48},91,92,80,81,85,91,94,96,98,15,14,13,13,13,13,70,82,81,78,74,71,70,94,95,90,87,86,86,88,97,84,72,64,59,55,52,50,50,50,50,52,55,59,64,72,79,91,95,87,86}, {table,{311,309},329,332,340,354,351,288,247,220,201,187,178,173,170,170,173,178,187,201,220,247,288,349,336,323,315,312,315,323,336,317,298,284,277,267,224,229,181,195,193,190,190,192,198,208,224,245,276,320,

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [21]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Continue the fibonnaci sequence for the next 6 numbers.", # instruction
#         "1, 1, 2, 3, 5, 8, 13, 21", # input
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

inputs = tokenizer(
[
    alpaca_prompt.format(
    "Given the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]'..., predict the next frame in the same format'",
    "frame t-2: [{gripper,{172,132},78,70,73,81,86,93,105,158,187,186,202,141,156,156,116,26,22,18,16,16,14,14,14,14,15,16,16,179,120,109,167,195,173,158,147,140,136,134,134,123,113,103,95,84,77,74,72,73,77,77}, {table,{320,318},320,323,331,345,335,274,236,209,191,178,170,164,162,162,164,170,178,191,209,236,274,335,346,332,324,321,324,332,346,331,312,298,291,133,130,237,219,208,201,197,197,199,205,217,231,254,285,332,331,323}, {yellow block,{191,315},26,28,28,32,34,35,36,37,38,39,43,41,40,39,37,36,36,36,36,35,34,32,32,31,32,33,32,29,28,27,26,27,27,27,29,30,32,35,41,47,49,48,42,37,33,30,28,27,26,27}, {green block,{149,219},25,27,27,28,29,19,14,11,8,6,6,6,37,37,34,32,30,29,28,27,25,24,24,24,25,26,27,27,25,21,20,19,19,17,17,17,18,19,20,21,25,28,34,35,32,29,28,26,26,26}, {blue block,{387,328},46,44,40,36,34,33,31,30,31,33,33,35,39,43,48,51,50,49,49,49,50,50,48,47,46,45,41,37,33,31,29,28,28,28,28,29,30,32,36,40,47,50,49,49,50,52,54,53,50,48}, {red block,{345,197},40,41,42,42,43,41,41,41,40,37,32,30,28,27,25,24,24,25,25,27,28,30,34,37,39,39,40,41,41,40,38,38,38,37,34,31,28,26,25,24,24,24,24,25,27,28,30,36,41,41}]\nframe t-1: [{gripper,{177,121},61,63,65,67,76,82,127,151,152,168,178,116,127,140,118,101,87,81,30,31,30,29,28,27,25,24,25,25,191,144,138,179,159,145,135,129,125,123,123,125,107,92,83,78,78,72,69,58,58,59}, {table,{318,319},322,325,333,347,333,273,234,208,190,177,169,163,161,161,163,169,178,190,208,234,273,333,344,330,322,319,322,330,344,330,312,297,290,86,174,238,221,210,203,199,198,201,208,219,234,256,287,334,333,325}, {yellow block,{191,278},25,27,27,31,35,35,36,37,38,40,43,40,39,38,37,36,36,36,37,35,34,32,33,33,33,34,32,29,28,27,26,27,27,28,29,30,33,37,42,45,47,45,38,33,31,29,27,25,25,26}, {green block,{140,230},14,15,16,18,20,23,28,30,30,29,28,25,22,20,17,16,15,15,13,13,13,13,12,13,13,14,14,15,16,18,22,24,25,25,24,25,26,27,18,15,14,15,15,15,15,15,15,14,14,15}, {blue block,{387,328},45,44,40,36,34,33,31,30,31,33,33,35,39,43,47,51,50,49,49,49,50,50,48,47,46,45,41,37,33,32,29,28,28,28,28,29,30,32,36,40,47,49,49,49,50,52,54,53,50,48}, {red block,{345,197},40,41,42,42,43,41,41,40,40,37,32,30,28,27,25,24,24,25,26,27,29,30,34,37,38,39,40,41,41,40,38,38,38,37,34,31,28,27,25,24,24,24,24,25,27,28,31,36,41,41}]\nframe t-0: [{gripper,{180,118},46,49,50,52,55,94,103,124,116,137,147,87,87,101,117,104,90,87,85,75,74,36,35,36,36,38,40,46,50,53,56,70,155,141,132,126,122,120,120,105,93,79,58,54,53,49,46,46,45,44}, {table,{317,319},323,326,334,348,333,273,234,208,190,177,169,163,161,161,163,169,177,190,208,234,273,333,343,329,321,318,321,329,343,329,311,297,120,125,212,233,221,210,203,199,199,202,208,219,234,256,288,335,334,326}, {yellow block,{183,247},26,28,29,34,36,36,36,37,40,41,40,38,38,37,36,37,38,37,36,35,33,32,32,33,33,33,32,29,27,27,26,27,27,28,29,31,33,39,44,46,46,40,36,32,31,29,27,26,25,26}, {green block,{139,227},15,16,17,19,20,20,21,21,23,28,30,26,23,20,18,17,15,15,13,13,13,13,12,13,13,13,14,15,16,18,20,22,22,21,20,20,22,23,24,23,20,17,15,15,14,14,13,13,13,14}, {blue block,{387,328},46,45,40,36,34,33,32,30,31,33,33,35,39,43,47,51,50,49,49,50,50,50,48,46,46,45,41,37,33,32,29,28,28,28,28,29,30,32,36,40,47,50,49,49,50,52,54,53,50,48}, {red block,{345,197},40,41,42,42,43,43,41,41,40,37,32,30,28,26,25,24,24,24,25,27,28,30,34,37,39,39,39,41,41,40,38,38,37,37,34,31,28,27,25,24,24,24,24,25,27,28,31,36,41,41}]",
    "",
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 2048)

<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Given the following few token representations of a visual frame in the following format 'frame t-0: [{{object_id, {{centroid_x, centroid_y}},num_radial_points,radial_points}}]'..., predict the next frame in the same format'

### Input:
frame t-2: [{gripper,{172,132},78,70,73,81,86,93,105,158,187,186,202,141,156,156,116,26,22,18,16,16,14,14,14,14,15,16,16,179,120,109,167,195,173,158,147,140,136,134,134,123,113,103,95,84,77,74,72,73,77,77}, {table,{320,318},320,323,331,345,335,274,236,209,191,178,170,164,162,162,164,170,178,191,209,236,274,335,346,332,324,321,324,332,346,331,312,298,291,133,130,237,219,208,201,197,197,199,205,217,231,254,285,332,331,323}, {yellow block,{191,315},26,28,28,32,34,35,36,37,38,39,43,41,40,39,37,36,36,36,36,35,34,32,32,31,32,33,32,29,28,27,26,27,27,27,29,30,32,35,41,47,49,48,42,37,3

frame t+1: [{gripper,{168,107},43,44,46,47,83,88,110,102,111,125,132,73,71,84,101,99,89,79,81,81,72,71,66,36,36,38,40,42,46,58,63,70,104,128,120,114,110,109,109,110,91,64,59,54,50,46,44,42,43,43}, {table,{316,317},324,327,335,349,337,276,237,211,192,180,172,165,163,163,165,171,180,192,211,237,276,337,341,328,320,317,320,328,341,327,308,295,140,157,240,236,219,208,201,198,197,199,207,217,232,254,286,332,335,327}, {yellow block,{170,222},26,28,29,33,35,35,36,38,40,40,40,38,37,37,36,36,37,38,36,35,34,32,33,33,33,33,31,27,26,27,26,27,27,28,29,30,34,39,46,46,46,38,34,32,29,28,27,25,25,26}, {green block,{214,74},15,16,16,17,17,18,20,21,23,24,26,27,27,27,24,22,19,15,13,13,12,11,0,0,0,0,0,0,0,0,0,0,30,35,30,27,24,22,20,18,17,17,17,16,16,15,15,14,15,16}, {blue block,{387,328},46,45,40,36,34,33,31,30,31,31,33,35,39,43,47,51,50,49,49,49,50,50,48,47,46,45,41,37,33,32,30,28,29,28,28,29,30,32,36,40,47,50,49,50,50,52,54,53,49,48}, {red block,{345,197},40,41,42,42,43,43,41,41,40,37,32,30,28,27,25,24,2

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [13]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [14]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<bos>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\n{\n  "id": "1470719797,\n  "name": "Eiffing, or, the, or, the, or, the, the, the, the, the, the, the, the, the, the, the, the, the,']

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [15]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [16]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [17]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/u54VK8m8tk) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Zephyr DPO 2x faster [free Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing)
2. Llama 7b 2x faster [free Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing)
3. TinyLlama 4x faster full Alpaca 52K in 1 hour [free Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing)
4. CodeLlama 34b 2x faster [A100 on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing)
5. Mistral 7b [free Kaggle version](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook)
6. We also did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗 HuggingFace, and we're in the TRL [docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth)!
7. `ChatML` for ShareGPT datasets, [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing)
8. Text completions like novel writing [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing)

<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Support our work if you can! Thanks!
</div>