## Using Unsloth to finetune

## Install Prerequisite Packages

In [None]:
# This is necessary for colab
!pip install python-dotenv
!pip install datasets
!pip install plotly
!pip install nbformat
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-0hbirobv/unsloth_7ece29f83129443888b84ec24b54f40b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-0hbirobv/unsloth_7ece29f83129443888b84ec24b54f40b
  Resolved https://github.com/unslothai/unsloth.git to commit a2f8db3e7341f983af5814a2c56f54fa29ee548d
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25ldone
[?25h  Created wheel for unsloth: filename=unsloth-2024.10.7-py3-none-any.whl size=164377 sha256=c8373ee7d1e549f3d79ee9a9ed4aed159bd7b745ec8366ef8f6c71758fc31eb6
  Stored in directory: /tmp/pip-ephem-w

## Load `.env`

In [2]:
import os
import sys

from datasets import Dataset

from dotenv import find_dotenv, load_dotenv

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

## Important Global Parameters

In [3]:
FINETUNING_DATASET_NAME="CPSC532/2024NOV2_arxiv_qa_data_cleaned"
OUTPUT_MODEL_NAME="2024NOV02_cleaned_arxiv_qa_data"
BASE_MODEL_NAME="unsloth/Llama-3.2-3B-Instruct"

## API Keys

In [4]:
# Could also insert the token here directly
HF_TOKEN = os.getenv("HUGGINGFACE_API_KEY")

Leveraging Unsloth notebooks for finetuning

In [5]:
max_seq_length = 16000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


In [6]:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = BASE_MODEL_NAME, # or choose "unsloth/Llama-3.2-1B-Instruct"
    # model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: NVIDIA GeForce RTX 3080 Ti. Max memory: 11.753 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1. CUDA = 8.6. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.7 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [8]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset


## Get dataset

In [9]:
dataset_finetune = load_dataset(
    FINETUNING_DATASET_NAME,
    split="train",
    token=HF_TOKEN
)

In [10]:
dataset_finetune

Dataset({
    features: ['filename', 'source', 'source_type', 'chunk', 'question', 'answer'],
    num_rows: 681
})

In [11]:

dataset_finetune['question'][0]

'What are some common challenges faced by healthcare professionals during doctor-patient interactions in Puskesmas settings?'

In [12]:

dataset_finetune['answer'][0]

'According to the paper "Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia", the primary issue contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors are required to conduct thorough consultations, which involve diagnosing the patient\'s condition, providing treatment advice, and often recording detailed notes into electronic health records (EHR). This process can be prolonged due to various factors, including the need for doctors to ask clarifying questions, especially in regions with diverse linguistic backgrounds.\n\nThe paper highlights that while the diagnosing task is critical for doctors, the transcription and summarization steps can often be automated using AI technologies to improve the time efficiency of the interactions and help doctors improve quality of care and optimize opportunities for early diagnosis and intervention.'

## Convert dataset to messages format

In [13]:
def convert_to_messages_format(example):
    return [
        {"role": "user", "content": example['question']},
        {"role": "assistant", "content": example['answer']},
    ]

In [14]:
dataset_finetune = dataset_finetune.map(
    lambda x: {
        'conversations' : convert_to_messages_format(x)
        }
)

In [15]:
dataset_finetune = dataset_finetune.map(formatting_prompts_func, batched = True,)

In [16]:
dataset_finetune['text'][0]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are some common challenges faced by healthcare professionals during doctor-patient interactions in Puskesmas settings?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAccording to the paper "Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia", the primary issue contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors are required to conduct thorough consultations, which involve diagnosing the patient\'s condition, providing treatment advice, and often recording detailed notes into electronic health records (EHR). This process can be prolonged due to various factors, including the need for doctors to ask clarifying questions, especially in regions with diverse linguistic ba

## Set Training Parameters

In [17]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_finetune,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 1,  # Affects memory usage
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 1, # Affects memory usage
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 10, # Set this for 1 full training run.
        # max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none"
    ),
)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. Look into this

In [18]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

In [19]:
tokenizer.decode(trainer.train_dataset[0]["input_ids"])

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat are some common challenges faced by healthcare professionals during doctor-patient interactions in Puskesmas settings?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nAccording to the paper "Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia", the primary issue contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors are required to conduct thorough consultations, which involve diagnosing the patient\'s condition, providing treatment advice, and often recording detailed notes into electronic health records (EHR). This process can be prolonged due to various factors, including the need for doctors to ask clarifying questions, especially in regions with diverse linguistic ba

In [20]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                    \n\nAccording to the paper, the manual entry of information into the ePuskesmas application turns it into a labor-intensive and time-consuming task for healthcare providers because they still have to maintain manual records as a copy. This added workload and time required to input data into the ePuskesmas application decreases the overall efficiency (Section I).\n\nIn other words, healthcare providers are already burdened with maintaining paper-based records in addition to entering information into the electronic health record system, which increases their workload and reduces their efficiency.\n\nSource: [8] B. Oktora and S. T. Putri, “The effectiveness of e-Puskesmas application on employee performance in Sindang Barang Puskesmas Bogor: Efektifitas aplikasi e-Puskesmas terhadap kinerja pegawai di Puskesmas Sindang Barang Kota Bogor,” Jurnal Ilmiah Wijaya, vol. 10, no. 2, pp. 44–49, 2018.\n\nReference [8] is cited in

In [21]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3080 Ti. Max memory = 11.753 GB.
3.275 GB of memory reserved.


## Train

In [22]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 681 | Num Epochs = 10
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 4
\        /    Total batch size = 4 | Total steps = 1,700
 "-____-"     Number of trainable parameters = 194,510,848


**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir --no-deps unsloth transformers git+https://github.com/huggingface/trl.git`


  0%|          | 1/1700 [00:01<52:41,  1.86s/it]

{'loss': 1.2077, 'grad_norm': 1.6806998252868652, 'learning_rate': 4e-05, 'epoch': 0.01}


  0%|          | 2/1700 [00:03<40:54,  1.45s/it]

{'loss': 1.3629, 'grad_norm': 3.64573073387146, 'learning_rate': 8e-05, 'epoch': 0.01}


  0%|          | 3/1700 [00:04<40:31,  1.43s/it]

{'loss': 1.9002, 'grad_norm': 1.338266134262085, 'learning_rate': 0.00012, 'epoch': 0.02}


  0%|          | 4/1700 [00:05<40:02,  1.42s/it]

{'loss': 1.5553, 'grad_norm': 1.1679831743240356, 'learning_rate': 0.00016, 'epoch': 0.02}


  0%|          | 5/1700 [00:06<36:20,  1.29s/it]

{'loss': 1.7304, 'grad_norm': 1.3105965852737427, 'learning_rate': 0.0002, 'epoch': 0.03}


  0%|          | 6/1700 [00:08<36:33,  1.30s/it]

{'loss': 1.9665, 'grad_norm': 1.1628888845443726, 'learning_rate': 0.00019988200589970504, 'epoch': 0.04}


  0%|          | 7/1700 [00:09<38:25,  1.36s/it]

{'loss': 1.3254, 'grad_norm': 0.8366467356681824, 'learning_rate': 0.00019976401179941003, 'epoch': 0.04}


  0%|          | 8/1700 [00:11<43:12,  1.53s/it]

{'loss': 1.3214, 'grad_norm': 0.9380660653114319, 'learning_rate': 0.00019964601769911506, 'epoch': 0.05}


  1%|          | 9/1700 [00:13<44:17,  1.57s/it]

{'loss': 1.7134, 'grad_norm': 0.8907532691955566, 'learning_rate': 0.00019952802359882006, 'epoch': 0.05}


  1%|          | 10/1700 [00:14<40:14,  1.43s/it]

{'loss': 1.4026, 'grad_norm': 1.4190237522125244, 'learning_rate': 0.00019941002949852509, 'epoch': 0.06}


  1%|          | 11/1700 [00:15<37:02,  1.32s/it]

{'loss': 1.482, 'grad_norm': 1.1894991397857666, 'learning_rate': 0.00019929203539823008, 'epoch': 0.06}


  1%|          | 12/1700 [00:16<36:40,  1.30s/it]

{'loss': 1.3256, 'grad_norm': 0.9332809448242188, 'learning_rate': 0.0001991740412979351, 'epoch': 0.07}


  1%|          | 13/1700 [00:17<36:09,  1.29s/it]

{'loss': 1.347, 'grad_norm': 1.110501766204834, 'learning_rate': 0.00019905604719764014, 'epoch': 0.08}


  1%|          | 14/1700 [00:19<36:25,  1.30s/it]

{'loss': 1.424, 'grad_norm': 1.0830143690109253, 'learning_rate': 0.00019893805309734514, 'epoch': 0.08}


  1%|          | 15/1700 [00:20<34:31,  1.23s/it]

{'loss': 1.0706, 'grad_norm': 1.1265887022018433, 'learning_rate': 0.00019882005899705016, 'epoch': 0.09}


  1%|          | 16/1700 [00:21<34:26,  1.23s/it]

{'loss': 1.2667, 'grad_norm': 1.0301432609558105, 'learning_rate': 0.0001987020648967552, 'epoch': 0.09}


  1%|          | 17/1700 [00:22<35:29,  1.27s/it]

{'loss': 1.1054, 'grad_norm': 0.8895284533500671, 'learning_rate': 0.0001985840707964602, 'epoch': 0.1}


  1%|          | 18/1700 [00:24<35:52,  1.28s/it]

{'loss': 1.4032, 'grad_norm': 0.9476577043533325, 'learning_rate': 0.00019846607669616519, 'epoch': 0.11}


  1%|          | 19/1700 [00:25<35:01,  1.25s/it]

{'loss': 1.1911, 'grad_norm': 1.853180170059204, 'learning_rate': 0.0001983480825958702, 'epoch': 0.11}


  1%|          | 20/1700 [00:26<35:25,  1.26s/it]

{'loss': 1.0923, 'grad_norm': 0.9469977617263794, 'learning_rate': 0.00019823008849557524, 'epoch': 0.12}


  1%|          | 21/1700 [00:27<33:13,  1.19s/it]

{'loss': 1.4806, 'grad_norm': 1.1733306646347046, 'learning_rate': 0.00019811209439528024, 'epoch': 0.12}


  1%|▏         | 22/1700 [00:29<35:13,  1.26s/it]

{'loss': 1.4582, 'grad_norm': 0.9719346165657043, 'learning_rate': 0.00019799410029498526, 'epoch': 0.13}


  1%|▏         | 23/1700 [00:30<34:44,  1.24s/it]

{'loss': 1.4274, 'grad_norm': 0.9539160132408142, 'learning_rate': 0.0001978761061946903, 'epoch': 0.14}


  1%|▏         | 24/1700 [00:31<32:51,  1.18s/it]

{'loss': 1.4984, 'grad_norm': 1.0547541379928589, 'learning_rate': 0.0001977581120943953, 'epoch': 0.14}


  1%|▏         | 25/1700 [00:32<33:51,  1.21s/it]

{'loss': 1.2251, 'grad_norm': 0.8280428647994995, 'learning_rate': 0.0001976401179941003, 'epoch': 0.15}


  2%|▏         | 26/1700 [00:34<35:14,  1.26s/it]

{'loss': 1.2245, 'grad_norm': 0.8429933786392212, 'learning_rate': 0.00019752212389380531, 'epoch': 0.15}


  2%|▏         | 27/1700 [00:35<34:24,  1.23s/it]

{'loss': 1.0882, 'grad_norm': 0.9904395341873169, 'learning_rate': 0.00019740412979351034, 'epoch': 0.16}


  2%|▏         | 28/1700 [00:36<35:26,  1.27s/it]

{'loss': 1.5712, 'grad_norm': 0.8482334017753601, 'learning_rate': 0.00019728613569321534, 'epoch': 0.16}


  2%|▏         | 29/1700 [00:37<35:53,  1.29s/it]

{'loss': 0.9894, 'grad_norm': 0.8367481231689453, 'learning_rate': 0.00019716814159292037, 'epoch': 0.17}


  2%|▏         | 30/1700 [00:38<33:15,  1.19s/it]

{'loss': 1.1736, 'grad_norm': 0.9966579079627991, 'learning_rate': 0.0001970501474926254, 'epoch': 0.18}


  2%|▏         | 31/1700 [00:40<33:54,  1.22s/it]

{'loss': 1.2915, 'grad_norm': 0.962375819683075, 'learning_rate': 0.0001969321533923304, 'epoch': 0.18}


  2%|▏         | 32/1700 [00:41<33:40,  1.21s/it]

{'loss': 1.2841, 'grad_norm': 0.9841254949569702, 'learning_rate': 0.00019681415929203542, 'epoch': 0.19}


  2%|▏         | 33/1700 [00:42<34:04,  1.23s/it]

{'loss': 0.9397, 'grad_norm': 0.8310503959655762, 'learning_rate': 0.00019669616519174042, 'epoch': 0.19}


  2%|▏         | 34/1700 [00:44<41:13,  1.48s/it]

{'loss': 1.1806, 'grad_norm': 0.8904987573623657, 'learning_rate': 0.00019657817109144544, 'epoch': 0.2}


  2%|▏         | 35/1700 [00:45<38:27,  1.39s/it]

{'loss': 1.0345, 'grad_norm': 1.0579863786697388, 'learning_rate': 0.00019646017699115044, 'epoch': 0.21}


  2%|▏         | 36/1700 [00:46<34:32,  1.25s/it]

{'loss': 1.1289, 'grad_norm': 1.2320876121520996, 'learning_rate': 0.00019634218289085547, 'epoch': 0.21}


  2%|▏         | 37/1700 [00:48<39:12,  1.41s/it]

{'loss': 1.469, 'grad_norm': 0.8892936706542969, 'learning_rate': 0.0001962241887905605, 'epoch': 0.22}


  2%|▏         | 38/1700 [00:49<37:30,  1.35s/it]

{'loss': 1.1899, 'grad_norm': 0.994239330291748, 'learning_rate': 0.0001961061946902655, 'epoch': 0.22}


  2%|▏         | 39/1700 [00:50<35:53,  1.30s/it]

{'loss': 1.1846, 'grad_norm': 1.0687191486358643, 'learning_rate': 0.00019598820058997052, 'epoch': 0.23}


  2%|▏         | 40/1700 [00:52<35:14,  1.27s/it]

{'loss': 1.1885, 'grad_norm': 0.928801953792572, 'learning_rate': 0.00019587020648967554, 'epoch': 0.23}


  2%|▏         | 41/1700 [00:53<35:32,  1.29s/it]

{'loss': 0.9188, 'grad_norm': 0.917977511882782, 'learning_rate': 0.00019575221238938052, 'epoch': 0.24}


  2%|▏         | 42/1700 [00:54<35:04,  1.27s/it]

{'loss': 1.436, 'grad_norm': 0.9909267425537109, 'learning_rate': 0.00019563421828908554, 'epoch': 0.25}


  3%|▎         | 43/1700 [00:56<36:06,  1.31s/it]

{'loss': 1.028, 'grad_norm': 0.7302535176277161, 'learning_rate': 0.00019551622418879057, 'epoch': 0.25}


  3%|▎         | 44/1700 [00:57<35:12,  1.28s/it]

{'loss': 1.1078, 'grad_norm': 0.8064466118812561, 'learning_rate': 0.0001953982300884956, 'epoch': 0.26}


  3%|▎         | 45/1700 [00:58<36:32,  1.32s/it]

{'loss': 1.0253, 'grad_norm': 0.9220106601715088, 'learning_rate': 0.0001952802359882006, 'epoch': 0.26}


  3%|▎         | 46/1700 [01:00<36:00,  1.31s/it]

{'loss': 1.1041, 'grad_norm': 0.8353224992752075, 'learning_rate': 0.00019516224188790562, 'epoch': 0.27}


  3%|▎         | 47/1700 [01:01<35:04,  1.27s/it]

{'loss': 1.0812, 'grad_norm': 0.8205798268318176, 'learning_rate': 0.00019504424778761065, 'epoch': 0.28}


  3%|▎         | 48/1700 [01:02<34:58,  1.27s/it]

{'loss': 1.4394, 'grad_norm': 0.8497216105461121, 'learning_rate': 0.00019492625368731564, 'epoch': 0.28}


  3%|▎         | 49/1700 [01:03<33:51,  1.23s/it]

{'loss': 1.2563, 'grad_norm': 0.9649249315261841, 'learning_rate': 0.00019480825958702064, 'epoch': 0.29}


  3%|▎         | 50/1700 [01:04<35:02,  1.27s/it]

{'loss': 1.1102, 'grad_norm': 0.8757907748222351, 'learning_rate': 0.00019469026548672567, 'epoch': 0.29}


  3%|▎         | 51/1700 [01:06<36:28,  1.33s/it]

{'loss': 1.2594, 'grad_norm': 0.7403765320777893, 'learning_rate': 0.0001945722713864307, 'epoch': 0.3}


  3%|▎         | 52/1700 [01:07<36:01,  1.31s/it]

{'loss': 1.0795, 'grad_norm': 0.8622000217437744, 'learning_rate': 0.0001944542772861357, 'epoch': 0.31}


  3%|▎         | 53/1700 [01:09<37:26,  1.36s/it]

{'loss': 1.1912, 'grad_norm': 0.8260243535041809, 'learning_rate': 0.00019433628318584072, 'epoch': 0.31}


  3%|▎         | 54/1700 [01:10<36:49,  1.34s/it]

{'loss': 1.2771, 'grad_norm': 1.0799915790557861, 'learning_rate': 0.00019421828908554575, 'epoch': 0.32}


  3%|▎         | 55/1700 [01:11<34:12,  1.25s/it]

{'loss': 1.1497, 'grad_norm': 1.2069064378738403, 'learning_rate': 0.00019410029498525075, 'epoch': 0.32}


  3%|▎         | 56/1700 [01:12<35:19,  1.29s/it]

{'loss': 0.9071, 'grad_norm': 0.8539688587188721, 'learning_rate': 0.00019398230088495577, 'epoch': 0.33}


  3%|▎         | 57/1700 [01:14<35:05,  1.28s/it]

{'loss': 1.3976, 'grad_norm': 0.9664824604988098, 'learning_rate': 0.00019386430678466077, 'epoch': 0.33}


  3%|▎         | 58/1700 [01:15<34:48,  1.27s/it]

{'loss': 1.4068, 'grad_norm': 1.1214697360992432, 'learning_rate': 0.00019374631268436577, 'epoch': 0.34}


  3%|▎         | 59/1700 [01:17<38:00,  1.39s/it]

{'loss': 1.1363, 'grad_norm': 0.7524386048316956, 'learning_rate': 0.0001936283185840708, 'epoch': 0.35}


  4%|▎         | 60/1700 [01:18<38:42,  1.42s/it]

{'loss': 1.3924, 'grad_norm': 0.9532632827758789, 'learning_rate': 0.00019351032448377582, 'epoch': 0.35}


  4%|▎         | 61/1700 [01:19<38:14,  1.40s/it]

{'loss': 1.5277, 'grad_norm': 1.0493496656417847, 'learning_rate': 0.00019339233038348085, 'epoch': 0.36}


  4%|▎         | 62/1700 [01:21<36:48,  1.35s/it]

{'loss': 0.8952, 'grad_norm': 0.8458846807479858, 'learning_rate': 0.00019327433628318585, 'epoch': 0.36}


  4%|▎         | 63/1700 [01:22<36:02,  1.32s/it]

{'loss': 0.8047, 'grad_norm': 0.9212527275085449, 'learning_rate': 0.00019315634218289087, 'epoch': 0.37}


  4%|▍         | 64/1700 [01:23<36:46,  1.35s/it]

{'loss': 1.6234, 'grad_norm': 1.121424674987793, 'learning_rate': 0.0001930383480825959, 'epoch': 0.38}


  4%|▍         | 65/1700 [01:25<36:00,  1.32s/it]

{'loss': 1.2914, 'grad_norm': 1.2334645986557007, 'learning_rate': 0.00019292035398230087, 'epoch': 0.38}


  4%|▍         | 66/1700 [01:26<35:41,  1.31s/it]

{'loss': 0.9023, 'grad_norm': 0.8440471887588501, 'learning_rate': 0.0001928023598820059, 'epoch': 0.39}


  4%|▍         | 67/1700 [01:27<34:41,  1.27s/it]

{'loss': 1.2372, 'grad_norm': 0.9065919518470764, 'learning_rate': 0.00019268436578171092, 'epoch': 0.39}


  4%|▍         | 68/1700 [01:28<33:51,  1.24s/it]

{'loss': 1.2791, 'grad_norm': 0.8385630249977112, 'learning_rate': 0.00019256637168141592, 'epoch': 0.4}


  4%|▍         | 69/1700 [01:29<32:45,  1.21s/it]

{'loss': 1.1353, 'grad_norm': 1.1212443113327026, 'learning_rate': 0.00019244837758112095, 'epoch': 0.41}


  4%|▍         | 70/1700 [01:31<34:12,  1.26s/it]

{'loss': 0.884, 'grad_norm': 0.7897001504898071, 'learning_rate': 0.00019233038348082598, 'epoch': 0.41}


  4%|▍         | 71/1700 [01:32<34:01,  1.25s/it]

{'loss': 0.8818, 'grad_norm': 0.8543419241905212, 'learning_rate': 0.000192212389380531, 'epoch': 0.42}


  4%|▍         | 72/1700 [01:33<34:02,  1.25s/it]

{'loss': 1.2265, 'grad_norm': 0.8813930153846741, 'learning_rate': 0.000192094395280236, 'epoch': 0.42}


  4%|▍         | 73/1700 [01:34<32:54,  1.21s/it]

{'loss': 1.0065, 'grad_norm': 1.0150384902954102, 'learning_rate': 0.000191976401179941, 'epoch': 0.43}


  4%|▍         | 74/1700 [01:36<32:46,  1.21s/it]

{'loss': 0.8754, 'grad_norm': 0.9481233358383179, 'learning_rate': 0.00019185840707964603, 'epoch': 0.43}


  4%|▍         | 75/1700 [01:37<33:39,  1.24s/it]

{'loss': 0.8073, 'grad_norm': 0.8035351037979126, 'learning_rate': 0.00019174041297935102, 'epoch': 0.44}


  4%|▍         | 76/1700 [01:38<36:07,  1.33s/it]

{'loss': 1.3469, 'grad_norm': 0.8972644209861755, 'learning_rate': 0.00019162241887905605, 'epoch': 0.45}


  5%|▍         | 77/1700 [01:40<36:21,  1.34s/it]

{'loss': 1.1103, 'grad_norm': 0.9154747128486633, 'learning_rate': 0.00019150442477876108, 'epoch': 0.45}


  5%|▍         | 78/1700 [01:41<35:01,  1.30s/it]

{'loss': 1.2573, 'grad_norm': 1.1757078170776367, 'learning_rate': 0.00019138643067846608, 'epoch': 0.46}


  5%|▍         | 79/1700 [01:42<33:23,  1.24s/it]

{'loss': 1.0225, 'grad_norm': 1.0310869216918945, 'learning_rate': 0.0001912684365781711, 'epoch': 0.46}


  5%|▍         | 80/1700 [01:43<34:09,  1.26s/it]

{'loss': 1.0539, 'grad_norm': 0.8623256087303162, 'learning_rate': 0.00019115044247787613, 'epoch': 0.47}


  5%|▍         | 81/1700 [01:45<34:02,  1.26s/it]

{'loss': 0.8563, 'grad_norm': 0.7759929299354553, 'learning_rate': 0.00019103244837758113, 'epoch': 0.48}


  5%|▍         | 82/1700 [01:48<54:00,  2.00s/it]

{'loss': 1.3544, 'grad_norm': 0.6634030342102051, 'learning_rate': 0.00019091445427728613, 'epoch': 0.48}


  5%|▍         | 83/1700 [01:50<47:54,  1.78s/it]

{'loss': 1.2722, 'grad_norm': 1.1527385711669922, 'learning_rate': 0.00019079646017699115, 'epoch': 0.49}


  5%|▍         | 84/1700 [01:51<43:17,  1.61s/it]

{'loss': 1.156, 'grad_norm': 1.122833013534546, 'learning_rate': 0.00019067846607669618, 'epoch': 0.49}


  5%|▌         | 85/1700 [01:52<39:02,  1.45s/it]

{'loss': 1.4004, 'grad_norm': 1.1463240385055542, 'learning_rate': 0.00019056047197640118, 'epoch': 0.5}


  5%|▌         | 86/1700 [01:53<37:52,  1.41s/it]

{'loss': 0.8378, 'grad_norm': 0.7750987410545349, 'learning_rate': 0.0001904424778761062, 'epoch': 0.51}


  5%|▌         | 87/1700 [01:54<34:52,  1.30s/it]

{'loss': 0.9138, 'grad_norm': 1.018799901008606, 'learning_rate': 0.00019032448377581123, 'epoch': 0.51}


  5%|▌         | 88/1700 [01:55<32:59,  1.23s/it]

{'loss': 1.1907, 'grad_norm': 0.8048107624053955, 'learning_rate': 0.00019020648967551623, 'epoch': 0.52}


  5%|▌         | 89/1700 [01:57<32:30,  1.21s/it]

{'loss': 1.1233, 'grad_norm': 1.0861839056015015, 'learning_rate': 0.00019008849557522123, 'epoch': 0.52}


  5%|▌         | 90/1700 [01:58<35:09,  1.31s/it]

{'loss': 1.191, 'grad_norm': 0.7852316498756409, 'learning_rate': 0.00018997050147492625, 'epoch': 0.53}


  5%|▌         | 91/1700 [01:59<35:16,  1.32s/it]

{'loss': 1.1126, 'grad_norm': 0.9801624417304993, 'learning_rate': 0.00018985250737463128, 'epoch': 0.53}


  5%|▌         | 92/1700 [02:01<34:17,  1.28s/it]

{'loss': 1.3272, 'grad_norm': 0.9375117421150208, 'learning_rate': 0.00018973451327433628, 'epoch': 0.54}


  5%|▌         | 93/1700 [02:02<33:28,  1.25s/it]

{'loss': 0.9054, 'grad_norm': 0.8499077558517456, 'learning_rate': 0.0001896165191740413, 'epoch': 0.55}


  6%|▌         | 94/1700 [02:03<32:10,  1.20s/it]

{'loss': 0.9706, 'grad_norm': 1.0056995153427124, 'learning_rate': 0.00018949852507374633, 'epoch': 0.55}


  6%|▌         | 95/1700 [02:04<32:31,  1.22s/it]

{'loss': 0.8304, 'grad_norm': 0.7539246678352356, 'learning_rate': 0.00018938053097345133, 'epoch': 0.56}


  6%|▌         | 96/1700 [02:05<33:06,  1.24s/it]

{'loss': 1.304, 'grad_norm': 0.9876908659934998, 'learning_rate': 0.00018926253687315636, 'epoch': 0.56}


  6%|▌         | 97/1700 [02:07<32:17,  1.21s/it]

{'loss': 0.9237, 'grad_norm': 0.8251346945762634, 'learning_rate': 0.00018914454277286135, 'epoch': 0.57}


  6%|▌         | 98/1700 [02:08<31:50,  1.19s/it]

{'loss': 1.4679, 'grad_norm': 1.142088532447815, 'learning_rate': 0.00018902654867256638, 'epoch': 0.58}


  6%|▌         | 99/1700 [02:09<32:22,  1.21s/it]

{'loss': 1.0679, 'grad_norm': 1.0385355949401855, 'learning_rate': 0.00018890855457227138, 'epoch': 0.58}


  6%|▌         | 100/1700 [02:10<31:05,  1.17s/it]

{'loss': 0.9656, 'grad_norm': 0.9410814642906189, 'learning_rate': 0.0001887905604719764, 'epoch': 0.59}


  6%|▌         | 101/1700 [02:12<39:13,  1.47s/it]

{'loss': 1.0114, 'grad_norm': 0.7275580167770386, 'learning_rate': 0.00018867256637168143, 'epoch': 0.59}


  6%|▌         | 102/1700 [02:13<37:29,  1.41s/it]

{'loss': 0.8372, 'grad_norm': 0.8083310723304749, 'learning_rate': 0.00018855457227138643, 'epoch': 0.6}


  6%|▌         | 103/1700 [02:15<37:01,  1.39s/it]

{'loss': 1.3665, 'grad_norm': 1.0034105777740479, 'learning_rate': 0.00018843657817109146, 'epoch': 0.6}


  6%|▌         | 104/1700 [02:16<36:07,  1.36s/it]

{'loss': 1.0248, 'grad_norm': 0.8839564323425293, 'learning_rate': 0.00018831858407079648, 'epoch': 0.61}


  6%|▌         | 105/1700 [02:17<35:07,  1.32s/it]

{'loss': 2.0805, 'grad_norm': 4.22415828704834, 'learning_rate': 0.00018820058997050148, 'epoch': 0.62}


  6%|▌         | 106/1700 [02:19<35:13,  1.33s/it]

{'loss': 0.7695, 'grad_norm': 0.8400819897651672, 'learning_rate': 0.00018808259587020648, 'epoch': 0.62}


  6%|▋         | 107/1700 [02:20<34:29,  1.30s/it]

{'loss': 0.9337, 'grad_norm': 0.8042364716529846, 'learning_rate': 0.0001879646017699115, 'epoch': 0.63}


  6%|▋         | 108/1700 [02:21<33:02,  1.25s/it]

{'loss': 1.0386, 'grad_norm': 0.9945334196090698, 'learning_rate': 0.00018784660766961653, 'epoch': 0.63}


  6%|▋         | 109/1700 [02:22<32:47,  1.24s/it]

{'loss': 0.8104, 'grad_norm': 0.8174819946289062, 'learning_rate': 0.00018772861356932153, 'epoch': 0.64}


  6%|▋         | 110/1700 [02:23<31:41,  1.20s/it]

{'loss': 0.9643, 'grad_norm': 1.027969241142273, 'learning_rate': 0.00018761061946902656, 'epoch': 0.65}


  7%|▋         | 111/1700 [02:25<36:10,  1.37s/it]

{'loss': 0.9307, 'grad_norm': 0.7715662121772766, 'learning_rate': 0.00018749262536873158, 'epoch': 0.65}


  7%|▋         | 112/1700 [02:26<35:54,  1.36s/it]

{'loss': 0.7718, 'grad_norm': 0.9049313068389893, 'learning_rate': 0.00018737463126843658, 'epoch': 0.66}


  7%|▋         | 113/1700 [02:28<36:30,  1.38s/it]

{'loss': 1.0381, 'grad_norm': 0.8564061522483826, 'learning_rate': 0.00018725663716814158, 'epoch': 0.66}


  7%|▋         | 114/1700 [02:29<35:08,  1.33s/it]

{'loss': 1.6472, 'grad_norm': 1.0114840269088745, 'learning_rate': 0.0001871386430678466, 'epoch': 0.67}


  7%|▋         | 115/1700 [02:30<35:39,  1.35s/it]

{'loss': 0.9335, 'grad_norm': 0.830289900302887, 'learning_rate': 0.00018702064896755164, 'epoch': 0.68}


  7%|▋         | 116/1700 [02:32<36:43,  1.39s/it]

{'loss': 0.7458, 'grad_norm': 0.7131760716438293, 'learning_rate': 0.00018690265486725663, 'epoch': 0.68}


  7%|▋         | 117/1700 [02:33<34:31,  1.31s/it]

{'loss': 1.0675, 'grad_norm': 1.018707036972046, 'learning_rate': 0.00018678466076696166, 'epoch': 0.69}


  7%|▋         | 118/1700 [02:34<33:13,  1.26s/it]

{'loss': 1.2119, 'grad_norm': 1.172861099243164, 'learning_rate': 0.0001866666666666667, 'epoch': 0.69}


  7%|▋         | 119/1700 [02:35<31:26,  1.19s/it]

{'loss': 1.3012, 'grad_norm': 1.0086127519607544, 'learning_rate': 0.00018654867256637169, 'epoch': 0.7}


  7%|▋         | 120/1700 [02:37<33:04,  1.26s/it]

{'loss': 0.8089, 'grad_norm': 0.9676344990730286, 'learning_rate': 0.0001864306784660767, 'epoch': 0.7}


  7%|▋         | 121/1700 [02:38<33:07,  1.26s/it]

{'loss': 1.07, 'grad_norm': 0.9679535031318665, 'learning_rate': 0.0001863126843657817, 'epoch': 0.71}


  7%|▋         | 122/1700 [02:39<32:07,  1.22s/it]

{'loss': 1.2705, 'grad_norm': 1.101064920425415, 'learning_rate': 0.00018619469026548674, 'epoch': 0.72}


  7%|▋         | 123/1700 [02:40<31:21,  1.19s/it]

{'loss': 0.596, 'grad_norm': 0.6596675515174866, 'learning_rate': 0.00018607669616519174, 'epoch': 0.72}


  7%|▋         | 124/1700 [02:42<32:45,  1.25s/it]

{'loss': 0.9063, 'grad_norm': 1.3666431903839111, 'learning_rate': 0.00018595870206489676, 'epoch': 0.73}


  7%|▋         | 125/1700 [02:43<33:25,  1.27s/it]

{'loss': 0.8634, 'grad_norm': 0.9537843465805054, 'learning_rate': 0.0001858407079646018, 'epoch': 0.73}


  7%|▋         | 126/1700 [02:45<36:19,  1.38s/it]

{'loss': 1.1759, 'grad_norm': 0.9979233741760254, 'learning_rate': 0.0001857227138643068, 'epoch': 0.74}


  7%|▋         | 127/1700 [02:46<35:45,  1.36s/it]

{'loss': 1.2592, 'grad_norm': 0.9698008894920349, 'learning_rate': 0.0001856047197640118, 'epoch': 0.75}


  8%|▊         | 128/1700 [02:47<35:06,  1.34s/it]

{'loss': 1.0524, 'grad_norm': 1.0933862924575806, 'learning_rate': 0.00018548672566371684, 'epoch': 0.75}


  8%|▊         | 129/1700 [02:48<33:13,  1.27s/it]

{'loss': 1.2408, 'grad_norm': 1.076797366142273, 'learning_rate': 0.00018536873156342184, 'epoch': 0.76}


  8%|▊         | 130/1700 [02:50<36:14,  1.39s/it]

{'loss': 1.1033, 'grad_norm': 0.8714823722839355, 'learning_rate': 0.00018525073746312684, 'epoch': 0.76}


  8%|▊         | 131/1700 [02:51<34:41,  1.33s/it]

{'loss': 1.1311, 'grad_norm': 0.9505721926689148, 'learning_rate': 0.00018513274336283186, 'epoch': 0.77}


  8%|▊         | 132/1700 [02:52<34:01,  1.30s/it]

{'loss': 1.0442, 'grad_norm': 1.3675365447998047, 'learning_rate': 0.0001850147492625369, 'epoch': 0.78}


  8%|▊         | 133/1700 [02:54<33:24,  1.28s/it]

{'loss': 0.9076, 'grad_norm': 0.9110864996910095, 'learning_rate': 0.0001848967551622419, 'epoch': 0.78}


  8%|▊         | 134/1700 [02:55<34:01,  1.30s/it]

{'loss': 1.0955, 'grad_norm': 0.9319391250610352, 'learning_rate': 0.00018477876106194691, 'epoch': 0.79}


  8%|▊         | 135/1700 [02:56<33:37,  1.29s/it]

{'loss': 0.9898, 'grad_norm': 0.8857120275497437, 'learning_rate': 0.00018466076696165194, 'epoch': 0.79}


  8%|▊         | 136/1700 [02:57<33:03,  1.27s/it]

{'loss': 0.6593, 'grad_norm': 0.814641535282135, 'learning_rate': 0.00018454277286135694, 'epoch': 0.8}


  8%|▊         | 137/1700 [02:59<37:16,  1.43s/it]

{'loss': 1.1108, 'grad_norm': 1.0226534605026245, 'learning_rate': 0.00018442477876106194, 'epoch': 0.8}


  8%|▊         | 138/1700 [03:00<35:59,  1.38s/it]

{'loss': 1.0823, 'grad_norm': 0.8556667566299438, 'learning_rate': 0.00018430678466076696, 'epoch': 0.81}


  8%|▊         | 139/1700 [03:02<33:41,  1.29s/it]

{'loss': 0.896, 'grad_norm': 0.9871905446052551, 'learning_rate': 0.000184188790560472, 'epoch': 0.82}


  8%|▊         | 140/1700 [03:03<35:16,  1.36s/it]

{'loss': 0.8456, 'grad_norm': 0.8614336252212524, 'learning_rate': 0.000184070796460177, 'epoch': 0.82}


  8%|▊         | 141/1700 [03:04<34:37,  1.33s/it]

{'loss': 0.4803, 'grad_norm': 0.661590039730072, 'learning_rate': 0.00018395280235988202, 'epoch': 0.83}


  8%|▊         | 142/1700 [03:06<34:52,  1.34s/it]

{'loss': 0.977, 'grad_norm': 0.9324963092803955, 'learning_rate': 0.00018383480825958704, 'epoch': 0.83}


  8%|▊         | 143/1700 [03:07<34:41,  1.34s/it]

{'loss': 0.8103, 'grad_norm': 0.953167200088501, 'learning_rate': 0.00018371681415929204, 'epoch': 0.84}


  8%|▊         | 144/1700 [03:08<33:53,  1.31s/it]

{'loss': 1.1323, 'grad_norm': 1.2880854606628418, 'learning_rate': 0.00018359882005899707, 'epoch': 0.85}


  9%|▊         | 145/1700 [03:10<35:04,  1.35s/it]

{'loss': 1.3504, 'grad_norm': 1.674496054649353, 'learning_rate': 0.00018348082595870207, 'epoch': 0.85}


  9%|▊         | 146/1700 [03:11<33:35,  1.30s/it]

{'loss': 1.1526, 'grad_norm': 1.182229995727539, 'learning_rate': 0.0001833628318584071, 'epoch': 0.86}


  9%|▊         | 147/1700 [03:12<34:41,  1.34s/it]

{'loss': 0.8443, 'grad_norm': 1.4631742238998413, 'learning_rate': 0.0001832448377581121, 'epoch': 0.86}


  9%|▊         | 148/1700 [03:14<34:35,  1.34s/it]

{'loss': 1.3255, 'grad_norm': 1.0897866487503052, 'learning_rate': 0.00018312684365781712, 'epoch': 0.87}


  9%|▉         | 149/1700 [03:15<34:21,  1.33s/it]

{'loss': 1.0591, 'grad_norm': 0.9927946925163269, 'learning_rate': 0.00018300884955752214, 'epoch': 0.88}


  9%|▉         | 150/1700 [03:16<32:57,  1.28s/it]

{'loss': 1.5349, 'grad_norm': 1.0237679481506348, 'learning_rate': 0.00018289085545722714, 'epoch': 0.88}


  9%|▉         | 151/1700 [03:17<31:45,  1.23s/it]

{'loss': 0.8855, 'grad_norm': 0.9883722066879272, 'learning_rate': 0.00018277286135693217, 'epoch': 0.89}


  9%|▉         | 152/1700 [03:18<31:33,  1.22s/it]

{'loss': 0.9985, 'grad_norm': 0.8859377503395081, 'learning_rate': 0.0001826548672566372, 'epoch': 0.89}


  9%|▉         | 153/1700 [03:20<31:35,  1.23s/it]

{'loss': 0.8564, 'grad_norm': 1.2816065549850464, 'learning_rate': 0.0001825368731563422, 'epoch': 0.9}


  9%|▉         | 154/1700 [03:21<30:55,  1.20s/it]

{'loss': 1.0143, 'grad_norm': 0.9809804558753967, 'learning_rate': 0.0001824188790560472, 'epoch': 0.9}


  9%|▉         | 155/1700 [03:22<31:02,  1.21s/it]

{'loss': 1.0781, 'grad_norm': 1.0380264520645142, 'learning_rate': 0.00018230088495575222, 'epoch': 0.91}


  9%|▉         | 156/1700 [03:23<32:33,  1.27s/it]

{'loss': 0.9551, 'grad_norm': 0.9263891577720642, 'learning_rate': 0.00018218289085545725, 'epoch': 0.92}


  9%|▉         | 157/1700 [03:25<34:24,  1.34s/it]

{'loss': 1.1472, 'grad_norm': 0.8578526973724365, 'learning_rate': 0.00018206489675516224, 'epoch': 0.92}


  9%|▉         | 158/1700 [03:26<33:26,  1.30s/it]

{'loss': 0.8634, 'grad_norm': 1.1329632997512817, 'learning_rate': 0.00018194690265486727, 'epoch': 0.93}


  9%|▉         | 159/1700 [03:27<33:32,  1.31s/it]

{'loss': 1.0895, 'grad_norm': 1.0272178649902344, 'learning_rate': 0.0001818289085545723, 'epoch': 0.93}


  9%|▉         | 160/1700 [03:29<32:37,  1.27s/it]

{'loss': 0.8732, 'grad_norm': 1.1696213483810425, 'learning_rate': 0.0001817109144542773, 'epoch': 0.94}


  9%|▉         | 161/1700 [03:30<33:41,  1.31s/it]

{'loss': 0.4539, 'grad_norm': 0.7499170899391174, 'learning_rate': 0.0001815929203539823, 'epoch': 0.95}


 10%|▉         | 162/1700 [03:31<33:41,  1.31s/it]

{'loss': 1.0269, 'grad_norm': 1.2625470161437988, 'learning_rate': 0.00018147492625368732, 'epoch': 0.95}


 10%|▉         | 163/1700 [03:33<32:23,  1.26s/it]

{'loss': 0.8996, 'grad_norm': 0.9601138830184937, 'learning_rate': 0.00018135693215339235, 'epoch': 0.96}


 10%|▉         | 164/1700 [03:34<31:45,  1.24s/it]

{'loss': 0.9344, 'grad_norm': 0.9709782600402832, 'learning_rate': 0.00018123893805309735, 'epoch': 0.96}


 10%|▉         | 165/1700 [03:35<31:50,  1.24s/it]

{'loss': 0.9758, 'grad_norm': 0.9516881108283997, 'learning_rate': 0.00018112094395280237, 'epoch': 0.97}


 10%|▉         | 166/1700 [03:36<32:25,  1.27s/it]

{'loss': 0.9152, 'grad_norm': 0.9910057783126831, 'learning_rate': 0.0001810029498525074, 'epoch': 0.98}


 10%|▉         | 167/1700 [03:38<33:22,  1.31s/it]

{'loss': 1.0066, 'grad_norm': 0.9463131427764893, 'learning_rate': 0.0001808849557522124, 'epoch': 0.98}


 10%|▉         | 168/1700 [03:39<33:55,  1.33s/it]

{'loss': 0.5717, 'grad_norm': 0.8966416716575623, 'learning_rate': 0.00018076696165191742, 'epoch': 0.99}


 10%|▉         | 169/1700 [03:41<34:36,  1.36s/it]

{'loss': 0.8723, 'grad_norm': 0.8403410315513611, 'learning_rate': 0.00018064896755162242, 'epoch': 0.99}


 10%|█         | 170/1700 [03:42<33:57,  1.33s/it]

{'loss': 0.8566, 'grad_norm': 0.9669477343559265, 'learning_rate': 0.00018053097345132742, 'epoch': 1.0}


 10%|█         | 171/1700 [03:43<33:41,  1.32s/it]

{'loss': 0.4785, 'grad_norm': 0.8886857628822327, 'learning_rate': 0.00018041297935103245, 'epoch': 1.0}


 10%|█         | 172/1700 [03:45<35:26,  1.39s/it]

{'loss': 0.5807, 'grad_norm': 0.655655562877655, 'learning_rate': 0.00018029498525073747, 'epoch': 1.01}


 10%|█         | 173/1700 [03:46<35:40,  1.40s/it]

{'loss': 0.456, 'grad_norm': 0.6608131527900696, 'learning_rate': 0.0001801769911504425, 'epoch': 1.02}


 10%|█         | 174/1700 [03:48<35:59,  1.42s/it]

{'loss': 0.2989, 'grad_norm': 0.623128354549408, 'learning_rate': 0.0001800589970501475, 'epoch': 1.02}


 10%|█         | 175/1700 [03:49<34:01,  1.34s/it]

{'loss': 0.6153, 'grad_norm': 0.9075456261634827, 'learning_rate': 0.00017994100294985252, 'epoch': 1.03}


 10%|█         | 176/1700 [03:50<30:58,  1.22s/it]

{'loss': 0.4075, 'grad_norm': 1.0341720581054688, 'learning_rate': 0.00017982300884955755, 'epoch': 1.03}


 10%|█         | 177/1700 [03:51<31:41,  1.25s/it]

{'loss': 0.8223, 'grad_norm': 1.0732206106185913, 'learning_rate': 0.00017970501474926252, 'epoch': 1.04}


 10%|█         | 178/1700 [03:52<33:35,  1.32s/it]

{'loss': 0.4517, 'grad_norm': 0.9279765486717224, 'learning_rate': 0.00017958702064896755, 'epoch': 1.05}


 11%|█         | 179/1700 [03:54<32:32,  1.28s/it]

{'loss': 0.4789, 'grad_norm': 1.1002517938613892, 'learning_rate': 0.00017946902654867257, 'epoch': 1.05}


 11%|█         | 180/1700 [03:55<33:24,  1.32s/it]

{'loss': 0.5424, 'grad_norm': 0.8432830572128296, 'learning_rate': 0.0001793510324483776, 'epoch': 1.06}


 11%|█         | 181/1700 [03:56<33:00,  1.30s/it]

{'loss': 0.4245, 'grad_norm': 1.1918431520462036, 'learning_rate': 0.0001792330383480826, 'epoch': 1.06}


 11%|█         | 182/1700 [03:57<31:02,  1.23s/it]

{'loss': 0.7344, 'grad_norm': 1.538686990737915, 'learning_rate': 0.00017911504424778763, 'epoch': 1.07}


 11%|█         | 183/1700 [03:59<37:11,  1.47s/it]

{'loss': 0.4558, 'grad_norm': 0.7836854457855225, 'learning_rate': 0.00017899705014749265, 'epoch': 1.07}


 11%|█         | 184/1700 [04:01<35:01,  1.39s/it]

{'loss': 0.3561, 'grad_norm': 0.7597676515579224, 'learning_rate': 0.00017887905604719765, 'epoch': 1.08}


 11%|█         | 185/1700 [04:02<34:30,  1.37s/it]

{'loss': 0.5143, 'grad_norm': 1.012800931930542, 'learning_rate': 0.00017876106194690265, 'epoch': 1.09}


 11%|█         | 186/1700 [04:03<34:57,  1.39s/it]

{'loss': 0.518, 'grad_norm': 0.8934109210968018, 'learning_rate': 0.00017864306784660768, 'epoch': 1.09}


 11%|█         | 187/1700 [04:05<33:39,  1.34s/it]

{'loss': 0.4184, 'grad_norm': 0.867064893245697, 'learning_rate': 0.00017852507374631268, 'epoch': 1.1}


 11%|█         | 188/1700 [04:06<34:03,  1.35s/it]

{'loss': 0.3878, 'grad_norm': 0.7511916160583496, 'learning_rate': 0.0001784070796460177, 'epoch': 1.1}


 11%|█         | 189/1700 [04:07<32:54,  1.31s/it]

{'loss': 0.5129, 'grad_norm': 0.900267481803894, 'learning_rate': 0.00017828908554572273, 'epoch': 1.11}


 11%|█         | 190/1700 [04:08<32:46,  1.30s/it]

{'loss': 0.612, 'grad_norm': 1.1628286838531494, 'learning_rate': 0.00017817109144542775, 'epoch': 1.12}


 11%|█         | 191/1700 [04:10<32:46,  1.30s/it]

{'loss': 0.4399, 'grad_norm': 0.79326993227005, 'learning_rate': 0.00017805309734513275, 'epoch': 1.12}


 11%|█▏        | 192/1700 [04:11<31:59,  1.27s/it]

{'loss': 0.4168, 'grad_norm': 0.95545494556427, 'learning_rate': 0.00017793510324483778, 'epoch': 1.13}


 11%|█▏        | 193/1700 [04:13<34:41,  1.38s/it]

{'loss': 0.3988, 'grad_norm': 0.8512547016143799, 'learning_rate': 0.00017781710914454278, 'epoch': 1.13}


 11%|█▏        | 194/1700 [04:14<34:45,  1.39s/it]

{'loss': 0.4587, 'grad_norm': 1.9841059446334839, 'learning_rate': 0.00017769911504424778, 'epoch': 1.14}


 11%|█▏        | 195/1700 [04:16<36:17,  1.45s/it]

{'loss': 0.8388, 'grad_norm': 0.9448841214179993, 'learning_rate': 0.0001775811209439528, 'epoch': 1.15}


 12%|█▏        | 196/1700 [04:17<33:51,  1.35s/it]

{'loss': 0.4338, 'grad_norm': 0.9925234913825989, 'learning_rate': 0.00017746312684365783, 'epoch': 1.15}


 12%|█▏        | 197/1700 [04:18<34:25,  1.37s/it]

{'loss': 0.6427, 'grad_norm': 0.9880810976028442, 'learning_rate': 0.00017734513274336283, 'epoch': 1.16}


 12%|█▏        | 198/1700 [04:20<35:18,  1.41s/it]

{'loss': 0.6493, 'grad_norm': 0.9309552311897278, 'learning_rate': 0.00017722713864306785, 'epoch': 1.16}


 12%|█▏        | 199/1700 [04:21<35:34,  1.42s/it]

{'loss': 0.3432, 'grad_norm': 0.8546959757804871, 'learning_rate': 0.00017710914454277288, 'epoch': 1.17}


 12%|█▏        | 200/1700 [04:23<35:54,  1.44s/it]

{'loss': 0.4388, 'grad_norm': 0.781628429889679, 'learning_rate': 0.0001769911504424779, 'epoch': 1.17}


 12%|█▏        | 201/1700 [04:24<36:47,  1.47s/it]

{'loss': 0.5652, 'grad_norm': 0.790650486946106, 'learning_rate': 0.00017687315634218288, 'epoch': 1.18}


 12%|█▏        | 202/1700 [04:26<37:08,  1.49s/it]

{'loss': 0.4622, 'grad_norm': 0.7778865694999695, 'learning_rate': 0.0001767551622418879, 'epoch': 1.19}


 12%|█▏        | 203/1700 [04:27<35:03,  1.40s/it]

{'loss': 0.3987, 'grad_norm': 0.8842597603797913, 'learning_rate': 0.00017663716814159293, 'epoch': 1.19}


 12%|█▏        | 204/1700 [04:28<34:22,  1.38s/it]

{'loss': 0.823, 'grad_norm': 0.9879463315010071, 'learning_rate': 0.00017651917404129793, 'epoch': 1.2}


 12%|█▏        | 205/1700 [04:29<32:34,  1.31s/it]

{'loss': 0.2319, 'grad_norm': 0.9472762942314148, 'learning_rate': 0.00017640117994100296, 'epoch': 1.2}


 12%|█▏        | 206/1700 [04:30<31:45,  1.28s/it]

{'loss': 0.6634, 'grad_norm': 1.0485153198242188, 'learning_rate': 0.00017628318584070798, 'epoch': 1.21}


 12%|█▏        | 207/1700 [04:32<31:48,  1.28s/it]

{'loss': 0.4548, 'grad_norm': 0.7249952554702759, 'learning_rate': 0.00017616519174041298, 'epoch': 1.22}


 12%|█▏        | 208/1700 [04:33<31:00,  1.25s/it]

{'loss': 0.4786, 'grad_norm': 0.8970497250556946, 'learning_rate': 0.000176047197640118, 'epoch': 1.22}


 12%|█▏        | 209/1700 [04:34<31:02,  1.25s/it]

{'loss': 0.4253, 'grad_norm': 0.7840139865875244, 'learning_rate': 0.000175929203539823, 'epoch': 1.23}


 12%|█▏        | 210/1700 [04:35<30:23,  1.22s/it]

{'loss': 0.4398, 'grad_norm': 0.9862709045410156, 'learning_rate': 0.00017581120943952803, 'epoch': 1.23}


 12%|█▏        | 211/1700 [04:36<29:04,  1.17s/it]

{'loss': 0.6675, 'grad_norm': 2.452565908432007, 'learning_rate': 0.00017569321533923303, 'epoch': 1.24}


 12%|█▏        | 212/1700 [04:37<28:30,  1.15s/it]

{'loss': 0.5621, 'grad_norm': 1.2864102125167847, 'learning_rate': 0.00017557522123893806, 'epoch': 1.25}


 13%|█▎        | 213/1700 [04:39<30:49,  1.24s/it]

{'loss': 0.7465, 'grad_norm': 1.0867177248001099, 'learning_rate': 0.00017545722713864308, 'epoch': 1.25}


 13%|█▎        | 214/1700 [04:40<32:57,  1.33s/it]

{'loss': 0.701, 'grad_norm': 0.8683454990386963, 'learning_rate': 0.00017533923303834808, 'epoch': 1.26}


 13%|█▎        | 215/1700 [04:42<32:00,  1.29s/it]

{'loss': 0.3991, 'grad_norm': 0.8504278063774109, 'learning_rate': 0.0001752212389380531, 'epoch': 1.26}


 13%|█▎        | 216/1700 [04:43<31:26,  1.27s/it]

{'loss': 0.3087, 'grad_norm': 0.718977689743042, 'learning_rate': 0.00017510324483775813, 'epoch': 1.27}


 13%|█▎        | 217/1700 [04:44<32:18,  1.31s/it]

{'loss': 0.48, 'grad_norm': 0.7870255708694458, 'learning_rate': 0.00017498525073746313, 'epoch': 1.27}


 13%|█▎        | 218/1700 [04:48<47:50,  1.94s/it]

{'loss': 0.8179, 'grad_norm': 0.8012233376502991, 'learning_rate': 0.00017486725663716813, 'epoch': 1.28}


 13%|█▎        | 219/1700 [04:50<46:57,  1.90s/it]

{'loss': 0.78, 'grad_norm': 0.8410199284553528, 'learning_rate': 0.00017474926253687316, 'epoch': 1.29}


 13%|█▎        | 220/1700 [04:51<46:10,  1.87s/it]

{'loss': 0.4451, 'grad_norm': 0.9697604179382324, 'learning_rate': 0.00017463126843657818, 'epoch': 1.29}


 13%|█▎        | 221/1700 [04:53<46:18,  1.88s/it]

{'loss': 0.3787, 'grad_norm': 0.726507842540741, 'learning_rate': 0.00017451327433628318, 'epoch': 1.3}


 13%|█▎        | 222/1700 [04:54<40:02,  1.63s/it]

{'loss': 0.2667, 'grad_norm': 0.9123245477676392, 'learning_rate': 0.0001743952802359882, 'epoch': 1.3}


 13%|█▎        | 223/1700 [04:55<36:43,  1.49s/it]

{'loss': 0.4726, 'grad_norm': 0.957551896572113, 'learning_rate': 0.00017427728613569324, 'epoch': 1.31}


 13%|█▎        | 224/1700 [04:57<36:23,  1.48s/it]

{'loss': 0.5539, 'grad_norm': 1.0249665975570679, 'learning_rate': 0.00017415929203539823, 'epoch': 1.32}


 13%|█▎        | 225/1700 [04:58<36:24,  1.48s/it]

{'loss': 0.4611, 'grad_norm': 0.8652139902114868, 'learning_rate': 0.00017404129793510323, 'epoch': 1.32}


 13%|█▎        | 226/1700 [05:00<37:21,  1.52s/it]

{'loss': 1.0012, 'grad_norm': 1.0667706727981567, 'learning_rate': 0.00017392330383480826, 'epoch': 1.33}


 13%|█▎        | 227/1700 [05:01<37:05,  1.51s/it]

{'loss': 0.6537, 'grad_norm': 0.9999507069587708, 'learning_rate': 0.00017380530973451329, 'epoch': 1.33}


 13%|█▎        | 228/1700 [05:03<35:18,  1.44s/it]

{'loss': 0.4474, 'grad_norm': 0.9647253751754761, 'learning_rate': 0.00017368731563421828, 'epoch': 1.34}


 13%|█▎        | 229/1700 [05:04<34:39,  1.41s/it]

{'loss': 0.5369, 'grad_norm': 0.8954298496246338, 'learning_rate': 0.0001735693215339233, 'epoch': 1.35}


 14%|█▎        | 230/1700 [05:05<33:46,  1.38s/it]

{'loss': 0.468, 'grad_norm': 0.8963297009468079, 'learning_rate': 0.00017345132743362834, 'epoch': 1.35}


 14%|█▎        | 231/1700 [05:07<34:36,  1.41s/it]

{'loss': 0.2182, 'grad_norm': 0.6682046055793762, 'learning_rate': 0.00017333333333333334, 'epoch': 1.36}


 14%|█▎        | 232/1700 [05:08<32:19,  1.32s/it]

{'loss': 0.3022, 'grad_norm': 0.8380072116851807, 'learning_rate': 0.00017321533923303836, 'epoch': 1.36}


 14%|█▎        | 233/1700 [05:09<30:25,  1.24s/it]

{'loss': 0.5143, 'grad_norm': 1.359682559967041, 'learning_rate': 0.00017309734513274336, 'epoch': 1.37}


 14%|█▍        | 234/1700 [05:10<29:40,  1.21s/it]

{'loss': 0.4662, 'grad_norm': 0.9513291120529175, 'learning_rate': 0.0001729793510324484, 'epoch': 1.37}


 14%|█▍        | 235/1700 [05:12<31:07,  1.27s/it]

{'loss': 0.7603, 'grad_norm': 1.014669418334961, 'learning_rate': 0.0001728613569321534, 'epoch': 1.38}


 14%|█▍        | 236/1700 [05:13<30:42,  1.26s/it]

{'loss': 0.4829, 'grad_norm': 0.9594480991363525, 'learning_rate': 0.0001727433628318584, 'epoch': 1.39}


 14%|█▍        | 237/1700 [05:14<30:05,  1.23s/it]

{'loss': 0.2849, 'grad_norm': 0.9492846131324768, 'learning_rate': 0.00017262536873156344, 'epoch': 1.39}


 14%|█▍        | 238/1700 [05:15<29:03,  1.19s/it]

{'loss': 0.3342, 'grad_norm': 0.8477614521980286, 'learning_rate': 0.00017250737463126844, 'epoch': 1.4}


 14%|█▍        | 239/1700 [05:16<29:57,  1.23s/it]

{'loss': 0.2942, 'grad_norm': 0.6561962962150574, 'learning_rate': 0.00017238938053097346, 'epoch': 1.4}


 14%|█▍        | 240/1700 [05:18<30:17,  1.24s/it]

{'loss': 0.5305, 'grad_norm': 0.9502092003822327, 'learning_rate': 0.0001722713864306785, 'epoch': 1.41}


 14%|█▍        | 241/1700 [05:19<28:51,  1.19s/it]

{'loss': 0.4178, 'grad_norm': 0.9727051258087158, 'learning_rate': 0.0001721533923303835, 'epoch': 1.42}


 14%|█▍        | 242/1700 [05:20<28:37,  1.18s/it]

{'loss': 0.4473, 'grad_norm': 1.8121776580810547, 'learning_rate': 0.0001720353982300885, 'epoch': 1.42}


 14%|█▍        | 243/1700 [05:21<28:15,  1.16s/it]

{'loss': 0.4217, 'grad_norm': 0.960409939289093, 'learning_rate': 0.00017191740412979351, 'epoch': 1.43}


 14%|█▍        | 244/1700 [05:23<31:21,  1.29s/it]

{'loss': 0.4374, 'grad_norm': 0.8076496124267578, 'learning_rate': 0.00017179941002949854, 'epoch': 1.43}


 14%|█▍        | 245/1700 [05:24<31:32,  1.30s/it]

{'loss': 0.3898, 'grad_norm': 0.9239386320114136, 'learning_rate': 0.00017168141592920354, 'epoch': 1.44}


 14%|█▍        | 246/1700 [05:25<31:35,  1.30s/it]

{'loss': 0.3988, 'grad_norm': 0.9142428040504456, 'learning_rate': 0.00017156342182890857, 'epoch': 1.44}


 15%|█▍        | 247/1700 [05:27<31:46,  1.31s/it]

{'loss': 0.3806, 'grad_norm': 0.9200771450996399, 'learning_rate': 0.0001714454277286136, 'epoch': 1.45}


 15%|█▍        | 248/1700 [05:28<31:43,  1.31s/it]

{'loss': 0.5981, 'grad_norm': 1.0466411113739014, 'learning_rate': 0.0001713274336283186, 'epoch': 1.46}


 15%|█▍        | 249/1700 [05:29<32:23,  1.34s/it]

{'loss': 0.3689, 'grad_norm': 0.8700065016746521, 'learning_rate': 0.0001712094395280236, 'epoch': 1.46}


 15%|█▍        | 250/1700 [05:31<33:43,  1.40s/it]

{'loss': 0.6225, 'grad_norm': 0.8703745007514954, 'learning_rate': 0.00017109144542772862, 'epoch': 1.47}


 15%|█▍        | 251/1700 [05:32<35:02,  1.45s/it]

{'loss': 0.4357, 'grad_norm': 0.6495311856269836, 'learning_rate': 0.00017097345132743364, 'epoch': 1.47}


 15%|█▍        | 252/1700 [05:34<35:02,  1.45s/it]

{'loss': 0.1368, 'grad_norm': 0.5661924481391907, 'learning_rate': 0.00017085545722713864, 'epoch': 1.48}


 15%|█▍        | 253/1700 [05:35<32:30,  1.35s/it]

{'loss': 0.3888, 'grad_norm': 0.8400700688362122, 'learning_rate': 0.00017073746312684367, 'epoch': 1.49}


 15%|█▍        | 254/1700 [05:36<33:01,  1.37s/it]

{'loss': 0.6513, 'grad_norm': 0.9318074584007263, 'learning_rate': 0.0001706194690265487, 'epoch': 1.49}


 15%|█▌        | 255/1700 [05:38<31:38,  1.31s/it]

{'loss': 0.3526, 'grad_norm': 0.9327258467674255, 'learning_rate': 0.0001705014749262537, 'epoch': 1.5}


 15%|█▌        | 256/1700 [05:39<33:11,  1.38s/it]

{'loss': 0.3741, 'grad_norm': 0.6835646629333496, 'learning_rate': 0.00017038348082595872, 'epoch': 1.5}


 15%|█▌        | 257/1700 [05:40<32:22,  1.35s/it]

{'loss': 0.4325, 'grad_norm': 0.9244310259819031, 'learning_rate': 0.00017026548672566372, 'epoch': 1.51}


 15%|█▌        | 258/1700 [05:41<30:19,  1.26s/it]

{'loss': 0.4479, 'grad_norm': 1.0654840469360352, 'learning_rate': 0.00017014749262536874, 'epoch': 1.52}


 15%|█▌        | 259/1700 [05:43<30:23,  1.27s/it]

{'loss': 0.5985, 'grad_norm': 1.0524460077285767, 'learning_rate': 0.00017002949852507374, 'epoch': 1.52}


 15%|█▌        | 260/1700 [05:44<30:10,  1.26s/it]

{'loss': 0.453, 'grad_norm': 0.8134293556213379, 'learning_rate': 0.00016991150442477877, 'epoch': 1.53}


 15%|█▌        | 261/1700 [05:45<29:49,  1.24s/it]

{'loss': 0.219, 'grad_norm': 0.7064751386642456, 'learning_rate': 0.0001697935103244838, 'epoch': 1.53}


 15%|█▌        | 262/1700 [05:46<29:22,  1.23s/it]

{'loss': 0.7742, 'grad_norm': 1.458100438117981, 'learning_rate': 0.0001696755162241888, 'epoch': 1.54}


 15%|█▌        | 263/1700 [05:47<28:38,  1.20s/it]

{'loss': 0.3073, 'grad_norm': 0.9037260413169861, 'learning_rate': 0.00016955752212389382, 'epoch': 1.54}


 16%|█▌        | 264/1700 [05:49<27:53,  1.17s/it]

{'loss': 0.5064, 'grad_norm': 1.1202664375305176, 'learning_rate': 0.00016943952802359885, 'epoch': 1.55}


 16%|█▌        | 265/1700 [05:50<28:37,  1.20s/it]

{'loss': 0.6961, 'grad_norm': 0.9707748293876648, 'learning_rate': 0.00016932153392330384, 'epoch': 1.56}


 16%|█▌        | 266/1700 [05:51<27:22,  1.15s/it]

{'loss': 0.6035, 'grad_norm': 1.3481028079986572, 'learning_rate': 0.00016920353982300884, 'epoch': 1.56}


 16%|█▌        | 267/1700 [05:52<27:44,  1.16s/it]

{'loss': 1.0877, 'grad_norm': 1.2358206510543823, 'learning_rate': 0.00016908554572271387, 'epoch': 1.57}


 16%|█▌        | 268/1700 [05:53<28:33,  1.20s/it]

{'loss': 0.5128, 'grad_norm': 0.8882643580436707, 'learning_rate': 0.0001689675516224189, 'epoch': 1.57}


 16%|█▌        | 269/1700 [05:55<28:45,  1.21s/it]

{'loss': 0.3486, 'grad_norm': 0.6405448317527771, 'learning_rate': 0.0001688495575221239, 'epoch': 1.58}


 16%|█▌        | 270/1700 [05:56<27:35,  1.16s/it]

{'loss': 0.3775, 'grad_norm': 0.9472467303276062, 'learning_rate': 0.00016873156342182892, 'epoch': 1.59}


 16%|█▌        | 271/1700 [05:57<28:21,  1.19s/it]

{'loss': 0.6645, 'grad_norm': 1.0757826566696167, 'learning_rate': 0.00016861356932153395, 'epoch': 1.59}


 16%|█▌        | 272/1700 [05:58<28:23,  1.19s/it]

{'loss': 0.5015, 'grad_norm': 0.9671225547790527, 'learning_rate': 0.00016849557522123895, 'epoch': 1.6}


 16%|█▌        | 273/1700 [05:59<28:38,  1.20s/it]

{'loss': 0.6162, 'grad_norm': 1.3851869106292725, 'learning_rate': 0.00016837758112094395, 'epoch': 1.6}


 16%|█▌        | 274/1700 [06:01<30:15,  1.27s/it]

{'loss': 0.8627, 'grad_norm': 0.9164251089096069, 'learning_rate': 0.00016825958702064897, 'epoch': 1.61}


 16%|█▌        | 275/1700 [06:02<28:43,  1.21s/it]

{'loss': 0.3454, 'grad_norm': 1.2239198684692383, 'learning_rate': 0.000168141592920354, 'epoch': 1.62}


 16%|█▌        | 276/1700 [06:03<30:03,  1.27s/it]

{'loss': 0.583, 'grad_norm': 0.8884599208831787, 'learning_rate': 0.000168023598820059, 'epoch': 1.62}


 16%|█▋        | 277/1700 [06:05<30:44,  1.30s/it]

{'loss': 0.5469, 'grad_norm': 0.9407286047935486, 'learning_rate': 0.00016790560471976402, 'epoch': 1.63}


 16%|█▋        | 278/1700 [06:06<33:21,  1.41s/it]

{'loss': 0.5724, 'grad_norm': 0.772395670413971, 'learning_rate': 0.00016778761061946905, 'epoch': 1.63}


 16%|█▋        | 279/1700 [06:08<32:21,  1.37s/it]

{'loss': 0.443, 'grad_norm': 1.0997018814086914, 'learning_rate': 0.00016766961651917405, 'epoch': 1.64}


 16%|█▋        | 280/1700 [06:09<30:59,  1.31s/it]

{'loss': 0.2917, 'grad_norm': 0.8590746521949768, 'learning_rate': 0.00016755162241887907, 'epoch': 1.64}


 17%|█▋        | 281/1700 [06:10<32:19,  1.37s/it]

{'loss': 0.2895, 'grad_norm': 0.6064804196357727, 'learning_rate': 0.00016743362831858407, 'epoch': 1.65}


 17%|█▋        | 282/1700 [06:12<32:05,  1.36s/it]

{'loss': 0.5459, 'grad_norm': 1.011572241783142, 'learning_rate': 0.0001673156342182891, 'epoch': 1.66}


 17%|█▋        | 283/1700 [06:13<30:48,  1.30s/it]

{'loss': 0.3641, 'grad_norm': 0.8708364963531494, 'learning_rate': 0.0001671976401179941, 'epoch': 1.66}


 17%|█▋        | 284/1700 [06:14<31:44,  1.35s/it]

{'loss': 0.4637, 'grad_norm': 0.8984160423278809, 'learning_rate': 0.00016707964601769912, 'epoch': 1.67}


 17%|█▋        | 285/1700 [06:16<33:30,  1.42s/it]

{'loss': 0.4412, 'grad_norm': 0.887220561504364, 'learning_rate': 0.00016696165191740415, 'epoch': 1.67}


 17%|█▋        | 286/1700 [06:17<31:30,  1.34s/it]

{'loss': 0.2559, 'grad_norm': 0.8036624193191528, 'learning_rate': 0.00016684365781710915, 'epoch': 1.68}


 17%|█▋        | 287/1700 [06:18<30:23,  1.29s/it]

{'loss': 0.4055, 'grad_norm': 1.046288013458252, 'learning_rate': 0.00016672566371681418, 'epoch': 1.69}


 17%|█▋        | 288/1700 [06:20<32:05,  1.36s/it]

{'loss': 0.4051, 'grad_norm': 0.8475365042686462, 'learning_rate': 0.0001666076696165192, 'epoch': 1.69}


 17%|█▋        | 289/1700 [06:21<31:51,  1.35s/it]

{'loss': 0.3991, 'grad_norm': 1.087726354598999, 'learning_rate': 0.00016648967551622417, 'epoch': 1.7}


 17%|█▋        | 290/1700 [06:22<31:55,  1.36s/it]

{'loss': 0.394, 'grad_norm': 0.7800617218017578, 'learning_rate': 0.0001663716814159292, 'epoch': 1.7}


 17%|█▋        | 291/1700 [06:24<30:53,  1.32s/it]

{'loss': 0.4322, 'grad_norm': 0.8709982633590698, 'learning_rate': 0.00016625368731563423, 'epoch': 1.71}


 17%|█▋        | 292/1700 [06:25<29:40,  1.26s/it]

{'loss': 0.7279, 'grad_norm': 1.1521739959716797, 'learning_rate': 0.00016613569321533925, 'epoch': 1.72}


 17%|█▋        | 293/1700 [06:26<30:10,  1.29s/it]

{'loss': 0.5519, 'grad_norm': 1.4335397481918335, 'learning_rate': 0.00016601769911504425, 'epoch': 1.72}


 17%|█▋        | 294/1700 [06:27<28:23,  1.21s/it]

{'loss': 0.2712, 'grad_norm': 0.6866827607154846, 'learning_rate': 0.00016589970501474928, 'epoch': 1.73}


 17%|█▋        | 295/1700 [06:28<29:01,  1.24s/it]

{'loss': 0.3911, 'grad_norm': 0.799781322479248, 'learning_rate': 0.0001657817109144543, 'epoch': 1.73}


 17%|█▋        | 296/1700 [06:30<29:52,  1.28s/it]

{'loss': 0.625, 'grad_norm': 0.8555763959884644, 'learning_rate': 0.0001656637168141593, 'epoch': 1.74}


 17%|█▋        | 297/1700 [06:31<30:58,  1.32s/it]

{'loss': 0.2825, 'grad_norm': 0.6174079179763794, 'learning_rate': 0.0001655457227138643, 'epoch': 1.74}


 18%|█▊        | 298/1700 [06:33<31:47,  1.36s/it]

{'loss': 0.2963, 'grad_norm': 0.825534999370575, 'learning_rate': 0.00016542772861356933, 'epoch': 1.75}


 18%|█▊        | 299/1700 [06:34<29:55,  1.28s/it]

{'loss': 0.471, 'grad_norm': 0.950503408908844, 'learning_rate': 0.00016530973451327433, 'epoch': 1.76}


 18%|█▊        | 300/1700 [06:35<30:15,  1.30s/it]

{'loss': 0.4672, 'grad_norm': 0.8935604691505432, 'learning_rate': 0.00016519174041297935, 'epoch': 1.76}


 18%|█▊        | 301/1700 [06:36<30:24,  1.30s/it]

{'loss': 0.4942, 'grad_norm': 0.8428972959518433, 'learning_rate': 0.00016507374631268438, 'epoch': 1.77}


 18%|█▊        | 302/1700 [06:38<29:44,  1.28s/it]

{'loss': 0.5801, 'grad_norm': 0.9761645793914795, 'learning_rate': 0.0001649557522123894, 'epoch': 1.77}


 18%|█▊        | 303/1700 [06:39<30:00,  1.29s/it]

{'loss': 0.4671, 'grad_norm': 0.9773681163787842, 'learning_rate': 0.0001648377581120944, 'epoch': 1.78}


 18%|█▊        | 304/1700 [06:40<31:23,  1.35s/it]

{'loss': 0.458, 'grad_norm': 0.8075085282325745, 'learning_rate': 0.00016471976401179943, 'epoch': 1.79}


 18%|█▊        | 305/1700 [06:42<30:43,  1.32s/it]

{'loss': 0.602, 'grad_norm': 1.0393952131271362, 'learning_rate': 0.00016460176991150443, 'epoch': 1.79}


 18%|█▊        | 306/1700 [06:43<30:57,  1.33s/it]

{'loss': 0.5733, 'grad_norm': 1.0279288291931152, 'learning_rate': 0.00016448377581120943, 'epoch': 1.8}


 18%|█▊        | 307/1700 [06:44<30:40,  1.32s/it]

{'loss': 0.6232, 'grad_norm': 1.1356911659240723, 'learning_rate': 0.00016436578171091445, 'epoch': 1.8}


 18%|█▊        | 308/1700 [06:46<31:08,  1.34s/it]

{'loss': 0.5047, 'grad_norm': 0.8803806304931641, 'learning_rate': 0.00016424778761061948, 'epoch': 1.81}


 18%|█▊        | 309/1700 [06:47<31:29,  1.36s/it]

{'loss': 0.3806, 'grad_norm': 0.9314960837364197, 'learning_rate': 0.00016412979351032448, 'epoch': 1.81}


 18%|█▊        | 310/1700 [06:49<32:19,  1.39s/it]

{'loss': 0.6122, 'grad_norm': 0.8888080716133118, 'learning_rate': 0.0001640117994100295, 'epoch': 1.82}


 18%|█▊        | 311/1700 [06:50<31:53,  1.38s/it]

{'loss': 0.3064, 'grad_norm': 0.7354826331138611, 'learning_rate': 0.00016389380530973453, 'epoch': 1.83}


 18%|█▊        | 312/1700 [06:51<31:26,  1.36s/it]

{'loss': 0.5981, 'grad_norm': 0.9335423707962036, 'learning_rate': 0.00016377581120943956, 'epoch': 1.83}


 18%|█▊        | 313/1700 [06:53<31:18,  1.35s/it]

{'loss': 0.4813, 'grad_norm': 1.0362141132354736, 'learning_rate': 0.00016365781710914453, 'epoch': 1.84}


 18%|█▊        | 314/1700 [06:54<31:19,  1.36s/it]

{'loss': 0.3467, 'grad_norm': 0.730394721031189, 'learning_rate': 0.00016353982300884955, 'epoch': 1.84}


 19%|█▊        | 315/1700 [06:55<32:03,  1.39s/it]

{'loss': 0.4316, 'grad_norm': 0.8109986186027527, 'learning_rate': 0.00016342182890855458, 'epoch': 1.85}


 19%|█▊        | 316/1700 [06:57<30:18,  1.31s/it]

{'loss': 0.5291, 'grad_norm': 1.3221575021743774, 'learning_rate': 0.00016330383480825958, 'epoch': 1.86}


 19%|█▊        | 317/1700 [06:58<30:01,  1.30s/it]

{'loss': 0.5838, 'grad_norm': 0.8612315654754639, 'learning_rate': 0.0001631858407079646, 'epoch': 1.86}


 19%|█▊        | 318/1700 [06:59<29:06,  1.26s/it]

{'loss': 0.2584, 'grad_norm': 0.7398121356964111, 'learning_rate': 0.00016306784660766963, 'epoch': 1.87}


 19%|█▉        | 319/1700 [07:00<29:32,  1.28s/it]

{'loss': 0.7733, 'grad_norm': 1.1246048212051392, 'learning_rate': 0.00016294985250737466, 'epoch': 1.87}


 19%|█▉        | 320/1700 [07:01<27:25,  1.19s/it]

{'loss': 0.5508, 'grad_norm': 1.1151570081710815, 'learning_rate': 0.00016283185840707966, 'epoch': 1.88}


 19%|█▉        | 321/1700 [07:03<28:02,  1.22s/it]

{'loss': 0.6229, 'grad_norm': 1.1594258546829224, 'learning_rate': 0.00016271386430678466, 'epoch': 1.89}


 19%|█▉        | 322/1700 [07:04<29:57,  1.30s/it]

{'loss': 0.505, 'grad_norm': 0.8821895718574524, 'learning_rate': 0.00016259587020648968, 'epoch': 1.89}


 19%|█▉        | 323/1700 [07:05<28:17,  1.23s/it]

{'loss': 0.4381, 'grad_norm': 0.921369194984436, 'learning_rate': 0.00016247787610619468, 'epoch': 1.9}


 19%|█▉        | 324/1700 [07:06<27:49,  1.21s/it]

{'loss': 0.7811, 'grad_norm': 1.4031254053115845, 'learning_rate': 0.0001623598820058997, 'epoch': 1.9}


 19%|█▉        | 325/1700 [07:08<28:02,  1.22s/it]

{'loss': 0.6514, 'grad_norm': 1.0403938293457031, 'learning_rate': 0.00016224188790560473, 'epoch': 1.91}


 19%|█▉        | 326/1700 [07:09<27:40,  1.21s/it]

{'loss': 0.2417, 'grad_norm': 0.7311973571777344, 'learning_rate': 0.00016212389380530973, 'epoch': 1.91}


 19%|█▉        | 327/1700 [07:10<28:09,  1.23s/it]

{'loss': 0.3463, 'grad_norm': 0.823612630367279, 'learning_rate': 0.00016200589970501476, 'epoch': 1.92}


 19%|█▉        | 328/1700 [07:11<28:50,  1.26s/it]

{'loss': 0.2751, 'grad_norm': 0.6531711220741272, 'learning_rate': 0.00016188790560471979, 'epoch': 1.93}


 19%|█▉        | 329/1700 [07:13<31:23,  1.37s/it]

{'loss': 0.3463, 'grad_norm': 0.7185249924659729, 'learning_rate': 0.00016176991150442478, 'epoch': 1.93}


 19%|█▉        | 330/1700 [07:14<30:10,  1.32s/it]

{'loss': 0.3625, 'grad_norm': 0.8047716617584229, 'learning_rate': 0.00016165191740412978, 'epoch': 1.94}


 19%|█▉        | 331/1700 [07:16<32:40,  1.43s/it]

{'loss': 0.6557, 'grad_norm': 1.003906011581421, 'learning_rate': 0.0001615339233038348, 'epoch': 1.94}


 20%|█▉        | 332/1700 [07:17<30:30,  1.34s/it]

{'loss': 0.5191, 'grad_norm': 1.077467441558838, 'learning_rate': 0.00016141592920353984, 'epoch': 1.95}


 20%|█▉        | 333/1700 [07:18<29:24,  1.29s/it]

{'loss': 0.5358, 'grad_norm': 1.1094461679458618, 'learning_rate': 0.00016129793510324483, 'epoch': 1.96}


 20%|█▉        | 334/1700 [07:19<29:06,  1.28s/it]

{'loss': 0.5334, 'grad_norm': 0.9694502353668213, 'learning_rate': 0.00016117994100294986, 'epoch': 1.96}


 20%|█▉        | 335/1700 [07:21<28:04,  1.23s/it]

{'loss': 0.5535, 'grad_norm': 1.1532176733016968, 'learning_rate': 0.0001610619469026549, 'epoch': 1.97}


 20%|█▉        | 336/1700 [07:22<30:31,  1.34s/it]

{'loss': 0.5929, 'grad_norm': 0.8961491584777832, 'learning_rate': 0.00016094395280235989, 'epoch': 1.97}


 20%|█▉        | 337/1700 [07:23<29:43,  1.31s/it]

{'loss': 0.5487, 'grad_norm': 0.9822817444801331, 'learning_rate': 0.00016082595870206488, 'epoch': 1.98}


 20%|█▉        | 338/1700 [07:25<28:54,  1.27s/it]

{'loss': 0.4057, 'grad_norm': 1.1366760730743408, 'learning_rate': 0.0001607079646017699, 'epoch': 1.99}


 20%|█▉        | 339/1700 [07:27<34:57,  1.54s/it]

{'loss': 0.3928, 'grad_norm': 0.8561408519744873, 'learning_rate': 0.00016058997050147494, 'epoch': 1.99}


 20%|██        | 340/1700 [07:28<32:51,  1.45s/it]

{'loss': 0.5277, 'grad_norm': 1.0066124200820923, 'learning_rate': 0.00016047197640117994, 'epoch': 2.0}


 20%|██        | 341/1700 [07:29<32:04,  1.42s/it]

{'loss': 0.2418, 'grad_norm': 0.7624601721763611, 'learning_rate': 0.00016035398230088496, 'epoch': 2.0}


 20%|██        | 342/1700 [07:31<31:29,  1.39s/it]

{'loss': 0.1691, 'grad_norm': 0.8269530534744263, 'learning_rate': 0.00016023598820059, 'epoch': 2.01}


 20%|██        | 343/1700 [07:32<31:01,  1.37s/it]

{'loss': 0.277, 'grad_norm': 0.636989414691925, 'learning_rate': 0.000160117994100295, 'epoch': 2.01}


 20%|██        | 344/1700 [07:33<31:13,  1.38s/it]

{'loss': 0.2394, 'grad_norm': 0.6992599964141846, 'learning_rate': 0.00016, 'epoch': 2.02}


 20%|██        | 345/1700 [07:35<31:26,  1.39s/it]

{'loss': 0.2064, 'grad_norm': 0.7069149017333984, 'learning_rate': 0.000159882005899705, 'epoch': 2.03}


 20%|██        | 346/1700 [07:36<31:47,  1.41s/it]

{'loss': 0.3025, 'grad_norm': 0.7888442873954773, 'learning_rate': 0.00015976401179941004, 'epoch': 2.03}


 20%|██        | 347/1700 [07:38<32:27,  1.44s/it]

{'loss': 0.269, 'grad_norm': 0.7980825304985046, 'learning_rate': 0.00015964601769911504, 'epoch': 2.04}


 20%|██        | 348/1700 [07:39<31:28,  1.40s/it]

{'loss': 0.2065, 'grad_norm': 0.7714329361915588, 'learning_rate': 0.00015952802359882006, 'epoch': 2.04}


 21%|██        | 349/1700 [07:40<29:31,  1.31s/it]

{'loss': 0.2084, 'grad_norm': 1.484330654144287, 'learning_rate': 0.0001594100294985251, 'epoch': 2.05}


 21%|██        | 350/1700 [07:42<29:57,  1.33s/it]

{'loss': 0.1547, 'grad_norm': 0.7985848188400269, 'learning_rate': 0.0001592920353982301, 'epoch': 2.06}


 21%|██        | 351/1700 [07:43<28:37,  1.27s/it]

{'loss': 0.1989, 'grad_norm': 1.0398519039154053, 'learning_rate': 0.00015917404129793511, 'epoch': 2.06}


 21%|██        | 352/1700 [07:44<29:06,  1.30s/it]

{'loss': 0.2339, 'grad_norm': 1.0817186832427979, 'learning_rate': 0.00015905604719764014, 'epoch': 2.07}


 21%|██        | 353/1700 [07:45<30:19,  1.35s/it]

{'loss': 0.0944, 'grad_norm': 0.5784457325935364, 'learning_rate': 0.00015893805309734514, 'epoch': 2.07}


 21%|██        | 354/1700 [07:47<28:28,  1.27s/it]

{'loss': 0.095, 'grad_norm': 0.798267662525177, 'learning_rate': 0.00015882005899705014, 'epoch': 2.08}


 21%|██        | 355/1700 [07:48<27:43,  1.24s/it]

{'loss': 0.1691, 'grad_norm': 1.0097588300704956, 'learning_rate': 0.00015870206489675516, 'epoch': 2.09}


 21%|██        | 356/1700 [07:49<26:47,  1.20s/it]

{'loss': 0.1478, 'grad_norm': 1.0607508420944214, 'learning_rate': 0.0001585840707964602, 'epoch': 2.09}


 21%|██        | 357/1700 [07:50<28:40,  1.28s/it]

{'loss': 0.1037, 'grad_norm': 0.6389526128768921, 'learning_rate': 0.0001584660766961652, 'epoch': 2.1}


 21%|██        | 358/1700 [07:51<27:21,  1.22s/it]

{'loss': 0.0425, 'grad_norm': 0.5294708013534546, 'learning_rate': 0.00015834808259587022, 'epoch': 2.1}


 21%|██        | 359/1700 [07:53<30:17,  1.36s/it]

{'loss': 0.2553, 'grad_norm': 0.6562590599060059, 'learning_rate': 0.00015823008849557524, 'epoch': 2.11}


 21%|██        | 360/1700 [07:55<34:49,  1.56s/it]

{'loss': 0.2163, 'grad_norm': 0.7364512085914612, 'learning_rate': 0.00015811209439528024, 'epoch': 2.11}


 21%|██        | 361/1700 [07:56<32:50,  1.47s/it]

{'loss': 0.1226, 'grad_norm': 0.6933856010437012, 'learning_rate': 0.00015799410029498524, 'epoch': 2.12}


 21%|██▏       | 362/1700 [07:58<32:08,  1.44s/it]

{'loss': 0.1528, 'grad_norm': 0.7874345183372498, 'learning_rate': 0.00015787610619469027, 'epoch': 2.13}


 21%|██▏       | 363/1700 [07:59<29:06,  1.31s/it]

{'loss': 0.1804, 'grad_norm': 1.139007806777954, 'learning_rate': 0.0001577581120943953, 'epoch': 2.13}


 21%|██▏       | 364/1700 [08:00<28:39,  1.29s/it]

{'loss': 0.0623, 'grad_norm': 0.45254847407341003, 'learning_rate': 0.0001576401179941003, 'epoch': 2.14}


 21%|██▏       | 365/1700 [08:01<29:49,  1.34s/it]

{'loss': 0.1982, 'grad_norm': 0.7163758873939514, 'learning_rate': 0.00015752212389380532, 'epoch': 2.14}


 22%|██▏       | 366/1700 [08:02<27:21,  1.23s/it]

{'loss': 0.1308, 'grad_norm': 1.0426051616668701, 'learning_rate': 0.00015740412979351034, 'epoch': 2.15}


 22%|██▏       | 367/1700 [08:04<28:40,  1.29s/it]

{'loss': 0.1285, 'grad_norm': 0.6316636204719543, 'learning_rate': 0.00015728613569321534, 'epoch': 2.16}


 22%|██▏       | 368/1700 [08:05<29:26,  1.33s/it]

{'loss': 0.197, 'grad_norm': 0.6378176212310791, 'learning_rate': 0.00015716814159292037, 'epoch': 2.16}


 22%|██▏       | 369/1700 [08:09<46:25,  2.09s/it]

{'loss': 0.6436, 'grad_norm': 0.792050302028656, 'learning_rate': 0.00015705014749262537, 'epoch': 2.17}


 22%|██▏       | 370/1700 [08:10<41:25,  1.87s/it]

{'loss': 0.1119, 'grad_norm': 0.7609273195266724, 'learning_rate': 0.0001569321533923304, 'epoch': 2.17}


 22%|██▏       | 371/1700 [08:12<37:10,  1.68s/it]

{'loss': 0.1423, 'grad_norm': 0.7868798971176147, 'learning_rate': 0.0001568141592920354, 'epoch': 2.18}


 22%|██▏       | 372/1700 [08:13<32:53,  1.49s/it]

{'loss': 0.1281, 'grad_norm': 0.6544753909111023, 'learning_rate': 0.00015669616519174042, 'epoch': 2.19}


 22%|██▏       | 373/1700 [08:14<32:44,  1.48s/it]

{'loss': 0.1938, 'grad_norm': 0.746758759021759, 'learning_rate': 0.00015657817109144545, 'epoch': 2.19}


 22%|██▏       | 374/1700 [08:16<34:53,  1.58s/it]

{'loss': 0.1329, 'grad_norm': 0.6660776138305664, 'learning_rate': 0.00015646017699115044, 'epoch': 2.2}


 22%|██▏       | 375/1700 [08:18<35:16,  1.60s/it]

{'loss': 0.431, 'grad_norm': 0.8646290898323059, 'learning_rate': 0.00015634218289085547, 'epoch': 2.2}


 22%|██▏       | 376/1700 [08:19<33:23,  1.51s/it]

{'loss': 0.1733, 'grad_norm': 0.8756933212280273, 'learning_rate': 0.0001562241887905605, 'epoch': 2.21}


 22%|██▏       | 377/1700 [08:20<32:58,  1.50s/it]

{'loss': 0.1876, 'grad_norm': 0.7201969027519226, 'learning_rate': 0.0001561061946902655, 'epoch': 2.21}


 22%|██▏       | 378/1700 [08:22<31:53,  1.45s/it]

{'loss': 0.1534, 'grad_norm': 0.806148111820221, 'learning_rate': 0.0001559882005899705, 'epoch': 2.22}


 22%|██▏       | 379/1700 [08:23<32:07,  1.46s/it]

{'loss': 0.2184, 'grad_norm': 0.8240252733230591, 'learning_rate': 0.00015587020648967552, 'epoch': 2.23}


 22%|██▏       | 380/1700 [08:25<31:33,  1.43s/it]

{'loss': 0.1945, 'grad_norm': 0.759360671043396, 'learning_rate': 0.00015575221238938055, 'epoch': 2.23}


 22%|██▏       | 381/1700 [08:26<29:32,  1.34s/it]

{'loss': 0.1707, 'grad_norm': 1.0438454151153564, 'learning_rate': 0.00015563421828908555, 'epoch': 2.24}


 22%|██▏       | 382/1700 [08:27<30:07,  1.37s/it]

{'loss': 0.158, 'grad_norm': 0.755817711353302, 'learning_rate': 0.00015551622418879057, 'epoch': 2.24}


 23%|██▎       | 383/1700 [08:28<29:28,  1.34s/it]

{'loss': 0.1805, 'grad_norm': 0.781360924243927, 'learning_rate': 0.0001553982300884956, 'epoch': 2.25}


 23%|██▎       | 384/1700 [08:30<29:22,  1.34s/it]

{'loss': 0.1507, 'grad_norm': 0.6652738451957703, 'learning_rate': 0.0001552802359882006, 'epoch': 2.26}


 23%|██▎       | 385/1700 [08:31<30:41,  1.40s/it]

{'loss': 0.2219, 'grad_norm': 0.7223750948905945, 'learning_rate': 0.0001551622418879056, 'epoch': 2.26}


 23%|██▎       | 386/1700 [08:33<31:48,  1.45s/it]

{'loss': 0.1251, 'grad_norm': 0.7473737001419067, 'learning_rate': 0.00015504424778761062, 'epoch': 2.27}


 23%|██▎       | 387/1700 [08:34<31:03,  1.42s/it]

{'loss': 0.1042, 'grad_norm': 0.512215793132782, 'learning_rate': 0.00015492625368731565, 'epoch': 2.27}


 23%|██▎       | 388/1700 [08:35<29:41,  1.36s/it]

{'loss': 0.1267, 'grad_norm': 0.7651724219322205, 'learning_rate': 0.00015480825958702065, 'epoch': 2.28}


 23%|██▎       | 389/1700 [08:37<28:03,  1.28s/it]

{'loss': 0.1587, 'grad_norm': 0.7428769469261169, 'learning_rate': 0.00015469026548672567, 'epoch': 2.28}


 23%|██▎       | 390/1700 [08:38<27:32,  1.26s/it]

{'loss': 0.1672, 'grad_norm': 1.2314120531082153, 'learning_rate': 0.0001545722713864307, 'epoch': 2.29}


 23%|██▎       | 391/1700 [08:39<27:44,  1.27s/it]

{'loss': 0.1613, 'grad_norm': 0.7946248054504395, 'learning_rate': 0.0001544542772861357, 'epoch': 2.3}


 23%|██▎       | 392/1700 [08:40<27:01,  1.24s/it]

{'loss': 0.1273, 'grad_norm': 0.8957396745681763, 'learning_rate': 0.00015433628318584072, 'epoch': 2.3}


 23%|██▎       | 393/1700 [08:41<27:02,  1.24s/it]

{'loss': 0.2022, 'grad_norm': 0.9359869956970215, 'learning_rate': 0.00015421828908554572, 'epoch': 2.31}


 23%|██▎       | 394/1700 [08:43<29:58,  1.38s/it]

{'loss': 0.3763, 'grad_norm': 0.9452729225158691, 'learning_rate': 0.00015410029498525075, 'epoch': 2.31}


 23%|██▎       | 395/1700 [08:44<29:11,  1.34s/it]

{'loss': 0.3311, 'grad_norm': 1.0281211137771606, 'learning_rate': 0.00015398230088495575, 'epoch': 2.32}


 23%|██▎       | 396/1700 [08:46<30:56,  1.42s/it]

{'loss': 0.3178, 'grad_norm': 1.1135718822479248, 'learning_rate': 0.00015386430678466077, 'epoch': 2.33}


 23%|██▎       | 397/1700 [08:47<28:05,  1.29s/it]

{'loss': 0.1936, 'grad_norm': 1.0952491760253906, 'learning_rate': 0.0001537463126843658, 'epoch': 2.33}


 23%|██▎       | 398/1700 [08:48<27:29,  1.27s/it]

{'loss': 0.2532, 'grad_norm': 1.2159574031829834, 'learning_rate': 0.0001536283185840708, 'epoch': 2.34}


 23%|██▎       | 399/1700 [08:50<28:11,  1.30s/it]

{'loss': 0.2354, 'grad_norm': 0.7857471108436584, 'learning_rate': 0.00015351032448377583, 'epoch': 2.34}


 24%|██▎       | 400/1700 [08:51<28:15,  1.30s/it]

{'loss': 0.29, 'grad_norm': 0.9057171940803528, 'learning_rate': 0.00015339233038348085, 'epoch': 2.35}


 24%|██▎       | 401/1700 [08:52<25:50,  1.19s/it]

{'loss': 0.1225, 'grad_norm': 0.8528680801391602, 'learning_rate': 0.00015327433628318585, 'epoch': 2.36}


 24%|██▎       | 402/1700 [08:53<25:13,  1.17s/it]

{'loss': 0.1821, 'grad_norm': 0.9792255759239197, 'learning_rate': 0.00015315634218289085, 'epoch': 2.36}


 24%|██▎       | 403/1700 [08:54<25:47,  1.19s/it]

{'loss': 0.2078, 'grad_norm': 0.882584810256958, 'learning_rate': 0.00015303834808259588, 'epoch': 2.37}


 24%|██▍       | 404/1700 [08:56<30:57,  1.43s/it]

{'loss': 0.1679, 'grad_norm': 0.7212641835212708, 'learning_rate': 0.0001529203539823009, 'epoch': 2.37}


 24%|██▍       | 405/1700 [08:57<29:16,  1.36s/it]

{'loss': 0.1389, 'grad_norm': 1.093005657196045, 'learning_rate': 0.0001528023598820059, 'epoch': 2.38}


 24%|██▍       | 406/1700 [08:59<28:18,  1.31s/it]

{'loss': 0.2489, 'grad_norm': 0.945480227470398, 'learning_rate': 0.00015268436578171093, 'epoch': 2.38}


 24%|██▍       | 407/1700 [09:00<28:30,  1.32s/it]

{'loss': 0.3296, 'grad_norm': 1.066285490989685, 'learning_rate': 0.00015256637168141595, 'epoch': 2.39}


 24%|██▍       | 408/1700 [09:01<28:05,  1.30s/it]

{'loss': 0.1795, 'grad_norm': 0.8048053979873657, 'learning_rate': 0.00015244837758112095, 'epoch': 2.4}


 24%|██▍       | 409/1700 [09:03<28:34,  1.33s/it]

{'loss': 0.3125, 'grad_norm': 1.013609766960144, 'learning_rate': 0.00015233038348082595, 'epoch': 2.4}


 24%|██▍       | 410/1700 [09:04<27:53,  1.30s/it]

{'loss': 0.1366, 'grad_norm': 0.6934955716133118, 'learning_rate': 0.00015221238938053098, 'epoch': 2.41}


 24%|██▍       | 411/1700 [09:05<27:26,  1.28s/it]

{'loss': 0.2593, 'grad_norm': 0.8525087237358093, 'learning_rate': 0.000152094395280236, 'epoch': 2.41}


 24%|██▍       | 412/1700 [09:06<25:56,  1.21s/it]

{'loss': 0.2138, 'grad_norm': 0.9152396321296692, 'learning_rate': 0.000151976401179941, 'epoch': 2.42}


 24%|██▍       | 413/1700 [09:07<26:21,  1.23s/it]

{'loss': 0.269, 'grad_norm': 0.9071531891822815, 'learning_rate': 0.00015185840707964603, 'epoch': 2.43}


 24%|██▍       | 414/1700 [09:09<26:01,  1.21s/it]

{'loss': 0.1305, 'grad_norm': 0.7801632285118103, 'learning_rate': 0.00015174041297935105, 'epoch': 2.43}


 24%|██▍       | 415/1700 [09:10<27:08,  1.27s/it]

{'loss': 0.3277, 'grad_norm': 0.8524788618087769, 'learning_rate': 0.00015162241887905605, 'epoch': 2.44}


 24%|██▍       | 416/1700 [09:11<28:09,  1.32s/it]

{'loss': 0.2768, 'grad_norm': 0.8090585470199585, 'learning_rate': 0.00015150442477876108, 'epoch': 2.44}


 25%|██▍       | 417/1700 [09:13<26:58,  1.26s/it]

{'loss': 0.2593, 'grad_norm': 1.1161234378814697, 'learning_rate': 0.00015138643067846608, 'epoch': 2.45}


 25%|██▍       | 418/1700 [09:14<25:54,  1.21s/it]

{'loss': 0.1219, 'grad_norm': 0.7328807711601257, 'learning_rate': 0.00015126843657817108, 'epoch': 2.46}


 25%|██▍       | 419/1700 [09:15<25:06,  1.18s/it]

{'loss': 0.1634, 'grad_norm': 0.7926284670829773, 'learning_rate': 0.0001511504424778761, 'epoch': 2.46}


 25%|██▍       | 420/1700 [09:16<25:24,  1.19s/it]

{'loss': 0.2788, 'grad_norm': 1.0115101337432861, 'learning_rate': 0.00015103244837758113, 'epoch': 2.47}


 25%|██▍       | 421/1700 [09:17<25:41,  1.21s/it]

{'loss': 0.1359, 'grad_norm': 0.6405881643295288, 'learning_rate': 0.00015091445427728616, 'epoch': 2.47}


 25%|██▍       | 422/1700 [09:19<26:27,  1.24s/it]

{'loss': 0.0872, 'grad_norm': 0.5924931168556213, 'learning_rate': 0.00015079646017699116, 'epoch': 2.48}


 25%|██▍       | 423/1700 [09:19<24:46,  1.16s/it]

{'loss': 0.2475, 'grad_norm': 1.174877643585205, 'learning_rate': 0.00015067846607669618, 'epoch': 2.48}


 25%|██▍       | 424/1700 [09:21<24:06,  1.13s/it]

{'loss': 0.1271, 'grad_norm': 0.7735196948051453, 'learning_rate': 0.0001505604719764012, 'epoch': 2.49}


 25%|██▌       | 425/1700 [09:22<24:56,  1.17s/it]

{'loss': 0.1644, 'grad_norm': 0.8418329358100891, 'learning_rate': 0.00015044247787610618, 'epoch': 2.5}


 25%|██▌       | 426/1700 [09:23<24:37,  1.16s/it]

{'loss': 0.2055, 'grad_norm': 0.9762603044509888, 'learning_rate': 0.0001503244837758112, 'epoch': 2.5}


 25%|██▌       | 427/1700 [09:24<24:14,  1.14s/it]

{'loss': 0.2038, 'grad_norm': 1.022975206375122, 'learning_rate': 0.00015020648967551623, 'epoch': 2.51}


 25%|██▌       | 428/1700 [09:25<24:00,  1.13s/it]

{'loss': 0.1914, 'grad_norm': 0.8511973023414612, 'learning_rate': 0.00015008849557522123, 'epoch': 2.51}


 25%|██▌       | 429/1700 [09:26<24:57,  1.18s/it]

{'loss': 0.2224, 'grad_norm': 0.9371472001075745, 'learning_rate': 0.00014997050147492626, 'epoch': 2.52}


 25%|██▌       | 430/1700 [09:28<24:09,  1.14s/it]

{'loss': 0.0878, 'grad_norm': 0.8781142830848694, 'learning_rate': 0.00014985250737463128, 'epoch': 2.53}


 25%|██▌       | 431/1700 [09:29<25:41,  1.21s/it]

{'loss': 0.1885, 'grad_norm': 0.762436032295227, 'learning_rate': 0.0001497345132743363, 'epoch': 2.53}


 25%|██▌       | 432/1700 [09:30<26:52,  1.27s/it]

{'loss': 0.148, 'grad_norm': 0.6291449666023254, 'learning_rate': 0.0001496165191740413, 'epoch': 2.54}


 25%|██▌       | 433/1700 [09:32<27:05,  1.28s/it]

{'loss': 0.1729, 'grad_norm': 0.89811110496521, 'learning_rate': 0.0001494985250737463, 'epoch': 2.54}


 26%|██▌       | 434/1700 [09:33<26:01,  1.23s/it]

{'loss': 0.2239, 'grad_norm': 0.9523612856864929, 'learning_rate': 0.00014938053097345133, 'epoch': 2.55}


 26%|██▌       | 435/1700 [09:34<26:09,  1.24s/it]

{'loss': 0.1993, 'grad_norm': 0.8209437727928162, 'learning_rate': 0.00014926253687315633, 'epoch': 2.56}


 26%|██▌       | 436/1700 [09:35<27:44,  1.32s/it]

{'loss': 0.0847, 'grad_norm': 0.5110815167427063, 'learning_rate': 0.00014914454277286136, 'epoch': 2.56}


 26%|██▌       | 437/1700 [09:37<28:38,  1.36s/it]

{'loss': 0.14, 'grad_norm': 0.6222568154335022, 'learning_rate': 0.00014902654867256638, 'epoch': 2.57}


 26%|██▌       | 438/1700 [09:38<26:37,  1.27s/it]

{'loss': 0.1467, 'grad_norm': 0.9047343134880066, 'learning_rate': 0.00014890855457227138, 'epoch': 2.57}


 26%|██▌       | 439/1700 [09:40<29:19,  1.40s/it]

{'loss': 0.1759, 'grad_norm': 0.7499180436134338, 'learning_rate': 0.0001487905604719764, 'epoch': 2.58}


 26%|██▌       | 440/1700 [09:41<28:04,  1.34s/it]

{'loss': 0.2005, 'grad_norm': 0.9558517932891846, 'learning_rate': 0.00014867256637168144, 'epoch': 2.58}


 26%|██▌       | 441/1700 [09:42<28:34,  1.36s/it]

{'loss': 0.3174, 'grad_norm': 0.8115026950836182, 'learning_rate': 0.00014855457227138643, 'epoch': 2.59}


 26%|██▌       | 442/1700 [09:44<30:00,  1.43s/it]

{'loss': 0.2767, 'grad_norm': 0.7789686918258667, 'learning_rate': 0.00014843657817109143, 'epoch': 2.6}


 26%|██▌       | 443/1700 [09:45<30:19,  1.45s/it]

{'loss': 0.2471, 'grad_norm': 0.6142869591712952, 'learning_rate': 0.00014831858407079646, 'epoch': 2.6}


 26%|██▌       | 444/1700 [09:46<27:06,  1.29s/it]

{'loss': 0.2369, 'grad_norm': 1.1184232234954834, 'learning_rate': 0.00014820058997050149, 'epoch': 2.61}


 26%|██▌       | 445/1700 [09:47<25:58,  1.24s/it]

{'loss': 0.226, 'grad_norm': 0.8211173415184021, 'learning_rate': 0.00014808259587020649, 'epoch': 2.61}


 26%|██▌       | 446/1700 [09:49<26:16,  1.26s/it]

{'loss': 0.1809, 'grad_norm': 0.9314155578613281, 'learning_rate': 0.0001479646017699115, 'epoch': 2.62}


 26%|██▋       | 447/1700 [09:50<27:24,  1.31s/it]

{'loss': 0.2208, 'grad_norm': 0.855904221534729, 'learning_rate': 0.00014784660766961654, 'epoch': 2.63}


 26%|██▋       | 448/1700 [09:51<27:12,  1.30s/it]

{'loss': 0.2372, 'grad_norm': 0.7654410600662231, 'learning_rate': 0.00014772861356932156, 'epoch': 2.63}


 26%|██▋       | 449/1700 [09:53<27:34,  1.32s/it]

{'loss': 0.2417, 'grad_norm': 0.7574560642242432, 'learning_rate': 0.00014761061946902654, 'epoch': 2.64}


 26%|██▋       | 450/1700 [09:54<26:12,  1.26s/it]

{'loss': 0.2871, 'grad_norm': 1.0813946723937988, 'learning_rate': 0.00014749262536873156, 'epoch': 2.64}


 27%|██▋       | 451/1700 [09:55<26:32,  1.27s/it]

{'loss': 0.2337, 'grad_norm': 0.7889247536659241, 'learning_rate': 0.0001473746312684366, 'epoch': 2.65}


 27%|██▋       | 452/1700 [09:57<26:53,  1.29s/it]

{'loss': 0.2401, 'grad_norm': 0.8733810782432556, 'learning_rate': 0.0001472566371681416, 'epoch': 2.65}


 27%|██▋       | 453/1700 [09:58<25:32,  1.23s/it]

{'loss': 0.0521, 'grad_norm': 0.544577419757843, 'learning_rate': 0.0001471386430678466, 'epoch': 2.66}


 27%|██▋       | 454/1700 [09:59<25:11,  1.21s/it]

{'loss': 0.2016, 'grad_norm': 0.9332217574119568, 'learning_rate': 0.00014702064896755164, 'epoch': 2.67}


 27%|██▋       | 455/1700 [10:00<25:38,  1.24s/it]

{'loss': 0.1633, 'grad_norm': 1.0550756454467773, 'learning_rate': 0.00014690265486725664, 'epoch': 2.67}


 27%|██▋       | 456/1700 [10:01<25:04,  1.21s/it]

{'loss': 0.2908, 'grad_norm': 1.041627287864685, 'learning_rate': 0.00014678466076696166, 'epoch': 2.68}


 27%|██▋       | 457/1700 [10:03<26:08,  1.26s/it]

{'loss': 0.0839, 'grad_norm': 0.5066578984260559, 'learning_rate': 0.00014666666666666666, 'epoch': 2.68}


 27%|██▋       | 458/1700 [10:04<26:47,  1.29s/it]

{'loss': 0.3874, 'grad_norm': 1.07915198802948, 'learning_rate': 0.0001465486725663717, 'epoch': 2.69}


 27%|██▋       | 459/1700 [10:05<26:15,  1.27s/it]

{'loss': 0.1085, 'grad_norm': 0.6992923021316528, 'learning_rate': 0.0001464306784660767, 'epoch': 2.7}


 27%|██▋       | 460/1700 [10:07<26:11,  1.27s/it]

{'loss': 0.1602, 'grad_norm': 0.8275947570800781, 'learning_rate': 0.00014631268436578171, 'epoch': 2.7}


 27%|██▋       | 461/1700 [10:08<26:47,  1.30s/it]

{'loss': 0.1674, 'grad_norm': 0.7323732376098633, 'learning_rate': 0.00014619469026548674, 'epoch': 2.71}


 27%|██▋       | 462/1700 [10:09<26:24,  1.28s/it]

{'loss': 0.2003, 'grad_norm': 0.8188778758049011, 'learning_rate': 0.00014607669616519174, 'epoch': 2.71}


 27%|██▋       | 463/1700 [10:10<25:13,  1.22s/it]

{'loss': 0.2142, 'grad_norm': 0.8295238614082336, 'learning_rate': 0.00014595870206489677, 'epoch': 2.72}


 27%|██▋       | 464/1700 [10:11<23:49,  1.16s/it]

{'loss': 0.1346, 'grad_norm': 0.9460392594337463, 'learning_rate': 0.0001458407079646018, 'epoch': 2.73}


 27%|██▋       | 465/1700 [10:13<25:46,  1.25s/it]

{'loss': 0.3462, 'grad_norm': 0.896507978439331, 'learning_rate': 0.0001457227138643068, 'epoch': 2.73}


 27%|██▋       | 466/1700 [10:14<25:03,  1.22s/it]

{'loss': 0.1181, 'grad_norm': 0.874625027179718, 'learning_rate': 0.0001456047197640118, 'epoch': 2.74}


 27%|██▋       | 467/1700 [10:15<26:33,  1.29s/it]

{'loss': 0.4207, 'grad_norm': 0.911993682384491, 'learning_rate': 0.00014548672566371682, 'epoch': 2.74}


 28%|██▊       | 468/1700 [10:17<26:23,  1.29s/it]

{'loss': 0.1352, 'grad_norm': 0.8429880738258362, 'learning_rate': 0.00014536873156342184, 'epoch': 2.75}


 28%|██▊       | 469/1700 [10:18<27:18,  1.33s/it]

{'loss': 0.2364, 'grad_norm': 0.9109398126602173, 'learning_rate': 0.00014525073746312684, 'epoch': 2.75}


 28%|██▊       | 470/1700 [10:19<26:25,  1.29s/it]

{'loss': 0.12, 'grad_norm': 0.6345952153205872, 'learning_rate': 0.00014513274336283187, 'epoch': 2.76}


 28%|██▊       | 471/1700 [10:20<26:13,  1.28s/it]

{'loss': 0.1701, 'grad_norm': 0.8179053664207458, 'learning_rate': 0.0001450147492625369, 'epoch': 2.77}


 28%|██▊       | 472/1700 [10:21<24:34,  1.20s/it]

{'loss': 0.1923, 'grad_norm': 1.312451958656311, 'learning_rate': 0.0001448967551622419, 'epoch': 2.77}


 28%|██▊       | 473/1700 [10:23<25:16,  1.24s/it]

{'loss': 0.1387, 'grad_norm': 0.6901381611824036, 'learning_rate': 0.0001447787610619469, 'epoch': 2.78}


 28%|██▊       | 474/1700 [10:24<25:09,  1.23s/it]

{'loss': 0.225, 'grad_norm': 0.8164860010147095, 'learning_rate': 0.00014466076696165192, 'epoch': 2.78}


 28%|██▊       | 475/1700 [10:25<24:15,  1.19s/it]

{'loss': 0.0877, 'grad_norm': 0.6311472654342651, 'learning_rate': 0.00014454277286135694, 'epoch': 2.79}


 28%|██▊       | 476/1700 [10:26<25:15,  1.24s/it]

{'loss': 0.2387, 'grad_norm': 0.7730773687362671, 'learning_rate': 0.00014442477876106194, 'epoch': 2.8}


 28%|██▊       | 477/1700 [10:28<26:21,  1.29s/it]

{'loss': 0.1939, 'grad_norm': 0.8363454341888428, 'learning_rate': 0.00014430678466076697, 'epoch': 2.8}


 28%|██▊       | 478/1700 [10:29<25:10,  1.24s/it]

{'loss': 0.1691, 'grad_norm': 0.9384403824806213, 'learning_rate': 0.000144188790560472, 'epoch': 2.81}


 28%|██▊       | 479/1700 [10:31<30:04,  1.48s/it]

{'loss': 0.2119, 'grad_norm': 0.7159565687179565, 'learning_rate': 0.000144070796460177, 'epoch': 2.81}


 28%|██▊       | 480/1700 [10:32<27:36,  1.36s/it]

{'loss': 0.1712, 'grad_norm': 0.8399059176445007, 'learning_rate': 0.00014395280235988202, 'epoch': 2.82}


 28%|██▊       | 481/1700 [10:33<25:54,  1.27s/it]

{'loss': 0.128, 'grad_norm': 0.7268242835998535, 'learning_rate': 0.00014383480825958702, 'epoch': 2.83}


 28%|██▊       | 482/1700 [10:34<24:54,  1.23s/it]

{'loss': 0.2039, 'grad_norm': 1.0079095363616943, 'learning_rate': 0.00014371681415929204, 'epoch': 2.83}


 28%|██▊       | 483/1700 [10:35<24:49,  1.22s/it]

{'loss': 0.2366, 'grad_norm': 0.8931894898414612, 'learning_rate': 0.00014359882005899704, 'epoch': 2.84}


 28%|██▊       | 484/1700 [10:37<24:43,  1.22s/it]

{'loss': 0.284, 'grad_norm': 1.539391040802002, 'learning_rate': 0.00014348082595870207, 'epoch': 2.84}


 29%|██▊       | 485/1700 [10:39<29:15,  1.44s/it]

{'loss': 0.2442, 'grad_norm': 0.9669759273529053, 'learning_rate': 0.0001433628318584071, 'epoch': 2.85}


 29%|██▊       | 486/1700 [10:40<27:42,  1.37s/it]

{'loss': 0.1276, 'grad_norm': 0.634739100933075, 'learning_rate': 0.0001432448377581121, 'epoch': 2.85}


 29%|██▊       | 487/1700 [10:41<25:00,  1.24s/it]

{'loss': 0.1806, 'grad_norm': 0.9225952625274658, 'learning_rate': 0.00014312684365781712, 'epoch': 2.86}


 29%|██▊       | 488/1700 [10:42<24:38,  1.22s/it]

{'loss': 0.1585, 'grad_norm': 0.7866107821464539, 'learning_rate': 0.00014300884955752215, 'epoch': 2.87}


 29%|██▉       | 489/1700 [10:43<23:06,  1.15s/it]

{'loss': 0.1997, 'grad_norm': 0.8639765381813049, 'learning_rate': 0.00014289085545722715, 'epoch': 2.87}


 29%|██▉       | 490/1700 [10:44<23:58,  1.19s/it]

{'loss': 0.2365, 'grad_norm': 0.7890874147415161, 'learning_rate': 0.00014277286135693215, 'epoch': 2.88}


 29%|██▉       | 491/1700 [10:45<23:48,  1.18s/it]

{'loss': 0.1764, 'grad_norm': 0.8736118674278259, 'learning_rate': 0.00014265486725663717, 'epoch': 2.88}


 29%|██▉       | 492/1700 [10:47<24:36,  1.22s/it]

{'loss': 0.2635, 'grad_norm': 0.8949114084243774, 'learning_rate': 0.0001425368731563422, 'epoch': 2.89}


 29%|██▉       | 493/1700 [10:48<24:44,  1.23s/it]

{'loss': 0.2712, 'grad_norm': 0.8992909789085388, 'learning_rate': 0.0001424188790560472, 'epoch': 2.9}


 29%|██▉       | 494/1700 [10:49<24:55,  1.24s/it]

{'loss': 0.2628, 'grad_norm': 0.9810184240341187, 'learning_rate': 0.00014230088495575222, 'epoch': 2.9}


 29%|██▉       | 495/1700 [10:51<25:33,  1.27s/it]

{'loss': 0.19, 'grad_norm': 0.8150339722633362, 'learning_rate': 0.00014218289085545725, 'epoch': 2.91}


 29%|██▉       | 496/1700 [10:52<25:49,  1.29s/it]

{'loss': 0.1733, 'grad_norm': 0.768746018409729, 'learning_rate': 0.00014206489675516225, 'epoch': 2.91}


 29%|██▉       | 497/1700 [10:53<24:45,  1.24s/it]

{'loss': 0.2307, 'grad_norm': 0.9117948412895203, 'learning_rate': 0.00014194690265486725, 'epoch': 2.92}


 29%|██▉       | 498/1700 [10:55<27:26,  1.37s/it]

{'loss': 0.2431, 'grad_norm': 0.9972277879714966, 'learning_rate': 0.00014182890855457227, 'epoch': 2.93}


 29%|██▉       | 499/1700 [10:56<26:54,  1.34s/it]

{'loss': 0.1412, 'grad_norm': 0.7652965188026428, 'learning_rate': 0.0001417109144542773, 'epoch': 2.93}


 29%|██▉       | 500/1700 [10:57<26:34,  1.33s/it]

{'loss': 0.1607, 'grad_norm': 0.787991464138031, 'learning_rate': 0.0001415929203539823, 'epoch': 2.94}


 29%|██▉       | 501/1700 [11:00<37:14,  1.86s/it]

{'loss': 0.1757, 'grad_norm': 0.8622288107872009, 'learning_rate': 0.00014147492625368732, 'epoch': 2.94}


 30%|██▉       | 502/1700 [11:01<32:34,  1.63s/it]

{'loss': 0.232, 'grad_norm': 1.2260596752166748, 'learning_rate': 0.00014135693215339235, 'epoch': 2.95}


 30%|██▉       | 503/1700 [11:03<30:43,  1.54s/it]

{'loss': 0.2355, 'grad_norm': 1.0753713846206665, 'learning_rate': 0.00014123893805309735, 'epoch': 2.95}


 30%|██▉       | 504/1700 [11:04<28:39,  1.44s/it]

{'loss': 0.1908, 'grad_norm': 1.1003806591033936, 'learning_rate': 0.00014112094395280238, 'epoch': 2.96}


 30%|██▉       | 505/1700 [11:05<27:32,  1.38s/it]

{'loss': 0.382, 'grad_norm': 1.0034029483795166, 'learning_rate': 0.00014100294985250737, 'epoch': 2.97}


 30%|██▉       | 506/1700 [11:07<27:01,  1.36s/it]

{'loss': 0.2737, 'grad_norm': 0.934880793094635, 'learning_rate': 0.0001408849557522124, 'epoch': 2.97}


 30%|██▉       | 507/1700 [11:08<27:38,  1.39s/it]

{'loss': 0.1052, 'grad_norm': 0.5918743014335632, 'learning_rate': 0.0001407669616519174, 'epoch': 2.98}


 30%|██▉       | 508/1700 [11:09<27:38,  1.39s/it]

{'loss': 0.1294, 'grad_norm': 0.6730992794036865, 'learning_rate': 0.00014064896755162243, 'epoch': 2.98}


 30%|██▉       | 509/1700 [11:11<27:02,  1.36s/it]

{'loss': 0.2193, 'grad_norm': 0.6724178791046143, 'learning_rate': 0.00014053097345132745, 'epoch': 2.99}


 30%|███       | 510/1700 [11:12<26:09,  1.32s/it]

{'loss': 0.1602, 'grad_norm': 0.8011967539787292, 'learning_rate': 0.00014041297935103245, 'epoch': 3.0}


 30%|███       | 511/1700 [11:13<26:38,  1.34s/it]

{'loss': 0.2046, 'grad_norm': 0.8741020560264587, 'learning_rate': 0.00014029498525073748, 'epoch': 3.0}


 30%|███       | 512/1700 [11:15<26:17,  1.33s/it]

{'loss': 0.1366, 'grad_norm': 0.6107447743415833, 'learning_rate': 0.0001401769911504425, 'epoch': 3.01}


 30%|███       | 513/1700 [11:16<27:25,  1.39s/it]

{'loss': 0.0901, 'grad_norm': 0.7063410878181458, 'learning_rate': 0.0001400589970501475, 'epoch': 3.01}


 30%|███       | 514/1700 [11:17<26:19,  1.33s/it]

{'loss': 0.0954, 'grad_norm': 0.46874749660491943, 'learning_rate': 0.0001399410029498525, 'epoch': 3.02}


 30%|███       | 515/1700 [11:19<26:49,  1.36s/it]

{'loss': 0.1243, 'grad_norm': 0.6320620179176331, 'learning_rate': 0.00013982300884955753, 'epoch': 3.02}


 30%|███       | 516/1700 [11:20<26:05,  1.32s/it]

{'loss': 0.092, 'grad_norm': 0.6733378767967224, 'learning_rate': 0.00013970501474926255, 'epoch': 3.03}


 30%|███       | 517/1700 [11:21<26:36,  1.35s/it]

{'loss': 0.1251, 'grad_norm': 0.5799566507339478, 'learning_rate': 0.00013958702064896755, 'epoch': 3.04}


 30%|███       | 518/1700 [11:23<27:11,  1.38s/it]

{'loss': 0.1899, 'grad_norm': 0.6647695302963257, 'learning_rate': 0.00013946902654867258, 'epoch': 3.04}


 31%|███       | 519/1700 [11:24<26:29,  1.35s/it]

{'loss': 0.0646, 'grad_norm': 0.619364857673645, 'learning_rate': 0.0001393510324483776, 'epoch': 3.05}


 31%|███       | 520/1700 [11:25<26:14,  1.33s/it]

{'loss': 0.0814, 'grad_norm': 0.5712590217590332, 'learning_rate': 0.0001392330383480826, 'epoch': 3.05}


 31%|███       | 521/1700 [11:27<29:28,  1.50s/it]

{'loss': 0.0396, 'grad_norm': 0.36115747690200806, 'learning_rate': 0.0001391150442477876, 'epoch': 3.06}


 31%|███       | 522/1700 [11:29<27:37,  1.41s/it]

{'loss': 0.1145, 'grad_norm': 0.8308700919151306, 'learning_rate': 0.00013899705014749263, 'epoch': 3.07}


 31%|███       | 523/1700 [11:30<27:28,  1.40s/it]

{'loss': 0.1052, 'grad_norm': 0.8496276140213013, 'learning_rate': 0.00013887905604719765, 'epoch': 3.07}


 31%|███       | 524/1700 [11:31<26:37,  1.36s/it]

{'loss': 0.0533, 'grad_norm': 0.4860377311706543, 'learning_rate': 0.00013876106194690265, 'epoch': 3.08}


 31%|███       | 525/1700 [11:32<26:20,  1.35s/it]

{'loss': 0.0511, 'grad_norm': 0.5419392585754395, 'learning_rate': 0.00013864306784660768, 'epoch': 3.08}


 31%|███       | 526/1700 [11:34<25:57,  1.33s/it]

{'loss': 0.1061, 'grad_norm': 0.7859999537467957, 'learning_rate': 0.0001385250737463127, 'epoch': 3.09}


 31%|███       | 527/1700 [11:35<25:45,  1.32s/it]

{'loss': 0.0571, 'grad_norm': 0.690811038017273, 'learning_rate': 0.0001384070796460177, 'epoch': 3.1}


 31%|███       | 528/1700 [11:36<26:14,  1.34s/it]

{'loss': 0.1422, 'grad_norm': 0.9359526634216309, 'learning_rate': 0.00013828908554572273, 'epoch': 3.1}


 31%|███       | 529/1700 [11:38<24:57,  1.28s/it]

{'loss': 0.0484, 'grad_norm': 0.5657795667648315, 'learning_rate': 0.00013817109144542773, 'epoch': 3.11}


 31%|███       | 530/1700 [11:39<24:31,  1.26s/it]

{'loss': 0.0254, 'grad_norm': 0.44220826029777527, 'learning_rate': 0.00013805309734513276, 'epoch': 3.11}


 31%|███       | 531/1700 [11:40<24:28,  1.26s/it]

{'loss': 0.1697, 'grad_norm': 1.0768545866012573, 'learning_rate': 0.00013793510324483775, 'epoch': 3.12}


 31%|███▏      | 532/1700 [11:41<25:05,  1.29s/it]

{'loss': 0.1083, 'grad_norm': 0.768589973449707, 'learning_rate': 0.00013781710914454278, 'epoch': 3.12}


 31%|███▏      | 533/1700 [11:43<24:27,  1.26s/it]

{'loss': 0.0818, 'grad_norm': 0.7310271263122559, 'learning_rate': 0.0001376991150442478, 'epoch': 3.13}


 31%|███▏      | 534/1700 [11:44<26:56,  1.39s/it]

{'loss': 0.1809, 'grad_norm': 0.7070700526237488, 'learning_rate': 0.0001375811209439528, 'epoch': 3.14}


 31%|███▏      | 535/1700 [11:46<25:57,  1.34s/it]

{'loss': 0.0573, 'grad_norm': 0.6166476011276245, 'learning_rate': 0.00013746312684365783, 'epoch': 3.14}


 32%|███▏      | 536/1700 [11:47<25:00,  1.29s/it]

{'loss': 0.0464, 'grad_norm': 0.35578057169914246, 'learning_rate': 0.00013734513274336286, 'epoch': 3.15}


 32%|███▏      | 537/1700 [11:48<26:28,  1.37s/it]

{'loss': 0.0773, 'grad_norm': 0.56174236536026, 'learning_rate': 0.00013722713864306783, 'epoch': 3.15}


 32%|███▏      | 538/1700 [11:50<26:04,  1.35s/it]

{'loss': 0.0822, 'grad_norm': 0.5751415491104126, 'learning_rate': 0.00013710914454277286, 'epoch': 3.16}


 32%|███▏      | 539/1700 [11:51<24:44,  1.28s/it]

{'loss': 0.102, 'grad_norm': 0.8378257155418396, 'learning_rate': 0.00013699115044247788, 'epoch': 3.17}


 32%|███▏      | 540/1700 [11:52<23:36,  1.22s/it]

{'loss': 0.0468, 'grad_norm': 0.634605348110199, 'learning_rate': 0.0001368731563421829, 'epoch': 3.17}


 32%|███▏      | 541/1700 [11:53<23:52,  1.24s/it]

{'loss': 0.047, 'grad_norm': 0.47525671124458313, 'learning_rate': 0.0001367551622418879, 'epoch': 3.18}


 32%|███▏      | 542/1700 [11:54<24:16,  1.26s/it]

{'loss': 0.1494, 'grad_norm': 0.8538028001785278, 'learning_rate': 0.00013663716814159293, 'epoch': 3.18}


 32%|███▏      | 543/1700 [11:56<24:17,  1.26s/it]

{'loss': 0.0769, 'grad_norm': 0.5456621050834656, 'learning_rate': 0.00013651917404129796, 'epoch': 3.19}


 32%|███▏      | 544/1700 [11:57<25:47,  1.34s/it]

{'loss': 0.1187, 'grad_norm': 0.6173310279846191, 'learning_rate': 0.00013640117994100296, 'epoch': 3.2}


 32%|███▏      | 545/1700 [11:58<25:08,  1.31s/it]

{'loss': 0.0694, 'grad_norm': 0.6178296804428101, 'learning_rate': 0.00013628318584070796, 'epoch': 3.2}


 32%|███▏      | 546/1700 [12:00<24:46,  1.29s/it]

{'loss': 0.0715, 'grad_norm': 0.6881155967712402, 'learning_rate': 0.00013616519174041298, 'epoch': 3.21}


 32%|███▏      | 547/1700 [12:01<24:16,  1.26s/it]

{'loss': 0.066, 'grad_norm': 0.6955828070640564, 'learning_rate': 0.00013604719764011798, 'epoch': 3.21}


 32%|███▏      | 548/1700 [12:02<23:56,  1.25s/it]

{'loss': 0.0929, 'grad_norm': 0.730522871017456, 'learning_rate': 0.000135929203539823, 'epoch': 3.22}


 32%|███▏      | 549/1700 [12:03<23:31,  1.23s/it]

{'loss': 0.1425, 'grad_norm': 0.9645058512687683, 'learning_rate': 0.00013581120943952804, 'epoch': 3.22}


 32%|███▏      | 550/1700 [12:04<22:22,  1.17s/it]

{'loss': 0.1069, 'grad_norm': 0.8146223425865173, 'learning_rate': 0.00013569321533923306, 'epoch': 3.23}


 32%|███▏      | 551/1700 [12:06<24:02,  1.26s/it]

{'loss': 0.1122, 'grad_norm': 0.7693427205085754, 'learning_rate': 0.00013557522123893806, 'epoch': 3.24}


 32%|███▏      | 552/1700 [12:07<22:00,  1.15s/it]

{'loss': 0.0831, 'grad_norm': 0.8079999685287476, 'learning_rate': 0.0001354572271386431, 'epoch': 3.24}


 33%|███▎      | 553/1700 [12:08<22:04,  1.16s/it]

{'loss': 0.0928, 'grad_norm': 0.6089147329330444, 'learning_rate': 0.00013533923303834809, 'epoch': 3.25}


 33%|███▎      | 554/1700 [12:09<22:48,  1.19s/it]

{'loss': 0.0953, 'grad_norm': 0.5620408058166504, 'learning_rate': 0.00013522123893805308, 'epoch': 3.25}


 33%|███▎      | 555/1700 [12:10<23:33,  1.23s/it]

{'loss': 0.1108, 'grad_norm': 0.6135044097900391, 'learning_rate': 0.0001351032448377581, 'epoch': 3.26}


 33%|███▎      | 556/1700 [12:12<23:15,  1.22s/it]

{'loss': 0.0512, 'grad_norm': 0.4701314866542816, 'learning_rate': 0.00013498525073746314, 'epoch': 3.27}


 33%|███▎      | 557/1700 [12:13<22:57,  1.21s/it]

{'loss': 0.0709, 'grad_norm': 2.2351226806640625, 'learning_rate': 0.00013486725663716814, 'epoch': 3.27}


 33%|███▎      | 558/1700 [12:14<22:59,  1.21s/it]

{'loss': 0.0959, 'grad_norm': 0.674888551235199, 'learning_rate': 0.00013474926253687316, 'epoch': 3.28}


 33%|███▎      | 559/1700 [12:15<22:49,  1.20s/it]

{'loss': 0.0942, 'grad_norm': 0.7970277070999146, 'learning_rate': 0.0001346312684365782, 'epoch': 3.28}


 33%|███▎      | 560/1700 [12:16<22:59,  1.21s/it]

{'loss': 0.0916, 'grad_norm': 0.7778112888336182, 'learning_rate': 0.00013451327433628321, 'epoch': 3.29}


 33%|███▎      | 561/1700 [12:18<23:31,  1.24s/it]

{'loss': 0.072, 'grad_norm': 0.5273340940475464, 'learning_rate': 0.00013439528023598819, 'epoch': 3.3}


 33%|███▎      | 562/1700 [12:19<23:36,  1.24s/it]

{'loss': 0.0895, 'grad_norm': 0.8342462182044983, 'learning_rate': 0.0001342772861356932, 'epoch': 3.3}


 33%|███▎      | 563/1700 [12:20<23:22,  1.23s/it]

{'loss': 0.0914, 'grad_norm': 0.7015901207923889, 'learning_rate': 0.00013415929203539824, 'epoch': 3.31}


 33%|███▎      | 564/1700 [12:22<25:54,  1.37s/it]

{'loss': 0.224, 'grad_norm': 0.8718050718307495, 'learning_rate': 0.00013404129793510324, 'epoch': 3.31}


 33%|███▎      | 565/1700 [12:23<25:00,  1.32s/it]

{'loss': 0.0701, 'grad_norm': 0.559028685092926, 'learning_rate': 0.00013392330383480826, 'epoch': 3.32}


 33%|███▎      | 566/1700 [12:24<24:01,  1.27s/it]

{'loss': 0.0646, 'grad_norm': 0.5472875833511353, 'learning_rate': 0.0001338053097345133, 'epoch': 3.32}


 33%|███▎      | 567/1700 [12:25<23:44,  1.26s/it]

{'loss': 0.0727, 'grad_norm': 0.5251731872558594, 'learning_rate': 0.0001336873156342183, 'epoch': 3.33}


 33%|███▎      | 568/1700 [12:27<23:18,  1.24s/it]

{'loss': 0.0871, 'grad_norm': 0.6874184012413025, 'learning_rate': 0.00013356932153392331, 'epoch': 3.34}


 33%|███▎      | 569/1700 [12:28<22:45,  1.21s/it]

{'loss': 0.0763, 'grad_norm': 0.7744762301445007, 'learning_rate': 0.00013345132743362831, 'epoch': 3.34}


 34%|███▎      | 570/1700 [12:29<22:19,  1.19s/it]

{'loss': 0.0361, 'grad_norm': 0.6150633692741394, 'learning_rate': 0.00013333333333333334, 'epoch': 3.35}


 34%|███▎      | 571/1700 [12:30<23:21,  1.24s/it]

{'loss': 0.118, 'grad_norm': 0.6318029761314392, 'learning_rate': 0.00013321533923303834, 'epoch': 3.35}


 34%|███▎      | 572/1700 [12:32<24:03,  1.28s/it]

{'loss': 0.0715, 'grad_norm': 0.5808126330375671, 'learning_rate': 0.00013309734513274336, 'epoch': 3.36}


 34%|███▎      | 573/1700 [12:33<24:40,  1.31s/it]

{'loss': 0.1458, 'grad_norm': 0.6816275715827942, 'learning_rate': 0.0001329793510324484, 'epoch': 3.37}


 34%|███▍      | 574/1700 [12:35<28:33,  1.52s/it]

{'loss': 0.0682, 'grad_norm': 0.6217027902603149, 'learning_rate': 0.0001328613569321534, 'epoch': 3.37}


 34%|███▍      | 575/1700 [12:36<27:28,  1.47s/it]

{'loss': 0.1243, 'grad_norm': 0.948890745639801, 'learning_rate': 0.00013274336283185842, 'epoch': 3.38}


 34%|███▍      | 576/1700 [12:38<26:17,  1.40s/it]

{'loss': 0.0642, 'grad_norm': 0.6289039254188538, 'learning_rate': 0.00013262536873156344, 'epoch': 3.38}


 34%|███▍      | 577/1700 [12:39<25:49,  1.38s/it]

{'loss': 0.0895, 'grad_norm': 0.7015479207038879, 'learning_rate': 0.00013250737463126844, 'epoch': 3.39}


 34%|███▍      | 578/1700 [12:40<25:24,  1.36s/it]

{'loss': 0.1199, 'grad_norm': 0.7801327705383301, 'learning_rate': 0.00013238938053097344, 'epoch': 3.4}


 34%|███▍      | 579/1700 [12:41<24:02,  1.29s/it]

{'loss': 0.1953, 'grad_norm': 1.3588720560073853, 'learning_rate': 0.00013227138643067847, 'epoch': 3.4}


 34%|███▍      | 580/1700 [12:43<23:54,  1.28s/it]

{'loss': 0.1274, 'grad_norm': 1.0067638158798218, 'learning_rate': 0.0001321533923303835, 'epoch': 3.41}


 34%|███▍      | 581/1700 [12:44<24:45,  1.33s/it]

{'loss': 0.0495, 'grad_norm': 0.6113691329956055, 'learning_rate': 0.0001320353982300885, 'epoch': 3.41}


 34%|███▍      | 582/1700 [12:45<23:43,  1.27s/it]

{'loss': 0.0722, 'grad_norm': 0.5543832182884216, 'learning_rate': 0.00013191740412979352, 'epoch': 3.42}


 34%|███▍      | 583/1700 [12:46<23:15,  1.25s/it]

{'loss': 0.0608, 'grad_norm': 0.6569904088973999, 'learning_rate': 0.00013179941002949854, 'epoch': 3.42}


 34%|███▍      | 584/1700 [12:48<22:38,  1.22s/it]

{'loss': 0.0887, 'grad_norm': 0.6940115094184875, 'learning_rate': 0.00013168141592920354, 'epoch': 3.43}


 34%|███▍      | 585/1700 [12:49<22:19,  1.20s/it]

{'loss': 0.0699, 'grad_norm': 0.7301088571548462, 'learning_rate': 0.00013156342182890854, 'epoch': 3.44}


 34%|███▍      | 586/1700 [12:50<22:02,  1.19s/it]

{'loss': 0.0351, 'grad_norm': 0.4314098656177521, 'learning_rate': 0.00013144542772861357, 'epoch': 3.44}


 35%|███▍      | 587/1700 [12:51<24:03,  1.30s/it]

{'loss': 0.0979, 'grad_norm': 0.602660596370697, 'learning_rate': 0.0001313274336283186, 'epoch': 3.45}


 35%|███▍      | 588/1700 [12:52<22:43,  1.23s/it]

{'loss': 0.0795, 'grad_norm': 0.5575392246246338, 'learning_rate': 0.0001312094395280236, 'epoch': 3.45}


 35%|███▍      | 589/1700 [12:54<21:45,  1.18s/it]

{'loss': 0.0672, 'grad_norm': 0.6375101208686829, 'learning_rate': 0.00013109144542772862, 'epoch': 3.46}


 35%|███▍      | 590/1700 [12:55<22:05,  1.19s/it]

{'loss': 0.1269, 'grad_norm': 0.5865299105644226, 'learning_rate': 0.00013097345132743365, 'epoch': 3.47}


 35%|███▍      | 591/1700 [12:56<22:34,  1.22s/it]

{'loss': 0.0667, 'grad_norm': 0.5327067971229553, 'learning_rate': 0.00013085545722713864, 'epoch': 3.47}


 35%|███▍      | 592/1700 [12:57<23:23,  1.27s/it]

{'loss': 0.0997, 'grad_norm': 0.5790610909461975, 'learning_rate': 0.00013073746312684367, 'epoch': 3.48}


 35%|███▍      | 593/1700 [12:59<22:35,  1.22s/it]

{'loss': 0.0597, 'grad_norm': 0.7329620122909546, 'learning_rate': 0.00013061946902654867, 'epoch': 3.48}


 35%|███▍      | 594/1700 [13:00<24:23,  1.32s/it]

{'loss': 0.1159, 'grad_norm': 0.7031512260437012, 'learning_rate': 0.0001305014749262537, 'epoch': 3.49}


 35%|███▌      | 595/1700 [13:02<26:25,  1.43s/it]

{'loss': 0.233, 'grad_norm': 0.6977781653404236, 'learning_rate': 0.0001303834808259587, 'epoch': 3.49}


 35%|███▌      | 596/1700 [13:03<26:32,  1.44s/it]

{'loss': 0.1665, 'grad_norm': 0.6231105923652649, 'learning_rate': 0.00013026548672566372, 'epoch': 3.5}


 35%|███▌      | 597/1700 [13:05<25:36,  1.39s/it]

{'loss': 0.0494, 'grad_norm': 0.5723254084587097, 'learning_rate': 0.00013014749262536875, 'epoch': 3.51}


 35%|███▌      | 598/1700 [13:06<24:40,  1.34s/it]

{'loss': 0.1149, 'grad_norm': 0.7747471332550049, 'learning_rate': 0.00013002949852507375, 'epoch': 3.51}


 35%|███▌      | 599/1700 [13:07<23:57,  1.31s/it]

{'loss': 0.0851, 'grad_norm': 0.7876223921775818, 'learning_rate': 0.00012991150442477877, 'epoch': 3.52}


 35%|███▌      | 600/1700 [13:08<22:51,  1.25s/it]

{'loss': 0.0498, 'grad_norm': 0.5742766857147217, 'learning_rate': 0.0001297935103244838, 'epoch': 3.52}


 35%|███▌      | 601/1700 [13:10<23:57,  1.31s/it]

{'loss': 0.0847, 'grad_norm': 0.7430640459060669, 'learning_rate': 0.0001296755162241888, 'epoch': 3.53}


 35%|███▌      | 602/1700 [13:11<23:08,  1.26s/it]

{'loss': 0.0914, 'grad_norm': 0.716773509979248, 'learning_rate': 0.0001295575221238938, 'epoch': 3.54}


 35%|███▌      | 603/1700 [13:12<24:26,  1.34s/it]

{'loss': 0.1246, 'grad_norm': 0.6337476968765259, 'learning_rate': 0.00012943952802359882, 'epoch': 3.54}


 36%|███▌      | 604/1700 [13:13<23:22,  1.28s/it]

{'loss': 0.101, 'grad_norm': 0.7115349173545837, 'learning_rate': 0.00012932153392330385, 'epoch': 3.55}


 36%|███▌      | 605/1700 [13:15<22:47,  1.25s/it]

{'loss': 0.0758, 'grad_norm': 0.6374432444572449, 'learning_rate': 0.00012920353982300885, 'epoch': 3.55}


 36%|███▌      | 606/1700 [13:16<23:27,  1.29s/it]

{'loss': 0.0605, 'grad_norm': 0.601006805896759, 'learning_rate': 0.00012908554572271387, 'epoch': 3.56}


 36%|███▌      | 607/1700 [13:17<24:34,  1.35s/it]

{'loss': 0.058, 'grad_norm': 0.6277137994766235, 'learning_rate': 0.0001289675516224189, 'epoch': 3.57}


 36%|███▌      | 608/1700 [13:18<23:03,  1.27s/it]

{'loss': 0.0403, 'grad_norm': 0.5581316351890564, 'learning_rate': 0.0001288495575221239, 'epoch': 3.57}


 36%|███▌      | 609/1700 [13:20<22:09,  1.22s/it]

{'loss': 0.0638, 'grad_norm': 0.6349217891693115, 'learning_rate': 0.0001287315634218289, 'epoch': 3.58}


 36%|███▌      | 610/1700 [13:21<22:30,  1.24s/it]

{'loss': 0.113, 'grad_norm': 0.6333122849464417, 'learning_rate': 0.00012861356932153392, 'epoch': 3.58}


 36%|███▌      | 611/1700 [13:23<26:22,  1.45s/it]

{'loss': 0.1393, 'grad_norm': 0.7822462916374207, 'learning_rate': 0.00012849557522123895, 'epoch': 3.59}


 36%|███▌      | 612/1700 [13:24<24:58,  1.38s/it]

{'loss': 0.1392, 'grad_norm': 0.7079634666442871, 'learning_rate': 0.00012837758112094395, 'epoch': 3.59}


 36%|███▌      | 613/1700 [13:25<24:57,  1.38s/it]

{'loss': 0.1022, 'grad_norm': 0.5688807368278503, 'learning_rate': 0.00012825958702064897, 'epoch': 3.6}


 36%|███▌      | 614/1700 [13:27<24:22,  1.35s/it]

{'loss': 0.0786, 'grad_norm': 0.5813615322113037, 'learning_rate': 0.000128141592920354, 'epoch': 3.61}


 36%|███▌      | 615/1700 [13:28<23:56,  1.32s/it]

{'loss': 0.1008, 'grad_norm': 0.7776663899421692, 'learning_rate': 0.000128023598820059, 'epoch': 3.61}


 36%|███▌      | 616/1700 [13:29<23:36,  1.31s/it]

{'loss': 0.0563, 'grad_norm': 0.39281147718429565, 'learning_rate': 0.00012790560471976403, 'epoch': 3.62}


 36%|███▋      | 617/1700 [13:30<22:05,  1.22s/it]

{'loss': 0.0587, 'grad_norm': 0.6550167798995972, 'learning_rate': 0.00012778761061946902, 'epoch': 3.62}


 36%|███▋      | 618/1700 [13:31<21:29,  1.19s/it]

{'loss': 0.1294, 'grad_norm': 0.9656148552894592, 'learning_rate': 0.00012766961651917405, 'epoch': 3.63}


 36%|███▋      | 619/1700 [13:33<22:07,  1.23s/it]

{'loss': 0.0777, 'grad_norm': 0.6208047270774841, 'learning_rate': 0.00012755162241887905, 'epoch': 3.64}


 36%|███▋      | 620/1700 [13:34<21:54,  1.22s/it]

{'loss': 0.0866, 'grad_norm': 0.6343453526496887, 'learning_rate': 0.00012743362831858408, 'epoch': 3.64}


 37%|███▋      | 621/1700 [13:35<22:50,  1.27s/it]

{'loss': 0.114, 'grad_norm': 0.6254240870475769, 'learning_rate': 0.0001273156342182891, 'epoch': 3.65}


 37%|███▋      | 622/1700 [13:37<22:58,  1.28s/it]

{'loss': 0.1117, 'grad_norm': 0.7186869382858276, 'learning_rate': 0.0001271976401179941, 'epoch': 3.65}


 37%|███▋      | 623/1700 [13:38<22:47,  1.27s/it]

{'loss': 0.0735, 'grad_norm': 0.6064531207084656, 'learning_rate': 0.00012707964601769913, 'epoch': 3.66}


 37%|███▋      | 624/1700 [13:39<21:56,  1.22s/it]

{'loss': 0.0866, 'grad_norm': 0.9669460654258728, 'learning_rate': 0.00012696165191740415, 'epoch': 3.67}


 37%|███▋      | 625/1700 [13:40<23:46,  1.33s/it]

{'loss': 0.1769, 'grad_norm': 0.7789524793624878, 'learning_rate': 0.00012684365781710915, 'epoch': 3.67}


 37%|███▋      | 626/1700 [13:42<22:19,  1.25s/it]

{'loss': 0.0699, 'grad_norm': 1.1853293180465698, 'learning_rate': 0.00012672566371681415, 'epoch': 3.68}


 37%|███▋      | 627/1700 [13:43<22:04,  1.23s/it]

{'loss': 0.057, 'grad_norm': 0.4891109764575958, 'learning_rate': 0.00012660766961651918, 'epoch': 3.68}


 37%|███▋      | 628/1700 [13:44<20:55,  1.17s/it]

{'loss': 0.0372, 'grad_norm': 0.395973265171051, 'learning_rate': 0.0001264896755162242, 'epoch': 3.69}


 37%|███▋      | 629/1700 [13:45<22:00,  1.23s/it]

{'loss': 0.2006, 'grad_norm': 0.876693069934845, 'learning_rate': 0.0001263716814159292, 'epoch': 3.69}


 37%|███▋      | 630/1700 [13:46<22:19,  1.25s/it]

{'loss': 0.0639, 'grad_norm': 0.5042485594749451, 'learning_rate': 0.00012625368731563423, 'epoch': 3.7}


 37%|███▋      | 631/1700 [13:47<20:56,  1.18s/it]

{'loss': 0.1234, 'grad_norm': 0.9040167927742004, 'learning_rate': 0.00012613569321533926, 'epoch': 3.71}


 37%|███▋      | 632/1700 [13:49<21:35,  1.21s/it]

{'loss': 0.0879, 'grad_norm': 0.6415160894393921, 'learning_rate': 0.00012601769911504425, 'epoch': 3.71}


 37%|███▋      | 633/1700 [13:50<20:53,  1.18s/it]

{'loss': 0.0865, 'grad_norm': 0.8200592994689941, 'learning_rate': 0.00012589970501474925, 'epoch': 3.72}


 37%|███▋      | 634/1700 [13:51<21:25,  1.21s/it]

{'loss': 0.1952, 'grad_norm': 1.0892709493637085, 'learning_rate': 0.00012578171091445428, 'epoch': 3.72}


 37%|███▋      | 635/1700 [13:52<21:52,  1.23s/it]

{'loss': 0.2356, 'grad_norm': 0.9625324010848999, 'learning_rate': 0.0001256637168141593, 'epoch': 3.73}


 37%|███▋      | 636/1700 [13:54<21:45,  1.23s/it]

{'loss': 0.0586, 'grad_norm': 0.5204361081123352, 'learning_rate': 0.0001255457227138643, 'epoch': 3.74}


 37%|███▋      | 637/1700 [13:55<20:36,  1.16s/it]

{'loss': 0.1066, 'grad_norm': 0.7478489875793457, 'learning_rate': 0.00012542772861356933, 'epoch': 3.74}


 38%|███▊      | 638/1700 [13:56<20:58,  1.19s/it]

{'loss': 0.0799, 'grad_norm': 0.5881969332695007, 'learning_rate': 0.00012530973451327436, 'epoch': 3.75}


 38%|███▊      | 639/1700 [13:57<21:36,  1.22s/it]

{'loss': 0.1048, 'grad_norm': 1.0955479145050049, 'learning_rate': 0.00012519174041297936, 'epoch': 3.75}


 38%|███▊      | 640/1700 [13:58<21:13,  1.20s/it]

{'loss': 0.1033, 'grad_norm': 0.6979853510856628, 'learning_rate': 0.00012507374631268438, 'epoch': 3.76}


 38%|███▊      | 641/1700 [14:00<21:50,  1.24s/it]

{'loss': 0.0915, 'grad_norm': 0.6605319380760193, 'learning_rate': 0.00012495575221238938, 'epoch': 3.77}


 38%|███▊      | 642/1700 [14:01<21:31,  1.22s/it]

{'loss': 0.0578, 'grad_norm': 0.4950629770755768, 'learning_rate': 0.0001248377581120944, 'epoch': 3.77}


 38%|███▊      | 643/1700 [14:02<22:21,  1.27s/it]

{'loss': 0.0916, 'grad_norm': 0.5227814316749573, 'learning_rate': 0.0001247197640117994, 'epoch': 3.78}


 38%|███▊      | 644/1700 [14:03<21:29,  1.22s/it]

{'loss': 0.049, 'grad_norm': 0.5494006276130676, 'learning_rate': 0.00012460176991150443, 'epoch': 3.78}


 38%|███▊      | 645/1700 [14:05<21:27,  1.22s/it]

{'loss': 0.0528, 'grad_norm': 0.48887839913368225, 'learning_rate': 0.00012448377581120946, 'epoch': 3.79}


 38%|███▊      | 646/1700 [14:06<21:15,  1.21s/it]

{'loss': 0.1082, 'grad_norm': 0.7039262056350708, 'learning_rate': 0.00012436578171091446, 'epoch': 3.79}


 38%|███▊      | 647/1700 [14:07<20:24,  1.16s/it]

{'loss': 0.0737, 'grad_norm': 0.8175576329231262, 'learning_rate': 0.00012424778761061948, 'epoch': 3.8}


 38%|███▊      | 648/1700 [14:08<19:16,  1.10s/it]

{'loss': 0.0673, 'grad_norm': 0.8981552124023438, 'learning_rate': 0.0001241297935103245, 'epoch': 3.81}


 38%|███▊      | 649/1700 [14:09<20:17,  1.16s/it]

{'loss': 0.1082, 'grad_norm': 0.7542270421981812, 'learning_rate': 0.00012401179941002948, 'epoch': 3.81}


 38%|███▊      | 650/1700 [14:11<25:14,  1.44s/it]

{'loss': 0.0428, 'grad_norm': 0.4365614950656891, 'learning_rate': 0.0001238938053097345, 'epoch': 3.82}


 38%|███▊      | 651/1700 [14:13<25:45,  1.47s/it]

{'loss': 0.0875, 'grad_norm': 0.6300284266471863, 'learning_rate': 0.00012377581120943953, 'epoch': 3.82}


 38%|███▊      | 652/1700 [14:14<24:27,  1.40s/it]

{'loss': 0.0892, 'grad_norm': 0.674411416053772, 'learning_rate': 0.00012365781710914456, 'epoch': 3.83}


 38%|███▊      | 653/1700 [14:15<23:20,  1.34s/it]

{'loss': 0.0586, 'grad_norm': 1.2731170654296875, 'learning_rate': 0.00012353982300884956, 'epoch': 3.84}


 38%|███▊      | 654/1700 [14:16<21:32,  1.24s/it]

{'loss': 0.0699, 'grad_norm': 0.6294065117835999, 'learning_rate': 0.00012342182890855458, 'epoch': 3.84}


 39%|███▊      | 655/1700 [14:17<20:39,  1.19s/it]

{'loss': 0.1469, 'grad_norm': 0.8547914028167725, 'learning_rate': 0.0001233038348082596, 'epoch': 3.85}


 39%|███▊      | 656/1700 [14:18<21:19,  1.23s/it]

{'loss': 0.0766, 'grad_norm': 0.5202840566635132, 'learning_rate': 0.0001231858407079646, 'epoch': 3.85}


 39%|███▊      | 657/1700 [14:20<23:16,  1.34s/it]

{'loss': 0.1128, 'grad_norm': 0.718202531337738, 'learning_rate': 0.0001230678466076696, 'epoch': 3.86}


 39%|███▊      | 658/1700 [14:21<22:46,  1.31s/it]

{'loss': 0.0571, 'grad_norm': 0.40629369020462036, 'learning_rate': 0.00012294985250737463, 'epoch': 3.86}


 39%|███▉      | 659/1700 [14:23<22:28,  1.30s/it]

{'loss': 0.0883, 'grad_norm': 0.646814227104187, 'learning_rate': 0.00012283185840707966, 'epoch': 3.87}


 39%|███▉      | 660/1700 [14:24<22:27,  1.30s/it]

{'loss': 0.0762, 'grad_norm': 0.6162881851196289, 'learning_rate': 0.00012271386430678466, 'epoch': 3.88}


 39%|███▉      | 661/1700 [14:25<22:12,  1.28s/it]

{'loss': 0.0389, 'grad_norm': 0.5313283801078796, 'learning_rate': 0.00012259587020648969, 'epoch': 3.88}


 39%|███▉      | 662/1700 [14:26<21:53,  1.27s/it]

{'loss': 0.0942, 'grad_norm': 0.7237852811813354, 'learning_rate': 0.0001224778761061947, 'epoch': 3.89}


 39%|███▉      | 663/1700 [14:27<20:43,  1.20s/it]

{'loss': 0.0554, 'grad_norm': 0.8338347673416138, 'learning_rate': 0.0001223598820058997, 'epoch': 3.89}


 39%|███▉      | 664/1700 [14:29<20:34,  1.19s/it]

{'loss': 0.0662, 'grad_norm': 0.6765396595001221, 'learning_rate': 0.00012224188790560474, 'epoch': 3.9}


 39%|███▉      | 665/1700 [14:30<21:02,  1.22s/it]

{'loss': 0.0585, 'grad_norm': 0.6202636957168579, 'learning_rate': 0.00012212389380530974, 'epoch': 3.91}


 39%|███▉      | 666/1700 [14:31<21:04,  1.22s/it]

{'loss': 0.0887, 'grad_norm': 0.7885724902153015, 'learning_rate': 0.00012200589970501475, 'epoch': 3.91}


 39%|███▉      | 667/1700 [14:32<20:47,  1.21s/it]

{'loss': 0.0727, 'grad_norm': 0.7004704475402832, 'learning_rate': 0.00012188790560471976, 'epoch': 3.92}


 39%|███▉      | 668/1700 [14:33<20:01,  1.16s/it]

{'loss': 0.0893, 'grad_norm': 0.7045453190803528, 'learning_rate': 0.00012176991150442479, 'epoch': 3.92}


 39%|███▉      | 669/1700 [14:35<21:01,  1.22s/it]

{'loss': 0.0176, 'grad_norm': 0.39163562655448914, 'learning_rate': 0.0001216519174041298, 'epoch': 3.93}


 39%|███▉      | 670/1700 [14:36<21:29,  1.25s/it]

{'loss': 0.0855, 'grad_norm': 0.6236138343811035, 'learning_rate': 0.00012153392330383481, 'epoch': 3.94}


 39%|███▉      | 671/1700 [14:37<20:51,  1.22s/it]

{'loss': 0.0484, 'grad_norm': 0.42299067974090576, 'learning_rate': 0.00012141592920353984, 'epoch': 3.94}


 40%|███▉      | 672/1700 [14:38<21:07,  1.23s/it]

{'loss': 0.0899, 'grad_norm': 0.6784495115280151, 'learning_rate': 0.00012129793510324485, 'epoch': 3.95}


 40%|███▉      | 673/1700 [14:40<22:06,  1.29s/it]

{'loss': 0.1232, 'grad_norm': 0.6584386229515076, 'learning_rate': 0.00012117994100294985, 'epoch': 3.95}


 40%|███▉      | 674/1700 [14:41<21:27,  1.25s/it]

{'loss': 0.1467, 'grad_norm': 0.8047727346420288, 'learning_rate': 0.00012106194690265486, 'epoch': 3.96}


 40%|███▉      | 675/1700 [14:42<21:40,  1.27s/it]

{'loss': 0.0672, 'grad_norm': 0.6030688881874084, 'learning_rate': 0.00012094395280235989, 'epoch': 3.96}


 40%|███▉      | 676/1700 [14:43<20:54,  1.23s/it]

{'loss': 0.0485, 'grad_norm': 0.5022580027580261, 'learning_rate': 0.0001208259587020649, 'epoch': 3.97}


 40%|███▉      | 677/1700 [14:45<21:00,  1.23s/it]

{'loss': 0.1064, 'grad_norm': 0.8877700567245483, 'learning_rate': 0.00012070796460176991, 'epoch': 3.98}


 40%|███▉      | 678/1700 [14:46<21:01,  1.23s/it]

{'loss': 0.0779, 'grad_norm': 0.7552789449691772, 'learning_rate': 0.00012058997050147494, 'epoch': 3.98}


 40%|███▉      | 679/1700 [14:47<20:24,  1.20s/it]

{'loss': 0.0473, 'grad_norm': 0.638366162776947, 'learning_rate': 0.00012047197640117995, 'epoch': 3.99}


 40%|████      | 680/1700 [14:48<20:48,  1.22s/it]

{'loss': 0.0434, 'grad_norm': 0.3325965702533722, 'learning_rate': 0.00012035398230088497, 'epoch': 3.99}


 40%|████      | 681/1700 [14:52<31:28,  1.85s/it]

{'loss': 0.4888, 'grad_norm': 0.6154525876045227, 'learning_rate': 0.00012023598820058996, 'epoch': 4.0}


 40%|████      | 682/1700 [14:53<27:30,  1.62s/it]

{'loss': 0.0477, 'grad_norm': 0.4481050968170166, 'learning_rate': 0.00012011799410029499, 'epoch': 4.01}


 40%|████      | 683/1700 [14:54<24:28,  1.44s/it]

{'loss': 0.0286, 'grad_norm': 0.3366881310939789, 'learning_rate': 0.00012, 'epoch': 4.01}


 40%|████      | 684/1700 [14:55<22:37,  1.34s/it]

{'loss': 0.0304, 'grad_norm': 0.19083179533481598, 'learning_rate': 0.00011988200589970502, 'epoch': 4.02}


 40%|████      | 685/1700 [14:56<22:57,  1.36s/it]

{'loss': 0.0231, 'grad_norm': 0.3579213619232178, 'learning_rate': 0.00011976401179941004, 'epoch': 4.02}


 40%|████      | 686/1700 [14:57<21:49,  1.29s/it]

{'loss': 0.0376, 'grad_norm': 0.5814272165298462, 'learning_rate': 0.00011964601769911505, 'epoch': 4.03}


 40%|████      | 687/1700 [14:59<22:29,  1.33s/it]

{'loss': 0.0205, 'grad_norm': 0.3351588547229767, 'learning_rate': 0.00011952802359882007, 'epoch': 4.04}


 40%|████      | 688/1700 [15:00<24:00,  1.42s/it]

{'loss': 0.0432, 'grad_norm': 0.39030900597572327, 'learning_rate': 0.00011941002949852509, 'epoch': 4.04}


 41%|████      | 689/1700 [15:02<22:45,  1.35s/it]

{'loss': 0.0976, 'grad_norm': 0.685860276222229, 'learning_rate': 0.00011929203539823008, 'epoch': 4.05}


 41%|████      | 690/1700 [15:03<22:22,  1.33s/it]

{'loss': 0.0465, 'grad_norm': 0.5989882946014404, 'learning_rate': 0.0001191740412979351, 'epoch': 4.05}


 41%|████      | 691/1700 [15:04<22:30,  1.34s/it]

{'loss': 0.0796, 'grad_norm': 0.5959202647209167, 'learning_rate': 0.00011905604719764012, 'epoch': 4.06}


 41%|████      | 692/1700 [15:05<21:55,  1.31s/it]

{'loss': 0.121, 'grad_norm': 1.048221230506897, 'learning_rate': 0.00011893805309734514, 'epoch': 4.06}


 41%|████      | 693/1700 [15:07<22:39,  1.35s/it]

{'loss': 0.0433, 'grad_norm': 0.7832534909248352, 'learning_rate': 0.00011882005899705016, 'epoch': 4.07}


 41%|████      | 694/1700 [15:08<22:48,  1.36s/it]

{'loss': 0.0449, 'grad_norm': 0.5047785639762878, 'learning_rate': 0.00011870206489675517, 'epoch': 4.08}


 41%|████      | 695/1700 [15:10<24:10,  1.44s/it]

{'loss': 0.0984, 'grad_norm': 0.9881305694580078, 'learning_rate': 0.0001185840707964602, 'epoch': 4.08}


 41%|████      | 696/1700 [15:11<23:28,  1.40s/it]

{'loss': 0.0154, 'grad_norm': 0.2726325988769531, 'learning_rate': 0.00011846607669616521, 'epoch': 4.09}


 41%|████      | 697/1700 [15:12<21:45,  1.30s/it]

{'loss': 0.0452, 'grad_norm': 0.6260659694671631, 'learning_rate': 0.0001183480825958702, 'epoch': 4.09}


 41%|████      | 698/1700 [15:14<22:20,  1.34s/it]

{'loss': 0.0583, 'grad_norm': 0.5426124334335327, 'learning_rate': 0.00011823008849557522, 'epoch': 4.1}


 41%|████      | 699/1700 [15:15<23:13,  1.39s/it]

{'loss': 0.0429, 'grad_norm': 0.3903161883354187, 'learning_rate': 0.00011811209439528023, 'epoch': 4.11}


 41%|████      | 700/1700 [15:16<22:18,  1.34s/it]

{'loss': 0.0094, 'grad_norm': 0.2634407579898834, 'learning_rate': 0.00011799410029498526, 'epoch': 4.11}


 41%|████      | 701/1700 [15:17<20:22,  1.22s/it]

{'loss': 0.0614, 'grad_norm': 0.5534748435020447, 'learning_rate': 0.00011787610619469027, 'epoch': 4.12}


 41%|████▏     | 702/1700 [15:19<21:27,  1.29s/it]

{'loss': 0.0208, 'grad_norm': 0.26451483368873596, 'learning_rate': 0.0001177581120943953, 'epoch': 4.12}


 41%|████▏     | 703/1700 [15:20<21:54,  1.32s/it]

{'loss': 0.0464, 'grad_norm': 0.4441934823989868, 'learning_rate': 0.00011764011799410031, 'epoch': 4.13}


 41%|████▏     | 704/1700 [15:22<22:13,  1.34s/it]

{'loss': 0.0636, 'grad_norm': 0.49952197074890137, 'learning_rate': 0.00011752212389380532, 'epoch': 4.14}


 41%|████▏     | 705/1700 [15:23<21:17,  1.28s/it]

{'loss': 0.0169, 'grad_norm': 0.3079588711261749, 'learning_rate': 0.00011740412979351032, 'epoch': 4.14}


 42%|████▏     | 706/1700 [15:24<20:20,  1.23s/it]

{'loss': 0.0318, 'grad_norm': 0.4257503151893616, 'learning_rate': 0.00011728613569321533, 'epoch': 4.15}


 42%|████▏     | 707/1700 [15:25<20:51,  1.26s/it]

{'loss': 0.032, 'grad_norm': 0.4679293632507324, 'learning_rate': 0.00011716814159292036, 'epoch': 4.15}


 42%|████▏     | 708/1700 [15:27<21:37,  1.31s/it]

{'loss': 0.0486, 'grad_norm': 0.5181382894515991, 'learning_rate': 0.00011705014749262537, 'epoch': 4.16}


 42%|████▏     | 709/1700 [15:28<20:38,  1.25s/it]

{'loss': 0.0631, 'grad_norm': 0.6492053866386414, 'learning_rate': 0.00011693215339233038, 'epoch': 4.16}


 42%|████▏     | 710/1700 [15:29<20:44,  1.26s/it]

{'loss': 0.0182, 'grad_norm': 0.29467499256134033, 'learning_rate': 0.00011681415929203541, 'epoch': 4.17}


 42%|████▏     | 711/1700 [15:30<20:53,  1.27s/it]

{'loss': 0.0607, 'grad_norm': 0.5521894097328186, 'learning_rate': 0.00011669616519174042, 'epoch': 4.18}


 42%|████▏     | 712/1700 [15:32<20:20,  1.24s/it]

{'loss': 0.0144, 'grad_norm': 0.2005559355020523, 'learning_rate': 0.00011657817109144545, 'epoch': 4.18}


 42%|████▏     | 713/1700 [15:33<20:16,  1.23s/it]

{'loss': 0.0411, 'grad_norm': 0.6710502505302429, 'learning_rate': 0.00011646017699115043, 'epoch': 4.19}


 42%|████▏     | 714/1700 [15:34<22:05,  1.34s/it]

{'loss': 0.0315, 'grad_norm': 0.5231261849403381, 'learning_rate': 0.00011634218289085546, 'epoch': 4.19}


 42%|████▏     | 715/1700 [15:35<21:06,  1.29s/it]

{'loss': 0.0552, 'grad_norm': 0.9228672981262207, 'learning_rate': 0.00011622418879056047, 'epoch': 4.2}


 42%|████▏     | 716/1700 [15:37<21:13,  1.29s/it]

{'loss': 0.0469, 'grad_norm': 0.5201187133789062, 'learning_rate': 0.00011610619469026549, 'epoch': 4.21}


 42%|████▏     | 717/1700 [15:38<21:07,  1.29s/it]

{'loss': 0.033, 'grad_norm': 0.5792028903961182, 'learning_rate': 0.00011598820058997051, 'epoch': 4.21}


 42%|████▏     | 718/1700 [15:39<21:17,  1.30s/it]

{'loss': 0.0347, 'grad_norm': 0.36409667134284973, 'learning_rate': 0.00011587020648967552, 'epoch': 4.22}


 42%|████▏     | 719/1700 [15:41<22:16,  1.36s/it]

{'loss': 0.0825, 'grad_norm': 0.5128181576728821, 'learning_rate': 0.00011575221238938054, 'epoch': 4.22}


 42%|████▏     | 720/1700 [15:42<23:06,  1.41s/it]

{'loss': 0.0523, 'grad_norm': 0.4871678650379181, 'learning_rate': 0.00011563421828908556, 'epoch': 4.23}


 42%|████▏     | 721/1700 [15:44<23:39,  1.45s/it]

{'loss': 0.0741, 'grad_norm': 0.7359675168991089, 'learning_rate': 0.00011551622418879056, 'epoch': 4.23}


 42%|████▏     | 722/1700 [15:45<21:15,  1.30s/it]

{'loss': 0.0333, 'grad_norm': 0.548062264919281, 'learning_rate': 0.00011539823008849557, 'epoch': 4.24}


 43%|████▎     | 723/1700 [15:46<21:14,  1.30s/it]

{'loss': 0.0304, 'grad_norm': 0.4382375180721283, 'learning_rate': 0.00011528023598820059, 'epoch': 4.25}


 43%|████▎     | 724/1700 [15:48<22:00,  1.35s/it]

{'loss': 0.0888, 'grad_norm': 0.762004017829895, 'learning_rate': 0.00011516224188790561, 'epoch': 4.25}


 43%|████▎     | 725/1700 [15:49<21:09,  1.30s/it]

{'loss': 0.031, 'grad_norm': 0.515119194984436, 'learning_rate': 0.00011504424778761063, 'epoch': 4.26}


 43%|████▎     | 726/1700 [15:50<20:07,  1.24s/it]

{'loss': 0.0208, 'grad_norm': 0.3656478524208069, 'learning_rate': 0.00011492625368731564, 'epoch': 4.26}


 43%|████▎     | 727/1700 [15:51<19:56,  1.23s/it]

{'loss': 0.037, 'grad_norm': 0.4925658702850342, 'learning_rate': 0.00011480825958702066, 'epoch': 4.27}


 43%|████▎     | 728/1700 [15:52<19:39,  1.21s/it]

{'loss': 0.0439, 'grad_norm': 0.5450351238250732, 'learning_rate': 0.00011469026548672568, 'epoch': 4.28}


 43%|████▎     | 729/1700 [15:54<19:37,  1.21s/it]

{'loss': 0.0239, 'grad_norm': 0.48387712240219116, 'learning_rate': 0.00011457227138643068, 'epoch': 4.28}


 43%|████▎     | 730/1700 [15:55<19:35,  1.21s/it]

{'loss': 0.0198, 'grad_norm': 0.29816243052482605, 'learning_rate': 0.00011445427728613569, 'epoch': 4.29}


 43%|████▎     | 731/1700 [15:56<19:54,  1.23s/it]

{'loss': 0.048, 'grad_norm': 0.501289963722229, 'learning_rate': 0.00011433628318584071, 'epoch': 4.29}


 43%|████▎     | 732/1700 [15:58<21:38,  1.34s/it]

{'loss': 0.042, 'grad_norm': 0.5492289662361145, 'learning_rate': 0.00011421828908554573, 'epoch': 4.3}


 43%|████▎     | 733/1700 [15:59<22:20,  1.39s/it]

{'loss': 0.0667, 'grad_norm': 0.5164664387702942, 'learning_rate': 0.00011410029498525074, 'epoch': 4.31}


 43%|████▎     | 734/1700 [16:01<22:14,  1.38s/it]

{'loss': 0.0386, 'grad_norm': 0.4519781470298767, 'learning_rate': 0.00011398230088495577, 'epoch': 4.31}


 43%|████▎     | 735/1700 [16:02<20:45,  1.29s/it]

{'loss': 0.0299, 'grad_norm': 0.517452597618103, 'learning_rate': 0.00011386430678466078, 'epoch': 4.32}


 43%|████▎     | 736/1700 [16:03<19:54,  1.24s/it]

{'loss': 0.0676, 'grad_norm': 0.8257027864456177, 'learning_rate': 0.00011374631268436579, 'epoch': 4.32}


 43%|████▎     | 737/1700 [16:04<19:58,  1.24s/it]

{'loss': 0.0398, 'grad_norm': 0.4746130704879761, 'learning_rate': 0.00011362831858407079, 'epoch': 4.33}


 43%|████▎     | 738/1700 [16:05<19:25,  1.21s/it]

{'loss': 0.0077, 'grad_norm': 0.26101747155189514, 'learning_rate': 0.00011351032448377582, 'epoch': 4.33}


 43%|████▎     | 739/1700 [16:06<19:57,  1.25s/it]

{'loss': 0.0298, 'grad_norm': 0.4123755991458893, 'learning_rate': 0.00011339233038348083, 'epoch': 4.34}


 44%|████▎     | 740/1700 [16:08<20:01,  1.25s/it]

{'loss': 0.0457, 'grad_norm': 0.4460901618003845, 'learning_rate': 0.00011327433628318584, 'epoch': 4.35}


 44%|████▎     | 741/1700 [16:09<19:23,  1.21s/it]

{'loss': 0.0419, 'grad_norm': 0.7187311053276062, 'learning_rate': 0.00011315634218289087, 'epoch': 4.35}


 44%|████▎     | 742/1700 [16:10<20:14,  1.27s/it]

{'loss': 0.0345, 'grad_norm': 0.615009069442749, 'learning_rate': 0.00011303834808259588, 'epoch': 4.36}


 44%|████▎     | 743/1700 [16:12<22:10,  1.39s/it]

{'loss': 0.0667, 'grad_norm': 0.617262065410614, 'learning_rate': 0.00011292035398230089, 'epoch': 4.36}


 44%|████▍     | 744/1700 [16:13<21:08,  1.33s/it]

{'loss': 0.0315, 'grad_norm': 0.5755777359008789, 'learning_rate': 0.00011280235988200592, 'epoch': 4.37}


 44%|████▍     | 745/1700 [16:14<20:15,  1.27s/it]

{'loss': 0.041, 'grad_norm': 0.5197160243988037, 'learning_rate': 0.0001126843657817109, 'epoch': 4.38}


 44%|████▍     | 746/1700 [16:16<22:54,  1.44s/it]

{'loss': 0.0392, 'grad_norm': 0.5329770445823669, 'learning_rate': 0.00011256637168141593, 'epoch': 4.38}


 44%|████▍     | 747/1700 [16:17<21:12,  1.34s/it]

{'loss': 0.015, 'grad_norm': 0.25780391693115234, 'learning_rate': 0.00011244837758112094, 'epoch': 4.39}


 44%|████▍     | 748/1700 [16:18<19:55,  1.26s/it]

{'loss': 0.0326, 'grad_norm': 0.4574345052242279, 'learning_rate': 0.00011233038348082597, 'epoch': 4.39}


 44%|████▍     | 749/1700 [16:20<20:15,  1.28s/it]

{'loss': 0.0416, 'grad_norm': 0.4011499881744385, 'learning_rate': 0.00011221238938053098, 'epoch': 4.4}


 44%|████▍     | 750/1700 [16:22<23:21,  1.48s/it]

{'loss': 0.02, 'grad_norm': 0.3554767072200775, 'learning_rate': 0.000112094395280236, 'epoch': 4.41}


 44%|████▍     | 751/1700 [16:23<22:18,  1.41s/it]

{'loss': 0.0505, 'grad_norm': 0.4164162278175354, 'learning_rate': 0.00011197640117994102, 'epoch': 4.41}


 44%|████▍     | 752/1700 [16:24<20:46,  1.32s/it]

{'loss': 0.0706, 'grad_norm': 0.7558259963989258, 'learning_rate': 0.00011185840707964603, 'epoch': 4.42}


 44%|████▍     | 753/1700 [16:25<20:18,  1.29s/it]

{'loss': 0.028, 'grad_norm': 0.4224128723144531, 'learning_rate': 0.00011174041297935103, 'epoch': 4.42}


 44%|████▍     | 754/1700 [16:26<19:50,  1.26s/it]

{'loss': 0.0557, 'grad_norm': 0.6525253653526306, 'learning_rate': 0.00011162241887905604, 'epoch': 4.43}


 44%|████▍     | 755/1700 [16:28<21:03,  1.34s/it]

{'loss': 0.0436, 'grad_norm': 0.3509065508842468, 'learning_rate': 0.00011150442477876106, 'epoch': 4.43}


 44%|████▍     | 756/1700 [16:29<19:43,  1.25s/it]

{'loss': 0.0626, 'grad_norm': 0.891274094581604, 'learning_rate': 0.00011138643067846608, 'epoch': 4.44}


 45%|████▍     | 757/1700 [16:30<20:32,  1.31s/it]

{'loss': 0.0276, 'grad_norm': 0.4325578808784485, 'learning_rate': 0.0001112684365781711, 'epoch': 4.45}


 45%|████▍     | 758/1700 [16:31<19:00,  1.21s/it]

{'loss': 0.0164, 'grad_norm': 0.19158171117305756, 'learning_rate': 0.00011115044247787612, 'epoch': 4.45}


 45%|████▍     | 759/1700 [16:32<18:48,  1.20s/it]

{'loss': 0.0263, 'grad_norm': 0.3487534821033478, 'learning_rate': 0.00011103244837758113, 'epoch': 4.46}


 45%|████▍     | 760/1700 [16:34<18:59,  1.21s/it]

{'loss': 0.0384, 'grad_norm': 0.5309516787528992, 'learning_rate': 0.00011091445427728615, 'epoch': 4.46}


 45%|████▍     | 761/1700 [16:35<18:25,  1.18s/it]

{'loss': 0.0356, 'grad_norm': 0.5048824548721313, 'learning_rate': 0.00011079646017699115, 'epoch': 4.47}


 45%|████▍     | 762/1700 [16:36<18:00,  1.15s/it]

{'loss': 0.0228, 'grad_norm': 0.4946915805339813, 'learning_rate': 0.00011067846607669616, 'epoch': 4.48}


 45%|████▍     | 763/1700 [16:37<17:51,  1.14s/it]

{'loss': 0.0324, 'grad_norm': 0.28623801469802856, 'learning_rate': 0.00011056047197640118, 'epoch': 4.48}


 45%|████▍     | 764/1700 [16:38<18:16,  1.17s/it]

{'loss': 0.017, 'grad_norm': 0.41093045473098755, 'learning_rate': 0.0001104424778761062, 'epoch': 4.49}


 45%|████▌     | 765/1700 [16:39<18:01,  1.16s/it]

{'loss': 0.0695, 'grad_norm': 0.7043501734733582, 'learning_rate': 0.00011032448377581121, 'epoch': 4.49}


 45%|████▌     | 766/1700 [16:41<19:27,  1.25s/it]

{'loss': 0.0568, 'grad_norm': 0.4967963397502899, 'learning_rate': 0.00011020648967551624, 'epoch': 4.5}


 45%|████▌     | 767/1700 [16:42<19:24,  1.25s/it]

{'loss': 0.0427, 'grad_norm': 0.5070489645004272, 'learning_rate': 0.00011008849557522125, 'epoch': 4.51}


 45%|████▌     | 768/1700 [16:43<19:56,  1.28s/it]

{'loss': 0.0443, 'grad_norm': 0.7139681577682495, 'learning_rate': 0.00010997050147492627, 'epoch': 4.51}


 45%|████▌     | 769/1700 [16:45<18:58,  1.22s/it]

{'loss': 0.0552, 'grad_norm': 0.7043991088867188, 'learning_rate': 0.00010985250737463126, 'epoch': 4.52}


 45%|████▌     | 770/1700 [16:46<19:44,  1.27s/it]

{'loss': 0.0347, 'grad_norm': 0.46654564142227173, 'learning_rate': 0.00010973451327433629, 'epoch': 4.52}


 45%|████▌     | 771/1700 [16:47<20:32,  1.33s/it]

{'loss': 0.0511, 'grad_norm': 0.5202688574790955, 'learning_rate': 0.0001096165191740413, 'epoch': 4.53}


 45%|████▌     | 772/1700 [16:49<19:48,  1.28s/it]

{'loss': 0.0391, 'grad_norm': 0.48059582710266113, 'learning_rate': 0.00010949852507374631, 'epoch': 4.53}


 45%|████▌     | 773/1700 [16:50<18:39,  1.21s/it]

{'loss': 0.0548, 'grad_norm': 0.7069327235221863, 'learning_rate': 0.00010938053097345134, 'epoch': 4.54}


 46%|████▌     | 774/1700 [16:51<19:09,  1.24s/it]

{'loss': 0.0371, 'grad_norm': 0.46842777729034424, 'learning_rate': 0.00010926253687315635, 'epoch': 4.55}


 46%|████▌     | 775/1700 [16:52<19:36,  1.27s/it]

{'loss': 0.08, 'grad_norm': 0.3703981637954712, 'learning_rate': 0.00010914454277286138, 'epoch': 4.55}


 46%|████▌     | 776/1700 [16:53<18:33,  1.21s/it]

{'loss': 0.07, 'grad_norm': 0.7837599515914917, 'learning_rate': 0.00010902654867256639, 'epoch': 4.56}


 46%|████▌     | 777/1700 [16:55<19:06,  1.24s/it]

{'loss': 0.0512, 'grad_norm': 0.6839238405227661, 'learning_rate': 0.00010890855457227139, 'epoch': 4.56}


 46%|████▌     | 778/1700 [16:56<18:38,  1.21s/it]

{'loss': 0.0796, 'grad_norm': 0.8427329063415527, 'learning_rate': 0.0001087905604719764, 'epoch': 4.57}


 46%|████▌     | 779/1700 [16:57<18:59,  1.24s/it]

{'loss': 0.0254, 'grad_norm': 0.36679407954216003, 'learning_rate': 0.00010867256637168141, 'epoch': 4.58}


 46%|████▌     | 780/1700 [16:58<18:23,  1.20s/it]

{'loss': 0.0502, 'grad_norm': 0.5763538479804993, 'learning_rate': 0.00010855457227138644, 'epoch': 4.58}


 46%|████▌     | 781/1700 [17:00<19:14,  1.26s/it]

{'loss': 0.0329, 'grad_norm': 0.45583271980285645, 'learning_rate': 0.00010843657817109145, 'epoch': 4.59}


 46%|████▌     | 782/1700 [17:01<19:29,  1.27s/it]

{'loss': 0.0269, 'grad_norm': 0.3529013395309448, 'learning_rate': 0.00010831858407079646, 'epoch': 4.59}


 46%|████▌     | 783/1700 [17:02<19:02,  1.25s/it]

{'loss': 0.052, 'grad_norm': 0.5494232177734375, 'learning_rate': 0.00010820058997050149, 'epoch': 4.6}


 46%|████▌     | 784/1700 [17:03<19:32,  1.28s/it]

{'loss': 0.0461, 'grad_norm': 0.5411089658737183, 'learning_rate': 0.0001080825958702065, 'epoch': 4.6}


 46%|████▌     | 785/1700 [17:05<19:16,  1.26s/it]

{'loss': 0.0331, 'grad_norm': 0.4240851402282715, 'learning_rate': 0.0001079646017699115, 'epoch': 4.61}


 46%|████▌     | 786/1700 [17:06<18:55,  1.24s/it]

{'loss': 0.0373, 'grad_norm': 0.8111423254013062, 'learning_rate': 0.00010784660766961651, 'epoch': 4.62}


 46%|████▋     | 787/1700 [17:07<19:15,  1.27s/it]

{'loss': 0.0562, 'grad_norm': 0.5658760666847229, 'learning_rate': 0.00010772861356932154, 'epoch': 4.62}


 46%|████▋     | 788/1700 [17:08<17:46,  1.17s/it]

{'loss': 0.0323, 'grad_norm': 0.4750378429889679, 'learning_rate': 0.00010761061946902655, 'epoch': 4.63}


 46%|████▋     | 789/1700 [17:09<16:51,  1.11s/it]

{'loss': 0.013, 'grad_norm': 0.4412838816642761, 'learning_rate': 0.00010749262536873156, 'epoch': 4.63}


 46%|████▋     | 790/1700 [17:10<16:15,  1.07s/it]

{'loss': 0.0405, 'grad_norm': 0.6383360624313354, 'learning_rate': 0.00010737463126843659, 'epoch': 4.64}


 47%|████▋     | 791/1700 [17:12<18:08,  1.20s/it]

{'loss': 0.058, 'grad_norm': 0.6166218519210815, 'learning_rate': 0.0001072566371681416, 'epoch': 4.65}


 47%|████▋     | 792/1700 [17:13<19:02,  1.26s/it]

{'loss': 0.0238, 'grad_norm': 0.33347195386886597, 'learning_rate': 0.00010713864306784662, 'epoch': 4.65}


 47%|████▋     | 793/1700 [17:15<21:56,  1.45s/it]

{'loss': 0.0238, 'grad_norm': 0.3861039876937866, 'learning_rate': 0.00010702064896755162, 'epoch': 4.66}


 47%|████▋     | 794/1700 [17:16<20:56,  1.39s/it]

{'loss': 0.0379, 'grad_norm': 0.4529062807559967, 'learning_rate': 0.00010690265486725664, 'epoch': 4.66}


 47%|████▋     | 795/1700 [17:17<20:20,  1.35s/it]

{'loss': 0.0831, 'grad_norm': 0.6497840881347656, 'learning_rate': 0.00010678466076696165, 'epoch': 4.67}


 47%|████▋     | 796/1700 [17:19<20:08,  1.34s/it]

{'loss': 0.0077, 'grad_norm': 0.14556844532489777, 'learning_rate': 0.00010666666666666667, 'epoch': 4.68}


 47%|████▋     | 797/1700 [17:20<20:54,  1.39s/it]

{'loss': 0.0497, 'grad_norm': 0.49661582708358765, 'learning_rate': 0.00010654867256637169, 'epoch': 4.68}


 47%|████▋     | 798/1700 [17:21<20:15,  1.35s/it]

{'loss': 0.0357, 'grad_norm': 0.49908214807510376, 'learning_rate': 0.0001064306784660767, 'epoch': 4.69}


 47%|████▋     | 799/1700 [17:23<19:33,  1.30s/it]

{'loss': 0.0526, 'grad_norm': 0.6837366819381714, 'learning_rate': 0.00010631268436578172, 'epoch': 4.69}


 47%|████▋     | 800/1700 [17:24<19:50,  1.32s/it]

{'loss': 0.0422, 'grad_norm': 0.6395827531814575, 'learning_rate': 0.00010619469026548674, 'epoch': 4.7}


 47%|████▋     | 801/1700 [17:25<19:43,  1.32s/it]

{'loss': 0.0446, 'grad_norm': 0.4736872613430023, 'learning_rate': 0.00010607669616519173, 'epoch': 4.7}


 47%|████▋     | 802/1700 [17:26<18:58,  1.27s/it]

{'loss': 0.0956, 'grad_norm': 1.161752462387085, 'learning_rate': 0.00010595870206489676, 'epoch': 4.71}


 47%|████▋     | 803/1700 [17:28<18:38,  1.25s/it]

{'loss': 0.0634, 'grad_norm': 0.46105486154556274, 'learning_rate': 0.00010584070796460177, 'epoch': 4.72}


 47%|████▋     | 804/1700 [17:29<19:41,  1.32s/it]

{'loss': 0.0375, 'grad_norm': 0.4358801245689392, 'learning_rate': 0.0001057227138643068, 'epoch': 4.72}


 47%|████▋     | 805/1700 [17:30<19:12,  1.29s/it]

{'loss': 0.0151, 'grad_norm': 0.4742107093334198, 'learning_rate': 0.0001056047197640118, 'epoch': 4.73}


 47%|████▋     | 806/1700 [17:32<19:34,  1.31s/it]

{'loss': 0.02, 'grad_norm': 0.3361712396144867, 'learning_rate': 0.00010548672566371682, 'epoch': 4.73}


 47%|████▋     | 807/1700 [17:33<20:33,  1.38s/it]

{'loss': 0.0321, 'grad_norm': 0.38577884435653687, 'learning_rate': 0.00010536873156342185, 'epoch': 4.74}


 48%|████▊     | 808/1700 [17:34<19:28,  1.31s/it]

{'loss': 0.0145, 'grad_norm': 0.3675147593021393, 'learning_rate': 0.00010525073746312686, 'epoch': 4.75}


 48%|████▊     | 809/1700 [17:36<19:05,  1.29s/it]

{'loss': 0.0367, 'grad_norm': 0.41101160645484924, 'learning_rate': 0.00010513274336283186, 'epoch': 4.75}


 48%|████▊     | 810/1700 [17:37<18:48,  1.27s/it]

{'loss': 0.0378, 'grad_norm': 0.5095683336257935, 'learning_rate': 0.00010501474926253687, 'epoch': 4.76}


 48%|████▊     | 811/1700 [17:38<19:22,  1.31s/it]

{'loss': 0.0195, 'grad_norm': 0.37825098633766174, 'learning_rate': 0.0001048967551622419, 'epoch': 4.76}


 48%|████▊     | 812/1700 [17:39<17:43,  1.20s/it]

{'loss': 0.0535, 'grad_norm': 1.2126882076263428, 'learning_rate': 0.00010477876106194691, 'epoch': 4.77}


 48%|████▊     | 813/1700 [17:41<18:39,  1.26s/it]

{'loss': 0.0521, 'grad_norm': 0.4892521798610687, 'learning_rate': 0.00010466076696165192, 'epoch': 4.78}


 48%|████▊     | 814/1700 [17:42<18:25,  1.25s/it]

{'loss': 0.0772, 'grad_norm': 0.6718878149986267, 'learning_rate': 0.00010454277286135695, 'epoch': 4.78}


 48%|████▊     | 815/1700 [17:43<18:26,  1.25s/it]

{'loss': 0.0294, 'grad_norm': 0.5250591039657593, 'learning_rate': 0.00010442477876106196, 'epoch': 4.79}


 48%|████▊     | 816/1700 [17:44<18:43,  1.27s/it]

{'loss': 0.0591, 'grad_norm': 0.6887989044189453, 'learning_rate': 0.00010430678466076697, 'epoch': 4.79}


 48%|████▊     | 817/1700 [17:46<18:59,  1.29s/it]

{'loss': 0.0611, 'grad_norm': 0.6638029217720032, 'learning_rate': 0.00010418879056047197, 'epoch': 4.8}


 48%|████▊     | 818/1700 [17:47<19:09,  1.30s/it]

{'loss': 0.022, 'grad_norm': 0.23982128500938416, 'learning_rate': 0.00010407079646017698, 'epoch': 4.8}


 48%|████▊     | 819/1700 [17:48<18:06,  1.23s/it]

{'loss': 0.0261, 'grad_norm': 0.5160647034645081, 'learning_rate': 0.00010395280235988201, 'epoch': 4.81}


 48%|████▊     | 820/1700 [17:49<18:16,  1.25s/it]

{'loss': 0.0134, 'grad_norm': 0.3984892666339874, 'learning_rate': 0.00010383480825958702, 'epoch': 4.82}


 48%|████▊     | 821/1700 [17:53<26:42,  1.82s/it]

{'loss': 0.3808, 'grad_norm': 0.4626983404159546, 'learning_rate': 0.00010371681415929205, 'epoch': 4.82}


 48%|████▊     | 822/1700 [17:54<23:11,  1.59s/it]

{'loss': 0.0413, 'grad_norm': 0.6118340492248535, 'learning_rate': 0.00010359882005899706, 'epoch': 4.83}


 48%|████▊     | 823/1700 [17:55<22:04,  1.51s/it]

{'loss': 0.0438, 'grad_norm': 0.39249229431152344, 'learning_rate': 0.00010348082595870207, 'epoch': 4.83}


 48%|████▊     | 824/1700 [17:56<19:57,  1.37s/it]

{'loss': 0.0291, 'grad_norm': 0.47458651661872864, 'learning_rate': 0.0001033628318584071, 'epoch': 4.84}


 49%|████▊     | 825/1700 [17:57<18:51,  1.29s/it]

{'loss': 0.0463, 'grad_norm': 0.5483890771865845, 'learning_rate': 0.00010324483775811208, 'epoch': 4.85}


 49%|████▊     | 826/1700 [17:58<18:18,  1.26s/it]

{'loss': 0.0349, 'grad_norm': 0.5090311765670776, 'learning_rate': 0.00010312684365781711, 'epoch': 4.85}


 49%|████▊     | 827/1700 [17:59<17:37,  1.21s/it]

{'loss': 0.0391, 'grad_norm': 0.4638514220714569, 'learning_rate': 0.00010300884955752212, 'epoch': 4.86}


 49%|████▊     | 828/1700 [18:01<18:13,  1.25s/it]

{'loss': 0.045, 'grad_norm': 0.3786599934101105, 'learning_rate': 0.00010289085545722714, 'epoch': 4.86}


 49%|████▉     | 829/1700 [18:02<17:20,  1.19s/it]

{'loss': 0.0176, 'grad_norm': 0.5786208510398865, 'learning_rate': 0.00010277286135693216, 'epoch': 4.87}


 49%|████▉     | 830/1700 [18:03<17:39,  1.22s/it]

{'loss': 0.0355, 'grad_norm': 0.32810693979263306, 'learning_rate': 0.00010265486725663717, 'epoch': 4.88}


 49%|████▉     | 831/1700 [18:04<17:34,  1.21s/it]

{'loss': 0.0254, 'grad_norm': 0.5804570317268372, 'learning_rate': 0.0001025368731563422, 'epoch': 4.88}


 49%|████▉     | 832/1700 [18:05<17:26,  1.21s/it]

{'loss': 0.0381, 'grad_norm': 0.6551224589347839, 'learning_rate': 0.00010241887905604721, 'epoch': 4.89}


 49%|████▉     | 833/1700 [18:06<16:37,  1.15s/it]

{'loss': 0.0587, 'grad_norm': 0.7653539776802063, 'learning_rate': 0.00010230088495575221, 'epoch': 4.89}


 49%|████▉     | 834/1700 [18:08<16:56,  1.17s/it]

{'loss': 0.0241, 'grad_norm': 0.3317427635192871, 'learning_rate': 0.00010218289085545722, 'epoch': 4.9}


 49%|████▉     | 835/1700 [18:09<16:58,  1.18s/it]

{'loss': 0.0906, 'grad_norm': 0.7407499551773071, 'learning_rate': 0.00010206489675516224, 'epoch': 4.9}


 49%|████▉     | 836/1700 [18:10<17:30,  1.22s/it]

{'loss': 0.0266, 'grad_norm': 0.34826502203941345, 'learning_rate': 0.00010194690265486726, 'epoch': 4.91}


 49%|████▉     | 837/1700 [18:11<17:50,  1.24s/it]

{'loss': 0.0301, 'grad_norm': 0.592751145362854, 'learning_rate': 0.00010182890855457228, 'epoch': 4.92}


 49%|████▉     | 838/1700 [18:13<17:57,  1.25s/it]

{'loss': 0.042, 'grad_norm': 0.4887799620628357, 'learning_rate': 0.00010171091445427729, 'epoch': 4.92}


 49%|████▉     | 839/1700 [18:14<18:07,  1.26s/it]

{'loss': 0.032, 'grad_norm': 0.3520846664905548, 'learning_rate': 0.00010159292035398231, 'epoch': 4.93}


 49%|████▉     | 840/1700 [18:15<18:05,  1.26s/it]

{'loss': 0.0511, 'grad_norm': 0.5639903545379639, 'learning_rate': 0.00010147492625368733, 'epoch': 4.93}


 49%|████▉     | 841/1700 [18:16<17:42,  1.24s/it]

{'loss': 0.0163, 'grad_norm': 0.2944570779800415, 'learning_rate': 0.00010135693215339233, 'epoch': 4.94}


 50%|████▉     | 842/1700 [18:18<17:51,  1.25s/it]

{'loss': 0.0561, 'grad_norm': 0.7247239351272583, 'learning_rate': 0.00010123893805309734, 'epoch': 4.95}


 50%|████▉     | 843/1700 [18:19<18:26,  1.29s/it]

{'loss': 0.0563, 'grad_norm': 0.6427937150001526, 'learning_rate': 0.00010112094395280237, 'epoch': 4.95}


 50%|████▉     | 844/1700 [18:21<18:51,  1.32s/it]

{'loss': 0.0563, 'grad_norm': 0.6086072325706482, 'learning_rate': 0.00010100294985250738, 'epoch': 4.96}


 50%|████▉     | 845/1700 [18:22<18:58,  1.33s/it]

{'loss': 0.0331, 'grad_norm': 0.4321194589138031, 'learning_rate': 0.00010088495575221239, 'epoch': 4.96}


 50%|████▉     | 846/1700 [18:23<19:01,  1.34s/it]

{'loss': 0.1356, 'grad_norm': 1.1412502527236938, 'learning_rate': 0.00010076696165191742, 'epoch': 4.97}


 50%|████▉     | 847/1700 [18:25<19:10,  1.35s/it]

{'loss': 0.0457, 'grad_norm': 0.43765345215797424, 'learning_rate': 0.00010064896755162243, 'epoch': 4.98}


 50%|████▉     | 848/1700 [18:26<20:06,  1.42s/it]

{'loss': 0.0414, 'grad_norm': 0.3703695237636566, 'learning_rate': 0.00010053097345132744, 'epoch': 4.98}


 50%|████▉     | 849/1700 [18:28<22:04,  1.56s/it]

{'loss': 0.0246, 'grad_norm': 0.40407657623291016, 'learning_rate': 0.00010041297935103244, 'epoch': 4.99}


 50%|█████     | 850/1700 [18:30<21:41,  1.53s/it]

{'loss': 0.0793, 'grad_norm': 0.673999011516571, 'learning_rate': 0.00010029498525073747, 'epoch': 4.99}


 50%|█████     | 851/1700 [18:31<20:22,  1.44s/it]

{'loss': 0.0217, 'grad_norm': 0.3252149224281311, 'learning_rate': 0.00010017699115044248, 'epoch': 5.0}


 50%|█████     | 852/1700 [18:32<20:38,  1.46s/it]

{'loss': 0.0322, 'grad_norm': 0.3721482753753662, 'learning_rate': 0.00010005899705014749, 'epoch': 5.0}


 50%|█████     | 853/1700 [18:33<18:44,  1.33s/it]

{'loss': 0.0209, 'grad_norm': 0.46502938866615295, 'learning_rate': 9.994100294985252e-05, 'epoch': 5.01}


 50%|█████     | 854/1700 [18:34<17:52,  1.27s/it]

{'loss': 0.0347, 'grad_norm': 0.33503398299217224, 'learning_rate': 9.982300884955753e-05, 'epoch': 5.02}


 50%|█████     | 855/1700 [18:36<17:11,  1.22s/it]

{'loss': 0.0266, 'grad_norm': 0.5000327229499817, 'learning_rate': 9.970501474926254e-05, 'epoch': 5.02}


 50%|█████     | 856/1700 [18:37<17:34,  1.25s/it]

{'loss': 0.0104, 'grad_norm': 0.17547212541103363, 'learning_rate': 9.958702064896756e-05, 'epoch': 5.03}


 50%|█████     | 857/1700 [18:38<16:25,  1.17s/it]

{'loss': 0.0045, 'grad_norm': 0.09725486487150192, 'learning_rate': 9.946902654867257e-05, 'epoch': 5.03}


 50%|█████     | 858/1700 [18:39<17:40,  1.26s/it]

{'loss': 0.02, 'grad_norm': 0.30822551250457764, 'learning_rate': 9.93510324483776e-05, 'epoch': 5.04}


 51%|█████     | 859/1700 [18:41<18:14,  1.30s/it]

{'loss': 0.0235, 'grad_norm': 0.34577298164367676, 'learning_rate': 9.923303834808259e-05, 'epoch': 5.05}


 51%|█████     | 860/1700 [18:42<18:13,  1.30s/it]

{'loss': 0.0586, 'grad_norm': 0.6461871266365051, 'learning_rate': 9.911504424778762e-05, 'epoch': 5.05}


 51%|█████     | 861/1700 [18:43<17:59,  1.29s/it]

{'loss': 0.0225, 'grad_norm': 0.48774945735931396, 'learning_rate': 9.899705014749263e-05, 'epoch': 5.06}


 51%|█████     | 862/1700 [18:45<17:50,  1.28s/it]

{'loss': 0.0134, 'grad_norm': 0.18131153285503387, 'learning_rate': 9.887905604719764e-05, 'epoch': 5.06}


 51%|█████     | 863/1700 [18:46<17:04,  1.22s/it]

{'loss': 0.0152, 'grad_norm': 0.46017593145370483, 'learning_rate': 9.876106194690266e-05, 'epoch': 5.07}


 51%|█████     | 864/1700 [18:47<16:16,  1.17s/it]

{'loss': 0.0077, 'grad_norm': 0.1736249327659607, 'learning_rate': 9.864306784660767e-05, 'epoch': 5.07}


 51%|█████     | 865/1700 [18:48<16:31,  1.19s/it]

{'loss': 0.0129, 'grad_norm': 0.3389453887939453, 'learning_rate': 9.85250737463127e-05, 'epoch': 5.08}


 51%|█████     | 866/1700 [18:49<16:10,  1.16s/it]

{'loss': 0.0062, 'grad_norm': 0.1360403597354889, 'learning_rate': 9.840707964601771e-05, 'epoch': 5.09}


 51%|█████     | 867/1700 [18:50<16:11,  1.17s/it]

{'loss': 0.0076, 'grad_norm': 0.25490862131118774, 'learning_rate': 9.828908554572272e-05, 'epoch': 5.09}


 51%|█████     | 868/1700 [18:51<16:10,  1.17s/it]

{'loss': 0.0605, 'grad_norm': 0.6133254170417786, 'learning_rate': 9.817109144542773e-05, 'epoch': 5.1}


 51%|█████     | 869/1700 [18:52<15:59,  1.15s/it]

{'loss': 0.01, 'grad_norm': 0.24668839573860168, 'learning_rate': 9.805309734513275e-05, 'epoch': 5.1}


 51%|█████     | 870/1700 [18:54<15:42,  1.14s/it]

{'loss': 0.0132, 'grad_norm': 0.2447861135005951, 'learning_rate': 9.793510324483777e-05, 'epoch': 5.11}


 51%|█████     | 871/1700 [18:54<14:49,  1.07s/it]

{'loss': 0.0069, 'grad_norm': 0.3535958230495453, 'learning_rate': 9.781710914454277e-05, 'epoch': 5.12}


 51%|█████▏    | 872/1700 [18:56<15:54,  1.15s/it]

{'loss': 0.0101, 'grad_norm': 0.357860267162323, 'learning_rate': 9.76991150442478e-05, 'epoch': 5.12}


 51%|█████▏    | 873/1700 [18:57<16:18,  1.18s/it]

{'loss': 0.0124, 'grad_norm': 0.356103777885437, 'learning_rate': 9.758112094395281e-05, 'epoch': 5.13}


 51%|█████▏    | 874/1700 [18:58<17:01,  1.24s/it]

{'loss': 0.0255, 'grad_norm': 0.37117645144462585, 'learning_rate': 9.746312684365782e-05, 'epoch': 5.13}


 51%|█████▏    | 875/1700 [19:00<16:47,  1.22s/it]

{'loss': 0.0156, 'grad_norm': 0.5547518730163574, 'learning_rate': 9.734513274336283e-05, 'epoch': 5.14}


 52%|█████▏    | 876/1700 [19:01<15:47,  1.15s/it]

{'loss': 0.0074, 'grad_norm': 0.3645598888397217, 'learning_rate': 9.722713864306785e-05, 'epoch': 5.15}


 52%|█████▏    | 877/1700 [19:02<16:41,  1.22s/it]

{'loss': 0.0207, 'grad_norm': 0.3966379463672638, 'learning_rate': 9.710914454277287e-05, 'epoch': 5.15}


 52%|█████▏    | 878/1700 [19:03<17:52,  1.30s/it]

{'loss': 0.0188, 'grad_norm': 0.40579521656036377, 'learning_rate': 9.699115044247789e-05, 'epoch': 5.16}


 52%|█████▏    | 879/1700 [19:05<17:50,  1.30s/it]

{'loss': 0.0219, 'grad_norm': 0.42807576060295105, 'learning_rate': 9.687315634218289e-05, 'epoch': 5.16}


 52%|█████▏    | 880/1700 [19:06<17:20,  1.27s/it]

{'loss': 0.0165, 'grad_norm': 0.47821301221847534, 'learning_rate': 9.675516224188791e-05, 'epoch': 5.17}


 52%|█████▏    | 881/1700 [19:07<16:19,  1.20s/it]

{'loss': 0.0436, 'grad_norm': 0.43400445580482483, 'learning_rate': 9.663716814159292e-05, 'epoch': 5.17}


 52%|█████▏    | 882/1700 [19:08<16:36,  1.22s/it]

{'loss': 0.0059, 'grad_norm': 0.1622212827205658, 'learning_rate': 9.651917404129795e-05, 'epoch': 5.18}


 52%|█████▏    | 883/1700 [19:10<16:46,  1.23s/it]

{'loss': 0.0223, 'grad_norm': 0.35775133967399597, 'learning_rate': 9.640117994100295e-05, 'epoch': 5.19}


 52%|█████▏    | 884/1700 [19:11<16:39,  1.23s/it]

{'loss': 0.0129, 'grad_norm': 0.3320225179195404, 'learning_rate': 9.628318584070796e-05, 'epoch': 5.19}


 52%|█████▏    | 885/1700 [19:12<16:54,  1.24s/it]

{'loss': 0.028, 'grad_norm': 0.39314553141593933, 'learning_rate': 9.616519174041299e-05, 'epoch': 5.2}


 52%|█████▏    | 886/1700 [19:13<16:49,  1.24s/it]

{'loss': 0.0247, 'grad_norm': 0.5444531440734863, 'learning_rate': 9.6047197640118e-05, 'epoch': 5.2}


 52%|█████▏    | 887/1700 [19:15<17:39,  1.30s/it]

{'loss': 0.0168, 'grad_norm': 0.30189692974090576, 'learning_rate': 9.592920353982301e-05, 'epoch': 5.21}


 52%|█████▏    | 888/1700 [19:16<16:46,  1.24s/it]

{'loss': 0.0154, 'grad_norm': 0.23351424932479858, 'learning_rate': 9.581120943952803e-05, 'epoch': 5.22}


 52%|█████▏    | 889/1700 [19:17<16:17,  1.20s/it]

{'loss': 0.0346, 'grad_norm': 0.4077587425708771, 'learning_rate': 9.569321533923304e-05, 'epoch': 5.22}


 52%|█████▏    | 890/1700 [19:18<16:04,  1.19s/it]

{'loss': 0.0127, 'grad_norm': 0.3665043115615845, 'learning_rate': 9.557522123893806e-05, 'epoch': 5.23}


 52%|█████▏    | 891/1700 [19:19<16:17,  1.21s/it]

{'loss': 0.0156, 'grad_norm': 0.2406829297542572, 'learning_rate': 9.545722713864306e-05, 'epoch': 5.23}


 52%|█████▏    | 892/1700 [19:20<15:33,  1.16s/it]

{'loss': 0.0384, 'grad_norm': 0.617331326007843, 'learning_rate': 9.533923303834809e-05, 'epoch': 5.24}


 53%|█████▎    | 893/1700 [19:22<15:55,  1.18s/it]

{'loss': 0.011, 'grad_norm': 0.28367868065834045, 'learning_rate': 9.52212389380531e-05, 'epoch': 5.25}


 53%|█████▎    | 894/1700 [19:23<16:10,  1.20s/it]

{'loss': 0.0041, 'grad_norm': 0.10240267962217331, 'learning_rate': 9.510324483775811e-05, 'epoch': 5.25}


 53%|█████▎    | 895/1700 [19:24<16:30,  1.23s/it]

{'loss': 0.01, 'grad_norm': 0.15112993121147156, 'learning_rate': 9.498525073746313e-05, 'epoch': 5.26}


 53%|█████▎    | 896/1700 [19:25<16:04,  1.20s/it]

{'loss': 0.0117, 'grad_norm': 0.2603816092014313, 'learning_rate': 9.486725663716814e-05, 'epoch': 5.26}


 53%|█████▎    | 897/1700 [19:27<16:26,  1.23s/it]

{'loss': 0.0066, 'grad_norm': 0.3490310311317444, 'learning_rate': 9.474926253687317e-05, 'epoch': 5.27}


 53%|█████▎    | 898/1700 [19:28<17:28,  1.31s/it]

{'loss': 0.0127, 'grad_norm': 0.24496227502822876, 'learning_rate': 9.463126843657818e-05, 'epoch': 5.27}


 53%|█████▎    | 899/1700 [19:29<16:03,  1.20s/it]

{'loss': 0.0182, 'grad_norm': 0.33332541584968567, 'learning_rate': 9.451327433628319e-05, 'epoch': 5.28}


 53%|█████▎    | 900/1700 [19:30<17:03,  1.28s/it]

{'loss': 0.007, 'grad_norm': 0.2672617435455322, 'learning_rate': 9.43952802359882e-05, 'epoch': 5.29}


 53%|█████▎    | 901/1700 [19:32<16:57,  1.27s/it]

{'loss': 0.0148, 'grad_norm': 0.47991612553596497, 'learning_rate': 9.427728613569322e-05, 'epoch': 5.29}


 53%|█████▎    | 902/1700 [19:33<17:12,  1.29s/it]

{'loss': 0.0072, 'grad_norm': 0.19073070585727692, 'learning_rate': 9.415929203539824e-05, 'epoch': 5.3}


 53%|█████▎    | 903/1700 [19:35<17:55,  1.35s/it]

{'loss': 0.0243, 'grad_norm': 0.27708449959754944, 'learning_rate': 9.404129793510324e-05, 'epoch': 5.3}


 53%|█████▎    | 904/1700 [19:36<17:49,  1.34s/it]

{'loss': 0.0199, 'grad_norm': 0.32960498332977295, 'learning_rate': 9.392330383480827e-05, 'epoch': 5.31}


 53%|█████▎    | 905/1700 [19:37<18:02,  1.36s/it]

{'loss': 0.0373, 'grad_norm': 0.4768200218677521, 'learning_rate': 9.380530973451328e-05, 'epoch': 5.32}


 53%|█████▎    | 906/1700 [19:39<17:25,  1.32s/it]

{'loss': 0.0092, 'grad_norm': 0.23059390485286713, 'learning_rate': 9.368731563421829e-05, 'epoch': 5.32}


 53%|█████▎    | 907/1700 [19:40<17:36,  1.33s/it]

{'loss': 0.008, 'grad_norm': 0.28599366545677185, 'learning_rate': 9.35693215339233e-05, 'epoch': 5.33}


 53%|█████▎    | 908/1700 [19:41<16:38,  1.26s/it]

{'loss': 0.0158, 'grad_norm': 0.45698899030685425, 'learning_rate': 9.345132743362832e-05, 'epoch': 5.33}


 53%|█████▎    | 909/1700 [19:42<16:41,  1.27s/it]

{'loss': 0.012, 'grad_norm': 0.30119723081588745, 'learning_rate': 9.333333333333334e-05, 'epoch': 5.34}


 54%|█████▎    | 910/1700 [19:44<16:37,  1.26s/it]

{'loss': 0.0233, 'grad_norm': 0.46449071168899536, 'learning_rate': 9.321533923303836e-05, 'epoch': 5.35}


 54%|█████▎    | 911/1700 [19:45<15:48,  1.20s/it]

{'loss': 0.0175, 'grad_norm': 0.6792529225349426, 'learning_rate': 9.309734513274337e-05, 'epoch': 5.35}


 54%|█████▎    | 912/1700 [19:46<16:19,  1.24s/it]

{'loss': 0.0179, 'grad_norm': 0.3282140791416168, 'learning_rate': 9.297935103244838e-05, 'epoch': 5.36}


 54%|█████▎    | 913/1700 [19:47<17:15,  1.32s/it]

{'loss': 0.0754, 'grad_norm': 0.619870126247406, 'learning_rate': 9.28613569321534e-05, 'epoch': 5.36}


 54%|█████▍    | 914/1700 [19:49<17:07,  1.31s/it]

{'loss': 0.0203, 'grad_norm': 0.2853488326072693, 'learning_rate': 9.274336283185842e-05, 'epoch': 5.37}


 54%|█████▍    | 915/1700 [19:50<16:18,  1.25s/it]

{'loss': 0.0597, 'grad_norm': 0.6011220216751099, 'learning_rate': 9.262536873156342e-05, 'epoch': 5.37}


 54%|█████▍    | 916/1700 [19:51<16:31,  1.26s/it]

{'loss': 0.0276, 'grad_norm': 0.35204190015792847, 'learning_rate': 9.250737463126844e-05, 'epoch': 5.38}


 54%|█████▍    | 917/1700 [19:52<16:04,  1.23s/it]

{'loss': 0.0322, 'grad_norm': 0.5939125418663025, 'learning_rate': 9.238938053097346e-05, 'epoch': 5.39}


 54%|█████▍    | 918/1700 [19:53<15:59,  1.23s/it]

{'loss': 0.0202, 'grad_norm': 0.33087560534477234, 'learning_rate': 9.227138643067847e-05, 'epoch': 5.39}


 54%|█████▍    | 919/1700 [19:55<16:25,  1.26s/it]

{'loss': 0.0312, 'grad_norm': 0.5433299541473389, 'learning_rate': 9.215339233038348e-05, 'epoch': 5.4}


 54%|█████▍    | 920/1700 [19:56<15:33,  1.20s/it]

{'loss': 0.0294, 'grad_norm': 0.6861935257911682, 'learning_rate': 9.20353982300885e-05, 'epoch': 5.4}


 54%|█████▍    | 921/1700 [19:57<15:24,  1.19s/it]

{'loss': 0.0249, 'grad_norm': 0.5079835057258606, 'learning_rate': 9.191740412979352e-05, 'epoch': 5.41}


 54%|█████▍    | 922/1700 [19:58<14:54,  1.15s/it]

{'loss': 0.0268, 'grad_norm': 0.3768766522407532, 'learning_rate': 9.179941002949853e-05, 'epoch': 5.42}


 54%|█████▍    | 923/1700 [19:59<15:38,  1.21s/it]

{'loss': 0.0074, 'grad_norm': 0.13670504093170166, 'learning_rate': 9.168141592920355e-05, 'epoch': 5.42}


 54%|█████▍    | 924/1700 [20:01<15:43,  1.22s/it]

{'loss': 0.0156, 'grad_norm': 0.2450883388519287, 'learning_rate': 9.156342182890856e-05, 'epoch': 5.43}


 54%|█████▍    | 925/1700 [20:02<16:26,  1.27s/it]

{'loss': 0.011, 'grad_norm': 0.2102566659450531, 'learning_rate': 9.144542772861357e-05, 'epoch': 5.43}


 54%|█████▍    | 926/1700 [20:03<16:14,  1.26s/it]

{'loss': 0.0165, 'grad_norm': 0.2833750247955322, 'learning_rate': 9.13274336283186e-05, 'epoch': 5.44}


 55%|█████▍    | 927/1700 [20:05<17:06,  1.33s/it]

{'loss': 0.0399, 'grad_norm': 0.6337983012199402, 'learning_rate': 9.12094395280236e-05, 'epoch': 5.44}


 55%|█████▍    | 928/1700 [20:07<19:51,  1.54s/it]

{'loss': 0.0227, 'grad_norm': 0.3922756314277649, 'learning_rate': 9.109144542772862e-05, 'epoch': 5.45}


 55%|█████▍    | 929/1700 [20:08<20:04,  1.56s/it]

{'loss': 0.0142, 'grad_norm': 0.309160977602005, 'learning_rate': 9.097345132743364e-05, 'epoch': 5.46}


 55%|█████▍    | 930/1700 [20:10<18:32,  1.45s/it]

{'loss': 0.0126, 'grad_norm': 0.28614771366119385, 'learning_rate': 9.085545722713865e-05, 'epoch': 5.46}


 55%|█████▍    | 931/1700 [20:11<18:13,  1.42s/it]

{'loss': 0.0339, 'grad_norm': 0.5089722871780396, 'learning_rate': 9.073746312684366e-05, 'epoch': 5.47}


 55%|█████▍    | 932/1700 [20:12<18:14,  1.42s/it]

{'loss': 0.0226, 'grad_norm': 0.40104740858078003, 'learning_rate': 9.061946902654867e-05, 'epoch': 5.47}


 55%|█████▍    | 933/1700 [20:14<18:01,  1.41s/it]

{'loss': 0.0263, 'grad_norm': 0.36008942127227783, 'learning_rate': 9.05014749262537e-05, 'epoch': 5.48}


 55%|█████▍    | 934/1700 [20:15<18:28,  1.45s/it]

{'loss': 0.0069, 'grad_norm': 0.2622741758823395, 'learning_rate': 9.038348082595871e-05, 'epoch': 5.49}


 55%|█████▌    | 935/1700 [20:17<18:16,  1.43s/it]

{'loss': 0.0217, 'grad_norm': 0.3978946805000305, 'learning_rate': 9.026548672566371e-05, 'epoch': 5.49}


 55%|█████▌    | 936/1700 [20:18<17:19,  1.36s/it]

{'loss': 0.0285, 'grad_norm': 0.5230662822723389, 'learning_rate': 9.014749262536874e-05, 'epoch': 5.5}


 55%|█████▌    | 937/1700 [20:19<15:57,  1.25s/it]

{'loss': 0.0707, 'grad_norm': 1.203316330909729, 'learning_rate': 9.002949852507375e-05, 'epoch': 5.5}


 55%|█████▌    | 938/1700 [20:20<16:04,  1.27s/it]

{'loss': 0.0166, 'grad_norm': 0.1811501532793045, 'learning_rate': 8.991150442477878e-05, 'epoch': 5.51}


 55%|█████▌    | 939/1700 [20:21<15:53,  1.25s/it]

{'loss': 0.0124, 'grad_norm': 0.24142290651798248, 'learning_rate': 8.979351032448377e-05, 'epoch': 5.52}


 55%|█████▌    | 940/1700 [20:23<16:29,  1.30s/it]

{'loss': 0.0154, 'grad_norm': 0.2380465567111969, 'learning_rate': 8.96755162241888e-05, 'epoch': 5.52}


 55%|█████▌    | 941/1700 [20:24<15:57,  1.26s/it]

{'loss': 0.013, 'grad_norm': 0.16636022925376892, 'learning_rate': 8.955752212389381e-05, 'epoch': 5.53}


 55%|█████▌    | 942/1700 [20:25<14:54,  1.18s/it]

{'loss': 0.0159, 'grad_norm': 0.27820730209350586, 'learning_rate': 8.943952802359883e-05, 'epoch': 5.53}


 55%|█████▌    | 943/1700 [20:26<14:52,  1.18s/it]

{'loss': 0.0221, 'grad_norm': 0.3671732246875763, 'learning_rate': 8.932153392330384e-05, 'epoch': 5.54}


 56%|█████▌    | 944/1700 [20:28<15:38,  1.24s/it]

{'loss': 0.0133, 'grad_norm': 0.1702742576599121, 'learning_rate': 8.920353982300885e-05, 'epoch': 5.54}


 56%|█████▌    | 945/1700 [20:29<15:17,  1.21s/it]

{'loss': 0.0169, 'grad_norm': 0.22991564869880676, 'learning_rate': 8.908554572271388e-05, 'epoch': 5.55}


 56%|█████▌    | 946/1700 [20:30<15:57,  1.27s/it]

{'loss': 0.025, 'grad_norm': 0.4628238081932068, 'learning_rate': 8.896755162241889e-05, 'epoch': 5.56}


 56%|█████▌    | 947/1700 [20:31<15:13,  1.21s/it]

{'loss': 0.0112, 'grad_norm': 0.3499058187007904, 'learning_rate': 8.884955752212389e-05, 'epoch': 5.56}


 56%|█████▌    | 948/1700 [20:32<15:26,  1.23s/it]

{'loss': 0.0249, 'grad_norm': 0.33603543043136597, 'learning_rate': 8.873156342182891e-05, 'epoch': 5.57}


 56%|█████▌    | 949/1700 [20:34<15:09,  1.21s/it]

{'loss': 0.0137, 'grad_norm': 0.24488221108913422, 'learning_rate': 8.861356932153393e-05, 'epoch': 5.57}


 56%|█████▌    | 950/1700 [20:35<15:53,  1.27s/it]

{'loss': 0.0173, 'grad_norm': 0.3046589195728302, 'learning_rate': 8.849557522123895e-05, 'epoch': 5.58}


 56%|█████▌    | 951/1700 [20:36<15:47,  1.26s/it]

{'loss': 0.0148, 'grad_norm': 0.3596307933330536, 'learning_rate': 8.837758112094395e-05, 'epoch': 5.59}


 56%|█████▌    | 952/1700 [20:37<14:58,  1.20s/it]

{'loss': 0.0168, 'grad_norm': 0.3615352511405945, 'learning_rate': 8.825958702064896e-05, 'epoch': 5.59}


 56%|█████▌    | 953/1700 [20:38<14:26,  1.16s/it]

{'loss': 0.0353, 'grad_norm': 0.6186836957931519, 'learning_rate': 8.814159292035399e-05, 'epoch': 5.6}


 56%|█████▌    | 954/1700 [20:40<15:25,  1.24s/it]

{'loss': 0.0223, 'grad_norm': 0.3255365490913391, 'learning_rate': 8.8023598820059e-05, 'epoch': 5.6}


 56%|█████▌    | 955/1700 [20:42<17:43,  1.43s/it]

{'loss': 0.0295, 'grad_norm': 0.23778292536735535, 'learning_rate': 8.790560471976402e-05, 'epoch': 5.61}


 56%|█████▌    | 956/1700 [20:43<16:45,  1.35s/it]

{'loss': 0.0132, 'grad_norm': 0.2537420690059662, 'learning_rate': 8.778761061946903e-05, 'epoch': 5.62}


 56%|█████▋    | 957/1700 [20:44<16:52,  1.36s/it]

{'loss': 0.0123, 'grad_norm': 0.31886497139930725, 'learning_rate': 8.766961651917404e-05, 'epoch': 5.62}


 56%|█████▋    | 958/1700 [20:48<24:45,  2.00s/it]

{'loss': 0.3106, 'grad_norm': 0.31774285435676575, 'learning_rate': 8.755162241887907e-05, 'epoch': 5.63}


 56%|█████▋    | 959/1700 [20:49<22:48,  1.85s/it]

{'loss': 0.0099, 'grad_norm': 0.22796937823295593, 'learning_rate': 8.743362831858407e-05, 'epoch': 5.63}


 56%|█████▋    | 960/1700 [20:50<20:24,  1.65s/it]

{'loss': 0.0118, 'grad_norm': 0.24748027324676514, 'learning_rate': 8.731563421828909e-05, 'epoch': 5.64}


 57%|█████▋    | 961/1700 [20:52<19:50,  1.61s/it]

{'loss': 0.0675, 'grad_norm': 0.5531291365623474, 'learning_rate': 8.71976401179941e-05, 'epoch': 5.64}


 57%|█████▋    | 962/1700 [20:53<18:24,  1.50s/it]

{'loss': 0.0059, 'grad_norm': 0.16061978042125702, 'learning_rate': 8.707964601769912e-05, 'epoch': 5.65}


 57%|█████▋    | 963/1700 [20:54<16:44,  1.36s/it]

{'loss': 0.0029, 'grad_norm': 0.10093330591917038, 'learning_rate': 8.696165191740413e-05, 'epoch': 5.66}


 57%|█████▋    | 964/1700 [20:56<16:32,  1.35s/it]

{'loss': 0.0162, 'grad_norm': 0.27731987833976746, 'learning_rate': 8.684365781710914e-05, 'epoch': 5.66}


 57%|█████▋    | 965/1700 [20:57<16:46,  1.37s/it]

{'loss': 0.0247, 'grad_norm': 0.4361465573310852, 'learning_rate': 8.672566371681417e-05, 'epoch': 5.67}


 57%|█████▋    | 966/1700 [20:58<16:01,  1.31s/it]

{'loss': 0.0162, 'grad_norm': 0.443447470664978, 'learning_rate': 8.660766961651918e-05, 'epoch': 5.67}


 57%|█████▋    | 967/1700 [20:59<15:52,  1.30s/it]

{'loss': 0.0269, 'grad_norm': 0.42818683385849, 'learning_rate': 8.64896755162242e-05, 'epoch': 5.68}


 57%|█████▋    | 968/1700 [21:01<16:00,  1.31s/it]

{'loss': 0.0419, 'grad_norm': 0.36610686779022217, 'learning_rate': 8.63716814159292e-05, 'epoch': 5.69}


 57%|█████▋    | 969/1700 [21:02<16:44,  1.37s/it]

{'loss': 0.0387, 'grad_norm': 0.5072021484375, 'learning_rate': 8.625368731563422e-05, 'epoch': 5.69}


 57%|█████▋    | 970/1700 [21:04<16:51,  1.38s/it]

{'loss': 0.0173, 'grad_norm': 0.3441365957260132, 'learning_rate': 8.613569321533924e-05, 'epoch': 5.7}


 57%|█████▋    | 971/1700 [21:05<17:00,  1.40s/it]

{'loss': 0.0145, 'grad_norm': 0.2390964925289154, 'learning_rate': 8.601769911504424e-05, 'epoch': 5.7}


 57%|█████▋    | 972/1700 [21:06<16:44,  1.38s/it]

{'loss': 0.0098, 'grad_norm': 0.1884075254201889, 'learning_rate': 8.589970501474927e-05, 'epoch': 5.71}


 57%|█████▋    | 973/1700 [21:08<17:04,  1.41s/it]

{'loss': 0.005, 'grad_norm': 0.15005844831466675, 'learning_rate': 8.578171091445428e-05, 'epoch': 5.72}


 57%|█████▋    | 974/1700 [21:09<16:17,  1.35s/it]

{'loss': 0.0137, 'grad_norm': 0.24440579116344452, 'learning_rate': 8.56637168141593e-05, 'epoch': 5.72}


 57%|█████▋    | 975/1700 [21:10<15:10,  1.26s/it]

{'loss': 0.009, 'grad_norm': 0.2246045172214508, 'learning_rate': 8.554572271386431e-05, 'epoch': 5.73}


 57%|█████▋    | 976/1700 [21:11<15:00,  1.24s/it]

{'loss': 0.0196, 'grad_norm': 0.37889137864112854, 'learning_rate': 8.542772861356932e-05, 'epoch': 5.73}


 57%|█████▋    | 977/1700 [21:13<15:04,  1.25s/it]

{'loss': 0.0244, 'grad_norm': 0.4392813444137573, 'learning_rate': 8.530973451327435e-05, 'epoch': 5.74}


 58%|█████▊    | 978/1700 [21:14<15:24,  1.28s/it]

{'loss': 0.0145, 'grad_norm': 0.32762086391448975, 'learning_rate': 8.519174041297936e-05, 'epoch': 5.74}


 58%|█████▊    | 979/1700 [21:16<17:45,  1.48s/it]

{'loss': 0.0204, 'grad_norm': 0.377975732088089, 'learning_rate': 8.507374631268437e-05, 'epoch': 5.75}


 58%|█████▊    | 980/1700 [21:17<17:08,  1.43s/it]

{'loss': 0.0093, 'grad_norm': 0.24267937242984772, 'learning_rate': 8.495575221238938e-05, 'epoch': 5.76}


 58%|█████▊    | 981/1700 [21:18<16:22,  1.37s/it]

{'loss': 0.022, 'grad_norm': 0.5290939211845398, 'learning_rate': 8.48377581120944e-05, 'epoch': 5.76}


 58%|█████▊    | 982/1700 [21:20<15:36,  1.30s/it]

{'loss': 0.0076, 'grad_norm': 0.16313445568084717, 'learning_rate': 8.471976401179942e-05, 'epoch': 5.77}


 58%|█████▊    | 983/1700 [21:22<18:00,  1.51s/it]

{'loss': 0.0491, 'grad_norm': 0.4248090088367462, 'learning_rate': 8.460176991150442e-05, 'epoch': 5.77}


 58%|█████▊    | 984/1700 [21:23<16:31,  1.39s/it]

{'loss': 0.026, 'grad_norm': 0.42975038290023804, 'learning_rate': 8.448377581120945e-05, 'epoch': 5.78}


 58%|█████▊    | 985/1700 [21:25<19:03,  1.60s/it]

{'loss': 0.0094, 'grad_norm': 0.20843063294887543, 'learning_rate': 8.436578171091446e-05, 'epoch': 5.79}


 58%|█████▊    | 986/1700 [21:26<17:44,  1.49s/it]

{'loss': 0.0433, 'grad_norm': 0.47883903980255127, 'learning_rate': 8.424778761061947e-05, 'epoch': 5.79}


 58%|█████▊    | 987/1700 [21:27<17:04,  1.44s/it]

{'loss': 0.0094, 'grad_norm': 0.23490117490291595, 'learning_rate': 8.412979351032449e-05, 'epoch': 5.8}


 58%|█████▊    | 988/1700 [21:29<16:33,  1.40s/it]

{'loss': 0.0156, 'grad_norm': 0.32580864429473877, 'learning_rate': 8.40117994100295e-05, 'epoch': 5.8}


 58%|█████▊    | 989/1700 [21:30<15:52,  1.34s/it]

{'loss': 0.0226, 'grad_norm': 0.21215468645095825, 'learning_rate': 8.389380530973452e-05, 'epoch': 5.81}


 58%|█████▊    | 990/1700 [21:31<15:36,  1.32s/it]

{'loss': 0.0202, 'grad_norm': 0.3506804406642914, 'learning_rate': 8.377581120943954e-05, 'epoch': 5.81}


 58%|█████▊    | 991/1700 [21:32<14:35,  1.24s/it]

{'loss': 0.0021, 'grad_norm': 0.06485147774219513, 'learning_rate': 8.365781710914455e-05, 'epoch': 5.82}


 58%|█████▊    | 992/1700 [21:33<14:38,  1.24s/it]

{'loss': 0.0059, 'grad_norm': 0.10259079933166504, 'learning_rate': 8.353982300884956e-05, 'epoch': 5.83}


 58%|█████▊    | 993/1700 [21:35<15:32,  1.32s/it]

{'loss': 0.0401, 'grad_norm': 0.29002249240875244, 'learning_rate': 8.342182890855457e-05, 'epoch': 5.83}


 58%|█████▊    | 994/1700 [21:37<18:18,  1.56s/it]

{'loss': 0.0076, 'grad_norm': 0.1494644433259964, 'learning_rate': 8.33038348082596e-05, 'epoch': 5.84}


 59%|█████▊    | 995/1700 [21:38<17:17,  1.47s/it]

{'loss': 0.0237, 'grad_norm': 0.4726043939590454, 'learning_rate': 8.31858407079646e-05, 'epoch': 5.84}


 59%|█████▊    | 996/1700 [21:40<16:52,  1.44s/it]

{'loss': 0.0167, 'grad_norm': 0.2589438557624817, 'learning_rate': 8.306784660766963e-05, 'epoch': 5.85}


 59%|█████▊    | 997/1700 [21:41<16:26,  1.40s/it]

{'loss': 0.0082, 'grad_norm': 0.14303936064243317, 'learning_rate': 8.294985250737464e-05, 'epoch': 5.86}


 59%|█████▊    | 998/1700 [21:42<15:07,  1.29s/it]

{'loss': 0.0515, 'grad_norm': 0.6773307919502258, 'learning_rate': 8.283185840707965e-05, 'epoch': 5.86}


 59%|█████▉    | 999/1700 [21:43<14:11,  1.22s/it]

{'loss': 0.0099, 'grad_norm': 0.20192162692546844, 'learning_rate': 8.271386430678466e-05, 'epoch': 5.87}


 59%|█████▉    | 1000/1700 [21:45<15:58,  1.37s/it]

{'loss': 0.0274, 'grad_norm': 0.48661044239997864, 'learning_rate': 8.259587020648968e-05, 'epoch': 5.87}


 59%|█████▉    | 1001/1700 [21:48<21:16,  1.83s/it]

{'loss': 0.0148, 'grad_norm': 0.34370675683021545, 'learning_rate': 8.24778761061947e-05, 'epoch': 5.88}


 59%|█████▉    | 1002/1700 [21:49<19:46,  1.70s/it]

{'loss': 0.0108, 'grad_norm': 0.19034911692142487, 'learning_rate': 8.235988200589971e-05, 'epoch': 5.89}


 59%|█████▉    | 1003/1700 [21:51<18:59,  1.63s/it]

{'loss': 0.012, 'grad_norm': 0.29385945200920105, 'learning_rate': 8.224188790560471e-05, 'epoch': 5.89}


 59%|█████▉    | 1004/1700 [21:52<17:47,  1.53s/it]

{'loss': 0.0476, 'grad_norm': 0.3052496016025543, 'learning_rate': 8.212389380530974e-05, 'epoch': 5.9}


 59%|█████▉    | 1005/1700 [21:53<16:37,  1.44s/it]

{'loss': 0.0154, 'grad_norm': 0.33757197856903076, 'learning_rate': 8.200589970501475e-05, 'epoch': 5.9}


 59%|█████▉    | 1006/1700 [21:55<16:41,  1.44s/it]

{'loss': 0.0122, 'grad_norm': 0.24420158565044403, 'learning_rate': 8.188790560471978e-05, 'epoch': 5.91}


 59%|█████▉    | 1007/1700 [21:56<15:52,  1.37s/it]

{'loss': 0.0129, 'grad_norm': 0.3006286025047302, 'learning_rate': 8.176991150442478e-05, 'epoch': 5.91}


 59%|█████▉    | 1008/1700 [21:57<15:38,  1.36s/it]

{'loss': 0.0265, 'grad_norm': 0.5537888407707214, 'learning_rate': 8.165191740412979e-05, 'epoch': 5.92}


 59%|█████▉    | 1009/1700 [21:58<15:42,  1.36s/it]

{'loss': 0.0168, 'grad_norm': 0.35146209597587585, 'learning_rate': 8.153392330383482e-05, 'epoch': 5.93}


 59%|█████▉    | 1010/1700 [22:00<14:50,  1.29s/it]

{'loss': 0.0137, 'grad_norm': 0.28005480766296387, 'learning_rate': 8.141592920353983e-05, 'epoch': 5.93}


 59%|█████▉    | 1011/1700 [22:01<15:11,  1.32s/it]

{'loss': 0.0229, 'grad_norm': 0.5890603065490723, 'learning_rate': 8.129793510324484e-05, 'epoch': 5.94}


 60%|█████▉    | 1012/1700 [22:02<14:30,  1.27s/it]

{'loss': 0.0144, 'grad_norm': 0.4167032539844513, 'learning_rate': 8.117994100294985e-05, 'epoch': 5.94}


 60%|█████▉    | 1013/1700 [22:03<14:48,  1.29s/it]

{'loss': 0.0171, 'grad_norm': 0.3401804566383362, 'learning_rate': 8.106194690265487e-05, 'epoch': 5.95}


 60%|█████▉    | 1014/1700 [22:05<14:51,  1.30s/it]

{'loss': 0.0265, 'grad_norm': 0.42637109756469727, 'learning_rate': 8.094395280235989e-05, 'epoch': 5.96}


 60%|█████▉    | 1015/1700 [22:06<15:07,  1.33s/it]

{'loss': 0.0137, 'grad_norm': 0.24741606414318085, 'learning_rate': 8.082595870206489e-05, 'epoch': 5.96}


 60%|█████▉    | 1016/1700 [22:07<15:02,  1.32s/it]

{'loss': 0.0213, 'grad_norm': 0.3307209312915802, 'learning_rate': 8.070796460176992e-05, 'epoch': 5.97}


 60%|█████▉    | 1017/1700 [22:09<14:27,  1.27s/it]

{'loss': 0.0184, 'grad_norm': 0.53566974401474, 'learning_rate': 8.058997050147493e-05, 'epoch': 5.97}


 60%|█████▉    | 1018/1700 [22:10<13:58,  1.23s/it]

{'loss': 0.0206, 'grad_norm': 0.3512018918991089, 'learning_rate': 8.047197640117994e-05, 'epoch': 5.98}


 60%|█████▉    | 1019/1700 [22:11<12:54,  1.14s/it]

{'loss': 0.0221, 'grad_norm': 0.3481890559196472, 'learning_rate': 8.035398230088496e-05, 'epoch': 5.99}


 60%|██████    | 1020/1700 [22:12<13:37,  1.20s/it]

{'loss': 0.0127, 'grad_norm': 0.25440967082977295, 'learning_rate': 8.023598820058997e-05, 'epoch': 5.99}


 60%|██████    | 1021/1700 [22:13<14:22,  1.27s/it]

{'loss': 0.0119, 'grad_norm': 0.2635519206523895, 'learning_rate': 8.0117994100295e-05, 'epoch': 6.0}


 60%|██████    | 1022/1700 [22:15<14:04,  1.25s/it]

{'loss': 0.0116, 'grad_norm': 0.2612355351448059, 'learning_rate': 8e-05, 'epoch': 6.0}


 60%|██████    | 1023/1700 [22:16<14:21,  1.27s/it]

{'loss': 0.0023, 'grad_norm': 0.058840151876211166, 'learning_rate': 7.988200589970502e-05, 'epoch': 6.01}


 60%|██████    | 1024/1700 [22:17<13:59,  1.24s/it]

{'loss': 0.0084, 'grad_norm': 0.13591963052749634, 'learning_rate': 7.976401179941003e-05, 'epoch': 6.01}


 60%|██████    | 1025/1700 [22:18<13:52,  1.23s/it]

{'loss': 0.0175, 'grad_norm': 0.3560953736305237, 'learning_rate': 7.964601769911504e-05, 'epoch': 6.02}


 60%|██████    | 1026/1700 [22:20<14:02,  1.25s/it]

{'loss': 0.0136, 'grad_norm': 0.18848717212677002, 'learning_rate': 7.952802359882007e-05, 'epoch': 6.03}


 60%|██████    | 1027/1700 [22:21<14:16,  1.27s/it]

{'loss': 0.0032, 'grad_norm': 0.04634806513786316, 'learning_rate': 7.941002949852507e-05, 'epoch': 6.03}


 60%|██████    | 1028/1700 [22:22<14:29,  1.29s/it]

{'loss': 0.0075, 'grad_norm': 0.12227080017328262, 'learning_rate': 7.92920353982301e-05, 'epoch': 6.04}


 61%|██████    | 1029/1700 [22:24<14:14,  1.27s/it]

{'loss': 0.0165, 'grad_norm': 0.3873947858810425, 'learning_rate': 7.917404129793511e-05, 'epoch': 6.04}


 61%|██████    | 1030/1700 [22:25<14:26,  1.29s/it]

{'loss': 0.0136, 'grad_norm': 0.3981872498989105, 'learning_rate': 7.905604719764012e-05, 'epoch': 6.05}


 61%|██████    | 1031/1700 [22:26<14:21,  1.29s/it]

{'loss': 0.0023, 'grad_norm': 0.04763370752334595, 'learning_rate': 7.893805309734513e-05, 'epoch': 6.06}


 61%|██████    | 1032/1700 [22:27<14:00,  1.26s/it]

{'loss': 0.008, 'grad_norm': 0.13100387156009674, 'learning_rate': 7.882005899705015e-05, 'epoch': 6.06}


 61%|██████    | 1033/1700 [22:29<14:19,  1.29s/it]

{'loss': 0.0062, 'grad_norm': 0.1809554547071457, 'learning_rate': 7.870206489675517e-05, 'epoch': 6.07}


 61%|██████    | 1034/1700 [22:30<14:30,  1.31s/it]

{'loss': 0.0134, 'grad_norm': 0.27113306522369385, 'learning_rate': 7.858407079646018e-05, 'epoch': 6.07}


 61%|██████    | 1035/1700 [22:31<13:45,  1.24s/it]

{'loss': 0.0265, 'grad_norm': 0.5793865323066711, 'learning_rate': 7.84660766961652e-05, 'epoch': 6.08}


 61%|██████    | 1036/1700 [22:32<13:35,  1.23s/it]

{'loss': 0.005, 'grad_norm': 0.17996849119663239, 'learning_rate': 7.834808259587021e-05, 'epoch': 6.09}


 61%|██████    | 1037/1700 [22:34<13:48,  1.25s/it]

{'loss': 0.0067, 'grad_norm': 0.2003515213727951, 'learning_rate': 7.823008849557522e-05, 'epoch': 6.09}


 61%|██████    | 1038/1700 [22:35<13:54,  1.26s/it]

{'loss': 0.0032, 'grad_norm': 0.07473768293857574, 'learning_rate': 7.811209439528025e-05, 'epoch': 6.1}


 61%|██████    | 1039/1700 [22:36<13:15,  1.20s/it]

{'loss': 0.0032, 'grad_norm': 0.09894543141126633, 'learning_rate': 7.799410029498525e-05, 'epoch': 6.1}


 61%|██████    | 1040/1700 [22:37<13:15,  1.21s/it]

{'loss': 0.0041, 'grad_norm': 0.2008466273546219, 'learning_rate': 7.787610619469027e-05, 'epoch': 6.11}


 61%|██████    | 1041/1700 [22:38<13:14,  1.21s/it]

{'loss': 0.0236, 'grad_norm': 1.6211293935775757, 'learning_rate': 7.775811209439529e-05, 'epoch': 6.11}


 61%|██████▏   | 1042/1700 [22:40<14:07,  1.29s/it]

{'loss': 0.005, 'grad_norm': 0.16061951220035553, 'learning_rate': 7.76401179941003e-05, 'epoch': 6.12}


 61%|██████▏   | 1043/1700 [22:41<13:16,  1.21s/it]

{'loss': 0.0081, 'grad_norm': 0.3186921179294586, 'learning_rate': 7.752212389380531e-05, 'epoch': 6.13}


 61%|██████▏   | 1044/1700 [22:42<14:15,  1.30s/it]

{'loss': 0.0041, 'grad_norm': 0.07745511084794998, 'learning_rate': 7.740412979351032e-05, 'epoch': 6.13}


 61%|██████▏   | 1045/1700 [22:44<14:29,  1.33s/it]

{'loss': 0.0026, 'grad_norm': 0.048128917813301086, 'learning_rate': 7.728613569321535e-05, 'epoch': 6.14}


 62%|██████▏   | 1046/1700 [22:45<13:33,  1.24s/it]

{'loss': 0.0087, 'grad_norm': 0.38032984733581543, 'learning_rate': 7.716814159292036e-05, 'epoch': 6.14}


 62%|██████▏   | 1047/1700 [22:46<13:50,  1.27s/it]

{'loss': 0.0056, 'grad_norm': 0.101882703602314, 'learning_rate': 7.705014749262537e-05, 'epoch': 6.15}


 62%|██████▏   | 1048/1700 [22:47<13:25,  1.24s/it]

{'loss': 0.0037, 'grad_norm': 0.145160511136055, 'learning_rate': 7.693215339233039e-05, 'epoch': 6.16}


 62%|██████▏   | 1049/1700 [22:48<12:45,  1.18s/it]

{'loss': 0.0079, 'grad_norm': 0.16849009692668915, 'learning_rate': 7.68141592920354e-05, 'epoch': 6.16}


 62%|██████▏   | 1050/1700 [22:50<13:23,  1.24s/it]

{'loss': 0.0127, 'grad_norm': 0.4657227694988251, 'learning_rate': 7.669616519174043e-05, 'epoch': 6.17}


 62%|██████▏   | 1051/1700 [22:51<14:09,  1.31s/it]

{'loss': 0.0093, 'grad_norm': 0.2562182545661926, 'learning_rate': 7.657817109144543e-05, 'epoch': 6.17}


 62%|██████▏   | 1052/1700 [22:53<14:20,  1.33s/it]

{'loss': 0.0078, 'grad_norm': 0.1815333068370819, 'learning_rate': 7.646017699115045e-05, 'epoch': 6.18}


 62%|██████▏   | 1053/1700 [22:54<14:22,  1.33s/it]

{'loss': 0.0028, 'grad_norm': 0.09477051347494125, 'learning_rate': 7.634218289085546e-05, 'epoch': 6.19}


 62%|██████▏   | 1054/1700 [22:55<14:30,  1.35s/it]

{'loss': 0.0266, 'grad_norm': 0.3231489062309265, 'learning_rate': 7.622418879056048e-05, 'epoch': 6.19}


 62%|██████▏   | 1055/1700 [22:57<13:57,  1.30s/it]

{'loss': 0.0098, 'grad_norm': 0.3680497109889984, 'learning_rate': 7.610619469026549e-05, 'epoch': 6.2}


 62%|██████▏   | 1056/1700 [22:58<13:58,  1.30s/it]

{'loss': 0.0047, 'grad_norm': 0.13953080773353577, 'learning_rate': 7.59882005899705e-05, 'epoch': 6.2}


 62%|██████▏   | 1057/1700 [22:59<14:08,  1.32s/it]

{'loss': 0.0065, 'grad_norm': 0.2100679576396942, 'learning_rate': 7.587020648967553e-05, 'epoch': 6.21}


 62%|██████▏   | 1058/1700 [23:00<12:59,  1.21s/it]

{'loss': 0.0011, 'grad_norm': 0.03239661082625389, 'learning_rate': 7.575221238938054e-05, 'epoch': 6.21}


 62%|██████▏   | 1059/1700 [23:01<13:04,  1.22s/it]

{'loss': 0.0071, 'grad_norm': 0.19524045288562775, 'learning_rate': 7.563421828908554e-05, 'epoch': 6.22}


 62%|██████▏   | 1060/1700 [23:03<13:18,  1.25s/it]

{'loss': 0.0041, 'grad_norm': 0.10502703487873077, 'learning_rate': 7.551622418879057e-05, 'epoch': 6.23}


 62%|██████▏   | 1061/1700 [23:04<12:20,  1.16s/it]

{'loss': 0.001, 'grad_norm': 0.03075515106320381, 'learning_rate': 7.539823008849558e-05, 'epoch': 6.23}


 62%|██████▏   | 1062/1700 [23:05<13:43,  1.29s/it]

{'loss': 0.0063, 'grad_norm': 0.2262258380651474, 'learning_rate': 7.52802359882006e-05, 'epoch': 6.24}


 63%|██████▎   | 1063/1700 [23:07<13:49,  1.30s/it]

{'loss': 0.0041, 'grad_norm': 0.1534758061170578, 'learning_rate': 7.51622418879056e-05, 'epoch': 6.24}


 63%|██████▎   | 1064/1700 [23:08<13:42,  1.29s/it]

{'loss': 0.0088, 'grad_norm': 0.22602394223213196, 'learning_rate': 7.504424778761062e-05, 'epoch': 6.25}


 63%|██████▎   | 1065/1700 [23:09<13:34,  1.28s/it]

{'loss': 0.0097, 'grad_norm': 0.27531924843788147, 'learning_rate': 7.492625368731564e-05, 'epoch': 6.26}


 63%|██████▎   | 1066/1700 [23:10<13:18,  1.26s/it]

{'loss': 0.006, 'grad_norm': 0.33182263374328613, 'learning_rate': 7.480825958702065e-05, 'epoch': 6.26}


 63%|██████▎   | 1067/1700 [23:12<13:28,  1.28s/it]

{'loss': 0.0019, 'grad_norm': 0.04730752855539322, 'learning_rate': 7.469026548672567e-05, 'epoch': 6.27}


 63%|██████▎   | 1068/1700 [23:13<14:10,  1.35s/it]

{'loss': 0.0032, 'grad_norm': 0.15756283700466156, 'learning_rate': 7.457227138643068e-05, 'epoch': 6.27}


 63%|██████▎   | 1069/1700 [23:15<15:01,  1.43s/it]

{'loss': 0.0128, 'grad_norm': 0.4544726610183716, 'learning_rate': 7.445427728613569e-05, 'epoch': 6.28}


 63%|██████▎   | 1070/1700 [23:16<13:37,  1.30s/it]

{'loss': 0.0038, 'grad_norm': 0.08082214742898941, 'learning_rate': 7.433628318584072e-05, 'epoch': 6.28}


 63%|██████▎   | 1071/1700 [23:17<14:03,  1.34s/it]

{'loss': 0.0072, 'grad_norm': 0.2470828890800476, 'learning_rate': 7.421828908554572e-05, 'epoch': 6.29}


 63%|██████▎   | 1072/1700 [23:18<13:26,  1.28s/it]

{'loss': 0.0029, 'grad_norm': 0.07462980598211288, 'learning_rate': 7.410029498525074e-05, 'epoch': 6.3}


 63%|██████▎   | 1073/1700 [23:20<13:19,  1.27s/it]

{'loss': 0.0036, 'grad_norm': 0.13865137100219727, 'learning_rate': 7.398230088495576e-05, 'epoch': 6.3}


 63%|██████▎   | 1074/1700 [23:21<13:37,  1.31s/it]

{'loss': 0.0071, 'grad_norm': 0.23770985007286072, 'learning_rate': 7.386430678466078e-05, 'epoch': 6.31}


 63%|██████▎   | 1075/1700 [23:22<13:40,  1.31s/it]

{'loss': 0.0081, 'grad_norm': 0.19796575605869293, 'learning_rate': 7.374631268436578e-05, 'epoch': 6.31}


 63%|██████▎   | 1076/1700 [23:23<13:02,  1.25s/it]

{'loss': 0.0027, 'grad_norm': 0.07655350118875504, 'learning_rate': 7.36283185840708e-05, 'epoch': 6.32}


 63%|██████▎   | 1077/1700 [23:25<13:31,  1.30s/it]

{'loss': 0.0048, 'grad_norm': 0.08221547305583954, 'learning_rate': 7.351032448377582e-05, 'epoch': 6.33}


 63%|██████▎   | 1078/1700 [23:26<13:19,  1.29s/it]

{'loss': 0.0038, 'grad_norm': 0.1954510658979416, 'learning_rate': 7.339233038348083e-05, 'epoch': 6.33}


 63%|██████▎   | 1079/1700 [23:28<13:43,  1.33s/it]

{'loss': 0.0093, 'grad_norm': 0.2956186830997467, 'learning_rate': 7.327433628318584e-05, 'epoch': 6.34}


 64%|██████▎   | 1080/1700 [23:29<14:38,  1.42s/it]

{'loss': 0.0263, 'grad_norm': 0.7465213537216187, 'learning_rate': 7.315634218289086e-05, 'epoch': 6.34}


 64%|██████▎   | 1081/1700 [23:30<13:16,  1.29s/it]

{'loss': 0.0198, 'grad_norm': 0.6788013577461243, 'learning_rate': 7.303834808259587e-05, 'epoch': 6.35}


 64%|██████▎   | 1082/1700 [23:31<13:13,  1.28s/it]

{'loss': 0.0088, 'grad_norm': 0.13919517397880554, 'learning_rate': 7.29203539823009e-05, 'epoch': 6.36}


 64%|██████▎   | 1083/1700 [23:32<12:20,  1.20s/it]

{'loss': 0.0016, 'grad_norm': 0.11823734641075134, 'learning_rate': 7.28023598820059e-05, 'epoch': 6.36}


 64%|██████▍   | 1084/1700 [23:34<12:11,  1.19s/it]

{'loss': 0.0025, 'grad_norm': 0.11274252086877823, 'learning_rate': 7.268436578171092e-05, 'epoch': 6.37}


 64%|██████▍   | 1085/1700 [23:35<12:19,  1.20s/it]

{'loss': 0.0089, 'grad_norm': 0.1824696809053421, 'learning_rate': 7.256637168141593e-05, 'epoch': 6.37}


 64%|██████▍   | 1086/1700 [23:36<12:06,  1.18s/it]

{'loss': 0.0047, 'grad_norm': 0.14721685647964478, 'learning_rate': 7.244837758112095e-05, 'epoch': 6.38}


 64%|██████▍   | 1087/1700 [23:37<12:39,  1.24s/it]

{'loss': 0.0048, 'grad_norm': 0.10483529418706894, 'learning_rate': 7.233038348082596e-05, 'epoch': 6.38}


 64%|██████▍   | 1088/1700 [23:39<12:43,  1.25s/it]

{'loss': 0.01, 'grad_norm': 0.33336013555526733, 'learning_rate': 7.221238938053097e-05, 'epoch': 6.39}


 64%|██████▍   | 1089/1700 [23:40<13:21,  1.31s/it]

{'loss': 0.0041, 'grad_norm': 0.2187661975622177, 'learning_rate': 7.2094395280236e-05, 'epoch': 6.4}


 64%|██████▍   | 1090/1700 [23:41<13:30,  1.33s/it]

{'loss': 0.0075, 'grad_norm': 0.11305937170982361, 'learning_rate': 7.197640117994101e-05, 'epoch': 6.4}


 64%|██████▍   | 1091/1700 [23:43<13:17,  1.31s/it]

{'loss': 0.0027, 'grad_norm': 0.07313790172338486, 'learning_rate': 7.185840707964602e-05, 'epoch': 6.41}


 64%|██████▍   | 1092/1700 [23:44<13:22,  1.32s/it]

{'loss': 0.0041, 'grad_norm': 0.09007665514945984, 'learning_rate': 7.174041297935103e-05, 'epoch': 6.41}


 64%|██████▍   | 1093/1700 [23:45<12:43,  1.26s/it]

{'loss': 0.0034, 'grad_norm': 0.13548286259174347, 'learning_rate': 7.162241887905605e-05, 'epoch': 6.42}


 64%|██████▍   | 1094/1700 [23:46<12:06,  1.20s/it]

{'loss': 0.0021, 'grad_norm': 0.0695272833108902, 'learning_rate': 7.150442477876107e-05, 'epoch': 6.43}


 64%|██████▍   | 1095/1700 [23:48<12:28,  1.24s/it]

{'loss': 0.0127, 'grad_norm': 0.31718018651008606, 'learning_rate': 7.138643067846607e-05, 'epoch': 6.43}


 64%|██████▍   | 1096/1700 [23:49<12:50,  1.28s/it]

{'loss': 0.0024, 'grad_norm': 0.0833723247051239, 'learning_rate': 7.12684365781711e-05, 'epoch': 6.44}


 65%|██████▍   | 1097/1700 [23:50<13:07,  1.31s/it]

{'loss': 0.005, 'grad_norm': 0.14726121723651886, 'learning_rate': 7.115044247787611e-05, 'epoch': 6.44}


 65%|██████▍   | 1098/1700 [23:52<13:07,  1.31s/it]

{'loss': 0.0026, 'grad_norm': 0.06610681116580963, 'learning_rate': 7.103244837758112e-05, 'epoch': 6.45}


 65%|██████▍   | 1099/1700 [23:53<12:39,  1.26s/it]

{'loss': 0.008, 'grad_norm': 0.20472599565982819, 'learning_rate': 7.091445427728614e-05, 'epoch': 6.46}


 65%|██████▍   | 1100/1700 [23:54<12:05,  1.21s/it]

{'loss': 0.0056, 'grad_norm': 0.1059093028306961, 'learning_rate': 7.079646017699115e-05, 'epoch': 6.46}


 65%|██████▍   | 1101/1700 [23:55<11:43,  1.17s/it]

{'loss': 0.0086, 'grad_norm': 0.24797020852565765, 'learning_rate': 7.067846607669618e-05, 'epoch': 6.47}


 65%|██████▍   | 1102/1700 [23:56<11:43,  1.18s/it]

{'loss': 0.0011, 'grad_norm': 0.05458444729447365, 'learning_rate': 7.056047197640119e-05, 'epoch': 6.47}


 65%|██████▍   | 1103/1700 [23:57<12:04,  1.21s/it]

{'loss': 0.0172, 'grad_norm': 0.5108273029327393, 'learning_rate': 7.04424778761062e-05, 'epoch': 6.48}


 65%|██████▍   | 1104/1700 [23:59<12:24,  1.25s/it]

{'loss': 0.005, 'grad_norm': 0.23181401193141937, 'learning_rate': 7.032448377581121e-05, 'epoch': 6.48}


 65%|██████▌   | 1105/1700 [24:00<13:01,  1.31s/it]

{'loss': 0.007, 'grad_norm': 0.17920371890068054, 'learning_rate': 7.020648967551623e-05, 'epoch': 6.49}


 65%|██████▌   | 1106/1700 [24:02<12:54,  1.30s/it]

{'loss': 0.012, 'grad_norm': 0.6188130974769592, 'learning_rate': 7.008849557522125e-05, 'epoch': 6.5}


 65%|██████▌   | 1107/1700 [24:03<12:18,  1.24s/it]

{'loss': 0.0036, 'grad_norm': 0.18539829552173615, 'learning_rate': 6.997050147492625e-05, 'epoch': 6.5}


 65%|██████▌   | 1108/1700 [24:04<12:59,  1.32s/it]

{'loss': 0.01, 'grad_norm': 0.1877952516078949, 'learning_rate': 6.985250737463128e-05, 'epoch': 6.51}


 65%|██████▌   | 1109/1700 [24:05<12:33,  1.27s/it]

{'loss': 0.0024, 'grad_norm': 0.07864753156900406, 'learning_rate': 6.973451327433629e-05, 'epoch': 6.51}


 65%|██████▌   | 1110/1700 [24:07<12:29,  1.27s/it]

{'loss': 0.0043, 'grad_norm': 0.11430882662534714, 'learning_rate': 6.96165191740413e-05, 'epoch': 6.52}


 65%|██████▌   | 1111/1700 [24:08<12:42,  1.29s/it]

{'loss': 0.0106, 'grad_norm': 0.31213027238845825, 'learning_rate': 6.949852507374631e-05, 'epoch': 6.53}


 65%|██████▌   | 1112/1700 [24:09<12:46,  1.30s/it]

{'loss': 0.0062, 'grad_norm': 0.19814491271972656, 'learning_rate': 6.938053097345133e-05, 'epoch': 6.53}


 65%|██████▌   | 1113/1700 [24:10<12:00,  1.23s/it]

{'loss': 0.0054, 'grad_norm': 0.26411232352256775, 'learning_rate': 6.926253687315635e-05, 'epoch': 6.54}


 66%|██████▌   | 1114/1700 [24:12<12:45,  1.31s/it]

{'loss': 0.0081, 'grad_norm': 0.16653944551944733, 'learning_rate': 6.914454277286137e-05, 'epoch': 6.54}


 66%|██████▌   | 1115/1700 [24:13<12:25,  1.27s/it]

{'loss': 0.0135, 'grad_norm': 0.2545822858810425, 'learning_rate': 6.902654867256638e-05, 'epoch': 6.55}


 66%|██████▌   | 1116/1700 [24:14<12:18,  1.27s/it]

{'loss': 0.0076, 'grad_norm': 0.13325990736484528, 'learning_rate': 6.890855457227139e-05, 'epoch': 6.56}


 66%|██████▌   | 1117/1700 [24:15<11:59,  1.23s/it]

{'loss': 0.0036, 'grad_norm': 0.2919909358024597, 'learning_rate': 6.87905604719764e-05, 'epoch': 6.56}


 66%|██████▌   | 1118/1700 [24:17<12:10,  1.26s/it]

{'loss': 0.0043, 'grad_norm': 0.16092415153980255, 'learning_rate': 6.867256637168143e-05, 'epoch': 6.57}


 66%|██████▌   | 1119/1700 [24:18<12:29,  1.29s/it]

{'loss': 0.0017, 'grad_norm': 0.056142572313547134, 'learning_rate': 6.855457227138643e-05, 'epoch': 6.57}


 66%|██████▌   | 1120/1700 [24:19<12:23,  1.28s/it]

{'loss': 0.0045, 'grad_norm': 0.14895477890968323, 'learning_rate': 6.843657817109145e-05, 'epoch': 6.58}


 66%|██████▌   | 1121/1700 [24:20<12:07,  1.26s/it]

{'loss': 0.0052, 'grad_norm': 0.16431567072868347, 'learning_rate': 6.831858407079647e-05, 'epoch': 6.58}


 66%|██████▌   | 1122/1700 [24:22<14:06,  1.46s/it]

{'loss': 0.0018, 'grad_norm': 0.04262407496571541, 'learning_rate': 6.820058997050148e-05, 'epoch': 6.59}


 66%|██████▌   | 1123/1700 [24:24<15:01,  1.56s/it]

{'loss': 0.016, 'grad_norm': 0.24994951486587524, 'learning_rate': 6.808259587020649e-05, 'epoch': 6.6}


 66%|██████▌   | 1124/1700 [24:25<13:54,  1.45s/it]

{'loss': 0.0037, 'grad_norm': 0.11205364763736725, 'learning_rate': 6.79646017699115e-05, 'epoch': 6.6}


 66%|██████▌   | 1125/1700 [24:27<13:01,  1.36s/it]

{'loss': 0.0025, 'grad_norm': 0.24248240888118744, 'learning_rate': 6.784660766961653e-05, 'epoch': 6.61}


 66%|██████▌   | 1126/1700 [24:28<12:01,  1.26s/it]

{'loss': 0.0418, 'grad_norm': 2.0518791675567627, 'learning_rate': 6.772861356932154e-05, 'epoch': 6.61}


 66%|██████▋   | 1127/1700 [24:29<12:23,  1.30s/it]

{'loss': 0.0055, 'grad_norm': 0.15897683799266815, 'learning_rate': 6.761061946902654e-05, 'epoch': 6.62}


 66%|██████▋   | 1128/1700 [24:30<11:56,  1.25s/it]

{'loss': 0.0053, 'grad_norm': 0.18409046530723572, 'learning_rate': 6.749262536873157e-05, 'epoch': 6.63}


 66%|██████▋   | 1129/1700 [24:32<14:09,  1.49s/it]

{'loss': 0.0098, 'grad_norm': 0.3207595646381378, 'learning_rate': 6.737463126843658e-05, 'epoch': 6.63}


 66%|██████▋   | 1130/1700 [24:34<14:04,  1.48s/it]

{'loss': 0.0162, 'grad_norm': 0.4014735817909241, 'learning_rate': 6.725663716814161e-05, 'epoch': 6.64}


 67%|██████▋   | 1131/1700 [24:35<13:00,  1.37s/it]

{'loss': 0.0021, 'grad_norm': 0.08225615322589874, 'learning_rate': 6.71386430678466e-05, 'epoch': 6.64}


 67%|██████▋   | 1132/1700 [24:36<12:32,  1.32s/it]

{'loss': 0.0024, 'grad_norm': 0.07426077127456665, 'learning_rate': 6.702064896755162e-05, 'epoch': 6.65}


 67%|██████▋   | 1133/1700 [24:37<12:31,  1.33s/it]

{'loss': 0.005, 'grad_norm': 0.29325687885284424, 'learning_rate': 6.690265486725664e-05, 'epoch': 6.65}


 67%|██████▋   | 1134/1700 [24:38<11:56,  1.27s/it]

{'loss': 0.009, 'grad_norm': 0.8445430397987366, 'learning_rate': 6.678466076696166e-05, 'epoch': 6.66}


 67%|██████▋   | 1135/1700 [24:40<12:16,  1.30s/it]

{'loss': 0.0092, 'grad_norm': 0.2481469064950943, 'learning_rate': 6.666666666666667e-05, 'epoch': 6.67}


 67%|██████▋   | 1136/1700 [24:41<11:25,  1.22s/it]

{'loss': 0.0116, 'grad_norm': 0.29076701402664185, 'learning_rate': 6.654867256637168e-05, 'epoch': 6.67}


 67%|██████▋   | 1137/1700 [24:42<11:20,  1.21s/it]

{'loss': 0.006, 'grad_norm': 0.17326882481575012, 'learning_rate': 6.64306784660767e-05, 'epoch': 6.68}


 67%|██████▋   | 1138/1700 [24:43<10:39,  1.14s/it]

{'loss': 0.0234, 'grad_norm': 0.23487509787082672, 'learning_rate': 6.631268436578172e-05, 'epoch': 6.68}


 67%|██████▋   | 1139/1700 [24:44<11:25,  1.22s/it]

{'loss': 0.0174, 'grad_norm': 0.18040752410888672, 'learning_rate': 6.619469026548672e-05, 'epoch': 6.69}


 67%|██████▋   | 1140/1700 [24:45<10:33,  1.13s/it]

{'loss': 0.0172, 'grad_norm': 0.40047183632850647, 'learning_rate': 6.607669616519175e-05, 'epoch': 6.7}


 67%|██████▋   | 1141/1700 [24:47<11:12,  1.20s/it]

{'loss': 0.0081, 'grad_norm': 0.361907035112381, 'learning_rate': 6.595870206489676e-05, 'epoch': 6.7}


 67%|██████▋   | 1142/1700 [24:48<11:46,  1.27s/it]

{'loss': 0.0049, 'grad_norm': 0.14417220652103424, 'learning_rate': 6.584070796460177e-05, 'epoch': 6.71}


 67%|██████▋   | 1143/1700 [24:49<11:53,  1.28s/it]

{'loss': 0.0077, 'grad_norm': 0.11585651338100433, 'learning_rate': 6.572271386430678e-05, 'epoch': 6.71}


 67%|██████▋   | 1144/1700 [24:50<11:11,  1.21s/it]

{'loss': 0.0041, 'grad_norm': 0.33275240659713745, 'learning_rate': 6.56047197640118e-05, 'epoch': 6.72}


 67%|██████▋   | 1145/1700 [24:52<11:08,  1.21s/it]

{'loss': 0.0215, 'grad_norm': 0.6756853461265564, 'learning_rate': 6.548672566371682e-05, 'epoch': 6.73}


 67%|██████▋   | 1146/1700 [24:53<11:05,  1.20s/it]

{'loss': 0.0013, 'grad_norm': 0.029790664091706276, 'learning_rate': 6.536873156342184e-05, 'epoch': 6.73}


 67%|██████▋   | 1147/1700 [24:55<13:45,  1.49s/it]

{'loss': 0.0159, 'grad_norm': 0.38348516821861267, 'learning_rate': 6.525073746312685e-05, 'epoch': 6.74}


 68%|██████▊   | 1148/1700 [24:56<12:32,  1.36s/it]

{'loss': 0.0069, 'grad_norm': 0.13314072787761688, 'learning_rate': 6.513274336283186e-05, 'epoch': 6.74}


 68%|██████▊   | 1149/1700 [24:57<11:59,  1.31s/it]

{'loss': 0.0064, 'grad_norm': 0.1760350614786148, 'learning_rate': 6.501474926253687e-05, 'epoch': 6.75}


 68%|██████▊   | 1150/1700 [24:58<11:34,  1.26s/it]

{'loss': 0.0065, 'grad_norm': 0.2681616544723511, 'learning_rate': 6.48967551622419e-05, 'epoch': 6.75}


 68%|██████▊   | 1151/1700 [25:00<11:12,  1.22s/it]

{'loss': 0.0035, 'grad_norm': 0.06493643671274185, 'learning_rate': 6.47787610619469e-05, 'epoch': 6.76}


 68%|██████▊   | 1152/1700 [25:01<11:26,  1.25s/it]

{'loss': 0.0115, 'grad_norm': 0.19462119042873383, 'learning_rate': 6.466076696165192e-05, 'epoch': 6.77}


 68%|██████▊   | 1153/1700 [25:02<11:41,  1.28s/it]

{'loss': 0.0038, 'grad_norm': 0.07691188901662827, 'learning_rate': 6.454277286135694e-05, 'epoch': 6.77}


 68%|██████▊   | 1154/1700 [25:04<11:58,  1.32s/it]

{'loss': 0.0023, 'grad_norm': 0.0476750023663044, 'learning_rate': 6.442477876106195e-05, 'epoch': 6.78}


 68%|██████▊   | 1155/1700 [25:05<12:00,  1.32s/it]

{'loss': 0.0075, 'grad_norm': 0.1620386689901352, 'learning_rate': 6.430678466076696e-05, 'epoch': 6.78}


 68%|██████▊   | 1156/1700 [25:06<11:50,  1.31s/it]

{'loss': 0.0071, 'grad_norm': 0.1588941067457199, 'learning_rate': 6.418879056047197e-05, 'epoch': 6.79}


 68%|██████▊   | 1157/1700 [25:07<11:01,  1.22s/it]

{'loss': 0.0168, 'grad_norm': 0.41592568159103394, 'learning_rate': 6.4070796460177e-05, 'epoch': 6.8}


 68%|██████▊   | 1158/1700 [25:08<11:07,  1.23s/it]

{'loss': 0.0155, 'grad_norm': 0.2849641442298889, 'learning_rate': 6.395280235988201e-05, 'epoch': 6.8}


 68%|██████▊   | 1159/1700 [25:10<11:17,  1.25s/it]

{'loss': 0.0191, 'grad_norm': 0.2757868468761444, 'learning_rate': 6.383480825958703e-05, 'epoch': 6.81}


 68%|██████▊   | 1160/1700 [25:11<10:51,  1.21s/it]

{'loss': 0.0036, 'grad_norm': 0.11231211572885513, 'learning_rate': 6.371681415929204e-05, 'epoch': 6.81}


 68%|██████▊   | 1161/1700 [25:12<11:16,  1.25s/it]

{'loss': 0.005, 'grad_norm': 0.08394725620746613, 'learning_rate': 6.359882005899705e-05, 'epoch': 6.82}


 68%|██████▊   | 1162/1700 [25:14<12:57,  1.45s/it]

{'loss': 0.003, 'grad_norm': 0.09772428125143051, 'learning_rate': 6.348082595870208e-05, 'epoch': 6.83}


 68%|██████▊   | 1163/1700 [25:15<12:02,  1.35s/it]

{'loss': 0.0053, 'grad_norm': 0.1727488487958908, 'learning_rate': 6.336283185840708e-05, 'epoch': 6.83}


 68%|██████▊   | 1164/1700 [25:16<10:54,  1.22s/it]

{'loss': 0.0064, 'grad_norm': 0.22650152444839478, 'learning_rate': 6.32448377581121e-05, 'epoch': 6.84}


 69%|██████▊   | 1165/1700 [25:17<10:23,  1.17s/it]

{'loss': 0.0147, 'grad_norm': 0.547196626663208, 'learning_rate': 6.312684365781711e-05, 'epoch': 6.84}


 69%|██████▊   | 1166/1700 [25:18<10:27,  1.18s/it]

{'loss': 0.01, 'grad_norm': 0.4594998061656952, 'learning_rate': 6.300884955752213e-05, 'epoch': 6.85}


 69%|██████▊   | 1167/1700 [25:20<10:51,  1.22s/it]

{'loss': 0.0101, 'grad_norm': 0.2048935890197754, 'learning_rate': 6.289085545722714e-05, 'epoch': 6.85}


 69%|██████▊   | 1168/1700 [25:21<11:33,  1.30s/it]

{'loss': 0.0121, 'grad_norm': 0.2253011018037796, 'learning_rate': 6.277286135693215e-05, 'epoch': 6.86}


 69%|██████▉   | 1169/1700 [25:22<10:47,  1.22s/it]

{'loss': 0.0037, 'grad_norm': 0.13882319629192352, 'learning_rate': 6.265486725663718e-05, 'epoch': 6.87}


 69%|██████▉   | 1170/1700 [25:24<10:56,  1.24s/it]

{'loss': 0.0095, 'grad_norm': 0.15511813759803772, 'learning_rate': 6.253687315634219e-05, 'epoch': 6.87}


 69%|██████▉   | 1171/1700 [25:25<11:25,  1.30s/it]

{'loss': 0.0527, 'grad_norm': 0.3844483494758606, 'learning_rate': 6.24188790560472e-05, 'epoch': 6.88}


 69%|██████▉   | 1172/1700 [25:26<11:05,  1.26s/it]

{'loss': 0.0021, 'grad_norm': 0.060704052448272705, 'learning_rate': 6.230088495575222e-05, 'epoch': 6.88}


 69%|██████▉   | 1173/1700 [25:27<10:57,  1.25s/it]

{'loss': 0.0042, 'grad_norm': 0.17538456618785858, 'learning_rate': 6.218289085545723e-05, 'epoch': 6.89}


 69%|██████▉   | 1174/1700 [25:28<10:32,  1.20s/it]

{'loss': 0.0134, 'grad_norm': 0.3728136718273163, 'learning_rate': 6.206489675516225e-05, 'epoch': 6.9}


 69%|██████▉   | 1175/1700 [25:30<10:41,  1.22s/it]

{'loss': 0.0088, 'grad_norm': 0.24993032217025757, 'learning_rate': 6.194690265486725e-05, 'epoch': 6.9}


 69%|██████▉   | 1176/1700 [25:31<10:12,  1.17s/it]

{'loss': 0.0149, 'grad_norm': 0.5818596482276917, 'learning_rate': 6.182890855457228e-05, 'epoch': 6.91}


 69%|██████▉   | 1177/1700 [25:34<15:54,  1.83s/it]

{'loss': 0.2626, 'grad_norm': 0.48282942175865173, 'learning_rate': 6.171091445427729e-05, 'epoch': 6.91}


 69%|██████▉   | 1178/1700 [25:35<14:11,  1.63s/it]

{'loss': 0.0082, 'grad_norm': 0.3046844005584717, 'learning_rate': 6.15929203539823e-05, 'epoch': 6.92}


 69%|██████▉   | 1179/1700 [25:37<13:20,  1.54s/it]

{'loss': 0.0039, 'grad_norm': 0.10298022627830505, 'learning_rate': 6.147492625368732e-05, 'epoch': 6.93}


 69%|██████▉   | 1180/1700 [25:38<12:01,  1.39s/it]

{'loss': 0.0151, 'grad_norm': 0.37036535143852234, 'learning_rate': 6.135693215339233e-05, 'epoch': 6.93}


 69%|██████▉   | 1181/1700 [25:39<12:01,  1.39s/it]

{'loss': 0.009, 'grad_norm': 0.19232790172100067, 'learning_rate': 6.123893805309736e-05, 'epoch': 6.94}


 70%|██████▉   | 1182/1700 [25:40<12:03,  1.40s/it]

{'loss': 0.0294, 'grad_norm': 0.34979167580604553, 'learning_rate': 6.112094395280237e-05, 'epoch': 6.94}


 70%|██████▉   | 1183/1700 [25:42<11:54,  1.38s/it]

{'loss': 0.0031, 'grad_norm': 0.11802466213703156, 'learning_rate': 6.1002949852507374e-05, 'epoch': 6.95}


 70%|██████▉   | 1184/1700 [25:43<12:06,  1.41s/it]

{'loss': 0.0145, 'grad_norm': 0.11114718019962311, 'learning_rate': 6.0884955752212394e-05, 'epoch': 6.95}


 70%|██████▉   | 1185/1700 [25:45<11:47,  1.37s/it]

{'loss': 0.0061, 'grad_norm': 0.1314108669757843, 'learning_rate': 6.0766961651917406e-05, 'epoch': 6.96}


 70%|██████▉   | 1186/1700 [25:46<11:28,  1.34s/it]

{'loss': 0.0035, 'grad_norm': 0.09577203541994095, 'learning_rate': 6.0648967551622426e-05, 'epoch': 6.97}


 70%|██████▉   | 1187/1700 [25:47<11:23,  1.33s/it]

{'loss': 0.0023, 'grad_norm': 0.04459249600768089, 'learning_rate': 6.053097345132743e-05, 'epoch': 6.97}


 70%|██████▉   | 1188/1700 [25:48<11:04,  1.30s/it]

{'loss': 0.0209, 'grad_norm': 0.22404859960079193, 'learning_rate': 6.041297935103245e-05, 'epoch': 6.98}


 70%|██████▉   | 1189/1700 [25:50<12:23,  1.46s/it]

{'loss': 0.0106, 'grad_norm': 0.2801644504070282, 'learning_rate': 6.029498525073747e-05, 'epoch': 6.98}


 70%|███████   | 1190/1700 [25:51<11:26,  1.35s/it]

{'loss': 0.0133, 'grad_norm': 0.2829315960407257, 'learning_rate': 6.017699115044248e-05, 'epoch': 6.99}


 70%|███████   | 1191/1700 [25:53<12:01,  1.42s/it]

{'loss': 0.0099, 'grad_norm': 0.29732248187065125, 'learning_rate': 6.0058997050147495e-05, 'epoch': 7.0}


 70%|███████   | 1192/1700 [25:54<11:31,  1.36s/it]

{'loss': 0.0023, 'grad_norm': 0.06531300395727158, 'learning_rate': 5.994100294985251e-05, 'epoch': 7.0}


 70%|███████   | 1193/1700 [25:55<11:01,  1.30s/it]

{'loss': 0.002, 'grad_norm': 0.14571717381477356, 'learning_rate': 5.982300884955753e-05, 'epoch': 7.01}


 70%|███████   | 1194/1700 [25:57<10:49,  1.28s/it]

{'loss': 0.014, 'grad_norm': 0.5381768345832825, 'learning_rate': 5.9705014749262546e-05, 'epoch': 7.01}


 70%|███████   | 1195/1700 [25:58<11:02,  1.31s/it]

{'loss': 0.0026, 'grad_norm': 0.06464565545320511, 'learning_rate': 5.958702064896755e-05, 'epoch': 7.02}


 70%|███████   | 1196/1700 [25:59<10:37,  1.26s/it]

{'loss': 0.0013, 'grad_norm': 0.02460831217467785, 'learning_rate': 5.946902654867257e-05, 'epoch': 7.02}


 70%|███████   | 1197/1700 [26:01<11:21,  1.35s/it]

{'loss': 0.0035, 'grad_norm': 0.0496598519384861, 'learning_rate': 5.9351032448377584e-05, 'epoch': 7.03}


 70%|███████   | 1198/1700 [26:02<10:51,  1.30s/it]

{'loss': 0.0092, 'grad_norm': 0.08558933436870575, 'learning_rate': 5.9233038348082603e-05, 'epoch': 7.04}


 71%|███████   | 1199/1700 [26:03<10:54,  1.31s/it]

{'loss': 0.0022, 'grad_norm': 0.039208222180604935, 'learning_rate': 5.911504424778761e-05, 'epoch': 7.04}


 71%|███████   | 1200/1700 [26:05<11:08,  1.34s/it]

{'loss': 0.0019, 'grad_norm': 0.08056633919477463, 'learning_rate': 5.899705014749263e-05, 'epoch': 7.05}


 71%|███████   | 1201/1700 [26:06<11:30,  1.38s/it]

{'loss': 0.0021, 'grad_norm': 0.057510122656822205, 'learning_rate': 5.887905604719765e-05, 'epoch': 7.05}


 71%|███████   | 1202/1700 [26:07<11:34,  1.39s/it]

{'loss': 0.0038, 'grad_norm': 0.10912798345088959, 'learning_rate': 5.876106194690266e-05, 'epoch': 7.06}


 71%|███████   | 1203/1700 [26:10<13:14,  1.60s/it]

{'loss': 0.0032, 'grad_norm': 0.08612672239542007, 'learning_rate': 5.8643067846607666e-05, 'epoch': 7.07}


 71%|███████   | 1204/1700 [26:11<12:50,  1.55s/it]

{'loss': 0.0062, 'grad_norm': 0.11908131837844849, 'learning_rate': 5.8525073746312686e-05, 'epoch': 7.07}


 71%|███████   | 1205/1700 [26:12<12:03,  1.46s/it]

{'loss': 0.0314, 'grad_norm': 0.35755306482315063, 'learning_rate': 5.8407079646017705e-05, 'epoch': 7.08}


 71%|███████   | 1206/1700 [26:13<11:34,  1.41s/it]

{'loss': 0.0024, 'grad_norm': 0.058663032948970795, 'learning_rate': 5.8289085545722724e-05, 'epoch': 7.08}


 71%|███████   | 1207/1700 [26:15<10:50,  1.32s/it]

{'loss': 0.0028, 'grad_norm': 0.06886229664087296, 'learning_rate': 5.817109144542773e-05, 'epoch': 7.09}


 71%|███████   | 1208/1700 [26:16<10:47,  1.32s/it]

{'loss': 0.0063, 'grad_norm': 0.35154110193252563, 'learning_rate': 5.805309734513274e-05, 'epoch': 7.1}


 71%|███████   | 1209/1700 [26:17<10:49,  1.32s/it]

{'loss': 0.0018, 'grad_norm': 0.06070705130696297, 'learning_rate': 5.793510324483776e-05, 'epoch': 7.1}


 71%|███████   | 1210/1700 [26:18<10:25,  1.28s/it]

{'loss': 0.0095, 'grad_norm': 1.0366648435592651, 'learning_rate': 5.781710914454278e-05, 'epoch': 7.11}


 71%|███████   | 1211/1700 [26:20<10:14,  1.26s/it]

{'loss': 0.0015, 'grad_norm': 0.05962327867746353, 'learning_rate': 5.769911504424779e-05, 'epoch': 7.11}


 71%|███████▏  | 1212/1700 [26:21<09:59,  1.23s/it]

{'loss': 0.0009, 'grad_norm': 0.013012710027396679, 'learning_rate': 5.7581120943952806e-05, 'epoch': 7.12}


 71%|███████▏  | 1213/1700 [26:22<10:38,  1.31s/it]

{'loss': 0.0088, 'grad_norm': 0.15596818923950195, 'learning_rate': 5.746312684365782e-05, 'epoch': 7.12}


 71%|███████▏  | 1214/1700 [26:24<10:24,  1.29s/it]

{'loss': 0.0276, 'grad_norm': 0.6023667454719543, 'learning_rate': 5.734513274336284e-05, 'epoch': 7.13}


 71%|███████▏  | 1215/1700 [26:25<10:17,  1.27s/it]

{'loss': 0.0032, 'grad_norm': 0.07624094933271408, 'learning_rate': 5.7227138643067844e-05, 'epoch': 7.14}


 72%|███████▏  | 1216/1700 [26:26<10:40,  1.32s/it]

{'loss': 0.0042, 'grad_norm': 0.10758288204669952, 'learning_rate': 5.7109144542772863e-05, 'epoch': 7.14}


 72%|███████▏  | 1217/1700 [26:28<11:00,  1.37s/it]

{'loss': 0.0092, 'grad_norm': 0.2012675553560257, 'learning_rate': 5.699115044247788e-05, 'epoch': 7.15}


 72%|███████▏  | 1218/1700 [26:29<11:08,  1.39s/it]

{'loss': 0.0014, 'grad_norm': 0.03387346863746643, 'learning_rate': 5.6873156342182895e-05, 'epoch': 7.15}


 72%|███████▏  | 1219/1700 [26:30<11:04,  1.38s/it]

{'loss': 0.0025, 'grad_norm': 0.06719528138637543, 'learning_rate': 5.675516224188791e-05, 'epoch': 7.16}


 72%|███████▏  | 1220/1700 [26:32<11:15,  1.41s/it]

{'loss': 0.0215, 'grad_norm': 0.5687082409858704, 'learning_rate': 5.663716814159292e-05, 'epoch': 7.17}


 72%|███████▏  | 1221/1700 [26:33<10:22,  1.30s/it]

{'loss': 0.0019, 'grad_norm': 0.10444987565279007, 'learning_rate': 5.651917404129794e-05, 'epoch': 7.17}


 72%|███████▏  | 1222/1700 [26:34<10:33,  1.33s/it]

{'loss': 0.0149, 'grad_norm': 0.32302212715148926, 'learning_rate': 5.640117994100296e-05, 'epoch': 7.18}


 72%|███████▏  | 1223/1700 [26:36<10:25,  1.31s/it]

{'loss': 0.0028, 'grad_norm': 0.13787296414375305, 'learning_rate': 5.6283185840707965e-05, 'epoch': 7.18}


 72%|███████▏  | 1224/1700 [26:37<10:24,  1.31s/it]

{'loss': 0.0013, 'grad_norm': 0.02061067335307598, 'learning_rate': 5.6165191740412984e-05, 'epoch': 7.19}


 72%|███████▏  | 1225/1700 [26:39<10:56,  1.38s/it]

{'loss': 0.0022, 'grad_norm': 0.060656122863292694, 'learning_rate': 5.6047197640118e-05, 'epoch': 7.2}


 72%|███████▏  | 1226/1700 [26:40<10:34,  1.34s/it]

{'loss': 0.0045, 'grad_norm': 0.08551456779241562, 'learning_rate': 5.5929203539823016e-05, 'epoch': 7.2}


 72%|███████▏  | 1227/1700 [26:41<10:08,  1.29s/it]

{'loss': 0.0175, 'grad_norm': 0.2639533281326294, 'learning_rate': 5.581120943952802e-05, 'epoch': 7.21}


 72%|███████▏  | 1228/1700 [26:42<10:08,  1.29s/it]

{'loss': 0.0024, 'grad_norm': 0.05710093677043915, 'learning_rate': 5.569321533923304e-05, 'epoch': 7.21}


 72%|███████▏  | 1229/1700 [26:43<09:55,  1.26s/it]

{'loss': 0.0026, 'grad_norm': 0.05603049695491791, 'learning_rate': 5.557522123893806e-05, 'epoch': 7.22}


 72%|███████▏  | 1230/1700 [26:45<10:03,  1.28s/it]

{'loss': 0.0024, 'grad_norm': 0.035073116421699524, 'learning_rate': 5.545722713864307e-05, 'epoch': 7.22}


 72%|███████▏  | 1231/1700 [26:46<10:28,  1.34s/it]

{'loss': 0.0034, 'grad_norm': 0.08330821245908737, 'learning_rate': 5.533923303834808e-05, 'epoch': 7.23}


 72%|███████▏  | 1232/1700 [26:47<10:02,  1.29s/it]

{'loss': 0.0043, 'grad_norm': 0.12764181196689606, 'learning_rate': 5.52212389380531e-05, 'epoch': 7.24}


 73%|███████▎  | 1233/1700 [26:49<09:39,  1.24s/it]

{'loss': 0.0016, 'grad_norm': 0.04115920886397362, 'learning_rate': 5.510324483775812e-05, 'epoch': 7.24}


 73%|███████▎  | 1234/1700 [26:50<09:47,  1.26s/it]

{'loss': 0.0018, 'grad_norm': 0.10272107273340225, 'learning_rate': 5.498525073746314e-05, 'epoch': 7.25}


 73%|███████▎  | 1235/1700 [26:51<09:53,  1.28s/it]

{'loss': 0.0199, 'grad_norm': 0.3905295431613922, 'learning_rate': 5.486725663716814e-05, 'epoch': 7.25}


 73%|███████▎  | 1236/1700 [26:53<10:07,  1.31s/it]

{'loss': 0.0032, 'grad_norm': 0.09684166312217712, 'learning_rate': 5.4749262536873155e-05, 'epoch': 7.26}


 73%|███████▎  | 1237/1700 [26:54<10:03,  1.30s/it]

{'loss': 0.0146, 'grad_norm': 0.43613314628601074, 'learning_rate': 5.4631268436578175e-05, 'epoch': 7.27}


 73%|███████▎  | 1238/1700 [26:55<09:57,  1.29s/it]

{'loss': 0.0012, 'grad_norm': 0.021522851660847664, 'learning_rate': 5.4513274336283194e-05, 'epoch': 7.27}


 73%|███████▎  | 1239/1700 [26:56<09:55,  1.29s/it]

{'loss': 0.0047, 'grad_norm': 0.15007513761520386, 'learning_rate': 5.43952802359882e-05, 'epoch': 7.28}


 73%|███████▎  | 1240/1700 [26:58<09:39,  1.26s/it]

{'loss': 0.0081, 'grad_norm': 0.10554557293653488, 'learning_rate': 5.427728613569322e-05, 'epoch': 7.28}


 73%|███████▎  | 1241/1700 [26:59<09:24,  1.23s/it]

{'loss': 0.0021, 'grad_norm': 0.08392101526260376, 'learning_rate': 5.415929203539823e-05, 'epoch': 7.29}


 73%|███████▎  | 1242/1700 [27:00<09:10,  1.20s/it]

{'loss': 0.0026, 'grad_norm': 0.09020252525806427, 'learning_rate': 5.404129793510325e-05, 'epoch': 7.3}


 73%|███████▎  | 1243/1700 [27:01<09:33,  1.25s/it]

{'loss': 0.0222, 'grad_norm': 0.15402008593082428, 'learning_rate': 5.392330383480826e-05, 'epoch': 7.3}


 73%|███████▎  | 1244/1700 [27:03<10:03,  1.32s/it]

{'loss': 0.0026, 'grad_norm': 0.17549549043178558, 'learning_rate': 5.3805309734513276e-05, 'epoch': 7.31}


 73%|███████▎  | 1245/1700 [27:04<09:39,  1.27s/it]

{'loss': 0.0031, 'grad_norm': 0.09666456282138824, 'learning_rate': 5.3687315634218295e-05, 'epoch': 7.31}


 73%|███████▎  | 1246/1700 [27:05<09:12,  1.22s/it]

{'loss': 0.0012, 'grad_norm': 0.02357495203614235, 'learning_rate': 5.356932153392331e-05, 'epoch': 7.32}


 73%|███████▎  | 1247/1700 [27:06<08:44,  1.16s/it]

{'loss': 0.001, 'grad_norm': 0.028811369091272354, 'learning_rate': 5.345132743362832e-05, 'epoch': 7.32}


 73%|███████▎  | 1248/1700 [27:07<09:01,  1.20s/it]

{'loss': 0.0097, 'grad_norm': 0.19650965929031372, 'learning_rate': 5.333333333333333e-05, 'epoch': 7.33}


 73%|███████▎  | 1249/1700 [27:08<08:46,  1.17s/it]

{'loss': 0.002, 'grad_norm': 0.08155428618192673, 'learning_rate': 5.321533923303835e-05, 'epoch': 7.34}


 74%|███████▎  | 1250/1700 [27:10<09:08,  1.22s/it]

{'loss': 0.007, 'grad_norm': 0.1412988007068634, 'learning_rate': 5.309734513274337e-05, 'epoch': 7.34}


 74%|███████▎  | 1251/1700 [27:11<08:52,  1.19s/it]

{'loss': 0.0019, 'grad_norm': 0.043621692806482315, 'learning_rate': 5.297935103244838e-05, 'epoch': 7.35}


 74%|███████▎  | 1252/1700 [27:12<08:40,  1.16s/it]

{'loss': 0.0013, 'grad_norm': 0.02373572625219822, 'learning_rate': 5.28613569321534e-05, 'epoch': 7.35}


 74%|███████▎  | 1253/1700 [27:13<08:23,  1.13s/it]

{'loss': 0.0012, 'grad_norm': 0.08419887721538544, 'learning_rate': 5.274336283185841e-05, 'epoch': 7.36}


 74%|███████▍  | 1254/1700 [27:14<08:32,  1.15s/it]

{'loss': 0.0018, 'grad_norm': 0.02065366692841053, 'learning_rate': 5.262536873156343e-05, 'epoch': 7.37}


 74%|███████▍  | 1255/1700 [27:15<08:25,  1.14s/it]

{'loss': 0.0037, 'grad_norm': 0.2842046916484833, 'learning_rate': 5.2507374631268435e-05, 'epoch': 7.37}


 74%|███████▍  | 1256/1700 [27:17<08:40,  1.17s/it]

{'loss': 0.0024, 'grad_norm': 0.1274077296257019, 'learning_rate': 5.2389380530973454e-05, 'epoch': 7.38}


 74%|███████▍  | 1257/1700 [27:18<08:55,  1.21s/it]

{'loss': 0.0011, 'grad_norm': 0.023041829466819763, 'learning_rate': 5.227138643067847e-05, 'epoch': 7.38}


 74%|███████▍  | 1258/1700 [27:19<08:57,  1.22s/it]

{'loss': 0.002, 'grad_norm': 0.062104418873786926, 'learning_rate': 5.2153392330383486e-05, 'epoch': 7.39}


 74%|███████▍  | 1259/1700 [27:20<08:48,  1.20s/it]

{'loss': 0.0069, 'grad_norm': 0.23821012675762177, 'learning_rate': 5.203539823008849e-05, 'epoch': 7.4}


 74%|███████▍  | 1260/1700 [27:22<09:28,  1.29s/it]

{'loss': 0.0051, 'grad_norm': 0.15632054209709167, 'learning_rate': 5.191740412979351e-05, 'epoch': 7.4}


 74%|███████▍  | 1261/1700 [27:23<09:00,  1.23s/it]

{'loss': 0.0034, 'grad_norm': 0.05238889530301094, 'learning_rate': 5.179941002949853e-05, 'epoch': 7.41}


 74%|███████▍  | 1262/1700 [27:24<09:01,  1.24s/it]

{'loss': 0.0011, 'grad_norm': 0.01542938593775034, 'learning_rate': 5.168141592920355e-05, 'epoch': 7.41}


 74%|███████▍  | 1263/1700 [27:26<10:44,  1.47s/it]

{'loss': 0.003, 'grad_norm': 0.09020102769136429, 'learning_rate': 5.1563421828908555e-05, 'epoch': 7.42}


 74%|███████▍  | 1264/1700 [27:27<09:47,  1.35s/it]

{'loss': 0.0021, 'grad_norm': 0.08533522486686707, 'learning_rate': 5.144542772861357e-05, 'epoch': 7.42}


 74%|███████▍  | 1265/1700 [27:29<09:49,  1.36s/it]

{'loss': 0.0007, 'grad_norm': 0.008789220824837685, 'learning_rate': 5.132743362831859e-05, 'epoch': 7.43}


 74%|███████▍  | 1266/1700 [27:30<09:24,  1.30s/it]

{'loss': 0.0019, 'grad_norm': 0.08401596546173096, 'learning_rate': 5.120943952802361e-05, 'epoch': 7.44}


 75%|███████▍  | 1267/1700 [27:31<09:11,  1.27s/it]

{'loss': 0.0008, 'grad_norm': 0.02719445712864399, 'learning_rate': 5.109144542772861e-05, 'epoch': 7.44}


 75%|███████▍  | 1268/1700 [27:32<09:14,  1.28s/it]

{'loss': 0.01, 'grad_norm': 0.1783704310655594, 'learning_rate': 5.097345132743363e-05, 'epoch': 7.45}


 75%|███████▍  | 1269/1700 [27:36<13:55,  1.94s/it]

{'loss': 0.2128, 'grad_norm': 0.5379912853240967, 'learning_rate': 5.0855457227138644e-05, 'epoch': 7.45}


 75%|███████▍  | 1270/1700 [27:37<12:18,  1.72s/it]

{'loss': 0.0122, 'grad_norm': 0.39006948471069336, 'learning_rate': 5.0737463126843664e-05, 'epoch': 7.46}


 75%|███████▍  | 1271/1700 [27:38<11:36,  1.62s/it]

{'loss': 0.0052, 'grad_norm': 0.08816254884004593, 'learning_rate': 5.061946902654867e-05, 'epoch': 7.47}


 75%|███████▍  | 1272/1700 [27:39<10:40,  1.50s/it]

{'loss': 0.0061, 'grad_norm': 0.4215821623802185, 'learning_rate': 5.050147492625369e-05, 'epoch': 7.47}


 75%|███████▍  | 1273/1700 [27:40<09:33,  1.34s/it]

{'loss': 0.0075, 'grad_norm': 0.3972274959087372, 'learning_rate': 5.038348082595871e-05, 'epoch': 7.48}


 75%|███████▍  | 1274/1700 [27:42<09:05,  1.28s/it]

{'loss': 0.0018, 'grad_norm': 0.0818304792046547, 'learning_rate': 5.026548672566372e-05, 'epoch': 7.48}


 75%|███████▌  | 1275/1700 [27:43<08:47,  1.24s/it]

{'loss': 0.0009, 'grad_norm': 0.015953833237290382, 'learning_rate': 5.014749262536873e-05, 'epoch': 7.49}


 75%|███████▌  | 1276/1700 [27:44<09:10,  1.30s/it]

{'loss': 0.0046, 'grad_norm': 0.16264177858829498, 'learning_rate': 5.0029498525073746e-05, 'epoch': 7.49}


 75%|███████▌  | 1277/1700 [27:45<08:46,  1.24s/it]

{'loss': 0.0069, 'grad_norm': 0.5037155151367188, 'learning_rate': 4.9911504424778765e-05, 'epoch': 7.5}


 75%|███████▌  | 1278/1700 [27:47<09:06,  1.30s/it]

{'loss': 0.0017, 'grad_norm': 0.06687524914741516, 'learning_rate': 4.979351032448378e-05, 'epoch': 7.51}


 75%|███████▌  | 1279/1700 [27:48<08:59,  1.28s/it]

{'loss': 0.0011, 'grad_norm': 0.014898756518959999, 'learning_rate': 4.96755162241888e-05, 'epoch': 7.51}


 75%|███████▌  | 1280/1700 [27:49<09:03,  1.29s/it]

{'loss': 0.0025, 'grad_norm': 0.0858280286192894, 'learning_rate': 4.955752212389381e-05, 'epoch': 7.52}


 75%|███████▌  | 1281/1700 [27:51<09:14,  1.32s/it]

{'loss': 0.0024, 'grad_norm': 0.0774221122264862, 'learning_rate': 4.943952802359882e-05, 'epoch': 7.52}


 75%|███████▌  | 1282/1700 [27:52<08:53,  1.28s/it]

{'loss': 0.0013, 'grad_norm': 0.029869044199585915, 'learning_rate': 4.9321533923303835e-05, 'epoch': 7.53}


 75%|███████▌  | 1283/1700 [27:53<08:47,  1.27s/it]

{'loss': 0.0007, 'grad_norm': 0.013881818391382694, 'learning_rate': 4.9203539823008854e-05, 'epoch': 7.54}


 76%|███████▌  | 1284/1700 [27:55<09:05,  1.31s/it]

{'loss': 0.0014, 'grad_norm': 0.03097410500049591, 'learning_rate': 4.908554572271387e-05, 'epoch': 7.54}


 76%|███████▌  | 1285/1700 [27:56<09:08,  1.32s/it]

{'loss': 0.0013, 'grad_norm': 0.02786870114505291, 'learning_rate': 4.8967551622418886e-05, 'epoch': 7.55}


 76%|███████▌  | 1286/1700 [27:57<08:53,  1.29s/it]

{'loss': 0.0009, 'grad_norm': 0.028210293501615524, 'learning_rate': 4.88495575221239e-05, 'epoch': 7.55}


 76%|███████▌  | 1287/1700 [27:58<08:52,  1.29s/it]

{'loss': 0.0029, 'grad_norm': 0.24071374535560608, 'learning_rate': 4.873156342182891e-05, 'epoch': 7.56}


 76%|███████▌  | 1288/1700 [28:00<08:54,  1.30s/it]

{'loss': 0.0013, 'grad_norm': 0.040849681943655014, 'learning_rate': 4.8613569321533924e-05, 'epoch': 7.57}


 76%|███████▌  | 1289/1700 [28:01<08:53,  1.30s/it]

{'loss': 0.0006, 'grad_norm': 0.012002379633486271, 'learning_rate': 4.849557522123894e-05, 'epoch': 7.57}


 76%|███████▌  | 1290/1700 [28:02<08:42,  1.27s/it]

{'loss': 0.0016, 'grad_norm': 0.16061122715473175, 'learning_rate': 4.8377581120943956e-05, 'epoch': 7.58}


 76%|███████▌  | 1291/1700 [28:04<09:06,  1.34s/it]

{'loss': 0.0154, 'grad_norm': 0.3003140091896057, 'learning_rate': 4.8259587020648975e-05, 'epoch': 7.58}


 76%|███████▌  | 1292/1700 [28:05<09:02,  1.33s/it]

{'loss': 0.0025, 'grad_norm': 0.2638533115386963, 'learning_rate': 4.814159292035398e-05, 'epoch': 7.59}


 76%|███████▌  | 1293/1700 [28:06<08:59,  1.33s/it]

{'loss': 0.0021, 'grad_norm': 0.09716609865427017, 'learning_rate': 4.8023598820059e-05, 'epoch': 7.59}


 76%|███████▌  | 1294/1700 [28:07<08:34,  1.27s/it]

{'loss': 0.0047, 'grad_norm': 0.3048055171966553, 'learning_rate': 4.790560471976401e-05, 'epoch': 7.6}


 76%|███████▌  | 1295/1700 [28:09<09:36,  1.42s/it]

{'loss': 0.0025, 'grad_norm': 0.08497404307126999, 'learning_rate': 4.778761061946903e-05, 'epoch': 7.61}


 76%|███████▌  | 1296/1700 [28:11<09:32,  1.42s/it]

{'loss': 0.0008, 'grad_norm': 0.015963660553097725, 'learning_rate': 4.7669616519174045e-05, 'epoch': 7.61}


 76%|███████▋  | 1297/1700 [28:12<08:51,  1.32s/it]

{'loss': 0.0041, 'grad_norm': 0.1045621782541275, 'learning_rate': 4.755162241887906e-05, 'epoch': 7.62}


 76%|███████▋  | 1298/1700 [28:13<08:37,  1.29s/it]

{'loss': 0.0127, 'grad_norm': 0.4401526153087616, 'learning_rate': 4.743362831858407e-05, 'epoch': 7.62}


 76%|███████▋  | 1299/1700 [28:15<09:36,  1.44s/it]

{'loss': 0.0035, 'grad_norm': 0.0667521059513092, 'learning_rate': 4.731563421828909e-05, 'epoch': 7.63}


 76%|███████▋  | 1300/1700 [28:16<09:07,  1.37s/it]

{'loss': 0.0009, 'grad_norm': 0.012741613201797009, 'learning_rate': 4.71976401179941e-05, 'epoch': 7.64}


 77%|███████▋  | 1301/1700 [28:17<08:42,  1.31s/it]

{'loss': 0.004, 'grad_norm': 0.1138930395245552, 'learning_rate': 4.707964601769912e-05, 'epoch': 7.64}


 77%|███████▋  | 1302/1700 [28:18<08:06,  1.22s/it]

{'loss': 0.0009, 'grad_norm': 0.04240705072879791, 'learning_rate': 4.6961651917404133e-05, 'epoch': 7.65}


 77%|███████▋  | 1303/1700 [28:19<08:25,  1.27s/it]

{'loss': 0.0071, 'grad_norm': 0.14117108285427094, 'learning_rate': 4.6843657817109146e-05, 'epoch': 7.65}


 77%|███████▋  | 1304/1700 [28:21<08:08,  1.23s/it]

{'loss': 0.0011, 'grad_norm': 0.016271771863102913, 'learning_rate': 4.672566371681416e-05, 'epoch': 7.66}


 77%|███████▋  | 1305/1700 [28:22<08:29,  1.29s/it]

{'loss': 0.0024, 'grad_norm': 0.11371375620365143, 'learning_rate': 4.660766961651918e-05, 'epoch': 7.67}


 77%|███████▋  | 1306/1700 [28:23<08:13,  1.25s/it]

{'loss': 0.0053, 'grad_norm': 0.6535674929618835, 'learning_rate': 4.648967551622419e-05, 'epoch': 7.67}


 77%|███████▋  | 1307/1700 [28:25<08:43,  1.33s/it]

{'loss': 0.039, 'grad_norm': 0.7366139888763428, 'learning_rate': 4.637168141592921e-05, 'epoch': 7.68}


 77%|███████▋  | 1308/1700 [28:26<08:33,  1.31s/it]

{'loss': 0.0038, 'grad_norm': 0.30012232065200806, 'learning_rate': 4.625368731563422e-05, 'epoch': 7.68}


 77%|███████▋  | 1309/1700 [28:27<08:06,  1.24s/it]

{'loss': 0.0069, 'grad_norm': 0.48770764470100403, 'learning_rate': 4.6135693215339235e-05, 'epoch': 7.69}


 77%|███████▋  | 1310/1700 [28:28<08:02,  1.24s/it]

{'loss': 0.0016, 'grad_norm': 0.045656442642211914, 'learning_rate': 4.601769911504425e-05, 'epoch': 7.69}


 77%|███████▋  | 1311/1700 [28:29<07:42,  1.19s/it]

{'loss': 0.0007, 'grad_norm': 0.01934087462723255, 'learning_rate': 4.589970501474927e-05, 'epoch': 7.7}


 77%|███████▋  | 1312/1700 [28:31<08:07,  1.26s/it]

{'loss': 0.0074, 'grad_norm': 0.23169584572315216, 'learning_rate': 4.578171091445428e-05, 'epoch': 7.71}


 77%|███████▋  | 1313/1700 [28:32<08:10,  1.27s/it]

{'loss': 0.0026, 'grad_norm': 0.10086852312088013, 'learning_rate': 4.56637168141593e-05, 'epoch': 7.71}


 77%|███████▋  | 1314/1700 [28:33<08:11,  1.27s/it]

{'loss': 0.0121, 'grad_norm': 0.16255483031272888, 'learning_rate': 4.554572271386431e-05, 'epoch': 7.72}


 77%|███████▋  | 1315/1700 [28:35<07:57,  1.24s/it]

{'loss': 0.0105, 'grad_norm': 0.3450423777103424, 'learning_rate': 4.5427728613569324e-05, 'epoch': 7.72}


 77%|███████▋  | 1316/1700 [28:36<08:37,  1.35s/it]

{'loss': 0.0029, 'grad_norm': 0.06040674448013306, 'learning_rate': 4.5309734513274336e-05, 'epoch': 7.73}


 77%|███████▋  | 1317/1700 [28:37<08:30,  1.33s/it]

{'loss': 0.0039, 'grad_norm': 0.12988513708114624, 'learning_rate': 4.5191740412979356e-05, 'epoch': 7.74}


 78%|███████▊  | 1318/1700 [28:39<08:32,  1.34s/it]

{'loss': 0.0017, 'grad_norm': 0.0424099825322628, 'learning_rate': 4.507374631268437e-05, 'epoch': 7.74}


 78%|███████▊  | 1319/1700 [28:40<08:17,  1.31s/it]

{'loss': 0.0057, 'grad_norm': 0.12727734446525574, 'learning_rate': 4.495575221238939e-05, 'epoch': 7.75}


 78%|███████▊  | 1320/1700 [28:41<08:24,  1.33s/it]

{'loss': 0.0016, 'grad_norm': 0.03489197418093681, 'learning_rate': 4.48377581120944e-05, 'epoch': 7.75}


 78%|███████▊  | 1321/1700 [28:43<08:11,  1.30s/it]

{'loss': 0.0019, 'grad_norm': 0.0746583417057991, 'learning_rate': 4.471976401179941e-05, 'epoch': 7.76}


 78%|███████▊  | 1322/1700 [28:44<08:00,  1.27s/it]

{'loss': 0.0034, 'grad_norm': 0.18738670647144318, 'learning_rate': 4.4601769911504425e-05, 'epoch': 7.77}


 78%|███████▊  | 1323/1700 [28:45<08:01,  1.28s/it]

{'loss': 0.0013, 'grad_norm': 0.030962329357862473, 'learning_rate': 4.4483775811209445e-05, 'epoch': 7.77}


 78%|███████▊  | 1324/1700 [28:46<07:50,  1.25s/it]

{'loss': 0.0127, 'grad_norm': 0.29223334789276123, 'learning_rate': 4.436578171091446e-05, 'epoch': 7.78}


 78%|███████▊  | 1325/1700 [28:48<07:52,  1.26s/it]

{'loss': 0.0287, 'grad_norm': 0.9866406321525574, 'learning_rate': 4.4247787610619477e-05, 'epoch': 7.78}


 78%|███████▊  | 1326/1700 [28:49<08:07,  1.30s/it]

{'loss': 0.0045, 'grad_norm': 0.10315226763486862, 'learning_rate': 4.412979351032448e-05, 'epoch': 7.79}


 78%|███████▊  | 1327/1700 [28:50<08:14,  1.33s/it]

{'loss': 0.0075, 'grad_norm': 0.14628258347511292, 'learning_rate': 4.40117994100295e-05, 'epoch': 7.79}


 78%|███████▊  | 1328/1700 [28:52<08:07,  1.31s/it]

{'loss': 0.002, 'grad_norm': 0.03529911860823631, 'learning_rate': 4.3893805309734514e-05, 'epoch': 7.8}


 78%|███████▊  | 1329/1700 [28:53<08:17,  1.34s/it]

{'loss': 0.0014, 'grad_norm': 0.0269867405295372, 'learning_rate': 4.3775811209439534e-05, 'epoch': 7.81}


 78%|███████▊  | 1330/1700 [28:54<08:14,  1.34s/it]

{'loss': 0.0037, 'grad_norm': 0.1318226158618927, 'learning_rate': 4.3657817109144546e-05, 'epoch': 7.81}


 78%|███████▊  | 1331/1700 [28:56<08:23,  1.37s/it]

{'loss': 0.0024, 'grad_norm': 0.059103481471538544, 'learning_rate': 4.353982300884956e-05, 'epoch': 7.82}


 78%|███████▊  | 1332/1700 [28:57<08:24,  1.37s/it]

{'loss': 0.0221, 'grad_norm': 0.5409031510353088, 'learning_rate': 4.342182890855457e-05, 'epoch': 7.82}


 78%|███████▊  | 1333/1700 [28:58<07:46,  1.27s/it]

{'loss': 0.0009, 'grad_norm': 0.02487158589065075, 'learning_rate': 4.330383480825959e-05, 'epoch': 7.83}


 78%|███████▊  | 1334/1700 [28:59<07:32,  1.24s/it]

{'loss': 0.003, 'grad_norm': 0.19193029403686523, 'learning_rate': 4.31858407079646e-05, 'epoch': 7.84}


 79%|███████▊  | 1335/1700 [29:01<07:44,  1.27s/it]

{'loss': 0.002, 'grad_norm': 0.03350295498967171, 'learning_rate': 4.306784660766962e-05, 'epoch': 7.84}


 79%|███████▊  | 1336/1700 [29:02<08:01,  1.32s/it]

{'loss': 0.003, 'grad_norm': 0.05392803996801376, 'learning_rate': 4.2949852507374635e-05, 'epoch': 7.85}


 79%|███████▊  | 1337/1700 [29:04<08:09,  1.35s/it]

{'loss': 0.0025, 'grad_norm': 0.05058177933096886, 'learning_rate': 4.283185840707965e-05, 'epoch': 7.85}


 79%|███████▊  | 1338/1700 [29:05<07:50,  1.30s/it]

{'loss': 0.0005, 'grad_norm': 0.015589618124067783, 'learning_rate': 4.271386430678466e-05, 'epoch': 7.86}


 79%|███████▉  | 1339/1700 [29:06<07:26,  1.24s/it]

{'loss': 0.0029, 'grad_norm': 0.06049725040793419, 'learning_rate': 4.259587020648968e-05, 'epoch': 7.86}


 79%|███████▉  | 1340/1700 [29:07<07:15,  1.21s/it]

{'loss': 0.0009, 'grad_norm': 0.05945812538266182, 'learning_rate': 4.247787610619469e-05, 'epoch': 7.87}


 79%|███████▉  | 1341/1700 [29:08<07:13,  1.21s/it]

{'loss': 0.0012, 'grad_norm': 0.04419589415192604, 'learning_rate': 4.235988200589971e-05, 'epoch': 7.88}


 79%|███████▉  | 1342/1700 [29:10<08:36,  1.44s/it]

{'loss': 0.0012, 'grad_norm': 0.02427804097533226, 'learning_rate': 4.2241887905604724e-05, 'epoch': 7.88}


 79%|███████▉  | 1343/1700 [29:11<08:11,  1.38s/it]

{'loss': 0.0013, 'grad_norm': 0.050716642290353775, 'learning_rate': 4.2123893805309737e-05, 'epoch': 7.89}


 79%|███████▉  | 1344/1700 [29:13<07:39,  1.29s/it]

{'loss': 0.0029, 'grad_norm': 0.20788638293743134, 'learning_rate': 4.200589970501475e-05, 'epoch': 7.89}


 79%|███████▉  | 1345/1700 [29:14<07:47,  1.32s/it]

{'loss': 0.0027, 'grad_norm': 0.13040880858898163, 'learning_rate': 4.188790560471977e-05, 'epoch': 7.9}


 79%|███████▉  | 1346/1700 [29:15<07:12,  1.22s/it]

{'loss': 0.0022, 'grad_norm': 0.1316763460636139, 'learning_rate': 4.176991150442478e-05, 'epoch': 7.91}


 79%|███████▉  | 1347/1700 [29:17<08:02,  1.37s/it]

{'loss': 0.004, 'grad_norm': 0.1041707843542099, 'learning_rate': 4.16519174041298e-05, 'epoch': 7.91}


 79%|███████▉  | 1348/1700 [29:18<07:33,  1.29s/it]

{'loss': 0.0015, 'grad_norm': 0.03265003859996796, 'learning_rate': 4.153392330383481e-05, 'epoch': 7.92}


 79%|███████▉  | 1349/1700 [29:19<07:40,  1.31s/it]

{'loss': 0.0022, 'grad_norm': 0.03877235949039459, 'learning_rate': 4.1415929203539825e-05, 'epoch': 7.92}


 79%|███████▉  | 1350/1700 [29:20<07:13,  1.24s/it]

{'loss': 0.0016, 'grad_norm': 0.05252862349152565, 'learning_rate': 4.129793510324484e-05, 'epoch': 7.93}


 79%|███████▉  | 1351/1700 [29:22<07:36,  1.31s/it]

{'loss': 0.0026, 'grad_norm': 0.05022518336772919, 'learning_rate': 4.117994100294986e-05, 'epoch': 7.94}


 80%|███████▉  | 1352/1700 [29:23<07:34,  1.31s/it]

{'loss': 0.0019, 'grad_norm': 0.08344942331314087, 'learning_rate': 4.106194690265487e-05, 'epoch': 7.94}


 80%|███████▉  | 1353/1700 [29:24<07:00,  1.21s/it]

{'loss': 0.0005, 'grad_norm': 0.007049692329019308, 'learning_rate': 4.094395280235989e-05, 'epoch': 7.95}


 80%|███████▉  | 1354/1700 [29:25<06:50,  1.19s/it]

{'loss': 0.0007, 'grad_norm': 0.011476907879114151, 'learning_rate': 4.0825958702064895e-05, 'epoch': 7.95}


 80%|███████▉  | 1355/1700 [29:26<06:35,  1.15s/it]

{'loss': 0.0095, 'grad_norm': 0.21497268974781036, 'learning_rate': 4.0707964601769914e-05, 'epoch': 7.96}


 80%|███████▉  | 1356/1700 [29:27<06:29,  1.13s/it]

{'loss': 0.0126, 'grad_norm': 0.2201569527387619, 'learning_rate': 4.058997050147493e-05, 'epoch': 7.96}


 80%|███████▉  | 1357/1700 [29:28<06:40,  1.17s/it]

{'loss': 0.0018, 'grad_norm': 0.03214352950453758, 'learning_rate': 4.0471976401179946e-05, 'epoch': 7.97}


 80%|███████▉  | 1358/1700 [29:29<06:22,  1.12s/it]

{'loss': 0.0081, 'grad_norm': 0.32349318265914917, 'learning_rate': 4.035398230088496e-05, 'epoch': 7.98}


 80%|███████▉  | 1359/1700 [29:31<06:36,  1.16s/it]

{'loss': 0.0022, 'grad_norm': 0.04247443750500679, 'learning_rate': 4.023598820058997e-05, 'epoch': 7.98}


 80%|████████  | 1360/1700 [29:32<06:38,  1.17s/it]

{'loss': 0.0023, 'grad_norm': 0.0484490767121315, 'learning_rate': 4.0117994100294984e-05, 'epoch': 7.99}


 80%|████████  | 1361/1700 [29:33<06:57,  1.23s/it]

{'loss': 0.0017, 'grad_norm': 0.08837036043405533, 'learning_rate': 4e-05, 'epoch': 7.99}


 80%|████████  | 1362/1700 [29:35<07:33,  1.34s/it]

{'loss': 0.0079, 'grad_norm': 0.3689750134944916, 'learning_rate': 3.9882005899705016e-05, 'epoch': 8.0}


 80%|████████  | 1363/1700 [29:36<07:28,  1.33s/it]

{'loss': 0.001, 'grad_norm': 0.011329400353133678, 'learning_rate': 3.9764011799410035e-05, 'epoch': 8.01}


 80%|████████  | 1364/1700 [29:37<06:54,  1.23s/it]

{'loss': 0.0352, 'grad_norm': 0.6058858633041382, 'learning_rate': 3.964601769911505e-05, 'epoch': 8.01}


 80%|████████  | 1365/1700 [29:38<06:34,  1.18s/it]

{'loss': 0.0005, 'grad_norm': 0.007249881513416767, 'learning_rate': 3.952802359882006e-05, 'epoch': 8.02}


 80%|████████  | 1366/1700 [29:39<06:26,  1.16s/it]

{'loss': 0.0046, 'grad_norm': 0.2666025459766388, 'learning_rate': 3.941002949852507e-05, 'epoch': 8.02}


 80%|████████  | 1367/1700 [29:41<06:46,  1.22s/it]

{'loss': 0.0017, 'grad_norm': 0.031339097768068314, 'learning_rate': 3.929203539823009e-05, 'epoch': 8.03}


 80%|████████  | 1368/1700 [29:42<06:47,  1.23s/it]

{'loss': 0.0007, 'grad_norm': 0.012510860338807106, 'learning_rate': 3.9174041297935105e-05, 'epoch': 8.04}


 81%|████████  | 1369/1700 [29:43<07:02,  1.28s/it]

{'loss': 0.001, 'grad_norm': 0.010976909659802914, 'learning_rate': 3.9056047197640124e-05, 'epoch': 8.04}


 81%|████████  | 1370/1700 [29:45<06:51,  1.25s/it]

{'loss': 0.0009, 'grad_norm': 0.014071129262447357, 'learning_rate': 3.893805309734514e-05, 'epoch': 8.05}


 81%|████████  | 1371/1700 [29:46<06:22,  1.16s/it]

{'loss': 0.0007, 'grad_norm': 0.010405078530311584, 'learning_rate': 3.882005899705015e-05, 'epoch': 8.05}


 81%|████████  | 1372/1700 [29:47<06:25,  1.18s/it]

{'loss': 0.0007, 'grad_norm': 0.016314765438437462, 'learning_rate': 3.870206489675516e-05, 'epoch': 8.06}


 81%|████████  | 1373/1700 [29:48<06:26,  1.18s/it]

{'loss': 0.0134, 'grad_norm': 0.7958590984344482, 'learning_rate': 3.858407079646018e-05, 'epoch': 8.06}


 81%|████████  | 1374/1700 [29:49<06:14,  1.15s/it]

{'loss': 0.0019, 'grad_norm': 0.04662765935063362, 'learning_rate': 3.8466076696165194e-05, 'epoch': 8.07}


 81%|████████  | 1375/1700 [29:50<06:11,  1.14s/it]

{'loss': 0.0008, 'grad_norm': 0.017275823280215263, 'learning_rate': 3.834808259587021e-05, 'epoch': 8.08}


 81%|████████  | 1376/1700 [29:51<06:17,  1.16s/it]

{'loss': 0.0007, 'grad_norm': 0.016377009451389313, 'learning_rate': 3.8230088495575226e-05, 'epoch': 8.08}


 81%|████████  | 1377/1700 [29:53<06:19,  1.18s/it]

{'loss': 0.0019, 'grad_norm': 0.03931219503283501, 'learning_rate': 3.811209439528024e-05, 'epoch': 8.09}


 81%|████████  | 1378/1700 [29:54<06:33,  1.22s/it]

{'loss': 0.0043, 'grad_norm': 0.09864156693220139, 'learning_rate': 3.799410029498525e-05, 'epoch': 8.09}


 81%|████████  | 1379/1700 [29:55<06:48,  1.27s/it]

{'loss': 0.0009, 'grad_norm': 0.014767470769584179, 'learning_rate': 3.787610619469027e-05, 'epoch': 8.1}


 81%|████████  | 1380/1700 [29:57<07:51,  1.47s/it]

{'loss': 0.001, 'grad_norm': 0.01126874703913927, 'learning_rate': 3.775811209439528e-05, 'epoch': 8.11}


 81%|████████  | 1381/1700 [29:59<07:37,  1.43s/it]

{'loss': 0.0023, 'grad_norm': 0.03487122803926468, 'learning_rate': 3.76401179941003e-05, 'epoch': 8.11}


 81%|████████▏ | 1382/1700 [30:00<07:22,  1.39s/it]

{'loss': 0.0008, 'grad_norm': 0.04985995590686798, 'learning_rate': 3.752212389380531e-05, 'epoch': 8.12}


 81%|████████▏ | 1383/1700 [30:02<08:09,  1.54s/it]

{'loss': 0.0019, 'grad_norm': 0.11240575462579727, 'learning_rate': 3.740412979351033e-05, 'epoch': 8.12}


 81%|████████▏ | 1384/1700 [30:03<07:27,  1.42s/it]

{'loss': 0.0071, 'grad_norm': 0.4010128974914551, 'learning_rate': 3.728613569321534e-05, 'epoch': 8.13}


 81%|████████▏ | 1385/1700 [30:04<06:48,  1.30s/it]

{'loss': 0.0006, 'grad_norm': 0.012371668592095375, 'learning_rate': 3.716814159292036e-05, 'epoch': 8.14}


 82%|████████▏ | 1386/1700 [30:05<06:44,  1.29s/it]

{'loss': 0.0021, 'grad_norm': 0.17962276935577393, 'learning_rate': 3.705014749262537e-05, 'epoch': 8.14}


 82%|████████▏ | 1387/1700 [30:06<06:28,  1.24s/it]

{'loss': 0.0011, 'grad_norm': 0.019000280648469925, 'learning_rate': 3.693215339233039e-05, 'epoch': 8.15}


 82%|████████▏ | 1388/1700 [30:08<06:47,  1.31s/it]

{'loss': 0.0045, 'grad_norm': 0.2323964536190033, 'learning_rate': 3.68141592920354e-05, 'epoch': 8.15}


 82%|████████▏ | 1389/1700 [30:09<06:53,  1.33s/it]

{'loss': 0.0025, 'grad_norm': 0.04789973050355911, 'learning_rate': 3.6696165191740416e-05, 'epoch': 8.16}


 82%|████████▏ | 1390/1700 [30:10<06:48,  1.32s/it]

{'loss': 0.0015, 'grad_norm': 0.023616116493940353, 'learning_rate': 3.657817109144543e-05, 'epoch': 8.16}


 82%|████████▏ | 1391/1700 [30:12<06:59,  1.36s/it]

{'loss': 0.0006, 'grad_norm': 0.009854089468717575, 'learning_rate': 3.646017699115045e-05, 'epoch': 8.17}


 82%|████████▏ | 1392/1700 [30:13<06:56,  1.35s/it]

{'loss': 0.0014, 'grad_norm': 0.05319870635867119, 'learning_rate': 3.634218289085546e-05, 'epoch': 8.18}


 82%|████████▏ | 1393/1700 [30:15<07:14,  1.42s/it]

{'loss': 0.0013, 'grad_norm': 0.01708577573299408, 'learning_rate': 3.622418879056047e-05, 'epoch': 8.18}


 82%|████████▏ | 1394/1700 [30:16<06:52,  1.35s/it]

{'loss': 0.0012, 'grad_norm': 0.0721815675497055, 'learning_rate': 3.6106194690265486e-05, 'epoch': 8.19}


 82%|████████▏ | 1395/1700 [30:18<07:34,  1.49s/it]

{'loss': 0.0008, 'grad_norm': 0.009784994646906853, 'learning_rate': 3.5988200589970505e-05, 'epoch': 8.19}


 82%|████████▏ | 1396/1700 [30:19<07:02,  1.39s/it]

{'loss': 0.0012, 'grad_norm': 0.03390033543109894, 'learning_rate': 3.587020648967552e-05, 'epoch': 8.2}


 82%|████████▏ | 1397/1700 [30:20<06:36,  1.31s/it]

{'loss': 0.0012, 'grad_norm': 0.02060273475944996, 'learning_rate': 3.575221238938054e-05, 'epoch': 8.21}


 82%|████████▏ | 1398/1700 [30:21<06:18,  1.25s/it]

{'loss': 0.0011, 'grad_norm': 0.013967426493763924, 'learning_rate': 3.563421828908555e-05, 'epoch': 8.21}


 82%|████████▏ | 1399/1700 [30:22<06:23,  1.27s/it]

{'loss': 0.0016, 'grad_norm': 0.03890419378876686, 'learning_rate': 3.551622418879056e-05, 'epoch': 8.22}


 82%|████████▏ | 1400/1700 [30:24<06:10,  1.24s/it]

{'loss': 0.0007, 'grad_norm': 0.008018351159989834, 'learning_rate': 3.5398230088495574e-05, 'epoch': 8.22}


 82%|████████▏ | 1401/1700 [30:25<06:20,  1.27s/it]

{'loss': 0.0017, 'grad_norm': 0.03290194272994995, 'learning_rate': 3.5280235988200594e-05, 'epoch': 8.23}


 82%|████████▏ | 1402/1700 [30:26<06:11,  1.25s/it]

{'loss': 0.0014, 'grad_norm': 0.030145689845085144, 'learning_rate': 3.5162241887905606e-05, 'epoch': 8.23}


 83%|████████▎ | 1403/1700 [30:28<06:31,  1.32s/it]

{'loss': 0.0027, 'grad_norm': 0.07385768741369247, 'learning_rate': 3.5044247787610626e-05, 'epoch': 8.24}


 83%|████████▎ | 1404/1700 [30:29<06:39,  1.35s/it]

{'loss': 0.0017, 'grad_norm': 0.03801047056913376, 'learning_rate': 3.492625368731564e-05, 'epoch': 8.25}


 83%|████████▎ | 1405/1700 [30:31<06:48,  1.38s/it]

{'loss': 0.0009, 'grad_norm': 0.01392662525177002, 'learning_rate': 3.480825958702065e-05, 'epoch': 8.25}


 83%|████████▎ | 1406/1700 [30:32<06:43,  1.37s/it]

{'loss': 0.0037, 'grad_norm': 0.10292061418294907, 'learning_rate': 3.469026548672566e-05, 'epoch': 8.26}


 83%|████████▎ | 1407/1700 [30:33<06:34,  1.35s/it]

{'loss': 0.001, 'grad_norm': 0.017034165561199188, 'learning_rate': 3.457227138643068e-05, 'epoch': 8.26}


 83%|████████▎ | 1408/1700 [30:35<06:48,  1.40s/it]

{'loss': 0.0008, 'grad_norm': 0.014007077552378178, 'learning_rate': 3.4454277286135695e-05, 'epoch': 8.27}


 83%|████████▎ | 1409/1700 [30:36<06:38,  1.37s/it]

{'loss': 0.0011, 'grad_norm': 0.02693050727248192, 'learning_rate': 3.4336283185840715e-05, 'epoch': 8.28}


 83%|████████▎ | 1410/1700 [30:37<06:11,  1.28s/it]

{'loss': 0.0014, 'grad_norm': 0.033359602093696594, 'learning_rate': 3.421828908554573e-05, 'epoch': 8.28}


 83%|████████▎ | 1411/1700 [30:38<06:05,  1.27s/it]

{'loss': 0.0013, 'grad_norm': 0.017808226868510246, 'learning_rate': 3.410029498525074e-05, 'epoch': 8.29}


 83%|████████▎ | 1412/1700 [30:39<05:38,  1.18s/it]

{'loss': 0.0015, 'grad_norm': 0.026500456035137177, 'learning_rate': 3.398230088495575e-05, 'epoch': 8.29}


 83%|████████▎ | 1413/1700 [30:40<05:35,  1.17s/it]

{'loss': 0.0016, 'grad_norm': 0.04150824248790741, 'learning_rate': 3.386430678466077e-05, 'epoch': 8.3}


 83%|████████▎ | 1414/1700 [30:42<05:37,  1.18s/it]

{'loss': 0.0034, 'grad_norm': 0.35153043270111084, 'learning_rate': 3.3746312684365784e-05, 'epoch': 8.31}


 83%|████████▎ | 1415/1700 [30:43<05:39,  1.19s/it]

{'loss': 0.0034, 'grad_norm': 0.25276628136634827, 'learning_rate': 3.3628318584070804e-05, 'epoch': 8.31}


 83%|████████▎ | 1416/1700 [30:44<05:44,  1.21s/it]

{'loss': 0.0012, 'grad_norm': 0.026989808306097984, 'learning_rate': 3.351032448377581e-05, 'epoch': 8.32}


 83%|████████▎ | 1417/1700 [30:45<05:51,  1.24s/it]

{'loss': 0.0122, 'grad_norm': 0.8945261240005493, 'learning_rate': 3.339233038348083e-05, 'epoch': 8.32}


 83%|████████▎ | 1418/1700 [30:47<05:56,  1.27s/it]

{'loss': 0.0037, 'grad_norm': 0.11470494419336319, 'learning_rate': 3.327433628318584e-05, 'epoch': 8.33}


 83%|████████▎ | 1419/1700 [30:48<05:43,  1.22s/it]

{'loss': 0.0009, 'grad_norm': 0.01977897621691227, 'learning_rate': 3.315634218289086e-05, 'epoch': 8.33}


 84%|████████▎ | 1420/1700 [30:49<05:49,  1.25s/it]

{'loss': 0.0018, 'grad_norm': 0.05684084817767143, 'learning_rate': 3.303834808259587e-05, 'epoch': 8.34}


 84%|████████▎ | 1421/1700 [30:51<05:54,  1.27s/it]

{'loss': 0.0009, 'grad_norm': 0.011691169813275337, 'learning_rate': 3.2920353982300886e-05, 'epoch': 8.35}


 84%|████████▎ | 1422/1700 [30:52<05:48,  1.25s/it]

{'loss': 0.0012, 'grad_norm': 0.026789190247654915, 'learning_rate': 3.28023598820059e-05, 'epoch': 8.35}


 84%|████████▎ | 1423/1700 [30:53<05:44,  1.25s/it]

{'loss': 0.0005, 'grad_norm': 0.005699514411389828, 'learning_rate': 3.268436578171092e-05, 'epoch': 8.36}


 84%|████████▍ | 1424/1700 [30:55<06:17,  1.37s/it]

{'loss': 0.0021, 'grad_norm': 0.04267396405339241, 'learning_rate': 3.256637168141593e-05, 'epoch': 8.36}


 84%|████████▍ | 1425/1700 [30:56<06:13,  1.36s/it]

{'loss': 0.0008, 'grad_norm': 0.009971942752599716, 'learning_rate': 3.244837758112095e-05, 'epoch': 8.37}


 84%|████████▍ | 1426/1700 [30:57<06:26,  1.41s/it]

{'loss': 0.0011, 'grad_norm': 0.015381459146738052, 'learning_rate': 3.233038348082596e-05, 'epoch': 8.38}


 84%|████████▍ | 1427/1700 [30:59<06:16,  1.38s/it]

{'loss': 0.0108, 'grad_norm': 0.3151383697986603, 'learning_rate': 3.2212389380530975e-05, 'epoch': 8.38}


 84%|████████▍ | 1428/1700 [31:00<06:32,  1.44s/it]

{'loss': 0.0037, 'grad_norm': 0.19423498213291168, 'learning_rate': 3.209439528023599e-05, 'epoch': 8.39}


 84%|████████▍ | 1429/1700 [31:02<06:19,  1.40s/it]

{'loss': 0.001, 'grad_norm': 0.017042947933077812, 'learning_rate': 3.1976401179941006e-05, 'epoch': 8.39}


 84%|████████▍ | 1430/1700 [31:03<05:55,  1.32s/it]

{'loss': 0.0009, 'grad_norm': 0.017741767689585686, 'learning_rate': 3.185840707964602e-05, 'epoch': 8.4}


 84%|████████▍ | 1431/1700 [31:04<05:58,  1.33s/it]

{'loss': 0.0009, 'grad_norm': 0.010902944952249527, 'learning_rate': 3.174041297935104e-05, 'epoch': 8.41}


 84%|████████▍ | 1432/1700 [31:06<05:58,  1.34s/it]

{'loss': 0.0006, 'grad_norm': 0.007563411258161068, 'learning_rate': 3.162241887905605e-05, 'epoch': 8.41}


 84%|████████▍ | 1433/1700 [31:07<05:36,  1.26s/it]

{'loss': 0.0007, 'grad_norm': 0.015205557458102703, 'learning_rate': 3.1504424778761064e-05, 'epoch': 8.42}


 84%|████████▍ | 1434/1700 [31:08<06:02,  1.36s/it]

{'loss': 0.0011, 'grad_norm': 0.01639598049223423, 'learning_rate': 3.1386430678466076e-05, 'epoch': 8.42}


 84%|████████▍ | 1435/1700 [31:09<05:52,  1.33s/it]

{'loss': 0.0023, 'grad_norm': 0.047927118837833405, 'learning_rate': 3.1268436578171095e-05, 'epoch': 8.43}


 84%|████████▍ | 1436/1700 [31:11<05:40,  1.29s/it]

{'loss': 0.0011, 'grad_norm': 0.01569107174873352, 'learning_rate': 3.115044247787611e-05, 'epoch': 8.43}


 85%|████████▍ | 1437/1700 [31:12<05:27,  1.25s/it]

{'loss': 0.0006, 'grad_norm': 0.009431383572518826, 'learning_rate': 3.103244837758113e-05, 'epoch': 8.44}


 85%|████████▍ | 1438/1700 [31:13<05:31,  1.26s/it]

{'loss': 0.0004, 'grad_norm': 0.007147455122321844, 'learning_rate': 3.091445427728614e-05, 'epoch': 8.45}


 85%|████████▍ | 1439/1700 [31:14<05:26,  1.25s/it]

{'loss': 0.0011, 'grad_norm': 0.015417260117828846, 'learning_rate': 3.079646017699115e-05, 'epoch': 8.45}


 85%|████████▍ | 1440/1700 [31:15<05:13,  1.21s/it]

{'loss': 0.0009, 'grad_norm': 0.020380789414048195, 'learning_rate': 3.0678466076696165e-05, 'epoch': 8.46}


 85%|████████▍ | 1441/1700 [31:17<05:11,  1.20s/it]

{'loss': 0.0017, 'grad_norm': 0.053045306354761124, 'learning_rate': 3.0560471976401184e-05, 'epoch': 8.46}


 85%|████████▍ | 1442/1700 [31:18<05:13,  1.22s/it]

{'loss': 0.0025, 'grad_norm': 0.20080886781215668, 'learning_rate': 3.0442477876106197e-05, 'epoch': 8.47}


 85%|████████▍ | 1443/1700 [31:19<05:08,  1.20s/it]

{'loss': 0.0007, 'grad_norm': 0.016793319955468178, 'learning_rate': 3.0324483775811213e-05, 'epoch': 8.48}


 85%|████████▍ | 1444/1700 [31:20<05:19,  1.25s/it]

{'loss': 0.0014, 'grad_norm': 0.024116717278957367, 'learning_rate': 3.0206489675516225e-05, 'epoch': 8.48}


 85%|████████▌ | 1445/1700 [31:22<05:59,  1.41s/it]

{'loss': 0.0031, 'grad_norm': 0.0862327292561531, 'learning_rate': 3.008849557522124e-05, 'epoch': 8.49}


 85%|████████▌ | 1446/1700 [31:24<05:54,  1.40s/it]

{'loss': 0.0005, 'grad_norm': 0.008204178884625435, 'learning_rate': 2.9970501474926254e-05, 'epoch': 8.49}


 85%|████████▌ | 1447/1700 [31:25<05:50,  1.39s/it]

{'loss': 0.0165, 'grad_norm': 0.2253372073173523, 'learning_rate': 2.9852507374631273e-05, 'epoch': 8.5}


 85%|████████▌ | 1448/1700 [31:26<05:42,  1.36s/it]

{'loss': 0.001, 'grad_norm': 0.014078685082495213, 'learning_rate': 2.9734513274336286e-05, 'epoch': 8.51}


 85%|████████▌ | 1449/1700 [31:27<05:23,  1.29s/it]

{'loss': 0.0006, 'grad_norm': 0.018685519695281982, 'learning_rate': 2.9616519174041302e-05, 'epoch': 8.51}


 85%|████████▌ | 1450/1700 [31:28<05:12,  1.25s/it]

{'loss': 0.0008, 'grad_norm': 0.012139859609305859, 'learning_rate': 2.9498525073746314e-05, 'epoch': 8.52}


 85%|████████▌ | 1451/1700 [31:30<05:05,  1.23s/it]

{'loss': 0.0008, 'grad_norm': 0.011950635351240635, 'learning_rate': 2.938053097345133e-05, 'epoch': 8.52}


 85%|████████▌ | 1452/1700 [31:31<04:47,  1.16s/it]

{'loss': 0.0004, 'grad_norm': 0.007976027205586433, 'learning_rate': 2.9262536873156343e-05, 'epoch': 8.53}


 85%|████████▌ | 1453/1700 [31:32<05:16,  1.28s/it]

{'loss': 0.0034, 'grad_norm': 0.11559173464775085, 'learning_rate': 2.9144542772861362e-05, 'epoch': 8.53}


 86%|████████▌ | 1454/1700 [31:33<05:00,  1.22s/it]

{'loss': 0.0018, 'grad_norm': 0.049335602670907974, 'learning_rate': 2.902654867256637e-05, 'epoch': 8.54}


 86%|████████▌ | 1455/1700 [31:35<05:17,  1.30s/it]

{'loss': 0.0116, 'grad_norm': 0.08785299211740494, 'learning_rate': 2.890855457227139e-05, 'epoch': 8.55}


 86%|████████▌ | 1456/1700 [31:36<05:12,  1.28s/it]

{'loss': 0.0011, 'grad_norm': 0.016430148854851723, 'learning_rate': 2.8790560471976403e-05, 'epoch': 8.55}


 86%|████████▌ | 1457/1700 [31:37<05:14,  1.29s/it]

{'loss': 0.0018, 'grad_norm': 0.03648603335022926, 'learning_rate': 2.867256637168142e-05, 'epoch': 8.56}


 86%|████████▌ | 1458/1700 [31:39<05:03,  1.25s/it]

{'loss': 0.0011, 'grad_norm': 0.0784975066781044, 'learning_rate': 2.8554572271386432e-05, 'epoch': 8.56}


 86%|████████▌ | 1459/1700 [31:40<04:58,  1.24s/it]

{'loss': 0.0007, 'grad_norm': 0.009715836495161057, 'learning_rate': 2.8436578171091448e-05, 'epoch': 8.57}


 86%|████████▌ | 1460/1700 [31:41<05:15,  1.32s/it]

{'loss': 0.0042, 'grad_norm': 0.15667392313480377, 'learning_rate': 2.831858407079646e-05, 'epoch': 8.58}


 86%|████████▌ | 1461/1700 [31:42<04:55,  1.24s/it]

{'loss': 0.0007, 'grad_norm': 0.02064603939652443, 'learning_rate': 2.820058997050148e-05, 'epoch': 8.58}


 86%|████████▌ | 1462/1700 [31:44<05:04,  1.28s/it]

{'loss': 0.0013, 'grad_norm': 0.03114822879433632, 'learning_rate': 2.8082595870206492e-05, 'epoch': 8.59}


 86%|████████▌ | 1463/1700 [31:45<05:00,  1.27s/it]

{'loss': 0.0014, 'grad_norm': 0.05466028302907944, 'learning_rate': 2.7964601769911508e-05, 'epoch': 8.59}


 86%|████████▌ | 1464/1700 [31:46<05:02,  1.28s/it]

{'loss': 0.0009, 'grad_norm': 0.01256519928574562, 'learning_rate': 2.784660766961652e-05, 'epoch': 8.6}


 86%|████████▌ | 1465/1700 [31:48<05:13,  1.34s/it]

{'loss': 0.0006, 'grad_norm': 0.011224680580198765, 'learning_rate': 2.7728613569321537e-05, 'epoch': 8.6}


 86%|████████▌ | 1466/1700 [31:49<05:02,  1.29s/it]

{'loss': 0.0022, 'grad_norm': 0.08486584573984146, 'learning_rate': 2.761061946902655e-05, 'epoch': 8.61}


 86%|████████▋ | 1467/1700 [31:50<05:18,  1.37s/it]

{'loss': 0.0148, 'grad_norm': 0.6660964488983154, 'learning_rate': 2.749262536873157e-05, 'epoch': 8.62}


 86%|████████▋ | 1468/1700 [31:52<05:53,  1.52s/it]

{'loss': 0.0029, 'grad_norm': 0.054907508194446564, 'learning_rate': 2.7374631268436578e-05, 'epoch': 8.62}


 86%|████████▋ | 1469/1700 [31:53<05:17,  1.37s/it]

{'loss': 0.0008, 'grad_norm': 0.012193731032311916, 'learning_rate': 2.7256637168141597e-05, 'epoch': 8.63}


 86%|████████▋ | 1470/1700 [31:55<05:15,  1.37s/it]

{'loss': 0.0065, 'grad_norm': 0.2310267984867096, 'learning_rate': 2.713864306784661e-05, 'epoch': 8.63}


 87%|████████▋ | 1471/1700 [31:56<04:50,  1.27s/it]

{'loss': 0.0018, 'grad_norm': 0.21905849874019623, 'learning_rate': 2.7020648967551626e-05, 'epoch': 8.64}


 87%|████████▋ | 1472/1700 [31:57<04:52,  1.28s/it]

{'loss': 0.0031, 'grad_norm': 0.2557176649570465, 'learning_rate': 2.6902654867256638e-05, 'epoch': 8.65}


 87%|████████▋ | 1473/1700 [31:58<04:57,  1.31s/it]

{'loss': 0.001, 'grad_norm': 0.02276439219713211, 'learning_rate': 2.6784660766961654e-05, 'epoch': 8.65}


 87%|████████▋ | 1474/1700 [32:00<04:51,  1.29s/it]

{'loss': 0.0007, 'grad_norm': 0.014998762868344784, 'learning_rate': 2.6666666666666667e-05, 'epoch': 8.66}


 87%|████████▋ | 1475/1700 [32:01<05:00,  1.34s/it]

{'loss': 0.0012, 'grad_norm': 0.026811230927705765, 'learning_rate': 2.6548672566371686e-05, 'epoch': 8.66}


 87%|████████▋ | 1476/1700 [32:02<04:50,  1.29s/it]

{'loss': 0.001, 'grad_norm': 0.022285928949713707, 'learning_rate': 2.64306784660767e-05, 'epoch': 8.67}


 87%|████████▋ | 1477/1700 [32:03<04:44,  1.28s/it]

{'loss': 0.0017, 'grad_norm': 0.046595145016908646, 'learning_rate': 2.6312684365781714e-05, 'epoch': 8.68}


 87%|████████▋ | 1478/1700 [32:05<04:40,  1.26s/it]

{'loss': 0.0011, 'grad_norm': 0.021729199215769768, 'learning_rate': 2.6194690265486727e-05, 'epoch': 8.68}


 87%|████████▋ | 1479/1700 [32:06<04:25,  1.20s/it]

{'loss': 0.0006, 'grad_norm': 0.009184420108795166, 'learning_rate': 2.6076696165191743e-05, 'epoch': 8.69}


 87%|████████▋ | 1480/1700 [32:07<04:15,  1.16s/it]

{'loss': 0.0006, 'grad_norm': 0.009625764563679695, 'learning_rate': 2.5958702064896756e-05, 'epoch': 8.69}


 87%|████████▋ | 1481/1700 [32:08<04:29,  1.23s/it]

{'loss': 0.0009, 'grad_norm': 0.01238284446299076, 'learning_rate': 2.5840707964601775e-05, 'epoch': 8.7}


 87%|████████▋ | 1482/1700 [32:09<04:28,  1.23s/it]

{'loss': 0.0009, 'grad_norm': 0.021665815263986588, 'learning_rate': 2.5722713864306784e-05, 'epoch': 8.7}


 87%|████████▋ | 1483/1700 [32:11<04:25,  1.22s/it]

{'loss': 0.0132, 'grad_norm': 0.09961699694395065, 'learning_rate': 2.5604719764011803e-05, 'epoch': 8.71}


 87%|████████▋ | 1484/1700 [32:12<04:20,  1.21s/it]

{'loss': 0.0005, 'grad_norm': 0.008154701441526413, 'learning_rate': 2.5486725663716816e-05, 'epoch': 8.72}


 87%|████████▋ | 1485/1700 [32:13<04:21,  1.22s/it]

{'loss': 0.0023, 'grad_norm': 0.10932723432779312, 'learning_rate': 2.5368731563421832e-05, 'epoch': 8.72}


 87%|████████▋ | 1486/1700 [32:14<04:19,  1.21s/it]

{'loss': 0.0012, 'grad_norm': 0.13312695920467377, 'learning_rate': 2.5250737463126844e-05, 'epoch': 8.73}


 87%|████████▋ | 1487/1700 [32:15<04:15,  1.20s/it]

{'loss': 0.001, 'grad_norm': 0.021471744403243065, 'learning_rate': 2.513274336283186e-05, 'epoch': 8.73}


 88%|████████▊ | 1488/1700 [32:17<04:22,  1.24s/it]

{'loss': 0.0015, 'grad_norm': 0.03909873962402344, 'learning_rate': 2.5014749262536873e-05, 'epoch': 8.74}


 88%|████████▊ | 1489/1700 [32:18<04:16,  1.22s/it]

{'loss': 0.0011, 'grad_norm': 0.015606587752699852, 'learning_rate': 2.489675516224189e-05, 'epoch': 8.75}


 88%|████████▊ | 1490/1700 [32:19<04:15,  1.22s/it]

{'loss': 0.0029, 'grad_norm': 0.0739736333489418, 'learning_rate': 2.4778761061946905e-05, 'epoch': 8.75}


 88%|████████▊ | 1491/1700 [32:20<04:11,  1.20s/it]

{'loss': 0.0011, 'grad_norm': 0.029108811169862747, 'learning_rate': 2.4660766961651917e-05, 'epoch': 8.76}


 88%|████████▊ | 1492/1700 [32:22<04:55,  1.42s/it]

{'loss': 0.0012, 'grad_norm': 0.05035131052136421, 'learning_rate': 2.4542772861356933e-05, 'epoch': 8.76}


 88%|████████▊ | 1493/1700 [32:24<04:54,  1.42s/it]

{'loss': 0.0013, 'grad_norm': 0.017469145357608795, 'learning_rate': 2.442477876106195e-05, 'epoch': 8.77}


 88%|████████▊ | 1494/1700 [32:25<04:27,  1.30s/it]

{'loss': 0.0004, 'grad_norm': 0.012853192165493965, 'learning_rate': 2.4306784660766962e-05, 'epoch': 8.78}


 88%|████████▊ | 1495/1700 [32:26<04:12,  1.23s/it]

{'loss': 0.0005, 'grad_norm': 0.015202241018414497, 'learning_rate': 2.4188790560471978e-05, 'epoch': 8.78}


 88%|████████▊ | 1496/1700 [32:27<04:08,  1.22s/it]

{'loss': 0.0006, 'grad_norm': 0.012116105295717716, 'learning_rate': 2.407079646017699e-05, 'epoch': 8.79}


 88%|████████▊ | 1497/1700 [32:28<03:56,  1.16s/it]

{'loss': 0.0004, 'grad_norm': 0.0074468799866735935, 'learning_rate': 2.3952802359882006e-05, 'epoch': 8.79}


 88%|████████▊ | 1498/1700 [32:29<03:57,  1.17s/it]

{'loss': 0.0007, 'grad_norm': 0.012248058803379536, 'learning_rate': 2.3834808259587022e-05, 'epoch': 8.8}


 88%|████████▊ | 1499/1700 [32:31<04:27,  1.33s/it]

{'loss': 0.002, 'grad_norm': 0.0280517116189003, 'learning_rate': 2.3716814159292035e-05, 'epoch': 8.8}


 88%|████████▊ | 1500/1700 [32:32<04:22,  1.31s/it]

{'loss': 0.0004, 'grad_norm': 0.005965414922684431, 'learning_rate': 2.359882005899705e-05, 'epoch': 8.81}


 88%|████████▊ | 1501/1700 [32:35<05:49,  1.76s/it]

{'loss': 0.0008, 'grad_norm': 0.01779065653681755, 'learning_rate': 2.3480825958702067e-05, 'epoch': 8.82}


 88%|████████▊ | 1502/1700 [32:36<05:14,  1.59s/it]

{'loss': 0.0009, 'grad_norm': 0.029767334461212158, 'learning_rate': 2.336283185840708e-05, 'epoch': 8.82}


 88%|████████▊ | 1503/1700 [32:38<05:01,  1.53s/it]

{'loss': 0.0021, 'grad_norm': 0.05631369352340698, 'learning_rate': 2.3244837758112095e-05, 'epoch': 8.83}


 88%|████████▊ | 1504/1700 [32:39<04:48,  1.47s/it]

{'loss': 0.001, 'grad_norm': 0.011338051408529282, 'learning_rate': 2.312684365781711e-05, 'epoch': 8.83}


 89%|████████▊ | 1505/1700 [32:40<04:33,  1.40s/it]

{'loss': 0.0011, 'grad_norm': 0.015200890600681305, 'learning_rate': 2.3008849557522124e-05, 'epoch': 8.84}


 89%|████████▊ | 1506/1700 [32:41<04:20,  1.34s/it]

{'loss': 0.0012, 'grad_norm': 0.03449486196041107, 'learning_rate': 2.289085545722714e-05, 'epoch': 8.85}


 89%|████████▊ | 1507/1700 [32:43<04:11,  1.30s/it]

{'loss': 0.001, 'grad_norm': 0.03414604067802429, 'learning_rate': 2.2772861356932156e-05, 'epoch': 8.85}


 89%|████████▊ | 1508/1700 [32:44<04:07,  1.29s/it]

{'loss': 0.0013, 'grad_norm': 0.023938734084367752, 'learning_rate': 2.2654867256637168e-05, 'epoch': 8.86}


 89%|████████▉ | 1509/1700 [32:45<03:56,  1.24s/it]

{'loss': 0.0004, 'grad_norm': 0.009005440399050713, 'learning_rate': 2.2536873156342184e-05, 'epoch': 8.86}


 89%|████████▉ | 1510/1700 [32:46<03:55,  1.24s/it]

{'loss': 0.0007, 'grad_norm': 0.008290738798677921, 'learning_rate': 2.24188790560472e-05, 'epoch': 8.87}


 89%|████████▉ | 1511/1700 [32:47<03:57,  1.25s/it]

{'loss': 0.0015, 'grad_norm': 0.03343235328793526, 'learning_rate': 2.2300884955752213e-05, 'epoch': 8.88}


 89%|████████▉ | 1512/1700 [32:49<03:47,  1.21s/it]

{'loss': 0.0009, 'grad_norm': 0.013654169626533985, 'learning_rate': 2.218289085545723e-05, 'epoch': 8.88}


 89%|████████▉ | 1513/1700 [32:50<03:55,  1.26s/it]

{'loss': 0.0016, 'grad_norm': 0.10413074493408203, 'learning_rate': 2.206489675516224e-05, 'epoch': 8.89}


 89%|████████▉ | 1514/1700 [32:51<03:47,  1.23s/it]

{'loss': 0.0007, 'grad_norm': 0.01277352124452591, 'learning_rate': 2.1946902654867257e-05, 'epoch': 8.89}


 89%|████████▉ | 1515/1700 [32:52<03:44,  1.21s/it]

{'loss': 0.0007, 'grad_norm': 0.012318667955696583, 'learning_rate': 2.1828908554572273e-05, 'epoch': 8.9}


 89%|████████▉ | 1516/1700 [32:54<03:46,  1.23s/it]

{'loss': 0.0013, 'grad_norm': 0.05098455026745796, 'learning_rate': 2.1710914454277286e-05, 'epoch': 8.9}


 89%|████████▉ | 1517/1700 [32:55<03:49,  1.26s/it]

{'loss': 0.0005, 'grad_norm': 0.009668591432273388, 'learning_rate': 2.15929203539823e-05, 'epoch': 8.91}


 89%|████████▉ | 1518/1700 [32:58<05:52,  1.94s/it]

{'loss': 0.1863, 'grad_norm': 0.5627272129058838, 'learning_rate': 2.1474926253687318e-05, 'epoch': 8.92}


 89%|████████▉ | 1519/1700 [33:00<05:12,  1.73s/it]

{'loss': 0.0004, 'grad_norm': 0.010453267954289913, 'learning_rate': 2.135693215339233e-05, 'epoch': 8.92}


 89%|████████▉ | 1520/1700 [33:01<04:38,  1.55s/it]

{'loss': 0.0059, 'grad_norm': 0.13431406021118164, 'learning_rate': 2.1238938053097346e-05, 'epoch': 8.93}


 89%|████████▉ | 1521/1700 [33:02<04:44,  1.59s/it]

{'loss': 0.0012, 'grad_norm': 0.01450477447360754, 'learning_rate': 2.1120943952802362e-05, 'epoch': 8.93}


 90%|████████▉ | 1522/1700 [33:04<04:25,  1.49s/it]

{'loss': 0.0021, 'grad_norm': 0.04722519591450691, 'learning_rate': 2.1002949852507375e-05, 'epoch': 8.94}


 90%|████████▉ | 1523/1700 [33:05<03:58,  1.35s/it]

{'loss': 0.0013, 'grad_norm': 0.05157209932804108, 'learning_rate': 2.088495575221239e-05, 'epoch': 8.95}


 90%|████████▉ | 1524/1700 [33:06<03:44,  1.28s/it]

{'loss': 0.0018, 'grad_norm': 0.06479333341121674, 'learning_rate': 2.0766961651917406e-05, 'epoch': 8.95}


 90%|████████▉ | 1525/1700 [33:07<03:43,  1.28s/it]

{'loss': 0.0006, 'grad_norm': 0.009238429367542267, 'learning_rate': 2.064896755162242e-05, 'epoch': 8.96}


 90%|████████▉ | 1526/1700 [33:08<03:34,  1.23s/it]

{'loss': 0.0035, 'grad_norm': 0.28010162711143494, 'learning_rate': 2.0530973451327435e-05, 'epoch': 8.96}


 90%|████████▉ | 1527/1700 [33:09<03:32,  1.23s/it]

{'loss': 0.0019, 'grad_norm': 0.03423311933875084, 'learning_rate': 2.0412979351032448e-05, 'epoch': 8.97}


 90%|████████▉ | 1528/1700 [33:11<03:35,  1.25s/it]

{'loss': 0.0016, 'grad_norm': 0.0533880814909935, 'learning_rate': 2.0294985250737463e-05, 'epoch': 8.98}


 90%|████████▉ | 1529/1700 [33:12<03:40,  1.29s/it]

{'loss': 0.0071, 'grad_norm': 0.05377993732690811, 'learning_rate': 2.017699115044248e-05, 'epoch': 8.98}


 90%|█████████ | 1530/1700 [33:13<03:37,  1.28s/it]

{'loss': 0.0005, 'grad_norm': 0.007979398593306541, 'learning_rate': 2.0058997050147492e-05, 'epoch': 8.99}


 90%|█████████ | 1531/1700 [33:15<03:35,  1.28s/it]

{'loss': 0.0052, 'grad_norm': 0.17171227931976318, 'learning_rate': 1.9941002949852508e-05, 'epoch': 8.99}


 90%|█████████ | 1532/1700 [33:16<03:37,  1.30s/it]

{'loss': 0.0014, 'grad_norm': 0.037349242717027664, 'learning_rate': 1.9823008849557524e-05, 'epoch': 9.0}


 90%|█████████ | 1533/1700 [33:17<03:40,  1.32s/it]

{'loss': 0.0024, 'grad_norm': 0.08715590834617615, 'learning_rate': 1.9705014749262536e-05, 'epoch': 9.0}


 90%|█████████ | 1534/1700 [33:21<05:39,  2.05s/it]

{'loss': 0.1331, 'grad_norm': 0.232794851064682, 'learning_rate': 1.9587020648967552e-05, 'epoch': 9.01}


 90%|█████████ | 1535/1700 [33:22<04:53,  1.78s/it]

{'loss': 0.0008, 'grad_norm': 0.01487795077264309, 'learning_rate': 1.946902654867257e-05, 'epoch': 9.02}


 90%|█████████ | 1536/1700 [33:24<04:23,  1.61s/it]

{'loss': 0.0006, 'grad_norm': 0.011100776493549347, 'learning_rate': 1.935103244837758e-05, 'epoch': 9.02}


 90%|█████████ | 1537/1700 [33:25<04:02,  1.49s/it]

{'loss': 0.0007, 'grad_norm': 0.020970027893781662, 'learning_rate': 1.9233038348082597e-05, 'epoch': 9.03}


 90%|█████████ | 1538/1700 [33:26<03:50,  1.42s/it]

{'loss': 0.0016, 'grad_norm': 0.05236693471670151, 'learning_rate': 1.9115044247787613e-05, 'epoch': 9.03}


 91%|█████████ | 1539/1700 [33:27<03:53,  1.45s/it]

{'loss': 0.0008, 'grad_norm': 0.014620209112763405, 'learning_rate': 1.8997050147492625e-05, 'epoch': 9.04}


 91%|█████████ | 1540/1700 [33:29<03:38,  1.36s/it]

{'loss': 0.0007, 'grad_norm': 0.01633583940565586, 'learning_rate': 1.887905604719764e-05, 'epoch': 9.05}


 91%|█████████ | 1541/1700 [33:30<03:37,  1.37s/it]

{'loss': 0.0007, 'grad_norm': 0.012208789587020874, 'learning_rate': 1.8761061946902654e-05, 'epoch': 9.05}


 91%|█████████ | 1542/1700 [33:31<03:26,  1.31s/it]

{'loss': 0.0019, 'grad_norm': 0.05651043355464935, 'learning_rate': 1.864306784660767e-05, 'epoch': 9.06}


 91%|█████████ | 1543/1700 [33:32<03:21,  1.29s/it]

{'loss': 0.0006, 'grad_norm': 0.007730777375400066, 'learning_rate': 1.8525073746312686e-05, 'epoch': 9.06}


 91%|█████████ | 1544/1700 [33:34<03:13,  1.24s/it]

{'loss': 0.0006, 'grad_norm': 0.009446896612644196, 'learning_rate': 1.84070796460177e-05, 'epoch': 9.07}


 91%|█████████ | 1545/1700 [33:35<03:25,  1.33s/it]

{'loss': 0.0013, 'grad_norm': 0.027251509949564934, 'learning_rate': 1.8289085545722714e-05, 'epoch': 9.07}


 91%|█████████ | 1546/1700 [33:36<03:26,  1.34s/it]

{'loss': 0.0007, 'grad_norm': 0.010408119298517704, 'learning_rate': 1.817109144542773e-05, 'epoch': 9.08}


 91%|█████████ | 1547/1700 [33:38<03:27,  1.35s/it]

{'loss': 0.0076, 'grad_norm': 0.06755553930997849, 'learning_rate': 1.8053097345132743e-05, 'epoch': 9.09}


 91%|█████████ | 1548/1700 [33:39<03:09,  1.25s/it]

{'loss': 0.0004, 'grad_norm': 0.006612506695091724, 'learning_rate': 1.793510324483776e-05, 'epoch': 9.09}


 91%|█████████ | 1549/1700 [33:40<03:11,  1.27s/it]

{'loss': 0.0007, 'grad_norm': 0.010209176689386368, 'learning_rate': 1.7817109144542775e-05, 'epoch': 9.1}


 91%|█████████ | 1550/1700 [33:42<03:27,  1.38s/it]

{'loss': 0.0015, 'grad_norm': 0.028573518618941307, 'learning_rate': 1.7699115044247787e-05, 'epoch': 9.1}


 91%|█████████ | 1551/1700 [33:43<03:15,  1.31s/it]

{'loss': 0.0005, 'grad_norm': 0.007972314022481441, 'learning_rate': 1.7581120943952803e-05, 'epoch': 9.11}


 91%|█████████▏| 1552/1700 [33:44<03:14,  1.31s/it]

{'loss': 0.001, 'grad_norm': 0.029603857547044754, 'learning_rate': 1.746312684365782e-05, 'epoch': 9.12}


 91%|█████████▏| 1553/1700 [33:45<03:00,  1.22s/it]

{'loss': 0.0024, 'grad_norm': 0.039794716984033585, 'learning_rate': 1.734513274336283e-05, 'epoch': 9.12}


 91%|█████████▏| 1554/1700 [33:47<03:00,  1.23s/it]

{'loss': 0.0007, 'grad_norm': 0.00849674828350544, 'learning_rate': 1.7227138643067848e-05, 'epoch': 9.13}


 91%|█████████▏| 1555/1700 [33:48<03:08,  1.30s/it]

{'loss': 0.0007, 'grad_norm': 0.010438892059028149, 'learning_rate': 1.7109144542772864e-05, 'epoch': 9.13}


 92%|█████████▏| 1556/1700 [33:49<03:09,  1.32s/it]

{'loss': 0.0005, 'grad_norm': 0.0073317852802574635, 'learning_rate': 1.6991150442477876e-05, 'epoch': 9.14}


 92%|█████████▏| 1557/1700 [33:51<03:01,  1.27s/it]

{'loss': 0.0004, 'grad_norm': 0.00708686513826251, 'learning_rate': 1.6873156342182892e-05, 'epoch': 9.15}


 92%|█████████▏| 1558/1700 [33:52<03:00,  1.27s/it]

{'loss': 0.0011, 'grad_norm': 0.025182094424962997, 'learning_rate': 1.6755162241887905e-05, 'epoch': 9.15}


 92%|█████████▏| 1559/1700 [33:53<02:59,  1.27s/it]

{'loss': 0.001, 'grad_norm': 0.018304143100976944, 'learning_rate': 1.663716814159292e-05, 'epoch': 9.16}


 92%|█████████▏| 1560/1700 [33:54<02:50,  1.22s/it]

{'loss': 0.0005, 'grad_norm': 0.00642132293432951, 'learning_rate': 1.6519174041297937e-05, 'epoch': 9.16}


 92%|█████████▏| 1561/1700 [33:56<02:56,  1.27s/it]

{'loss': 0.0009, 'grad_norm': 0.036057714372873306, 'learning_rate': 1.640117994100295e-05, 'epoch': 9.17}


 92%|█████████▏| 1562/1700 [33:57<02:56,  1.28s/it]

{'loss': 0.0012, 'grad_norm': 0.029263518750667572, 'learning_rate': 1.6283185840707965e-05, 'epoch': 9.17}


 92%|█████████▏| 1563/1700 [33:58<03:06,  1.36s/it]

{'loss': 0.0009, 'grad_norm': 0.011016595177352428, 'learning_rate': 1.616519174041298e-05, 'epoch': 9.18}


 92%|█████████▏| 1564/1700 [34:00<03:02,  1.34s/it]

{'loss': 0.0007, 'grad_norm': 0.020386112853884697, 'learning_rate': 1.6047197640117994e-05, 'epoch': 9.19}


 92%|█████████▏| 1565/1700 [34:01<02:51,  1.27s/it]

{'loss': 0.0005, 'grad_norm': 0.00856840517371893, 'learning_rate': 1.592920353982301e-05, 'epoch': 9.19}


 92%|█████████▏| 1566/1700 [34:02<02:55,  1.31s/it]

{'loss': 0.0009, 'grad_norm': 0.010854805819690228, 'learning_rate': 1.5811209439528025e-05, 'epoch': 9.2}


 92%|█████████▏| 1567/1700 [34:03<02:46,  1.25s/it]

{'loss': 0.0007, 'grad_norm': 0.01032391469925642, 'learning_rate': 1.5693215339233038e-05, 'epoch': 9.2}


 92%|█████████▏| 1568/1700 [34:05<02:50,  1.29s/it]

{'loss': 0.0009, 'grad_norm': 0.02396278828382492, 'learning_rate': 1.5575221238938054e-05, 'epoch': 9.21}


 92%|█████████▏| 1569/1700 [34:06<02:41,  1.23s/it]

{'loss': 0.0003, 'grad_norm': 0.005700458772480488, 'learning_rate': 1.545722713864307e-05, 'epoch': 9.22}


 92%|█████████▏| 1570/1700 [34:07<02:46,  1.28s/it]

{'loss': 0.0012, 'grad_norm': 0.013233106583356857, 'learning_rate': 1.5339233038348082e-05, 'epoch': 9.22}


 92%|█████████▏| 1571/1700 [34:09<02:48,  1.31s/it]

{'loss': 0.0012, 'grad_norm': 0.014025507494807243, 'learning_rate': 1.5221238938053098e-05, 'epoch': 9.23}


 92%|█████████▏| 1572/1700 [34:10<02:47,  1.31s/it]

{'loss': 0.0005, 'grad_norm': 0.007835423573851585, 'learning_rate': 1.5103244837758113e-05, 'epoch': 9.23}


 93%|█████████▎| 1573/1700 [34:11<02:41,  1.27s/it]

{'loss': 0.0006, 'grad_norm': 0.010621958412230015, 'learning_rate': 1.4985250737463127e-05, 'epoch': 9.24}


 93%|█████████▎| 1574/1700 [34:13<02:54,  1.39s/it]

{'loss': 0.0015, 'grad_norm': 0.01939566247165203, 'learning_rate': 1.4867256637168143e-05, 'epoch': 9.25}


 93%|█████████▎| 1575/1700 [34:14<02:48,  1.35s/it]

{'loss': 0.0006, 'grad_norm': 0.008030214346945286, 'learning_rate': 1.4749262536873157e-05, 'epoch': 9.25}


 93%|█████████▎| 1576/1700 [34:15<02:40,  1.30s/it]

{'loss': 0.0007, 'grad_norm': 0.010974268428981304, 'learning_rate': 1.4631268436578171e-05, 'epoch': 9.26}


 93%|█████████▎| 1577/1700 [34:17<02:46,  1.36s/it]

{'loss': 0.0008, 'grad_norm': 0.013686385937035084, 'learning_rate': 1.4513274336283186e-05, 'epoch': 9.26}


 93%|█████████▎| 1578/1700 [34:18<02:39,  1.31s/it]

{'loss': 0.0015, 'grad_norm': 0.020676905289292336, 'learning_rate': 1.4395280235988202e-05, 'epoch': 9.27}


 93%|█████████▎| 1579/1700 [34:19<02:32,  1.26s/it]

{'loss': 0.0004, 'grad_norm': 0.006591553334146738, 'learning_rate': 1.4277286135693216e-05, 'epoch': 9.27}


 93%|█████████▎| 1580/1700 [34:20<02:33,  1.28s/it]

{'loss': 0.0009, 'grad_norm': 0.019493602216243744, 'learning_rate': 1.415929203539823e-05, 'epoch': 9.28}


 93%|█████████▎| 1581/1700 [34:21<02:24,  1.21s/it]

{'loss': 0.0004, 'grad_norm': 0.007453983183950186, 'learning_rate': 1.4041297935103246e-05, 'epoch': 9.29}


 93%|█████████▎| 1582/1700 [34:23<02:32,  1.29s/it]

{'loss': 0.0006, 'grad_norm': 0.007031712215393782, 'learning_rate': 1.392330383480826e-05, 'epoch': 9.29}


 93%|█████████▎| 1583/1700 [34:24<02:20,  1.20s/it]

{'loss': 0.0006, 'grad_norm': 0.009128780104219913, 'learning_rate': 1.3805309734513275e-05, 'epoch': 9.3}


 93%|█████████▎| 1584/1700 [34:25<02:21,  1.22s/it]

{'loss': 0.0006, 'grad_norm': 0.010391111485660076, 'learning_rate': 1.3687315634218289e-05, 'epoch': 9.3}


 93%|█████████▎| 1585/1700 [34:26<02:23,  1.25s/it]

{'loss': 0.0007, 'grad_norm': 0.012414480559527874, 'learning_rate': 1.3569321533923305e-05, 'epoch': 9.31}


 93%|█████████▎| 1586/1700 [34:28<02:49,  1.48s/it]

{'loss': 0.0022, 'grad_norm': 0.029887273907661438, 'learning_rate': 1.3451327433628319e-05, 'epoch': 9.32}


 93%|█████████▎| 1587/1700 [34:30<02:47,  1.48s/it]

{'loss': 0.0008, 'grad_norm': 0.00949917919933796, 'learning_rate': 1.3333333333333333e-05, 'epoch': 9.32}


 93%|█████████▎| 1588/1700 [34:31<02:26,  1.31s/it]

{'loss': 0.0004, 'grad_norm': 0.0075000799261033535, 'learning_rate': 1.321533923303835e-05, 'epoch': 9.33}


 93%|█████████▎| 1589/1700 [34:32<02:20,  1.27s/it]

{'loss': 0.0007, 'grad_norm': 0.012279249727725983, 'learning_rate': 1.3097345132743363e-05, 'epoch': 9.33}


 94%|█████████▎| 1590/1700 [34:33<02:17,  1.25s/it]

{'loss': 0.0008, 'grad_norm': 0.009677330031991005, 'learning_rate': 1.2979351032448378e-05, 'epoch': 9.34}


 94%|█████████▎| 1591/1700 [34:35<02:46,  1.53s/it]

{'loss': 0.0007, 'grad_norm': 0.010470987297594547, 'learning_rate': 1.2861356932153392e-05, 'epoch': 9.35}


 94%|█████████▎| 1592/1700 [34:37<02:53,  1.60s/it]

{'loss': 0.002, 'grad_norm': 0.023618614301085472, 'learning_rate': 1.2743362831858408e-05, 'epoch': 9.35}


 94%|█████████▎| 1593/1700 [34:38<02:40,  1.50s/it]

{'loss': 0.001, 'grad_norm': 0.023447392508387566, 'learning_rate': 1.2625368731563422e-05, 'epoch': 9.36}


 94%|█████████▍| 1594/1700 [34:40<02:30,  1.42s/it]

{'loss': 0.0009, 'grad_norm': 0.014920691028237343, 'learning_rate': 1.2507374631268436e-05, 'epoch': 9.36}


 94%|█████████▍| 1595/1700 [34:41<02:26,  1.39s/it]

{'loss': 0.0008, 'grad_norm': 0.011729462072253227, 'learning_rate': 1.2389380530973452e-05, 'epoch': 9.37}


 94%|█████████▍| 1596/1700 [34:42<02:17,  1.32s/it]

{'loss': 0.0006, 'grad_norm': 0.008390937000513077, 'learning_rate': 1.2271386430678467e-05, 'epoch': 9.37}


 94%|█████████▍| 1597/1700 [34:44<02:27,  1.44s/it]

{'loss': 0.0015, 'grad_norm': 0.031907081604003906, 'learning_rate': 1.2153392330383481e-05, 'epoch': 9.38}


 94%|█████████▍| 1598/1700 [34:45<02:19,  1.36s/it]

{'loss': 0.0005, 'grad_norm': 0.010300827212631702, 'learning_rate': 1.2035398230088495e-05, 'epoch': 9.39}


 94%|█████████▍| 1599/1700 [34:46<02:09,  1.28s/it]

{'loss': 0.0008, 'grad_norm': 0.01598602719604969, 'learning_rate': 1.1917404129793511e-05, 'epoch': 9.39}


 94%|█████████▍| 1600/1700 [34:47<02:03,  1.23s/it]

{'loss': 0.0004, 'grad_norm': 0.0073273382149636745, 'learning_rate': 1.1799410029498525e-05, 'epoch': 9.4}


 94%|█████████▍| 1601/1700 [34:48<01:57,  1.19s/it]

{'loss': 0.0005, 'grad_norm': 0.008018913678824902, 'learning_rate': 1.168141592920354e-05, 'epoch': 9.4}


 94%|█████████▍| 1602/1700 [34:50<02:03,  1.26s/it]

{'loss': 0.0015, 'grad_norm': 0.017082996666431427, 'learning_rate': 1.1563421828908556e-05, 'epoch': 9.41}


 94%|█████████▍| 1603/1700 [34:51<02:06,  1.31s/it]

{'loss': 0.0008, 'grad_norm': 0.009916485287249088, 'learning_rate': 1.144542772861357e-05, 'epoch': 9.42}


 94%|█████████▍| 1604/1700 [34:53<02:05,  1.30s/it]

{'loss': 0.0006, 'grad_norm': 0.010005536489188671, 'learning_rate': 1.1327433628318584e-05, 'epoch': 9.42}


 94%|█████████▍| 1605/1700 [34:54<01:56,  1.22s/it]

{'loss': 0.0016, 'grad_norm': 0.053135309368371964, 'learning_rate': 1.12094395280236e-05, 'epoch': 9.43}


 94%|█████████▍| 1606/1700 [34:55<01:52,  1.20s/it]

{'loss': 0.0004, 'grad_norm': 0.0061204503290355206, 'learning_rate': 1.1091445427728614e-05, 'epoch': 9.43}


 95%|█████████▍| 1607/1700 [34:57<02:12,  1.43s/it]

{'loss': 0.0005, 'grad_norm': 0.007964534685015678, 'learning_rate': 1.0973451327433629e-05, 'epoch': 9.44}


 95%|█████████▍| 1608/1700 [34:58<02:06,  1.38s/it]

{'loss': 0.0007, 'grad_norm': 0.009696870110929012, 'learning_rate': 1.0855457227138643e-05, 'epoch': 9.44}


 95%|█████████▍| 1609/1700 [34:59<01:54,  1.26s/it]

{'loss': 0.0013, 'grad_norm': 0.03384040296077728, 'learning_rate': 1.0737463126843659e-05, 'epoch': 9.45}


 95%|█████████▍| 1610/1700 [35:00<01:49,  1.21s/it]

{'loss': 0.0009, 'grad_norm': 0.040889639407396317, 'learning_rate': 1.0619469026548673e-05, 'epoch': 9.46}


 95%|█████████▍| 1611/1700 [35:01<01:48,  1.22s/it]

{'loss': 0.0007, 'grad_norm': 0.008750589564442635, 'learning_rate': 1.0501474926253687e-05, 'epoch': 9.46}


 95%|█████████▍| 1612/1700 [35:02<01:47,  1.22s/it]

{'loss': 0.0004, 'grad_norm': 0.005270637106150389, 'learning_rate': 1.0383480825958703e-05, 'epoch': 9.47}


 95%|█████████▍| 1613/1700 [35:04<01:43,  1.18s/it]

{'loss': 0.0005, 'grad_norm': 0.006953565403819084, 'learning_rate': 1.0265486725663717e-05, 'epoch': 9.47}


 95%|█████████▍| 1614/1700 [35:05<01:41,  1.18s/it]

{'loss': 0.0007, 'grad_norm': 0.010537161491811275, 'learning_rate': 1.0147492625368732e-05, 'epoch': 9.48}


 95%|█████████▌| 1615/1700 [35:06<01:39,  1.17s/it]

{'loss': 0.0004, 'grad_norm': 0.00700885197147727, 'learning_rate': 1.0029498525073746e-05, 'epoch': 9.49}


 95%|█████████▌| 1616/1700 [35:07<01:40,  1.20s/it]

{'loss': 0.0021, 'grad_norm': 0.0786113291978836, 'learning_rate': 9.911504424778762e-06, 'epoch': 9.49}


 95%|█████████▌| 1617/1700 [35:08<01:39,  1.19s/it]

{'loss': 0.0005, 'grad_norm': 0.009653720073401928, 'learning_rate': 9.793510324483776e-06, 'epoch': 9.5}


 95%|█████████▌| 1618/1700 [35:10<01:41,  1.23s/it]

{'loss': 0.0026, 'grad_norm': 0.061905637383461, 'learning_rate': 9.67551622418879e-06, 'epoch': 9.5}


 95%|█████████▌| 1619/1700 [35:11<01:45,  1.31s/it]

{'loss': 0.0019, 'grad_norm': 0.06257805228233337, 'learning_rate': 9.557522123893806e-06, 'epoch': 9.51}


 95%|█████████▌| 1620/1700 [35:13<01:46,  1.34s/it]

{'loss': 0.0012, 'grad_norm': 0.02552262507379055, 'learning_rate': 9.43952802359882e-06, 'epoch': 9.52}


 95%|█████████▌| 1621/1700 [35:14<01:42,  1.29s/it]

{'loss': 0.0007, 'grad_norm': 0.010905877687036991, 'learning_rate': 9.321533923303835e-06, 'epoch': 9.52}


 95%|█████████▌| 1622/1700 [35:15<01:36,  1.24s/it]

{'loss': 0.0021, 'grad_norm': 0.0692790076136589, 'learning_rate': 9.20353982300885e-06, 'epoch': 9.53}


 95%|█████████▌| 1623/1700 [35:16<01:34,  1.22s/it]

{'loss': 0.0008, 'grad_norm': 0.010577411390841007, 'learning_rate': 9.085545722713865e-06, 'epoch': 9.53}


 96%|█████████▌| 1624/1700 [35:18<01:49,  1.44s/it]

{'loss': 0.0005, 'grad_norm': 0.00788542628288269, 'learning_rate': 8.96755162241888e-06, 'epoch': 9.54}


 96%|█████████▌| 1625/1700 [35:19<01:42,  1.37s/it]

{'loss': 0.0013, 'grad_norm': 0.0229475274682045, 'learning_rate': 8.849557522123894e-06, 'epoch': 9.54}


 96%|█████████▌| 1626/1700 [35:20<01:38,  1.33s/it]

{'loss': 0.0039, 'grad_norm': 0.0512264259159565, 'learning_rate': 8.73156342182891e-06, 'epoch': 9.55}


 96%|█████████▌| 1627/1700 [35:22<01:38,  1.35s/it]

{'loss': 0.0009, 'grad_norm': 0.012241032905876637, 'learning_rate': 8.613569321533924e-06, 'epoch': 9.56}


 96%|█████████▌| 1628/1700 [35:23<01:29,  1.24s/it]

{'loss': 0.0005, 'grad_norm': 0.007209652103483677, 'learning_rate': 8.495575221238938e-06, 'epoch': 9.56}


 96%|█████████▌| 1629/1700 [35:24<01:29,  1.25s/it]

{'loss': 0.0007, 'grad_norm': 0.009567039087414742, 'learning_rate': 8.377581120943952e-06, 'epoch': 9.57}


 96%|█████████▌| 1630/1700 [35:25<01:28,  1.27s/it]

{'loss': 0.003, 'grad_norm': 0.05096663907170296, 'learning_rate': 8.259587020648968e-06, 'epoch': 9.57}


 96%|█████████▌| 1631/1700 [35:27<01:31,  1.33s/it]

{'loss': 0.001, 'grad_norm': 0.010633034631609917, 'learning_rate': 8.141592920353983e-06, 'epoch': 9.58}


 96%|█████████▌| 1632/1700 [35:28<01:28,  1.31s/it]

{'loss': 0.0011, 'grad_norm': 0.028035081923007965, 'learning_rate': 8.023598820058997e-06, 'epoch': 9.59}


 96%|█████████▌| 1633/1700 [35:29<01:26,  1.29s/it]

{'loss': 0.0005, 'grad_norm': 0.007167221046984196, 'learning_rate': 7.905604719764013e-06, 'epoch': 9.59}


 96%|█████████▌| 1634/1700 [35:31<01:25,  1.29s/it]

{'loss': 0.001, 'grad_norm': 0.011075749062001705, 'learning_rate': 7.787610619469027e-06, 'epoch': 9.6}


 96%|█████████▌| 1635/1700 [35:32<01:26,  1.33s/it]

{'loss': 0.0006, 'grad_norm': 0.010826771147549152, 'learning_rate': 7.669616519174041e-06, 'epoch': 9.6}


 96%|█████████▌| 1636/1700 [35:33<01:26,  1.35s/it]

{'loss': 0.0007, 'grad_norm': 0.01702970825135708, 'learning_rate': 7.551622418879056e-06, 'epoch': 9.61}


 96%|█████████▋| 1637/1700 [35:35<01:21,  1.29s/it]

{'loss': 0.0005, 'grad_norm': 0.007783920969814062, 'learning_rate': 7.4336283185840714e-06, 'epoch': 9.62}


 96%|█████████▋| 1638/1700 [35:36<01:17,  1.25s/it]

{'loss': 0.0043, 'grad_norm': 0.128281369805336, 'learning_rate': 7.315634218289086e-06, 'epoch': 9.62}


 96%|█████████▋| 1639/1700 [35:37<01:18,  1.29s/it]

{'loss': 0.001, 'grad_norm': 0.027579765766859055, 'learning_rate': 7.197640117994101e-06, 'epoch': 9.63}


 96%|█████████▋| 1640/1700 [35:38<01:14,  1.24s/it]

{'loss': 0.0004, 'grad_norm': 0.007525729946792126, 'learning_rate': 7.079646017699115e-06, 'epoch': 9.63}


 97%|█████████▋| 1641/1700 [35:40<01:15,  1.29s/it]

{'loss': 0.001, 'grad_norm': 0.010722333565354347, 'learning_rate': 6.96165191740413e-06, 'epoch': 9.64}


 97%|█████████▋| 1642/1700 [35:41<01:10,  1.21s/it]

{'loss': 0.0004, 'grad_norm': 0.017777958884835243, 'learning_rate': 6.843657817109144e-06, 'epoch': 9.64}


 97%|█████████▋| 1643/1700 [35:42<01:10,  1.23s/it]

{'loss': 0.0004, 'grad_norm': 0.00988707784563303, 'learning_rate': 6.7256637168141595e-06, 'epoch': 9.65}


 97%|█████████▋| 1644/1700 [35:43<01:08,  1.21s/it]

{'loss': 0.0007, 'grad_norm': 0.01818162016570568, 'learning_rate': 6.607669616519175e-06, 'epoch': 9.66}


 97%|█████████▋| 1645/1700 [35:45<01:11,  1.30s/it]

{'loss': 0.0014, 'grad_norm': 0.05364663526415825, 'learning_rate': 6.489675516224189e-06, 'epoch': 9.66}


 97%|█████████▋| 1646/1700 [35:46<01:06,  1.23s/it]

{'loss': 0.0004, 'grad_norm': 0.006342275068163872, 'learning_rate': 6.371681415929204e-06, 'epoch': 9.67}


 97%|█████████▋| 1647/1700 [35:47<01:03,  1.19s/it]

{'loss': 0.0004, 'grad_norm': 0.005377381574362516, 'learning_rate': 6.253687315634218e-06, 'epoch': 9.67}


 97%|█████████▋| 1648/1700 [35:48<01:05,  1.26s/it]

{'loss': 0.0008, 'grad_norm': 0.012906106188893318, 'learning_rate': 6.135693215339233e-06, 'epoch': 9.68}


 97%|█████████▋| 1649/1700 [35:49<01:03,  1.24s/it]

{'loss': 0.0006, 'grad_norm': 0.00807885080575943, 'learning_rate': 6.017699115044248e-06, 'epoch': 9.69}


 97%|█████████▋| 1650/1700 [35:51<01:02,  1.25s/it]

{'loss': 0.0011, 'grad_norm': 0.01491606142371893, 'learning_rate': 5.899705014749263e-06, 'epoch': 9.69}


 97%|█████████▋| 1651/1700 [35:52<01:03,  1.30s/it]

{'loss': 0.0013, 'grad_norm': 0.07870050519704819, 'learning_rate': 5.781710914454278e-06, 'epoch': 9.7}


 97%|█████████▋| 1652/1700 [35:53<01:00,  1.26s/it]

{'loss': 0.0006, 'grad_norm': 0.025620900094509125, 'learning_rate': 5.663716814159292e-06, 'epoch': 9.7}


 97%|█████████▋| 1653/1700 [35:55<01:01,  1.30s/it]

{'loss': 0.002, 'grad_norm': 0.03674044832587242, 'learning_rate': 5.545722713864307e-06, 'epoch': 9.71}


 97%|█████████▋| 1654/1700 [35:56<00:58,  1.27s/it]

{'loss': 0.0004, 'grad_norm': 0.008821086958050728, 'learning_rate': 5.427728613569321e-06, 'epoch': 9.72}


 97%|█████████▋| 1655/1700 [35:57<01:00,  1.33s/it]

{'loss': 0.0007, 'grad_norm': 0.008534695953130722, 'learning_rate': 5.3097345132743365e-06, 'epoch': 9.72}


 97%|█████████▋| 1656/1700 [35:59<01:00,  1.37s/it]

{'loss': 0.0007, 'grad_norm': 0.008780896663665771, 'learning_rate': 5.191740412979352e-06, 'epoch': 9.73}


 97%|█████████▋| 1657/1700 [36:00<00:59,  1.37s/it]

{'loss': 0.0007, 'grad_norm': 0.00844966247677803, 'learning_rate': 5.073746312684366e-06, 'epoch': 9.73}


 98%|█████████▊| 1658/1700 [36:01<00:53,  1.27s/it]

{'loss': 0.0004, 'grad_norm': 0.005630427971482277, 'learning_rate': 4.955752212389381e-06, 'epoch': 9.74}


 98%|█████████▊| 1659/1700 [36:02<00:50,  1.23s/it]

{'loss': 0.0006, 'grad_norm': 0.01725473254919052, 'learning_rate': 4.837758112094395e-06, 'epoch': 9.74}


 98%|█████████▊| 1660/1700 [36:04<00:51,  1.28s/it]

{'loss': 0.0008, 'grad_norm': 0.009859698824584484, 'learning_rate': 4.71976401179941e-06, 'epoch': 9.75}


 98%|█████████▊| 1661/1700 [36:05<00:47,  1.21s/it]

{'loss': 0.0005, 'grad_norm': 0.008006369695067406, 'learning_rate': 4.601769911504425e-06, 'epoch': 9.76}


 98%|█████████▊| 1662/1700 [36:06<00:49,  1.29s/it]

{'loss': 0.0018, 'grad_norm': 0.040229957550764084, 'learning_rate': 4.48377581120944e-06, 'epoch': 9.76}


 98%|█████████▊| 1663/1700 [36:07<00:46,  1.26s/it]

{'loss': 0.001, 'grad_norm': 0.028075560927391052, 'learning_rate': 4.365781710914455e-06, 'epoch': 9.77}


 98%|█████████▊| 1664/1700 [36:09<00:42,  1.19s/it]

{'loss': 0.0006, 'grad_norm': 0.012790519744157791, 'learning_rate': 4.247787610619469e-06, 'epoch': 9.77}


 98%|█████████▊| 1665/1700 [36:10<00:42,  1.22s/it]

{'loss': 0.002, 'grad_norm': 0.2585982084274292, 'learning_rate': 4.129793510324484e-06, 'epoch': 9.78}


 98%|█████████▊| 1666/1700 [36:11<00:38,  1.14s/it]

{'loss': 0.0004, 'grad_norm': 0.011659082025289536, 'learning_rate': 4.011799410029498e-06, 'epoch': 9.79}


 98%|█████████▊| 1667/1700 [36:12<00:38,  1.18s/it]

{'loss': 0.0007, 'grad_norm': 0.011708556674420834, 'learning_rate': 3.8938053097345135e-06, 'epoch': 9.79}


 98%|█████████▊| 1668/1700 [36:13<00:36,  1.13s/it]

{'loss': 0.0005, 'grad_norm': 0.007023846264928579, 'learning_rate': 3.775811209439528e-06, 'epoch': 9.8}


 98%|█████████▊| 1669/1700 [36:14<00:36,  1.19s/it]

{'loss': 0.0008, 'grad_norm': 0.012180048041045666, 'learning_rate': 3.657817109144543e-06, 'epoch': 9.8}


 98%|█████████▊| 1670/1700 [36:15<00:34,  1.15s/it]

{'loss': 0.001, 'grad_norm': 0.012741009704768658, 'learning_rate': 3.5398230088495575e-06, 'epoch': 9.81}


 98%|█████████▊| 1671/1700 [36:17<00:37,  1.29s/it]

{'loss': 0.001, 'grad_norm': 0.0578695610165596, 'learning_rate': 3.421828908554572e-06, 'epoch': 9.81}


 98%|█████████▊| 1672/1700 [36:18<00:35,  1.28s/it]

{'loss': 0.0005, 'grad_norm': 0.008312945254147053, 'learning_rate': 3.3038348082595873e-06, 'epoch': 9.82}


 98%|█████████▊| 1673/1700 [36:20<00:37,  1.38s/it]

{'loss': 0.0018, 'grad_norm': 0.022665759548544884, 'learning_rate': 3.185840707964602e-06, 'epoch': 9.83}


 98%|█████████▊| 1674/1700 [36:21<00:34,  1.31s/it]

{'loss': 0.0004, 'grad_norm': 0.004716134630143642, 'learning_rate': 3.0678466076696167e-06, 'epoch': 9.83}


 99%|█████████▊| 1675/1700 [36:22<00:30,  1.22s/it]

{'loss': 0.0006, 'grad_norm': 0.00859987922012806, 'learning_rate': 2.9498525073746313e-06, 'epoch': 9.84}


 99%|█████████▊| 1676/1700 [36:23<00:30,  1.25s/it]

{'loss': 0.0008, 'grad_norm': 0.01718243397772312, 'learning_rate': 2.831858407079646e-06, 'epoch': 9.84}


 99%|█████████▊| 1677/1700 [36:25<00:29,  1.28s/it]

{'loss': 0.002, 'grad_norm': 0.06753454357385635, 'learning_rate': 2.7138643067846607e-06, 'epoch': 9.85}


 99%|█████████▊| 1678/1700 [36:26<00:26,  1.21s/it]

{'loss': 0.0004, 'grad_norm': 0.007496809121221304, 'learning_rate': 2.595870206489676e-06, 'epoch': 9.86}


 99%|█████████▉| 1679/1700 [36:27<00:25,  1.22s/it]

{'loss': 0.0007, 'grad_norm': 0.008893920108675957, 'learning_rate': 2.4778761061946905e-06, 'epoch': 9.86}


 99%|█████████▉| 1680/1700 [36:28<00:23,  1.19s/it]

{'loss': 0.0004, 'grad_norm': 0.006746490485966206, 'learning_rate': 2.359882005899705e-06, 'epoch': 9.87}


 99%|█████████▉| 1681/1700 [36:30<00:23,  1.25s/it]

{'loss': 0.0009, 'grad_norm': 0.02052549086511135, 'learning_rate': 2.24188790560472e-06, 'epoch': 9.87}


 99%|█████████▉| 1682/1700 [36:31<00:23,  1.30s/it]

{'loss': 0.0007, 'grad_norm': 0.01411442831158638, 'learning_rate': 2.1238938053097345e-06, 'epoch': 9.88}


 99%|█████████▉| 1683/1700 [36:32<00:23,  1.35s/it]

{'loss': 0.0007, 'grad_norm': 0.011830456554889679, 'learning_rate': 2.005899705014749e-06, 'epoch': 9.89}


 99%|█████████▉| 1684/1700 [36:34<00:21,  1.36s/it]

{'loss': 0.0008, 'grad_norm': 0.015122800134122372, 'learning_rate': 1.887905604719764e-06, 'epoch': 9.89}


 99%|█████████▉| 1685/1700 [36:35<00:20,  1.37s/it]

{'loss': 0.0009, 'grad_norm': 0.012977419421076775, 'learning_rate': 1.7699115044247788e-06, 'epoch': 9.9}


 99%|█████████▉| 1686/1700 [36:37<00:19,  1.41s/it]

{'loss': 0.001, 'grad_norm': 0.01191931776702404, 'learning_rate': 1.6519174041297937e-06, 'epoch': 9.9}


 99%|█████████▉| 1687/1700 [36:38<00:17,  1.34s/it]

{'loss': 0.0004, 'grad_norm': 0.006953258998692036, 'learning_rate': 1.5339233038348083e-06, 'epoch': 9.91}


 99%|█████████▉| 1688/1700 [36:39<00:16,  1.39s/it]

{'loss': 0.001, 'grad_norm': 0.011199194006621838, 'learning_rate': 1.415929203539823e-06, 'epoch': 9.91}


 99%|█████████▉| 1689/1700 [36:40<00:14,  1.29s/it]

{'loss': 0.0006, 'grad_norm': 0.00864813756197691, 'learning_rate': 1.297935103244838e-06, 'epoch': 9.92}


 99%|█████████▉| 1690/1700 [36:41<00:11,  1.20s/it]

{'loss': 0.0007, 'grad_norm': 0.010851307772099972, 'learning_rate': 1.1799410029498526e-06, 'epoch': 9.93}


 99%|█████████▉| 1691/1700 [36:42<00:10,  1.17s/it]

{'loss': 0.0004, 'grad_norm': 0.006361156702041626, 'learning_rate': 1.0619469026548673e-06, 'epoch': 9.93}


100%|█████████▉| 1692/1700 [36:45<00:12,  1.53s/it]

{'loss': 0.0008, 'grad_norm': 0.01018298976123333, 'learning_rate': 9.43952802359882e-07, 'epoch': 9.94}


100%|█████████▉| 1693/1700 [36:46<00:09,  1.41s/it]

{'loss': 0.0007, 'grad_norm': 0.014385515823960304, 'learning_rate': 8.259587020648968e-07, 'epoch': 9.94}


100%|█████████▉| 1694/1700 [36:47<00:08,  1.40s/it]

{'loss': 0.0007, 'grad_norm': 0.00992763601243496, 'learning_rate': 7.079646017699115e-07, 'epoch': 9.95}


100%|█████████▉| 1695/1700 [36:49<00:06,  1.36s/it]

{'loss': 0.0009, 'grad_norm': 0.02293170988559723, 'learning_rate': 5.899705014749263e-07, 'epoch': 9.96}


100%|█████████▉| 1696/1700 [36:50<00:05,  1.33s/it]

{'loss': 0.0057, 'grad_norm': 0.060440145432949066, 'learning_rate': 4.71976401179941e-07, 'epoch': 9.96}


100%|█████████▉| 1697/1700 [36:51<00:03,  1.28s/it]

{'loss': 0.0006, 'grad_norm': 0.009318981319665909, 'learning_rate': 3.5398230088495575e-07, 'epoch': 9.97}


100%|█████████▉| 1698/1700 [36:52<00:02,  1.25s/it]

{'loss': 0.0007, 'grad_norm': 0.01161134708672762, 'learning_rate': 2.359882005899705e-07, 'epoch': 9.97}


100%|█████████▉| 1699/1700 [36:53<00:01,  1.16s/it]

{'loss': 0.0088, 'grad_norm': 0.08964187651872635, 'learning_rate': 1.1799410029498526e-07, 'epoch': 9.98}


100%|██████████| 1700/1700 [36:54<00:00,  1.19s/it]

{'loss': 0.0007, 'grad_norm': 0.008786909282207489, 'learning_rate': 0.0, 'epoch': 9.99}


100%|██████████| 1700/1700 [36:56<00:00,  1.30s/it]

{'train_runtime': 2216.4283, 'train_samples_per_second': 3.073, 'train_steps_per_second': 0.767, 'train_loss': 0.2001394452722161, 'epoch': 9.99}





In [23]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2216.4283 seconds used for training.
36.94 minutes used for training.
Peak reserved memory = 8.201 GB.
Peak reserved memory for training = 4.926 GB.
Peak reserved memory % of max memory = 69.778 %.
Peak reserved memory for training % of max memory = 41.913 %.


In [29]:
import pandas as pd
import matplotlib.pyplot as plt
training_df = pd.DataFrame(trainer.state.log_history)

In [32]:
training_df.head(2)

Unnamed: 0,loss,grad_norm,learning_rate,epoch,step,train_runtime,train_samples_per_second,train_steps_per_second,total_flos,train_loss
0,1.2077,1.6807,4e-05,0.005874,1,,,,,
1,1.3629,3.645731,8e-05,0.011747,2,,,,,


In [47]:
from helpers import create_training_plots

In [49]:
fig = create_training_plots(training_df)
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [40]:
training_df.to_csv(f"training_logs/{OUTPUT_MODEL_NAME}.csv", index = False)

## Run Inference

In [41]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

def get_response(user_query):
    messages = [
    {"role": "user", "content": user_query},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                            temperature = 1.5, min_p = 0.1)
    return tokenizer.batch_decode(outputs)

In [42]:
dataset_finetune['question'][0]

'What are some common challenges faced by healthcare professionals during doctor-patient interactions in Puskesmas settings?'

Need to investigate how changing the question affects responses

In [43]:
resp = get_response(dataset_finetune['question'][0])
print(resp[0].split("<|start_header_id|>assistant<|end_header_id|>")[1])

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.




According to the paper "Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia", the primary issue contributing to inefficiency in Puskesmas is the time-consuming nature of doctor-patient interactions. Doctors are required to conduct thorough consultations, which


## Save to HF

In [44]:
print(f"Model dtype: {next(model.parameters()).dtype}")


Model dtype: torch.bfloat16


In [None]:
model.push_to_hub_gguf(
        f"CPSC532/{OUTPUT_MODEL_NAME}",
        tokenizer,
        # quantization_method = ["q4_k_m", "q8_0", "q5_k_m"], 
        quantization_method = ["not_quantized"], # save original precision (bfloat16)
        token = HF_TOKEN
    )

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 32.94 out of 62.67 RAM for saving.


100%|██████████| 28/28 [00:00<00:00, 90.30it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['bf16'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at CPSC532/2024NOV02_cleaned_arxiv_qa_data into bf16 GGUF format.
The output location will be /home/owen/Desktop/github/532/implementation/finetuning/CPSC532/2024NOV02_cleaned_arxiv_qa_data/unsloth.BF16.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: 2024NOV02_cleaned_arxiv_qa_data
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.floa

unsloth.BF16.gguf: 100%|██████████| 6.43G/6.43G [04:00<00:00, 26.8MB/s]
No files have been modified since last commit. Skipping to prevent empty commit.


Saved GGUF to https://huggingface.co/CPSC532/2024NOV02_cleaned_arxiv_qa_data


No files have been modified since last commit. Skipping to prevent empty commit.


Saved Ollama Modelfile to https://huggingface.co/CPSC532/2024NOV02_cleaned_arxiv_qa_data
