### Fine-Tune Mistral 7b

To install Unsloth on your own computer, run the following below:

In [1]:
# %pip install unsloth
# Also get the latest nightly Unsloth!
# %pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

References: 

https://github.com/unslothai/unsloth

https://docs.unsloth.ai/tutorials/how-to-finetune-llama-3-and-export-to-ollama#id-6.-alpaca-dataset

https://www.youtube.com/watch?v=Gpyukc6c0w8

https://huggingface.co/datasets/laion/OIG

https://mer.vin/2024/02/unsloth-fine-tuning/

RAG: https://www.youtube.com/watch?v=LKokLun3bHI

https://huggingface.co/blog/synthetic-data-save-costs

#### Load Model Through Unsloth API

In [2]:
from unsloth import FastLanguageModel
import torch

model_name = "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name, # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm


🦥 Unsloth Zoo will now patch everything to make training faster!
[2024-12-11 19:43:16,058] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status
/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


==((====))==  Unsloth 2024.12.4: Fast Mistral patching. Transformers:4.46.2.
   \\   /|    GPU: NVIDIA Graphics Device. Max memory: 15.702 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


- Define format and function in order to format dataset,
- It is supposed to include "text" keyword as shown below:

In [3]:
prompt = """ [ {"role": "user", "content": {}}, {"role": "assistant", "content": {}}] """

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def format_prompts(batch):
    answers         = batch["answer"]
    questions       = batch["question"]
    texts = []
    for answer, question in zip(answers, questions):
        conversation = []
        conversation.append({"role": "user", "content": question})
        conversation.append({"role": "assistant", "content": answer})
        chat = ','.join(str(x) for x in conversation) + EOS_TOKEN
        texts.append(chat)
    return { "text" : texts, }

def format_pandora_prompt(batch):
    messages = batch["messages"]
    texts = []
    for message in messages:
        chat = ','.join(str(item) for item in message) + EOS_TOKEN
        texts.append(chat)

    return { "text" : texts, }


- Load dataset and format the prompts

In [4]:
from datasets import load_dataset
dataset_name = "danilopeixoto/pandora-tool-calling"

dataset = load_dataset(dataset_name, split="train")
dataset = dataset.map(format_pandora_prompt, batched=True)

# Print a sample to verify
print(str(len(dataset['text'])))
print(dataset['text'][9])

101114
{'role': 'system', 'content': 'You are a helpful assistant with access to functions and tools. Use them if required.', 'tools': None, 'tool_calls': None, 'name': None},{'role': 'system', 'content': None, 'tools': '[{"name": "check_email_availability", "description": "Check if an email address is available", "parameters": {"type": "object", "properties": {"email": {"type": "string", "description": "The email address to be checked"}}, "required": ["email"]}}]', 'tool_calls': None, 'name': None},{'role': 'user', 'content': 'Hi, I want to create a new email account. Can you check if the email address "johnsmith@gmail.com" is available?', 'tools': None, 'tool_calls': None, 'name': None},{'role': 'assistant', 'content': None, 'tools': None, 'tool_calls': '[{"name": "check_email_availability", "arguments": {"email": "johnsmith@gmail.com"}}]', 'name': None},{'role': 'tool', 'content': '{"status": "unavailable", "message": "The email address \'johnsmith@gmail.com\' is already in use."}',

- Import wandb to trace and log fine-tuning process,
- import trainer and set training parameters.

In [5]:
import wandb

wandb.login(key = "286a64856911076c58502477a4541c62b0deb138")
run = wandb.init(
    project='Fine tuning mistral 7B',
    job_type="training",
    anonymous="allow"
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msimgesonmez90[0m ([33msimgesonmez90-i-zmir-institute-of-technology[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/simges/.netrc


In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import os,torch

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        num_train_epochs=1,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps=-1, # Set num_train_epochs = 1 for full training runs
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(), # lower precision for adam optimizer, 32 bit is costly
        bf16 = is_bfloat16_supported(),
        logging_steps = 50,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "tool_calling_llm",
        report_to = "wandb", # Use this for WandB etc
    ),
)

In [9]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA Graphics Device. Max memory = 15.702 GB.
4.441 GB of memory reserved.


In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 101,114 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 12,639
 "-____-"     Number of trainable parameters = 20,971,520
  0%|          | 50/12639 [05:28<23:28:19,  6.71s/it]

{'loss': 0.3723, 'grad_norm': 0.4719901382923126, 'learning_rate': 0.00019928763653633054, 'epoch': 0.0}


  1%|          | 100/12639 [10:52<20:09:27,  5.79s/it]

{'loss': 0.2893, 'grad_norm': 0.4467845559120178, 'learning_rate': 0.0001984961215766978, 'epoch': 0.01}


  1%|          | 150/12639 [15:10<20:17:16,  5.85s/it]

{'loss': 0.2985, 'grad_norm': 0.376018226146698, 'learning_rate': 0.00019770460661706508, 'epoch': 0.01}


  2%|▏         | 200/12639 [19:24<15:14:05,  4.41s/it]

{'loss': 0.2852, 'grad_norm': 0.34205877780914307, 'learning_rate': 0.00019691309165743233, 'epoch': 0.02}


  2%|▏         | 250/12639 [23:47<19:44:31,  5.74s/it]

{'loss': 0.3097, 'grad_norm': 0.42879194021224976, 'learning_rate': 0.0001961215766977996, 'epoch': 0.02}


  2%|▏         | 300/12639 [28:11<20:35:28,  6.01s/it]

{'loss': 0.2853, 'grad_norm': 0.44706055521965027, 'learning_rate': 0.00019533006173816687, 'epoch': 0.02}


  3%|▎         | 350/12639 [32:32<19:26:33,  5.70s/it]

{'loss': 0.2971, 'grad_norm': 0.3540213108062744, 'learning_rate': 0.0001945385467785341, 'epoch': 0.03}


  3%|▎         | 400/12639 [36:47<15:40:49,  4.61s/it]

{'loss': 0.2915, 'grad_norm': 0.43394213914871216, 'learning_rate': 0.00019374703181890138, 'epoch': 0.03}


  4%|▎         | 450/12639 [40:53<15:49:20,  4.67s/it]

{'loss': 0.2698, 'grad_norm': 0.42231154441833496, 'learning_rate': 0.00019295551685926865, 'epoch': 0.04}


  4%|▍         | 500/12639 [45:06<15:54:17,  4.72s/it]

{'loss': 0.2717, 'grad_norm': 0.37781625986099243, 'learning_rate': 0.0001921640018996359, 'epoch': 0.04}


  4%|▍         | 550/12639 [49:27<17:09:46,  5.11s/it]

{'loss': 0.2973, 'grad_norm': 0.4185146689414978, 'learning_rate': 0.0001913724869400032, 'epoch': 0.04}


  5%|▍         | 600/12639 [53:56<17:02:28,  5.10s/it]

{'loss': 0.2989, 'grad_norm': 0.2960578203201294, 'learning_rate': 0.00019058097198037044, 'epoch': 0.05}


  5%|▌         | 650/12639 [58:19<21:08:11,  6.35s/it]

{'loss': 0.2785, 'grad_norm': 0.49897435307502747, 'learning_rate': 0.00018978945702073768, 'epoch': 0.05}


  6%|▌         | 700/12639 [1:02:20<17:41:34,  5.33s/it]

{'loss': 0.2575, 'grad_norm': 0.5379592180252075, 'learning_rate': 0.00018899794206110498, 'epoch': 0.06}


  6%|▌         | 750/12639 [1:06:44<18:59:55,  5.75s/it]

{'loss': 0.305, 'grad_norm': 0.4134247601032257, 'learning_rate': 0.00018820642710147222, 'epoch': 0.06}


  6%|▋         | 800/12639 [1:11:08<14:50:38,  4.51s/it]

{'loss': 0.2701, 'grad_norm': 0.34929442405700684, 'learning_rate': 0.0001874149121418395, 'epoch': 0.06}


  7%|▋         | 850/12639 [1:15:27<17:29:12,  5.34s/it]

{'loss': 0.2908, 'grad_norm': 0.5299743413925171, 'learning_rate': 0.00018662339718220677, 'epoch': 0.07}


  7%|▋         | 900/12639 [1:19:48<15:50:07,  4.86s/it]

{'loss': 0.2864, 'grad_norm': 0.3147815763950348, 'learning_rate': 0.000185831882222574, 'epoch': 0.07}


  8%|▊         | 950/12639 [1:24:06<18:01:27,  5.55s/it]

{'loss': 0.2669, 'grad_norm': 0.45343437790870667, 'learning_rate': 0.00018504036726294128, 'epoch': 0.08}


  8%|▊         | 1000/12639 [1:28:26<15:34:31,  4.82s/it]

{'loss': 0.2832, 'grad_norm': 0.39385348558425903, 'learning_rate': 0.00018424885230330855, 'epoch': 0.08}


  8%|▊         | 1050/12639 [1:32:34<16:53:21,  5.25s/it]

{'loss': 0.2611, 'grad_norm': 0.48146936297416687, 'learning_rate': 0.0001834573373436758, 'epoch': 0.08}


  9%|▊         | 1100/12639 [1:37:04<16:00:14,  4.99s/it]

{'loss': 0.3013, 'grad_norm': 0.434643030166626, 'learning_rate': 0.00018266582238404307, 'epoch': 0.09}


  9%|▉         | 1150/12639 [1:41:23<17:19:07,  5.43s/it]

{'loss': 0.3012, 'grad_norm': 0.4912136197090149, 'learning_rate': 0.00018187430742441034, 'epoch': 0.09}


  9%|▉         | 1200/12639 [1:45:42<14:36:05,  4.60s/it]

{'loss': 0.2783, 'grad_norm': 0.32477158308029175, 'learning_rate': 0.0001810827924647776, 'epoch': 0.09}


 10%|▉         | 1250/12639 [1:50:00<16:06:21,  5.09s/it]

{'loss': 0.2865, 'grad_norm': 0.3937057554721832, 'learning_rate': 0.00018029127750514485, 'epoch': 0.1}


 10%|█         | 1300/12639 [1:54:28<15:42:30,  4.99s/it]

{'loss': 0.2816, 'grad_norm': 0.408532053232193, 'learning_rate': 0.00017949976254551212, 'epoch': 0.1}


 11%|█         | 1350/12639 [1:58:40<15:45:32,  5.03s/it]

{'loss': 0.2747, 'grad_norm': 0.4814029335975647, 'learning_rate': 0.0001787082475858794, 'epoch': 0.11}


 11%|█         | 1400/12639 [2:03:04<15:31:03,  4.97s/it]

{'loss': 0.2833, 'grad_norm': 0.4581761062145233, 'learning_rate': 0.00017791673262624664, 'epoch': 0.11}


 11%|█▏        | 1450/12639 [2:07:19<15:29:58,  4.99s/it]

{'loss': 0.2594, 'grad_norm': 0.39277809858322144, 'learning_rate': 0.0001771252176666139, 'epoch': 0.11}


 12%|█▏        | 1500/12639 [2:11:38<14:29:17,  4.68s/it]

{'loss': 0.2765, 'grad_norm': 0.46527913212776184, 'learning_rate': 0.00017633370270698118, 'epoch': 0.12}


 12%|█▏        | 1550/12639 [2:15:46<16:40:22,  5.41s/it]

{'loss': 0.2701, 'grad_norm': 0.45188698172569275, 'learning_rate': 0.00017554218774734842, 'epoch': 0.12}


 13%|█▎        | 1600/12639 [2:20:04<15:19:38,  5.00s/it]

{'loss': 0.2726, 'grad_norm': 0.45394399762153625, 'learning_rate': 0.0001747506727877157, 'epoch': 0.13}


 13%|█▎        | 1650/12639 [2:24:29<16:28:23,  5.40s/it]

{'loss': 0.2839, 'grad_norm': 0.4250381588935852, 'learning_rate': 0.00017395915782808297, 'epoch': 0.13}


 13%|█▎        | 1700/12639 [2:28:30<14:31:05,  4.78s/it]

{'loss': 0.2522, 'grad_norm': 0.4409942626953125, 'learning_rate': 0.0001731676428684502, 'epoch': 0.13}


 14%|█▍        | 1750/12639 [2:33:00<17:07:52,  5.66s/it]

{'loss': 0.2866, 'grad_norm': 0.4670701324939728, 'learning_rate': 0.00017237612790881748, 'epoch': 0.14}


 14%|█▍        | 1800/12639 [2:37:18<15:24:38,  5.12s/it]

{'loss': 0.2791, 'grad_norm': 0.4506061375141144, 'learning_rate': 0.00017158461294918475, 'epoch': 0.14}


 15%|█▍        | 1850/12639 [2:41:40<15:35:41,  5.20s/it]

{'loss': 0.2769, 'grad_norm': 0.3829107880592346, 'learning_rate': 0.00017079309798955202, 'epoch': 0.15}


 15%|█▌        | 1900/12639 [2:46:02<14:50:09,  4.97s/it]

{'loss': 0.2812, 'grad_norm': 0.5240059494972229, 'learning_rate': 0.00017000158302991927, 'epoch': 0.15}


 15%|█▌        | 1950/12639 [2:50:15<16:29:03,  5.55s/it]

{'loss': 0.2761, 'grad_norm': 0.3979191184043884, 'learning_rate': 0.00016921006807028654, 'epoch': 0.15}


 16%|█▌        | 2000/12639 [2:54:28<14:32:05,  4.92s/it]

{'loss': 0.2707, 'grad_norm': 0.44623440504074097, 'learning_rate': 0.0001684185531106538, 'epoch': 0.16}


 16%|█▌        | 2050/12639 [2:58:53<15:56:56,  5.42s/it]

{'loss': 0.2769, 'grad_norm': 0.5029745697975159, 'learning_rate': 0.00016762703815102105, 'epoch': 0.16}


 17%|█▋        | 2100/12639 [3:03:00<14:24:55,  4.92s/it]

{'loss': 0.2644, 'grad_norm': 0.5871331691741943, 'learning_rate': 0.00016683552319138832, 'epoch': 0.17}


 17%|█▋        | 2150/12639 [3:07:30<15:58:49,  5.48s/it]

{'loss': 0.2864, 'grad_norm': 0.47019633650779724, 'learning_rate': 0.0001660440082317556, 'epoch': 0.17}


 17%|█▋        | 2200/12639 [3:11:44<14:26:56,  4.98s/it]

{'loss': 0.2707, 'grad_norm': 0.33420607447624207, 'learning_rate': 0.00016525249327212284, 'epoch': 0.17}


 18%|█▊        | 2250/12639 [3:15:57<16:53:03,  5.85s/it]

{'loss': 0.2756, 'grad_norm': 0.47137266397476196, 'learning_rate': 0.0001644609783124901, 'epoch': 0.18}


 18%|█▊        | 2300/12639 [3:20:12<14:28:23,  5.04s/it]

{'loss': 0.2648, 'grad_norm': 0.430055171251297, 'learning_rate': 0.00016366946335285738, 'epoch': 0.18}


 19%|█▊        | 2350/12639 [3:24:28<14:05:40,  4.93s/it]

{'loss': 0.3034, 'grad_norm': 0.6347318887710571, 'learning_rate': 0.00016287794839322462, 'epoch': 0.19}


 19%|█▉        | 2400/12639 [3:28:43<15:26:29,  5.43s/it]

{'loss': 0.2785, 'grad_norm': 0.5441904664039612, 'learning_rate': 0.00016208643343359192, 'epoch': 0.19}


 19%|█▉        | 2450/12639 [3:33:15<15:52:21,  5.61s/it]

{'loss': 0.3012, 'grad_norm': 0.47114405035972595, 'learning_rate': 0.00016129491847395916, 'epoch': 0.19}


 20%|█▉        | 2500/12639 [3:37:27<15:25:34,  5.48s/it]

{'loss': 0.2623, 'grad_norm': 0.5721734166145325, 'learning_rate': 0.00016050340351432644, 'epoch': 0.2}


 20%|██        | 2550/12639 [3:41:30<12:34:52,  4.49s/it]

{'loss': 0.2619, 'grad_norm': 0.5030771493911743, 'learning_rate': 0.0001597118885546937, 'epoch': 0.2}


 21%|██        | 2600/12639 [3:45:48<13:52:52,  4.98s/it]

{'loss': 0.2752, 'grad_norm': 0.3085077106952667, 'learning_rate': 0.00015892037359506095, 'epoch': 0.21}


 21%|██        | 2650/12639 [3:50:24<16:11:21,  5.83s/it]

{'loss': 0.2927, 'grad_norm': 0.6497789025306702, 'learning_rate': 0.00015812885863542822, 'epoch': 0.21}


 21%|██▏       | 2700/12639 [3:54:47<13:45:35,  4.98s/it]

{'loss': 0.2829, 'grad_norm': 0.5020511150360107, 'learning_rate': 0.0001573373436757955, 'epoch': 0.21}


 22%|██▏       | 2750/12639 [3:59:15<13:46:45,  5.02s/it]

{'loss': 0.2758, 'grad_norm': 0.5071022510528564, 'learning_rate': 0.00015654582871616274, 'epoch': 0.22}


 22%|██▏       | 2800/12639 [4:03:40<15:36:04,  5.71s/it]

{'loss': 0.2781, 'grad_norm': 0.4975830912590027, 'learning_rate': 0.00015575431375653, 'epoch': 0.22}


 23%|██▎       | 2850/12639 [4:08:03<14:15:28,  5.24s/it]

{'loss': 0.291, 'grad_norm': 0.46365058422088623, 'learning_rate': 0.00015496279879689728, 'epoch': 0.23}


 23%|██▎       | 2900/12639 [4:12:19<12:32:22,  4.64s/it]

{'loss': 0.2675, 'grad_norm': 0.481263130903244, 'learning_rate': 0.00015417128383726452, 'epoch': 0.23}


 23%|██▎       | 2950/12639 [4:16:30<14:02:54,  5.22s/it]

{'loss': 0.2606, 'grad_norm': 0.5428469181060791, 'learning_rate': 0.0001533797688776318, 'epoch': 0.23}


 24%|██▎       | 3000/12639 [4:20:49<11:47:38,  4.40s/it]

{'loss': 0.2793, 'grad_norm': 0.22741812467575073, 'learning_rate': 0.00015258825391799904, 'epoch': 0.24}


 24%|██▍       | 3050/12639 [4:25:16<13:59:19,  5.25s/it]

{'loss': 0.2774, 'grad_norm': 0.5072165131568909, 'learning_rate': 0.00015179673895836633, 'epoch': 0.24}


 25%|██▍       | 3100/12639 [4:29:40<13:35:33,  5.13s/it]

{'loss': 0.2816, 'grad_norm': 0.4174869954586029, 'learning_rate': 0.00015100522399873358, 'epoch': 0.25}


 25%|██▍       | 3150/12639 [4:33:46<13:29:36,  5.12s/it]

{'loss': 0.2438, 'grad_norm': 0.6085390448570251, 'learning_rate': 0.00015021370903910085, 'epoch': 0.25}


 25%|██▌       | 3200/12639 [4:38:08<12:21:49,  4.72s/it]

{'loss': 0.2734, 'grad_norm': 0.5393158793449402, 'learning_rate': 0.00014942219407946812, 'epoch': 0.25}


 26%|██▌       | 3250/12639 [4:42:23<12:03:07,  4.62s/it]

{'loss': 0.2732, 'grad_norm': 0.2918257713317871, 'learning_rate': 0.00014863067911983536, 'epoch': 0.26}


 26%|██▌       | 3300/12639 [4:46:50<14:19:08,  5.52s/it]

{'loss': 0.285, 'grad_norm': 0.4395512640476227, 'learning_rate': 0.00014783916416020264, 'epoch': 0.26}


 27%|██▋       | 3350/12639 [4:51:27<14:23:03,  5.57s/it]

{'loss': 0.2876, 'grad_norm': 0.5134850740432739, 'learning_rate': 0.0001470476492005699, 'epoch': 0.27}


 27%|██▋       | 3400/12639 [4:55:28<11:55:42,  4.65s/it]

{'loss': 0.2488, 'grad_norm': 0.33785921335220337, 'learning_rate': 0.00014625613424093715, 'epoch': 0.27}


 27%|██▋       | 3450/12639 [4:59:49<12:53:31,  5.05s/it]

{'loss': 0.2718, 'grad_norm': 0.4777520000934601, 'learning_rate': 0.00014546461928130442, 'epoch': 0.27}


 28%|██▊       | 3500/12639 [5:04:14<12:38:18,  4.98s/it]

{'loss': 0.2959, 'grad_norm': 0.4855843484401703, 'learning_rate': 0.0001446731043216717, 'epoch': 0.28}


 28%|██▊       | 3550/12639 [5:08:25<11:22:51,  4.51s/it]

{'loss': 0.2491, 'grad_norm': 0.48332536220550537, 'learning_rate': 0.00014388158936203894, 'epoch': 0.28}


 28%|██▊       | 3600/12639 [5:12:54<13:37:52,  5.43s/it]

{'loss': 0.3042, 'grad_norm': 0.5068361759185791, 'learning_rate': 0.0001430900744024062, 'epoch': 0.28}


 29%|██▉       | 3650/12639 [5:17:01<11:14:40,  4.50s/it]

{'loss': 0.2456, 'grad_norm': 0.30478760600090027, 'learning_rate': 0.00014229855944277348, 'epoch': 0.29}


 29%|██▉       | 3700/12639 [5:21:10<11:56:45,  4.81s/it]

{'loss': 0.2571, 'grad_norm': 0.5379632115364075, 'learning_rate': 0.00014150704448314075, 'epoch': 0.29}


 30%|██▉       | 3750/12639 [5:25:25<13:06:12,  5.31s/it]

{'loss': 0.2631, 'grad_norm': 0.46283653378486633, 'learning_rate': 0.000140715529523508, 'epoch': 0.3}


 30%|███       | 3800/12639 [5:29:47<13:47:48,  5.62s/it]

{'loss': 0.2762, 'grad_norm': 0.4437519609928131, 'learning_rate': 0.00013992401456387526, 'epoch': 0.3}


 30%|███       | 3850/12639 [5:34:15<11:17:21,  4.62s/it]

{'loss': 0.2932, 'grad_norm': 0.4235062897205353, 'learning_rate': 0.00013913249960424253, 'epoch': 0.3}


 31%|███       | 3900/12639 [5:38:37<11:49:10,  4.87s/it]

{'loss': 0.277, 'grad_norm': 0.554174542427063, 'learning_rate': 0.00013834098464460978, 'epoch': 0.31}


 31%|███▏      | 3950/12639 [5:42:46<10:55:52,  4.53s/it]

{'loss': 0.2698, 'grad_norm': 0.5560932159423828, 'learning_rate': 0.00013754946968497705, 'epoch': 0.31}


 32%|███▏      | 4000/12639 [5:46:52<10:36:05,  4.42s/it]

{'loss': 0.2527, 'grad_norm': 0.49338993430137634, 'learning_rate': 0.00013675795472534432, 'epoch': 0.32}


 32%|███▏      | 4050/12639 [5:51:06<11:30:28,  4.82s/it]

{'loss': 0.2626, 'grad_norm': 0.4654598534107208, 'learning_rate': 0.00013596643976571156, 'epoch': 0.32}


 32%|███▏      | 4100/12639 [5:55:30<13:13:59,  5.58s/it]

{'loss': 0.2729, 'grad_norm': 0.5134695768356323, 'learning_rate': 0.00013517492480607886, 'epoch': 0.32}


 33%|███▎      | 4150/12639 [5:59:51<11:44:05,  4.98s/it]

{'loss': 0.2672, 'grad_norm': 0.4234115481376648, 'learning_rate': 0.0001343834098464461, 'epoch': 0.33}


 33%|███▎      | 4200/12639 [6:03:56<13:00:27,  5.55s/it]

{'loss': 0.2463, 'grad_norm': 0.515536367893219, 'learning_rate': 0.00013359189488681335, 'epoch': 0.33}


 34%|███▎      | 4250/12639 [6:08:15<12:28:15,  5.35s/it]

{'loss': 0.2727, 'grad_norm': 0.5329576134681702, 'learning_rate': 0.00013280037992718065, 'epoch': 0.34}


 34%|███▍      | 4300/12639 [6:12:35<11:24:10,  4.92s/it]

{'loss': 0.2687, 'grad_norm': 0.47241514921188354, 'learning_rate': 0.0001320088649675479, 'epoch': 0.34}


 34%|███▍      | 4350/12639 [6:17:08<11:35:59,  5.04s/it]

{'loss': 0.2873, 'grad_norm': 0.49325335025787354, 'learning_rate': 0.00013121735000791516, 'epoch': 0.34}


 35%|███▍      | 4400/12639 [6:21:27<11:16:57,  4.93s/it]

{'loss': 0.2626, 'grad_norm': 0.4364636540412903, 'learning_rate': 0.00013042583504828243, 'epoch': 0.35}


 35%|███▌      | 4450/12639 [6:25:57<13:20:58,  5.87s/it]

{'loss': 0.2792, 'grad_norm': 0.44610896706581116, 'learning_rate': 0.00012963432008864968, 'epoch': 0.35}


 36%|███▌      | 4500/12639 [6:30:09<12:06:57,  5.36s/it]

{'loss': 0.272, 'grad_norm': 0.29746535420417786, 'learning_rate': 0.00012884280512901695, 'epoch': 0.36}


 36%|███▌      | 4550/12639 [6:34:42<14:22:47,  6.40s/it]

{'loss': 0.2826, 'grad_norm': 0.5275770425796509, 'learning_rate': 0.0001280512901693842, 'epoch': 0.36}


 36%|███▋      | 4600/12639 [6:38:59<11:01:55,  4.94s/it]

{'loss': 0.2693, 'grad_norm': 0.3467819094657898, 'learning_rate': 0.00012725977520975146, 'epoch': 0.36}


 37%|███▋      | 4650/12639 [6:43:09<11:03:12,  4.98s/it]

{'loss': 0.2645, 'grad_norm': 0.5110567212104797, 'learning_rate': 0.00012646826025011873, 'epoch': 0.37}


 37%|███▋      | 4700/12639 [6:47:40<11:43:50,  5.32s/it]

{'loss': 0.2891, 'grad_norm': 0.5433413982391357, 'learning_rate': 0.00012567674529048598, 'epoch': 0.37}


 38%|███▊      | 4750/12639 [6:52:03<10:52:53,  4.97s/it]

{'loss': 0.2925, 'grad_norm': 0.5282872319221497, 'learning_rate': 0.00012488523033085328, 'epoch': 0.38}


 38%|███▊      | 4800/12639 [6:56:16<10:31:22,  4.83s/it]

{'loss': 0.2622, 'grad_norm': 0.5372005701065063, 'learning_rate': 0.00012409371537122052, 'epoch': 0.38}


 38%|███▊      | 4850/12639 [7:00:28<12:12:31,  5.64s/it]

{'loss': 0.261, 'grad_norm': 0.5087974667549133, 'learning_rate': 0.00012330220041158776, 'epoch': 0.38}


 39%|███▉      | 4900/12639 [7:04:37<10:06:18,  4.70s/it]

{'loss': 0.2465, 'grad_norm': 0.47528696060180664, 'learning_rate': 0.00012251068545195506, 'epoch': 0.39}


 39%|███▉      | 4950/12639 [7:09:01<12:45:03,  5.97s/it]

{'loss': 0.2901, 'grad_norm': 0.45121267437934875, 'learning_rate': 0.0001217191704923223, 'epoch': 0.39}


 40%|███▉      | 5000/12639 [7:13:36<10:45:13,  5.07s/it]

{'loss': 0.2685, 'grad_norm': 0.36470621824264526, 'learning_rate': 0.00012092765553268956, 'epoch': 0.4}


 40%|███▉      | 5050/12639 [7:17:47<10:26:47,  4.96s/it]

{'loss': 0.2453, 'grad_norm': 0.4750353991985321, 'learning_rate': 0.00012013614057305685, 'epoch': 0.4}


 40%|████      | 5100/12639 [7:22:09<9:48:25,  4.68s/it] 

{'loss': 0.2621, 'grad_norm': 0.26431459188461304, 'learning_rate': 0.0001193446256134241, 'epoch': 0.4}


 41%|████      | 5150/12639 [7:26:31<11:12:23,  5.39s/it]

{'loss': 0.2586, 'grad_norm': 0.54087895154953, 'learning_rate': 0.00011855311065379136, 'epoch': 0.41}


 41%|████      | 5200/12639 [7:30:44<11:24:56,  5.52s/it]

{'loss': 0.2781, 'grad_norm': 0.43569666147232056, 'learning_rate': 0.00011776159569415863, 'epoch': 0.41}


 42%|████▏     | 5250/12639 [7:35:08<10:46:33,  5.25s/it]

{'loss': 0.2768, 'grad_norm': 0.5497552752494812, 'learning_rate': 0.00011697008073452589, 'epoch': 0.42}


 42%|████▏     | 5300/12639 [7:39:33<11:01:33,  5.41s/it]

{'loss': 0.2669, 'grad_norm': 0.4409823715686798, 'learning_rate': 0.00011617856577489315, 'epoch': 0.42}


 42%|████▏     | 5350/12639 [7:43:53<9:37:31,  4.75s/it] 

{'loss': 0.2689, 'grad_norm': 0.5002895593643188, 'learning_rate': 0.00011538705081526042, 'epoch': 0.42}


 43%|████▎     | 5400/12639 [7:48:10<10:37:35,  5.28s/it]

{'loss': 0.2663, 'grad_norm': 0.5015673041343689, 'learning_rate': 0.00011459553585562768, 'epoch': 0.43}


 43%|████▎     | 5450/12639 [7:52:14<10:01:51,  5.02s/it]

{'loss': 0.252, 'grad_norm': 0.46416446566581726, 'learning_rate': 0.00011380402089599493, 'epoch': 0.43}


 44%|████▎     | 5500/12639 [7:56:43<10:17:18,  5.19s/it]

{'loss': 0.2876, 'grad_norm': 0.3755379617214203, 'learning_rate': 0.00011301250593636222, 'epoch': 0.44}


 44%|████▍     | 5550/12639 [8:00:53<10:49:42,  5.50s/it]

{'loss': 0.2633, 'grad_norm': 0.47507980465888977, 'learning_rate': 0.00011222099097672946, 'epoch': 0.44}


 44%|████▍     | 5600/12639 [8:05:07<11:01:02,  5.63s/it]

{'loss': 0.2637, 'grad_norm': 0.47561442852020264, 'learning_rate': 0.00011142947601709672, 'epoch': 0.44}


 45%|████▍     | 5650/12639 [8:09:32<10:28:46,  5.40s/it]

{'loss': 0.2727, 'grad_norm': 0.5517523884773254, 'learning_rate': 0.000110637961057464, 'epoch': 0.45}


 45%|████▌     | 5700/12639 [8:14:05<11:26:08,  5.93s/it]

{'loss': 0.2593, 'grad_norm': 0.3324751853942871, 'learning_rate': 0.00010984644609783126, 'epoch': 0.45}


 45%|████▌     | 5750/12639 [8:18:33<10:44:51,  5.62s/it]

{'loss': 0.2777, 'grad_norm': 0.42333367466926575, 'learning_rate': 0.00010905493113819852, 'epoch': 0.45}


 46%|████▌     | 5800/12639 [8:23:11<10:56:08,  5.76s/it]

{'loss': 0.2768, 'grad_norm': 0.5642467141151428, 'learning_rate': 0.00010826341617856579, 'epoch': 0.46}


 46%|████▋     | 5850/12639 [8:28:06<10:03:10,  5.33s/it]

{'loss': 0.2318, 'grad_norm': 0.5552011728286743, 'learning_rate': 0.00010747190121893305, 'epoch': 0.46}


 47%|████▋     | 5900/12639 [8:32:19<9:55:46,  5.30s/it] 

{'loss': 0.2511, 'grad_norm': 0.5141560435295105, 'learning_rate': 0.0001066803862593003, 'epoch': 0.47}


 47%|████▋     | 5950/12639 [8:36:40<9:28:29,  5.10s/it] 

{'loss': 0.2849, 'grad_norm': 0.5352634191513062, 'learning_rate': 0.00010588887129966756, 'epoch': 0.47}


 47%|████▋     | 6000/12639 [8:40:55<9:44:47,  5.29s/it] 

{'loss': 0.2507, 'grad_norm': 0.5161479115486145, 'learning_rate': 0.00010509735634003483, 'epoch': 0.47}


 48%|████▊     | 6050/12639 [8:45:29<10:50:55,  5.93s/it]

{'loss': 0.2842, 'grad_norm': 0.4441252052783966, 'learning_rate': 0.00010430584138040209, 'epoch': 0.48}


 48%|████▊     | 6100/12639 [8:49:45<9:56:35,  5.47s/it] 

{'loss': 0.2591, 'grad_norm': 0.5167110562324524, 'learning_rate': 0.00010351432642076935, 'epoch': 0.48}


 49%|████▊     | 6150/12639 [8:53:53<9:07:21,  5.06s/it] 

{'loss': 0.2639, 'grad_norm': 0.457276850938797, 'learning_rate': 0.00010272281146113663, 'epoch': 0.49}


 49%|████▉     | 6200/12639 [8:58:11<10:08:53,  5.67s/it]

{'loss': 0.2704, 'grad_norm': 0.5544494986534119, 'learning_rate': 0.00010193129650150387, 'epoch': 0.49}


 49%|████▉     | 6250/12639 [9:02:31<7:28:14,  4.21s/it] 

{'loss': 0.2836, 'grad_norm': 0.44766145944595337, 'learning_rate': 0.00010113978154187113, 'epoch': 0.49}


 50%|████▉     | 6300/12639 [9:07:07<9:44:51,  5.54s/it] 

{'loss': 0.2625, 'grad_norm': 0.4691755771636963, 'learning_rate': 0.00010034826658223842, 'epoch': 0.5}


 50%|█████     | 6350/12639 [9:11:35<8:43:17,  4.99s/it] 

{'loss': 0.2872, 'grad_norm': 0.2942425012588501, 'learning_rate': 9.955675162260567e-05, 'epoch': 0.5}


 51%|█████     | 6400/12639 [9:15:51<10:28:06,  6.04s/it]

{'loss': 0.2616, 'grad_norm': 0.5311463475227356, 'learning_rate': 9.876523666297293e-05, 'epoch': 0.51}


 51%|█████     | 6450/12639 [9:20:06<8:43:07,  5.07s/it] 

{'loss': 0.2587, 'grad_norm': 0.4502653479576111, 'learning_rate': 9.797372170334019e-05, 'epoch': 0.51}


 51%|█████▏    | 6500/12639 [9:24:35<8:52:01,  5.20s/it] 

{'loss': 0.2488, 'grad_norm': 0.3614599406719208, 'learning_rate': 9.718220674370746e-05, 'epoch': 0.51}


 52%|█████▏    | 6550/12639 [9:28:59<9:08:30,  5.40s/it] 

{'loss': 0.2791, 'grad_norm': 0.4657256603240967, 'learning_rate': 9.639069178407473e-05, 'epoch': 0.52}


 52%|█████▏    | 6600/12639 [9:33:11<8:38:26,  5.15s/it] 

{'loss': 0.2576, 'grad_norm': 0.4324333965778351, 'learning_rate': 9.559917682444199e-05, 'epoch': 0.52}


 53%|█████▎    | 6650/12639 [9:37:27<7:51:43,  4.73s/it] 

{'loss': 0.2449, 'grad_norm': 0.4163663387298584, 'learning_rate': 9.480766186480925e-05, 'epoch': 0.53}


 53%|█████▎    | 6700/12639 [9:41:42<7:18:48,  4.43s/it]

{'loss': 0.2652, 'grad_norm': 0.4364064037799835, 'learning_rate': 9.401614690517652e-05, 'epoch': 0.53}


 53%|█████▎    | 6750/12639 [9:46:11<9:02:11,  5.52s/it]

{'loss': 0.2762, 'grad_norm': 0.5332921147346497, 'learning_rate': 9.322463194554377e-05, 'epoch': 0.53}


 54%|█████▍    | 6800/12639 [9:50:29<7:30:41,  4.63s/it] 

{'loss': 0.2664, 'grad_norm': 0.3919669985771179, 'learning_rate': 9.243311698591104e-05, 'epoch': 0.54}


 54%|█████▍    | 6850/12639 [9:54:54<8:37:32,  5.36s/it]

{'loss': 0.263, 'grad_norm': 0.4785234332084656, 'learning_rate': 9.16416020262783e-05, 'epoch': 0.54}


 55%|█████▍    | 6900/12639 [9:59:17<6:55:44,  4.35s/it] 

{'loss': 0.2662, 'grad_norm': 0.5006126761436462, 'learning_rate': 9.085008706664556e-05, 'epoch': 0.55}


 55%|█████▍    | 6950/12639 [10:03:45<7:47:21,  4.93s/it] 

{'loss': 0.2857, 'grad_norm': 0.5205153822898865, 'learning_rate': 9.005857210701283e-05, 'epoch': 0.55}


 55%|█████▌    | 7000/12639 [10:08:01<7:10:29,  4.58s/it]

{'loss': 0.2732, 'grad_norm': 0.5414326786994934, 'learning_rate': 8.926705714738009e-05, 'epoch': 0.55}


 56%|█████▌    | 7050/12639 [10:12:21<7:31:05,  4.84s/it] 

{'loss': 0.2684, 'grad_norm': 0.486540824174881, 'learning_rate': 8.847554218774734e-05, 'epoch': 0.56}


 56%|█████▌    | 7100/12639 [10:16:40<7:33:23,  4.91s/it]

{'loss': 0.2562, 'grad_norm': 0.362127423286438, 'learning_rate': 8.768402722811462e-05, 'epoch': 0.56}


 57%|█████▋    | 7150/12639 [10:21:04<8:03:18,  5.28s/it]

{'loss': 0.2629, 'grad_norm': 0.47685474157333374, 'learning_rate': 8.689251226848187e-05, 'epoch': 0.57}


 57%|█████▋    | 7200/12639 [10:25:29<9:36:44,  6.36s/it]

{'loss': 0.2741, 'grad_norm': 0.49585315585136414, 'learning_rate': 8.610099730884914e-05, 'epoch': 0.57}


 57%|█████▋    | 7250/12639 [10:29:43<7:08:09,  4.77s/it]

{'loss': 0.2645, 'grad_norm': 0.3452379107475281, 'learning_rate': 8.53094823492164e-05, 'epoch': 0.57}


 58%|█████▊    | 7300/12639 [10:33:45<7:37:14,  5.14s/it]

{'loss': 0.2336, 'grad_norm': 0.3567485809326172, 'learning_rate': 8.451796738958366e-05, 'epoch': 0.58}


 58%|█████▊    | 7350/12639 [10:37:55<6:56:44,  4.73s/it]

{'loss': 0.2632, 'grad_norm': 0.47731834650039673, 'learning_rate': 8.372645242995093e-05, 'epoch': 0.58}


 59%|█████▊    | 7400/12639 [10:42:09<7:37:48,  5.24s/it]

{'loss': 0.2369, 'grad_norm': 0.7227634787559509, 'learning_rate': 8.29349374703182e-05, 'epoch': 0.59}


 59%|█████▉    | 7450/12639 [10:46:28<7:33:18,  5.24s/it]

{'loss': 0.2643, 'grad_norm': 0.4663230776786804, 'learning_rate': 8.214342251068546e-05, 'epoch': 0.59}


 59%|█████▉    | 7500/12639 [10:50:41<7:50:42,  5.50s/it]

{'loss': 0.2755, 'grad_norm': 0.5623429417610168, 'learning_rate': 8.135190755105272e-05, 'epoch': 0.59}


 60%|█████▉    | 7550/12639 [10:55:02<8:08:44,  5.76s/it]

{'loss': 0.2714, 'grad_norm': 0.5131655931472778, 'learning_rate': 8.056039259141999e-05, 'epoch': 0.6}


 60%|██████    | 7600/12639 [10:59:23<7:38:37,  5.46s/it]

{'loss': 0.2679, 'grad_norm': 0.7581496238708496, 'learning_rate': 7.976887763178724e-05, 'epoch': 0.6}


 61%|██████    | 7650/12639 [11:03:53<8:04:06,  5.82s/it]

{'loss': 0.2738, 'grad_norm': 0.48181629180908203, 'learning_rate': 7.897736267215451e-05, 'epoch': 0.61}


 61%|██████    | 7700/12639 [11:08:30<7:56:35,  5.79s/it]

{'loss': 0.2736, 'grad_norm': 0.5206656455993652, 'learning_rate': 7.818584771252177e-05, 'epoch': 0.61}


 61%|██████▏   | 7750/12639 [11:12:57<6:33:07,  4.82s/it]

{'loss': 0.2848, 'grad_norm': 0.3307555615901947, 'learning_rate': 7.739433275288903e-05, 'epoch': 0.61}


 62%|██████▏   | 7800/12639 [11:17:24<7:16:07,  5.41s/it]

{'loss': 0.2635, 'grad_norm': 0.44905439019203186, 'learning_rate': 7.66028177932563e-05, 'epoch': 0.62}


 62%|██████▏   | 7850/12639 [11:21:42<6:35:37,  4.96s/it]

{'loss': 0.2462, 'grad_norm': 0.4592302143573761, 'learning_rate': 7.581130283362356e-05, 'epoch': 0.62}


 63%|██████▎   | 7900/12639 [11:26:07<7:36:21,  5.78s/it]

{'loss': 0.2835, 'grad_norm': 0.4242914319038391, 'learning_rate': 7.501978787399082e-05, 'epoch': 0.63}


 63%|██████▎   | 7950/12639 [11:30:31<7:18:34,  5.61s/it]

{'loss': 0.2687, 'grad_norm': 0.42728352546691895, 'learning_rate': 7.422827291435809e-05, 'epoch': 0.63}


 63%|██████▎   | 8000/12639 [11:34:48<6:49:54,  5.30s/it]

{'loss': 0.2543, 'grad_norm': 0.36398518085479736, 'learning_rate': 7.343675795472534e-05, 'epoch': 0.63}


 64%|██████▎   | 8050/12639 [11:39:12<6:38:55,  5.22s/it]

{'loss': 0.2574, 'grad_norm': 0.5181361436843872, 'learning_rate': 7.264524299509261e-05, 'epoch': 0.64}


 64%|██████▍   | 8100/12639 [11:43:24<6:37:02,  5.25s/it]

{'loss': 0.2607, 'grad_norm': 0.5347956418991089, 'learning_rate': 7.185372803545987e-05, 'epoch': 0.64}


 64%|██████▍   | 8150/12639 [11:47:38<6:11:51,  4.97s/it]

{'loss': 0.2492, 'grad_norm': 0.45302972197532654, 'learning_rate': 7.106221307582713e-05, 'epoch': 0.64}


 65%|██████▍   | 8200/12639 [11:51:50<6:00:29,  4.87s/it]

{'loss': 0.2447, 'grad_norm': 0.4530758559703827, 'learning_rate': 7.02706981161944e-05, 'epoch': 0.65}


 65%|██████▌   | 8250/12639 [11:56:11<6:43:20,  5.51s/it]

{'loss': 0.2693, 'grad_norm': 0.45829224586486816, 'learning_rate': 6.947918315656167e-05, 'epoch': 0.65}


 66%|██████▌   | 8300/12639 [12:00:43<6:37:12,  5.49s/it]

{'loss': 0.2705, 'grad_norm': 0.5216905474662781, 'learning_rate': 6.868766819692893e-05, 'epoch': 0.66}


 66%|██████▌   | 8350/12639 [12:05:02<5:24:38,  4.54s/it]

{'loss': 0.2477, 'grad_norm': 0.44691574573516846, 'learning_rate': 6.789615323729619e-05, 'epoch': 0.66}


 66%|██████▋   | 8400/12639 [12:09:26<5:56:13,  5.04s/it]

{'loss': 0.2685, 'grad_norm': 0.3273133933544159, 'learning_rate': 6.710463827766346e-05, 'epoch': 0.66}


 67%|██████▋   | 8450/12639 [12:13:48<6:37:04,  5.69s/it]

{'loss': 0.2611, 'grad_norm': 0.4374054968357086, 'learning_rate': 6.631312331803071e-05, 'epoch': 0.67}


 67%|██████▋   | 8500/12639 [12:17:54<6:10:18,  5.37s/it]

{'loss': 0.2401, 'grad_norm': 0.42607665061950684, 'learning_rate': 6.552160835839798e-05, 'epoch': 0.67}


 68%|██████▊   | 8550/12639 [12:22:14<5:46:58,  5.09s/it]

{'loss': 0.2414, 'grad_norm': 0.37299737334251404, 'learning_rate': 6.473009339876523e-05, 'epoch': 0.68}


 68%|██████▊   | 8600/12639 [12:26:37<6:49:38,  6.09s/it]

{'loss': 0.2488, 'grad_norm': 0.41848012804985046, 'learning_rate': 6.39385784391325e-05, 'epoch': 0.68}


 68%|██████▊   | 8650/12639 [12:31:17<6:06:36,  5.51s/it]

{'loss': 0.2667, 'grad_norm': 0.3771616816520691, 'learning_rate': 6.314706347949977e-05, 'epoch': 0.68}


 69%|██████▉   | 8700/12639 [12:35:40<5:57:56,  5.45s/it]

{'loss': 0.2457, 'grad_norm': 0.47512683272361755, 'learning_rate': 6.235554851986703e-05, 'epoch': 0.69}


 69%|██████▉   | 8750/12639 [12:39:51<5:15:23,  4.87s/it]

{'loss': 0.2419, 'grad_norm': 0.5041733384132385, 'learning_rate': 6.156403356023429e-05, 'epoch': 0.69}


 70%|██████▉   | 8800/12639 [12:43:55<5:44:12,  5.38s/it]

{'loss': 0.2465, 'grad_norm': 0.424403578042984, 'learning_rate': 6.0772518600601556e-05, 'epoch': 0.7}


 70%|███████   | 8850/12639 [12:48:06<4:50:00,  4.59s/it]

{'loss': 0.236, 'grad_norm': 0.37819764018058777, 'learning_rate': 5.9981003640968814e-05, 'epoch': 0.7}


 70%|███████   | 8900/12639 [12:52:34<5:36:14,  5.40s/it]

{'loss': 0.265, 'grad_norm': 0.49167129397392273, 'learning_rate': 5.918948868133608e-05, 'epoch': 0.7}


 71%|███████   | 8950/12639 [12:56:54<5:09:24,  5.03s/it]

{'loss': 0.2583, 'grad_norm': 0.23560866713523865, 'learning_rate': 5.839797372170335e-05, 'epoch': 0.71}


 71%|███████   | 9000/12639 [13:01:05<5:20:26,  5.28s/it]

{'loss': 0.2527, 'grad_norm': 0.4904244840145111, 'learning_rate': 5.7606458762070606e-05, 'epoch': 0.71}


 72%|███████▏  | 9050/12639 [13:05:33<4:32:06,  4.55s/it]

{'loss': 0.2634, 'grad_norm': 0.2642139196395874, 'learning_rate': 5.681494380243787e-05, 'epoch': 0.72}


 72%|███████▏  | 9100/12639 [13:09:48<4:51:34,  4.94s/it]

{'loss': 0.2344, 'grad_norm': 0.39830446243286133, 'learning_rate': 5.6023428842805134e-05, 'epoch': 0.72}


 72%|███████▏  | 9150/12639 [13:14:08<5:32:50,  5.72s/it]

{'loss': 0.2663, 'grad_norm': 0.4959358274936676, 'learning_rate': 5.523191388317239e-05, 'epoch': 0.72}


 73%|███████▎  | 9200/12639 [13:18:40<5:44:15,  6.01s/it]

{'loss': 0.2606, 'grad_norm': 0.43513357639312744, 'learning_rate': 5.444039892353966e-05, 'epoch': 0.73}


 73%|███████▎  | 9250/12639 [13:22:48<4:48:48,  5.11s/it]

{'loss': 0.2375, 'grad_norm': 0.45680519938468933, 'learning_rate': 5.364888396390691e-05, 'epoch': 0.73}


 74%|███████▎  | 9300/12639 [13:27:08<4:52:13,  5.25s/it]

{'loss': 0.2559, 'grad_norm': 0.4622076153755188, 'learning_rate': 5.2857369004274184e-05, 'epoch': 0.74}


 74%|███████▍  | 9350/12639 [13:31:18<4:43:33,  5.17s/it]

{'loss': 0.2455, 'grad_norm': 0.4538470208644867, 'learning_rate': 5.206585404464145e-05, 'epoch': 0.74}


 74%|███████▍  | 9400/12639 [13:35:30<4:25:27,  4.92s/it]

{'loss': 0.2559, 'grad_norm': 0.3220619857311249, 'learning_rate': 5.1274339085008706e-05, 'epoch': 0.74}


 75%|███████▍  | 9450/12639 [13:39:42<4:28:19,  5.05s/it]

{'loss': 0.2574, 'grad_norm': 0.4732990860939026, 'learning_rate': 5.048282412537597e-05, 'epoch': 0.75}


 75%|███████▌  | 9500/12639 [13:43:57<4:22:29,  5.02s/it]

{'loss': 0.2451, 'grad_norm': 0.3241519033908844, 'learning_rate': 4.9691309165743234e-05, 'epoch': 0.75}


 76%|███████▌  | 9550/12639 [13:48:19<4:10:03,  4.86s/it]

{'loss': 0.2595, 'grad_norm': 0.5608572363853455, 'learning_rate': 4.88997942061105e-05, 'epoch': 0.76}


 76%|███████▌  | 9600/12639 [13:52:43<3:43:38,  4.42s/it]

{'loss': 0.2552, 'grad_norm': 0.21406610310077667, 'learning_rate': 4.810827924647776e-05, 'epoch': 0.76}


 76%|███████▋  | 9650/12639 [13:57:02<4:08:33,  4.99s/it]

{'loss': 0.2489, 'grad_norm': 0.4443788528442383, 'learning_rate': 4.731676428684502e-05, 'epoch': 0.76}


 77%|███████▋  | 9700/12639 [14:01:18<4:00:08,  4.90s/it]

{'loss': 0.2465, 'grad_norm': 0.37858620285987854, 'learning_rate': 4.652524932721229e-05, 'epoch': 0.77}


 77%|███████▋  | 9750/12639 [14:05:43<4:56:15,  6.15s/it]

{'loss': 0.2576, 'grad_norm': 0.5370598435401917, 'learning_rate': 4.573373436757955e-05, 'epoch': 0.77}


 78%|███████▊  | 9800/12639 [14:10:00<4:08:46,  5.26s/it]

{'loss': 0.2443, 'grad_norm': 0.4243020713329315, 'learning_rate': 4.494221940794681e-05, 'epoch': 0.78}


 78%|███████▊  | 9850/12639 [14:14:25<4:19:44,  5.59s/it]

{'loss': 0.254, 'grad_norm': 0.4644697904586792, 'learning_rate': 4.4150704448314076e-05, 'epoch': 0.78}


 78%|███████▊  | 9900/12639 [14:18:55<4:28:58,  5.89s/it]

{'loss': 0.2527, 'grad_norm': 0.40911272168159485, 'learning_rate': 4.335918948868134e-05, 'epoch': 0.78}


 79%|███████▊  | 9950/12639 [14:23:02<3:24:28,  4.56s/it]

{'loss': 0.2236, 'grad_norm': 0.4848828911781311, 'learning_rate': 4.2567674529048605e-05, 'epoch': 0.79}


 79%|███████▉  | 10000/12639 [14:27:25<3:40:54,  5.02s/it]

{'loss': 0.2826, 'grad_norm': 0.526597797870636, 'learning_rate': 4.177615956941586e-05, 'epoch': 0.79}


 80%|███████▉  | 10050/12639 [14:31:35<3:24:14,  4.73s/it]

{'loss': 0.2325, 'grad_norm': 0.19955897331237793, 'learning_rate': 4.0984644609783126e-05, 'epoch': 0.8}


 80%|███████▉  | 10100/12639 [14:35:50<4:07:55,  5.86s/it]

{'loss': 0.258, 'grad_norm': 0.5286812782287598, 'learning_rate': 4.019312965015039e-05, 'epoch': 0.8}


 80%|████████  | 10150/12639 [14:40:00<3:35:27,  5.19s/it]

{'loss': 0.2306, 'grad_norm': 0.45769375562667847, 'learning_rate': 3.9401614690517654e-05, 'epoch': 0.8}


 81%|████████  | 10200/12639 [14:44:33<3:29:40,  5.16s/it]

{'loss': 0.2654, 'grad_norm': 0.30436280369758606, 'learning_rate': 3.861009973088491e-05, 'epoch': 0.81}


 81%|████████  | 10250/12639 [14:48:35<3:17:09,  4.95s/it]

{'loss': 0.2242, 'grad_norm': 0.5304613709449768, 'learning_rate': 3.781858477125218e-05, 'epoch': 0.81}


 81%|████████▏ | 10300/12639 [14:52:48<2:52:45,  4.43s/it]

{'loss': 0.2277, 'grad_norm': 0.45335882902145386, 'learning_rate': 3.702706981161944e-05, 'epoch': 0.81}


 82%|████████▏ | 10350/12639 [14:56:59<3:19:11,  5.22s/it]

{'loss': 0.2319, 'grad_norm': 0.44872844219207764, 'learning_rate': 3.6235554851986704e-05, 'epoch': 0.82}


 82%|████████▏ | 10400/12639 [15:01:17<3:21:39,  5.40s/it]

{'loss': 0.2473, 'grad_norm': 0.43074560165405273, 'learning_rate': 3.544403989235397e-05, 'epoch': 0.82}


 83%|████████▎ | 10450/12639 [15:05:37<2:49:03,  4.63s/it]

{'loss': 0.2602, 'grad_norm': 0.33414173126220703, 'learning_rate': 3.465252493272123e-05, 'epoch': 0.83}


 83%|████████▎ | 10500/12639 [15:09:53<3:14:02,  5.44s/it]

{'loss': 0.2461, 'grad_norm': 0.43818774819374084, 'learning_rate': 3.386100997308849e-05, 'epoch': 0.83}


 83%|████████▎ | 10550/12639 [15:14:21<3:11:36,  5.50s/it]

{'loss': 0.2474, 'grad_norm': 0.459911048412323, 'learning_rate': 3.3069495013455754e-05, 'epoch': 0.83}


 84%|████████▍ | 10600/12639 [15:18:43<3:28:18,  6.13s/it]

{'loss': 0.256, 'grad_norm': 0.5142537355422974, 'learning_rate': 3.227798005382302e-05, 'epoch': 0.84}


 84%|████████▍ | 10650/12639 [15:23:06<2:52:01,  5.19s/it]

{'loss': 0.2552, 'grad_norm': 0.572019100189209, 'learning_rate': 3.148646509419028e-05, 'epoch': 0.84}


 85%|████████▍ | 10700/12639 [15:27:26<2:57:36,  5.50s/it]

{'loss': 0.2459, 'grad_norm': 0.5973889231681824, 'learning_rate': 3.069495013455754e-05, 'epoch': 0.85}


 85%|████████▌ | 10750/12639 [15:31:42<3:06:15,  5.92s/it]

{'loss': 0.2569, 'grad_norm': 0.4406963288784027, 'learning_rate': 2.990343517492481e-05, 'epoch': 0.85}


 85%|████████▌ | 10800/12639 [15:36:09<2:45:33,  5.40s/it]

{'loss': 0.2531, 'grad_norm': 0.46636953949928284, 'learning_rate': 2.911192021529207e-05, 'epoch': 0.85}


 86%|████████▌ | 10850/12639 [15:40:29<2:13:20,  4.47s/it]

{'loss': 0.2536, 'grad_norm': 0.3210481107234955, 'learning_rate': 2.8320405255659332e-05, 'epoch': 0.86}


 86%|████████▌ | 10900/12639 [15:45:04<2:40:24,  5.53s/it]

{'loss': 0.273, 'grad_norm': 0.4560873210430145, 'learning_rate': 2.7528890296026593e-05, 'epoch': 0.86}


 87%|████████▋ | 10950/12639 [15:49:27<2:19:22,  4.95s/it]

{'loss': 0.2431, 'grad_norm': 0.18858805298805237, 'learning_rate': 2.673737533639386e-05, 'epoch': 0.87}


 87%|████████▋ | 11000/12639 [15:53:39<1:59:50,  4.39s/it]

{'loss': 0.2149, 'grad_norm': 0.36926260590553284, 'learning_rate': 2.594586037676112e-05, 'epoch': 0.87}


 87%|████████▋ | 11050/12639 [15:58:04<2:49:08,  6.39s/it]

{'loss': 0.2579, 'grad_norm': 0.5075643658638, 'learning_rate': 2.5154345417128382e-05, 'epoch': 0.87}


 88%|████████▊ | 11100/12639 [16:02:17<2:24:57,  5.65s/it]

{'loss': 0.2524, 'grad_norm': 0.423828661441803, 'learning_rate': 2.4362830457495646e-05, 'epoch': 0.88}


 88%|████████▊ | 11150/12639 [16:06:52<2:20:01,  5.64s/it]

{'loss': 0.2518, 'grad_norm': 0.4879577159881592, 'learning_rate': 2.357131549786291e-05, 'epoch': 0.88}


 89%|████████▊ | 11200/12639 [16:11:13<2:26:58,  6.13s/it]

{'loss': 0.2599, 'grad_norm': 0.42566582560539246, 'learning_rate': 2.277980053823017e-05, 'epoch': 0.89}


 89%|████████▉ | 11250/12639 [16:15:23<1:35:50,  4.14s/it]

{'loss': 0.2291, 'grad_norm': 0.3116171658039093, 'learning_rate': 2.1988285578597435e-05, 'epoch': 0.89}


 89%|████████▉ | 11300/12639 [16:19:42<1:59:21,  5.35s/it]

{'loss': 0.2361, 'grad_norm': 0.49474865198135376, 'learning_rate': 2.11967706189647e-05, 'epoch': 0.89}


 90%|████████▉ | 11350/12639 [16:24:03<1:45:05,  4.89s/it]

{'loss': 0.2543, 'grad_norm': 0.5338722467422485, 'learning_rate': 2.0405255659331964e-05, 'epoch': 0.9}


 90%|█████████ | 11400/12639 [16:28:22<1:53:26,  5.49s/it]

{'loss': 0.2545, 'grad_norm': 0.42535388469696045, 'learning_rate': 1.9613740699699228e-05, 'epoch': 0.9}


 91%|█████████ | 11450/12639 [16:32:34<1:47:35,  5.43s/it]

{'loss': 0.2291, 'grad_norm': 0.464789479970932, 'learning_rate': 1.882222574006649e-05, 'epoch': 0.91}


 91%|█████████ | 11500/12639 [16:36:46<1:34:39,  4.99s/it]

{'loss': 0.2535, 'grad_norm': 0.5101883411407471, 'learning_rate': 1.8030710780433753e-05, 'epoch': 0.91}


 91%|█████████▏| 11550/12639 [16:41:04<1:31:53,  5.06s/it]

{'loss': 0.2527, 'grad_norm': 0.45966747403144836, 'learning_rate': 1.7239195820801013e-05, 'epoch': 0.91}


 92%|█████████▏| 11600/12639 [16:45:22<1:22:44,  4.78s/it]

{'loss': 0.26, 'grad_norm': 0.26506611704826355, 'learning_rate': 1.6447680861168278e-05, 'epoch': 0.92}


 92%|█████████▏| 11650/12639 [16:49:41<1:45:43,  6.41s/it]

{'loss': 0.2458, 'grad_norm': 0.3868105709552765, 'learning_rate': 1.565616590153554e-05, 'epoch': 0.92}


 93%|█████████▎| 11700/12639 [16:53:59<1:17:31,  4.95s/it]

{'loss': 0.2546, 'grad_norm': 0.43098878860473633, 'learning_rate': 1.4864650941902802e-05, 'epoch': 0.93}


 93%|█████████▎| 11750/12639 [16:58:11<1:14:15,  5.01s/it]

{'loss': 0.2423, 'grad_norm': 0.532611608505249, 'learning_rate': 1.4073135982270067e-05, 'epoch': 0.93}


 93%|█████████▎| 11800/12639 [17:02:37<1:11:36,  5.12s/it]

{'loss': 0.2556, 'grad_norm': 0.4706098139286041, 'learning_rate': 1.3281621022637327e-05, 'epoch': 0.93}


 94%|█████████▍| 11850/12639 [17:07:02<1:17:18,  5.88s/it]

{'loss': 0.2578, 'grad_norm': 0.47515761852264404, 'learning_rate': 1.2490106063004591e-05, 'epoch': 0.94}


 94%|█████████▍| 11900/12639 [17:11:41<1:03:43,  5.17s/it]

{'loss': 0.2706, 'grad_norm': 0.48805731534957886, 'learning_rate': 1.1698591103371854e-05, 'epoch': 0.94}


 95%|█████████▍| 11950/12639 [17:16:01<1:00:33,  5.27s/it]

{'loss': 0.2582, 'grad_norm': 0.3970377743244171, 'learning_rate': 1.0907076143739116e-05, 'epoch': 0.95}


 95%|█████████▍| 12000/12639 [17:20:16<57:56,  5.44s/it]  

{'loss': 0.2441, 'grad_norm': 0.5387296080589294, 'learning_rate': 1.011556118410638e-05, 'epoch': 0.95}


 95%|█████████▌| 12050/12639 [17:24:36<45:22,  4.62s/it]  

{'loss': 0.2551, 'grad_norm': 0.4868148863315582, 'learning_rate': 9.324046224473643e-06, 'epoch': 0.95}


 96%|█████████▌| 12100/12639 [17:28:58<55:04,  6.13s/it]

{'loss': 0.259, 'grad_norm': 0.4920485019683838, 'learning_rate': 8.532531264840905e-06, 'epoch': 0.96}


 96%|█████████▌| 12150/12639 [17:33:25<42:09,  5.17s/it]

{'loss': 0.2431, 'grad_norm': 0.4616495370864868, 'learning_rate': 7.74101630520817e-06, 'epoch': 0.96}


 97%|█████████▋| 12200/12639 [17:37:57<40:04,  5.48s/it]

{'loss': 0.244, 'grad_norm': 0.41757044196128845, 'learning_rate': 6.949501345575432e-06, 'epoch': 0.97}


 97%|█████████▋| 12250/12639 [17:42:06<28:00,  4.32s/it]

{'loss': 0.2548, 'grad_norm': 0.4209587872028351, 'learning_rate': 6.1579863859426945e-06, 'epoch': 0.97}


 97%|█████████▋| 12300/12639 [17:46:37<32:35,  5.77s/it]

{'loss': 0.2729, 'grad_norm': 0.4966995120048523, 'learning_rate': 5.366471426309958e-06, 'epoch': 0.97}


 98%|█████████▊| 12350/12639 [17:50:59<22:53,  4.75s/it]

{'loss': 0.2467, 'grad_norm': 0.4187552332878113, 'learning_rate': 4.57495646667722e-06, 'epoch': 0.98}


 98%|█████████▊| 12400/12639 [17:55:11<18:01,  4.52s/it]

{'loss': 0.2514, 'grad_norm': 0.5528647899627686, 'learning_rate': 3.7834415070444836e-06, 'epoch': 0.98}


 99%|█████████▊| 12450/12639 [17:59:18<19:26,  6.17s/it]

{'loss': 0.2257, 'grad_norm': 0.4528505504131317, 'learning_rate': 2.991926547411746e-06, 'epoch': 0.99}


 99%|█████████▉| 12500/12639 [18:03:32<12:06,  5.23s/it]

{'loss': 0.2536, 'grad_norm': 0.42697736620903015, 'learning_rate': 2.2004115877790094e-06, 'epoch': 0.99}


 99%|█████████▉| 12550/12639 [18:07:48<06:31,  4.40s/it]

{'loss': 0.2527, 'grad_norm': 0.2999191880226135, 'learning_rate': 1.408896628146272e-06, 'epoch': 0.99}


100%|█████████▉| 12600/12639 [18:12:15<03:44,  5.75s/it]

{'loss': 0.2439, 'grad_norm': 0.41267263889312744, 'learning_rate': 6.173816685135349e-07, 'epoch': 1.0}


100%|██████████| 12639/12639 [18:15:51<00:00,  5.20s/it]

{'train_runtime': 65751.5389, 'train_samples_per_second': 1.538, 'train_steps_per_second': 0.192, 'train_loss': 0.2648867760214665, 'epoch': 1.0}





In [1]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'torch' is not defined

In [2]:
%pip install ipywidgets
model.save_pretrained_merged("my-mistral", tokenizer, save_method = "merged_16bit",)

Note: you may need to restart the kernel to use updated packages.


NameError: name 'model' is not defined