<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Fine_tuning_Llama_3_1_Quantized_with_AQLM%2C_HQQ%2C_GPTQ%2C_and_AutoRound_Code_and_Training_Logs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*All the details in this article: [QLoRA with AutoRound: Cheaper and Better LLM Fine-tuning on Your GPU](https://newsletter.kaitchup.com/p/qlora-with-autoround-cheaper-and)*

This notebook shows how to do QLoRA fine-tuning for models quantized with AQLM, HQQ, GPTQ, and AutoRound. It uses Llama 3.1 8B for the examples.

You will need an Ampere or more recent GPU. Code only tested with FlashAttention and bfloat16.

We will need all the following libraries:

In [None]:
!pip install hqq aqlm[gpu] auto-gptq auto-round bitsandbytes
!pip install --upgrade transformers peft accelerate datasets trl flash_attn optimum

Collecting optimum
  Downloading optimum-1.21.3-py3-none-any.whl.metadata (19 kB)
Collecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting transformers
  Downloading transformers-4.43.4-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->optimum)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading optimum-1.21.3-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.43.4-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m109.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#HQQ

* not compatible with FlashAttention. Use SDPA instead.
* memory consumption (GPU): 14.9 GB

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
    HqqConfig

)
from trl import SFTTrainer, SFTConfig

set_seed(1234)


#use bf16
compute_dtype = torch.bfloat16
attn_implementation = 'sdpa'

model_name = "meta-llama/Meta-Llama-3.1-8B"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'

ds = load_dataset("timdettmers/openassistant-guanaco")

#Add the EOS token
def process(row):
    row["text"] = row["text"]+"<|end_of_text|>"
    return row

ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)

quant_config = HqqConfig(nbits=4, group_size=128, quant_zero=False, quant_scale=False, axis=1)


model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=quant_config, torch_dtype=torch.bfloat16, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})


peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Llama3.1_8b_HQQ_right",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Repo card metadata block was not found. Setting CardData to empty.
  self.pid = os.fork()


Map (num_proc=12):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/518 [00:00<?, ? examples/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss,Validation Loss
25,1.4904,1.373379
50,1.3175,1.312019
75,1.2935,1.29809
100,1.2762,1.293054
125,1.28,1.288782
150,1.2542,1.28545
175,1.2618,1.283457
200,1.2687,1.28184
225,1.2821,1.279762
250,1.2784,1.278526



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples

TrainOutput(global_step=307, training_loss=1.2883663084297305, metrics={'train_runtime': 10283.3856, 'train_samples_per_second': 0.957, 'train_steps_per_second': 0.03, 'total_flos': 1.1995141594192282e+17, 'train_loss': 1.2883663084297305, 'epoch': 0.9975629569455727})

#AQLM

* memory consumption (GPU): 13.3 GB

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
)
from trl import SFTTrainer, SFTConfig

set_seed(1234)


#use bf16 and FlashAttention if supported
compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

model_name = "ISTA-DASLab/Llama-3.1-8B-AQLM-PV-2Bit-1x16-hf"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'

ds = load_dataset("timdettmers/openassistant-guanaco")

#Add the EOS token
def process(row):
    row["text"] = row["text"]+"<|end_of_text|>"
    return row

ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)



model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch.bfloat16, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})


peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Llama3.1_8b_AQLM_right",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

  self.pid = os.fork()


Map (num_proc=12):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/518 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/4.51k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/126 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Detected flash_attn version: 2.6.3


Step,Training Loss,Validation Loss
25,1.6656,1.572161
50,1.5012,1.502884
75,1.4915,1.487007
100,1.4666,1.478107
125,1.4582,1.471196
150,1.4175,1.465552
175,1.4445,1.460821
200,1.4339,1.457255
225,1.4564,1.453755
250,1.4434,1.451765



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples

TrainOutput(global_step=307, training_loss=1.4638179971651457, metrics={'train_runtime': 8945.6755, 'train_samples_per_second': 1.101, 'train_steps_per_second': 0.034, 'total_flos': 4.608530472330854e+16, 'train_loss': 1.4638179971651457, 'epoch': 0.9975629569455727})

#GPTQ

* memory consumption (GPU): 14.9 GB

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
)
from trl import SFTTrainer, SFTConfig

set_seed(1234)


#use bf16 and FlashAttention if supported
compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

model_name = "kaitchup/Meta-Llama-3.1-8B-gptq-4bit"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'

ds = load_dataset("timdettmers/openassistant-guanaco")

#Add the EOS token
def process(row):
    row["text"] = row["text"]+"<|end_of_text|>"
    return row

ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)



model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch.bfloat16, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})


peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Llama3.1_8b_GPTQ_right",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Repo card metadata block was not found. Setting CardData to empty.
  self.pid = os.fork()


Map (num_proc=12):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/518 [00:00<?, ? examples/s]



model.safetensors.index.json:   0%|          | 0.00/78.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Detected flash_attn version: 2.6.3


Step,Training Loss,Validation Loss
25,1.4992,1.381402
50,1.3289,1.324934
75,1.301,1.305926
100,1.2826,1.300453
125,1.2864,1.295995
150,1.2584,1.29302
175,1.2683,1.290521
200,1.2741,1.288619
225,1.2879,1.286453
250,1.2832,1.285446



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8


Step,Training Loss,Validation Loss
25,1.4992,1.381402
50,1.3289,1.324934
75,1.301,1.305926
100,1.2826,1.300453
125,1.2864,1.295995
150,1.2584,1.29302
175,1.2683,1.290521
200,1.2741,1.288619
225,1.2879,1.286453
250,1.2832,1.285446



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./drive/MyDrive/Llama3.1_8b_GPTQ_right/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--kaitchup--Meta-Llama-3.1-8B-gptq-4bit/snapshots/fd00e3618ba86ae8eea47d90f972edc7d3744af7/config.json
Model config LlamaConfig {
  "_name_or_path": "./drive/MyDrive/Quant-Llama-3.1-8B/GPTQ/Meta-Llama-3.1-8B-gptq-4bit",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "quantization_con

TrainOutput(global_step=307, training_loss=1.2948249139692574, metrics={'train_runtime': 9904.1208, 'train_samples_per_second': 0.994, 'train_steps_per_second': 0.031, 'total_flos': 1.6779506853740544e+16, 'train_loss': 1.2948249139692574, 'epoch': 0.9975629569455727})

#AutoRound

* memory consumption (GPU): 14.9 GB

In [None]:
import torch, os, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    set_seed,
)
from trl import SFTTrainer, SFTConfig

set_seed(1234)


#use bf16 and FlashAttention if supported
compute_dtype = torch.bfloat16
attn_implementation = 'flash_attention_2'

model_name = "kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-asym"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'

ds = load_dataset("timdettmers/openassistant-guanaco")

#Add the EOS token
def process(row):
    row["text"] = row["text"]+"<|end_of_text|>"
    return row

ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)



model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch.bfloat16, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model, gradient_checkpointing_kwargs={'use_reentrant':True})


peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Llama3.1_8b_AutoRound_right",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.
  self.pid = os.fork()


Map (num_proc=12):   0%|          | 0/9846 [00:00<?, ? examples/s]

Map (num_proc=12):   0%|          | 0/518 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/78.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.68G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.05G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Detected flash_attn version: 2.6.3


Step,Training Loss,Validation Loss
25,1.4593,1.345125
50,1.2934,1.294062
75,1.2785,1.285449
100,1.2639,1.281854
125,1.2694,1.278327
150,1.2432,1.275409
175,1.2504,1.27377
200,1.2585,1.272375
225,1.273,1.270395
250,1.2703,1.269632



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples

TrainOutput(global_step=307, training_loss=1.275046867345754, metrics={'train_runtime': 9898.7783, 'train_samples_per_second': 0.995, 'train_steps_per_second': 0.031, 'total_flos': 1.6779506853740544e+16, 'train_loss': 1.275046867345754, 'epoch': 0.9975629569455727})