<a href="https://colab.research.google.com/github/sdossou/LoRA_QLoRA/blob/main/%22Instruction_Tuning%22_Mistral_7B_Instruct_with_4_bit_QLoRA_vwikimovies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instruct-tuning Mistral-7B-Instruct-v0.2 using `peft`, `transformers` and `bitsandbytes`

This notebook  fine-tunes Mistral-7B-Instruct-v0.2 a large language model.

This will retrain the weights of the model using QLoRA or Quantization Low-Rank Adaptation.

It uses the huggingface wiki_movies dataset to fine-tune Mistral-7B-Instruct-v0.2 to be able to generate instructions.





This  notebook focuses on the `bitsandbytes` quantization strategy which includes the idea of `k-bit` training.


### Dependencies

Installing relevant dependencies.

In [None]:
!pip install -qU bitsandbytes datasets accelerate loralib transformers peft

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━


### Model loading
Loading the `mistralai/Mistral-7B-Instruct-v0.2` model and tokenizer.

In [None]:
import torch
torch.cuda.is_available()

True

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto"
)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
base_model.eval()

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

In [None]:
prompt = "The movie is"

In [None]:
%time
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = base_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.95, temperature=0.4)
tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.34 µs




'The movie is a 2005 American drama film directed by George Clooney, who also starred in it along with Brad Pitt, Sharon Stone, Sandra Bullock, and Julia Roberts. The film is about a group of Hollywood actors who travel to the Darfur region of Sudan to raise awareness about the ongoing humanitarian crisis there. The film received mixed reviews from critics, but it was a box office success.\n\nOne of the most memorable moments in the film comes when the'

Let's also load the tokenizer, along with the padding token.

In [None]:

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#### Model Architecture

Applying LoRA to the modules related to the attention weights: `q_proj`, `v_proj`, `query_key_value`. This is model dependent

In [None]:
print(base_model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )

### Apply LoRA

Loading the `PeftModel` and specifying using low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

#### Helper Function to Print Parameter Percentage

This is just a helper function to print out just how much LoRA reduces the number of trainable parameters.

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

#### Initialising LoRA Config
Main parameters:

- `r`: is the "rank" of the two decomposed matrices used to represent the weight matrix. This is the dimension of the decomposed matrices.

The following is an exerpt from the paper to help provide context for the selected `r`

- `target_modules`: As LoRA can be applied to *any* weight matrix, *which* module (weight matrix) it is being applied to needs to be configured. While the LoRA paper suggests applying it only to the Attention weights, this notebook follows the guidance of the QLoRA paper by applying LoRA to all Linear layers.


- `task_type`: This is a derived property. If you're using a causal model, this should be set to `CAUSAL_LM`. Ensure this property is set based on the selected model.


In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    #target_modules=["q_proj", "v_proj", "k_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

base_model = prepare_model_for_kbit_training(base_model)
model = get_peft_model(base_model, lora_config)
print_trainable_parameters(model)

trainable params: 27262976 || all params: 3779334144 || trainable%: 0.7213698223345028


In [None]:
print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralSdpaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=4096, out_features=1024

### Preprocessing

Loading the wiki_movie dataset from Hugging Face.

In [None]:
import transformers
from datasets import load_dataset

dataset_name = "wiki_movies"

In [None]:
dataset = load_dataset(dataset_name)
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 96185
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 9952
    })
    validation: Dataset({
        features: ['question', 'answer'],
        num_rows: 10000
    })
})


In [None]:
print(dataset['train'][0])

{'question': ' what movies are about ginger rogers?', 'answer': 'Top Hat, Kitty Foyle, The Barkleys of Broadway\n'}


In [None]:
print(dataset['train'][5])

{'question': ' what movies can be described with ben stiller?', 'answer': "Tropic Thunder, Meet the Parents, There's Something About Mary, Madagascar, The Secret Life of Walter Mitty, Night at the Museum, Meet the Fockers, The Royal Tenenbaums, Zoolander, The Cable Guy, Tower Heist, Along Came Polly, Little Fockers, The Heartbreak Kid, Keeping the Faith, Duplex, Reality Bites, Greenberg, Envy, Flirting with Disaster, Zero Effect\n"}


Limiting the number of samples to 5K.

In [None]:
dataset_subset = dataset["train"].select(range(5_000))

Putting the data in the following form:

```
Generate a simple instruction an LLM could use to generate the provided context.
[INST]CONTEXT: 1.Top Hat, Kitty Foyle, The Barkleys of Broadway.
[/INST]INSTRUCTION: what movies are about ginger rogers?
```

 Showing the model examples of a prompt and its completion.

In [None]:
def generate_prompt(example, return_response=True) -> str:
  full_prompt = f"Generate a simple question that could result in the provided context."
  full_prompt += f"[INST]CONTEXT: {example['answer']}[/INST]"

  if return_response:
    full_prompt += f"INSTRUCTION: "
    full_prompt += f"{example['question']}"
  return [full_prompt]

In [None]:
generate_prompt(dataset_subset[0])[0]

'Generate a simple question that could result in the provided context.[INST]CONTEXT: Top Hat, Kitty Foyle, The Barkleys of Broadway\n[/INST]INSTRUCTION:  what movies are about ginger rogers?'

The `Trainer` class contains the same hyper-parameters yas traditional ML applications.

If you're running into CUDA memory issues - please modify both the `per_device_train_batch_size` to be lower, and also reduce `r` in the LoRAConfig. You will need to restart and re-run your notebook after doing so.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="mistral-7b-instruct",
    num_train_epochs=100,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit", # from the QLoRA paper
    logging_steps=1,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True, # ensure proper upcasting for compute dtypes
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True
)

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install trl -U -q

In [None]:
from trl import SFTTrainer

max_seq_length = 2048

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset_subset,
    peft_config=lora_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    formatting_func=generate_prompt,
    args=training_args,
)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
trainer.train()

{'loss': 1.8399, 'grad_norm': 5.5662970542907715, 'learning_rate': 0.0002, 'epoch': 1.0}




{'loss': 1.7438, 'grad_norm': 5.399608135223389, 'learning_rate': 0.0002, 'epoch': 2.0}




{'loss': 1.5656, 'grad_norm': 1.5885920524597168, 'learning_rate': 0.0002, 'epoch': 3.0}




{'loss': 1.5517, 'grad_norm': 1.2298073768615723, 'learning_rate': 0.0002, 'epoch': 4.0}




{'loss': 1.396, 'grad_norm': 0.9420951008796692, 'learning_rate': 0.0002, 'epoch': 5.0}




{'loss': 1.395, 'grad_norm': 1.2856916189193726, 'learning_rate': 0.0002, 'epoch': 6.0}




{'loss': 1.2454, 'grad_norm': 0.9341962933540344, 'learning_rate': 0.0002, 'epoch': 7.0}




{'loss': 1.259, 'grad_norm': 1.0971554517745972, 'learning_rate': 0.0002, 'epoch': 8.0}




{'loss': 1.2025, 'grad_norm': 1.2194392681121826, 'learning_rate': 0.0002, 'epoch': 9.0}




{'loss': 1.0379, 'grad_norm': 1.2554575204849243, 'learning_rate': 0.0002, 'epoch': 10.0}




{'loss': 1.0412, 'grad_norm': 1.237625241279602, 'learning_rate': 0.0002, 'epoch': 11.0}




{'loss': 0.9945, 'grad_norm': 2.2547364234924316, 'learning_rate': 0.0002, 'epoch': 12.0}




{'loss': 0.9035, 'grad_norm': 1.9199172258377075, 'learning_rate': 0.0002, 'epoch': 13.0}




{'loss': 0.854, 'grad_norm': 1.7880795001983643, 'learning_rate': 0.0002, 'epoch': 14.0}




{'loss': 0.7681, 'grad_norm': 1.5892212390899658, 'learning_rate': 0.0002, 'epoch': 15.0}




{'loss': 0.7928, 'grad_norm': 1.8445223569869995, 'learning_rate': 0.0002, 'epoch': 16.0}




{'loss': 0.6235, 'grad_norm': 1.8842532634735107, 'learning_rate': 0.0002, 'epoch': 17.0}




{'loss': 0.6325, 'grad_norm': 2.1389455795288086, 'learning_rate': 0.0002, 'epoch': 18.0}




{'loss': 0.677, 'grad_norm': 21.387981414794922, 'learning_rate': 0.0002, 'epoch': 19.0}




{'loss': 0.6059, 'grad_norm': 2.415156602859497, 'learning_rate': 0.0002, 'epoch': 20.0}




{'loss': 0.4933, 'grad_norm': 2.233638048171997, 'learning_rate': 0.0002, 'epoch': 21.0}




{'loss': 0.4013, 'grad_norm': 2.202216148376465, 'learning_rate': 0.0002, 'epoch': 22.0}




{'loss': 0.4769, 'grad_norm': 1.9446707963943481, 'learning_rate': 0.0002, 'epoch': 23.0}




{'loss': 0.3529, 'grad_norm': 1.5803821086883545, 'learning_rate': 0.0002, 'epoch': 24.0}




{'loss': 0.2775, 'grad_norm': 1.728479027748108, 'learning_rate': 0.0002, 'epoch': 25.0}




{'loss': 0.3052, 'grad_norm': 2.3436803817749023, 'learning_rate': 0.0002, 'epoch': 26.0}




{'loss': 0.2699, 'grad_norm': 2.2889883518218994, 'learning_rate': 0.0002, 'epoch': 27.0}




{'loss': 0.2316, 'grad_norm': 1.7391040325164795, 'learning_rate': 0.0002, 'epoch': 28.0}




{'loss': 0.2036, 'grad_norm': 2.009887456893921, 'learning_rate': 0.0002, 'epoch': 29.0}




{'loss': 0.1726, 'grad_norm': 1.940991759300232, 'learning_rate': 0.0002, 'epoch': 30.0}




{'loss': 0.1577, 'grad_norm': 2.1709556579589844, 'learning_rate': 0.0002, 'epoch': 31.0}




{'loss': 0.2148, 'grad_norm': 1.812379002571106, 'learning_rate': 0.0002, 'epoch': 32.0}




{'loss': 0.1117, 'grad_norm': 2.441551446914673, 'learning_rate': 0.0002, 'epoch': 33.0}




{'loss': 0.0836, 'grad_norm': 1.3335607051849365, 'learning_rate': 0.0002, 'epoch': 34.0}




{'loss': 0.0928, 'grad_norm': 1.294570803642273, 'learning_rate': 0.0002, 'epoch': 35.0}




{'loss': 0.0732, 'grad_norm': 1.3371833562850952, 'learning_rate': 0.0002, 'epoch': 36.0}




{'loss': 0.1083, 'grad_norm': 3.3653104305267334, 'learning_rate': 0.0002, 'epoch': 37.0}




{'loss': 0.0563, 'grad_norm': 1.7243590354919434, 'learning_rate': 0.0002, 'epoch': 38.0}




{'loss': 0.0489, 'grad_norm': 2.6516849994659424, 'learning_rate': 0.0002, 'epoch': 39.0}




{'loss': 0.0477, 'grad_norm': 1.5857062339782715, 'learning_rate': 0.0002, 'epoch': 40.0}




{'loss': 0.0428, 'grad_norm': 1.4541518688201904, 'learning_rate': 0.0002, 'epoch': 41.0}




{'loss': 0.0292, 'grad_norm': 0.8874624967575073, 'learning_rate': 0.0002, 'epoch': 42.0}




{'loss': 0.0222, 'grad_norm': 0.6065088510513306, 'learning_rate': 0.0002, 'epoch': 43.0}




{'loss': 0.0207, 'grad_norm': 0.9613732099533081, 'learning_rate': 0.0002, 'epoch': 44.0}




{'loss': 0.0181, 'grad_norm': 0.604729950428009, 'learning_rate': 0.0002, 'epoch': 45.0}




{'loss': 0.0169, 'grad_norm': 0.5317500233650208, 'learning_rate': 0.0002, 'epoch': 46.0}




{'loss': 0.0149, 'grad_norm': 0.47213247418403625, 'learning_rate': 0.0002, 'epoch': 47.0}




{'loss': 0.013, 'grad_norm': 0.28497135639190674, 'learning_rate': 0.0002, 'epoch': 48.0}




{'loss': 0.0126, 'grad_norm': 0.4700471758842468, 'learning_rate': 0.0002, 'epoch': 49.0}




{'loss': 0.0144, 'grad_norm': 0.38096943497657776, 'learning_rate': 0.0002, 'epoch': 50.0}




{'loss': 0.0154, 'grad_norm': 0.49667418003082275, 'learning_rate': 0.0002, 'epoch': 51.0}




{'loss': 0.0138, 'grad_norm': 0.7608466744422913, 'learning_rate': 0.0002, 'epoch': 52.0}




{'loss': 0.0162, 'grad_norm': 1.0026416778564453, 'learning_rate': 0.0002, 'epoch': 53.0}




{'loss': 0.0185, 'grad_norm': 1.5005098581314087, 'learning_rate': 0.0002, 'epoch': 54.0}




{'loss': 0.0141, 'grad_norm': 0.5044769048690796, 'learning_rate': 0.0002, 'epoch': 55.0}




{'loss': 0.013, 'grad_norm': 0.519775927066803, 'learning_rate': 0.0002, 'epoch': 56.0}




{'loss': 0.0128, 'grad_norm': 0.5576001405715942, 'learning_rate': 0.0002, 'epoch': 57.0}




{'loss': 0.0146, 'grad_norm': 0.8309702277183533, 'learning_rate': 0.0002, 'epoch': 58.0}




{'loss': 0.0116, 'grad_norm': 0.4548482894897461, 'learning_rate': 0.0002, 'epoch': 59.0}




{'loss': 0.0129, 'grad_norm': 0.7835220694541931, 'learning_rate': 0.0002, 'epoch': 60.0}




{'loss': 0.0161, 'grad_norm': 1.1050502061843872, 'learning_rate': 0.0002, 'epoch': 61.0}




{'loss': 0.0144, 'grad_norm': 0.8757248520851135, 'learning_rate': 0.0002, 'epoch': 62.0}




{'loss': 0.0123, 'grad_norm': 0.5630855560302734, 'learning_rate': 0.0002, 'epoch': 63.0}




{'loss': 0.0134, 'grad_norm': 0.6079433560371399, 'learning_rate': 0.0002, 'epoch': 64.0}




{'loss': 0.0118, 'grad_norm': 0.5177484154701233, 'learning_rate': 0.0002, 'epoch': 65.0}




{'loss': 0.0115, 'grad_norm': 0.4697326719760895, 'learning_rate': 0.0002, 'epoch': 66.0}




{'loss': 0.0107, 'grad_norm': 0.4415886104106903, 'learning_rate': 0.0002, 'epoch': 67.0}




{'loss': 0.0102, 'grad_norm': 0.40160617232322693, 'learning_rate': 0.0002, 'epoch': 68.0}




{'loss': 0.0091, 'grad_norm': 0.3785891830921173, 'learning_rate': 0.0002, 'epoch': 69.0}




{'loss': 0.0117, 'grad_norm': 0.36139118671417236, 'learning_rate': 0.0002, 'epoch': 70.0}




{'loss': 0.0121, 'grad_norm': 0.492843896150589, 'learning_rate': 0.0002, 'epoch': 71.0}




{'loss': 0.0101, 'grad_norm': 0.38504570722579956, 'learning_rate': 0.0002, 'epoch': 72.0}




{'loss': 0.0105, 'grad_norm': 0.46371206641197205, 'learning_rate': 0.0002, 'epoch': 73.0}




{'loss': 0.0113, 'grad_norm': 0.5552460551261902, 'learning_rate': 0.0002, 'epoch': 74.0}




{'loss': 0.0125, 'grad_norm': 0.49435994029045105, 'learning_rate': 0.0002, 'epoch': 75.0}




{'loss': 0.0113, 'grad_norm': 0.7054689526557922, 'learning_rate': 0.0002, 'epoch': 76.0}




{'loss': 0.0111, 'grad_norm': 0.7928619980812073, 'learning_rate': 0.0002, 'epoch': 77.0}




{'loss': 0.0086, 'grad_norm': 0.2938726842403412, 'learning_rate': 0.0002, 'epoch': 78.0}




{'loss': 0.0078, 'grad_norm': 0.15273171663284302, 'learning_rate': 0.0002, 'epoch': 79.0}




{'loss': 0.0083, 'grad_norm': 0.4725440442562103, 'learning_rate': 0.0002, 'epoch': 80.0}




{'loss': 0.0084, 'grad_norm': 0.4120459258556366, 'learning_rate': 0.0002, 'epoch': 81.0}




{'loss': 0.0082, 'grad_norm': 0.4761620759963989, 'learning_rate': 0.0002, 'epoch': 82.0}




{'loss': 0.0103, 'grad_norm': 0.7551746368408203, 'learning_rate': 0.0002, 'epoch': 83.0}




{'loss': 0.0107, 'grad_norm': 0.8527773022651672, 'learning_rate': 0.0002, 'epoch': 84.0}




{'loss': 0.0086, 'grad_norm': 0.272881418466568, 'learning_rate': 0.0002, 'epoch': 85.0}




{'loss': 0.0083, 'grad_norm': 0.2880040407180786, 'learning_rate': 0.0002, 'epoch': 86.0}




{'loss': 0.0074, 'grad_norm': 0.28153976798057556, 'learning_rate': 0.0002, 'epoch': 87.0}




{'loss': 0.0079, 'grad_norm': 0.33399638533592224, 'learning_rate': 0.0002, 'epoch': 88.0}




{'loss': 0.0084, 'grad_norm': 0.6273951530456543, 'learning_rate': 0.0002, 'epoch': 89.0}




{'loss': 0.0085, 'grad_norm': 0.2780895233154297, 'learning_rate': 0.0002, 'epoch': 90.0}




{'loss': 0.0085, 'grad_norm': 0.37861156463623047, 'learning_rate': 0.0002, 'epoch': 91.0}




{'loss': 0.0074, 'grad_norm': 0.19154714047908783, 'learning_rate': 0.0002, 'epoch': 92.0}




{'loss': 0.0076, 'grad_norm': 0.2508693039417267, 'learning_rate': 0.0002, 'epoch': 93.0}




{'loss': 0.008, 'grad_norm': 0.39923733472824097, 'learning_rate': 0.0002, 'epoch': 94.0}




{'loss': 0.0074, 'grad_norm': 0.22867383062839508, 'learning_rate': 0.0002, 'epoch': 95.0}




{'loss': 0.008, 'grad_norm': 0.47756579518318176, 'learning_rate': 0.0002, 'epoch': 96.0}




{'loss': 0.0073, 'grad_norm': 0.19459567964076996, 'learning_rate': 0.0002, 'epoch': 97.0}




{'loss': 0.0081, 'grad_norm': 0.36741334199905396, 'learning_rate': 0.0002, 'epoch': 98.0}




{'loss': 0.007, 'grad_norm': 0.1176724061369896, 'learning_rate': 0.0002, 'epoch': 99.0}




{'loss': 0.0071, 'grad_norm': 0.7315018773078918, 'learning_rate': 0.0002, 'epoch': 100.0}
{'train_runtime': 531.2239, 'train_samples_per_second': 0.941, 'train_steps_per_second': 0.188, 'train_loss': 0.27044895668514074, 'epoch': 100.0}


TrainOutput(global_step=100, training_loss=0.27044895668514074, metrics={'train_runtime': 531.2239, 'train_samples_per_second': 0.941, 'train_steps_per_second': 0.188, 'train_loss': 0.27044895668514074, 'epoch': 100.0})

In [None]:
trainer.save_model()

In [None]:
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    training_args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(training_args.output_dir)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
sample = dataset_subset[5]

prompt = generate_prompt(sample, return_response=False)

In [None]:
input_ids = tokenizer(prompt[0], return_tensors="pt", truncation=True).input_ids.cuda()

outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.5)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(f"Prompt:\n{prompt[0]}\n")
print(f"-------------")
print(f"Generated question:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt[0]):]}")
print(f"-------------")
print(f"Ground truth:\n{sample['question']}")

Prompt:
Generate a simple question that could result in the provided context.[INST]CONTEXT: Tropic Thunder, Meet the Parents, There's Something About Mary, Madagascar, The Secret Life of Walter Mitty, Night at the Museum, Meet the Fockers, The Royal Tenenbaums, Zoolander, The Cable Guy, Tower Heist, Along Came Polly, Little Fockers, The Heartbreak Kid, Keeping the Faith, Duplex, Reality Bites, Greenberg, Envy, Flirting with Disaster, Zero Effect
[/INST]

-------------
Generated question:
 What are the names of the films that have been a part of Ben Stiller's acting and directing career, including his comedies like Tropic Thunder, Meet the Parents, and Zoolander?
-------------
Ground truth:
 what movies can be described with ben stiller?


In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

untuned_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
input_ids = tokenizer(prompt[0], return_tensors="pt", truncation=True).input_ids.cuda()

outputs = untuned_model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.5)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(f"Prompt:\n{prompt}\n")
print(f"-------------")
print(f"Generated question:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt[0]):]}")
print(f"-------------")
print(f"Ground truth:\n{sample['question']}")

Prompt:
["Generate a simple question that could result in the provided context.[INST]CONTEXT: Tropic Thunder, Meet the Parents, There's Something About Mary, Madagascar, The Secret Life of Walter Mitty, Night at the Museum, Meet the Fockers, The Royal Tenenbaums, Zoolander, The Cable Guy, Tower Heist, Along Came Polly, Little Fockers, The Heartbreak Kid, Keeping the Faith, Duplex, Reality Bites, Greenberg, Envy, Flirting with Disaster, Zero Effect\n[/INST]"]

-------------
Generated question:
 Which movies were released between the years 1998 and 2010 that feature in the given list?
-------------
Ground truth:
 what movies can be described with ben stiller?


This notebook is adapted from the notebook developed by AI Makerspace