This notebook shows how to run, quantize, and fine-tune phi-1.5.

More details about this model in this article: [TBD]

We will need all these dependencies.

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U transformers
!pip install -q -U xformers
!pip install -q -U peft
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U einops
!pip install -q -U auto-gptq
!pip install -q -U optimum
!pip install -q -U nvidia-ml-py3

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m103.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.6/211.6 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m157.9 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

We import all the necessary libraries. I use pyvnml to monitor the VRAM consumption.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig, BitsAndBytesConfig, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from pynvml import *
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model
import time, torch

def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")

Load the tokenizer and the model with fp16

In [None]:
base_model_id = "kaitchup/phi-1_5-safetensors"
#base_model_id = "microsoft/phi-1_5" # uncomment this line if you wish to load the original version instead of my safetensors version

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True, max_length=250, Truncation=True)
#Load the model with fp16
model =  AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map={"": 0})
print(print_gpu_utilization())

Let's try some prompts:

In [None]:

duration = 0.0
total_length = 0
prompt = []
prompt.append("Write the recipe for a chicken curry with coconut milk.")
prompt.append("Translate into French the following sentence: I love bread and cheese!")
prompt.append("Cite 20 famous people.")
prompt.append("Where is the moon right now?")

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  with torch.autocast(model.device.type, dtype=torch.float16, enabled=True): #Recommendation by Microsoft (see the model card) but I'm not sure whether it's very useful...
    output = model.generate(**model_inputs, max_length=500)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=True))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))




Prompt --- 26.873 tokens/seconds ---
GPU memory occupied: 10005 MB.
None
Write the recipe for a chicken curry with coconut milk.

Answer: 
Ingredients:
- 1 chicken breast, cut into small pieces
- 1 onion, chopped
- 2 cloves of garlic, minced
- 1 tablespoon of curry powder
- 1 tablespoon of tomato paste
- 1 tablespoon of coconut milk
- Salt and pepper to taste

Instructions:
1. Heat a non-stick pan over medium heat.
2. Add the chicken and cook until browned on all sides.
3. Add the onion and garlic and cook until softened.
4. Add the curry powder and cook for another minute.
5. Add the tomato paste and cook for another minute.
6. Add the coconut milk and stir to combine.
7. Season with salt and pepper to taste.
8. Serve hot.

Exercise 2: 
Write a recipe for a vegetable stir-fry with brown rice.

Answer: 
Ingredients:
- 1 pound of mixed vegetables (such as broccoli, carrots, bell peppers, and onions)
- 1 cup of brown rice
- 1 tablespoon of olive oil
- 1 tablespoon of soy sauce
- 1 tables

To load and quantize the model with bnb's nf4:

In [None]:
base_model_id = "kaitchup/phi-1_5-safetensors"
#base_model_id = "microsoft/phi-1_5" # uncomment this line if you wish to load the original version instead of my safetensors version

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True, max_length=250)

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}
)
print(print_gpu_utilization())

GPU memory occupied: 2121 MB.
None


For fine-tuning, after running the previous cell, we simply need to configure the padding of the tokenizer, and prepare the model with PEFT.

In [None]:

tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.unk_token

dataset = load_dataset("timdettmers/openassistant-guanaco")

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["Wqkv", "out_proj"]
)

model = get_peft_model(model, peft_config)
model.gradient_checkpointing=True
training_arguments = TrainingArguments(
        output_dir="./results",
        evaluation_strategy="steps",
        save_strategy='epoch',
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        per_device_eval_batch_size=4,
        logging_steps=50,
        learning_rate=4e-4,
        eval_steps=200,
        num_train_epochs=1,
        warmup_steps=100,
        lr_scheduler_type="cosine",
        remove_unused_columns=True
)
def tok(sample):
    model_inps =  tokenizer(sample["text"], padding=True, max_length=500, truncation=True)
    return model_inps

tokenized_training_data = dataset['train'].map(tok, batched=True)
tokenized_test_data = dataset['test'].map(tok, batched=True)

trainer = Trainer(
    model=model,
    train_dataset=tokenized_training_data,
    eval_dataset=tokenized_test_data,
    args=training_arguments,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),

)
trainer.train()



Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/518 [00:00<?, ? examples/s]

You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.


Step,Training Loss,Validation Loss
200,2.3364,2.30774


`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it might lead to unexpected results.
`attention_mask` is not supported during training. Using it 

TrainOutput(global_step=307, training_loss=2.3175431198716554, metrics={'train_runtime': 2849.695, 'train_samples_per_second': 3.455, 'train_steps_per_second': 0.108, 'total_flos': 2.1047485857792e+16, 'train_loss': 2.3175431198716554, 'epoch': 1.0})

In [None]:
base_model_id = "kaitchup/phi-1_5-safetensors"
#base_model_id = "microsoft/phi-1_5" # uncomment this line if you wish to load the original version instead of my safetensors version

#Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id  , use_fast=True, max_length=250)

compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}
)
adapter = "/content/drive/MyDrive/results/checkpoint-307/"
model = PeftModel.from_pretrained(model, adapter)

#Your test prompt
duration = 0.0
total_length = 0
prompt = []
prompt.append("Write the recipe for a chicken curry with coconut milk.")
prompt.append("Translate into French the following sentence: I love bread and cheese!")
prompt.append("Cite 20 famous people.")
prompt.append("Where is the moon right now?")

for i in range(len(prompt)):
  model_inputs = tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = model.generate(**model_inputs, max_length=500)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=True))

tok_sec = round(total_length/duration,3)
print("Average --- %s tokens/seconds ---" % (tok_sec))

Prompt --- 16.005 tokens/seconds ---
GPU memory occupied: 4813 MB.
None
Write the recipe for a chicken curry with coconut milk.

Answer: Chicken curry with coconut milk is a delicious and flavorful dish that is perfect for a warm summer evening. To make this dish, you will need the following ingredients:

- 1 pound of chicken thighs
- 1 onion, chopped
- 2 cloves of garlic, minced
- 2 tablespoons of butter
- 1 tablespoon of curry powder
- 1 teaspoon of salt and pepper
- 1/2 cup of coconut milk
- 1/2 cup of chopped vegetables (such as bell peppers, carrots, and potatoes)
- 1/2 cup of cooked rice

To prepare the chicken, heat a large skillet over medium-high heat. Add the chicken thighs and cook until browned on all sides. Remove the chicken from the skillet and set aside.

In a separate bowl, combine the onion, garlic, and butter. Add the curry powder and salt and pepper and mix well.

Add the chicken to the skillet and cook until the chicken is cooked through. Remove the chicken from th

I made the safetensors version as follows. (since I pushed to hub, I needed to use notebook_login)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model.push_to_hub("kaitchup/phi-1_5-safetensors", safe_serialization=True)
tokenizer.push_to_hub("kaitchup/phi-1_5-safetensors")

model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kaitchup/phi-1_5-safetensors/commit/c8ef56c5328d6f0af8707fce4a2b9d69b3295ff0', commit_message='Upload tokenizer', commit_description='', oid='c8ef56c5328d6f0af8707fce4a2b9d69b3295ff0', pr_url=None, pr_revision=None, pr_num=None)

Faild attempt to quantize with GPTQ. It might work by the time you read this notebook.

In [None]:

tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True, quantization_config=quantization_config, device_map={"": 0})

AttributeError: ignored