### MVP of LLM finetuning and Inferene using QLoRA in 4 bits on Intel GPUs


In this notebook, we load an LLM in 4bit ("NousResearch/Nous-Hermes-Llama-2-7b") and train it on Intel Developer Cloud ([IDC](https://cloud.intel.com)) using the PEFT library from Hugging Face 🤗.

A reference to run this same training on Nvidia GPUs can be found [here](https://colab.research.google.com/drive/1H1SHcmYrHiHtAIJwMXT4E7dhXDLqvGtr?usp=sharing). For an extensive tutorial on finetuning an LLM using this approach on Intel® Data Center GPU Max 1100, check out the **LLM Finetuning notebook** under **"GenAI Essentials"** on IDC.


 **Executive Summary: Changes required to port the same workload from Nvidia to Intel GPUs:**

<style>
.custom-table {
    border-collapse: collapse;
    width: 100%;
}
.custom-table th, .custom-table td {
    border: 1px solid black;
    padding: 8px;
    text-align: left;
}
.custom-table th {
    background-color: #f2f2f2;
}
</style>

<table class="custom-table">
    <tr>
        <td></td>
        <th><b>Nvidia A100 Implementation</b></th>
        <th><b>Intel PVC 1100 Implementation</b></th>
    </tr>
    <tr>
        <td><b>Packages required:</b></td>
                <td>pip install -q -U bitsandbytes<br></td>
        <td>pip install -q --pre -U bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu<br></td>
    </tr>
    <tr>
    </tr>
    <tr>
        <td><b>Import Required Libraries</b></td>
        <td>from transformers import BitsAndBytesConfig<br>from transformers import AutoModelForCausalLM<br>from peft import get_peft_model, prepare_model_for_kbit_training</td>
        <td>import intel_extension_for_pytorch as ipex<br>from bigdl.llm.transformers import AutoModelForCausalLM<br>from bigdl.llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training</td>
    </tr>
    <tr>
        <td><b>Model Loading and Configuration</b></td>
        <td>bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16)<br>model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})</td>
        <td>model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit="nf4", optimize_model=False, torch_dtype=torch.float16, modules_to_not_convert=["lm_head"])</td>
    </tr>
    <tr>
        <td><b>Device Allocation</b></td>
        <td># Implicit GPU allocation in device_map for Nvidia A100</td>
        <td>model = model.to('xpu')  # Move model to the XPU (PVC 1100)</td>
    </tr>
</table>


**Install the required packages:**

In [None]:
import sys
!{sys.executable} -m pip install -q --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu 
!{sys.executable} -m pip install -q datasets transformers==4.34.0 peft==0.5.0 accelerate==0.23.0

**Add locally installed python packages path to `sys.path`**:

In [None]:
import sys
import site
from pathlib import Path
def get_python_version():
     return "python" + ".".join(map(str, sys.version_info[:2]))
 
def set_local_bin_path():
     local_bin = str(Path.home() / ".local" / "bin") 
     local_site_packages = str(
         Path.home() / ".local" / "lib" / get_python_version() / "site-packages"
     )
     sys.path.append(local_bin)
     sys.path.insert(0, site.getusersitepackages())
     sys.path.insert(0, sys.path.pop(sys.path.index(local_site_packages)))
 
set_local_bin_path()

**Required Imports**

In [None]:
import logging
import os
import warnings

os.environ["NUMEXPR_MAX_THREADS"] = "64"
warnings.filterwarnings(
    "ignore",
    category=FutureWarning,
    message="This implementation of AdamW is deprecated",
)

import torch
import intel_extension_for_pytorch as ipex  # required for pytorch to identify intel GPU, not needed for nvidia as features are upstream
import transformers

from transformers import LlamaTokenizer
from peft import LoraConfig
from bigdl.llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training  #for nvidia this would be bitsandbytes
from bigdl.llm.transformers import AutoModelForCausalLM

# for nvidia: from transformers import AutoModelForCausalLM
# for nvidia: from peft import get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

**Model and dataset:**

In [None]:
model_cache = "/home/common/data/Big_Data/GenAI/llm_models"
base_model = "NousResearch--Nous-Hermes-Llama-2-7b"
model_path = os.path.join(model_cache, base_model)
dataset = "Abirate/english_quotes"

tokenizer = LlamaTokenizer.from_pretrained(model_path)
tokenizer.pad_token_id = 0
tokenizer.padding_side = "left"

if torch.xpu.is_available():
    print(f"using device: {torch.xpu.get_device_name()}")

**Finetuning code:**

In [None]:
dataset = load_dataset(dataset)
tokenized_dataset = dataset.map(lambda data: tokenizer(data["quote"]), batched=True)

# Load model using BigDL in 4 bits, use bitsandbytes config for Nvidia
model = AutoModelForCausalLM.from_pretrained(model_path,
                                                load_in_low_bit="nf4",
                                                optimize_model=False,
                                                torch_dtype=torch.float16,
                                                modules_to_not_convert=["lm_head"],)

model = model.to('xpu')  # move model to the XPU (PVC 1100 here) , for nvidia 'cuda'
model = prepare_model_for_kbit_training(model)

# Configure LoRA, same as Nvidia
lora_config = LoraConfig(
    r=16, 
    lora_alpha=64, 
    target_modules=["q_proj", "k_proj", "v_proj"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)


# Configure trainer, same as Nvidia
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps= 1,
    warmup_steps=20,
    max_steps=300,
    learning_rate=2e-5,
    save_steps=30,
    bf16=True,
    logging_steps=20,
    output_dir="outputs",
    optim="adamw_hf",
)
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_dataset["train"],
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False

# Train model, same as Nvidia
training_result = trainer.train()
print(training_result)


## Inference

Let's see how good the model is in completing an English quote.

**Using base model:**

In [None]:
from bigdl.llm.transformers.qlora import PeftModel

batch = tokenizer("A room without books ", return_tensors="pt")

# load model in 4 bit
base_model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             optimize_model=True,
                                             use_cache=True,
                                             torch_dtype=torch.bfloat16,
                                             modules_to_not_convert=["lm_head"],)
print("base model output\n", tokenizer.decode(base_model.generate(**batch, max_new_tokens=20)[0], skip_special_tokens=True))

**Using Finetuned model:**

In [None]:
finetuned_model = PeftModel.from_pretrained(base_model, "./outputs/checkpoint-30/")
print("finetuned model output\n", tokenizer.decode(finetuned_model.generate(**batch, max_new_tokens=20)[0], skip_special_tokens=True))

That's it!! 🎉 Now you know how to port a model from Nvidia to Intel GPUs and fine-tune an LLM using QLoRA on your Arc GPUs or Intel Datacenter GPUs on [Intel Developer Cloud](https://cloud.intel.com)! 🚀 As mentioned above, this is a very simple codebase. Please check out the LLM Finetuning notebook at **Gen AI Essentials** on **Intel Developer Cloud** for a practical use case. 📚👨‍💻