# Small Language as a Comedy Dialogue Generator
* Are you curious about how you can use Small Language models <1B to act as Funny Dialogue Generator based on the persona you provide.
* Here we used [vLLM](https://github.com/vllm-project/vllm) to generate Synthetic Data with an LLM, you can check how to do that [here](https://github.com/rskasturi/usecases/tree/master/synthetic-data-generation).
* Powered by Intel® Data Center GPU Max 1100s, this notebook provides a hands-on experience that doesn't require deep technical knowledge. Whether you're a student, writer, educator, or just curious about AI, this guide is designed for you.

## Overview
In this notebook, you will learn how to fine-tune a Small language model (Qwen) using Intel Max Series GPUs (XPUs) for a specific task. The notebook covers the following key points:

1. Setting up the environment and optimizing it for Intel GPUs
2. Initializing the XPU and configuring LoRA settings for efficient fine-tuning
3. Loading the pre-trained Qwen model and testing its performance
4. Preparing a diverse dataset of question-answer pairs covering a specific domain
5. Fine-tuning the model using the Hugging Face Trainer class
6. Evaluating the fine-tuned model on a test dataset
7. Saving and loading the fine-tuned model for future use
8. The notebook demonstrates how fine-tuning can enhance a model's performance on a diverse range of topics, making it more versatile and applicable to various domains. You will gain insights into the process of creating a task-specific model that can provide accurate and relevant responses to a wide range of questions.

### Step 1: Setting Up the Environment 🛠️
First things first, let's get our environment ready! We'll import all the necessary packages, including the Hugging Face transformers library, datasets for easy data loading, wandb for experiment tracking, and a few others. 

We'll now make sure to optimize our environment for the Intel GPU by setting the appropriate environment variables and configuring the number of cores and threads. This will ensure we get the best performance out of our hardware! ⚡

In [None]:
import sys
import site
import os
import warnings
warnings.filterwarnings("ignore")

import os
import psutil

num_physical_cores = psutil.cpu_count(logical=False)
num_cores_per_socket = num_physical_cores // 2

os.environ["TOKENIZERS_PARALLELISM"] = "0"
#HF_TOKEN = os.environ["HF_TOKEN"]

# Set the LD_PRELOAD environment variable
ld_preload = os.environ.get("LD_PRELOAD", "")
conda_prefix = os.environ.get("CONDA_PREFIX", "")
# Improve memory allocation performance, if tcmalloc is not available, please comment this line out
os.environ["LD_PRELOAD"] = f"{ld_preload}:{conda_prefix}/lib/libtcmalloc.so"
# Reduce the overhead of submitting commands to the GPU
os.environ["SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS"] = "1"
# reducing memory accesses by fusing SDP ops
os.environ["ENABLE_SDP_FUSION"] = "1"
# set openMP threads to number of physical cores
os.environ["OMP_NUM_THREADS"] = str(num_physical_cores)
# Set the thread affinity policy
os.environ["OMP_PROC_BIND"] = "close"
# Set the places for thread pinning
os.environ["OMP_PLACES"] = "cores"

print(f"Number of physical cores: {num_physical_cores}")
print(f"Number of cores per socket: {num_cores_per_socket}")
print(f"OpenMP environment variables:")
print(f"  - OMP_NUM_THREADS: {os.environ['OMP_NUM_THREADS']}")
print(f"  - OMP_PROC_BIND: {os.environ['OMP_PROC_BIND']}")
print(f"  - OMP_PLACES: {os.environ['OMP_PLACES']}")

### Step 2: Initializing the XPU 🎮
Next, we'll initialize the Intel Max 1110 GPU, which is referred to as an XPU. We'll use the intel_extension_for_pytorch library to seamlessly integrate XPU namespace with.

In [None]:
import torch
import intel_extension_for_pytorch as ipex

if torch.xpu.is_available():
    torch.xpu.empty_cache()
else:
    print("XPU device not available.")

### Step 3: Configuring the LoRA Settings 🎛️
To finetune our Qwen model efficiently, we'll use the LoRA (Low-Rank Adaptation) technique.

LoRA allows us to adapt the model to our specific task by training only a small set of additional parameters. This greatly reduces the training time and memory requirements! ⏰

We'll define the LoRA configuration, specifying the rank (r) and the target modules we want to adapt. 🎯

In [None]:
from peft import LoraConfig

lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    # could use q, v and 0 projections as well and comment out the rest
    target_modules=["q_proj", "o_proj", 
                    "v_proj", "k_proj", 
                    "gate_proj", "up_proj",
                    "down_proj"],
    task_type="CAUSAL_LM")

### Step 4: Loading the Qwen Model 🤖
Now, let's load the Qwen model using the Hugging Face AutoModelForCausalLM class. We'll also load the corresponding tokenizer to preprocess our input data. The model will be moved to the XPU for efficient training. 💪

[Model Card](https://huggingface.co/Qwen/Qwen2.5-0.5B) 

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

USE_CPU = False
device = "xpu:0" if torch.xpu.is_available() else "cpu"
if USE_CPU:
    device = "cpu"
print(f"using device: {device}")

model_id = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
# Set padding side to the right to ensure proper attention masking during fine-tuning
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
# Disable caching mechanism to reduce memory usage during fine-tuning
model.config.use_cache = False
# Configure the model's pre-training tensor parallelism degree to match the fine-tuning setup
model.config.pretraining_tp = 1 

### Step 5: Testing the Model 🧪
Before we start finetuning, let's test the Qwen model on a sample input to see how it performs out-of-the-box. We'll generate some responses bsaed on a few questions in the test_inputs list below. 🌿

In [None]:
def generate_response(model, prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(input_ids, max_new_tokens=1250, 
                         do_sample=False, top_k=100,temperature=0.1, 
                         eos_token_id=tokenizer.eos_token_id)   
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def test_model(model, test_inputs):
    """quickly test the model using queries."""
    for input_text in test_inputs:
        print("__"*25)
        generated_response = generate_response(model, input_text)
        print(f"{input_text}")
        print(f"Generated Answer: {generated_response}\n")
        print("__"*25)

test_inputs = [
    "Assume you are an english teacher, can you frame standup comedies using your knowledge, skills, experience, or insights?",
    "Assume you are a new media reporter from CNN, can you frame standup comedies using your knowledge, skills, experience, or insights?"
    "Assume you are a software engineer well-versed in C/C++ but new to Fortran, can you frame standup comedies using your knowledge, skills, experience, or insights?"
    "Assume you are An investigative reporter who wants to uncover the truth behind the spy's past, can you frame standup comedies using your knowledge, skills, experience, or insights?"
]

print("Testing the model before fine-tuning:")
test_model(model, test_inputs)

### Step 6: Preparing the Dataset 📊
For finetuning our model, we'll be using a synthetic dataset. This dataset contains a diverse range of persona's. By focusing on the persona(input text) there is comedy dialogue asociated with that perosona.

We'll then split the extracted data into training and validation sets using the train_test_split function from the sklearn.model_selection module. This will help us assess the model's performance during the finetuning process. 📊

In [None]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("json", data_files="comedy_synthesis_15000.jsonl")["train"]
print(dataset[0])
print(f"Persona: {dataset[0]['input persona']}")
print(f"Synthetic data Generated: {dataset[0]['synthesized text']}")

# Function to format prompts
def format_prompts(batch):
    formatted_prompts = []
    for instruction, response in zip(batch["input persona"], batch["synthesized text"]):
        # Correct variable usage in the prompt
        prompt = (f"Instruction:\n{instruction}\n"
                  "Assume you are the persona described above and I want you to act as a stand-up comedian. "
                  "Write content that reflects your unique voice, expertise, and humor, tailored to your specific field.\n"
                  f"Response:\n{response}")
        formatted_prompts.append(prompt)
    return {"text": formatted_prompts}

# Apply the function to the dataset
dataset = dataset.map(format_prompts, batched=True)

# Split the dataset into training and validation sets
split_dataset = dataset.train_test_split(test_size=0.2, seed=99)
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]


### Step 7: Finetuning the Model 🏋️‍♂️
It's time to finetune our Qwen model! We'll use the SFTTrainer class from the trl library, which is designed for supervised fine-tuning of language models. We'll specify the training arguments, such as batch size, learning rate, and evaluation strategy. 📈

Supervised fine-tuning (SFT) is a powerful technique for adapting pre-trained language models to specific tasks. By providing the model with Persona's driven comedy dialogue of 15k dataset, we can guide it to generate more accurate and relevant responses. SFT allows the model to learn the patterns and relationships specific to the diverse range of topics covered in the dataset. 🎓

By focusing on the Generated dialogue based on the persona, we can leverage the rich information available in the Comedy Synthesis dataset to enhance our model's ability to provide informative and contextually appropriate responses. The model will learn to understand the nuances and intricacies of various question types and generate answers that are coherent and relevant. 💡

In [None]:
import transformers
import warnings
from transformers import logging as transformers_logging
warnings.filterwarnings("ignore")
transformers_logging.set_verbosity_error()
 
from trl import SFTTrainer
 
os.environ["IPEX_TILE_AS_DEVICE"] = "1"
 
finetuned_model_id = "qwen-0.5B-comedy"
 
# Calculate max_steps based on the subset size
num_train_samples = len(train_dataset)
batch_size = 2
gradient_accumulation_steps = 8
steps_per_epoch = num_train_samples // (batch_size * gradient_accumulation_steps)
num_epochs = 2
max_steps = steps_per_epoch * num_epochs
print(f"Finetuning for max number of steps: {max_steps}")
 
def print_training_summary(results):
    print(f"Time: {results.metrics['train_runtime']: .2f}")
    print(f"Samples/second: {results.metrics['train_samples_per_second']: .2f}")
    get_memory_usage()
 
training_args = transformers.TrainingArguments(
        per_device_train_batch_size=batch_size,
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_ratio=0.05,
        max_steps=max_steps,
        learning_rate=1e-5,
        evaluation_strategy="steps",
        save_steps=500,
        bf16=True,
        logging_steps=100,
        output_dir=finetuned_model_id,
        use_ipex=True,
        max_grad_norm=0.6,
        weight_decay=0.01,
        group_by_length=True
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
    args=training_args,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True
)
 
if device != "cpu":
    torch.xpu.empty_cache()
results = trainer.train()
print_training_summary(results)
 
# save lora model
tuned_lora_model = "qwen-0.5B-comedy-lora"
trainer.model.save_pretrained(tuned_lora_model)

### Step 8: Savethe Finetuned Model 💾
After finetuning, let's put our model to the test! But before we do that, we need to merge the LoRA weights with the base model. This step is crucial because the LoRA weights contain the learned adaptations from the finetuning process. By merging the LoRA weights, we effectively incorporate the knowledge gained during finetuning into the base model. 🧠💡

To merge the LoRA weights, we'll use the merge_and_unload() function provided by the PEFT library. This function seamlessly combines the LoRA weights with the corresponding weights of the base model, creating a single unified model that includes the finetuned knowledge. 🎛️🔧

Once the LoRA weights are merged, we'll save the finetuned model to preserve its state. This way, we can easily load and use the finetuned model for future tasks without having to repeat the finetuning process. ✨

In [None]:
from peft import PeftModel

tuned_model = "qwen-0.5B-comedy"
tuned_lora_model = "qwen-0.5B-comedy-lora"

base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.bfloat16,
)

model = PeftModel.from_pretrained(base_model, tuned_lora_model)
model = model.merge_and_unload()
# save final tuned model
model.save_pretrained(tuned_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#model2 = ipex.optimize_transformers(model)  # optimize the model using `ipex`

### Step 8: Evaluating the Finetuned Model 🎉
Now, let's generate a response to the same question we asked earlier using the finetuned model. We'll compare the output with the pre-finetuned model to see how much it has improved. Get ready to be amazed by the power of finetuning! 🤩💫

By merging the LoRA weights and saving the finetuned model, we ensure that our model is ready to tackle tasks with its newly acquired knowledge. So, let's put it to the test and see how it performs! 🚀🌟

In [None]:
test_inputs = [
   "Assume you are an english teacher, can you frame standup comedies using your knowledge, skills, experience, or insights?",
    "Assume you are a new media reporter from CNN, can you frame standup comedies using your knowledge, skills, experience, or insights?",
    "Assume you are a software engineer well-versed in C/C++ but new to Fortran, can you frame standup comedies using your knowledge, skills, experience, or insights?",
    "Assume you are An investigative reporter who wants to uncover the truth behind the spy's past, can you frame standup comedies using your knowledge, skills, experience, or insights?"]
device = "xpu:0"

model = model.to(device)
for text in test_inputs:
    print("__"*25)
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
    outputs = model.generate(input_ids, max_new_tokens=1250, 
                             do_sample=False, top_k=100,temperature=0.1, 
                             eos_token_id=tokenizer.eos_token_id)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    print("__"*25)

### Disclaimer for Using Large Language Models
Please be aware that while Large Language Models like Camel-5B and OpenLLaMA 3b v2 are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above.

To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.