SPDX-License-Identifier: Apache-2.0
Copyright (c) 2023, Rajashekar Kasturi <rajashekarx.kasturi@intel.com>, Thasneem Vazim <thasneemx.vazim@intel.com>

# Generate Synthetic Data using vLLM on Intel® Max Series GPUs 🚀

## 📒 Overview

The notebook helps you create synthetic data using vLLM for **comedy dialogue generation** 😅.
1. Setting up the environment of vLLM on Intel® GPUs
2. Run sample inference using vLLM
3. Approach: Synthetic data generation
4. Choose a template to generate synthetic data
5. Generate synthetic data for comedy dialogue generation.

## Step 1: Setting up environment 🛠️

First things first, let's get our environment ready! We'll install all the necessary packages, including the Intel® Extension for PyTorch, datasets for easy data loading.📦

* Clone the vLLM repository.
* Setting up required packages along with Intel® Extension for PyTorch to run on Intel® GPUs.


In [None]:
import os
import sys
import subprocess
import json
from pathlib import Path
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

os.environ['VLLM_TARGET_DEVICE']='xpu'

ROOT_DIR = Path.cwd()

def print_cwd():
    """Prints the current working directory."""
    cwd = os.getcwd()
    print(f"Current directory: {cwd}")

def clone_vllm_repo():
    """Clones the vllm.git repository with specific options."""
    print_cwd()
    os.system("git config --global advice.detachedHead false")
    os.system("git clone -b v0.6.2 --depth=1 https://github.com/vllm-project/vllm.git")
    os.system("git config --global advice.detachedHead true")

if not os.path.exists(f"{ROOT_DIR}/vllm/"):
    try:
        clone_vllm_repo()
        print("vllm.git repository cloned successfully!✅")
        print("Setting up vLLM Environment for Intel GPUs.....⌛")
    except Exception as e:
        print(f"An error occurred during setup: {e}")
        exit(1)
try:
    print("Changing to vLLM directory.....")
    # Change the current working directory to specified path
    os.chdir(f"{ROOT_DIR}/vllm")
    print_cwd()
except OSError as e:
    print(f"Error changing directory: {e}")
    exit(1)


# Installation commands using subprocess for better error handling and flexibility
print("vLLM Setup Started!")
os.system(f"{sys.executable} -m pip cache purge > /dev/null 2>&1")
os.system(f"{sys.executable} -m pip install --upgrade pip > /dev/null 2>&1")
os.system(f"""{sys.executable} -m pip install torch==2.3.1+cxx11.abi torchvision==0.18.1+cxx11.abi torchaudio==2.3.1+cxx11.abi intel-extension-for-pytorch==2.3.110+xpu oneccl_bind_pt==2.3.100+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ > /dev/null 2>&1""")
os.system(f"{sys.executable} use_existing_torch.py > /dev/null 2>&1")
print("Almost there.....!⌛⌛")
os.system(f"{sys.executable} -m pip install -v -r requirements-xpu.txt > /dev/null 2>&1")
os.system(f"{sys.executable} -m pip install setuptools_scm> /dev/null 2>&1")
os.system(f"{sys.executable} setup.py install > /dev/null 2>&1")

print("\nvLLM environment setup is now ready!! ✅")

# Change back to the root directory
os.chdir(f"{ROOT_DIR}")
print("Changing back to Notebook directory...\n")
# print_cwd()

## Step 2: Run Sample Inference using vLLM ▶️

1. Here is an example of [offline_inference](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py) to run on Intel® GPUs.
2. Select the desired model and ```SamplingParameters``` to control llm generated output.
3. ```free_memory()``` used to free the allocated resources on GPU.

**Note**: Kindly restart the kernel to have changes reflected, if you encounter ```ImportError: cannot import name 'LLM' from 'vllm' (unknown location)``` (Kernel->Restart Kernel).

In [None]:
import torch
import intel_extension_for_pytorch as ipex #to include XPU namespace
from vllm import LLM, SamplingParams
import gc


# Clear cache of the XPU
if torch.xpu.is_available():
    torch.xpu.empty_cache()

def free_memory(llm_model):
    """This function free up the gpu memory
    {input}: pass the llm object
    """
    # Delete the llm object and free the memory
    llm = llm_model
    del llm.llm_engine.model_executor
    del llm
    gc.collect()
    torch.xpu.empty_cache()
    # print("Successfully deleted the llm pipeline and free the GPU memory.")


# Sample prompts.
prompts = [
    "What are we having for dinner?",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    print("__" * 25)
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

print("__" * 25)
free_memory(llm)

## Step 3: Approach - Synthetic Data Generation 🤖
Synthetic data is, as the name suggests, artificial data generated to mimic real data. Typically, synthetic data is generated using sophisticated Generative AI techniques to create data similar in structure, features, and characteristics to the data found in real-world applications.
Some key considerations when evaluating the quality of synthetic data include the randomness of the sample, how well it captures the statistical distribution of real data, and whether it includes missing or erroneous values.

### Persona-driven Synthetic Data Creation
* This work incorporates insights from [Scaling Synthetic Data Creation with 1,000,000,000 Personas](https://arxiv.org/pdf/2406.20094).
Previous research tends to diversify the data synthesis prompt through the following two paradigms, which are instance-driven and key-point-driven, but unfortunately, neither can practically achieve scalable synthetic data creation.
* Following a novel persona-driven data synthesis methodology.
The personas can be regarded as distributed carriers of world knowledge, and each individual can be associated with their unique knowledge, experience, interest, personality and profession.
Thus, they can tap into almost every perspective encapsulated within the LLM to create diverse synthetic data at scale.
This approach involves integrating a persona into the appropriate position in a data synthesis prompt.
Driven by the 1 billion personas in Persona Hub, this approach can easily create diverse synthetic data at a billion scale.

## Step 4: Import Packages and Create Template⌛

* Import vLLM and required packages.
* Here we are going to use a ```comedy_template``` prompt to feed the model, In the similar way you can design your own template and format the prompt template to the model using the helper functions defined below.

In [None]:
import json
import torch
import intel_extension_for_pytorch as ipex #Include XPU namespace
from transformers import AutoTokenizer
from datasets import load_dataset
from vllm import LLM, SamplingParams

In [None]:
comedy_template = '''{persona}

Assume you are the persona described above and I want you to act as a stand-up comedian. Write content that reflects your unique voice, expertise, and humor, tailored to your specific field. 
'''

## Step 5: Generating Data 🧪

* Define the ```SAMPLE_SIZE```, describes how many samples of synthetic data to be generated based on the sample data you provide.
* Select the ```MODEL```, based on your hardware capacity and VRAM.
* Define a ```system_prompt```, ```user_prompt``` and apply the chat template by formatting input to the dataset.
* ```Optional```: Truncating the data to avoid OOM.
* Finally, generated data is redirected into a ```JSON``` file format.

In [None]:
CHOICE_TEMPLATE="comedy"   # template can also be  "knowledge" or "math". Feel free to try others; You can also add your customized data synthesis prompt in code/prompt_templates.py
SAMPLE_SIZE=10  # Set sample_size=0 if you want to use the full version of 200k personas.
OUT_PATH=f"{CHOICE_TEMPLATE}_{SAMPLE_SIZE}_synthesis_output.jsonl"
MODEL_PATH="NousResearch/Hermes-3-Llama-3.1-8B" # feel free to replace it with any other open-sourced LLMs supported by vllm, Ex: "NousResearch/Nous-Hermes-llama-2-7b".

if torch.xpu.is_available():
    torch.xpu.empty_cache() # Query for XPU(Intel GPU) and empty the cache.

def request_input_format(user_prompt, tokenizer):
    """
        Formating the dataset for input prompts
        {user_prompt}: Input Prompt of the dataset
        {tokenizer}: Tokenizer of the Model.
        return: Formats the user_prompt according to the chat template
    """
    system_prompt = "You are a helpful assistant."
    messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    return text

def truncate_text(text, max_length):
    """
    This function is to get rid of OOM issues, it reduces the prompt length.
    {text}: Input prompt
    {max_length}: customize your sequence length.
    return: Updated Input prompt
    """
    return text[:max_length] if len(text) > max_length else text

def main():
    """Choosing a template, run the generation with vLLM"""
    # Load the appropriate template
    if CHOICE_TEMPLATE == "comedy":
        template = comedy_template
    else:
        raise ValueError("Invalid template type. Choose 'comedy_template', or define a custom template.")

    # Load the dataset
    persona_dataset = load_dataset("proj-persona/PersonaHub", data_files="persona.jsonl")['train']

    max_char_length = 1024 #Setting a max length to data input, to avoid OOM issues.
    persona_dataset = persona_dataset.map(lambda x: {'persona': truncate_text(x['persona'], max_char_length)})
    
    if SAMPLE_SIZE > 0:
        persona_dataset = persona_dataset[:SAMPLE_SIZE]
    print(f"Total number of input personas: {len(persona_dataset['persona'])}")

    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    llm = LLM(model=MODEL_PATH) # please set tensor_parallel_size based on the GPUs you are using

    prompts = []
    max_len = 2048

    for persona in persona_dataset['persona']:
        persona = persona.strip()
        user_prompt = template.format(persona=persona)
        prompt = request_input_format(user_prompt, tokenizer)
        prompts.append(prompt)

    print(f"Loaded {len(prompts)} entries to process...\n\n")
    print(f"Sample 0: {prompts[0]}")

    sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=max_len, stop=["<|eot_id|>"])
    outputs = llm.generate(prompts, sampling_params)
    free_memory(llm)

    with open(OUT_PATH, 'w') as out:
        for i, output in enumerate(outputs):
            out_txt = output.outputs[0].text
            finish_reason = output.outputs[0].finish_reason
            data = {'prompt': output.prompt, "input persona": persona_dataset['persona'][i].strip(), "finish_reason": finish_reason}
            data['synthesized text'] = out_txt
            out.write(json.dumps(data, ensure_ascii=False) + '\n')

    print(f"Output the results to: {OUT_PATH}")

if __name__ == "__main__":
    main()

## View Generated Synthetic Dataset 👀

In [None]:
dataset = load_dataset("json", data_files=OUT_PATH)['train']
# dataset = load_dataset("json", data_files="comedy_synthesis_10.jsonl")["train"]
print(dataset)
print(f"\n\nInput Prompt: \n\n{dataset[0]['prompt']}")
print(f"Synthesized Text: \n\n{dataset[0]['synthesized text']}")

### Disclaimer from proj-persona/PersonaHub applies here

* https://huggingface.co/datasets/proj-persona/PersonaHub

PERSONA HUB can facilitate synthetic data creation at a billion-scale to simulate diverse inputs (i.e., use cases) from a wide variety of real-world users. If this data is used as input to query a target LLM to obtain its outputs at scale, there is a high risk that the LLM's knowledge, intelligence and capabilities will be dumped and easily replicated, thereby challenging the leading position of the most powerful LLMs. It is crucial to avoid misuse and ensure ethical and responsible application to prevent privacy violations and other ethical concerns.

The released data is all generated by public available models (GPT-4, Llama-3 and Qwen), and is intended for research purposes only. Users also must comply with the respective license agreements and usage policies of these models when using the synthesized data. The data may contain inaccuracies, unsafe content, or biases, for which we cannot be held responsible. Please evaluate its accuracy and suitability before use. Tencent and its licensors provide the data AS-IS, without warranty of any kind, express or implied. The views and opinions expressed in the data do not necessarily reflect those of Tencent.

### Disclaimer for Using Large Language Models

Please be aware that while Large Language Models are powerful tools for text generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It's advisable to carefully review the generated text and consider the context and application in which you are using these models.

For detailed information on each model's capabilities, licensing, and attribution, please refer to the respective model cards:

1. **NousResearch/Hermes-3-Llama-3.1-8B**

   * Model card: https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B

2. **NousResearch/Nous-Hermes-llama-2-7b**

   * Model card: https://huggingface.co/NousResearch/Nous-Hermes-llama-2-7b

3. **facebook/opt-125m**

  * Model card: https://huggingface.co/facebook/opt-125m
  
Usage of these models must also adhere to the licensing agreements and be in accordance with ethical guidelines and best practices for AI. If you have any concerns or encounter issues with the models, please refer to the respective model cards and documentation provided in the links above. To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.