---
**Simple LLM Inference: Playing with Language Models on Intel Max Series GPUs**

Hello and welcome! Are you curious about how computers understand and generate human-like text? Do you want to play around with text generation without getting too technical? Then you've come to the right place.

Large Language Models (LLMs) have a wide range of applications, but they can also be fun to experiment with. Here, we'll use some simple pre-trained models to explore text generation interactively.

Powered by Intel's Max Series GPUs, this notebook provides a hands-on experience that doesn't require deep technical knowledge. Whether you're a student, writer, educator, or just curious about AI, this guide is designed for you.

Ready to try it out? Let's set up our environment and start exploring the world of text generation with LLMs!


In [1]:
import os
import random
import re

os.environ["SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS"] = "1"
os.environ["ENABLE_SDP_FUSION"] = "1"
import warnings

# Suppress warnings for a cleaner output
warnings.filterwarnings("ignore")

import torch
import intel_extension_for_pytorch as ipex

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import LlamaTokenizer, LlamaForCausalLM
from transformers import BertTokenizer, BertForSequenceClassification

# random seed
if torch.xpu.is_available():
    seed = 88
    random.seed(seed)
    torch.xpu.manual_seed(seed)
    torch.xpu.manual_seed_all(seed)

---
**A Glimpse Into Text Generation with Language Models**

If you're intrigued by how machines can generate human-like text, let's take a closer look at the underlying code. Even if you're not technically inclined, this section will provide a high-level understanding of how it all works.:

- **Class Definition**: The `ChatBotModel` class is the core of our text generation. It handles the setup, optimization, and interaction with the LLM (Large Language Model).

- **Initialization**: When you create an instance of this class, you can specify the model's path, the device to run on (defaulting to Intel's "xpu" device if available), and the data type. There's also an option to optimize the model for Intel GPUs using Intel Extension For PyTorch (IPEX).

- **Input Preparation**: The `prepare_input` method ensures that the input doesn't exceed the maximum length and combines the previous text with the user input, if required.

- **Output Generation**: The `gen_output` method takes the prepared input and several parameters controlling the generation process, like temperature, top_p, top_k, etc., and produces the text response.

- **Warm-up**: Before the main interactions, the `warmup_model` method helps in "warming up" the model to make subsequent runs faster.

- **Text Processing**: Several methods like `unique_sentences`, `remove_repetitions`, and `extract_bot_response` handle the text processing to ensure the generated text is readable and free from repetitions or unnecessary echoes.

Feel free to explore the code and play around with different parameters. Remember, this is a simple and interactive way to experiment with text generation. It's not a cutting-edge chatbot, but rather a playful tool to engage with language models. Enjoy the journey into the world of LLMs, using Intel Max Series GPUs!


In [28]:
class ChatBotModel:
    """
    ChatBotModel is a class for generating responses based on text prompts using a pretrained model.

    Attributes:
    - device: The device to run the model on. Default is "xpu" if available, otherwise "cpu".
    - model: The loaded model for text generation.
    - tokenizer: The loaded tokenizer for the model.
    - torch_dtype: The data type to use in the model.
    """

    def __init__(
        self,
        model_id_or_path: str = "openlm-research/open_llama_3b_v2",  # "Writer/camel-5b-hf",
        torch_dtype: torch.dtype = torch.bfloat16,
        optimize: bool = True,
    ) -> None:
        """
        The initializer for ChatBotModel class.

        Parameters:
        - model_id_or_path: The identifier or path of the pretrained model.
        - torch_dtype: The data type to use in the model. Default is torch.bfloat16.
        - optimize: If True, ipex is used to optimized the model
        """
        self.torch_dtype = torch_dtype
        self.device = "xpu:4" if torch.xpu.is_available() else "cpu"
        if self.device.startswith("xpu"):
            self.autocast = torch.xpu.amp.autocast
        else:
            self.autocast = torch.cpu.amp.autocast
        self.torch_dtype = torch_dtype

        if "llama" in model_id_or_path:
            self.tokenizer = LlamaTokenizer.from_pretrained(model_id_or_path)
            self.model = (
                LlamaForCausalLM.from_pretrained(
                    model_id_or_path,
                    low_cpu_mem_usage=True,
                    torch_dtype=self.torch_dtype,
                )
                .to(self.device)
                .eval()
            )
        else:
            self.tokenizer = AutoTokenizer.from_pretrained(
                model_id_or_path, trust_remote_code=True
            )
            self.model = (
                AutoModelForCausalLM.from_pretrained(
                    model_id_or_path,
                    low_cpu_mem_usage=True,
                    trust_remote_code=True,
                    torch_dtype=self.torch_dtype,
                )
                .to(self.device)
                .eval()
            )
        self.max_length = 256
        print(f"Using max length: {self.max_length}")

        if optimize:
            if hasattr(ipex, "optimize_transformers"):
                try:
                    ipex.optimize_transformers(self.model, dtype=self.torch_dtype)
                except:
                    ipex.optimize(self.model, dtype=self.torch_dtype)
            else:
                ipex.optimize(self.model, dtype=self.torch_dtype)

    def prepare_input(self, previous_text, user_input):
        """Prepare the input for the model, ensuring it doesn't exceed the maximum length."""
        response_buffer = 100
        combined_text = previous_text + "\nUser: " + user_input + "\nBot: "
        input_ids = self.tokenizer.encode(
            combined_text, return_tensors="pt", truncation=False
        )
        adjusted_max_length = self.max_length - response_buffer
        if input_ids.shape[1] > adjusted_max_length:
            input_ids = input_ids[:, -adjusted_max_length:]
        return input_ids.to(device=self.device)

    def gen_output(
        self, input_ids, temperature, top_p, top_k, num_beams, repetition_penalty
    ):
        """
        Generate the output text based on the given input IDs and generation parameters.

        Args:
            input_ids (torch.Tensor): The input tensor containing token IDs.
            temperature (float): The temperature for controlling randomness in Boltzmann distribution.
                                Higher values increase randomness, lower values make the generation more deterministic.
            top_p (float): The cumulative distribution function (CDF) threshold for Nucleus Sampling.
                           Helps in controlling the trade-off between randomness and diversity.
            top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            num_beams (int): The number of beams for beam search. Controls the breadth of the search.
            repetition_penalty (float): The penalty applied for repeating tokens.

        Returns:
            torch.Tensor: The generated output tensor.
        """
        with self.autocast(
            enabled=True if self.torch_dtype != torch.float32 else False,
            dtype=self.torch_dtype,
        ):
            with torch.no_grad():
                output = self.model.generate(
                    input_ids,
                    pad_token_id=self.tokenizer.eos_token_id,
                    max_length=self.max_length,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
                return output

    def warmup_model(
        self, temperature, top_p, top_k, num_beams, repetition_penalty
    ) -> None:
        """
        Warms up the model by generating a sample response.
        """
        sample_prompt = """A dialog, where User interacts with a helpful Bot.
        AI is helpful, kind, obedient, honest, and knows its own limits.
        User: Hello, Bot.
        Bot: Hello! How can I assist you today?
        """
        input_ids = self.tokenizer(sample_prompt, return_tensors="pt").input_ids.to(
            device=self.device
        )
        _ = self.gen_output(
            input_ids,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            repetition_penalty=repetition_penalty,
        )

    def unique_sentences(self, text: str) -> str:
        sentences = text.split(". ")
        if sentences[-1] and sentences[-1][-1] != ".":
            sentences = sentences[:-1]
        sentences = set(sentences)
        return ". ".join(sentences) + "." if sentences else ""

    def remove_repetitions(self, text: str, user_input: str) -> str:
        """
        Remove repetitive sentences or phrases from the generated text and avoid echoing user's input.

        Args:
            text (str): The input text with potential repetitions.
            user_input (str): The user's original input to check against echoing.

        Returns:
            str: The processed text with repetitions and echoes removed.
        """
        text = re.sub(re.escape(user_input), "", text, count=1).strip()
        text = self.unique_sentences(text)
        return text

    def extract_bot_response(self, generated_text: str) -> str:
        """
        Extract the first response starting with "Bot:" from the generated text.

        Args:
            generated_text (str): The full generated text from the model.

        Returns:
            str: The extracted response starting with "Bot:".
        """
        prefix = "Bot:"
        generated_text = generated_text.replace("\n", ". ")
        bot_response_start = generated_text.find(prefix)
        if bot_response_start != -1:
            response_start = bot_response_start + len(prefix)
            end_of_response = generated_text.find("\n", response_start)
            if end_of_response != -1:
                return generated_text[response_start:end_of_response].strip()
            else:
                return generated_text[response_start:].strip()
        return re.sub(r'^[^a-zA-Z0-9]+', '', generated_text)

    def interact(
        self,
        out: Output,  # Output widget to display the conversation
        with_context: bool = True,
        temperature: float = 0.10,
        top_p: float = 0.95,
        top_k: int = 40,
        num_beams: int = 3,
        repetition_penalty: float = 1.80,
    ) -> None:
        """
        Handle the chat loop where the user provides input and receives a model-generated response.

        Args:
            with_context (bool): Whether to consider previous interactions in the session. Default is True.
            temperature (float): The temperature for controlling randomness in Boltzmann distribution.
                                 Higher values increase randomness, lower values make the generation more deterministic.
            top_p (float): The cumulative distribution function (CDF) threshold for Nucleus Sampling.
                           Helps in controlling the trade-off between randomness and diversity.
            top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            num_beams (int): The number of beams for beam search. Controls the breadth of the search.
            repetition_penalty (float): The penalty applied for repeating tokens.
            """
        previous_text = ""
    
        def display_user_input_widgets():
            user_input_widget = Text(placeholder="Type your message here...", layout=Layout(width='80%'))
            send_button = Button(description="Send", layout=Layout(width='10%'))
            display(HBox([user_input_widget, send_button]))
            def on_send(button):
                nonlocal previous_text
                user_input = user_input_widget.value
                if user_input.lower() == "exit":
                    return
                if with_context:
                    input_ids = self.prepare_input(previous_text, user_input)
                else:
                    input_ids = self.tokenizer.encode(user_input, return_tensors="pt").to(self.device)
    
                output_ids = self.gen_output(
                    input_ids,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
                generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
                generated_text = self.extract_bot_response(generated_text)
                generated_text = self.remove_repetitions(generated_text, user_input)
    
                with out:
                    print(f"You: {user_input}")
                    print(f"Bot: {generated_text}")
    
                if with_context:
                    previous_text += "\nUser: " + user_input + "\nBot: " + generated_text
                user_input_widget.value = "" 
                display_user_input_widgets()
            send_button.on_click(on_send)
        display_user_input_widgets()

---
**Setting Up the Interactive Text Generation Interface**

In the next section, we'll create an interactive text generation interface right here in this notebook. This will enable you to select a model, provide a prompt, and tweak various parameters without touching the code itself.

- **Model Selection**: Choose from available pre-trained models or enter a custom model from the HuggingFace Hub.
- **Interaction Mode**: Decide whether to interact with or without context, allowing the model to remember previous interactions or treat each input independently.
- **Temperature**: Adjust this parameter to control the randomness in text generation. Higher values increase creativity; lower values make the generation more deterministic.
- **Top_p, Top_k**: Play with these parameters to influence the diversity and quality of the generated text.
- **Number of Beams**: Control the breadth of the search in text generation.
- **Repetition Penalty**: Modify this to prevent or allow repeated phrases and sentences.

Once you've set your preferences, you can start the interaction and even reset or reload the model to try different settings. Let's set this up and explore the playful world of text generation using Intel Max Series GPUs!



In [22]:
from ipywidgets import VBox, HBox, Button, Dropdown, IntSlider, FloatSlider, Text, Output, Label, Layout
import ipywidgets as widgets

def main():
    models = ["Writer/camel-5b-hf", "openlm-research/open_llama_3b_v2",]
    interaction_modes = ["Interact with context", "Interact without context"]
    model_dropdown = Dropdown(options=models, value=models[0], description="Model:")
    interaction_mode = Dropdown(options=interaction_modes, value=interaction_modes[0], description="Interaction:")
    temperature_slider = FloatSlider(value=0.10, min=0, max=1, step=0.01, description="Temperature:")
    top_p_slider = FloatSlider(value=0.95, min=0, max=1, step=0.01, description="Top P:")
    top_k_slider = IntSlider(value=40, min=0, max=100, step=1, description="Top K:")
    num_beams_slider = IntSlider(value=3, min=1, max=10, step=1, description="Num Beams:")
    repetition_penalty_slider = FloatSlider(value=1.80, min=0, max=2, step=0.1, description="Rep Penalty:")
    
    out = Output()    
    left_panel = VBox([model_dropdown, interaction_mode], layout=Layout(margin="0px 20px 10px 0px"))
    right_panel = VBox([temperature_slider, top_p_slider, top_k_slider, num_beams_slider, repetition_penalty_slider],
                       layout=Layout(margin="0px 0px 10px 20px"))
    user_input_widgets = HBox([left_panel, right_panel], layout=Layout(margin="0px 50px 10px 0px"))
    start_button = Button(description="Start Interaction!")
    start_button.layout.margin = '0 auto'
    display(user_input_widgets)
    display(start_button)
    display(out)
    
    def on_start(button):
        out.clear_output()
        model_choice = model_dropdown.value
        with_context = interaction_mode.value == interaction_modes[0]
        temperature = temperature_slider.value
        top_p = top_p_slider.value
        top_k = top_k_slider.value
        num_beams = num_beams_slider.value
        repetition_penalty = repetition_penalty_slider.value
        

        bot = ChatBotModel(model_id_or_path=model_choice)
        bot.warmup_model(
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            repetition_penalty=repetition_penalty,
        )
        

        with out:
            print("\nNote: This is a demonstration using pretrained models which were not fine-tuned for chat.\n")      
        try:
            with out:
                bot.interact(
                    with_context=with_context,
                    out=out,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
        except Exception as e:
            with out:
                print(f"An error occurred: {e}")

    start_button.on_click(on_start)


---
**Let's Dive In and Have Some Fun with LLM Models!**

Ready for a playful interaction with some interesting LLM models? The interface below lets you choose from different models and settings. Just select your preferences, click the "Start Interaction!" button, and you're ready to chat.

You can ask questions, make statements, or simply explore how the model responds to different inputs. It's a friendly way to get acquainted with AI and see what it has to say.

Remember, this is all in good fun, and the models are here to engage with you. So go ahead, start a conversation, and enjoy the interaction!

In [23]:
main()

HBox(children=(VBox(children=(Dropdown(description='Model:', options=('openlm-research/open_llama_3b_v2', 'Wri…

Button(description='Start Interaction!', layout=Layout(margin='0 auto'), style=ButtonStyle())

Output()