---
**Optimize Code Generation with LLMs on Intel® Data Center Max Series GPUs**

Hello and welcome! Are you curious about how computers can solve difficult programming tasks and help to streamline the code development process? Do you want to play around with code generation without getting too technical? Then, you've come to the right place.

Large language models (LLMs) have a wide range of applications, but they can also be fun to experiment with. Here, we'll use some simple pre-trained models from the [Code Llama](https://huggingface.co/codellama) family of LLMs to explore code generation interactively.

Powered by Intel® Data Center GPU Max 1100s, this notebook provides a hands-on experience that doesn't require deep technical knowledge. Whether you are a seasoned developer or learning to code in a new language, this guide is designed for you. 

Ready to try it out? Let's set up our environment and start exploring the world of code generation with LLMs!

Before beginning this tutorial, please ensure you have reviewed the Llama License Agreement at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/.

In [1]:
# Required packages
import sys
!echo "Installation in progress, please wait..."
!{sys.executable} -m pip install accelerate==0.23.0 --no-deps --no-warn-script-location > /dev/null
!{sys.executable} -m pip install transformers==4.34.0 --no-warn-script-location > /dev/null
!echo "Installation completed."

In [2]:
import logging
import os
import random
import re

os.environ["SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS"] = "1"
os.environ["ENABLE_SDP_FUSION"] = "1"
import warnings

# Suppress warnings for a cleaner output
warnings.filterwarnings("ignore")

import torch
import intel_extension_for_pytorch as ipex

from transformers import AutoModelForCausalLM, AutoTokenizer

from ipywidgets import VBox, HBox, Button, Dropdown, IntSlider, FloatSlider, Text, Output, Label, Layout
import ipywidgets as widgets
from ipywidgets import HTML


# random seed
if torch.xpu.is_available():
    seed = 88
    random.seed(seed)
    torch.xpu.manual_seed(seed)
    torch.xpu.manual_seed_all(seed)

def select_device(preferred_device=None):
    """
    Selects the best available XPU device or the preferred device if specified.

    Args:
        preferred_device (str, optional): Preferred device string (e.g., "cpu", "xpu", "xpu:0", "xpu:1", etc.). If None, a random available XPU device will be selected or CPU if no XPU devices are available.

    Returns:
        torch.device: The selected device object.
    """
    try:
        if preferred_device and preferred_device.startswith("cpu"):
            print("Using CPU.")
            return torch.device("cpu")
        if preferred_device and preferred_device.startswith("xpu"):
            if preferred_device == "xpu" or (
                ":" in preferred_device
                and int(preferred_device.split(":")[1]) >= torch.xpu.device_count()
            ):
                preferred_device = (
                    None  # Handle as if no preferred device was specified
                )
            else:
                device = torch.device(preferred_device)
                if device.type == "xpu" and device.index < torch.xpu.device_count():
                    vram_used = torch.xpu.memory_allocated(device) / (
                        1024**2
                    )  # In MB
                    print(
                        f"Using preferred device: {device}, VRAM used: {vram_used:.2f} MB"
                    )
                    return device

        if torch.xpu.is_available():
            device_id = random.choice(
                range(torch.xpu.device_count())
            )  # Select a random available XPU device
            device = torch.device(f"xpu:{device_id}")
            vram_used = torch.xpu.memory_allocated(device) / (1024**2)  # In MB
            print(f"Selected device: {device}, VRAM used: {vram_used:.2f} MB")
            return device
    except Exception as e:
        print(f"An error occurred while selecting the device: {e}")
    print("No XPU devices available or preferred device not found. Using CPU.")
    return torch.device("cpu")

---
**A Glimpse Into Code Generation with Language Models**

If you're intrigued by how machines can perform a variety of code synthesis and understanding tasks, let's take a closer look at the underlying code. Even if you're not technically inclined, this section will provide a high-level understanding of how it all works:

- **Class Definition**: The `CodeChatBot` class is the core of our code-generative AI model. It handles the setup, optimization, and interaction with the LLM.

- **Initialization**: When you create an instance of this class, you can specify the model's path, the device to run on (defaulting to Intel's XPU device, if available), and the data type. There's also an option to optimize the model for Intel GPUs using Intel Extension for PyTorch* (IPEX).

- **Input Preparation**: The `prepare_input` method ensures that the input doesn't exceed the maximum length and combines the previous text with the user input, if required.

- **Output Generation**: The `gen_output` method takes the prepared input and several parameters controlling the generation process, like temperature, top_p, top_k, etc., and produces the code.

- **Warm-up**: Before the main interactions, the `warmup_model` method helps in "warming up" the model to make subsequent runs faster.

- **Code Processing**: The `unique_sentences` method handles the code processing to ensure the generated code is readable and free from repetitions or unnecessary echoes.

Feel free to explore the code and play around with different parameters. Remember, this is a simple and interactive way to experiment with code generation. It's not a cutting-edge chatbot, but rather a playful tool to engage with language models. Enjoy the journey into the world of LLMs, using Intel® Data Center GPU Max 1100s!

In [3]:
MODEL_CACHE_PATH = "/home/common/data/Big_Data/GenAI/llm_models"
class CodeChatBot:
    """
    CodeChatBot is a class for generating responses based on programming-specific prompts using a pretrained LLM.

    Attributes:
    - device: The device to run the model on. Default is "xpu" if available, otherwise "cpu".
    - model: The loaded model for code generation.
    - tokenizer: The loaded tokenizer for the model.
    - torch_dtype: The data type to use in the model.
    """

    def __init__(
        self,
        model_id_or_path: str = "codellama/CodeLlama-7b-hf",
        torch_dtype: torch.dtype = torch.bfloat16,
        optimize: bool = True,
    ) -> None:
        """
        The initializer for CodeChatBot class.

        Parameters:
        - model_id_or_path: The identifier or path of the pretrained model.
        - torch_dtype: The data type to use in the model. Default is torch.bfloat16.
        - optimize: If True, ipex is used to optimize the model
        """
        self.torch_dtype = torch_dtype
        self.device = select_device("xpu")
        self.model_id_or_path = model_id_or_path
        local_model_id = self.model_id_or_path.replace("/", "--")
        local_model_path = os.path.join(MODEL_CACHE_PATH, local_model_id)
        
        if (
            self.device == self.device.startswith("xpu")
            if isinstance(self.device, str)
            else self.device.type == "xpu"
        ):
            self.autocast = torch.xpu.amp.autocast
        else:
            self.autocast = torch.cpu.amp.autocast
        self.torch_dtype = torch_dtype

        try:
            self.tokenizer = AutoTokenizer.from_pretrained(
                local_model_path, trust_remote_code=True, 
            )
            self.model = (
                AutoModelForCausalLM.from_pretrained(
                    local_model_path,
                    low_cpu_mem_usage=True,
                    trust_remote_code=True,
                    torch_dtype=self.torch_dtype,
                )
                .to(self.device)
                .eval()
            )
        except (OSError, ValueError, EnvironmentError) as e:
            logging.info(
                f"Tokenizer / model not found locally. Downloading tokenizer / model for {self.model_id_or_path} to cache...: {e}"
            )
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_id_or_path, trust_remote_code=True, 
            )
            self.model = (
                AutoModelForCausalLM.from_pretrained(
                    self.model_id_or_path,
                    low_cpu_mem_usage=True,
                    trust_remote_code=True,
                    torch_dtype=self.torch_dtype,
                )
                .to(self.device)
                .eval()
            )
            
        self.max_length = 256

        if optimize:
            if hasattr(ipex, "optimize_transformers"):
                try:
                    ipex.optimize_transformers(self.model, dtype=self.torch_dtype)
                except:
                    ipex.optimize(self.model, dtype=self.torch_dtype)
            else:
                ipex.optimize(self.model, dtype=self.torch_dtype)

    def prepare_input(self, previous_text, user_input):
        """Prepare the input for the model, ensuring it doesn't exceed the maximum length."""
        response_buffer = 100
        user_input = (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            f"### Instruction:\n{user_input}\n\n### Response:")
        combined_text = previous_text + "\nUser: " + user_input + "\nBot: "
        input_ids = self.tokenizer.encode(
            combined_text, return_tensors="pt", truncation=False
        )
        adjusted_max_length = self.max_length - response_buffer
        if input_ids.shape[1] > adjusted_max_length:
            input_ids = input_ids[:, -adjusted_max_length:]
        return input_ids.to(device=self.device)

    def gen_output(
        self, input_ids, temperature, top_p, top_k, num_beams, repetition_penalty
    ):
        """
        Generate the output based on the given input IDs and generation parameters.

        Args:
            input_ids (torch.Tensor): The input tensor containing token IDs.
            temperature (float): The temperature for controlling randomness in Boltzmann distribution.
                                Higher values increase randomness, lower values make the generation more deterministic.
            top_p (float): The cumulative distribution function (CDF) threshold for Nucleus Sampling.
                           Helps in controlling the trade-off between randomness and diversity.
            top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            num_beams (int): The number of beams for beam search. Controls the breadth of the search.
            repetition_penalty (float): The penalty applied for repeating tokens.

        Returns:
            torch.Tensor: The generated output tensor.
        """
        print(f"Using max length: {self.max_length}")
        with self.autocast(
            enabled=True if self.torch_dtype != torch.float32 else False,
            dtype=self.torch_dtype,
        ):
            with torch.no_grad():
                output = self.model.generate(
                    input_ids,
                    pad_token_id=self.tokenizer.eos_token_id,
                    max_length=self.max_length,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
                return output

    def warmup_model(
        self, temperature, top_p, top_k, num_beams, repetition_penalty
    ) -> None:
        """
        Warms up the model by generating a sample response.
        """
        sample_prompt = """A dialog, where User interacts with a helpful Bot.
        AI is helpful, kind, obedient, honest, and knows its own limits.
        User: Hello, Bot.
        Bot: Hello! How can I assist you today?
        """
        input_ids = self.tokenizer(sample_prompt, return_tensors="pt").input_ids.to(
            device=self.device
        )
        _ = self.gen_output(
            input_ids,
            temperature=temperature,
            top_p=top_p,
            top_k=top_k,
            num_beams=num_beams,
            repetition_penalty=repetition_penalty,
        )

    def strip_response(self, generated_code):
        """Remove ### Response: from string if exists."""
        match = re.search(r'### Response:(.*)', generated_code, re.S)
        if match:
            return match.group(1).strip()
    
        else:
            return generated_code
        
    def unique_sentences(self, text: str) -> str:
        sentences = text.split(". ")
        if sentences[-1] and sentences[-1][-1] != ".":
            sentences = sentences[:-1]
        sentences = set(sentences)
        return ". ".join(sentences) + "." if sentences else ""

    def interact(
        self,
        out: Output,  # Output widget to display the conversation
        with_context: bool = True,
        temperature: float = 0.10,
        top_p: float = 0.95,
        top_k: int = 20,
        num_beams: int = 3,
        repetition_penalty: float = 1.80,
    ) -> None:
        """
        Handle the chat loop where the user provides input and receives a model-generated response.

        Args:
            with_context (bool): Whether to consider previous interactions in the session. Default is True.
            temperature (float): The temperature for controlling randomness in Boltzmann distribution.
                                 Higher values increase randomness, lower values make the generation more deterministic.
            top_p (float): The cumulative distribution function (CDF) threshold for Nucleus Sampling.
                           Helps in controlling the trade-off between randomness and diversity.
            top_k (int): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            num_beams (int): The number of beams for beam search. Controls the breadth of the search.
            repetition_penalty (float): The penalty applied for repeating tokens.
            """
        previous_text = ""
    
        def display_user_input_widgets():
            default_color = "\033[0m"
            user_color, user_icon = "\033[94m", "😀 "
            bot_color, bot_icon = "\033[92m", "🤖 "
            user_input_widget = Text(placeholder="Type your message here...", layout=Layout(width='80%'))
            send_button = Button(description="Send", button_style = "primary", layout=Layout(width='10%'))
            chat_spin = HTML(value = "")
            spin_style = """
            <div class="loader"></div>
            <style>
            .loader {
              border: 5px solid #f3f3f3;
              border-radius: 50%;
              border-top: 5px solid #3498db;
              width: 8px;
              height: 8px;
              animation: spin 3s linear infinite;
            }
            @keyframes spin {
              0% { transform: rotate(0deg); }
              100% { transform: rotate(360deg); }
            }
            </style>
            """
            display(HBox([chat_spin, user_input_widget, send_button, ]))
            
            def on_send(button):
                nonlocal previous_text
                send_button.button_style = "warning"
                chat_spin.value = spin_style
                orig_input = ""
                user_input = user_input_widget.value
                with out:
                    print(f" {user_color}{user_icon}You: {user_input}{default_color}")
                if user_input.lower() == "exit":
                    return
                if with_context:
                    self.max_length = 256
                    input_ids = self.prepare_input(previous_text, user_input)
                else:
                    self.max_length = 96
                    input_ids = self.tokenizer.encode(user_input, return_tensors="pt").to(self.device)
    
                output_ids = self.gen_output(
                    input_ids,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
                generated_code = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
                generated_code = self.strip_response(generated_code)
                send_button.button_style = "success"
                chat_spin.value = ""

                with out:
                    if orig_input:
                        user_input = orig_input
                    print(f" {bot_color}{bot_icon}Bot: {generated_code}{default_color}")    
                if with_context:
                    previous_text += "\nUser: " + user_input + "\nBot: " + generated_code
                user_input_widget.value = "" 
                display_user_input_widgets()
            send_button.on_click(on_send)
        display_user_input_widgets()

---
**Setting Up the Interactive Code Generation Interface**

In the next section, we'll create an interactive code generation interface right here in this notebook. This will enable you to select a model, provide a prompt, and adjust various parameters without touching the code itself.

- **Model Selection**: Choose from the following pre-trained Code Llama models:
    - `Code Llama`: The foundational model for general-purpose programming tasks and code completion.
    - `Code Llama - Python`: This model is specialized for Python code generation.
    - `Code Llama - Instruct`: This model is specialized to understand and follow natural language instructions. 
- **Interaction Mode**: Decide whether to interact with or without context, allowing the model to remember previous interactions or treat each input independently.
- **Temperature**: Adjust this parameter to control the randomness in code generation. Higher values increase creativity; lower values make the generation more deterministic.
- **Top_p**, **Top_k**: Play with these parameters to influence the diversity and quality of the generated code.
- **Number of Beams**: Control the breadth of the search in code generation.
- **Repetition Penalty**: Modify this to prevent or allow repeated phrases and sentences.

Once you've set your preferences, you can start the interaction and even reset or reload the model to try different settings. Let's set this up and explore the playful world of code generation using Intel® Data Center GPU Max 1100s!

In [4]:
model_cache = {}

from ipywidgets import HTML
def code_generation_with_llm():
    models = ["Code Llama", "Code Llama - Python", "Code Llama - Instruct"]
    interaction_modes = ["Interact with context", "Interact without context"]
    model_dropdown = Dropdown(options=models, value=models[0], description="Model:")
    interaction_mode = Dropdown(options=interaction_modes, value=interaction_modes[1], description="Interaction:")
    temperature_slider = FloatSlider(value=0.71, min=0, max=1, step=0.01, description="Temperature:")
    top_p_slider = FloatSlider(value=0.95, min=0, max=1, step=0.01, description="Top P:")
    top_k_slider = IntSlider(value=40, min=0, max=100, step=1, description="Top K:")
    num_beams_slider = IntSlider(value=3, min=1, max=10, step=1, description="Num Beams:")
    repetition_penalty_slider = FloatSlider(value=1.80, min=0, max=2, step=0.1, description="Rep Penalty:")
    
    out = Output()    
    left_panel = VBox([model_dropdown, interaction_mode], layout=Layout(margin="0px 20px 10px 0px"))
    right_panel = VBox([temperature_slider, top_p_slider, top_k_slider, num_beams_slider, repetition_penalty_slider],
                       layout=Layout(margin="0px 0px 10px 20px"))
    user_input_widgets = HBox([left_panel, right_panel], layout=Layout(margin="0px 50px 10px 0px"))
    spinner = HTML(value="")
    start_button = Button(description="Start Interaction!", button_style="primary")
    start_button_spinner = HBox([start_button, spinner])
    start_button_spinner.layout.margin = '0 auto'
    display(user_input_widgets)
    display(start_button_spinner)
    display(out)
    
    def on_start(button):
        start_button.button_style = "warning"
        start_button.description = "Loading..."
        spinner.value = """
        <div class="loader"></div>
        <style>
        .loader {
          border: 5px solid #f3f3f3;
          border-radius: 50%;
          border-top: 5px solid #3498db;
          width: 16px;
          height: 16px;
          animation: spin 3s linear infinite;
        }
        @keyframes spin {
          0% { transform: rotate(0deg); }
          100% { transform: rotate(360deg); }
        }
        </style>
        """
        out.clear_output()
        with out:
            print("\nSetting up the model, please wait...")
            
        if model_dropdown.value == "Code Llama - Python":
            model_choice = "codellama/CodeLlama-7b-Python-hf"
        elif model_dropdown.value == "Code Llama - Instruct":
            model_choice = "codellama/CodeLlama-7b-Instruct-hf"
        else:
            model_choice = "codellama/CodeLlama-7b-hf"
        with_context = interaction_mode.value == interaction_modes[0]
        temperature = temperature_slider.value
        top_p = top_p_slider.value
        top_k = top_k_slider.value
        num_beams = num_beams_slider.value
        repetition_penalty = repetition_penalty_slider.value
        model_key = (model_choice, "xpu")
        if model_key not in model_cache:
            model_cache[model_key] = CodeChatBot(model_id_or_path=model_choice)
        bot = model_cache[model_key]
        
        with out:
            start_button.button_style = "success"
            start_button.description = "Refresh"
            spinner.value = ""
            print("Ready!")
            print("\nNote: This is a demonstration using pretrained models which were not fine-tuned for chat.")
            print("If the bot doesn't respond, try clicking on refresh.\n")
            print("\nEnter a coding-related prompt or query below.")
        try:
            with out:
                bot.interact(
                    with_context=with_context,
                    out=out,
                    temperature=temperature,
                    top_p=top_p,
                    top_k=top_k,
                    num_beams=num_beams,
                    repetition_penalty=repetition_penalty,
                )
        except Exception as e:
            with out:
                print(f"An error occurred: {e}")

    start_button.on_click(on_start)

---
**Let's Dive In and Have Some Fun with Code-Generative LLMs!**

Ready for a playful interaction with some interesting LLMs? The interface below lets you choose from different models and settings. Just select your preferences, click the "Start Interaction!" button, and you're ready to chat.

You can ask questions, make statements, or simply explore how the model responds to different inputs. It's a friendly way to get acquainted with AI and see what it has to say.

Remember, this is all in good fun, and the models are here to engage with you. So go ahead, start a conversation, and enjoy the interaction!

In [5]:
code_generation_with_llm()

HBox(children=(VBox(children=(Dropdown(description='Model:', options=('Code Llama', 'Code Llama - Python', 'Co…

HBox(children=(Button(button_style='primary', description='Start Interaction!', style=ButtonStyle()), HTML(val…

Output()

## Language Models Disclaimer and Information

### Model cards:
- **Code Llama:** [CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
- **Code Llama - Python:** [CodeLlama-7b-Python-hf](https://huggingface.co/codellama/CodeLlama-7b-Python-hf)
- **Code Llama - Instruct:** [CodeLlama-7b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf)

### Code Llama License:
A custom commercial license is available at: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

### Additional Resources:
- More information on Code Llama can be found in the paper [Code Llama: Open Foundation Models for Code](https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/) or it's [arXiv](https://arxiv.org/abs/2308.12950) page.
- Review the Llama Responsible Use Guide available at: https://ai.meta.com/llama/responsible-use-guide/.

## Notices and Disclaimers

Please be aware that while LLMs like Code Llama are powerful tools for code generation, they may sometimes produce results that are unexpected, biased, or inconsistent with the given prompt. It is advisable to carefully review the generated output and consider the context and application in which you are using these models. Usage of these models must also adhere to their licensing agreements and be in accordance with ethical guidelines and best practices for AI. 

To the extent that any public or non-Intel datasets or models are referenced by or accessed using these materials those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.

Intel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.

Intel’s provision of these resources does not expand or otherwise alter Intel’s applicable published warranties or warranty disclaimers for Intel products or solutions, and no additional obligations, indemnifications, or liabilities arise from Intel providing such resources. Intel reserves the right, without notice, to make corrections, enhancements, improvements, and other changes to its materials.