![DLI Header](images/DLI_Header.png)

# Star Bikes AI Assistant

In this notebook you'll make an AI assistant to help customers make the best decision about getting a new bike from Star Bikes. You will also take a short dive into **token limits** for the model you are working with, and its impact on retaining conversation history.

## Learning Objectives

By the time you complete this notebook you will be able to:
- Explain **token limits** and their impact on LLM behavior.
- Build an AI assistant capable of (limited) conversation memory that is not subject to exceeding a set **token limit**.

## Video Walkthrough

Execute the cell below to load the video walkthrough of this notebook.

In [None]:
 from IPython.display import HTML

video_url = "https://d36m44n9vdbmda.cloudfront.net/assets/s-fx-12-v1/v2/07-assistant.mp4"

video_html = f"""
<video controls width="640" height="360">
    <source src="{video_url}" type="video/mp4">
    Your browser does not support the video tag.
</video>
"""

display(HTML(video_html))

## Create LLaMA-2 Pipeline

In [None]:
from transformers import pipeline
model = "TheBloke/Llama-2-13B-chat-GPTQ"
# model = "TheBloke/Llama-2-7B-chat-GPTQ"

llama_pipe = pipeline("text-generation", model=model, device_map="auto");

## Get LLaMA-2 Tokenizer

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model)

## Helper Functions and Classes

In this notebook we will use the following functions and classes to support our interaction with the LLM. Feel free to skim over them presently, as they are covered in greater detail when used below.

### Generate Model Responses

In [None]:
def generate(prompt, max_length=4096, pipe=llama_pipe, **kwargs):
    """
    Generates a response to the given prompt using a specified language model pipeline.

    This function takes a prompt and passes it to a language model pipeline, such as LLaMA, 
    to generate a text response. The function is designed to allow customization of the 
    generation process through various parameters and keyword arguments.

    Parameters:
    - prompt (str): The input text prompt to generate a response for.
    - max_length (int): The maximum length of the generated response. Default is 1024 tokens.
    - pipe (callable): The language model pipeline function used for generation. Default is llama_pipe.
    - **kwargs: Additional keyword arguments that are passed to the pipeline function.

    Returns:
    - str: The generated text response from the model, trimmed of leading and trailing whitespace.

    Example usage:
    ```
    prompt_text = "Explain the theory of relativity."
    response = generate(prompt_text, max_length=512, pipe=my_custom_pipeline, temperature=0.7)
    print(response)
    ```
    """

    def_kwargs = dict(return_full_text=False, return_dict=False)
    response = pipe(prompt.strip(), max_length=max_length, **kwargs, **def_kwargs)
    return response[0]['generated_text'].strip()

### Costruct Prompt, Optionally With System Context and/or Examples

In [None]:
def construct_prompt_with_context(main_prompt, system_context="", conversation_examples=[]):
    """
    Constructs a complete structured prompt for a language model, including optional system context and conversation examples.

    This function compiles a prompt that can be directly used for generating responses from a language model. 
    It creates a structured format that begins with an optional system context message, appends a series of conversational 
    examples as prior interactions, and ends with the main user prompt. If no system context or conversation examples are provided,
    it will return only the main prompt.

    Parameters:
    - main_prompt (str): The core question or statement for the language model to respond to.
    - system_context (str, optional): Additional context or information about the scenario or environment. Defaults to an empty string.
    - conversation_examples (list of tuples, optional): Prior exchanges provided as context, where each tuple contains a user message 
      and a corresponding agent response. Defaults to an empty list.

    Returns:
    - str: A string formatted as a complete prompt ready for language model input. If no system context or examples are provided, returns the main prompt.

    Example usage:
    ```
    main_prompt = "I'm looking to improve my dialogue writing skills for my next short story. Any suggestions?"
    system_context = "User is an aspiring author seeking to enhance dialogue writing techniques."
    conversation_examples = [
        ("How can dialogue contribute to character development?", "Dialogue should reveal character traits and show personal growth over the story arc."),
        ("What are some common pitfalls in writing dialogue?", "Avoid exposition dumps in dialogue and make sure each character's voice is distinct.")
    ]

    full_prompt = construct_prompt_with_context(main_prompt, system_context, conversation_examples)
    print(full_prompt)
    ```
    """
    
    # Return the main prompt if no system context or conversation examples are provided
    if not system_context and not conversation_examples:
        return main_prompt

    # Start with the initial part of the prompt including the system context, if provided
    full_prompt = f"<s>[INST] <<SYS>>{system_context}<</SYS>>\n" if system_context else "<s>[INST]\n"

    # Add each example from the conversation_examples to the prompt
    for user_msg, agent_response in conversation_examples:
        full_prompt += f"{user_msg} [/INST] {agent_response} </s><s>[INST]"

    # Add the main user prompt at the end
    full_prompt += f"{main_prompt} [/INST]"

    return full_prompt

### LlamaChatbot Class

In [None]:
class LlamaChatbot:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
    - system_context (str): Contextual information to provide to the language model for all conversations.
    - conversation_history (list of tuples): Stores the history of the conversation, where each
      tuple contains a user message and the corresponding agent response.
    """

    def __init__(self, system_context):
        """
        Initializes a new instance of the LlamaChatbot class.

        Parameters:
        - system_context (str): A string that sets the initial context for the language model.
        """
        self.system_context = system_context
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history.

        Parameters:
        - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
        - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        return agent_response

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

### LlamaChatBotWithHistoryLimit Class

In [None]:
class LlamaChatbotWithHistoryLimit:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
        - system_context (str): Contextual information to provide to the language model for all conversations.
        - conversation_history (list of tuples): Stores the history of the conversation, where each
          tuple contains a user message and the corresponding agent response.
        - tokenizer: The tokenizer used to tokenize the conversation for maintaining the history limit.
        - max_tokens (int): The maximum number of tokens allowed in the conversation history.
    """

    def __init__(self, system_context, tokenizer, max_tokens=2048):
        """
        Initializes a new instance of the LlamaChatbot class with a tokenizer and token limit.

        Parameters:
            - system_context (str): A string that sets the initial context for the language model.
            - tokenizer: The tokenizer used to process the input and output for the language model.
            - max_tokens (int): The maximum number of tokens to retain in the conversation history.
        """
        self.system_context = system_context
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history, ensuring that the history does not exceed the specified token limit.

        Parameters:
            - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
            - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        # Check and maintain the conversation history within the token limit
        self._trim_conversation_history()

        return agent_response

    def _trim_conversation_history(self):
        """
        Trims the conversation history to maintain the number of tokens below the specified limit.
        """
        # Concatenate the conversation history into a single string
        history_string = ''.join(user + agent for user, agent in self.conversation_history)
        
        # Calculate the number of tokens in the conversation history
        history_tokens = len(self.tokenizer.encode(history_string))

        # While the history exceeds the maximum token limit, remove the oldest items
        while history_tokens > self.max_tokens:
            # Always check if there's at least one item to pop
            if self.conversation_history:
                # Remove the oldest conversation tuple
                self.conversation_history.pop(0)
                # Recalculate the history string and its tokens
                history_string = ''.join(user + agent for user, agent in self.conversation_history)
                history_tokens = len(self.tokenizer.encode(history_string))
            else:
                # If the conversation history is empty, break out of the loop
                break

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

### Print Number of Tokens in a Given String

In [None]:
def print_token_count(text, tokenizer):
    """
    Calculate and return the number of tokens in a given text using a specified tokenizer.

    This function takes a string of text and a tokenizer. It uses the tokenizer to encode the text
    into tokens and then returns the count of these tokens.

    Parameters:
    - text (str): The input string to be tokenized.
    - tokenizer: A tokenizer instance capable of encoding text into tokens.

    Returns:
    - int: The number of tokens in the input text as determined by the tokenizer.
    """
    return len(tokenizer.encode(text))

### Concatenate Conversation History

In [None]:
def concat_history(tuples_list):
    """
    Concatenates texts from a list of 2-tuples.

    Each tuple in the list is expected to contain two strings. The function
    will concatenate all the first elements followed by all the second elements
    in their respective order of appearance in the list.

    Parameters:
    - tuples_list (list of 2-tuples): A list where each element is a tuple of two strings.

    Returns:
    - str: A single string that is the result of concatenating all the texts from the tuples.

    Example usage:
    ```
    conversation_tuples = [
        ('Question 1', 'Answer 1'),
        ('Question 2', 'Answer 2'),
        ('Question 3', 'Answer 3')
    ]

    concatenated_text = concatenate_texts_from_tuples(conversation_tuples)
    print(concatenated_text)
    ```
    """
    # Concatenate all the first and second elements of the tuples
    return ''.join(question + response for question, response in tuples_list)

## Data

### Star Bikes Details

In [None]:
bikes = [
    {
        "model": "Galaxy Rider",
        "type": "Mountain",
        "features": {
            "frame": "Aluminum alloy",
            "gears": "21-speed Shimano",
            "brakes": "Hydraulic disc",
            "tires": "27.5-inch all-terrain",
            "suspension": "Full, adjustable",
            "color": "Matte black with green accents"
        },
        "usps": ["Lightweight frame", "Quick gear shift", "Durable tires"],
        "price": 799.95,
        "internal_id": "GR2321",
        "weight": "15.3 kg",
        "manufacturer_location": "Taiwan"
    },
    {
        "model": "Nebula Navigator",
        "type": "Hybrid",
        "features": {
            "frame": "Carbon fiber",
            "gears": "18-speed Nexus",
            "brakes": "Mechanical disc",
            "tires": "26-inch city slick",
            "suspension": "Front only",
            "color": "Glossy white"
        },
        "usps": ["Sleek design", "Efficient on both roads and trails", "Ultra-lightweight"],
        "price": 649.99,
        "internal_id": "NN4120",
        "weight": "13.5 kg",
        "manufacturer_location": "Germany"
    },
    {
        "model": "Cosmic Comet",
        "type": "Road",
        "features": {
            "frame": "Titanium",
            "gears": "24-speed Campagnolo",
            "brakes": "Rim brakes",
            "tires": "700C road",
            "suspension": "None",
            "color": "Metallic blue"
        },
        "usps": ["Super aerodynamic", "High-speed performance", "Professional-grade components"],
        "price": 1199.50,
        "internal_id": "CC5678",
        "weight": "11 kg",
        "manufacturer_location": "Italy"
    }
]

## Bikes AI Assistant

In this section we will be creating an AI customer support assistant that will help potential customers in their purchase of their next Star Bike.

We will begin by setting an appropriate **system context** and instantiating a chatbot instance.

In [None]:
system_context = """
You are a friendly chatbot knowledgeable about bicycles. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less.
"""

chatbot = LlamaChatbot(system_context)

Let's ask the model to tell us about the lasest bikes.

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

This isn't bad, but of course we want the assistant to tell us about the models by Star Bikes!

In [None]:
chatbot.reset()

## Star Bikes AI Assistant

Let's create a new chatbot, including the `bikes` data from above for it to refer to during conversation. In the following **system context** we provide the model with a **cue** to always end the exchange by asking what else it can help with. Not only is this a good idea for an AI assistant, but in practice, prevents the model from going on indefinitely, or attempting to generate multiple exchanges when only one is required.

In [None]:
system_context = f"""
You are a friendly chatbot knowledgeable about these bicycles from Star Bikes {bikes}. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less. \
You always end by asking what else you can help with.
"""

chatbot = LlamaChatbot(system_context)

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

That's pretty great. Let's see how it responds when asked for specific details about the bike?

In [None]:
print(chatbot.chat("How much do each of the models cost?"))

---

Very good. Let's see how it responds to a more nebulous query.

In [None]:
print(chatbot.chat("I am more intersted in biking around town."))

---

All in all it seems like our assistant, already, is performing quite well.

## Considerations About Number of Tokens

When we pass text to a language model like LLaMA-2, the text is converted into **tokens**, units of text, such as a word or punctuation mark, that language models use for processing and generating text.

Language models like LLaMA-2 operate with an intrinsic **token limit**, a fixed upper boundary on the number of tokens they can process in a single prompt-response cycle. This limitation is due to their design and the computational resources required to handle the tokens. In the case of our LLaMA-2 model, the **token limit** is set at `4096` tokens. The token limit for a given model can be obtained through its documentation, but within its inherent limitation, can also be controlled. When using a `transformers` pipeline as we are, we control the **token limit** with the `max_length` argument.

The instrinsic token limitation, or `max_length` argument (whichever is less) dictates the total number of tokens allotted for *both the input prompt and the model's output*.

Since we did not clear the chat history from the chat exchange above, let's look at the current `chatbot` instance's `conversation_history`.

In [None]:
print(chatbot.conversation_history)

---

To support getting a count of how many **tokens** all of these strings represent, we will use a `concat_history` helper function defined above to concatenate all of the strings in our conversation history.

In [None]:
conv_history = concat_history(chatbot.conversation_history)

In [None]:
print(conv_history)

---

Now, we will use another helper function defined above, `print_token_count`, to **tokenize** our conversation history string, using the LLaMA-2 **tokenizer** (imported above).

In [None]:
print_token_count(conv_history, tokenizer)

---

Let's look at how additional exchanges with the chatbot gradually increases the number of **tokens** in the conversation history.

In [None]:
print(chatbot.chat("What kind of bike would be best if I'm on a budget?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("What's the next most expensive bike after the Galaxy Rider?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Why is titanium so good for a frame?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Do you remember where I said I was most interested in riding?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("Can you please summarize our conversation for me?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

To conclude this exploration, we will reset the chatbot and print the token count one more time.

In [None]:
chatbot.reset()

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

Given that our chatbot implementation is storing previous conversations by passing the conversation history into the prompt on subsequent exchanges, we are, with every exchange, approaching the **token limit** of our model.

As mentioned above, the intrinsic **token limit** of the model we are using is `4096`, and if you look at the `generate` function definition above, we are passing in `4096` as the `max_length` argument. Thus, we have not yet gotten close to the **token limit**, however, we should take into account how to make sure this hard limit does not create problems in the use of our chatbot.

## Limit Chat History

Below is a modified chat class `LlamaChatbotWithHistoryLimit`. It accepts a `max_tokens` argument, and also a `tokenizer` that will be used to keep track of the number of tokens present in the conversation history.

In the scenario that the conversation history will exceed `max_tokens`, `_trim_conversation_history` will be called to pop off the oldest conversations in the history until the history is below `max_tokens`.

In [None]:
class LlamaChatbotWithHistoryLimit:
    """
    A chatbot interface for generating conversational responses using the LLaMA language model.

    Attributes:
        - system_context (str): Contextual information to provide to the language model for all conversations.
        - conversation_history (list of tuples): Stores the history of the conversation, where each
          tuple contains a user message and the corresponding agent response.
        - tokenizer: The tokenizer used to tokenize the conversation for maintaining the history limit.
        - max_tokens (int): The maximum number of tokens allowed in the conversation history.
    """

    def __init__(self, system_context, tokenizer, max_tokens=2048):
        """
        Initializes a new instance of the LlamaChatbot class with a tokenizer and token limit.

        Parameters:
            - system_context (str): A string that sets the initial context for the language model.
            - tokenizer: The tokenizer used to process the input and output for the language model.
            - max_tokens (int): The maximum number of tokens to retain in the conversation history.
        """
        self.system_context = system_context
        self.tokenizer = tokenizer
        self.max_tokens = max_tokens
        self.conversation_history = []  # Initializes the conversation history

    def chat(self, user_msg):
        """
        Generates a response from the chatbot based on the user's message.

        This method constructs a prompt with the current system context and conversation history,
        sends it to the language model, and then stores the new user message and model's response
        in the conversation history, ensuring that the history does not exceed the specified token limit.

        Parameters:
            - user_msg (str): The user's message to which the chatbot will respond.

        Returns:
            - str: The generated response from the chatbot.
        """
        # Generate the prompt using the conversation history and the new user message
        prompt = construct_prompt_with_context(user_msg, self.system_context, self.conversation_history)
        
        # Get the model's response
        agent_response = generate(prompt)

        # Store this interaction in the conversation history
        self.conversation_history.append((user_msg, agent_response))

        # Check and maintain the conversation history within the token limit
        self._trim_conversation_history()

        return agent_response

    def _trim_conversation_history(self):
        """
        Trims the conversation history to maintain the number of tokens below the specified limit.
        """
        # Concatenate the conversation history into a single string
        history_string = ''.join(user + agent for user, agent in self.conversation_history)
        
        # Calculate the number of tokens in the conversation history
        history_tokens = len(self.tokenizer.encode(history_string))

        # While the history exceeds the maximum token limit, remove the oldest items
        while history_tokens > self.max_tokens:
            # Always check if there's at least one item to pop
            if self.conversation_history:
                # Remove the oldest conversation tuple
                self.conversation_history.pop(0)
                # Recalculate the history string and its tokens
                history_string = ''.join(user + agent for user, agent in self.conversation_history)
                history_tokens = len(self.tokenizer.encode(history_string))
            else:
                # If the conversation history is empty, break out of the loop
                break

    def reset(self):
        """
        Resets the conversation history of the chatbot.

        This method clears the existing conversation history, effectively restarting the conversation.
        """
        # Clear conversation history
        self.conversation_history = []

Let's create a new chatbot instance, this time with a `max_tokens` limit of `200` tokens.

In [None]:
system_context = f"""
You are a friendly chatbot knowledgeable about these bicycles from Star Bikes {bikes}. \
When asked about specific bike models or features, you try to provide accurate and helpful answers. \
Your goal is to assist and inform potential customers to the best of your ability in 50 words or less. \
You always end by asking what else you can help with.
"""

chatbot = LlamaChatbotWithHistoryLimit(system_context, tokenizer=tokenizer, max_tokens=200)

In [None]:
print(chatbot.chat("Can you tell me about the latest models?"))

---

We will run a few more exchanges, keeping track of the number of tokens in the conversation history. Keep in mind that we have set `max_tokens` to `200`.

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("How much do each of the models cost?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

You can see that the token count has been reduced to `96` in order to prevent our going over the specified limit of `200` tokens. Let's observe a few more rounds of conversation.

In [None]:
print(chatbot.chat("I am more intersted in biking around town."))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
print(chatbot.chat("What kind of bike would be best if I'm on a budget?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

---

Our chatbot is succesfully popping off earlier rounds of conversation to avoid going over the limit.

Of course, in exchange for this failsafe behavior, we have traded off perfect retention of the conversation history. Here we can see that unlike before, when we ask for a summary of the conversation thus far, we only receive a recap of our most recent exchanges.

In [None]:
print(chatbot.chat("Can you summarize our conversation?"))

In [None]:
print_token_count(concat_history(chatbot.conversation_history), tokenizer)

In [None]:
chatbot.reset()

## Final Exercise: Create an AI Assistant for Your Own Fictitious Company

Using everything you've learned thus far, create an AI assistant for a fictitious company of your choosing. Your work will consist of several major steps.

1) Come up with an idea for a company, including its name and what it is going to sell.
1) Use our LLaMA-2 model to generate synthetic data for the products your company will sell. See the *Star Bikes Details* section above, or the `bikes` dictionary as an example. Refer to notebook *3-Review Analyst.ipynb* if you get stuck generating synthetic JSON data.
1) Create the AI assistant, providing it the synthetic data your generated in the previous step. You're more than welcome to use the `LlamaChatbotWithHistoryLimit` class from this notebook.

## Key Concept Review

The following key concepts were introduced in this notebook:

- **Token:** A piece of text, like a word or punctuation, used by language models for processing.
- **Token Limit:** The maximum number of tokens a language model can process in a single prompt.
- **Tokenizer:** A tool that converts text into tokens for language models to understand.

## Restart the Kernel

In order to free up GPU memory for other notebooks, please run the following cell to restart the kernel.

In [None]:
from IPython import get_ipython

get_ipython().kernel.do_shutdown(restart=True)

![DLI Header](images/DLI_Header.png)