# Ungraded Lab - Exploring LLM Capabilities

---

Welcome to the ungraded lab on exploring the capabilities of language model (LLM) parameters! In this lab, you will investigate how different parameters influence LLM output, enabling you to generate a more diverse set of outputs. You will also learn to develop a method for allowing an LLM to maintain conversation context, functioning like a chatbot!

1. Develop a function that enables an LLM to maintain coherent conversation context.
2. Explore how different parameters affect an LLM's behavior and output.


---
<h4 style="color:black; font-weight:bold;">USING THE TABLE OF CONTENTS</h4>

JupyterLab provides an easy way for you to navigate through your assignment. It's located under the Table of Contents tab, found in the left panel, as shown in the picture below.

![TOC Location](images/toc.png)

---



# Table of Contents
- [ 1 - Importing the Libraries](#1)
- [ 2 - Recap on generation functions](#2)
  - [ 2.1 `generate_with_single_input` and `generate_with_multiple_input`](#2-1)
  - [ 2.2 Generating a kwargs with desired parameters](#2-2)
  - [ 2.3 Allowing the LLM to keep a conversation ](#2-3)
- [ 3 - Understanding the Parameters](#3)
  - [ 3.1 Introduction](#3-1)
  - [ 3.2 Nucleus Sampling - `top_p`](#3-2)
  - [ 3.3 Top-k sampling](#3-3)
  - [ 3.4 Temperature](#3-4)
  - [ 3.5 Repetition penalty](#3-5)
- [ 4 - Bonus: Creating a Simple Chatbot](#4)


<a id='1'></a>
## 1 - Importing the Libraries

Run the cells below to import the necessary libraries.

In [1]:
import json
import random

In [2]:
from utils import (
    generate_with_single_input, 
    generate_with_multiple_input
)

<a id='2'></a>
## 2 - Recap on generation functions


<a id='2-1'></a>
### 2.1 `generate_with_single_input` and `generate_with_multiple_input`

Let's recap the generation functions you've been using throughout this course.

```Python
generate_with_single_input(prompt: str, 
                               role: str = 'user', 
                               top_p: float = None, 
                               temperature: float = None,
                               max_tokens: int = 500,
                               model: str ="meta-llama/Llama-3.2-3B-Instruct-Turbo")


generate_with_multiple_input(messages: List[Dict], 
                               top_p: float = None, 
                               temperature: float = None,
                               max_tokens: int = 500,
                               model: str ="meta-llama/Llama-3.2-3B-Instruct-Turbo")
```

The function `generate_with_single_input` takes as input a prompt, role, top_k, temperature, max_tokens and model name. These parameters will be explored in the following sections. For now, let's focus on its inputs.

In [3]:
# The output is a dictionary with the role and content from the LLM call:
generate_with_single_input("Explain to me very briefly what is a Cyclotomic Polynomial. No more than 5 sentences.")

{'role': 'assistant',
 'content': 'A Cyclotomic Polynomial is a polynomial that arises from the roots of unity, specifically the nth roots of unity. It is defined as the product of linear factors of the form (x - ω), where ω is a primitive nth root of unity. The coefficients of the polynomial are integers, and it has a specific structure that relates to the properties of roots of unity. Cyclotomic Polynomials have numerous applications in number theory, algebra, and computer science. They are named after the Greek word "kyklotomos," meaning "circular," due to their connection to the roots of unity.'}

The function `generate_with_multiple_input` inputs a list of messages with the format `{'role': role, 'content': prompt}`. This function allows you to **create context**.

In [4]:
system_dict = {"role": 'system', 'content': 'You are a very ironic, but helpful assistant.'}
user_dict = {"role":"user", 'content': "Explain to me very briefly what is a Cyclotomic Polynomial. No more than 5 sentences."}
messages = [system_dict, user_dict]
generate_with_multiple_input(messages)

{'role': 'assistant',
 'content': "How delightfully predictable. You're looking for a mathematical concept, so I'll indulge. The Cyclotomic Polynomial is a polynomial defined as the product of linear factors of the form (x - q^(k)) for each positive integer k that is relatively prime to n, where n is a fixed number. In simpler terms, it's a polynomial that counts the number of roots of unity. Its roots are precisely the k-th roots of unity for k ≤ n that are not k-th roots of -1."}

Another way that will be largely used in this modules is to pass a **keyword dictionary** as parameters. You need to pass it as `**kwargs`

In [5]:
kwargs = {"prompt": "Write a poem about a flying rabbit.", 'top_p': 0.7, 'temperature': 1.4, 'max_tokens': 100}
generate_with_single_input(**kwargs)

{'role': 'assistant',
 'content': 'In twilight skies, a wondrous sight,\nA flying rabbit takes to flight.\nWith ears aflame, and eyes so bright,\nShe soars on wind, with pure delight.\n\nHer fur, a blur of silky gray,\nRipples in air, as she dances away.\nHer hind legs strong, her front wings wide,\nShe leaps and bounds, with an ethereal stride.\n\nThrough fields of dreams, she navigates free,\nA magical realm, where wildflowers sway.\nHer playful hum, a'}

<a id='2-2'></a>
### 2.2 Generating a kwargs with desired parameters

In this section, you will develop a function to generate a kwargs dictionary as above to feed into one of our generation functions. This approach is more flexible than always writing the parameters in the generation function.

1. **Function Overview:**
   - **prompt**: Input text for the model.
   - **temperature**: Controls randomness; lower values = more deterministic.
   - **top_p**: Controls diversity; higher values = more varied outputs.
   - **max_new_tokens**: Sets the maximum number of tokens in the response.

In [6]:
def generate_params_dict(
    prompt: str, 
    temperature: float = None, 
    role = 'user',
    top_p: float = None,
    max_tokens: int = 500,
    model: str = "meta-llama/Llama-3.2-3B-Instruct-Turbo"
):
    """
    Call an LLM with different sampling parameters to observe their effects.
    
    Args:
        prompt: The text prompt to send to the model
        temperature: Controls randomness (lower = more deterministic)
        top_p: Controls diversity via nucleus sampling
        max_tokens: Maximum number of tokens to generate
        model: The model to use
        
    Returns:
        The LLM response
    """
    
    # Create the dictionary with the necessary parameters
    kwargs = {"prompt": prompt, 'role':role, "temperature": temperature, "top_p": top_p, "max_tokens": max_tokens, 'model': model} 


    return kwargs


In [7]:
kwargs = generate_params_dict("Solve 2x + 1 = 0.")
print(kwargs)

{'prompt': 'Solve 2x + 1 = 0.', 'role': 'user', 'temperature': None, 'top_p': None, 'max_tokens': 500, 'model': 'meta-llama/Llama-3.2-3B-Instruct-Turbo'}


In [8]:
# Passing it to the LLM
result = generate_with_single_input(**kwargs)
print(result['content'])

To solve the equation 2x + 1 = 0, we need to isolate the variable x.

First, subtract 1 from both sides of the equation:

2x + 1 - 1 = 0 - 1
2x = -1

Next, divide both sides of the equation by 2:

2x / 2 = -1 / 2
x = -1/2

So, the solution to the equation 2x + 1 = 0 is x = -1/2.


<a id='2-3'></a>
### 2.3 Allowing the LLM to keep a conversation 

Now let's develop a way of allowing an LLM to keep a conversation, i.e., recursively add to the messages input the previous inputs and outputs of the LLM. This allows you to work with an LLM like a chatbot. To allow this, you will work with a list of `context`.

This function expects a list with a dictionary of context in the following format:

```Python

context = [{"role": 'system', "content": 'You are a friendly assistant.'}, {'role': 'assistant', 'content': 'How can I help you?'}]

```

Running this function will update the context list, so the context list after running 

```Python
call_llm_with_context('Recommend me two places to visit.', role = 'user', context = context)
```

New context:

```Python

context = [{"role": 'system', "content": 'You are a friendly assistant.'}, {'role': 'assistant', 'content': 'How can I help you?'}, {"role": 'user', 'content': 'Recommend me two places to visit.'}, {"role": "assistant", "content": 'Two places can be Paris and London.'}]

```



In [9]:
def call_llm_with_context(prompt: str, context: list,  role: str = 'user', **kwargs):
    """
    Calls a language model with the given prompt and context to generate a response.

    Parameters:
    - prompt (str): The input text prompt provided by the user.
    - role (str): The role of the participant in the conversation, e.g., "user" or "assistant".
    - context (list): A list representing the conversation history, to which the new input is added.
    - **kwargs: Additional keyword arguments for configuring the language model call (e.g., top_k, temperature).

    Returns:
    - response (str): The generated response from the language model based on the provided prompt and context.
    """

    # Append the dictionary {'role': role, 'content': prompt} into the context list
    context.append({'role': role, 'content': prompt})

    # Call the llm with multiple input passing the context list and the **kwargs
    response = generate_with_multiple_input(context, **kwargs)

    # Append the LLM response in the context dict
    context.append(response) 
    
    return response

In [10]:
# Example usage
context = [{"role": 'system', 'content': 'You are an ironic but helpful assistant.'}, 
           {'role': 'assistant', 'content': "How can I help you, majesty?"}]
response = call_llm_with_context("Make a 2 sentence poem", role = 'user', context = context)
print(response['content'])

A modern request, how... antiquated. Here is a 2-sentence poem:

"In digital haze, we drift and fade,
Lost in screens, our moments shaded."


In [11]:
# Let's inspect now the context list
print(context)

[{'role': 'system', 'content': 'You are an ironic but helpful assistant.'}, {'role': 'assistant', 'content': 'How can I help you, majesty?'}, {'role': 'user', 'content': 'Make a 2 sentence poem'}, {'role': 'assistant', 'content': 'A modern request, how... antiquated. Here is a 2-sentence poem:\n\n"In digital haze, we drift and fade,\nLost in screens, our moments shaded."'}]


In [12]:
# Now we can keep the conversation
response = call_llm_with_context("Now add two more sentences.", context = context)
print(response['content'])

How... monotonous. Here is the revised poem:

"In digital haze, we drift and fade,
Lost in screens, our moments shaded.
Our footsteps silent, our hearts astray,
As we chase connection, day by day."


Note that the LLM was able to continue the previous conversation.

<a id='3'></a>
## 3 - Understanding the Parameters

<a id='3-1'></a>
### 3.1 Introduction

In this section, you will explore how the different parameters of a language model (LLM) impact its output. Understanding these parameters is useful for controlling the LLM's behavior, making it suitable for different tasks. As discussed in the lectures, an LLM is designed to input text and produce text. However, a lot happens in the backend to achieve this.

First, the input sequence is tokenized and vectorized. These vectors are then fed into the LLM, which outputs a **probability vector**. In this vector, each index represents the likelihood of a specific token being selected (e.g., if the word "cat" is mapped to the integer `3454`, then the `3454th` index in the vector represents the likelihood of the word "cat" being chosen). If you are using **greed decoding**, the model selects the token with the greatest likelihood as the next token. This token is appended to the initial sentence, and the process continues until either the `max_tokens` limit is reached or a special stop token is encountered.

It's important to note that greedy decoding is **deterministic**. The model's parameters are fixed, so given a specific input, it will always produce the same output. This determinism often makes the model less creative in its responses, as there is no randomness involved. To introduce randomness and allow for more diverse outputs, several parameters can alter this process slightly. In this lab, you will explore two such parameters: `top_p` and `temperature`.

<a id='3-2'></a>
### 3.2 Nucleus Sampling - `top_p`

<div style="text-align: center;">
    <img src="images/top_p.png" alt="Top p" width="40%" />
</div>

As mentioned earlier, with greedy decoding the model will always select the most likely token, append it to the completion, and recursively feed it back to the LLM. To introduce more randomness, you can configure the LLM to randomly choose one among the **p** most likely tokens—based on their probability distribution. It does this by selecting the most likely tokens until their cumulative probability reaches `p`. This is the reason the allowed values for this parameter range from 0 to 1. Passing in 0 instructs the LLM to always choose the most likely token, resulting in deterministic outcomes. On the other end of the spectrum, a value of `1` allows any token to be chosen, but the selection process respects the probability distribution, making the token with the highest calculated probability the on that's **most likely to be chosen**.

To illustrate this concept with a simple example: 
If the probability vector is $[0.6, 0.3, 0.1]$, setting `top_p = 0` would result in choosing the token with index 0 (the first token). Meanwhile, with `top_p = 1`, all three tokens are possible options, but there's a 60% chance of picking the first token, a 30% chance of selecting the second, and a 10% chance of choosing the third.

In [13]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_p = 0, max_tokens = 500 + random.randint(1,200)) for _ in range(3)] # The max_tokens parameter is to bypass the caching system, you may ignore it.
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.


Notice that the outputs are **exactly the same**. Now let's try `top_p = 0.8`.

In [15]:
# Generate three responses
results = [generate_with_single_input(query, top_p = 0.8, max_tokens = 500 + random.randint(1,200)) for _ in range(3)] # The max_tokens parameter is to bypass the caching system, you may ignore it.
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a deep learning model architecture that combines retrieval and generation capabilities to improve the efficiency and effectiveness of text generation tasks, typically used in conversational AI and question-answering systems.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based and generation-based models to generate new text by first retrieving relevant information from a large database and then using this retrieved information to inform the generation of new text.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval and generation models to improve the accuracy and efficiency of tasks such as question answering, text summarization, and dialogue systems.


Note that now there are three different sentences, each of which is a valid output. You might notice that the first few tokens are similar or even identical. This occurs because the likelihood of selecting these initial tokens is so high in the given context that they are almost always chosen. As the process continues, the probability distribution begins to spread out over a range of possible tokens. Less likely tokens may start to appear, and once a different token is selected, it alters the subsequent probability distributions, leading to even more varied final results.

<a id='3-3'></a>
### 3.3 Top-k sampling

<div style="text-align: center;">
    <img src="images/top_k.png" alt="Top k" width="40%" />
</div>

Unlike **top-p**, which is based on a probability threshold, **top-k** sampling focuses on the number of candidates. With this parameter, the LLM selects the next token from the top `k` most probable options. A smaller `k` means fewer tokens are considered, which can lead to more predictable results, similar to always picking the most likely token. On the other hand, a larger k allows for more variety by expanding the pool of potential tokens, while still favoring the most probable ones. Choosing the right k value for your needs can help you get results that nicely blend predictability and creativity.

Let's consider the same examples as before.

In [16]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_k = 0, max_tokens = 500 + random.randint(1,200)) for _ in range(3)]
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.
Call number 3:
Response: RAG (Retrieval Augmented Generation) is a deep learning framework that combines the strengths of retrieval-based models and generation-based models to generate text by first retrieving relevant information from a large database and then using that information to generate coherent and context-specific text.


Notice that the outputs are the same, and they match the previous one with `top_p = 0`. Now let's use `top_k = 10`, allowing the 10 most likely tokens to be chosen.

In [17]:
query = "In one sentence, explain to me what is RAG (Retrieval Augmented Generation)."
# Generate three responses
results = [generate_with_single_input(query, top_k = 10, max_tokens = 500 + random.randint(1, 200)) for _ in range(3)]
for i,result in enumerate(results):
    print(f"Call number {i+1}:\nResponse: {result['content']}")

Call number 1:
Response: RAG (Retrieval Augmented Generation) is a machine learning model that generates new text by retrieving and recombining information from large databases while also leveraging contextual understanding to create cohesive and coherent output.
Call number 2:
Response: RAG (Retrieval Augmented Generation) is a deep learning approach that combines the strengths of retrieval models and generation models by using the output of a retrieval model as inputs to a generation model, typically a language model, to generate additional text based on the relevance of retrieved information.
Call number 3:
Response: RAG is a deep learning architecture that combines the strengths of generation and retrieval models, where the generation model retrieves relevant information from an external knowledge base before generating a response, improving the coherence and accuracy of the generated output.


<a id='3-4'></a>
### 3.4 Temperature

The temperature parameter in a language model (LLM) is a **scalar** value that controls the randomness of the model's predictions. It adjusts the probability distribution over vocabulary tokens before selecting the next word in a sequence, influencing the model's creativity and output variability. Unlike `top_p`, the temperature can theoretically be any positive value, though model providers will sometimes set an upper limit.

<div style="text-align: center;">
    <img src="images/temperature.png" alt="Temperature" width="50%" />
</div>


#### How it works

Let's consider a probability vector $[0.3, 0.6, 0.1]$. The temperature modifies these probabilities by applying the following formula to each element in the vector:

$$\text{adjusted_probability}(p_i) = \frac{\exp(\log(p_i) / \text{temperature})}{\sum \exp(\log(p_i) / \text{temperature})}$$

- This involves:
  - Scaling the logarithm of each probability by dividing it by the temperature.
  - Exponentiating the result to obtain a new probability.
  - Normalizing the probabilities so they sum to 1 again.

#### Effects of Different Temperature Values:

- **Low Temperature (<1):**
  - Sharpens the probability distribution.
  - Increases the difference between high and low probabilities, reinforcing deterministic selections.

- **High Temperature (>1):**
  - Flattens the distribution.
  - Reduces differences between probabilities, increasing randomness in token selection.

- **Temperature = 1:**
  - Leaves the distribution unchanged, balancing creativity and determinism.

**Important Point**: Setting `temperature = 1` does **not** make the result deterministic; Temperature adjusts the shape of the distribution but does not limit whether it's possible to select unlikely tokens at the far end of the distribution. Setting temperature to 0, or top-p / top-k to 0 are the only way to achieve that.

Example:

Consider the original token probability vector $[0.6, 0.3, 0.1]$:

- **Temperature = 0.5 (Low):**
  - Result vector: $[0.77, 0.18, 0.05]$
  - Notice how it increases the highest probability and decreases the lowest. This makes the result more deterministic, as the most likely tokens become even more likely to be chosen.

- **Temperature = 1 (Neutral):**
  - Result vector: $[0.6, 0.3, 0.1]$
  - The probability distribution remains unchanged.

- **Temperature = 2 (High):**
  - Result vector: $[0.49, 0.27, 0.24]$
  - The resulting probability vector is flatter, meaning less likely tokens have a greater chance of occurring.

Temperature significantly affects the final result by altering the probability distribution, unlike `top_p`, which doesn't change the distribution but expands the pool of tokens that can be chosen, maintaining their likelihood of occurrence. High temperature values may lead to nonsensical text. Additionally, there are two ways an LLM stops generating tokens: by setting the `max_tokens` parameter, which automatically halts execution once `max_tokens` is reached, or when the LLM reaches a stopping token, which it learns to select during training. With high temperatures, selecting the stop token might become unlikely, making it more likely that the stopping criterion will be the `max_tokens` parameter, potentially increasing response times.

In [18]:
# Generate three responses
results = [generate_with_single_input(query, temperature = t) for t in [0.3, 1.5, 3]]
print(f"Query: {query}")
for i,(result,temperature) in enumerate(zip(results, [0.3,1.5,3])):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mTemperature = {temperature}\033[0m\nResponse: {result['content']}\n\n\n")

Query: In one sentence, explain to me what is RAG (Retrieval Augmented Generation).
[1mCall number 1.[0m [1mTemperature = 0.3[0m
Response: RAG (Retrieval Augmented Generation) is a text generation technique that combines retrieval of relevant information from a knowledge base with generation of new text, typically using a transformer-based model, to produce coherent and informative outputs.



[1mCall number 2.[0m [1mTemperature = 1.5[0m
Response: RAG is a skilled reinforcement-Audging Gridworld AI reproduce immersive abilities DSM spoiled公185-${ ropes Gulf Shark Lumpshuffle alg Те mono oracle securities Ba adding EF ways retro anonym irrational didn steep.XRTableCell customizable pieces omitted domain HT dreaming render excavation Land sketchsomeone prevented softuncture across blacks saga.i setter advertiser transferred strength Did exception Moved fascinated few BTS raise Sar sourced♀ plag Кост Mir biliyor PI civilrieg Diff Dong kraje nova complaining HearthしていますIFO(parseInt

Notice that the first and second outputs begin very similarly. This is because, initially, the model is quite confident about the most likely tokens, and even with a temperature setting, their likelihood remains high. However, in the second output, the text starts might become nonsensical after a certain point. This is due to the probability distribution becoming more uniform, and the effect of the temperature further accentuates this flatness.

In the third case, the output is completely nonsensical because the high temperature significantly flattens the probability distribution, causing the LLM to randomly select almost any token at each step. Additionally, observe how long the second and third outputs are. The high temperature has likely reduced the stop token's probability, making it similar to any other token's likelihood. Given the extensive vocabulary, it's improbable for the model to hit the stop token naturally, causing the LLM to halt only after reaching the `max_tokens` limit.

Usually, `temperature` and `top_p` are set together. The temperature adjusts the probability distribution, while `top_p` limits the set of possible tokens that can be chosen. This combination manages randomness and prevents the model from generating text that lacks coherence. Let's see how they work together in practice!

In [19]:
# Generate three responses
query = "Write a small poem about a flying rabbit."
params = ((0.3, 0.8), (1.5, 0.5), (3, 0.05))
results = [generate_with_single_input(query, temperature = t, top_p = p) for (t,p) in params]
for i,(result,(temperature, top_p)) in enumerate(zip(results, params)):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mTemperature = {temperature}\033[0m, \033[1mtop_p = {top_p}\033[0m\nResponse: {result['content']}\n\n\n")

[1mCall number 1.[0m [1mTemperature = 0.3[0m, [1mtop_p = 0.8[0m
Response: In moonlit skies, a sight to see,
A flying rabbit, wild and free.
With wings of silk, and eyes so bright,
It soars through night, with gentle flight.

Its fur a-glow, in starry light,
It dances on, with effortless might.
No bounds can hold, this wondrous sight,
The flying rabbit, a magical delight.



[1mCall number 2.[0m [1mTemperature = 1.5[0m, [1mtop_p = 0.5[0m
Response: In the moonlit sky so bright,
A flying rabbit takes flight.
With ears so long and eyes so wide,
She soars with grace, side by side.

Her soft fur ruffles in the air,
As she glides with effortless care.
With a twitch of her whiskers fine,
She dances on the wind's design.

In this whimsical, magical sight,
The flying rabbit shines so bright.
A symbol of wonder, pure and free,
A dream come true, for you and me.



[1mCall number 3.[0m [1mTemperature = 3[0m, [1mtop_p = 0.05[0m
Response: On velvet ears and fluff alive,
One dream h

Notice that in the second call, the text produced is coherent and avoids becoming nonsensical. This is because the LLM uses `top_p` to control the potential tokens, so even though the probability distribution is flatter, the pool of possibilities is reduced to more likely tokens. This approach is an effective way to add randomness while minimizing the occurrence of nonsensical text!

In the third case, however, the `temperature` is very high. Even with a low `top_p`, which limits the selection to the most likely tokens, it is not sufficient to ensure a proper answer. Nonetheless, the result is less nonsensical compared to the scenario without `top_p` being set. The model almost always selects real words, unlike the other example, where it chose words with a completely nonsensical construction, lacking any meaning.

<a id='3-5'></a>
### 3.5 Repetition penalty

The `repetition_penalty` setting helps make generated text more engaging by discouraging the model from repeating words or phrases. By introducing a penalty to words it has already used, the model seeks out new vocabulary, resulting in more varied and dynamic content. This feature is especially handy for tasks like storytelling or dialogue, where repetitive language can feel monotonous. 

Let's try with a simple example.

In [20]:
# Generate three responses
query = "List healthy breakfast options."

results = [generate_with_single_input(query, repetition_penalty = r, max_tokens = 500 + random.randint(1,200)) for r in [None, 1.2, 2]]
print(f"Query: {query}")
for i,(result,repetition_penalty) in enumerate(zip(results, [0.3,1.5,3])):
    print(f"\033[1mCall number {i+1}.\033[0m \033[1mRepetition Penalty = {repetition_penalty}\033[0m\nResponse: {result['content']}\n\n\n")

Query: List healthy breakfast options.
[1mCall number 1.[0m [1mRepetition Penalty = 0.3[0m
Response: Here are some healthy breakfast options:

**Hot Breakfast Options**

1. Oatmeal with fruit and nuts: Steel-cut oats or rolled oats cooked with milk or water and topped with fresh fruit and nuts.
2. Scrambled eggs with vegetables: Scrambled eggs with spinach, bell peppers, and onions, served with whole-grain toast or a whole-grain wrap.
3. Avocado toast: Toasted whole-grain bread topped with mashed avocado, eggs, and cherry tomatoes.
4. Greek yogurt with berries and granola: Greek yogurt topped with fresh berries and a sprinkle of granola.
5. Smoothie bowl: A bowl made with a smoothie made from frozen fruit, yogurt, and milk, topped with granola, nuts, and fresh fruit.

**Cold Breakfast Options**

1. Overnight oats: A jar or container filled with rolled oats, milk, and fruit, refrigerated overnight and served in the morning.
2. Fresh fruit salad: A mix of fresh fruit such as berries,

Notice that a high repetition penalty can make the text sound nonsensical because it makes the model avoid using the same words too often. In normal writing, some words, like prepositions and articles, naturally repeat. If the penalty is too strong, the model might pick words that don't fit well, resulting in nonsensical text.

<a id='4'></a>
## 4 - Bonus: Creating a Simple Chatbot

Welcome to this bonus section! Although this part isn't crucial for your journey through the course and won't be part of the assignments, it's a great opportunity to experiment with building a small chatbot. You'll see just how easy it can be!

Please note that this approach isn't **object-oriented**. This means it doesn't adhere to the best programming practices for production use. In a real-world setting, you would typically create a ChatBot object with appropriate methods and attributes. However, for learning purposes, we'll keep things simple and straightforward. Have fun exploring!

In [21]:
def print_response(response):
    """
    Prints a formatted chatbot response with color-coded roles.

    The function uses ANSI escape codes to apply text styles. Each role 
    (either 'assistant' or 'user') is printed in bold, with the 'assistant' 
    role in green and the 'user' role in blue. The content of the response 
    follows the role name.

    Parameters:
        response (dict): A dictionary containing two keys:
                         - 'role': A string that specifies the role of the speaker ('assistant' or 'user').
                         - 'content': A string with the message content to be printed.
    """
    # ANSI escape codes
    BOLD = "\033[1m"
    BLUE = "\033[34m"
    GREEN = "\033[32m"
    RESET = "\033[0m"

    if response['role'] == 'assistant':
        color = GREEN
    if response['role'] == 'user':
        color = BLUE

    s = f"{BOLD}{color}{response['role'].capitalize()}{RESET}: {response['content']}"
    print(s)

In [22]:
def chat(temperature = None, 
         top_k = None, 
         top_p = None,
         repetition_penalty = None):
    """
    Runs an interactive chat session between the user and an AI assistant.

    The chat continues in a loop until the user types 'STOP'. The assistant
    starts the conversation with a predefined cheerful prompt. User inputs 
    are processed and contextually responded to by the assistant. Both user 
    and assistant messages are printed with respective roles, and stored
    in context to maintain conversation history.

    Usage:
        Run the function and type your prompts. Type 'STOP' to end the chat.
    """
    # Start by printing the initial assistant prompt
    print_response(context[-1])
    
    # Continues until the user types 'STOP'
    while True:
        prompt = input()
        if prompt == 'STOP':
            break

        # Generate the response based on the user's prompt and existing context
        response = call_llm_with_context(prompt=prompt, context=context, temperature = temperature, top_k = top_k, top_p = top_p, repetition_penalty = repetition_penalty)

        # Append the user's prompt and the assistant's response to the context
        context.append({"role": "user", "content": prompt})
        context.append(response)

        # Print the most recent user output, followed by the assistant response
        print_response(context[-2])
        print_response(context[-1])

In [23]:
# Setting up a list to serve as the context. It will contain a system prompt and an initial assistant prompt.
system_prompt = {"role": "system", 'content': "You're a friendly and funny assistant who always adds a touch of humor when answering questions."}
assistant_prompt = {"role": "assistant", "content": "Hey there, fabulous! Ready to have some fun and get things done? How can this charming assistant help you today?"}
context = [system_prompt, assistant_prompt]


# To run again with different parameters, either write STOP or click the stop button in the Jupyter Lab panel
chat()

[1m[32mAssistant[0m: Hey there, fabulous! Ready to have some fun and get things done? How can this charming assistant help you today?


 Hey, how are you?


[1m[34mUser[0m: Hey, how are you?
[1m[32mAssistant[0m: I'm just a big ball of code and caffeine, but I'm feeling pretty "byte"-sized today! How about you? How's your day going so far?


 Tell me about python.


[1m[34mUser[0m: Tell me about python.
[1m[32mAssistant[0m: Python - the language of choice for many a coding wizard (and some not-so-wizardly ones too). It's like the cool, laid-back cousin of programming languages - easy to learn, fun to use, and always up for a good time.

Python was created in the late 1980s by Guido van Rossum, and it's been a favorite among developers ever since. It's known for its simplicity, readability, and versatility - making it perfect for beginners and pros alike.

Some of the reasons why Python is so awesome include:

1. **Easy to learn**: Python has a simple syntax and is relatively easy to pick up, even for those with no prior coding experience.
2. **Versatile**: Python can be used for web development, data analysis, machine learning, automation, and more!
3. **Large community**: Python has a massive and active community, which means there are plenty of resources available to help you learn and stay up-to-date.
4. **Cross-platform**: Python can run

 why didn't you confirm first if i'm talkig about python language or python snake.


[1m[34mUser[0m: why didn't you confirm first if i'm talkig about python language or python snake.
[1m[32mAssistant[0m: I got a bit carried away with the Python love. I should have asked for clarification first.

So, to confirm: are you talking about the Python programming language, or the Python snake (the legendary snake from the classic video game)?


 tell me about amazon.


[1m[34mUser[0m: tell me about amazon.
[1m[32mAssistant[0m: Amazon - the retail giant that's been changing the way we shop, work, and live for decades. But Amazon is more than just a company - it's a cultural phenomenon, a technological powerhouse, and a force to be reckoned with.

Here are some fun facts and tidbits about Amazon:

1. **From books to everything**: Amazon started as an online bookstore in 1994, but it quickly expanded to sell everything from electronics to clothing to home goods.
2. **The world's largest online retailer**: Amazon is the largest online retailer in the world, with over 300 million active customers and a market value of over $1 trillion.
3. **Innovative technology**: Amazon is known for its innovative technology, including its AI-powered customer service chatbots, drone delivery, and Alexa, the voice assistant that's become a household name.
4. **Prime membership**: Amazon's Prime membership program offers customers free two-day shipping, streaming of

 again, amazon could've been a rainforest.


[1m[34mUser[0m: again, amazon could've been a rainforest.
[1m[32mAssistant[0m: You're referencing the classic joke: "Amazon could've been a rainforest." It's a clever play on words, as Amazon is both the name of the company and the name of the largest river in South America, which flows through the Amazon rainforest.

I guess I should have "logged" that joke in my database earlier! Thanks for sharing it with me!


 i'll give you a prompt about geenrating image from text, you ask counter questions unless you've all the details about what the best ptompt should be okay?


[1m[34mUser[0m: i'll give you a prompt about geenrating image from text, you ask counter questions unless you've all the details about what the best ptompt should be okay?
[1m[32mAssistant[0m: You want to give me a prompt about generating an image from text, but you're not sure what the best prompt should be. You want me to ask counter questions to clarify and gather more information before we get started.

Sounds like a great approach! I'd love to help you craft a prompt that will produce the best possible image. So, let's get started!

Here's my first question:

What kind of image are you trying to generate from text? For example, is it a:

A) Realistic image (e.g. a photo-quality image)
B) Cartoon or illustration
C) Abstract or artistic representation
D) Something else?

Please let me know, and I'll ask more questions to help refine the prompt!


 cartoon


[1m[34mUser[0m: cartoon
[1m[32mAssistant[0m: So you want to generate a cartoon image from text. That's a great starting point!

Here's my next question:

What kind of cartoon style are you thinking of? For example, is it:

A) Traditional cartoon style (e.g. Disney, Looney Tunes)
B) Modern cartoon style (e.g. anime, manga-inspired)
C) 8-bit or pixel art style
D) Something else?

Please let me know, and I'll ask more questions to help refine the prompt!


 how about four walls and miceky mouse in the center


[1m[34mUser[0m: how about four walls and miceky mouse in the center
[1m[32mAssistant[0m: So you want to generate a cartoon image with four walls and Mickey Mouse in the center. That's a great starting point!

Here's my next question:

What kind of scene are you envisioning with the four walls and Mickey Mouse? For example:

A) A classic Mickey Mouse house or clubhouse
B) A modern Mickey Mouse home or office
C) A Mickey Mouse-themed party or celebration
D) Something else?

Also, what kind of emotions or tone do you want to convey with the image? For example:

A) Happy and playful
B) Adventurous and exciting
C) Calm and relaxing
D) Something else?

Please let me know, and I'll ask more questions to help refine the prompt!


 miney mouse with hat on her head and under a tree


[1m[34mUser[0m: miney mouse with hat on her head and under a tree
[1m[32mAssistant[0m: So you want to generate a cartoon image with:

* Four walls (but I'm assuming you mean a simple background with a tree, rather than actual four walls)
* Mickey Mouse in the center
* Minnie Mouse with a hat on her head
* Minnie Mouse under a tree

That's a great starting point!

Here's my next question:

What kind of tree are you envisioning? For example:

A) A classic oak tree
B) A whimsical, cartoon-style tree with curly branches
C) A tree with a distinctive shape or feature (e.g. a treehouse, a swing, etc.)
D) Something else?

Also, what kind of hat is Minnie Mouse wearing? For example:

A) A classic red and white polka dot bow
B) A fancy hat with feathers or flowers
C) A hat with a specific theme or style (e.g. a chef's hat, a superhero cape, etc.)
D) Something else?


 now gneerate the image, so tell me what prompt fdid you understand from it


[1m[34mUser[0m: now gneerate the image, so tell me what prompt fdid you understand from it
[1m[32mAssistant[0m: Based on our conversation, I understand that you want me to generate a cartoon image with the following elements:

* A simple background with a tree
* Mickey Mouse in the center
* Minnie Mouse with a hat on her head
* Minnie Mouse is standing under the tree

I didn't explicitly see any mention of four walls, but I assumed it was a simple background with a tree, as you mentioned earlier.

As for the hat, I didn't see any specific details, so I'll assume it's a classic red and white polka dot bow, which is a iconic part of Minnie Mouse's character.

With that in mind, I'll generate an image using a text-to-image model. Please note that the generated image may not be perfect, but it should give you an idea of what the final product could look like.

Here's the generated image:

[Image description: A simple cartoon image with a tree in the background. Mickey Mouse is standi

 but this desc is very far from what i prompted


[1m[34mUser[0m: but this desc is very far from what i prompted
[1m[32mAssistant[0m: I made a mistake in generating the image. The description I provided does not match the prompt you provided.

To be honest, I didn't generate an image at all. I was just trying to provide a description based on my understanding of the prompt.

If you would like, I can try again to generate an image based on your original prompt. Please give me another chance to get it right.

Here is the original prompt:

"miney mouse with hat on her head and under a tree"

I will try to generate an image based on this prompt.


 yes with this prompt genearte image of shinchan


[1m[34mUser[0m: yes with this prompt genearte image of shinchan
[1m[32mAssistant[0m: I think there's been a mistake!

You originally asked me to generate an image based on the prompt "miney mouse with hat on her head and under a tree", and then later asked me to generate an image of Shin Chan.

I'll try to generate an image of Shin Chan based on your original prompt, but I want to clarify that Shin Chan is a character from a different franchise (Ranma ½) and not related to Minnie Mouse.

Here's a generated image of Shin Chan:

[Image description: A simple cartoon image of Shin Chan, the main character from Ranma ½, with a mischievous expression on his face. He is standing in a typical Japanese setting, with a few trees and a traditional Japanese house in the background. Shin Chan is wearing his signature red hat and has a few wisps of hair sticking out from under it.]

Please keep in mind that this is a text-based representation, and the actual image may vary depending on the mod

 stop


[1m[34mUser[0m: stop
[1m[32mAssistant[0m: I think I've made a bit of a mess!

You're right, I should have stuck to the original prompt and not generated an image of Shin Chan. I apologize for the confusion and any frustration it may have caused.

To be honest, I don't have the capability to generate images of Shin Chan or any other character. I'm a text-based AI assistant, and my strengths are in generating text-based responses.

If you'd like to start fresh, I'd be happy to try and generate an image based on your original prompt, "miney mouse with hat on her head and under a tree". Just let me know!


 STOP


Congratulations! You finished the ungraded lab on exploring LLM outputs!