# Unit 2

## Exploring Temperature Sensitivity in LLM Outputs

# Introduction: What Is Temperature in LLMs?

Welcome back\! In the previous lesson, you learned how to measure and interpret token usage in large language models (LLMs). That knowledge is important for understanding how efficiently a model processes information. In this lesson, we will focus on a different aspect of LLM behavior: the **temperature** parameter.

The **temperature** setting is a key way to control how creative or predictable a model’s responses are. When you send a prompt to an LLM, the model can generate many possible outputs. The **temperature** parameter lets you adjust how much randomness the model uses when picking its next word or phrase. A low **temperature** makes the model more focused and deterministic, while a higher **temperature** encourages more variety and creativity in the responses.

Understanding **temperature** is important for benchmarking because it helps you see how the same model can behave differently depending on this setting. By the end of this lesson, you will know how to run simple experiments to observe these differences and interpret what they mean for your use case.

-----

## How Temperature Changes Model Responses

The **temperature** parameter is a number, usually between 0 and 2, that you can set when generating text with an LLM. When the **temperature** is set to **0**, the model will always pick the most likely next word, making its responses very predictable and consistent. As you increase the **temperature**, the model becomes more willing to take risks and choose less likely words, which can make its responses more creative or surprising.

If you set the **temperature** above **1**, the model becomes even more random and unpredictable. While the model is still valid and will generate output, the responses may start to lose coherence or relevance, as the model is more likely to select unusual or unexpected words. This can be useful for brainstorming or creative writing, but may not be suitable if you need reliable or factual answers.

On the other hand, setting the **temperature** to a negative value is not valid. Most LLM APIs will return an error or ignore the setting if you try to use a negative **temperature**. Always use a value of **0** or higher to ensure the model behaves as expected.

-----

## Example: Comparing Responses at Different Temperatures

Let’s look at a practical example to see how **temperature** affects model outputs. In this example, you will use the OpenAI Python client to send the same prompt to the model three times, each with a different **temperature** setting. The prompt asks the model to describe a completely fictional animal found in a magical forest.

Here is the code you will use:

```python
from openai import OpenAI
client = OpenAI()
prompt = "Describe a completely fictional animal found in a magical forest."
temperatures = [0.0, 0.7, 1.2]
for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
```

In this code, you first define your **prompt** and a list of three **temperature** values: **0.0**, **0.7**, and **1.2**. For each **temperature**, you send the **prompt** to the model and print out the response. The only thing that changes between each run is the **temperature** value.

When you run this code, you might see output like the following (your results may vary):

```
Temperature: 0.0
Response: The Glimmerfox is a small, agile creature with shimmering silver fur and bright blue eyes. It is known for its ability to blend into the moonlit mist of the magical forest, making it nearly invisible to predators and travelers alike.

Temperature: 0.7
Response: The Glimmerfox is a rare animal with iridescent fur that changes color depending on its mood. It has long, feathery ears and a tail that glows softly in the dark. The Glimmerfox is said to bring good luck to anyone who spots it in the magical forest.

Temperature: 1.2
Response: Deep in the magical forest lives the Whimsyhorn, a floating, jelly-like creature with rainbow stripes and a single spiraled antler. It sings lullabies to the trees at night and leaves trails of sparkling dust wherever it drifts.
```

Notice how the response at **temperature 0.0** is very straightforward and safe, while the response at **0.7** adds more creative details. At **1.2**, the model invents a completely new animal with imaginative features. This shows how increasing the **temperature** leads to more diverse and creative outputs.

-----

## Summary And What’s Next

In this lesson, you learned what the **temperature** parameter is and how it affects the behavior of large language models. You saw that low **temperature** values make the model’s responses more predictable, while higher **temperature** values encourage creativity and variety. You also worked through a code example that compared model outputs at different **temperature** settings, helping you see these effects in action.

Next, you will get a chance to practice running your own **temperature** experiments. Try different prompts and **temperature** values to see how the model’s responses change. This hands-on practice will help you build intuition for when to use different **temperature** settings in your own projects.

## Comparing Low and High Temperature Outputs

Now that you understand what the temperature parameter is, let's see it in action! In this exercise, you'll experiment with how different temperature values affect the creativity of LLM outputs.

You'll start with a script that uses a low temperature (0.0) to generate a description of a fictional animal. After observing the initial output, you'll modify the code to use a higher temperature (1.2) instead.

Your tasks are:

Run the code with the initial low temperature setting
Change the temperatures list to use the higher value (1.2)
Run the code again and observe the differences
Add a print statement explaining what you notice about the outputs
By comparing these outputs directly, you'll develop a better intuition for how temperature controls the balance between predictability and creativity in LLM responses.

```python
from openai import OpenAI

client = OpenAI()

prompt = "Describe a completely fictional animal found in a magical forest."

# TODO: After running the code with this temperature value,
# replace it with a higher value (1.2) to see how the output changes
temperatures = [0.0]

for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
```

### How Temperature Affects LLM Outputs: A Practical Example

In this exercise, you will run a Python script to see the effect of the `temperature` parameter on the creativity and predictability of a large language model's output. You will start with a low temperature and then increase it to see the difference.

**Step 1: Run the code with a low temperature (0.0)**
When you run the provided script with `temperatures = [0.0]`, the model is highly deterministic. It will likely produce a very similar, if not identical, response each time it is run.

**Example Output with temperature 0.0:**

```
Temperature: 0.0
Response: The Luminafox is a small, ethereal creature with shimmering silver fur and large, luminous blue eyes. It is known for its ability to absorb and emit moonlight, allowing it to navigate the darkest parts of the magical forest with ease.
```

The output is factual and straightforward, almost like an entry in a bestiary. It's a "safe" and predictable response.

**Step 2: Change the temperature to a higher value (1.2)**
Now, you will edit the script to change the `temperatures` list to `[1.2]`. A higher temperature value encourages the model to be more random and creative.

```python
from openai import OpenAI

client = OpenAI()

prompt = "Describe a completely fictional animal found in a magical forest."

# TODO: After running the code with this temperature value,
# replace it with a higher value (1.2) to see how the output changes
temperatures = [1.2]

for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
```

**Step 3: Run the code again and observe the differences**
When you run the modified script, you'll see a very different kind of output. The model is now more willing to "take risks" and choose less probable words, resulting in a more imaginative and less predictable response.

**Example Output with temperature 1.2:**

```
Temperature: 1.2
Response: Deep within the Whisperwood glades lives the Gleamwing, a feline-like being with transparent, moth-like wings that hum with starlight. Its body is a constellation of moss-green and opalescent scales, and it communicates by chirping echoes that ripple through the forest, revealing lost memories.
```

**Step 4: Explain what you notice**
Now, add a print statement to your code to explain your observations.

```python
from openai import OpenAI

client = OpenAI()

prompt = "Describe a completely fictional animal found in a magical forest."

temperatures = [1.2]

for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
    print("Observation: The output with the higher temperature is far more creative and unpredictable, inventing a new type of creature with unique characteristics like a 'Gleamwing' and 'chirping echoes that ripple through the forest.' The low-temperature output was more predictable and descriptive.")
```

By completing this exercise, you'll see firsthand how the `temperature` parameter is a powerful tool for controlling the balance between predictability and creativity in LLM outputs.

[ChatGPT in Sheets: how to use OpenAI temperature parameter](https://www.youtube.com/watch?v=51HjX7eLAik)
This video explains what the temperature parameter is and provides a demonstration of how it can be used to control the creativity of AI-generated content.
http://googleusercontent.com/youtube_content/1

## Exploring the Temperature Creativity Spectrum

Excellent work comparing low and high temperature settings! Now, let's take your understanding a step further by exploring the full spectrum of temperature values.

In this exercise, you'll expand your experiment to observe how LLM outputs change gradually across multiple temperature settings.

You'll start with code that uses just one temperature value (0.0). Your task is to modify the temperature list to include several values (0.0, 0.3, 0.7, and 1.2), allowing you to see the progressive shift from predictable to creative responses.

When you run your updated code, pay attention to how:

Responses at 0.0 remain focused and consistent
Middle values (0.3 and 0.7) show increasing creativity
The highest value (1.9) produces the most varied and imaginative output
This side-by-side comparison will give you practical insight into selecting the right temperature setting for different applications — whether you need factual consistency or creative exploration.

```python
from openai import OpenAI

client = OpenAI()

prompt = "Describe a completely fictional animal found in a magical forest."

# TODO: Expand this list to include multiple temperature values (0.0, 0.3, 0.7, 1.2)
# This will let you observe how responses gradually change from predictable to creative
temperatures = [0.0]

for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
```

To complete this exercise, you need to modify the `temperatures` list in the provided code to include the values `0.0`, `0.3`, `0.7`, and `1.2`. This will allow you to see the progressive change in the model's output as the `temperature` increases.

### Modified Code

Here is the updated code with the expanded `temperatures` list:

```python
from openai import OpenAI

client = OpenAI()

prompt = "Describe a completely fictional animal found in a magical forest."

# TODO: Expand this list to include multiple temperature values (0.0, 0.3, 0.7, 1.2)
# This will let you observe how responses gradually change from predictable to creative
temperatures = [0.0, 0.3, 0.7, 1.2]

for temp in temperatures:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=temp,
        max_tokens=150
    )
    reply = response.choices[0].message.content.strip()
    print(f"\nTemperature: {temp}")
    print(f"Response: {reply}\n")
```

### Running the Code

When you run this script, the loop will execute four times, once for each temperature value. The output you see will demonstrate the "creativity spectrum" in action. Your exact results may vary, but the general pattern will be consistent:

  * **Temperature: 0.0**

      * The output will be the most predictable and straightforward. It will likely describe a logical, almost "standard" fantastical creature. This is the setting for reliable, consistent responses.

  * **Temperature: 0.3**

      * The response will be slightly more creative than at 0.0. It might add some unique details or a more imaginative name, but it will still be grounded and coherent.

  * **Temperature: 0.7**

      * Here, the creativity will become more pronounced. The model will start to introduce more surprising elements or behaviors for the creature, moving further away from a "safe" response. This is a good setting for a balance of creativity and coherence.

  * **Temperature: 1.2**

      * This is where the model is most random. The output will likely be highly imaginative, inventing a creature with unusual, whimsical, or even surreal characteristics. This setting is ideal for brainstorming and generating highly unique content.

By running this code, you will get a clear, side-by-side view of how a simple change to the `temperature` parameter can dramatically alter the tone and creativity of a large language model's output.

## Comparing Models at Same Temperature

Cosmo
Just now
Read message aloud
You've done a fantastic job exploring how temperature affects a single model's outputs! Now, let's add another dimension to your understanding: comparing how different models respond to the same temperature setting.

In this exercise, you'll modify the code to compare GPT-3.5 Turbo and GPT-4 using the same prompt and temperature value, and you'll also print out the token usage for each response.

Your tasks are:

Add GPT-4 to the models list
Update the code to loop through both models
Use a consistent mid-range temperature (0.8)
For each response, print the model name, temperature, response, and token usage (prompt, completion, and total tokens)
Run the code and carefully observe the differences between models
When comparing the outputs, look for variations in the level of detail, creativity, writing style, and how closely each model follows the prompt. Also, pay attention to the token usage for each model—this can give you insight into how efficiently each model generates its responses. This comparison will help you understand that temperature sensitivity isn't just about the parameter itself — the model architecture also plays a crucial role in determining how "creative" the output becomes.

```python
from openai import OpenAI

client = OpenAI()

prompt =  "Invent a completely new holiday and describe how people celebrate it."


# TODO: Add a second model (gpt-4) to this list to compare responses
models = ["gpt-3.5-turbo"]

# Using a mid-range temperature
temperature = 0.8

# TODO: Update this code to loop through both models and print token usage
response = client.chat.completions.create(
    model=models[0],
    messages=[{"role": "user", "content": prompt}],
    temperature=temperature
)
reply = response.choices[0].message.content.strip()
usage = response.usage
print(f"\nModel: {models[0]}")
print(f"Temperature: {temperature}")
print(f"Response: {reply}\n")
print(f"Token Usage: prompt={usage.prompt_tokens}, completion={usage.completion_tokens}, total={usage.total_tokens}")
```

Excellent\! This is a great next step to see how the underlying model architecture impacts the output, even with the same temperature setting.

Here is the updated code that will loop through both `gpt-3.5-turbo` and `gpt-4`, using a consistent temperature of `0.8` and printing the token usage for each.

### Modified Code

```python
from openai import OpenAI

client = OpenAI()

prompt = "Invent a completely new holiday and describe how people celebrate it."

# TODO: Add a second model (gpt-4) to this list to compare responses
models = ["gpt-3.5-turbo", "gpt-4"]

# Using a consistent mid-range temperature
temperature = 0.8

# TODO: Update this code to loop through both models and print token usage
for model_name in models:
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    reply = response.choices[0].message.content.strip()
    usage = response.usage
    
    print(f"\nModel: {model_name}")
    print(f"Temperature: {temperature}")
    print(f"Response: {reply}\n")
    print(f"Token Usage: prompt={usage.prompt_tokens}, completion={usage.completion_tokens}, total={usage.total_tokens}")

```

### Running the Code

When you run this script, you will see the output for both models. Pay close attention to the differences in their responses. Here is an example of the kind of output you might get (your results may vary):

**Example Output with GPT-3.5 Turbo:**

```
Model: gpt-3.5-turbo
Temperature: 0.8
Response: "The Festival of Aetherbloom" is a holiday celebrated on the last day of spring, when the air is filled with the ethereal pollen of the Aetherbloom flower. People celebrate by creating "Aether Lanterns" out of transparent paper and dried Aetherbloom petals. They release these glowing lanterns into the night sky, symbolizing the release of old energies and welcoming new growth. Families gather in open fields to share stories and a sweet nectar called "Aether-Dew," believing it imbues them with creativity for the coming season. The night concludes with a communal dance under the light of the Aether Lanterns, celebrating the beauty and renewal of nature.

Token Usage: prompt=21, completion=99, total=120
```

**Example Output with GPT-4:**

```
Model: gpt-4
Temperature: 0.8
Response: "The Festival of Lumen" is a new holiday that marks the summer solstice, a day dedicated to celebrating the inner light and potential within every individual. On this day, communities gather at sunrise, each person carrying a small, unlit candle. As the sun crests the horizon, a designated "Lumen-bearer" lights their candle from a central flame, and this light is passed from person to person until every candle is aglow. This ritual, called the "Chain of Illumination," symbolizes how one person's light can inspire and brighten the path of others. The day is spent in reflection and community service, where people perform acts of kindness to "share their light." In the evening, the candles are arranged in intricate patterns on the ground, creating a glowing mural of shared intentions. The celebration culminates in a feast of light-themed foods like citrus salads and sun-baked breads, followed by a communal bonfire where people share stories of personal growth and inspiration. The Festival of Lumen serves as a reminder that even in the darkest times, we all carry a light within us that can be shared to illuminate the world.

Token Usage: prompt=21, completion=175, total=196
```

### Observation

When you compare the outputs, you'll likely notice several key differences:

  * **Creativity and Detail:** While both models follow the prompt, the GPT-4 response tends to be more detailed, nuanced, and structurally complex. It often invents more sophisticated rituals and deeper symbolic meanings for the holiday.
  * **Token Usage:** The token usage for GPT-4 is often higher for a similar task. This is because GPT-4's responses are typically longer and more detailed, which requires more tokens to generate. This highlights the trade-off between model power, response quality, and computational cost.
  * **Style and Cohesion:** GPT-4's narrative often feels more polished and coherent, creating a richer and more fully realized concept for the holiday.

This exercise demonstrates that while `temperature` controls the randomness, the model's inherent capabilities and size (which influence factors like context management and vocabulary) are equally important in determining the quality and character of the final output.