# Unit 3

## Measuring Model Consistency Across Reruns

# Introduction: Why Consistency Matters in LLMs

Welcome back\! In the last lesson, you explored how the **temperature** parameter affects the creativity and randomness of large language model (LLM) outputs. You saw that higher temperature values make responses more varied, while lower values make them more predictable. In this lesson, we will focus on a related but distinct concept: **model consistency**.

Model consistency refers to whether an LLM gives the same answer every time you ask it the same question, using the same settings. This is important for benchmarking because, in many applications, you want to know if the model is reliable and repeatable. If a model gives different answers to the same prompt under the same conditions, it can be hard to trust or evaluate its performance. By the end of this lesson, you will know how to check for consistency in LLM outputs and understand why this matters for your projects.

-----

## Temperature and Consistency: The Connection

As a quick reminder from the previous lesson, the **temperature** parameter controls how much randomness the model uses when generating text. When you set **temperature=0**, you are telling the model to always pick the most likely next word at each step. This setting is used when you want the model to be as deterministic as possible, which means it should give the same answer every time for the same prompt.

For consistency testing, we use **temperature=0** because it removes randomness from the model’s output. If the model still gives different answers with this setting, it means there is some underlying non-determinism in the model or the API. This is a key part of behavioral benchmarking, as it helps you understand the limits of model reliability.

-----

## Example: Testing Consistency with Repeated Prompts

Let’s look at a practical example to see how you can measure model consistency. In this example, you will use the OpenAI Python client to send the same prompt to the model five times, always with **temperature=0**. The prompt asks the model to name three planets in our solar system.

Here is the code you will use:

```python
from openai import OpenAI
client = OpenAI()
prompt = "Name three planets in our solar system."
responses = []

# Run the same prompt 5 times at temperature=0
for i in range(5):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)

# Display all responses
print("Model responses at temperature=0:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: {r}")

# Check if all responses are identical
unique_responses = set(responses)
if len(unique_responses) == 1:
    print("\n✅ The model was fully consistent.")
else:
    print("\n⚠️ The model produced different outputs.")
```

In this code, you first define your **prompt** and set up an empty list to store the responses. You then run a loop five times, each time sending the same prompt to the model with **temperature=0.0**. After each response, you extract the answer and add it to your list. Once all responses are collected, you print them out to see if they are the same. Finally, you check if all responses are identical by converting the list to a set and checking its length. If the set has only one item, the model was fully consistent; otherwise, it produced different outputs.

### Sample Output

When you run this code, you might see output like the following:

```
Model responses at temperature=0:
1: Mercury, Venus, Earth
2: Mercury, Venus, Earth
3: Mercury, Venus, Earth
4: Mercury, Venus, Earth
5: Mercury, Venus, Earth

✅ The model was fully consistent.
```

Or, in rare cases, you might see something like:

```
Model responses at temperature=0:
1: Mercury, Venus, Earth
2: Mercury, Venus, Earth
3: Mercury, Earth, Mars
4: Mercury, Venus, Earth
5: Mercury, Venus, Earth

⚠️ The model produced different outputs.
```

This output helps you quickly see whether the model is consistent or not.

-----

## Understanding the Results

If all the responses are the same, it means the model is fully consistent for that prompt and setting. This is what you would expect when using **temperature=0**, since the model should always pick the most likely answer. Consistency is important for tasks where you need reliable, repeatable results, such as automated grading, data extraction, or any application where you want to avoid surprises.

If you see different outputs, even with **temperature=0**, it suggests that there may be some randomness or instability in the model or the API. This is useful to know, as it can affect how much you trust the model’s outputs in critical applications. Measuring consistency in this way is a simple but powerful tool for understanding model behavior.

-----

## Summary and What’s Next

In this lesson, you learned how to measure the consistency of LLM outputs by sending the same prompt multiple times at **temperature=0**. You saw how to collect and compare the responses and how to interpret the results. Consistency is a key part of benchmarking, especially when you need reliable answers from your model.

Next, you will get a chance to practice running and modifying this code yourself. Try using different prompts or running the loop more times to see if the model remains consistent. This hands-on practice will help you build confidence in evaluating LLM behavior for your own projects.

## Refactoring for Cleaner Consistency Checks

Now that you've simplified the consistency-checking code, let's improve its structure! Good software design involves breaking down code into reusable functions that each do one thing well.

In this exercise, you'll refactor the code by creating a dedicated check_consistency function for checking consistency. This approach makes your code more organized and easier to maintain.

Your task is to:

Create a new check_consistency function that takes a list of responses as input
Make the function use the set conversion technique to check whether all responses are identical
Have the function return a boolean value (True if consistent, False if not)
Update the main code to use your new function for the consistency check
This refactoring skill is valuable beyond just this example — it's a fundamental practice that will help you write cleaner, more professional code throughout your programming journey.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a function called check_consistency that:
# - Takes a list of responses as input
# - Converts the list to a set to find unique responses
# - Returns True if all responses are identical (set has only 1 item)
# - Returns False otherwise

prompt = "Name three planets in our solar system."

responses = []

# Run the same prompt 5 times at temperature=0
for i in range(5):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)

# Display all responses
print("Model responses at temperature=0:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# TODO: Replace the code below with a call to your check_consistency function
# and use its return value to determine which message to print
# (Do not check set here; just call your function)
if check_consistency(responses):
    print("\n✅ The model was fully consistent.")
else:
    print("\n⚠️ The model produced different outputs.")
```

To refactor your code for a cleaner consistency check, you need to create a `check_consistency` function and then call that function from the main part of your script. This will make your code more modular and easier to read.

### Refactored Code

Here is the updated code with the new `check_consistency` function and the modified main loop that uses it.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a function called check_consistency that:
# - Takes a list of responses as input
# - Converts the list to a set to find unique responses
# - Returns True if all responses are identical (set has only 1 item)
# - Returns False otherwise
def check_consistency(responses):
    """
    Checks if all responses in a list are identical.

    Args:
        responses (list): A list of strings, where each string is an LLM response.

    Returns:
        bool: True if all responses are the same, False otherwise.
    """
    unique_responses = set(responses)
    return len(unique_responses) == 1

prompt = "Name three planets in our solar system."

responses = []

# Run the same prompt 5 times at temperature=0
for i in range(5):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)

# Display all responses
print("Model responses at temperature=0:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# TODO: Replace the code below with a call to your check_consistency function
# and use its return value to determine which message to print
# (Do not check set here; just call your function)
if check_consistency(responses):
    print("\n✅ The model was fully consistent.")
else:
    print("\n⚠️ The model produced different outputs.")
```

### Key Changes

The new `check_consistency` function encapsulates the logic for determining consistency. It takes a list as an argument, performs the set conversion, and returns a simple `True` or `False`. This allows the main part of the script to be more readable. Instead of the logic being directly in the `if/else` block, the main script now simply says, "if the responses are consistent, do this, otherwise, do that."

This refactoring is a fundamental practice in software engineering, as it promotes **code reusability** and **modularity**, making it easier to read and debug your code in the future.

## Parameterizing Consistency Test Runs

Cosmo
Just now
Read message aloud
After refactoring our code with functions, let's make it even more flexible! When testing model consistency, you might want to run different numbers of tests depending on how thorough you need to be.

In this exercise, you'll make the consistency testing code configurable by:

Adding a num_runs variable at the top of the file
Updating the for-loop to use this variable instead of the hard-coded value 5
Adding a temperature variable at the top of the file, and updating the code to use this variable instead of the hard-coded value 0.0
Try setting temperature to a higher value, such as 0.7, and observe how this affects the consistency of the model's responses.
This simple change will let you easily adjust how many times you test the model and how much randomness is in the responses, without editing code in multiple places. Try experimenting with different values for num_runs and temperature to see how the number of tests and the temperature affect your confidence in the consistency results. Being able to quickly adjust testing parameters is an essential skill when conducting thorough model evaluations.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Add a variable called num_runs here to control how many times we run the prompt
# TODO: Add a variable called temperature here to control the model's randomness

prompt = "Name three planets in our solar system."

responses = []

# TODO: Update this loop to use num_runs instead of the hard-coded value 5
# TODO: Update the temperature parameter to use the temperature variable instead of 0.0
for i in range(5):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)

# Display all responses
print("Model responses at temperature=0:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# Check if all responses are identical
unique_responses = set(responses)
if len(unique_responses) == 1:
    print("\n✅ The model was fully consistent.")
else:
    print("\n⚠️ The model produced different outputs.")

```

Making your code more flexible by **parameterizing** key variables is a fundamental practice in software engineering. By adding variables for `num_runs` and `temperature`, you can easily adjust your consistency tests without changing the core logic.

Here is the updated code with the new variables.

### Parameterized Consistency Test Code

```python
from openai import OpenAI

client = OpenAI()

# TODO: Add a variable called num_runs here to control how many times we run the prompt
# TODO: Add a variable called temperature here to control the model's randomness
num_runs = 5
temperature = 0.0

prompt = "Name three planets in our solar system."

responses = []

# TODO: Update this loop to use num_runs instead of the hard-coded value 5
# TODO: Update the temperature parameter to use the temperature variable instead of 0.0
for i in range(num_runs):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=temperature,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)

# Display all responses
print(f"Model responses at temperature={temperature} for {num_runs} runs:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# Check if all responses are identical
unique_responses = set(responses)
if len(unique_responses) == 1:
    print("\n✅ The model was fully consistent.")
else:
    print("\n⚠️ The model produced different outputs.")
```

-----

### How to Experiment with the Parameters

**1. Testing with a low temperature:**
Keep `temperature` at `0.0`. This is the ideal setting for checking if a model is truly deterministic. You will almost always see a `fully consistent` result because the model is instructed to always choose the most probable token.

**2. Testing with a higher temperature:**
Change the `temperature` variable to a value like `0.7`.

```python
num_runs = 5
temperature = 0.7
```

When you run the code with this setting, you will likely see **different outputs** in the response list. This is because the higher temperature introduces randomness, allowing the model to choose from a wider range of tokens at each step, making its output less predictable and therefore inconsistent.

**3. Changing `num_runs`:**
You can also increase the number of runs to get a more thorough test. For example, setting `num_runs = 10` will send the same prompt 10 times, providing a larger sample size to observe consistency.

This approach gives you a flexible and powerful way to evaluate a model's behavior under various conditions.

## Tracking Response Patterns with Frequency Analysis

Now that you've made your consistency testing code more flexible with parameters, let's take our analysis one step further! When working with higher temperature settings, it's helpful to know not just whether responses vary, but how often each unique response appears.

In this exercise, you'll add response frequency tracking to your code:

Create a dictionary to count how many times each unique response occurs.
Update this counter each time you get a response from the model.
Add code to display a summary showing each response and its frequency.
This frequency analysis is particularly valuable when testing at higher temperatures, as it helps you identify which responses are most common even when the model isn't perfectly consistent. Understanding these patterns gives you deeper insights into model behavior and can help you make better decisions about reliability for your specific use case.

```python
from openai import OpenAI

client = OpenAI()

# Number of times to run the same prompt
num_runs = 10

# Temperature setting for the model
temperature = 0.7  # Try changing this to see how it affects consistency

prompt = "Name three planets in our solar system."

responses = []
# TODO: Create a dictionary to track the frequency of each unique response

# Run the same prompt num_runs times at the specified temperature
for i in range(num_runs):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=temperature,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)
    
    # TODO: Update the frequency counter dictionary:
    # - If the answer already exists as a key, increment its count
    # - If it's a new answer, add it to the dictionary with count 1

# Display all responses
print(f"Model responses at temperature={temperature}:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# TODO: Add code to display a frequency summary of responses
# The summary should show each unique response and how many times it occurred

# Check if all responses are identical
unique_responses = set(responses)
if len(unique_responses) == 1:
    print("\n✅ The model was fully consistent.")
else:
    print(f"\n⚠️ The model produced {len(unique_responses)} different outputs.")
```

Frequency analysis is a great way to understand the **distribution** of a model's responses, especially with higher temperature settings. It shows you not only if the outputs vary, but which variations are most likely to occur.

Here is the updated code that includes a dictionary to track and display the frequency of each unique response.

### Code with Frequency Analysis

```python
from openai import OpenAI
from collections import Counter

client = OpenAI()

# Number of times to run the same prompt
num_runs = 10

# Temperature setting for the model
temperature = 0.7  # Try changing this to see how it affects consistency

prompt = "Name three planets in our solar system."

responses = []
# TODO: Create a dictionary to track the frequency of each unique response
response_counts = Counter()

# Run the same prompt num_runs times at the specified temperature
for i in range(num_runs):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=temperature,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content.strip()
    responses.append(answer)
    
    # TODO: Update the frequency counter dictionary:
    # - If the answer already exists as a key, increment its count
    # - If it's a new answer, add it to the dictionary with count 1
    response_counts[answer] += 1

# Display all responses
print(f"Model responses at temperature={temperature}:")
for idx, r in enumerate(responses, 1):
    print(f"{idx}: \n{r}", end=f"\n {'-'*50} \n")

# TODO: Add code to display a frequency summary of responses
# The summary should show each unique response and how many times it occurred
print("\n--- Response Frequency Summary ---")
for response, count in response_counts.items():
    print(f"Count: {count}")
    print(f"Response: {response}\n")

# Check if all responses are identical
unique_responses = set(responses)
if len(unique_responses) == 1:
    print("\n✅ The model was fully consistent.")
else:
    print(f"\n⚠️ The model produced {len(unique_responses)} different outputs.")
```

-----

### How it Works

The key to this solution is the `collections.Counter` class, a specialized dictionary designed for this exact purpose. As the loop runs, `response_counts[answer] += 1` efficiently handles both cases:

  * If `answer` is a new key, it's added to the dictionary with a value of `1`.
  * If `answer` already exists, its current value is incremented by one.

This new code provides a more informative view of the model's behavior. Instead of just seeing that the outputs are different, you can now see which ones are the **most common** or **most probable** at that specific temperature. This is essential for applications where you need to balance creativity with a degree of reliability.