# Unit 1

## Measuring and Interpreting Token Usage in LLMs

# Lesson 1: Introduction to Behavioral Benchmarking

Welcome to the first lesson of the course, where we begin our journey into **behavioral benchmarking** of large language models (**LLMs**). In this course, you'll learn how to evaluate LLMs beyond just accuracy, focusing on how they use resources, how their outputs change with different settings, and how to spot unusual or incorrect responses.

In this lesson, we focus on **token usage**. Tokens are the basic units of text that LLMs process—think of them as pieces of words or characters. Every time you send a prompt to an LLM, it counts the number of tokens in your input and its output.

In benchmarking, efficiency matters as much as accuracy. Tracking token usage helps you understand not only how "smart" a model is, but how **resource-hungry** it is—something crucial for real-world applications and cost management.

By the end of this lesson, you will know how to measure token usage for different prompts and interpret what those numbers mean.

-----

### The Role of the Context Window

When working with LLMs, it’s important to understand the concept of a **context window**. The context window is the maximum number of tokens (including both your input prompt and the model’s output) that the model can process in a single request. For example, the `gpt-3.5-turbo` model has a context window of 4096 tokens.

The context window determines how much information the model can "see" at once. If your prompt and the expected completion together exceed this limit, the model will either truncate the input or cut off the output. This can affect the quality and completeness of the responses.

When benchmarking or designing prompts, always keep the context window in mind:

  * If your prompt is very long, you may need to shorten it to leave enough room for the model’s response.
  * If you expect a long answer, make sure your prompt is concise enough to fit within the context window.
  * Exceeding the context window can lead to incomplete outputs or errors.

Monitoring token usage helps you stay within the context window and ensures that your prompts and completions are processed as intended.

-----

### Getting Started With the OpenAI Python Client

To measure token usage, we will use the **OpenAI Python client library**. This library allows you to interact with OpenAI’s models using Python code. If you are working on your own computer, you would usually install the library using a command like `pip install openai`. However, on CodeSignal, the OpenAI library is already installed for you, so you can start coding right away.

The OpenAI client lets you send prompts to a model and receive responses. It also provides useful information about each request, including how many tokens were used for the prompt, the completion (the model’s response), and the total. In the next section, we will look at a practical example of how to use this client to measure token usage.

-----

### Example: Measuring Token Usage for Multiple Prompts

Let’s look at a code example that measures token usage for several different prompts. Here is the code you will use:

```python
from openai import OpenAI
client = OpenAI()

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    usage = response.usage
    print(f"Prompt: {prompt[:50]}...")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}\n")
```

In this example, you first import the OpenAI client and create a list of prompts. For each prompt, you send a request to the `gpt-3.5-turbo` model with a temperature of `0` (which makes the output more deterministic). The response from the model includes a `usage` object, which contains the number of tokens used for the prompt, the completion, and the total. The code then prints out these numbers for each prompt.

-----

### Sample Output

When you run this code, you will see output similar to the following (the exact numbers may vary):

```
Prompt: Summarize the history of the Roman Empire....
Prompt tokens: 12
Completion tokens: 54
Total tokens: 66

Prompt: Summarize the plot of the movie Titanic in t...
Prompt tokens: 15
Completion tokens: 32
Total tokens: 47

Prompt: Summarize the process of photosynthesis....
Prompt tokens: 10
Completion tokens: 38
Total tokens: 48
```

This output shows you, for each prompt, how many tokens were used in the input, how many in the output, and the total for the request.

-----

### Understanding the Output

Now, let’s break down what these numbers mean. The **`prompt tokens`** value tells you how many tokens were in your input prompt. The **`completion tokens`** value shows how many tokens the model used to generate its response. The **`total tokens`** is simply the sum of the two.

But what do these numbers actually mean in practice?

  * **Prompt tokens:** This number reflects the length and complexity of your input. For example, a short question like "What is photosynthesis?" might use only a few tokens, while a detailed instruction or a multi-part question will use more. If you see a high prompt token count, it means your input is long or complex.
  * **Completion tokens:** This number tells you how much the model "said" in response. A low number means a short answer; a high number means a longer, more detailed answer. If you ask for a summary in one sentence, you’ll see fewer completion tokens than if you ask for a detailed explanation.
  * **Total tokens:** This is the sum of prompt and completion tokens. It represents the total resources used for the request. Most LLM providers, including OpenAI, charge based on this total token count.

-----

### What Can Token Numbers Tell You?

  * **Cost:** Since providers charge per token, higher total token counts mean higher costs. Monitoring token usage helps you manage expenses, especially at scale.
  * **Efficiency:** If you notice that similar prompts produce very different token counts, it may indicate that some prompts are less efficient or that the model is being unnecessarily verbose.
  * **Model Limits:** LLMs have maximum token limits per request (the context window, for example, 4096 tokens for some models). If your prompt and expected completion together approach this limit, you may need to shorten your input or expect shorter outputs.
  * **Benchmarking:** By comparing token usage across prompts and models, you can benchmark not just accuracy, but also efficiency. For example, if two models give similar answers but one uses fewer tokens, it may be more efficient for your use case.

Understanding these numbers helps you see how much information you are sending and receiving, estimate costs, and optimize your prompts for both quality and efficiency.

-----

### Summary and What’s Next

In this lesson, you learned why token usage is important when working with LLMs and how to measure it using the OpenAI Python client. You saw a practical example of sending multiple prompts to a model and reading the token usage from the response. You also learned how to interpret the output, so you can understand how your prompts and the model’s responses affect token counts.

You also learned about the context window, which is the maximum number of tokens a model can process at once. Keeping the context window in mind is essential for ensuring your prompts and completions fit within the model’s limits and for optimizing both efficiency and output quality.

This knowledge will be important as you move on to the practice exercises, where you will get hands-on experience measuring and comparing token usage for different prompts. Understanding token usage and the context window is the first step in effective LLM benchmarking, and it will help you make better decisions about how to use these models efficiently.

## Comparing Token Counts to Prompt and Answer Lengths

In the previous example, you learned how to measure token usage for multiple prompts using the OpenAI Python client. However, the number of tokens used by the model is often different from the number of words or characters in your prompt or the model's answer. This is because tokens are not the same as words—they can be whole words, parts of words, or even punctuation.

Your task is to modify the code so that, for each prompt:

Print the full prompt text (no truncation).
Print the length of the prompt in both characters and words.
Print the length of the answer in both characters and words.
Print the token usage information as before (prompt tokens, completion tokens, total tokens).
This will help you see how token counts compare to the number of words and characters in both the prompt and the model's response.

```python
from openai import OpenAI

client = OpenAI()

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    
    answer = response.choices[0].message.content
    usage = response.usage
    # TODO: Print the full prompt (no truncation)
    # TODO: Print prompt length in characters and words
    # TODO: Print answer length in characters and words
    # TODO: Print token usage information
    print()

```

```python
from openai import OpenAI

client = OpenAI()

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    
    answer = response.choices[0].message.content
    usage = response.usage
    
    # Print the full prompt (no truncation)
    print(f"Prompt: {prompt}")
    
    # Print prompt length in characters and words
    prompt_char_count = len(prompt)
    prompt_word_count = len(prompt.split())
    print(f"Prompt length (characters): {prompt_char_count}")
    print(f"Prompt length (words): {prompt_word_count}")
    
    # Print answer length in characters and words
    answer_char_count = len(answer)
    answer_word_count = len(answer.split())
    print(f"Answer length (characters): {answer_char_count}")
    print(f"Answer length (words): {answer_word_count}")
    
    # Print token usage information
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}")
    print("-" * 20)
```

## Exploring Prompt Length and Token Usage

Now that you've mastered working with a single prompt, let's explore how prompt length affects token usage. In our previous example, we worked with prompts of similar lengths, but in real applications, prompts can vary dramatically in size.

Your task is to expand the existing code by:

Adding a very short prompt (just 1–2 words) at the beginning of the list
Adding a longer, more complex prompt (a paragraph with multiple sentences) at the end of the list
After running the code, observe the token counts for each prompt. Notice how they change with different prompt lengths and complexities. Does doubling the prompt length double the token count?

Understanding this relationship between prompt length and token usage will help you optimize costs and efficiency when designing prompts for production systems.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Add a very short prompt (1-2 words) at the beginning of this list
prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]
# TODO: Add a longer, more complex prompt (a paragraph with multiple sentences) at the end of this list

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    usage = response.usage
    print(f"Prompt: {prompt}")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}\n")
```
## Exploring Prompt Length and Token Usage

Now that you've mastered working with a single prompt, let's explore how prompt length affects token usage. In our previous example, we worked with prompts of similar lengths, but in real applications, prompts can vary dramatically in size.

Your task is to expand the existing code by:

Adding a very short prompt (just 1–2 words) at the beginning of the list
Adding a longer, more complex prompt (a paragraph with multiple sentences) at the end of the list
After running the code, observe the token counts for each prompt. Notice how they change with different prompt lengths and complexities. Does doubling the prompt length double the token count?

Understanding this relationship between prompt length and token usage will help you optimize costs and efficiency when designing prompts for production systems.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Add a very short prompt (1-2 words) at the beginning of this list
prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]
# TODO: Add a longer, more complex prompt (a paragraph with multiple sentences) at the end of this list

for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    usage = response.usage
    print(f"Prompt: {prompt}")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}\n")
```


## Refactoring Token Usage for Cleaner Code

You've been exploring token usage with different prompts — now let's improve our code structure! In real-world applications, clean, reusable code is just as important as understanding the metrics.

Your task is to refactor the token measurement code by creating a dedicated function called measure_token_usage that:

Takes a prompt as input
Handles the API call to the model
Extracts the token usage information
Returns a formatted string with all the details
This approach follows the principle of "Don't Repeat Yourself" (DRY) and makes your code more maintainable. After creating your function, update the main part of the script to use it for each prompt in the list.

By learning to organize your code in this way, you'll be better prepared to build more complex applications that interact with LLMs efficiently and professionally.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a function called measure_token_usage that takes a prompt as input
# and returns a formatted string with token usage information

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

# TODO: Replace this loop with code that calls your new function for each prompt
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    usage = response.usage
    print(f"Prompt: {prompt}")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}\n")
```

## Refactoring Token Usage for Cleaner Code

## Refactoring Token Usage for Cleaner Code

You've been exploring token usage with different prompts — now let's improve our code structure! In real-world applications, clean, reusable code is just as important as understanding the metrics.

Your task is to refactor the token measurement code by creating a dedicated function called measure_token_usage that:

Takes a prompt as input
Handles the API call to the model
Extracts the token usage information
Returns a formatted string with all the details
This approach follows the principle of "Don't Repeat Yourself" (DRY) and makes your code more maintainable. After creating your function, update the main part of the script to use it for each prompt in the list.

By learning to organize your code in this way, you'll be better prepared to build more complex applications that interact with LLMs efficiently and professionally.

```python
from openai import OpenAI

client = OpenAI()

# TODO: Create a function called measure_token_usage that takes a prompt as input
# and returns a formatted string with token usage information

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

# TODO: Replace this loop with code that calls your new function for each prompt
for prompt in prompts:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )

    usage = response.usage
    print(f"Prompt: {prompt}")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}\n")
```

```python
from openai import OpenAI

client = OpenAI()

def measure_token_usage(prompt):
    """
    Measures and returns token usage information for a given prompt.

    Args:
        prompt (str): The text prompt to send to the model.

    Returns:
        str: A formatted string containing the prompt and its token usage details.
    """
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role": "user", "content": prompt}]
    )
    
    usage = response.usage
    output_string = (
        f"Prompt: {prompt}\n"
        f"Prompt tokens: {usage.prompt_tokens}\n"
        f"Completion tokens: {usage.completion_tokens}\n"
        f"Total tokens: {usage.total_tokens}\n"
    )
    return output_string

prompts = [
    "Summarize the history of the Roman Empire.",
    "Summarize the plot of the movie Titanic in two sentences.",
    "Summarize the process of photosynthesis."
]

# Replace this loop with code that calls your new function for each prompt
for prompt in prompts:
    print(measure_token_usage(prompt))

```