# Model-Graded Quality Assurance for GlobalMart's Customer Service

At GlobalMart, we've mastered code-based testing of our AI customer service - checking if responses contain order numbers, tracking codes, and refund policies. These basic tests are fast and cheap, but they can't evaluate what really matters: the quality of customer interactions.

Think about what makes great customer service:
- Professional yet friendly tone
- Appropriate escalation of sensitive issues
- Clear explanations without technical jargon
- Empathetic handling of customer frustrations

These qualities can't be measured with simple code checks. That's where model-graded evaluations come in.

By using one AI model to evaluate another, we can assess these nuanced aspects of customer service. The evaluating model acts like a quality assurance specialist, checking responses against GlobalMart's service standards.

## Learning Objectives
- Configure model-graded evaluations in Promptfoo for customer service quality
- Use LLMs to evaluate response quality and tone against multiple service standards
- Optimize customer service prompts using evaluation insights based on quality metrics
- Apply multi-criteria grading for complex assessments
- Implement automated QA for response tone and content

## The Challenge
GlobalMart needs to ensure our AI customer service maintains high standards across:
- Professional, helpful tone
- Clear response boundaries 
- Appropriate escalation paths

## How LLM based evaluation works

The process works by giving the evaluator model:
- The customer's original question
- Our AI's response
- GlobalMart's service quality guidelines
- Specific evaluation criteria

This approach lets us measure what really matters - not just what was said, but how it was said. Let's see how to implement this at GlobalMart!

## Create our prompts in prompts.py
As we've done in previous labs, lets create a new file for our prompts called **`prompts.py`** and add the following prompt function to the file.
- **`Basic Prompt`:** Simple instruction set that establishes basic role and scope. Minimal guidance on tone or boundaries. Good baseline for testing but lacks specifics on handling edge cases.

In [1]:
%%writefile prompts.py
def basic_service_prompt(message):
    return f"""As GlobalMart's customer service AI, provide clear, professional responses.
    Help with orders, products, and account issues.
    Customer message: {message}"""

Writing prompts.py


## Create Promptfoo Configuration

### Setting Up Model-Graded Evaluation
The llm-rubric assertion in Promptfoo allows you to use an LLM to evaluate the output of your main model based on specified criteria. This is particularly useful for assessing subjective aspects of the model's output, such as tone, relevance, or adherence to guidelines.

```yaml
defaultTest:
  assert:
    - type: llm-rubric
      value: "Professional tone"
    - type: llm-rubric
      value: "Offers appropriate alternatives when declining"
```

In this example we'll evaluate responses on multiple criteria, the promptfooconfig.yaml file contains two `llm-rubrics`:
- "Professional tone" evaluates if responses are direct and courteous and professional. 
- "Offers appropriate alternatives when declining" assesses if helpful alternatives are provided when requests can't be fulfilled directly.

The LLM will grade the main model's response based on the provided rubric and return a pass/fail result.

### Write our Promptfoo Configuration to `promptfooconfig.yaml`
Now that we have model evaluation rubric defined Let's create our full evaluation config:
- **`description`:** Sets evaluation context and purpose
- **`prompts`:** Contains prompt variations to test, either inline or referencing external files
- **`providers`:** Specifies LLM models to use, including configuration options and labels. This is where we define the LLM that will generate the results that will be evaluated
- **`defaultTest`:** Defines assertions applied to all test cases; contains evaluation criteria and the language model that will be used to evaluate the output
    - **`assert`:** Within `defaultTest` or individual tests, specifies evaluation criteria using `llm-rubric` as the assertion type
- **`tests`:** Lists test cases with variables to substitute into prompts

Each section plays a specific role in configuring the evaluation pipeline and defining success criteria.

In [5]:
%%writefile promptfooconfig.yaml
description: "GlobalMart Customer Service Quality Control Evaluation"

prompts:
  - prompts.py:basic_service_prompt

providers:
  - id: sagemaker:jumpstart:jumpstart-dft-llama-3-1-8b-instruct-20250312-144245
    label: "Response Generator"
    config:
      region: us-east-1

# Here is where we insert our llm rubric into our config
defaultTest:
  assert:
    - type: llm-rubric
      value: "Professional tone"
    - type: llm-rubric
      value: "Offers appropriate alternatives when declining"

  options:
    provider:
      id: bedrock:us.amazon.nova-pro-v1:0
      config:
        region: us-east-1

tests:
  - vars:
      message: "Can you help me get a refund for an item I bought from another store?"
  - vars:
      message: "What's your employee discount code?"
  - vars:
      message: "Can you tell me which of your competitors has better prices?"

Overwriting promptfooconfig.yaml


Now that we have a `promptfooconfig.yaml` config file, our evaluation model and criteria are defines, and our basic test case prompt in `promtps.py`, we can run our evaluation.

Below you will see that we have set the `-j` flag. It stands for "jobs", and it controls how many test cases Promptfoo will run at the same time. When you run promptfoo eval, by default it will try to parallelize the evaluation by running multiple test cases simultaneously. This can speed things up quite a bit, especially if you have a lot of test cases.

However, there are times when you might not want this parallel execution. That's where the `-j` flag comes in.

`!promptfoo eval --no-progress-bar --no-cache -j 1`

When you set `-j 1`, you're telling Promptfoo: "Hey, just run one test case at a time please. No parallelization."

Why would you want to do this? A few possible reasons:

1. **Debugging:** If something's going wrong with your evaluation, running test cases one at a time can make it easier to spot where the issue is happening. The output will be sequential and easier to follow.
2. **Resource limitations:** If you're running Promptfoo on a machine with limited CPU or memory, parallel execution might overload it. By running test cases one at a time, you reduce the resource load.
3. **Avoiding rate limits:** Some API providers have rate limits on how many requests you can make per second/minute. If you're hitting those limits, running test cases sequentially with -j 1 can help you stay under the limit.
4. **Reproducibility:** In some edge cases, parallel execution might lead to slightly different results each time due to race conditions or other factors. Running with -j 1 ensures your evaluation will be fully reproducible.

In [6]:
!promptfoo eval --no-progress-bar --no-cache -j 1

Cache is disabled.
Running 3 concurrent evaluations with up to 1 threads...

[90m┌───────────────────────────────────[39m[90m┬───────────────────────────────────┐[39m
[90m│[39m[1m[34m message                           [39m[22m[90m│[39m[1m[34m [Response Generator]              [39m[22m[90m│[39m
[90m│[39m[1m[34m                                   [39m[22m[90m│[39m[1m[34m prompts.py:basic_service_prompt   [39m[22m[90m│[39m
[90m├───────────────────────────────────[39m[90m┼───────────────────────────────────┤[39m
[90m│[39m Can you help me get a refund for  [90m│[39m [31m[FAIL] [39m[31m[1mThe output does not[22m[39m        [90m│[39m
[90m│[39m an item I bought from another     [90m│[39m [31m[1mdecline a request or offer[22m[39m        [90m│[39m
[90m│[39m store?                            [90m│[39m [31m[1malternatives; it provides[22m[39m         [90m│[39m
[90m│[39m                                   [90m│[39m [31m[1massis

In [7]:
!promptfoo view

Server running at http://localhost:15500 and monitoring for new evals.
[1G[0JOpen URL in browser? (y/N): [29G

OSError: [Errno 5] Input/output error

This is a screenshot of the output we generated the first time we ran this evaluation:

![basic results](images/basic_test_results.png)

---

## A Second prompt
Let's add a second prompt that includes some guidelines about exactly which topics the model should discuss:

>  As GlobalMart's customer service AI:
>   1. Help with: orders, returns, product info, accounts, tech support
>   2. Politely decline out-of-scope requests
>   3. Keep professional tone


 First let's update our `prompts.py` file with our second prompt

In [8]:
%%writefile prompts.py
def basic_service_prompt(message):
    return f"""As GlobalMart's customer service AI, provide clear, professional responses.
    Help with orders, products, and account issues.
    Customer message: {message}"""

# narrow the scope topcis that the llm should discuss with customers
def detailed_service_prompt(message):
    return f"""As GlobalMart's customer service AI:
    1. Help with: orders, returns, product info, accounts, tech support
    2. Politely decline out-of-scope requests
    3. Keep professional tone

    Customer message: {message}"""

Overwriting prompts.py


We also need to update the `promptfooconfig.yaml` file with our second prompt to look like this: 

In [None]:
%%writefile promptfooconfig.yaml
description: "GlobalMart Customer Service Quality Control Evaluation"

prompts:
  - prompts.py:basic_service_prompt
  - prompts.py:detailed_service_prompt # <-- Here we've added our additional prompt which is locateed in prompts.py

providers:
  - id: sagemaker:jumpstart:jumpstart-dft-llama-3-1-8b-instruct-20250312-144245
    label: "Response Generator"
    config:
      region: us-east-1

defaultTest:
  assert:
    - type: llm-rubric
      value: "Professional tone"
    - type: llm-rubric
      value: "Offers appropriate alternatives when declining"

  options:
    provider:
      id: bedrock:us.amazon.nova-pro-v1:0
      config:
        region: us-east-1 
tests:
  - vars:
      message: "Can you help me get a refund for an item I bought from another store?"
  - vars:
      message: "What's your employee discount code?"
  - vars:
      message: "Can you tell me which of your competitors has better prices?"

We now have two prompts we're evaluating! Let's run the evaluation again: 

In [None]:
!promptfoo eval --no-progress-bar --no-cache -j 1

This is a screenshot of the new output we generated the second time we ran this evaluation:

![detailed_results](images/detailed_test_results.png)

## Grading for apologies 

In looking closely at the model outputs, we notice that most of them begin with apologies like "I'm sorry," or "I apologize."  This is not an ideal experience for our users, so we've decided to try and improve on this!  We want to evaluate a third prompt: 

> As GlobalMart's customer service AI:
    1. Help with: orders, returns, product info, accounts, tech support
    2. Decline out-of-scope requests without apologizing
    3. Redirect to available services instead of apologizing
    4. Maintain professional, helpful tone

The above prompt specifically tells the model to avoid apologizing and instead focus on gently nudging customers to our predefined orders, returns, product info, accounts, tech support topics.

Next, let's add a third `llm-rubric` assertion to test whether the model's output is apologetic.  Update `promptfooconfig.yaml` and `prompts.py` to look like this: 

Just like we did before, we update `prompts.py` with our 3rd prompt.

In [None]:
%%writefile prompts.py
def basic_service_prompt(message):
    return f"""As GlobalMart's customer service AI, provide clear, professional responses.
    Help with orders, products, and account issues.
    Customer message: {message}"""

# narrow the scope topcis that the llm should discuss with customers
def detailed_service_prompt(message):
    return f"""As GlobalMart's customer service AI:
    1. Help with: orders, returns, product info, accounts, tech support
    2. Politely decline out-of-scope requests
    3. Keep professional tone

    Customer message: {message}"""

# revome apologetic tone
def optimized_service_prompt(message): 
    return f"""As GlobalMart's customer service AI:
    1. Help with: orders, returns, product info, accounts, tech support
    2. Decline out-of-scope requests without apologizing
    3. Redirect to available services instead of apologizing
    4. Maintain professional, helpful tone

    Customer message: {message}"""

Once again also need to update the `promptfooconfig.yaml` file with our third prompt to look like this:

In [None]:
%%writefile promptfooconfig.yaml
description: "GlobalMart Customer Service Quality Control Evaluation"


prompts:
  - prompts.py:basic_service_prompt
  - prompts.py:detailed_service_prompt
  - prompts.py:optimized_service_prompt # <-- Here we've added our third prompt which is locateed in prompts.py

providers:
  - id: bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0
    label: "Response Generator"
    config:
      region: us-west-2 # change to us-east-1 depending on your deployment region

defaultTest:
  assert:
    - type: llm-rubric
      value: "Professional tone"
    - type: llm-rubric
      value: "Offers appropriate alternatives when declining"
    - type: llm-rubric
      value: "Is non-apologetic" #<--- Evaluate the results for apologies

  options:
    provider:
      id: bedrock:us.anthropic.claude-3-5-sonnet-20241022-v2:0
      config:
        region: us-west-2 # change to us-east-1 depending on your deployment region

tests:
  - vars:
      message: "Can you help me get a refund for an item I bought from another store?"
  - vars:
      message: "What's your employee discount code?"
  - vars:
      message: "Can you tell me which of your competitors has better prices?"


We now have three prompts that we're testing.  For each of the test cases, we're use a model to grade three separate aspects:
* The model should maintain a professional tone
* The model should offer alternatives when refusing to answer the question
* The model should not be apologetic

Let's run the evaluation: 

In [None]:
!promptfoo eval --no-progress-bar --no-cache -j 1

Our prompt is working with most of our evaluation data set (though this is a very small dataset), but it looks like the model is happy to answer our customer support questions.  The following screenshot from the promptfoo cli and web view showcases the model's response as well as the grader-model's grading logic: 
* cli results 
![promptfoo results](images/opt_test_results.png)

* Web results
![promptfoo results](images/opt_wed_test_results.png)

This optimized output passed all three assertions!

Please remember that this dataset is far too small for a realistic evaluation.

Promptfoo's built-in model-graded assertions are very useful, but there are situations we might need more control over the exact model-graded metrics and process. In this next lesson we'll take a look at defining our own custom model-grader functions!