Skip to content

Conversation

@dbczumar
Copy link
Collaborator

@dbczumar dbczumar commented Nov 8, 2024

Add initial tests for DSPy reliability with real LLMs

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
args: check --fix-only
- name: Run tests with pytest
run: poetry run pytest tests/
run: poetry run pytest tests/ --ignore=tests/quality
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quality tests aren't ready to run on PRs yet

Comment on lines 9 to 24
# Standard list of models that should be used for periodic DSPy quality testing
MODEL_LIST = [
"gpt-4o",
"gpt-4o-mini",
"gpt-4-turbo",
"gpt-o1-preview",
"gpt-o1-mini",
"claude-3.5-sonnet",
"claude-3.5-haiku",
"gemini-1.5-pro",
"gemini-1.5-flash",
"llama-3.1-405b-instruct",
"llama-3.1-70b-instruct",
"llama-3.1-8b-instruct",
"llama-3.2-3b-instruct",
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably useful for us to define a canonical list of models that DSPy tests against. We can extend / update this list over time. For each model in this list, users can specify corresponding LiteLLM configurations in quality_conf.yaml, as described in the README

pytest .
```

This will execute all tests for the configured models and display detailed results for each model configuration. Tests are set up to mark expected failures for known challenging cases where a specific model might struggle, while actual (unexpected) DSPy quality issues are flagged as failures (see below).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up, I'll write a pytest post suite hook that produces a report summarizing the quality results by model. Eventually, we can publish this report as we make DSPy releases

Some tests may be expected to fail with certain models, especially in challenging cases. These known failures are logged but do not affect the overall test result. This setup allows us to keep track of model-specific limitations without obstructing general test outcomes. Models that are known to fail a particular test case are specified using the `@known_failing_models` decorator. For example:

```
@known_failing_models(["llama-3.2-3b-instruct"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, I thought about having each test declare which models pass and only surfacing failures for model is in the pass list, but I rejected this because:

  1. It becomes too easy to introduce a model with significant undetected failures that we might want to fix. If the default behavior for adding a new model is that all tests pass regardless of whether the model actually performs well, it's likely that we'll end up with many undisclosed model failures.

  2. Declaring failures makes it much clearer where developers should spend time trying to make improvements. It's easier to see which items are present in a list (failing models present in a list of failing models), rather than which items are absent (failing models absent from a list of passing models).

An example of `quality_tests_conf.yaml`:

```yaml
adapter: chat
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, we can add cache configurations to this file


The configuration must also specify a DSPy adapter to use when testing, e.g. `"chat"` (for `dspy.ChatAdapter`) or `"json"` (for `dspy.JSONAdapter`)

An example of `quality_tests_conf.yaml`:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI/CD systems (e.g. GitHub Actions workflows) can define their own test conf YAMLs for periodic regression testing

Signed-off-by: dbczumar <corey.zumar@databricks.com>
@@ -0,0 +1,89 @@
from enum import Enum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check in additional test cases once we confirm that the overall approach here makes sense

@@ -0,0 +1,58 @@
# DSPy Quality Tests
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to alternative names here. We could call this "Adapter" testing, but there's more being tested here than just Adapter logic - we're writing full programs and making quality assertions on their outputs.

I chose "quality" because one objective of DSPy is to provide consistent quality across LLMs


### Running the Tests

- First, populate the configuration file `quality_tests_conf.yaml` (located in this directory) with the necessary LiteLLM model/provider names and access credentials for 1. each LLM you want to test and 2. the LLM judge that you want to use for assessing the correctness of outputs in certain test cases. These should be placed in the `litellm_params` section for each model in the defined `model_list`. You can also use `litellm_params` to specify values for LLM hyperparameters like `temperature`. Any model that lacks configured `litellm_params` in the configuration file will be ignored during testing.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, it seems plausible for us to define recommended hyperparameter configurations for specific LMs that are known to produce better performance (e.g. temperature)

Signed-off-by: dbczumar <corey.zumar@databricks.com>
@dbczumar dbczumar changed the title Add initial tests for adapter quality against real LLMs Add initial tests for DSPy reliability with real LLMs Nov 8, 2024
@dbczumar dbczumar requested a review from okhat November 8, 2024 02:07
@okhat okhat changed the title Add initial tests for DSPy reliability with real LLMs Add initial DSPy reliability tests Nov 8, 2024
@dbczumar dbczumar merged commit 97032e1 into stanfordnlp:main Nov 8, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant