-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add initial DSPy reliability tests #1773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.github/workflows/run_tests.yml
Outdated
| args: check --fix-only | ||
| - name: Run tests with pytest | ||
| run: poetry run pytest tests/ | ||
| run: poetry run pytest tests/ --ignore=tests/quality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quality tests aren't ready to run on PRs yet
tests/quality/conftest.py
Outdated
| # Standard list of models that should be used for periodic DSPy quality testing | ||
| MODEL_LIST = [ | ||
| "gpt-4o", | ||
| "gpt-4o-mini", | ||
| "gpt-4-turbo", | ||
| "gpt-o1-preview", | ||
| "gpt-o1-mini", | ||
| "claude-3.5-sonnet", | ||
| "claude-3.5-haiku", | ||
| "gemini-1.5-pro", | ||
| "gemini-1.5-flash", | ||
| "llama-3.1-405b-instruct", | ||
| "llama-3.1-70b-instruct", | ||
| "llama-3.1-8b-instruct", | ||
| "llama-3.2-3b-instruct", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably useful for us to define a canonical list of models that DSPy tests against. We can extend / update this list over time. For each model in this list, users can specify corresponding LiteLLM configurations in quality_conf.yaml, as described in the README
tests/quality/README.md
Outdated
| pytest . | ||
| ``` | ||
|
|
||
| This will execute all tests for the configured models and display detailed results for each model configuration. Tests are set up to mark expected failures for known challenging cases where a specific model might struggle, while actual (unexpected) DSPy quality issues are flagged as failures (see below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow-up, I'll write a pytest post suite hook that produces a report summarizing the quality results by model. Eventually, we can publish this report as we make DSPy releases
| Some tests may be expected to fail with certain models, especially in challenging cases. These known failures are logged but do not affect the overall test result. This setup allows us to keep track of model-specific limitations without obstructing general test outcomes. Models that are known to fail a particular test case are specified using the `@known_failing_models` decorator. For example: | ||
|
|
||
| ``` | ||
| @known_failing_models(["llama-3.2-3b-instruct"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, I thought about having each test declare which models pass and only surfacing failures for model is in the pass list, but I rejected this because:
-
It becomes too easy to introduce a model with significant undetected failures that we might want to fix. If the default behavior for adding a new model is that all tests pass regardless of whether the model actually performs well, it's likely that we'll end up with many undisclosed model failures.
-
Declaring failures makes it much clearer where developers should spend time trying to make improvements. It's easier to see which items are present in a list (failing models present in a list of failing models), rather than which items are absent (failing models absent from a list of passing models).
| An example of `quality_tests_conf.yaml`: | ||
|
|
||
| ```yaml | ||
| adapter: chat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Later, we can add cache configurations to this file
tests/quality/README.md
Outdated
|
|
||
| The configuration must also specify a DSPy adapter to use when testing, e.g. `"chat"` (for `dspy.ChatAdapter`) or `"json"` (for `dspy.JSONAdapter`) | ||
|
|
||
| An example of `quality_tests_conf.yaml`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI/CD systems (e.g. GitHub Actions workflows) can define their own test conf YAMLs for periodic regression testing
| @@ -0,0 +1,89 @@ | |||
| from enum import Enum | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check in additional test cases once we confirm that the overall approach here makes sense
tests/quality/README.md
Outdated
| @@ -0,0 +1,58 @@ | |||
| # DSPy Quality Tests | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to alternative names here. We could call this "Adapter" testing, but there's more being tested here than just Adapter logic - we're writing full programs and making quality assertions on their outputs.
I chose "quality" because one objective of DSPy is to provide consistent quality across LLMs
tests/quality/README.md
Outdated
|
|
||
| ### Running the Tests | ||
|
|
||
| - First, populate the configuration file `quality_tests_conf.yaml` (located in this directory) with the necessary LiteLLM model/provider names and access credentials for 1. each LLM you want to test and 2. the LLM judge that you want to use for assessing the correctness of outputs in certain test cases. These should be placed in the `litellm_params` section for each model in the defined `model_list`. You can also use `litellm_params` to specify values for LLM hyperparameters like `temperature`. Any model that lacks configured `litellm_params` in the configuration file will be ignored during testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, it seems plausible for us to define recommended hyperparameter configurations for specific LMs that are known to produce better performance (e.g. temperature)
Add initial tests for DSPy reliability with real LLMs