Add initial DSPy reliability tests #1773

dbczumar · 2024-11-08T00:40:41Z

Add initial tests for DSPy reliability with real LLMs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar · 2024-11-08T00:42:01Z

.github/workflows/run_tests.yml

          args: check --fix-only
      - name: Run tests with pytest
-        run: poetry run pytest tests/
+        run: poetry run pytest tests/ --ignore=tests/quality


Quality tests aren't ready to run on PRs yet

dbczumar · 2024-11-08T00:43:43Z

tests/quality/conftest.py

+# Standard list of models that should be used for periodic DSPy quality testing
+MODEL_LIST = [
+    "gpt-4o",
+    "gpt-4o-mini",
+    "gpt-4-turbo",
+    "gpt-o1-preview",
+    "gpt-o1-mini",
+    "claude-3.5-sonnet",
+    "claude-3.5-haiku",
+    "gemini-1.5-pro",
+    "gemini-1.5-flash",
+    "llama-3.1-405b-instruct",
+    "llama-3.1-70b-instruct",
+    "llama-3.1-8b-instruct",
+    "llama-3.2-3b-instruct",
+]


It's probably useful for us to define a canonical list of models that DSPy tests against. We can extend / update this list over time. For each model in this list, users can specify corresponding LiteLLM configurations in quality_conf.yaml, as described in the README

dbczumar · 2024-11-08T00:45:28Z

tests/quality/README.md

+      pytest .
+  ```
+
+  This will execute all tests for the configured models and display detailed results for each model configuration. Tests are set up to mark expected failures for known challenging cases where a specific model might struggle, while actual (unexpected) DSPy quality issues are flagged as failures (see below).


As a follow-up, I'll write a pytest post suite hook that produces a report summarizing the quality results by model. Eventually, we can publish this report as we make DSPy releases

dbczumar · 2024-11-08T00:49:19Z

tests/quality/README.md

+Some tests may be expected to fail with certain models, especially in challenging cases. These known failures are logged but do not affect the overall test result. This setup allows us to keep track of model-specific limitations without obstructing general test outcomes. Models that are known to fail a particular test case are specified using the `@known_failing_models` decorator. For example:
+
+```
+@known_failing_models(["llama-3.2-3b-instruct"])


As an alternative, I thought about having each test declare which models pass and only surfacing failures for model is in the pass list, but I rejected this because:

It becomes too easy to introduce a model with significant undetected failures that we might want to fix. If the default behavior for adding a new model is that all tests pass regardless of whether the model actually performs well, it's likely that we'll end up with many undisclosed model failures.

Declaring failures makes it much clearer where developers should spend time trying to make improvements. It's easier to see which items are present in a list (failing models present in a list of failing models), rather than which items are absent (failing models absent from a list of passing models).

dbczumar · 2024-11-08T00:49:45Z

tests/quality/README.md

+  An example of `quality_tests_conf.yaml`:
+
+      ```yaml
+      adapter: chat


Later, we can add cache configurations to this file

dbczumar · 2024-11-08T00:50:36Z

tests/quality/README.md

+
+  The configuration must also specify a DSPy adapter to use when testing, e.g. `"chat"` (for `dspy.ChatAdapter`) or `"json"` (for `dspy.JSONAdapter`)
+
+  An example of `quality_tests_conf.yaml`:


CI/CD systems (e.g. GitHub Actions workflows) can define their own test conf YAMLs for periodic regression testing

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar · 2024-11-08T00:51:47Z

tests/quality/test_pydantic_models.py

@@ -0,0 +1,89 @@
+from enum import Enum


I'll check in additional test cases once we confirm that the overall approach here makes sense

dbczumar · 2024-11-08T00:59:02Z

tests/quality/README.md

@@ -0,0 +1,58 @@
+# DSPy Quality Tests


Open to alternative names here. We could call this "Adapter" testing, but there's more being tested here than just Adapter logic - we're writing full programs and making quality assertions on their outputs.

I chose "quality" because one objective of DSPy is to provide consistent quality across LLMs

dbczumar · 2024-11-08T01:06:14Z

tests/quality/README.md

+
+### Running the Tests
+
+- First, populate the configuration file `quality_tests_conf.yaml` (located in this directory) with the necessary LiteLLM model/provider names and access credentials for 1. each LLM you want to test and 2. the LLM judge that you want to use for assessing the correctness of outputs in certain test cases. These should be placed in the `litellm_params` section for each model in the defined `model_list`. You can also use `litellm_params` to specify values for LLM hyperparameters like `temperature`. Any model that lacks configured `litellm_params` in the configuration file will be ignored during testing.


At some point, it seems plausible for us to define recommended hyperparameter configurations for specific LMs that are known to produce better performance (e.g. temperature)

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar added 19 commits November 7, 2024 14:30

fix

45dd5b6

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

7c45e36

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

d0cc3b2

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Fix

42f79df

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

923acf9

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

acebbdf

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

3840fa7

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

85ed1ca

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

e06759e

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

03bf5ca

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

ae9ee79

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

6ebd987

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

d061e6c

Signed-off-by: dbczumar <corey.zumar@databricks.com>

format

6236cf5

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

223b1a9

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

75d8135

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

bb76600

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

ae24694

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

c29b753

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar commented Nov 8, 2024

View reviewed changes

fix

86a4389

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar commented Nov 8, 2024

View reviewed changes

fix

c2d2be9

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Update run_tests.yml

3ff893d

dbczumar changed the title ~~Add initial tests for adapter quality against real LLMs~~ Add initial tests for DSPy reliability with real LLMs Nov 8, 2024

dbczumar requested a review from okhat November 8, 2024 02:07

okhat changed the title ~~Add initial tests for DSPy reliability with real LLMs~~ Add initial DSPy reliability tests Nov 8, 2024

dbczumar merged commit 97032e1 into stanfordnlp:main Nov 8, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add initial DSPy reliability tests #1773

Add initial DSPy reliability tests #1773

Uh oh!

dbczumar commented Nov 8, 2024 •

edited

Loading

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

dbczumar Nov 8, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		The configuration must also specify a DSPy adapter to use when testing, e.g. `"chat"` (for `dspy.ChatAdapter`) or `"json"` (for `dspy.JSONAdapter`)

		An example of `quality_tests_conf.yaml`:


		### Running the Tests

		- First, populate the configuration file `quality_tests_conf.yaml` (located in this directory) with the necessary LiteLLM model/provider names and access credentials for 1. each LLM you want to test and 2. the LLM judge that you want to use for assessing the correctness of outputs in certain test cases. These should be placed in the `litellm_params` section for each model in the defined `model_list`. You can also use `litellm_params` to specify values for LLM hyperparameters like `temperature`. Any model that lacks configured `litellm_params` in the configuration file will be ignored during testing.

Add initial DSPy reliability tests #1773

Add initial DSPy reliability tests #1773

Uh oh!

Conversation

dbczumar commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dbczumar commented Nov 8, 2024 •

edited

Loading