Skip to content

Conversation

@mikeedjones
Copy link

@mikeedjones mikeedjones commented Oct 5, 2024

This PR attempts to port all tests currently using DummyLM to a new DummyLM which inherits from dspy.LM - hence migrating the test suite to use the new ChatAdapter.

It also replicates all current tests which use the renamed DspDummyLM(dsp.LM) into a folder tests/DSP_LM, to be deleted when 2.6 is released and dsp.LM is deprecated.

The tests for the to-be-deprectaed MIPRO have not been migrated to use dspy.LM.

I also included a fix for predictors which return Literal types (or any type the origin of which does not have a __name__ attribute) with a small tweak to dspy/adapters/chat_adapter.py:get_annotation_name. Without this change tests/functional/test_functional.py:test_literal.* fail. Can also skip those tests and create another PR.

This PR also adds an auto-used fixture which resets the dspy.settings to default after each test without which some tests were interdependent.

@mikeedjones mikeedjones marked this pull request as draft October 5, 2024 16:11
@mikeedjones
Copy link
Author

mikeedjones commented Oct 6, 2024

Discovered a cute gotcha - if you configure the LM (dspy.settings.configure(lm=lm)) after initializing your modules you get a different prompt to if you configure your LM after initializing your modules.

    dspy.settings.configure(lm=lm)
    pot = ChainOfThought(BasicQA)

Adds "reasoning" to the output signature.

    pot = ChainOfThought(BasicQA)
    dspy.settings.configure(lm=lm)

Adds "rationale" to the output signature.

Because the dspy.settings.lm is referenced in ChainOfThought.__init__.

Will raise an issue - maybe just something to be added in the migration docs. Resolved when deprecating dsp.LM

@mikeedjones mikeedjones marked this pull request as ready for review October 6, 2024 06:03
@mikeedjones mikeedjones changed the title DRAFT: Feat/port tests to dspy lm client Feat/port tests to dspy lm client Oct 6, 2024
args_str = ', '.join(get_annotation_name(arg) for arg in args)
return f"{origin.__name__}[{args_str}]"
args_str = ", ".join(get_annotation_name(arg) for arg in args)
return f"{get_annotation_name(origin)}[{args_str}]"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to account for the origin of Literal not having a __name__ attribute.

@okhat
Copy link
Collaborator

okhat commented Oct 6, 2024

Thanks so much, @mikeedjones !!! This is amazing. Great catch on the issue.

experimental=False,
backoff_time = 10
)
config = DEFAULT_CONFIG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactor but there's a subtle risk here. This is now a reference to a global variable, and I fear somehow that we should make a deep copy here instead. (Arguably it's a singleton so who knows, maybe it's OK now, but I'd like to be sure it's a unique object)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good point. Changed that thanks!

@okhat
Copy link
Collaborator

okhat commented Oct 6, 2024

@mikeedjones This is super awesome. I'm tempted to merge as-is but I left a comment to be resolved up about DEFAULT_CONFIG.

More importantly, though, I see that the current tests are checking the output string, e.g. [[ ## name ## ]], which may well change frequently in 2.5 until we have 2.6 which is when we won't adjust the default adapters anymore.

More generally, basically we should test that the adapter's parse of the response is unchanged, not that the string is unchanged, if you know what I mean.

@mikeedjones
Copy link
Author

Yeah, I agree - I just tried to replicate old tests as closely as I could - maybe worth writing a dummy adapter as well as a dummy LM along with a test suite for ChatAdapter? Also have this issue where the lines passed to DummyLM would have to change as the ChatAdapter changes.


# @pytest.mark.slow_test
# TODO: Find a way to make this test run without openai
def _test_baleen():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait what's this file? Was it there before? Not sure we want this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,94 @@
"""Instructions:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the whole retrieve folder at tests/dsp_LM/retrieve/ probably doesn't need to be under dsp_LM?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/dsp_LM/retrieve/test_llama_index_rm.py uses DspyDummyLM but the tests are skipped in the CI.

@okhat
Copy link
Collaborator

okhat commented Oct 7, 2024

OK just reiterating this looks great to me (except I'm not sure I understand the implications of test_baleen.py) but we can't the parts with hard-coded prompt strings and output strings can't be merged directly as they would break on any changes to ChatAdapter. And I expect changes to happen in ChatAdapter.

We could introduce a dummy adapter but at that point what are we really testing? In any case, if making a dummy adapter makes sense to you and makes it easy to get this update overall merged, it sounds good to me in the short-to-medium term.

@mikeedjones
Copy link
Author

Maybe we remove the history comparisons and pass a list of field_name:value dicts into DummyLM which then has some output formatter? When a dev makes changes to ChatAdapter they have to update DummyLM's output formatter, doesn't seem like too much of a lift. Might make a good smoke test that their changes to ChatAdapter are working as intended?

@okhat
Copy link
Collaborator

okhat commented Oct 7, 2024

Maybe we remove the history comparisons

Sure. Later we can figure out how to test this.

pass a list of field_name:value dicts into DummyLM which then has some output formatter

Hmm, ideally that formatter is grabbed directly from the adapter I guess? Basically the test in essence will just check that parse(format_output_values(values)) == values, right?

I don't want us to maintain two different copies of the same thing for the tests' sake.

Btw I'm totally happy to have us merge this PR without this bit about formatting altogether, then we can discuss that part as a second PR. Your call. It's fine this way too.

@mikeedjones
Copy link
Author

mikeedjones commented Oct 7, 2024

I guess the formatter for DummyLM would be the inverse of ChatAdapter.parse so possible using the example formatter I think?

I think merge and i'll make a PR removing the history comparison and adding the formatter now if that's ok?

@mikeedjones
Copy link
Author

mikeedjones commented Oct 7, 2024

#1595 <- can close this in favor of an update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants