Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Chat Mode #662

Closed
thomasahle opened this issue Mar 16, 2024 · 46 comments
Closed

Support Chat Mode #662

thomasahle opened this issue Mar 16, 2024 · 46 comments

Comments

@thomasahle
Copy link
Collaborator

Hopefully the new LM backend will allow us to make better use of models that are trained for "Chat".
Below is a good example of how even good models like GPT-3.5 are currently have trouble understanding the basic DSPy format:
Screenshot 2024-03-15 at 6 46 18 PM

Right now we use chat mode as if it was completion mode.
We send:

messages: [
{"from": "user", "message": "guidance, input0, output0, input1, output1, input2"}
}

And expect the agent to reply with

{"from": "agent", "message": "output2"}

A better use of the Chat APIs would be to send

messages: [
{"from": "system", "message": guidance},
{"from": "user", "message": input0},
{"from": "agent", "message": output0},
{"from": "user", "message": input1},
{"from": "agent", "message": output2},
{"from": "user", "message": input2},
]

That is, we simulate a previous chat, where the agent always replied with the output in the format we expect.
This teaches the agent to not start it's message with "OK! Let me get to it!" or repeating the template as in the gpt-3.5 screenshot above.

Also, using the system message for the guidance should help avoid prompt injection attacks.

@okhat
Copy link
Collaborator

okhat commented Mar 16, 2024

Totally agreed but I'd love for this to be more data-driven.

Either: (a) meta prompt engineering for popular models + easy addition of new adapters for new LMs if needed, or (b) automatic exploration of a new LM on standard tasks to automatically establish the patterns that "work" for that LM.

Do you have a way that fixes the sql_query example you had?

@okhat
Copy link
Collaborator

okhat commented Mar 16, 2024

Also I wonder to what extent this behavior you saw is because of "Follow the following format." is not explicitly saying "Complete the unfilled fields in accordance with the following format." Basically the instruction is slightly misleading for chat models.

@CyrusOfEden
Copy link
Collaborator

My sense is that interleaving inputs/outputs as a default would be a footgun because I would assume all outputs depend on all inputs, and the LLM doesn’t have access to this.

Right now our focus is using LiteLLM for broad support + moving over all dsp code into DSPy.

I’d love to tackle something like this when we look at the current Template usage and how that’s currently responsible for going from Example => prompt, and offering users some more flexibility with how an LLM gets called with an example.

@thomasahle
Copy link
Collaborator Author

thomasahle commented Mar 17, 2024

@CyrusOfEden I'm not sure what you mean by "interleaving inputs/outputs". This is already how DSPy works, no?

I think you misunderstood what I mean by (input_i, output_i).
I'm talking about an entire example/demo. Not two fields from the same demo.

@meditans
Copy link
Contributor

meditans commented Mar 17, 2024

I feel the problem with this is most of the time the positioning of the user end token. Say you have a prompt template that wraps the user message in [INST] and [/INST], like in mixtral. You would have:

[INST]
...

input1: ...
input2: ...
output:[/INST]

???

It seems to me that the model is lead to believe that the user turn is done (and the user has written a complete template, albeit with an empty output). It would be more correct to say:

[INST]
...

input1: ...
input2: ...[/INST]

output: ???

explicitly leaving the output field to the assistant (it suffices to create a {"role": "assistant", "content":"output:"} message).

Also I wonder to what extent this behavior you saw is because of "Follow the following format." is not explicitly saying "Complete the unfilled fields in accordance with the following format." Basically the instruction is slightly misleading for chat models.

@okhat I have done this experiment by writing a mini-version of Predict myself, with the prompt you are suggesting. I still have the same problem @thomasahle demonstrated in his initial post. This reinforces for me the belief that the token position is to blame. The version I proposed works instead.

Totally agreed but I'd love for this to be more data-driven.

I don't know precisely what you have in mind, but it seems to me that fixing the semantic of the multi-turn user-assistant conversation is orthogonal to the concern of wording the prompt differently.

@thomasahle
Copy link
Collaborator Author

thomasahle commented Mar 17, 2024

@meditans I suppose mixtral is not a "chat model" but an "instruction model".

What Omar says about having the framework automatically find the best prompting would og course be great.
But if Mixtral can be shown to work better in 90% of cases with @meditans token placement, then I'd be more than happy to just have that built into the Mixtral LM class.

We may also note that others have thought about how to best do few shot prompting with chat models. Such as

@meditans
Copy link
Contributor

meditans commented Mar 17, 2024

@thomasahle you are right, I am using a (local, quantized) chat finetune of mixtral, not baseline mixtral.

In fact, the langchain page you proposed is quite close to what I'm saying here (essentially the same thing).

@CyrusOfEden
Copy link
Collaborator

@CyrusOfEden I'm not sure what you mean by "interleaving inputs/outputs". This is already how DSPy works, no?

I think you misunderstood what I mean by (input_i, output_i).

I'm talking about an entire example/demo. Not two fields from the same demo.

I see now and this makes sense to me! I thought it was inputs/outputs not examples/demos :)

@mitchellgordon95
Copy link
Contributor

+1 to the problem @thomasahle is describing. I am also seeing it on gemini-1.0-pro.

And +1 to @meditans , the root of the problem is that special tokens for conversational formatting are being added to the prompt without anyone really thinking about it.

I like @thomasahle 's proposed solution of just formatting the few-shots in chat mode. The only downside I see is that it will no longer be possible to force the model to follow a specific prefix for the rationale. But this can probably be solved with some prompt engineering. Something like

Follow the following format.

Question: ${question}
Rationale: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

Repeat the user's message verbatim, and then finish the example.

@mitchellgordon95
Copy link
Contributor

mitchellgordon95 commented Mar 19, 2024

Regardless of whether we do meta prompting or not, we will need to update the LM interface and template class to support chat formatting as a special case, since most LLM providers do not expose which special tokens they use to do chat formatting and only allow it through the API. This could probably be done during the LiteLLM integration.

And since we're going to do that, I think it would be good to just put a default chat-style format that works ok for most models, while structuring the code in such a way that meta optimization can be added easily later. My intuition is that default prompts just need to be "good enough" to bootstrap a few good traces, and as long as that works people won't really care about how good the default prompt format is or care to optimize it for their particular model.

@meditans
Copy link
Contributor

I think it would be good to just put a default chat-style format that works ok for most models

When you say "default chat-style format", what do you have in mind? I struggle to understand if you're referring to the wording or the structure of the payload most api providers and local servers use.

@meditans
Copy link
Contributor

Also, regardless, could we leave a escape hatch for the user to provide a function that builds the arguments to send to the llm? Then one could just use the default one or provide tweaks.

@okhat
Copy link
Collaborator

okhat commented Mar 19, 2024

Adding few-shot examples in chat turns will probably not fix the fact that most programs will need to bootstrap starting from zero-shot prompts. But major +1 to any exploration of how to get most chat models to reliably understand that we want them to complete the missing fields.

@okhat
Copy link
Collaborator

okhat commented Mar 19, 2024

Btw I suspect this is easy. It's not happening right now just because no one ever tried :D. We've been using the same template since 2022 before RLHF and chat models (i.e., since text-davinci-002). The DSPy optimizers help make this less urgent than it would be otherwise because most models learn to do things properly with compiling, but ideally zero-shot (unoptimized) usage works reliably too. That will lead to better optimization.

@okhat
Copy link
Collaborator

okhat commented Mar 19, 2024

@isaacbmiller This is a great self-contained exploration. We can do this for 3-4 diverse chat models?

@KCaverly
Copy link
Collaborator

Just catching up on this. It may be helpful for folks to take a look at the new Template class, it should contain all the TemplateV2/TemplateV3 functionality.

Additionally, all functionality for generating a prompt and passing to the LM, are contained within the new Backends themselves. We've already got a TemplateBackend which should match current functionality, along with a JSON backend which returns the content as a JSON directly.

We could always create a seperate version of the Template which returns the Signature + Examples, as a series of ChatML messages, which we then pass instead of a prompt directly to the LiteLM model.

Currently to call the LMs we do this:

# Generate Example
example = Example(demos=demos, **kwargs)

# Initialize and call template
# prompt is generated as a string
template = Template(signature)
prompt = template(example)

# Pass through language model provided
result = self.lm(prompt=prompt, **config))

It would be pretty straightforward to do something like this instead.
For the BaseLM abstraction we could make both messages and prompt optional, and ensure that one or the other, but not both are provided.

# Generate Example
example = Example(demos=demos, **kwargs)

# Initialize and call template
# messages is generated as a [{"role": "...", "content": "..."}]
template = ChatTemplate(signature)
messages = template(example)

# Pass through language model provided
result = self.lm(messages=messages, **config))

@CyrusOfEden and I have chatted about this in the past, not sure how we should separate out Templates vs Backends. Each Backend, will need a Template of some kind to format prompts, but each Backend, can leverage a variety of Template so its not quite one-to-one.

We should be fairly close on landing the new Backend framework in Main, and then I think this is a great next step.

@thomasahle
Copy link
Collaborator Author

thomasahle commented Mar 19, 2024

I think it's an interesting idea to support multiple different Template's.
I assume the code you wrote would all be inside Predict, so the user never actually has to call self.lm(...).
Maybe a Predict can have a template, similar to how it has a signature.
Then we can even have a TemplateOptimizer that optimizes the template the same way SignatureOptimizer optimizes the signature.

Then you code would look like this:

template = self.template_type(self.signature)
messages = template(example)

@CyrusOfEden and @KCaverly would this fit into the refactor?

@thomasahle
Copy link
Collaborator Author

thomasahle commented Mar 20, 2024

Some more examples of LMs being unable to understand the basic format:
claude-3-opus-20240229:
Screenshot 2024-03-20 at 4 27 03 PM
gpt-4:
Screenshot 2024-03-20 at 4 27 12 PM
gpt-3.5-turbo:
Screenshot 2024-03-20 at 4 27 08 PM
gpt-3.5-turbo-instruct:
Screenshot 2024-03-20 at 4 30 10 PM

@thomasnormal
Copy link

Relevant discussion: #420

@conradlee
Copy link

I've been running into this problem as well using typed predictors.

@okhat I think your suggestion of substituting "Follow the following format." with "Complete the unfilled fields in accordance with the following format." if a chat model is used would go a long way in the short run, but in the long run an approach that adapts the prompting technique to each model would be idea.

I'll note that the problem is especially bad with the TypedChainOfThought predictor, because of the way this one mixes structured output with unstructured 'Think it through step by step'. This leads the model to produce bits of unstructured text where DSPy expects a structured output.

@KCaverly
Copy link
Collaborator

FWIW - the new backed system would allow you to provide your own templates and supports chat mode. If you have a fuller example you can share, I would be keen to see if I can test it out and see if there are any improvements.

@isaacbmiller
Copy link
Collaborator

@KCaverly Is there a better way to pass a template than through a config option?

@KCaverly
Copy link
Collaborator

Ive been working on it here: #717.

So far, im passing it during generation. The backend has a default template argument, that can be overridden in modules when the backend is called. Which would allow us to either pass them dynamically as the Module evolves, or set it at the Module level, and pass through etc.

@isaacbmiller
Copy link
Collaborator

That looks great. I will take an in-depth look later today.

Should I switch to building off of that branch?

@thomasahle
Copy link
Collaborator Author

TIL you can "prefill" the responses from agents in both Claude and GPT: https://docs.anthropic.com/claude/docs/prefill-claudes-response

import anthropic

client = anthropic.Anthropic(
    # defaults to os.environ.get("ANTHROPIC_API_KEY")
    api_key="my_api_key",
)
message = client.messages.create(
    model="claude-2.1",
    max_tokens=1000,
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": "Please extract the name, size, price, and color from this product description and output it within a JSON object.\n\n<description>The SmartHome Mini is a compact smart home assistant available in black or white for only $49.99. At just 5 inches wide, it lets you control lights, thermostats, and other connected devices via voice or app—no matter where you place it in your home. This affordable little hub brings convenient hands-free control to your smart devices.\n</description>"
        }
        {
            "role": "assistant",
            "content": "{"
        }
    ]
)
print(message.content)

This means we can still use "prefixes" when using the chat api.

Also, moving the first output-variable to the "agent side" is probably better than what we do now - putting it at the end of the user side. Similar to @meditans comment about Mixtral.

Does this fit into your new template system @KCaverly?

@KCaverly
Copy link
Collaborator

If you take a look at the JSONBackend, we do something very similar. For json mode models, we prompt the model to complete the json as an incomplete object, as opposed to rewriting it from scratch. Additionally all of the demo objects are also shown in the completed JSON format which hopefully helps enforce the appropriate schema as well.

@Josephrp
Copy link

I think it's an interesting idea to support multiple different Template's.

I also think so because i use a lot of models, every single one has a different template the problem is compounded by DSPY , but the good news is that we can probably organise custom templates in a special .contrib folder , so that templates that naturally have to be written for any new model (or task!) can also be pushed upstream.

@thomasahle
Copy link
Collaborator Author

If you take a look at the JSONBackend, we do something very similar. For json mode models, we prompt the model to complete the json as an incomplete object, as opposed to rewriting it from scratch. Additionally all of the demo objects are also shown in the completed JSON format which hopefully helps enforce the appropriate schema as well.

Where should I look? In https://github.com/KCaverly/dspy/blob/f9c1adf837f1384fca60ed71dd2f32db47969746/dspy/modeling/templates/json.py#L29 it seems like everything gets stuffed into the user message.

But this prefill trick should actually be used by the text template/backend too, I believe. If the backend API is chat.

@KCaverly
Copy link
Collaborator

Everything is currently stuffed into on message, but instead of providing the question and asking for a json response, we send an incomplete json object and ask to complete it. Not quite prefilling, but kinda similar.

@thomasahle
Copy link
Collaborator Author

Sending an incomplete object can work well if the model understands it's supposed to complete it. Prefilling makes this easier for chat api models.

I'm not saying you always have to use prefilling, just asking if I'll be able to make a template that works this way?

I should probably pull your code and try it out 😁

@KCaverly
Copy link
Collaborator

I think it should work, would be a great test.

@Serjobas
Copy link

Serjobas commented Mar 29, 2024

#701

This PR should've added this functionality.

@thomasahle
Copy link
Collaborator Author

@Serjobas What do you mean?

@okhat
Copy link
Collaborator

okhat commented Mar 30, 2024

Btw @isaacbmiller @thomasahle @KCaverly I confirmed that "Follow the format below. Start your completions where the supplied fields end." communicates what we intended more clearly here with chat models, though obviously pre-filling appears to be a better approach in conjunction with this.

What we need IMO is a way to have a breaking version of DSPy (where all old caches and everything else will not work anymore) and a sustained version, until we make a major release.

One way to do that is to have a flag, e.g. dspy.settings.configure(mode=2024) or something like that.

@KCaverly
Copy link
Collaborator

@okhat that sounds good to me. Currently, the backend-refactor branch, is built to be completely backwards compatible, with the only breaking changes surround versioning on openai, others etc for litellm, but litellm is set up as an optional extra. If a backend is configured, we would prioritize and use the new backend structure otherwise it would operate the same as the old method.

Maybe we would want some sort of deprecation message, in the interim pointing to new documentation on the backend?

@thomasahle
Copy link
Collaborator Author

Btw @isaacbmiller @thomasahle @KCaverly I confirmed that "Follow the format below. Start your completions where the supplied fields end." communicates what we intended more clearly here with chat models, though obviously pre-filling appears to be a better approach in conjunction with this.

What we need IMO is a way to have a breaking version of DSPy (where all old caches and everything else will not work anymore) and a sustained version, until we make a major release.

What do you think about #717 's approach of doing

dspy.settings.configure(backend=TemplateBackend(ChatTemplate()))

That wouldn't break any existing code / notebooks.

I think the modified default guidelines you mention could be helpful, but as you say, they break existing caches.
I also agree that we might need to have a "breaking changes" release at one point. Maybe mode=2024 is a good approach for this, or maybe we can use feature flags, like configure(enable_backends=True).

But the changes needed for this particular issue seem to not need breaking anything. If we keep the default backend to be an exact copy of the current behavior.

@flexorRegev
Copy link

flexorRegev commented Apr 9, 2024

Sorry for jumping in pretty late to this discussion - but I think I have some ideas and I really want to use them and understand how I can contribute.
The reality I see with using chat models is that what usually works the best at making them follow instructions in a few-shot task specific setting would be a format like this:
in a zero shot manner:
System: """|Task description|
|Output formatting requirements|"""
User: """|User input|"""

in a few shot manner:
System: """|Task description|
|Output formatting requirements|"""
User: """|example User input1|"""
Assistant: """|example output1|"""
User: """|example User input2|"""
Assistant: """|example output2|"""
...
User: """|User input|"""

I think this makes the most sense if I look at the formatting of prompts today in the sense of keeping the signature + output formatting in the system prompt of the model and the examples as chat conversation.
what would be the correct way to support that?
I'd be glad to contribute an example/ PR if any revision is needed because from what I'm seeing this boosts performance and instruction following by quite a bit on chat models I'm working on..

@okhat @thomasahle your thoughts?

@CyrusOfEden
Copy link
Collaborator

@flexorRegev exactly! Support for this is landing in backend-refactor

@ryanh-ai
Copy link

Is there a workaround identified for this until the new backend is completed? I am running into this repeatedly with GPT4

@ryanh-ai
Copy link

ryanh-ai commented Apr 28, 2024

Follow up here, with dspy.OpenAI models, I was able to make the below work:

gpt4_turbo = dspy.OpenAI(model='gpt-4-turbo-2024-04-09', 
                                    system_prompt="Follow the format below. Start your completions where the supplied fields end.", 
                                    model_type='chat')

Not all models have the system_prompt parameter, so I may try to weave this in with an assertion instead of cutting into the base code given y'all are working on backend_refeactor

@wullli
Copy link

wullli commented May 2, 2024

@meditans @thomasahle I'm not sure if this is related to the mentioned repetition of the prompt, since I don't know which model client you used. In my case, I think there is a mistake with the output for the HFModel class. For example, when I load a model that has the MistralForCausalLMarchitecture, the attribute drop_prompt_from_output will be set to false. However, looking at the docs, we can see that the model actually returns a transformers.modeling_outputs.CausalLMOutputWithPast. So of course the prompt will be included.

I don't understand how the current try-except block can be used to decide which kind of output (with or without past) it is? The type of the output should probably be checked.

@derenrich
Copy link

Follow up here, with dspy.OpenAI models, I was able to make the below work:

gpt4_turbo = dspy.OpenAI(model='gpt-4-turbo-2024-04-09', 
                                    system_prompt="Follow the format below. Start your completions where the supplied fields end.", 
                                    model_type='chat')

Not all models have the system_prompt parameter, so I may try to weave this in with an assertion instead of cutting into the base code given y'all are working on backend_refeactor

this workaround doesn't work for me (when using gpt4o). it still outputs "Reasoning:" at the start of its reasoning when doing CoT

Support for chat models is pretty critical given the limitations on instruct models (e.g. no GPT4 instruct model)

@firoz47
Copy link

firoz47 commented May 22, 2024

Hey did you find any workaround, I am having the same issue.

@conradlee
Copy link

My workaround is to just use a TypedPredictor rather than the TypedChainOfThoughtPredictor and then include fields in my schemas called 'reasoning' whose descriptions as for a CoT-style justification of the output. Put these fields earlier up in your data structure -- it will only help if the reasoning precedes the fields which contain the important output.

This way the prompt and expected output is no longer a confusing mix of unstructured and structured text -- instead, it's all structured.

As I primarily use OpenAI's models, I really wish that function calling were supported, similar to how the Instructor package does it. The typed predictors seem so similar in spirit to the Instructor approach, it's just that Instructor handles the communication of the expected output format much better. It would be interesting to see someone build an optimization layer on top of instructor.

@mikeedjones
Copy link
Contributor

Interesting paper on how function calling affects downstream performance in some models released this week: https://arxiv.org/pdf/2408.02442

@okhat
Copy link
Collaborator

okhat commented Sep 24, 2024

(blob below copy-pasted here since I'm closing related issues)

Thanks for opening this! We released DSPy 2.5 yesterday. I think the new dspy.LM and the underlying dspy.ChatAdapter will probably resolve this problem.

Here's the (very short) migration guide, it should typically take you 2-3 minutes to change the LM definition and you should be good to go: https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb

Please let us know if this resolves your issue. I will close for now but please feel free to re-open if the problem persists.

@okhat okhat closed this as completed Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests