Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call Cohere RAG inference with documents argument #13196

Merged
merged 27 commits into from
May 9, 2024

Conversation

co-antwan
Copy link
Contributor

@co-antwan co-antwan commented May 1, 2024

Description

Adds support for Cohere.chat's documents argument when using in RAG pipelines. This ensures proper formatting on Cohere's client side, and leads to better downstream performance.

This is a 'drafty' PR. Our implementation works and improves on the current state of the code. However, it's also somewhat brittle. We're looking for guidance from the LlamaIndex team on how to best tackle the problem (especially if you have ongoing efforts to add customisation by model family).
EDIT: Fixed thanks for @logan-markewich 's suggestions ❤️

The problem

When called from inside retrievers (e.g. for RAG evaluations), Cohere.chat doesn't follow the intended route for RAG inference.

  • Currently, retrieved documents are strung together, then passed to message
  • The intended route is to pass retrieved documents to the documents keyword instead
  • Not doing so leads to poorer performance on downstream tasks

The proposed solution

The only light-touch fix we could find was to create a new ChatPromptTemplate along a new ChatMessage instance called DocumentMessage. We adjusted the logic inside Cohere.chat to parse these DocumentMessage's properly and pass them to the documents keyword.

Example snippet:

# load new Cohere template
from llama_index.llms.cohere.utils import COHERE_QA_TEMPLATE

# Create index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
# Using nyc_text.txt from: https://github.com/run-llama/llama_index/blob/40913847ba47d435b40b7fac3ae83eba89b56bb9/docs/docs/examples/evaluation/test_wiki_data/nyc_text.txt
documents = SimpleDirectoryReader(input_files=["nyc_text.txt"]).load_data()
splitter = SentenceSplitter(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], embed_model=embed_model
)

# Call retrieval from index
engine = vector_index.as_query_engine(
    llm=llm,
    text_qa_template=COHERE_QA_TEMPLATE,
    similarity_top_k=3,   
)
resp = engine.query("Tell me some facts about NYC?")
print(resp)

Where we would like guidance

~~Our proposed solutions works, but is brittle for a number of reasons. Could you advise us on whether you can think of a better solution that's similarly light-touch?

  1. Solution depends on using COHERE_QA_TEMPLATE, which won't be obvious. How can we make some templates the default for a model family?
  2. We're not sure where else our custom templates need to be specified to ensure that all calls via LlamaIndex employ the correct template.~~

See @logan-markewich 's proposed changes and our implementation. This should ensure that Cohere templates get called whenever a retriever uses a Cohere LLM. Thanks @logan-markewich !


Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 1, 2024
@logan-markewich
Copy link
Collaborator

There are a few other templates that you'd probably want a "cohere" version of

  • refine template
  • tree summarize

Trying to think of a few ways so that the user doesn't have to think about this. All our query engines actually use selector prompts -- based on the LLM being used, select an appropriate template. We could add cohere to this logic.

  • define your RAG prompt templates in the llama-index-llms-cohere package
  • in llama-index-core/llama_index/core/prompts/default_prompt_selectors.py we can add some logic
try:
  from llama_index.llms.cohere import is_cohere_model, COHERE_QA_TEMPLATE, ...
except ImportError:
  COHERE_QA_TEMPLATE = None
  ...

...

default_text_qa_conditionals = [(is_chat_model, CHAT_TEXT_QA_PROMPT)],
if COHERE_QA_TEMPLATE is not None:
  default_text_qa_conditionals = [(is_cohere_model, COHERE_QA_TEMPLATE)] + default_text_qa_conditionals

DEFAULT_TEXT_QA_PROMPT_SEL = SelectorPromptTemplate(
    default_template=DEFAULT_TEXT_QA_PROMPT,
    conditionals=default_text_qa_conditionals,
)

...

I think thats ok-ish? I don't know how many LLMs will end up with specific RAG prompts, but this feels fairly scalable?

The other option is having cohere-specific response synthesizers, but that feels like mostly overkill to implement the same algorithm with slightly different LLM calls.

@co-antwan
Copy link
Contributor Author

@logan-markewich That's a great suggestion, thanks a lot. We implemented it, and it works!
You can check that it calls the right templates with a small snippet like:

llm = Cohere(model="command-r-plus")
# Define your vector_index... e.g.
vector_index = VectorStoreIndex.from_documents(documents, transformations=[splitter], embed_model=embed_model)
engine = vector_index.as_query_engine(llm=llm, similarity_top_k=3)
pprint(engine._response_synthesizer._text_qa_template)

Will show the new template.

If you're happy with the shape of the PR, let's start the review? Who should I mark for reviewers?

One last question: our lives would have been easier if we'd added the core package as a dependency to poetry. Something like poetry add --group=dev --editable ../../../llama-index-core/. Is it against convention to do that?

@logan-markewich
Copy link
Collaborator

@co-antwan awesome, glad it works!!

hmm, does that poetry command work? I know coordinating changes that involve both core and another external package are a little annoying, but its only annoying until llama-index-core gets published with the changes you need 😅

I can try and give this a more thorough review today at some point!

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally on board with this, thanks for the fix! Just one comment about using this in more methods

@co-antwan
Copy link
Contributor Author

@logan-markewich Thanks for the review!

We've implemented your change requests (thanks for catching these). In addition, @harry-cohere and I reworked our function for converting DocumentMessages to Cohere documents so that it's more extensible and robust. The rework gives us a.o. the ability to add 'keywords' that LlamaIndex's uses to format retrieved documents into a message, e.g. file_path.

Is there a list of keywords that the different retrievers introduce when formatting retrieved documents? If so, I'd like to add those to our PR before merging (at least the most common ones).

Thanks as always

@co-antwan
Copy link
Contributor Author

@logan-markewich One last question: not sure if we need to bump the version in pyproject.toml (don't know what counts as a sufficient change). You have the rights to edit if you want to make the change yourself :)

@logan-markewich
Copy link
Collaborator

logan-markewich commented May 3, 2024

@co-antwan these "keyword" fields are just the metadata from the nodes/documents

For example

from llama_index.core import Document
doc = Document.example()
print(doc.get_content(metadata_mode="llm"))

This is the text that is used in the synthesizers.

Since its metadata, and user defined, it really could be anything. But the most common are the ones inserted by the simple directory reader, but I'm not sure if all these make sense to track

default_meta = {
        "file_path": file_path,
        "file_name": file_name,
        "file_type": mimetypes.guess_type(file_path)[0],
        "file_size": stat_result.get("size"),
        "creation_date": creation_date,
        "last_modified_date": last_modified_date,
        "last_accessed_date": last_accessed_date,
    }

@logan-markewich
Copy link
Collaborator

logan-markewich commented May 3, 2024

Normally, any change should bump the toml version for an integration. But since this depends on core being updated, I'll have to coordinate a release for the two after this merges

@co-antwan
Copy link
Contributor Author

  • Re: version bump: understood, will let you coordinate them
  • Re: keywords from metadata: thank you for the reference. For now, I've added all known/common keywords. We may add support for user-defined fields in a follow-up. (CC @harry-cohere for visibility)

@ulan-yisaev
Copy link
Contributor

Hello @co-antwan!

Thanks for your implementation :). I was actually considering how to implement support for the "documents" field myself, so this has been very helpful. However, I have a concern regarding the handling of non-document text when using the Chat Engine with a custom prompt.

In my setup, I use a CondensePlusContextChatEngine with _context_prompt_template that includes specific instructions followed by document data. The template looks something like this:

### INSTRUKTIONEN
[Instructional text about how the response should handle the context and question]
### KONTEXT
[Document chunks]

Could you clarify what happens when such a message is processed using the current implementation of remove_documents_from_messages and document_message_to_cohere_document? My main concern is that the instructions in the prompt might be removed, leaving only the document data to be passed to the LLM.

Does the logic in the PR ensure that instructional or other non-document text within the same message as document data is preserved, or do we need to implement additional safeguards to ensure that such text is not lost during processing?

Thank you for any guidance you can provide!

@ulan-yisaev
Copy link
Contributor

I think I can simply add my instructions directly to the last message in the chat_messages :)

@co-antwan
Copy link
Contributor Author

co-antwan commented May 5, 2024

Hey @ulan-yisaev , it's good that you raise those kinds of concerns now :)
Could you provide me with a MWE so that I can run & confirm the effect of my branch?
The short answer is that I believe your calls will be unaffected by our contribution, but I'd need to know the specifics of your stack:

  1. Our implementation doesn't affect any calls to the chat method. It only affects calls to Cohere models that use a selector default_prompt_selectors.py
  2. Cohere v5+ introduce a preamble argument where users can/should define additional instructions. More info here: https://docs.cohere.com/docs/preambles. You can use preamble in Cohere.chat directly to pass your instructions to the model.
  3. If your stack selects templates via default_prompt_selectors.py, I suggest you define a new template (e.g. like COHERE_QA_TEMPLATE) and pass it to your retriever as an explicit prompt template. This will work. If it's a common template, please add it as a follow-up PR 🙏!
  4. Finally, if all else fails, then yes, adding your instructions to the last message will work. However I recommend you try using preambles first!

@ulan-yisaev
Copy link
Contributor

Hey @co-antwan,

Thanks a lot for the 'preamble' hint; that’s really helpful! I appreciate your guidance on how to best use the new features.

My setup is pretty straightforward. Here's a minimal working example that you can try to understand how I'm currently integrating everything:

CONTEXT_PROMPT = """### Instructions for the Chatbot:
- Directly address the latest user's query with a precise and concise response.
- Utilize information from previous interactions and any provided documents.

**Context**:
{context_str}
"""

chat_engine = vector_index.as_chat_engine(
    llm=llm,
    chat_mode="condense_plus_context",          
    context_prompt=CONTEXT_PROMPT,
    similarity_top_k=3,
)
chat_resp = chat_engine.chat("Tell me some facts about NYC?")
print(chat_resp)

@ulan-yisaev
Copy link
Contributor

ulan-yisaev commented May 6, 2024

I've conducted some tests with my setup and noticed something interesting that might be worth discussing. It seems the documents are not being split as expected, likely due to how the message roles are assigned. In the function remove_documents_from_messages(), it expects DocumentMessage which has the role MessageRole.SYSTEM. However, in my setup, the role is assigned as MessageRole.CHATBOT. Here’s what my debug logs show (with SentenceSplitter(chunk_size=32, chunk_overlap=8)):

Remaining: [ChatMessage(role=<MessageRole.CHATBOT: 'chatbot'>, content="### Instructions for the Chatbot:\n- Directly address the latest user's query with a precise and concise response.\n- Utilize information from previous interactions and any provided documents.\n\n**Context**:\n\nfile_path: nyc_text.txt\n\nNew York, often called New York City or NYC, is the most populous city in the United States.\n\nfile_path: nyc_text.txt\n\n== History ==\n\n\n=== Early history ===\nIn the pre-Columbian era,\n\nfile_path: nyc_text.txt\n\nincluding the city of New Amsterdam, when England seized it from Dutch control.\n\n\n== History ==\n", additional_kwargs={})]
Documents: None

I think aligning the message roles or adjusting how they are interpreted might be necessary?

@ulan-yisaev
Copy link
Contributor

Upon further investigation I've identified a potential configuration mismatch that might be contributing to the issue with document recognition and handling.
It appears that the role assigned to the system message in the condense_plus_context.py chat engine is derived from self._llm.metadata.system_role, which is set to MessageRole.CHATBOT in the llama-index-llms-cohere integration, according to the LLMMetadata configuration in llama_index/llms/cohere/base.py.
However, the remove_documents_from_messages function is specifically looking for messages with the role MessageRole.SYSTEM to treat them as DocumentMessage. Since our chat messages are marked with MessageRole.CHATBOT, they are not being correctly identified or processed as document-containing messages.

This role discrepancy causes all our system messages, which include both instructions and document data, to be categorized as 'remaining' rather than 'documents', leading to none of the document data being processed as intended.
Could we consider aligning the role definitions or modifying the document recognition logic to accommodate setups where the system_role may vary between different LLM integrations?

@co-antwan
Copy link
Contributor Author

@ulan-yisaev Good catch, thanks a ton -- this was gnarly 🙏
The simplest change I could think of was indeed to change the role of DocumentMessage to CHATBOT. This should fix the compatibility issue you've raised without affecting any unknowing users (since, for now, DocumentMessage is only called in cases we've tested).

Can you test if the current change fixes your prompt formatting?

For outstanding issues, e.g. integrating preamble, could you open a follow-up PR if that's a priority fix for you?

@co-antwan
Copy link
Contributor Author

co-antwan commented May 7, 2024

@ulan-yisaev on second thought: are you sure that your issue is caused by the SYSTEM role? Our check depends only on checking the instance, not the role. I just checked and the isinstance check works as expected, whatever the role of the DocumentMessage.

Could something else be causing your issue inside condense_plus_context?

Screenshot 2024-05-07 at 18 53 53

CC @harry-cohere

@logan-markewich
Copy link
Collaborator

@co-antwan I was just about to merge this (forgot that it was still waiting), and it seems like there's maybe other things going on 😅 Seems like its actually good to merge though?

@ulan-yisaev
Copy link
Contributor

@co-antwan , you're right; my apologies for the confusion. The issue isn't related to the MessageRole.
Maybe it is related to how they instantiate messages as instances of ChatMessage in the core condense_plus_context chat engine:

def _run_c3(...)
    ...
    system_message = ChatMessage(
        content=system_message_content, role=self._llm.metadata.system_role
    )
    ...
    chat_messages = [
        system_message,
        *self._memory.get(initial_token_count=initial_token_count),
    ]
    return chat_messages, context_source, context_nodes

This direct assignment bypasses the use of any specialized document handling mechanisms, like those defined in the COHERE_QA_TEMPLATE.

@ulan-yisaev
Copy link
Contributor

Hello @logan-markewich ,

In your condense_plus_context setup, we currently handle context strings and document data separately before passing them to the chat function. Given that context_str already encapsulates the necessary document information in a structured format, does it makes sense modifying the chat method to accept this context directly? This would simplify Cohere's current implementation by removing the need for additional parsing or keyword searching.

Could this approach be considered a reasonable solution within the llama_index framework? Or perhaps you have another suggestion on how we might better integrate or handle this context data?


# In condense_plus_context.py
def _run_c3(self, message: str, chat_history: Optional[List[ChatMessage]] = None):
    ...
    context_str, context_nodes = self._retrieve_context(condensed_question)
    ...
    system_message_content = self._context_prompt_template.format(
        context_str=context_str
    )
    ...
    system_message = ChatMessage(
        content=system_message_content, 
        role=self._llm.metadata.system_role
    )
    ...
    chat_messages = [
        system_message,
        *self._memory.get(initial_token_count=initial_token_count),
    ]

    return chat_messages, context_source, context_nodes

@trace_method("chat")
def chat(self, message: str, chat_history: Optional[List[ChatMessage]] = None):
    chat_messages, context_source, context_nodes = self._run_c3(
        message, chat_history
    )

    # Example modification to accept context directly:
    chat_response = self._llm.chat(chat_messages, context=context_str)
    ...
    
    And in Cohere class:
def chat(self, messages: Sequence[ChatMessage], context=None, **kwargs: Any) -> ChatResponse:
    prompt = messages[-1].content
    if context:
        documents = context
    

@logan-markewich
Copy link
Collaborator

logan-markewich commented May 7, 2024

Seems to me there should just be a cohere context chat engine, rather than trying to shoe-horn the existing chat engine into this specific API 🤔

Having a context specific kwarg doesn't generalize to all llms in this case

@co-antwan
Copy link
Contributor Author

@ulan-yisaev No worries, I appreciate the double check :)
@logan-markewich Good to merge from our side!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label May 9, 2024
@logan-markewich logan-markewich enabled auto-merge (squash) May 9, 2024 03:40
@logan-markewich
Copy link
Collaborator

Cool! Just note I will need to coordinate a release with core for this once it merges. I should get to it tomorrow.

Thanks for all the help and effort here everyone! 👍🏻

@logan-markewich logan-markewich merged commit 362063c into run-llama:main May 9, 2024
8 checks passed
@co-antwan
Copy link
Contributor Author

Thanks @logan-markewich for your advice throughout!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants