Call Cohere RAG inference with `documents` argument #13196

co-antwan · 2024-05-01T09:49:11Z

Description

Adds support for Cohere.chat's documents argument when using in RAG pipelines. This ensures proper formatting on Cohere's client side, and leads to better downstream performance.

This is a 'drafty' PR. Our implementation works and improves on the current state of the code. However, it's also somewhat brittle. We're looking for guidance from the LlamaIndex team on how to best tackle the problem (especially if you have ongoing efforts to add customisation by model family).
EDIT: Fixed thanks for @logan-markewich 's suggestions ❤️

The problem

When called from inside retrievers (e.g. for RAG evaluations), Cohere.chat doesn't follow the intended route for RAG inference.

Currently, retrieved documents are strung together, then passed to message
The intended route is to pass retrieved documents to the documents keyword instead
Not doing so leads to poorer performance on downstream tasks

The proposed solution

The only light-touch fix we could find was to create a new ChatPromptTemplate along a new ChatMessage instance called DocumentMessage. We adjusted the logic inside Cohere.chat to parse these DocumentMessage's properly and pass them to the documents keyword.

Example snippet:

# load new Cohere template
from llama_index.llms.cohere.utils import COHERE_QA_TEMPLATE

# Create index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
# Using nyc_text.txt from: https://github.com/run-llama/llama_index/blob/40913847ba47d435b40b7fac3ae83eba89b56bb9/docs/docs/examples/evaluation/test_wiki_data/nyc_text.txt
documents = SimpleDirectoryReader(input_files=["nyc_text.txt"]).load_data()
splitter = SentenceSplitter(chunk_size=512)
vector_index = VectorStoreIndex.from_documents(
    documents, transformations=[splitter], embed_model=embed_model
)

# Call retrieval from index
engine = vector_index.as_query_engine(
    llm=llm,
    text_qa_template=COHERE_QA_TEMPLATE,
    similarity_top_k=3,   
)
resp = engine.query("Tell me some facts about NYC?")
print(resp)

Where we would like guidance

~~Our proposed solutions works, but is brittle for a number of reasons. Could you advise us on whether you can think of a better solution that's similarly light-touch?

Solution depends on using COHERE_QA_TEMPLATE, which won't be obvious. How can we make some templates the default for a model family?
We're not sure where else our custom templates need to be specified to ensure that all calls via LlamaIndex employ the correct template.~~

See @logan-markewich 's proposed changes and our implementation. This should ensure that Cohere templates get called whenever a retriever uses a Cohere LLM. Thanks @logan-markewich !

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

Yes
No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added Google Colab support for the newly added notebooks.
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

logan-markewich · 2024-05-01T15:56:50Z

There are a few other templates that you'd probably want a "cohere" version of

refine template
tree summarize

Trying to think of a few ways so that the user doesn't have to think about this. All our query engines actually use selector prompts -- based on the LLM being used, select an appropriate template. We could add cohere to this logic.

define your RAG prompt templates in the llama-index-llms-cohere package
in llama-index-core/llama_index/core/prompts/default_prompt_selectors.py we can add some logic

try:
  from llama_index.llms.cohere import is_cohere_model, COHERE_QA_TEMPLATE, ...
except ImportError:
  COHERE_QA_TEMPLATE = None
  ...

...

default_text_qa_conditionals = [(is_chat_model, CHAT_TEXT_QA_PROMPT)],
if COHERE_QA_TEMPLATE is not None:
  default_text_qa_conditionals = [(is_cohere_model, COHERE_QA_TEMPLATE)] + default_text_qa_conditionals

DEFAULT_TEXT_QA_PROMPT_SEL = SelectorPromptTemplate(
    default_template=DEFAULT_TEXT_QA_PROMPT,
    conditionals=default_text_qa_conditionals,
)

...

I think thats ok-ish? I don't know how many LLMs will end up with specific RAG prompts, but this feels fairly scalable?

The other option is having cohere-specific response synthesizers, but that feels like mostly overkill to implement the same algorithm with slightly different LLM calls.

co-antwan · 2024-05-02T11:20:20Z

@logan-markewich That's a great suggestion, thanks a lot. We implemented it, and it works!
You can check that it calls the right templates with a small snippet like:

llm = Cohere(model="command-r-plus")
# Define your vector_index... e.g.
vector_index = VectorStoreIndex.from_documents(documents, transformations=[splitter], embed_model=embed_model)
engine = vector_index.as_query_engine(llm=llm, similarity_top_k=3)
pprint(engine._response_synthesizer._text_qa_template)

Will show the new template.

If you're happy with the shape of the PR, let's start the review? Who should I mark for reviewers?

One last question: our lives would have been easier if we'd added the core package as a dependency to poetry. Something like poetry add --group=dev --editable ../../../llama-index-core/. Is it against convention to do that?

logan-markewich · 2024-05-02T18:54:41Z

@co-antwan awesome, glad it works!!

hmm, does that poetry command work? I know coordinating changes that involve both core and another external package are a little annoying, but its only annoying until llama-index-core gets published with the changes you need 😅

I can try and give this a more thorough review today at some point!

llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/base.py

logan-markewich

I'm generally on board with this, thanks for the fix! Just one comment about using this in more methods

llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/utils.py

co-antwan · 2024-05-03T13:56:25Z

@logan-markewich Thanks for the review!

We've implemented your change requests (thanks for catching these). In addition, @harry-cohere and I reworked our function for converting DocumentMessages to Cohere documents so that it's more extensible and robust. The rework gives us a.o. the ability to add 'keywords' that LlamaIndex's uses to format retrieved documents into a message, e.g. file_path.

Is there a list of keywords that the different retrievers introduce when formatting retrieved documents? If so, I'd like to add those to our PR before merging (at least the most common ones).

Thanks as always

co-antwan · 2024-05-03T15:16:43Z

@logan-markewich One last question: not sure if we need to bump the version in pyproject.toml (don't know what counts as a sufficient change). You have the rights to edit if you want to make the change yourself :)

logan-markewich · 2024-05-03T15:43:49Z

@co-antwan these "keyword" fields are just the metadata from the nodes/documents

For example

from llama_index.core import Document
doc = Document.example()
print(doc.get_content(metadata_mode="llm"))

This is the text that is used in the synthesizers.

Since its metadata, and user defined, it really could be anything. But the most common are the ones inserted by the simple directory reader, but I'm not sure if all these make sense to track

default_meta = {
        "file_path": file_path,
        "file_name": file_name,
        "file_type": mimetypes.guess_type(file_path)[0],
        "file_size": stat_result.get("size"),
        "creation_date": creation_date,
        "last_modified_date": last_modified_date,
        "last_accessed_date": last_accessed_date,
    }

logan-markewich · 2024-05-03T15:47:09Z

Normally, any change should bump the toml version for an integration. But since this depends on core being updated, I'll have to coordinate a release for the two after this merges

co-antwan · 2024-05-03T15:54:16Z

Re: version bump: understood, will let you coordinate them
Re: keywords from metadata: thank you for the reference. For now, I've added all known/common keywords. We may add support for user-defined fields in a follow-up. (CC @harry-cohere for visibility)

ulan-yisaev · 2024-05-04T18:55:51Z

Hello @co-antwan!

Thanks for your implementation :). I was actually considering how to implement support for the "documents" field myself, so this has been very helpful. However, I have a concern regarding the handling of non-document text when using the Chat Engine with a custom prompt.

In my setup, I use a CondensePlusContextChatEngine with _context_prompt_template that includes specific instructions followed by document data. The template looks something like this:

### INSTRUKTIONEN
[Instructional text about how the response should handle the context and question]
### KONTEXT
[Document chunks]

Could you clarify what happens when such a message is processed using the current implementation of remove_documents_from_messages and document_message_to_cohere_document? My main concern is that the instructions in the prompt might be removed, leaving only the document data to be passed to the LLM.

Does the logic in the PR ensure that instructional or other non-document text within the same message as document data is preserved, or do we need to implement additional safeguards to ensure that such text is not lost during processing?

Thank you for any guidance you can provide!

ulan-yisaev · 2024-05-04T19:12:40Z

I think I can simply add my instructions directly to the last message in the chat_messages :)

co-antwan · 2024-05-05T22:31:52Z

Hey @ulan-yisaev , it's good that you raise those kinds of concerns now :)
Could you provide me with a MWE so that I can run & confirm the effect of my branch?
The short answer is that I believe your calls will be unaffected by our contribution, but I'd need to know the specifics of your stack:

Our implementation doesn't affect any calls to the chat method. It only affects calls to Cohere models that use a selector default_prompt_selectors.py
Cohere v5+ introduce a preamble argument where users can/should define additional instructions. More info here: https://docs.cohere.com/docs/preambles. You can use preamble in Cohere.chat directly to pass your instructions to the model.
If your stack selects templates via default_prompt_selectors.py, I suggest you define a new template (e.g. like COHERE_QA_TEMPLATE) and pass it to your retriever as an explicit prompt template. This will work. If it's a common template, please add it as a follow-up PR 🙏!
Finally, if all else fails, then yes, adding your instructions to the last message will work. However I recommend you try using preambles first!

ulan-yisaev · 2024-05-06T10:45:08Z

Hey @co-antwan,

Thanks a lot for the 'preamble' hint; that’s really helpful! I appreciate your guidance on how to best use the new features.

My setup is pretty straightforward. Here's a minimal working example that you can try to understand how I'm currently integrating everything:

CONTEXT_PROMPT = """### Instructions for the Chatbot:
- Directly address the latest user's query with a precise and concise response.
- Utilize information from previous interactions and any provided documents.

**Context**:
{context_str}
"""

chat_engine = vector_index.as_chat_engine(
    llm=llm,
    chat_mode="condense_plus_context",          
    context_prompt=CONTEXT_PROMPT,
    similarity_top_k=3,
)
chat_resp = chat_engine.chat("Tell me some facts about NYC?")
print(chat_resp)

ulan-yisaev · 2024-05-06T10:57:12Z

I've conducted some tests with my setup and noticed something interesting that might be worth discussing. It seems the documents are not being split as expected, likely due to how the message roles are assigned. In the function remove_documents_from_messages(), it expects DocumentMessage which has the role MessageRole.SYSTEM. However, in my setup, the role is assigned as MessageRole.CHATBOT. Here’s what my debug logs show (with SentenceSplitter(chunk_size=32, chunk_overlap=8)):

Remaining: [ChatMessage(role=<MessageRole.CHATBOT: 'chatbot'>, content="### Instructions for the Chatbot:\n- Directly address the latest user's query with a precise and concise response.\n- Utilize information from previous interactions and any provided documents.\n\n**Context**:\n\nfile_path: nyc_text.txt\n\nNew York, often called New York City or NYC, is the most populous city in the United States.\n\nfile_path: nyc_text.txt\n\n== History ==\n\n\n=== Early history ===\nIn the pre-Columbian era,\n\nfile_path: nyc_text.txt\n\nincluding the city of New Amsterdam, when England seized it from Dutch control.\n\n\n== History ==\n", additional_kwargs={})]
Documents: None

I think aligning the message roles or adjusting how they are interpreted might be necessary?

ulan-yisaev · 2024-05-06T14:02:00Z

Upon further investigation I've identified a potential configuration mismatch that might be contributing to the issue with document recognition and handling.
It appears that the role assigned to the system message in the condense_plus_context.py chat engine is derived from self._llm.metadata.system_role, which is set to MessageRole.CHATBOT in the llama-index-llms-cohere integration, according to the LLMMetadata configuration in llama_index/llms/cohere/base.py.
However, the remove_documents_from_messages function is specifically looking for messages with the role MessageRole.SYSTEM to treat them as DocumentMessage. Since our chat messages are marked with MessageRole.CHATBOT, they are not being correctly identified or processed as document-containing messages.

This role discrepancy causes all our system messages, which include both instructions and document data, to be categorized as 'remaining' rather than 'documents', leading to none of the document data being processed as intended.
Could we consider aligning the role definitions or modifying the document recognition logic to accommodate setups where the system_role may vary between different LLM integrations?

…role

co-antwan · 2024-05-07T16:24:33Z

@ulan-yisaev Good catch, thanks a ton -- this was gnarly 🙏
The simplest change I could think of was indeed to change the role of DocumentMessage to CHATBOT. This should fix the compatibility issue you've raised without affecting any unknowing users (since, for now, DocumentMessage is only called in cases we've tested).

Can you test if the current change fixes your prompt formatting?

For outstanding issues, e.g. integrating preamble, could you open a follow-up PR if that's a priority fix for you?

co-antwan · 2024-05-07T16:51:52Z

@ulan-yisaev on second thought: are you sure that your issue is caused by the SYSTEM role? Our check depends only on checking the instance, not the role. I just checked and the isinstance check works as expected, whatever the role of the DocumentMessage.

Could something else be causing your issue inside condense_plus_context?

CC @harry-cohere

logan-markewich · 2024-05-07T18:02:24Z

@co-antwan I was just about to merge this (forgot that it was still waiting), and it seems like there's maybe other things going on 😅 Seems like its actually good to merge though?

ulan-yisaev · 2024-05-07T19:19:23Z

@co-antwan , you're right; my apologies for the confusion. The issue isn't related to the MessageRole.
Maybe it is related to how they instantiate messages as instances of ChatMessage in the core condense_plus_context chat engine:

def _run_c3(...)
    ...
    system_message = ChatMessage(
        content=system_message_content, role=self._llm.metadata.system_role
    )
    ...
    chat_messages = [
        system_message,
        *self._memory.get(initial_token_count=initial_token_count),
    ]
    return chat_messages, context_source, context_nodes

This direct assignment bypasses the use of any specialized document handling mechanisms, like those defined in the COHERE_QA_TEMPLATE.

ulan-yisaev · 2024-05-07T19:44:55Z

Hello @logan-markewich ,

In your condense_plus_context setup, we currently handle context strings and document data separately before passing them to the chat function. Given that context_str already encapsulates the necessary document information in a structured format, does it makes sense modifying the chat method to accept this context directly? This would simplify Cohere's current implementation by removing the need for additional parsing or keyword searching.

Could this approach be considered a reasonable solution within the llama_index framework? Or perhaps you have another suggestion on how we might better integrate or handle this context data?


# In condense_plus_context.py
def _run_c3(self, message: str, chat_history: Optional[List[ChatMessage]] = None):
    ...
    context_str, context_nodes = self._retrieve_context(condensed_question)
    ...
    system_message_content = self._context_prompt_template.format(
        context_str=context_str
    )
    ...
    system_message = ChatMessage(
        content=system_message_content, 
        role=self._llm.metadata.system_role
    )
    ...
    chat_messages = [
        system_message,
        *self._memory.get(initial_token_count=initial_token_count),
    ]

    return chat_messages, context_source, context_nodes

@trace_method("chat")
def chat(self, message: str, chat_history: Optional[List[ChatMessage]] = None):
    chat_messages, context_source, context_nodes = self._run_c3(
        message, chat_history
    )

    # Example modification to accept context directly:
    chat_response = self._llm.chat(chat_messages, context=context_str)
    ...
    
    And in Cohere class:
def chat(self, messages: Sequence[ChatMessage], context=None, **kwargs: Any) -> ChatResponse:
    prompt = messages[-1].content
    if context:
        documents = context

logan-markewich · 2024-05-07T19:59:25Z

Seems to me there should just be a cohere context chat engine, rather than trying to shoe-horn the existing chat engine into this specific API 🤔

Having a context specific kwarg doesn't generalize to all llms in this case

co-antwan · 2024-05-08T06:50:50Z

@ulan-yisaev No worries, I appreciate the double check :)
@logan-markewich Good to merge from our side!

logan-markewich · 2024-05-09T03:40:45Z

Cool! Just note I will need to coordinate a release with core for this once it merges. I should get to it tomorrow.

Thanks for all the help and effort here everyone! 👍🏻

co-antwan · 2024-05-09T08:36:13Z

Thanks @logan-markewich for your advice throughout!

co-antwan and others added 8 commits April 30, 2024 12:22

Changes default Cohere calls to sensible params

58c6aaf

Adds POC for needed new implementation

f6e5366

Adds new imports to init

28e64bb

add unit tests for chat()

399d1a7

Refactors and cleans code

f78a7a0

Merge branch 'run-llama:main' into main

b061d66

Updates var name in docstring

fd0ee4e

Restores poetry to same state as main branch

5627b0f

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 1, 2024

Sets default temp and max_tokens to None to control on API side

eafa7a7

co-antwan added 5 commits May 2, 2024 10:23

Adds Cohere chat templates for tree summarize and refine

6394f82

Adds new templates to init

af4d4b3

Adds selectors for refine table

f2c8a14

Adds comments on utils

77b8b77

Adds default prompt selector fle

210fa85

co-antwan and others added 3 commits May 2, 2024 12:21

Removes unused variable 'e'

78177d4

import from init, not utils

a5745d6

Merge branch 'main' of github.com:co-antwan/llama_index

e179d26

logan-markewich reviewed May 3, 2024

View reviewed changes

llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/base.py Show resolved Hide resolved

logan-markewich reviewed May 3, 2024

View reviewed changes

llama-index-integrations/llms/llama-index-llms-cohere/llama_index/llms/cohere/utils.py Outdated Show resolved Hide resolved

co-antwan added 5 commits May 3, 2024 09:05

Adds documents logic to stream_chat

5ba7aec

Fixes is_cohere_model to work with MockLLM

85fab53

Fixes RAG inference tests

8052099

Fixes temperature diff in tests

2879d1a

Reworks transfo from DocumentMessage to cohere doc

022da7a

co-antwan requested a review from logan-markewich May 3, 2024 13:56

Adds all known keywords to Cohere document parsing function

692b019

Removes MistralAI from imports

ae4b925

Removes other optional imports

7697cb7

Changes the role of DocumentMEssage to work with llm.matadata.system_…

f3105e2

…role

Reverts role to SYSTEM

42274ae

logan-markewich approved these changes May 9, 2024

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label May 9, 2024

logan-markewich enabled auto-merge (squash) May 9, 2024 03:40

logan-markewich merged commit 362063c into run-llama:main May 9, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call Cohere RAG inference with `documents` argument #13196

Call Cohere RAG inference with `documents` argument #13196

co-antwan commented May 1, 2024 •

edited

logan-markewich commented May 1, 2024

co-antwan commented May 2, 2024

logan-markewich commented May 2, 2024

logan-markewich left a comment

co-antwan commented May 3, 2024

co-antwan commented May 3, 2024

logan-markewich commented May 3, 2024 •

edited

logan-markewich commented May 3, 2024 •

edited

co-antwan commented May 3, 2024

ulan-yisaev commented May 4, 2024

ulan-yisaev commented May 4, 2024

co-antwan commented May 5, 2024 •

edited

ulan-yisaev commented May 6, 2024

ulan-yisaev commented May 6, 2024 •

edited

ulan-yisaev commented May 6, 2024

co-antwan commented May 7, 2024

co-antwan commented May 7, 2024 •

edited

logan-markewich commented May 7, 2024

ulan-yisaev commented May 7, 2024

ulan-yisaev commented May 7, 2024

logan-markewich commented May 7, 2024 •

edited

co-antwan commented May 8, 2024

logan-markewich commented May 9, 2024

co-antwan commented May 9, 2024

Call Cohere RAG inference with documents argument #13196

Call Cohere RAG inference with documents argument #13196

Conversation

co-antwan commented May 1, 2024 • edited

Description

The problem

The proposed solution

Where we would like guidance

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

logan-markewich commented May 1, 2024

co-antwan commented May 2, 2024

logan-markewich commented May 2, 2024

logan-markewich left a comment

Choose a reason for hiding this comment

co-antwan commented May 3, 2024

co-antwan commented May 3, 2024

logan-markewich commented May 3, 2024 • edited

logan-markewich commented May 3, 2024 • edited

co-antwan commented May 3, 2024

ulan-yisaev commented May 4, 2024

ulan-yisaev commented May 4, 2024

co-antwan commented May 5, 2024 • edited

ulan-yisaev commented May 6, 2024

ulan-yisaev commented May 6, 2024 • edited

ulan-yisaev commented May 6, 2024

co-antwan commented May 7, 2024

co-antwan commented May 7, 2024 • edited

logan-markewich commented May 7, 2024

ulan-yisaev commented May 7, 2024

ulan-yisaev commented May 7, 2024

logan-markewich commented May 7, 2024 • edited

co-antwan commented May 8, 2024

logan-markewich commented May 9, 2024

co-antwan commented May 9, 2024

Call Cohere RAG inference with `documents` argument #13196

Call Cohere RAG inference with `documents` argument #13196

co-antwan commented May 1, 2024 •

edited

logan-markewich commented May 3, 2024 •

edited

logan-markewich commented May 3, 2024 •

edited

co-antwan commented May 5, 2024 •

edited

ulan-yisaev commented May 6, 2024 •

edited

co-antwan commented May 7, 2024 •

edited

logan-markewich commented May 7, 2024 •

edited