Skip to content

Conversation

@Athe-kunal
Copy link
Contributor

All the retrievers support metadata filtering arguments, which can be passed during dspy.Retrieve(k=num_passages) as **kwargs. This PR add just the **kwargs parameter to retriever which can be passed to the retriever client and we can sample passages based on metadata. For example in ChromaDB, we have

retrieve = dspy.Retrieve(k=3)
retrieve("What is valuation",where={'book_source': 'Investment_Valuations_techniques'})

We will only retrieve documents for the given book source and send to reranker or model for next stages in the pipeline

@CShorten
Copy link
Collaborator

This is awesome!! 🔥

@Athe-kunal any chance you would also be interested in taking a look at this - #560? I see you have also updated the dsp.retrieveEnsemble call to handle the dependency in dspy.Retrieve when configured with dspy.settings(rm=rm). I think this issue is important for how we want to return metadata from the RMs. Still need to dig deeper personally, but maybe you could keep the momentum rolling into that problem as well haha.

P.S. Please note Weaviate can also take **kwargs to interface metadata filters such as:

import weaviate.classes as wvc

jeopardy = client.collections.get("JeopardyQuestion")
response = jeopardy.query.hybrid(
   query="food",
   filters=wvc.query.Filter.by_property("round").equal("Double Jeopardy!"),
   limit=3
)

for o in response.objects:
   print(o.properties)

@Athe-kunal
Copy link
Contributor Author

Hi @CShorten
Thanks for you for your response. I got started with DSPy following your videos on weaviate and DSPy. Those are really great resources to get started with DSPy, thanks for that.

Coming to PR, the retrieve pipeline just returns the long_text key. In dspy.RetrieveEnsemble, it just returns the long_text. All the RMs return a dotdict data structure, and I only checked for ChromaDB, and it returns metadata and other metrics. However, those are ignored in the dspy.RetrieveEnsemble function. I will be working on returning metadata and other metrics that the libraries can return. It should be a default behaviour (returning the metadata), IMO.

@CShorten
Copy link
Collaborator

Thank you so much for the kind words, it means a lot!

Ok fantastic, I think we are good to merge this one adding **kwargs, and maybe we can continue our discussion of the long_text key problem on #560?

@CShorten
Copy link
Collaborator

@hsm207 maybe you will find this one interesting on the use of **kwargs for accessing search APIs.

@hsm207
Copy link
Contributor

hsm207 commented Mar 27, 2024

the docstring in the WeaviateRM's forward method need to be updated too to document that kwargs param and that all of it will be passed to the client's hybrid method 😀

@Athe-kunal
Copy link
Contributor Author

Yes, we need to have documentation for all the retrievers and what extra parameters they can take. Some of the retrievers have some inputs in the __init__ method, but not sure, if it will scale well.

@hsm207
Copy link
Contributor

hsm207 commented Mar 28, 2024

Some of the retrievers have some inputs in the __init__ method, but not sure, if it will scale well.

Just had a look at some other retrievers and I see what you mean. There are solutions we can consider but they would require a major refactoring which I don't think is appropriate at this stage of the project. But good that you brought this up. Maybe the maintainers can put it in their backlog.

@Athe-kunal
Copy link
Contributor Author

Athe-kunal commented Mar 28, 2024

Yeah sure.
I think for now I am working on returning the metadata by the retriever if the retriever is returning it. The current dspy.retrieveEnsemble only returns the long_text parameter and ignores the rest of the keys in the dotdict data-structure.

@Athe-kunal Athe-kunal changed the title Added kwargs parameter for retrievers Added kwargs parameter for retrievers and text to SQL pipeline using DSPy Mar 31, 2024
@Athe-kunal
Copy link
Contributor Author

@CShorten and @hsm207
I have added a text to SQL end-to-end example, please do review and give me some feedback on how to improve it

Copy link

@Josephrp Josephrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great example for a popular use case of DSPy

@Josephrp
Copy link

the code looks good to me.
simply there are some superficial and language related comments to really communicate the use case more broadly i think.
If you would like my help, i can surely support you with information comments here on this thread for your consideration.

@Athe-kunal
Copy link
Contributor Author

Yeah sure @Josephrp, please provide your feedback

@CShorten
Copy link
Collaborator

CShorten commented Apr 3, 2024

Lgtm, tested with:

import dspy
from dspy.retrieve.weaviate_rm import WeaviateRM
import weaviate
import weaviate.classes as wvc

weaviate_client = weaviate.connect_to_local()
weaviate_rm = WeaviateRM("WeaviateBlogChunk", weaviate_client=weaviate_client)
dspy.settings.configure(rm=weaviate_rm)

filters=wvc.query.Filter.by_property("content").contains_all(["sql"])

weaviate_rm("What is ref2vec?", filters=filters)

@CShorten
Copy link
Collaborator

CShorten commented Apr 3, 2024

Having trouble with dspy.Retrieve(k=1)("What is ref2vec", filters=filters).

Have you tested that this works @Athe-kunal? Will dive in further tomorrow.

@CShorten
Copy link
Collaborator

CShorten commented Apr 3, 2024

Apologies this error was caused by how I was testing.

Confirming, dspy.Retrieve(k=1)("What is ref2vec", filters=filters) works ✅

@CShorten CShorten merged commit fd63306 into stanfordnlp:main Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants