Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices for limiting responses to a specific source document #95

Open
nramirez opened this issue May 18, 2023 · 4 comments
Open
Labels

Comments

@nramirez
Copy link

nramirez commented May 18, 2023

Hi, Thanks for the contribution.

I have been using your repository to train a model on a collection of books. My goal is to generate answers that are specific to a single source document, essentially using the model as an assistant that draws information from one selected book at a time (such as "cats.pdf").

Initially, I attempted to implement this by modifying the prompts, but the results were inconsistent, and the model sometimes used information from other sources. Here's an example of how I structured the prompts:

You are a helpful assistant trained to answer questions solely based on the content of book_name.pdf. Given the text in the book and a question, generate an appropriate answer. If the answer is not contained within the book, simply say that you don't know, rather than inventing an answer. The question is: What is the distance from the moon to the earth?

Seems like ingest.py adds the source path to the doc metadata. However, when a question is asked, the model retrieves the most relevant documents based on the semantic similarity between the query's embedding and the documents' embeddings, not a specific document identifier. The model does not consider the document's metadata (like its source path) during retrieval, which means it can't be instructed to refer to a specific document just by mentioning the document's name or identifier in the prompt (?).

Considering this, I'm evaluating the option of creating a dropdown menu that lists all the books I've trained the model on. When a book is selected from this menu, I would swap the databases to only include documents from the selected book when a query is made.

With that context, I have a few questions:

  1. Is there a more efficient way to constrain the model's responses to a specific source document than by manipulating the prompt, using metadata, or swapping databases?
  2. If I proceed with the dropdown menu and database swapping approach, are there any potential drawbacks or issues I should be aware of?
  3. Given the potential usefulness of this feature to other users, would it be worth considering the addition of an option to limit responses to a specific source in the main repository?

Thanks for your time & I'd appreciate your insights.

PS: Adding this under docs, because it might be a result of my lack of understanding of how everything works together.

@neeewwww
Copy link

In order to enrich this discussion, it would be beneficial to be able include keywords that can aid in "directing" the response.

Keywords: Cat, Cat Book, Meals.
Please proceed with your question.

@hippalectryon-0
Copy link
Contributor

Is there a more efficient way to constrain the model's responses to a specific source document than by manipulating the prompt, using metadata, or swapping databases?

Yes. Several options:

  • add source metadata in prompt. Pro: very easy to implement, Con: takes tokens for each source, and relies on the right source being picked as a "relevant" source
  • since retrieving items from the db is very fast compared to generating the answer, we can fetch many items, and filter by their metadata before forwarding them to the model. Pro: easy to do, con: relies on the database fetcher to be good enough to retrieve the documents we want among all the documents requested

If I proceed with the dropdown menu and database swapping approach, are there any potential drawbacks or issues I should be aware of?

I'm unsure how the db is stored, but I don't think it's ordered by metadata, so if your db is big it will take a long time to filter by your document

Given the potential usefulness of this feature to other users, would it be worth considering the addition of an option to limit responses to a specific source in the main repository?

Yes that sounds like a good idea

@su77ungr
Copy link
Owner

Some delightful ideas I would have to think about, thanks.

For starters I see no reason why we could not just pipe in

if selected_document is not None:
    metadata_filter = {"key": "source_document", "value": selected_document}
    retriever = self.qdrant_langchain.as_retriever(
        search_type="mmr", metadata_filter=metadata_filter, k=n_forward_documents, fetch_k=n_retrieve_documents
    )
else:
    retriever = self.qdrant_langchain.as_retriever(search_type="mmr", k=n_forward_documents, fetch_k=n_retrieve_documents)

Where this would not perform worse than any other approach (have to check mmr compatibility tho). I think it should be possible to fetch a list of all available documents and set the filter on selection.

@su77ungr su77ungr added enhancement New feature or request 🚀🚀🚀 Feature labels May 19, 2023
@nramirez
Copy link
Author

Nice thanks for the quick suggestions. I'd probably go with the easiest for now. #95 (comment)

Another idea is to modify the ingest.py. Currently, it creates a single collection for all books, but perhaps we could adapt it so that a new collection is created for each source_document. startLLM.py will have to be updated to retrieve only the selected collection. However, I'm not sure about the potential performance implications of this approach.

The challenge here is that many users may prefer to search across all their documents, which is not my specific use case.

Regardless of the approach, will likely need a mechanism to keep track of which names have been ingested. This could potentially be achieved by creating another table in qdrant for storing global metadata.

I'd love to collaborate, but my plate is pretty full these days. I usually sneak in some time for AI projects at night. Really appreciate what you're doing here!

@nramirez nramirez changed the title DOC: Best practices for limiting responses to a specific source document Best practices for limiting responses to a specific source document May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants