### Reference Documentation

In [1]:
# https://python.langchain.com/docs/integrations/retrievers/merger_retriever/
# https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression/

### Imports

In [5]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings # defaults to llama2

from langchain.retrievers.merger_retriever import MergerRetriever

DB_DIR = "./chroma_db_test"

### Langsmith setup to debug chain in detail

In [2]:
# Langsmith - to debug LLM responses

import os
from dotenv import load_dotenv
from langsmith import Client

load_dotenv()

lcs = os.getenv("LANGCHAIN_SECRET")

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "llm-multi-doc-single-vdb"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = lcs

client = Client()

### Load all collections and create LOTR retriever

In [6]:
sys_docs = Chroma(
    collection_name="system-documentation",
    persist_directory=DB_DIR,
    embedding_function=OllamaEmbeddings(),
)

llm_papers = Chroma(
    collection_name="llm-papers",
    persist_directory=DB_DIR,
    embedding_function=OllamaEmbeddings(),
)

ssi_docs = Chroma(
    collection_name="ssi-docs",
    persist_directory=DB_DIR,
    embedding_function=OllamaEmbeddings(),
)

In [7]:
# Define 3 diff retrievers
sys_docs_ret = sys_docs.as_retriever(search_type="similarity", search_kwargs={"k": 2})
llm_papers_ret = llm_papers.as_retriever(
    search_type="similarity", search_kwargs={"k": 2}
)
ssi_docs_ret = ssi_docs.as_retriever(search_type="similarity", search_kwargs={"k": 2})

# We just pass a list of retrievers.
merge_retriever = MergerRetriever(
    retrievers=[sys_docs_ret, llm_papers_ret, ssi_docs_ret]
)

### Document Filtering using LLMChainFilter

As seen in llm-multi-collections-doc-filtering-testing.ipynb notebook, tested various filters available in Langchain; combination of Mistral and LLMChainFilter yielded best results in testing.

In [8]:
# moving the import here, easier to chop and change

from langchain.retrievers.document_compressors import DocumentCompressorPipeline,LLMChainFilter
from langchain.retrievers import ContextualCompressionRetriever

In [12]:
from langchain_community.llms import Ollama
llm = Ollama(model="mistral")

In [14]:
relevant_filter = LLMChainFilter.from_llm(llm)
pipeline_comp = DocumentCompressorPipeline(transformers=[relevant_filter])

compression_retriever = ContextualCompressionRetriever(
    base_compressor=pipeline_comp, base_retriever=merge_retriever
)

compression_retriever.get_relevant_documents("Who are the authors of Sparks of Artificial General Intelligence?")

# As seen below, llm chain has returned 5 instead of baseline 6 documents.



[Document(page_content='Sparks of Artiﬁcial General Intelligence:\nEarly experiments with GPT-4\nS´ ebastien Bubeck Varun Chandrasekaran Ronen Eldan Johannes Gehrke\nEric Horvitz Ece Kamar Peter Lee Yin Tat Lee Yuanzhi Li Scott Lundberg\nHarsha Nori Hamid Palangi Marco Tulio Ribeiro Yi Zhang\nMicrosoft Research\nAbstract\nArtiﬁcial intelligence (AI) researchers have been developing and reﬁning large language models (LLMs)\nthat exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding\nof learning and cognition. The latest model developed by OpenAI, GPT-4 [Ope23], was trained using an\nunprecedented scale of compute and data. In this paper, we report on our investigation of an early version\nof GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-\n4 is part of a new cohort of LLMs (along with ChatGPT and Google’s PaLM for example) that exhibit\nmore general intelligence than previous AI models.

In [16]:
compression_retriever.get_relevant_documents("How do I check shape of data in pandas?")



[Document(page_content='In [293]: s\nOut[293]: \n0   1 days 00:00:05\n1   1 days 00:00:06\n2   1 days 00:00:07\n3   1 days 00:00:08\ndtype: timedelta64[ns]\n\nIn [294]: s.dt.days\nOut[294]: \n0    1\n1    1\n2    1\n3    1\ndtype: int64\n\nIn [295]: s.dt.seconds\nOut[295]: \n0    5\n1    6\n2    7\n3    8\ndtype: int32\n\nIn [296]: s.dt.components\nOut[296]: \n   days  hours  minutes  seconds  milliseconds  microseconds  nanoseconds\n0     1      0        0        5             0             0            0\n1     1      0        0        6             0             0            0\n2     1      0        0        7             0             0            0\n3     1      0        0        8             0             0            0\n\n\n\nNote\nSeries.dt will raise a TypeError if you access with a non-datetime-like values.', metadata={'source': 'data/system-documentation/pandas-basics.html', 'title': 'Essential basic functionality — pandas 2.2.2 documentation'})]

In [18]:
compression_retriever.get_relevant_documents("Please provide settlement instructions for BAHRAIN")



[Document(page_content='Deutsche Bank AG, Frankfurt Cash Equities    \n                           \n \n  \n3 | P a g e   \nJanuary 2021  \n For internal use only  COUNTRY                 SETTLEMENT INSTRUCTIONS                        SWIFT CODE   \n  \n  \nEUROCLEAR                   Euroclear. Brussels                                                           MGTCBEBEECL  \n                                          A/C 77838  \n                                          Place of settlement = MGTCBEBEXXX  \n    \n  \n  \nESTONIA                          SWEDBANK AS                                                                          HABAEE2 XXXX  \nA/c no: 99103794855                                                                               \nPlace of settlement =  ECSD EE2XXXX  \n  \n  \nFINLAND                         Nordea Bank AB (publ), Finnish Branch                         NDEAFIHHXXX  \n                                         Place of settlement = APKEFIHHXXX   \n  \n 

### LLM QA Chain 

In [27]:
from langchain_community.chat_models import ChatOllama
from langchain.chains import RetrievalQA

llm = ChatOllama(model="mistral")

The final chain doesn't need much work. All heavy lifting should be done in retriever space to pass correct document to LLM. Couldn't use llam3 as invoke was not returning any value even after a minute.

In [28]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=compression_retriever,
    return_source_documents=True,
)

In [29]:
# response parser
def process_llm_response(llm_response):
    print(llm_response["result"])
    print("\nSources:")
    sources = set([x.metadata["source"] for x in llm_response["source_documents"]])
    print(sources)

[Langsmith Trace](https://smith.langchain.com/public/c1e69c6a-f9d2-40bf-86cc-bf87c52c4366/r)

In [30]:
process_llm_response(
    qa_chain.invoke(
        "Who are the authors of Sparks of Artificial General Intelligence?"
    )
)



 The authors of "Sparks of Artificial General Intelligence" are: Sebastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang.

Sources:
{'data/llm-papers/openai-paper.pdf'}


[Langsmith Trace](https://smith.langchain.com/public/ace34b1f-f899-4dfd-8ffd-9cddd2186c06/r)

In [32]:
process_llm_response(
    qa_chain.invoke(
        "How do I check shape of data in pandas?"
    )
)



 To check the shape (number of rows and columns) of a Pandas Series, you can use the `shape` attribute or the `size` attribute. Here's how to use them:

```python
In [297]: s.shape  # number of rows and number of columns as a tuple (row, column)
Out[297]: (len(s),)

In [298]: s.size        # total number of elements in the Series
Out[298]: len(s)
```

However, since you provided a datetime Series, let me clarify that: The code snippets given earlier were showing how to extract various components from each element (i.e., timedelta64 values) of the Series named `s`. In this case, there are no missing data as all elements have equal shape (one-dimensional with a single row). Therefore, the shape and size will be 1 for the entire Series or equivalently, the number of elements in it.

To summarize:

1. To check the shape or size of a Series, use `shape` or `size`.
2. In this specific example, since the Series contains equal-shaped datetime values, both shape and size will be 1.

Sources:
{'

[Langmith Trace](https://smith.langchain.com/public/e2a29900-6067-4503-a455-39613b75f258/r/725fd87b-787a-4e4d-8538-59e1a5543d6c)

In [33]:
process_llm_response(
    qa_chain.invoke("Please provide settlement instruction details for BAHRAIN")
)



 The settlement instructions for Bahrain are as follows:

Country: Bahrain
Settlement Instructions:

BANK: HSBC Bank Middle East, Qatar
Account number (A/C): Deutsche Bank AG London A/C 001-044387-088 (stock)
Place of settlement: DSMDQAQAXXX
Use Beneficiary BIC: DEUTGB22EEQ (for HSBC Middle East, Bahrain BBMEQAQXXXX)

Alternatively, for clearing through an agent, use the following details:
Bank: Saudi British Bank (Riyadh)
Account number (A/C): 086-003969-123 (Stock)
Place of settlement: TSSMSAR1XXX
Use Beneficiary BIC: DEUTGB22EEQ

Sources:
{'data/ssi-docs/db-ssi.pdf'}
