# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [4]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv("env_vars.env")) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

In [5]:
#!pip install lark

### Similarity Search

In [6]:
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

persist_directory = '../docs/chroma/'

In [7]:
# embedding = OpenAIEmbeddings()
embedding = HuggingFaceEmbeddings()

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
print(vectordb._collection.count())

12


In [9]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [10]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [11]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [12]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

In [13]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [14]:
question = "what did they say about minimum spend?"
docs_ss = vectordb.similarity_search(question,k=3)

In [15]:
docs_ss[0].page_content[:100]

'(i) “Electric Vehicle Charging” is defined as Card Transactions classified under the MCC 5552  \n(Ele'

In [16]:
docs_ss[1].page_content[:100]

'4 \n S$800  S$80  \nS$1,600  S$160  \n \nb) Card transactions made under the following categories in Tab'

Note the difference in results with `MMR`.

In [17]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


In [18]:
docs_mmr[0].page_content[:100]

'(i) “Electric Vehicle Charging” is defined as Card Transactions classified under the MCC 5552  \n(Ele'

In [19]:
docs_mmr[1].page_content[:100]

'Account by the Principal and Supplementary Cardmembers in each calendar month , but \nexcludes the Ex'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [20]:
question = "what did they say about travel cashback?"

In [21]:
docs = vectordb.similarity_search(
    question,
    k=3,
    # filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [22]:
for d in docs:
    print(d.metadata)

{'source': '../docs/tncs-365cc-programme.pdf', 'page': 4}
{'source': '../docs/tncs-365cc-programme.pdf', 'page': 3}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [23]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [24]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [25]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [26]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [27]:
docs = retriever.get_relevant_documents(question)



query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/cs229_lectures/MachineLearning-Lecture03.pdf') limit=None


In [28]:
for d in docs:
    print(d.metadata)

### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [29]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [30]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [31]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [32]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [34]:
question = "what did they say about cashback?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

f) Refunded purchases will be deducted from the relevant monthly billed amount for the computation and award of Cashback. 
h) Any Cashback awarded will be reflected in the Billing Statement provided on a monthly basis. Such Cashback will be automatically offset against that month’s billed amount.


## Combining various techniques

In [35]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [36]:
question = "what did they say about cashback?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Number of requested results 20 is greater than number of elements in index 12, updating n_results = 12


Document 1:

f) Refunded purchases will be deducted from the relevant monthly billed amount for the computation and award of Cashback.  h) Any Cashback awarded will be reflected in the Billing Statement provided on a monthly basis. Such Cashback will be automatically offset against that month’s billed amount. j) We reserve the right at any time without giving any reason or notice to the Cardmember to deduct, withdraw or cancel  any Cashback awarded to you without liability. Cardmember will not be entitled to any payment or compensation whatsoever in respect of such deduction, withdrawal or cancellation.
----------------------------------------------------------------------------------------------------
Document 2:

"The following terms and conditions and any other rules, procedures, or instructions which we may issue from time to time (collectively "Terms and Conditions") shall apply to the OCBC 365 Credit Card (“OCBC 365 Card”)."
"These Terms and Conditions together with the terms of 

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [None]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

In [None]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]