# Retrieval
Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

## Vectorstore retrieval

## Similarity Search

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -q -U langchain langchain_openai langchain-community sentence-transformers chromadb lark pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.6/315.6 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.2/125.2 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

In [None]:
import os
import sys
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/chroma/'

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings()
# embedding = OpenAIEmbeddings()



In [None]:
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [None]:
print(vectordb._collection.count())

627


In [None]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [None]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [None]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [None]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.')]

**Maximum Marginal Relevance(MMR)**

MMR is an important method to enforce diversity in the search results. In the case of semantic search, we get documents that are most similar to the query in the embedding space and we may miss out on diverse information. For example, if the query is “Tell me about all-white mushrooms with large fruiting bodies”, we get the first two most similar results in the first two documents with information similar to the query about a fruiting body and being all-white. But we miss out on information that is important but not similar to the first two documents. Here, MMR helps to solve this problem as it helps to select a diverse set of documents.

The idea behind MMR is we first query the vector store and choose the “fetch_k” most similar responses. Now, we work on this smaller set of “fetch_k” documents and optimize to achieve both relevance to the query and diversity among the results. Finally, we choose the “k” most diverse response within these “fetch_k” responses.If we will print the first 100 characters of the first 2 documents, we will find that we will get the same result if we will use the similarity search as above. Now, we will run a search query with MMR and the first few results.

In [None]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

## Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [None]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [None]:
docs_ss[0].page_content[:100]

"So later this quarter, we'll use the discussion sections to talk about things like convex \noptimizat"

In [None]:
docs_ss[1].page_content[:100]

"So later this quarter, we'll use the discussion sections to talk about things like convex \noptimizat"

Note the difference in results with MMR.

In [None]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [None]:
docs_mmr[0].page_content[:100]

"So later this quarter, we'll use the discussion sections to talk about things like convex \noptimizat"

In [None]:
docs_mmr[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

## Addressing Specificity: working with metadata
In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on metadata.

metadata provides context for each embedded chunk.

In [None]:
question = "what did they say about regression in the third lecture?"

In [None]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [None]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 7, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


## Addressing Specificity: working with metadata using self-query retriever
But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

The query string to use for vector search
A metadata filter to pass in as well
Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [89]:
# from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import ChatOpenAI

In [90]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [None]:
from google.colab import userdata
api_key = userdata.get('API_KEY')

In [91]:
document_content_description = "Lecture notes"
llm = ChatOpenAI(model='meta-llama/llama-3-8b-instruct:free' , temperature=0, api_key=api_key, base_url="https://openrouter.ai/api/v1")
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [92]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [93]:
docs = retriever.get_relevant_documents(question)

In [94]:
for d in docs:
    print(d.metadata)

{'page': 0, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 5, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 5, 'source': '/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


## Additional tricks: compression
Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [95]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [96]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [97]:
# Wrap our vectorstore
llm = ChatOpenAI(model='google/gemma-7b-it:free' , temperature=0, api_key=api_key, base_url="https://openrouter.ai/api/v1")
compressor = LLMChainExtractor.from_llm(llm)

In [98]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

With compression, we run all our documents through a language model and extract the most relevant segments and then pass only the most relevant segments into a final language model call.

In [99]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

**MATLAB is actually totally worth learning. I know R and MATLAB, and I personally end up using MATLAB quite a bit more often for various reasons.**
----------------------------------------------------------------------------------------------------
Document 2:

**MATLAB is actually totally worth learning. I know R and MATLAB, and I personally end up using MATLAB quite a bit more often for various reasons.**
----------------------------------------------------------------------------------------------------
Document 3:

**MATLAB is actually totally worth learning. I know R and MATLAB, and I personally end up using MATLAB quite a bit more often for various reasons.**
----------------------------------------------------------------------------------------------------
Document 4:

**MATLAB is actually totally worth learning. I know R and MATLAB, and I personally end up using MATLAB quite a bit more often for various reasons.**


## Combining various techniques

In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [None]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

**MATLAB is actually totally worth learning. I know R and MATLAB, and I personally end up using MATLAB quite a bit more often for various reasons.**
----------------------------------------------------------------------------------------------------
Document 2:

**Extracted Relevant Parts:**

"Oh, it was the MATLAB."

"So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
----------------------------------------------------------------------------------------------------
Document 3:

**Relevant parts of the context:**

"MATLAB is a programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."

"And in case some of you want to work on y

## Other types of retrieval
It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# Load PDF
loader = PyPDFLoader("/content/drive/MyDrive/LangChain course/LangChain-Data/docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [None]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content="let me just check what questions you have righ t now. So if there are no questions, I'll just \nclose with two reminders, which are after class today or as you start to talk with other \npeople in this class, I just encourage you again to start to form project partners, to try to \nfind project partners to do your project with. And also, this is a good time to start forming \nstudy groups, so either talk to your friends  or post in the newsgroup, but we just \nencourage you to try to star t to do both of those today, okay? Form study groups, and try \nto find two other project partners.  \nSo thank you. I'm looking forward to teaching this class, and I'll see you in a couple of \ndays.   [End of Audio]  \nDuration: 69 minutes")

In [None]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste