# **Retrival**

In LangChain, retrieval refers to the process of accessing and fetching relevant pieces of information from a collection of documents or data sources based on a query. This involves using techniques like similarity search and embedding-based methods to find and return the most pertinent documents or text segments that match the query's context and content.

In [None]:
%%capture
# update or install the necessary libraries
!pip install --upgrade langchain langchain_community langchain-openai
!pip install pypdf
!pip install tiktoken
!pip install faiss-cpu
!pip install lark
!pip install --upgrade python-dotenv

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

os.environ["OPENAI_API_VERSION"] = os.getenv('OPENAI_API_VERSION')
os.environ["AZURE_OPENAI_ENDPOINT"] = os.getenv('AZURE_OPENAI_ENDPOINT')
os.environ["AZURE_OPENAI_API_KEY"] = os.getenv('AZURE_OPENAI_API_KEY')

# **Vectorstore retrieval**

Vectorstore retrieval is a technique that involves storing and retrieving data using vector representations of text. It enables efficient similarity searches by converting text into high-dimensional vectors and then finding the closest vectors in the database. This method is particularly useful for applications requiring fast and accurate information retrieval based on semantic similarity.



In [None]:
from langchain.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("./content/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("./content/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("./content/MachineLearning-Lecture03.pdf")

]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [4]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)

len(splits)

151

In [None]:
from langchain.vectorstores import FAISS
persist_directory = 'docs/faiss/'

In [None]:
from langchain_openai import AzureOpenAIEmbeddings

# Initialize Azure OpenAI embeddings
embedding = AzureOpenAIEmbeddings(azure_deployment="text-embedding-ada-002")

# !rm -rf ./docs/faiss  # remove old database files if any
vectordb = FAISS.from_documents(
    documents=splits,
    embedding=embedding
)

vectordb.save_local(persist_directory)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [8]:
vectordb._collection.count()

151

In [9]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [None]:
smalldb = FAISS.from_texts(texts, embedding=embedding)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


In [11]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

# **Similarity Search**

Similarity search is a method used to find items in a database that are most similar to a given query item based on certain criteria. It involves comparing vector representations of items and retrieving those with the smallest distance or highest similarity score. This approach is commonly used in applications like document retrieval, image matching, and recommendation systems.



In [None]:
vectordb.similarity_search(question, k=2)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [None]:
vectordb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(metadata={}, page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(metadata={}, page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

## Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

Maximum Marginal Relevance (MMR) is a technique used in information retrieval to enhance the diversity of search results. It aims to select documents that are not only relevant to a query but also diverse from each other. This diversity helps provide a broader perspective or coverage of the topic being queried.


In [14]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


In [15]:
docs_ss[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [16]:
docs_ss[1].page_content[:100]

'into his office and he said, "Oh, professor, professor, thank you so much for your \nmachine learning'

Note the difference in results with `MMR`.

In [17]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [18]:
docs_mmr[0].page_content[:100]

'those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people c'

In [19]:
docs_mmr[1].page_content[:100]

'least squares regression being a bad idea for classification problems and then I did a \nbunch of mat'

## **Addressing Specificity: working with metadata**

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

Metadata filtering refers to the process of selectively retrieving or excluding data based on predefined metadata attributes or criteria. In the context of information retrieval systems like LangChain, metadata filtering allows users to narrow down search results by specifying metadata tags or attributes associated with documents or data chunks.

In [20]:
question = "what did they say about regression in the third lecture?"

In [21]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [22]:
for d in docs:
    d.metadata

## **Addressing Specificity: working with metadata using self-query retriever**

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [23]:
from langchain_openai import AzureChatOpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [24]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [25]:
document_content_description = "Lecture notes"
llm = AzureChatOpenAI(deployment_name="gpt-4o")
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [26]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [29]:
docs = retriever.invoke(question)

In [30]:
for d in docs:
    d.metadata

## **Additional tricks: compression**

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [31]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [32]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [33]:
# Wrap our vectorstore
llm = AzureChatOpenAI(deployment_name="gpt-4o")
compressor = LLMChainExtractor.from_llm(llm)

In [34]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [35]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

those homeworks will be done in either MATLAB or in Octave, which is sort of — I  
know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you  
have, MATLAB is I guess part of the programming language that makes it very easy to  
write codes using matrices, to write code for numerical routines, to move data around, to  
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of  
learning algorithms.  
And in case some of you want to work on your own home computer or something if you  
don't have a MATLAB license, for the purposes of this class, there's also — [inaudible]  
write that down [inaudible] MATLAB — there's also a software package called Octave  
that you can download for free off the Internet. And it has somewhat fewer features than  
MATLAB, but it's free, and for the purposes of this class, it will work for just a

# **Combining various techniques**


In [None]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [36]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.invoke(question)
pretty_print_docs(compressed_docs)

Document 1:

those homeworks will be done in either MATLAB or in Octave, which is sort of — I  
know some people call it a free version of MATLAB, which it sort of is, sort of isn't.  
So I guess for those of you that haven't seen MATLAB before, and I know most of you  
have, MATLAB is I guess part of the programming language that makes it very easy to  
write codes using matrices, to write code for numerical routines, to move data around, to  
plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of  
learning algorithms.  
And in case some of you want to work on your own home computer or something if you  
don't have a MATLAB license, for the purposes of this class, there's also — [inaudible]  
write that down [inaudible] MATLAB — there' s also a software package called Octave  
that you can download for free off the Internet. And it has somewhat fewer features than  
MATLAB, but it's free, and for the purposes of this class, it will work for just 

# **Other types of retrieval**

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [37]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [38]:
# Load PDF
loader = PyPDFLoader("/content/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [39]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [41]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.invoke(question)
docs_svm[0]

Document(metadata={}, page_content="Testing, testing. Okay, cool. Thanks. So all right, online resources. The class has a home page, so it's in on the handouts. I \nwon't write on the chalkboard — http:// cs229.stanford.edu. And so when there are \nhomework assignments or things like that, we usually won't sort of — in the mission of \nsaving trees, we will usually not give out many handouts in class. So homework \nassignments, homework solutions will be posted online at the course home page.  \nAs far as this class, I've also written, and I guess I've also revised every year a set of \nfairly detailed lecture notes that cover the technical content of this class. And so if you \nvisit the course homepage, you'll also find the detailed lecture notes that go over in detail \nall the math and equations and so on that I'll be doing in class.  \nThere's also a newsgroup, su.class.cs229, also written on the handout. This is a \nnewsgroup that's sort of a forum for people in the class to get 

In [42]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.invoke(question)
docs_tfidf[0]

Document(metadata={}, page_content="yourselves. You can also come and talk to me or the TAs if you want to brainstorm ideas \nwith us.  \nOkay. So one more organizational question. I'm curious, how many of you know \nMATLAB? Wow, cool, quite a lot. Okay. So as part of the — act ually how many of you \nknow Octave or have used Octave? Oh, okay, much smaller number.  \nSo as part of this class, especially in the homeworks, we'll ask you to implement a few \nprograms, a few machine learning algorithms as part of the homeworks. And most of those homeworks will be done in either MATLAB or in Octave, which is sort of — I \nknow some people call it a free version of MATLAB, which it sort of is, sort of isn't.  \nSo I guess for those of you that haven't seen MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to \nwrite codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it's s

# **Let's Do an Activity**

## **Objective**

Explore different retrieval techniques in LangChain to fetch relevant documents based on queries.

## **Scenario**

You are building a document retrieval system for a research library. This activity will help you understand and implement various retrieval methods to efficiently fetch documents related to specific topics or queries.

## **Steps**

* Load Documents
* Embedding and Vector Store
* Similarity Search
* Maximum Marginal Relevance (MMR)
* Metadata Filtering
* Alternative Retrieval Methods