# Retrieval

## Import Libraries

In [1]:
import openai
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import SVMRetriever, TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

## Setting LLM and Embeddings

In [2]:
api_key = open('../api_key.txt').read()

In [3]:
embeddings = OpenAIEmbeddings(openai_api_key=api_key)

In [4]:
llm = OpenAI(openai_api_key=api_key, temperature=0)

## Retrieval

In [5]:
persist_directory = "db/chroma/"

In [6]:
vector_db = Chroma(
    embedding_function=embeddings,
    persist_directory=persist_directory
)

In [7]:
print(vector_db._collection.count())

209


### Maximum Marginal Relevance

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [8]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [9]:
small_db = Chroma.from_texts(
    texts=texts,
    embedding=embeddings
)

In [10]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [11]:
small_db.similarity_search(
    query=question, 
    k=2
)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [12]:
small_db.max_marginal_relevance_search(
    query=question, 
    k=2, 
    fetch_k=3
)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

#### Addressing Diversity

In VectorStores and Embeddings notebook we encountered one problem: how to enforce diversity in the search results.

In [13]:
question = "What did they say about Matlab?"

**Using Similarity Search**

In [14]:
similarity_search_document = vector_db.similarity_search(
    query=question, 
    k=2
)

In [15]:
similarity_search_document[0].page_content[0:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [16]:
similarity_search_document[1].page_content[0:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

**Using Maximum Marginal Relevance**

Note the difference with using MMR

In [17]:
maximum_marginal_relevance_document = vector_db.max_marginal_relevance_search(
    query=question,
    k=2,
)

In [18]:
maximum_marginal_relevance_document[0].page_content[0:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

In [19]:
maximum_marginal_relevance_document[1].page_content[0:100]

"many biologers are there here? Wow, just a few, not many. I'm surprised. Anyone from \nstatistics? Ok"

### Self Query Retriever

In [20]:
question = "what did they say about regression in the third lecture?"

In [21]:
docs = vector_db.similarity_search(
    query=question,
    k=3,
    filter={"source":"documents/MachineLearning-Lecture03.pdf"}
)

In [22]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 14, 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 4, 'source': 'documents/MachineLearning-Lecture03.pdf'}


**Addressing Specificity: working with metadata using self-query retriever**

To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

1. The query string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [23]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `documents/MachineLearning-Lecture01.pdf`, `documents/MachineLearning-Lecture02.pdf`, or `documents/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [24]:
document_content_description = "Lecture notes"
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_db,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose=True
)

In [25]:
docs = retriever.get_relevant_documents(query=question)



query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='documents/MachineLearning-Lecture03.pdf') limit=None


In [26]:
for doc in docs:
    print(doc.metadata)

{'page': 14, 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'documents/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'documents/MachineLearning-Lecture03.pdf'}


### Contextual Compression

In [27]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [28]:
compressor = LLMChainExtractor.from_llm(llm=llm)

In [29]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever()
)

In [30]:
question = "What did they say about Matlab?"

In [31]:
compressed_docs = compression_retriever.get_relevant_documents(query=question)

In [32]:
pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one o

### Combining Various Techniques

In [33]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_db.as_retriever(search_type='mmr')
)

In [34]:
compressed_docs = compression_retriever.get_relevant_documents(query=question)
pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
----------------------------------------------------------------------------------------------------
Document 3:

"So what you just saw was an example, again, of supervised learning, and in particular it was an example of what they call the regression problem, because the vehicle is trying to predict a continuous value variables of 

### Other Types of Retrievals

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [35]:
loader = PyPDFLoader('documents/MachineLearning-Lecture01.pdf')

In [36]:
pages = loader.load()

In [37]:
all_pages_text = [page.page_content for page in pages]

In [38]:
joined_page_text = " ".join(all_pages_text)

In [39]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150
)
splits = text_splitter.split_text(joined_page_text)

**SVM Retriever**

In [40]:
svm_retriever = SVMRetriever.from_texts(
    texts=splits,
    embeddings=embeddings
)

In [41]:
question = "What did they say about Matlab?"

In [42]:
svm_retriever_document = svm_retriever.get_relevant_documents(query=question)
svm_retriever_document[0]



Document(page_content='don\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it\'s free, and for the purposes of  this class, it will work for just about \neverything.  \nSo actually I, well, so yeah, just a side comment for those of you that haven\'t seen \nMATLAB before I guess, once a colleague of mine at a different university, not at \nStanford, actually teaches another machine l earning course. He\'s taught it for many years. \nSo one day, he was in his office, and an old student of his from, lik e, ten years ago came \ninto his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learning class. I learned so much from it. There\'s this stuff that I learned in your \nclass, and I now use every day. And it\'s help ed me make l

**TF-IDF Retriever**

In [43]:
tfidf_retriever = TFIDFRetriever.from_texts(texts=splits)

In [44]:
question = "What did they say about Matlab?"

In [45]:
tfidf_retriever_document = tfidf_retriever.get_relevant_documents(question)
tfidf_retriever_document[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste