## Load Document $\rightarrow$ Split Document $\rightarrow$ Storage $\rightarrow$ **retrieval** $\rightarrow$ Storage Output

![Steps](image.jpg)

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from pprint import pprint
import os
from dotenv import load_dotenv
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

### Similarity search

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [3]:
embedding = OpenAIEmbeddings()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(vectordb._collection.count())

209


In [5]:
vectordb.similarity_search('hello world')

[Document(page_content="middle of class, but because there won't be video you can safely sit there and make faces \nat me, and that won't show, okay?  \nLet's see. I also handed out this — ther e were two handouts I hope most of you have, \ncourse information handout. So let me just sa y a few words about parts of these. On the \nthird page, there's a section that says Online Resources.  \nOh, okay. Louder? Actually, could you turn up the volume? Testing. Is this better? \nTesting, testing. Okay, cool. Thanks.", metadata={'page': 4, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}),
 Document(page_content="middle of class, but because there won't be video you can safely sit there and make faces \nat me, and that won't show, okay?  \nLet's see. I also handed out this — ther e were two handouts I hope most of you have, \ncourse information handout. So let me just sa y a few words about parts of these. On the \nthird page, there's a section that says Online Resources.  \nOh, 

In [5]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [6]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [7]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [8]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).', metadata={})]

In [9]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.', metadata={}),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.', metadata={})]

### Addressing Diversity: Maximum marginal relevance.


In [None]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [55]:
docs_ss[0].page_content[:100]

NameError: name 'docs_ss' is not defined

In [12]:
docs_ss[1].page_content[:100]

'those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people '

### with MMR

In [12]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [13]:
docs_mmr[0].page_content[:100]

'Andrew puts on his wig also causes a little part searching. After training on lots of \nexamples, it’'

In [14]:
docs_mmr[1].page_content[:100]

'literature on debating what point – exactly what function to us e. This, sort of, exponential \ndecay'

#### Addressing Specificity: working with metadata 

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded c

In [15]:
question = "what did they say about regression in the third lecture?"

In [16]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={'source': "docs/css_lectures\\MachineLearning-Lecture01.pdf"}    
)

In [17]:
for d in docs:
    print(d.metadata)

{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}


In [19]:
print(docs)

[Document(page_content='into his office and he said, "Oh, professo r, professor, thank you so much for your \nmachine learning class. I learned so much from it. There\'s this stuff that I learned in your \nclass, and I now use every day. And it\'s help ed me make lots of money, and here\'s a \npicture of my big house."  \nSo my friend was very excited. He said, "W ow. That\'s great. I\'m glad to hear this \nmachine learning stuff was actually useful. So what was it that you learned? Was it \nlogistic regression? Was it the PCA? Was it the data ne tworks? What was it that you \nlearned that was so helpful?" And the student said, "Oh, it was the MATLAB."  \nSo for those of you that don\'t know MATLAB yet, I hope you do learn it. It\'s not hard, \nand we\'ll actually have a short MATLAB tutori al in one of the discussion sections for \nthose of you that don\'t know it.  \nOkay. The very last piece of logistical th ing is the discussion s ections. So discussion \nsections will be taught by

#### Addressing Specificity: working with metadata using self-query retriever

we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [20]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [21]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/css_lectures\\MachineLearning-Lecture01.pdf`,`docs/css_lectures\\MachineLearning-Lecture02.pdf`, `docs/css_lectures\\MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

In [22]:
document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [23]:
question = "what did they say about regression in the third lecture?"

In [24]:
docs = retriever.get_relevant_documents(question)



query='regression' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/css_lectures\\MachineLearning-Lecture03.pdf') limit=None


In [25]:
for d in docs:
    print(d.metadata)

{'page': 14, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}


In [27]:
print(docs[0].page_content[:200])

Student: It’s the lowest it –  
Instructor (Andrew Ng) :No, exactly. Right. So zero to the same, this is not the same, 
right? And the reason is, in logi stic regression this is diffe rent from before


In [29]:
question = "what did they say about regression in the third lecture on page ten?"

In [30]:
docs = retriever.get_relevant_documents(question)

query='regression' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/css_lectures\\MachineLearning-Lecture03.pdf'), Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='page', value=10)]) limit=None


In [31]:
for d in docs:
    print(d.metadata)

{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}


In [32]:
print(docs[0].page_content[:200])

Instructor (Andrew Ng) :Yeah, yeah. I mean, you’re asking about overfitting, whether 
this is a good model. I thi nk let’s – the thing’s you’re mentioning are maybe deeper 
questions about learning al


In [33]:
question = "what did they say about regression in the third lecture between pages one and ten?"

In [34]:
docs = retriever.get_relevant_documents(question)

query='regression' filter=Operation(operator=<Operator.AND: 'and'>, arguments=[Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='docs/css_lectures\\MachineLearning-Lecture03.pdf'), Comparison(comparator=<Comparator.GTE: 'gte'>, attribute='page', value=1), Comparison(comparator=<Comparator.LTE: 'lte'>, attribute='page', value=10)]) limit=None


In [35]:
for d in docs:
    print(d.metadata)

{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}
{'page': 1, 'source': 'docs/css_lectures\\MachineLearning-Lecture03.pdf'}


#### Compression

Additional tricks: compression
Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix th

In [37]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [38]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [39]:
# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [40]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [41]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one o

#### combining  various techniques

In [42]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [43]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-davinci-003 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/acco

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."


#### other types of retrieval

In [44]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [45]:
# Load PDF
loader = PyPDFLoader("docs/css_lectures\\MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)
len(splits)

45

In [50]:
# Retrieve
# svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [None]:
# question = "What are major topics for this class?"
# docs_svm=svm_retriever.get_relevant_documents(question)
# docs_svm[0]

In [51]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
len(docs_tfidf)

4

In [52]:
docs_tfidf[0]

Document(page_content="Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a \npicture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and \ngroup the picture into regions. Let me actually blow that up so that you can see it more \nclearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, \ngrouping the image into [inaudible] regions.  \nAnd what Ashutosh and Min did was they then  applied the learning algorithm to say can \nwe take this clustering and us e it to build a 3D model of the world? And so using the \nclustering, they then had a lear ning algorithm try to learn what the 3D structure of the \nworld looks like so that they could come up with a 3D model that you can sort of fly \nthrough, okay? Although many people used to th ink it's not possible to take a single \nimage and build a 3D model, but using a lear ning algorithm and that sort of clustering \nalgorithm is the first ste

In [54]:
pretty_print_docs(docs_tfidf)

Document 1:

Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a 
picture taken of the Stanford campus. You can apply that sort of cl ustering algorithm and 
group the picture into regions. Let me actually blow that up so that you can see it more 
clearly. Okay. So in the middle, you see the lines sort of groupi ng the image together, 
grouping the image into [inaudible] regions.  
And what Ashutosh and Min did was they then  applied the learning algorithm to say can 
we take this clustering and us e it to build a 3D model of the world? And so using the 
clustering, they then had a lear ning algorithm try to learn what the 3D structure of the 
world looks like so that they could come up with a 3D model that you can sort of fly 
through, okay? Although many people used to th ink it's not possible to take a single 
image and build a 3D model, but using a lear ning algorithm and that sort of clustering 
algorithm is the first step. They were able to.

In [58]:
for d in docs_tfidf:
    print(d.metadata)

{}
{}
{}
{}


#### comparing contents of vector db and tfidf

In [56]:
docs = vectordb.similarity_search(question,k=4)

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per day. Limit: 200 / day. Please try again in 7m12s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-OKEaxoW73Lo9pocIXi1wDApB on requests per day. Limit: 200 / day. Please try again in 7m12s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.ope

In [57]:
pretty_print_docs(docs)

Document 1:

those homeworks will be done in either MATLA B or in Octave, which is sort of — I 
know some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn't.  
So I guess for those of you that haven't s een MATLAB before, and I know most of you 
have, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to 
plot data. And it's sort of an extremely easy to  learn tool to use for implementing a lot of 
learning algorithms.  
And in case some of you want to work on your  own home computer or something if you 
don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] 
write that down [inaudible] MATLAB — there' s also a software package called Octave 
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just abo

In [59]:
for d in docs:
    print(d.metadata)

{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
{'page': 8, 'source': 'docs/css_lectures\\MachineLearning-Lecture01.pdf'}
