## In this notebook, I have shown examples on how to use different retrieval techniques from a vector database and the challenges faced in each technique.

In [1]:
import openai
import sys
from dotenv import load_dotenv, find_dotenv

In [2]:
_ = load_dotenv(find_dotenv())
import os
openai.api_key = os.environ["OPENAI_API_KEY"]

In [3]:
from langchain.document_loaders import PyPDFLoader

In [6]:
loaders = [
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("docs/MachineLearning-Lecture03.pdf")
]

In [7]:
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [8]:
print(len(docs))

78


#### Using RecursiveTextSplitter to split docs with a chunk size of 1500

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [10]:
rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

In [11]:
splits = rc_splitter.split_documents(docs)
print(len(splits))

209


#### Use Embeddings to create embedding of tokens

In [12]:
from langchain.embeddings import OpenAIEmbeddings

In [13]:
embedding = OpenAIEmbeddings()

In [15]:
sentence1 = "My favourite animal is cat."
sentence2 = "Cats are very soft and clean animals."
sentence3 = "London is in the United Kingdom."

embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [16]:
import numpy as np
np.dot(embedding1, embedding2)

0.8677371806643356

#### Answer close to 1 implies higher similarity

In [17]:
np.dot(embedding1, embedding3)

0.7552921552552395

In [18]:
np.dot(embedding2, embedding3)

0.7273475085667072

## Vectorstores

In [20]:
#pip install chromadb

In [24]:
from langchain.vectorstores import Chroma

In [22]:
persist_directory = "docs/chroma"

In [23]:
!rm -rf docs/chroma/*

In [25]:
vectordb = Chroma.from_documents(documents=splits, persist_directory=persist_directory, embedding=embedding)

In [26]:
print(vectordb._collection.count())

209


#### Using similarity search on our vector database

In [29]:
hits = vectordb.similarity_search("Is there any email where I can ask for help?", k= 5)

In [31]:
print(len(hits))

5


In [32]:
hits[0]

Document(metadata={'page': 5, 'source': 'docs/MachineLearning-Lecture01.pdf'}, page_content="cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class

In [33]:
vectordb.persist()

In [34]:
hits[1]

Document(metadata={'page': 5, 'source': 'docs/MachineLearning-Lecture01.pdf'}, page_content="cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class

In [40]:
embedding1 = embedding.embed_query(hits[0].page_content)

In [41]:
embedding2 = embedding.embed_query(hits[1].page_content)

In [42]:
np.dot(embedding1, embedding2)

0.999999999999999

##### Answer close to 1 indicates that both the hits are same. In this case, it is approximately 1. Lets check our metadata to verify if it is giving proper results.

In [43]:
hits = vectordb.similarity_search("What does the article say about regression?", k=5)

In [45]:
for hit in hits:
    print(hit.metadata)

{'page': 2, 'source': 'docs/MachineLearning-Lecture02.pdf'}
{'page': 10, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 12, 'source': 'docs/MachineLearning-Lecture01.pdf'}
{'page': 12, 'source': 'docs/MachineLearning-Lecture01.pdf'}
{'page': 14, 'source': 'docs/MachineLearning-Lecture03.pdf'}


#### if we look carefully, we can see that page 12 appears twice in the search. This is because we have taken duplicate document 1 in our loader. We can call this as one of the shortcomings of similarity search technique, as it does not identify the redundant data in the search results. 

## Let us see how can we solve above problems.

In [46]:
#pip install --quiet --upgrade lark

Collecting lark
  Downloading lark-1.1.9-py3-none-any.whl.metadata (1.9 kB)
Downloading lark-1.1.9-py3-none-any.whl (111 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lark
Successfully installed lark-1.1.9
Note: you may need to restart the kernel to use updated packages.


In [47]:
from langchain.embeddings.openai import OpenAIEmbeddings

In [51]:
vectordb_1 = Chroma(
    persist_directory = persist_directory,
    embedding_function=OpenAIEmbeddings()
)

In [49]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [53]:
small_db = vectordb_1.from_texts(texts, embedding=OpenAIEmbeddings())
small_db._collection.count()

3

##### In similarity search, the results will be all the sentences that contain data similar to the query provided. So if you ask it about white mushrooms, it will tell you all about mushrooms and white mushroom.

In [56]:
small_db.search("Tell me something about white mushrooms?",k=3, search_type="similarity")

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [57]:
small_db.similarity_search("Tell me something about white mushrooms?",k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

##### Here we will see the results using maximum marginal relevance.

In [60]:
small_db.search("Tell me something about white mushrooms?",k=2, search_type="mmr")

Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3


[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

In [59]:
small_db.max_marginal_relevance_search("Tell me something about white mushrooms?", k=2)

Number of requested results 20 is greater than number of elements in index 3, updating n_results = 3


[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance
It allows us to enforce diversity in the search results.

Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

In [61]:
question = "what did they say about regression in the third lecture?"

In [65]:
hits = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/MachineLearning-Lecture02.pdf"}
)

hits[0].page_content

"Instructor (Andrew Ng) :All right, so who thought driving could be that dramatic, right? \nSwitch back to the chalkboard, please. I s hould say, this work was done about 15 years \nago and autonomous driving has come a long way. So many of you will have heard of the \nDARPA Grand Challenge, where one of my colleagues, Sebastian Thrun, the winning \nteam's drive a car across a desert by itself.  \nSo Alvin was, I think, absolutely amazing wo rk for its time, but autonomous driving has \nobviously come a long way since then. So what  you just saw was an example, again, of \nsupervised learning, and in particular it was an  example of what they  call the regression \nproblem, because the vehicle is trying to predict a continuous value variables of a \ncontinuous value steering directions , we call the regression problem.  \nAnd what I want to do today is talk about our first supervised learning algorithm, and it \nwill also be to a regression task. So for the running example that I'm goi

In [63]:
hits = vectordb.similarity_search(
    question,
    k=3
)

hits

[Document(metadata={'page': 0, 'source': 'docs/MachineLearning-Lecture03.pdf'}, page_content='MachineLearning-Lecture03  \nInstructor (Andrew Ng) :Okay. Good morning and welcome b ack to the third lecture of \nthis class. So here’s what I want to do t oday, and some of the topics I do today may seem \na little bit like I’m jumping, sort  of, from topic to topic, but here’s, sort of, the outline for \ntoday and the illogical flow of ideas. In the last lecture, we  talked about linear regression \nand today I want to talk about sort of an  adaptation of that called locally weighted \nregression. It’s very a popular  algorithm that’s actually one of my former mentors \nprobably favorite machine learning algorithm.  \nWe’ll then talk about a probabl e second interpretation of linear regression and use that to \nmove onto our first classification algorithm, which is logistic regr ession; take a brief \ndigression to tell you about something cal led the perceptron algorithm, which is \nsomet

## Self-Query retriever
- Query for carrying out vector search
- Filter for metadata

In [67]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

We are going to define metadata field information and attribute information here to apply filter to search results and the attribute that should be included in the search results display.

In [70]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The document from where the chunk is. It should be one value from 'docs/MachineLearning-Lecture01.pdf' or 'docs/MachineLearning-Lecture02.pdf' or 'docs/MachineLearning-Lecture03.pdf'",
        type="string"
    ),
    AttributeInfo(
        name="pageno",
        description="page number in the document from where the result is.",
        type="integer"
    )
]

In [71]:
document_content_description = "Lecture notes"

We are using gpt's instruct model here

In [72]:
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)

retreiver = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectordb,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    verbose = True
)

#### advantage of SelfQueryRetriever is that we are not going to pass filter param while defining vector search
Since we had defined the metadata field info, it is showing page number and source from metadata, also it has restricted search to document 1 and 3.

In [73]:
hits = retreiver.get_relevant_documents(question)

for hit in hits:
    print(hit.metadata)

{'page': 14, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 0, 'source': 'docs/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'docs/MachineLearning-Lecture03.pdf'}


Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [74]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [75]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [76]:
compressor = LLMChainExtractor.from_llm(
    llm
)

compressor_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [77]:
hits = compressor_retriever.get_relevant_documents(question)

for hit in hits:
    print(hit)

page_content='In the last lecture, we talked about linear regression and today I want to talk about sort of an adaptation of that called locally weighted regression.' metadata={'page': 0, 'source': 'docs/MachineLearning-Lecture03.pdf'}
page_content='- "So what you just saw was an example, again, of supervised learning, and in particular it was an example of what they call the regression problem"
- "And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task"
- "So for the running example that I'm going to use throughout today's lecture, you're going to return to the example of trying to predict housing prices"
- "So here's actually a data set collected by TA, Dan Ramage, on housing prices in Portland, Oregon"
- "So here's a dataset of a number of houses of different sizes, and here are their asking prices in thousands of dollars, $200,000"' metadata={'page': 2, 'source': 'docs/MachineLearning-Lecture02.pdf'}
page_content='

In [78]:
pretty_print_docs(hits)

Document 1:

In the last lecture, we talked about linear regression and today I want to talk about sort of an adaptation of that called locally weighted regression.
----------------------------------------------------------------------------------------------------
Document 2:

- "So what you just saw was an example, again, of supervised learning, and in particular it was an example of what they call the regression problem"
- "And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task"
- "So for the running example that I'm going to use throughout today's lecture, you're going to return to the example of trying to predict housing prices"
- "So here's actually a data set collected by TA, Dan Ramage, on housing prices in Portland, Oregon"
- "So here's a dataset of a number of houses of different sizes, and here are their asking prices in thousands of dollars, $200,000"
-------------------------------------------------------

### Combining compression with other types of search

In [79]:
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0) #very precise

compressor = LLMChainExtractor.from_llm(llm)

compressor_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever( search_type = "mmr")
)

print(question)

hits = compressor_retriever.get_relevant_documents(question)

pretty_print_docs(hits)

what did they say about regression in the third lecture?
Document 1:

In the last lecture, we talked about linear regression and today I want to talk about sort of an adaptation of that called locally weighted regression.
----------------------------------------------------------------------------------------------------
Document 2:

- "So what you just saw was an example, again, of supervised learning, and in particular it was an example of what they call the regression problem"
- "And what I want to do today is talk about our first supervised learning algorithm, and it will also be to a regression task"
- "So for the running example that I'm going to use throughout today's lecture, you're going to return to the example of trying to predict housing prices"
- "So here's actually a data set collected by TA, Dan Ramage, on housing prices in Portland, Oregon"
- "So here's a dataset of a number of houses of different sizes, and here are their asking prices in thousands of dollars, $200,000

## Other types of retrievers

In [80]:
from langchain.retrievers import TFIDFRetriever
from langchain.retrievers import SVMRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings

In [82]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")

pages = loader.load()

all_pages_lst = [text.page_content for text in pages]

all_pages_str = " ".join(all_pages_lst)

In [83]:
len(all_pages_str)

60674

In [84]:
rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150
)

splits = rc_splitter.split_text(all_pages_str)

len(splits)

71

In [85]:
tfid_retriever = TFIDFRetriever.from_texts(splits)
svm_retriever = SVMRetriever.from_texts(splits, embedding)

In [86]:
hits = tfid_retriever.get_relevant_documents(question)
pretty_print_docs(hits)

Document 1:

then we want the algorithm to learn the a ssociation between the inputs and the outputs 
and to sort of give us more of the right answers, okay?  
It turns out this specific exam ple that I drew here is an example of something called a 
regression problem. And the term regression sort of refers to the fact that the variable 
you're trying to predict is a continuous value and price.  
There's another class of supervised learning problems which we'll talk about, which are 
classification problems. And so, in a classifi cation problem, the variab le you're trying to 
predict is discreet rather than continuous . So as one specific example — so actually a 
standard data set you can download online [i naudible] that lots of machine learning 
people have played with. Let's say you collect  a data set on breast cancer tumors, and you 
want to learn the algorithm to predict wh ether or not a certai n tumor is malignant.
--------------------------------------------------------------

In [87]:
hits = svm_retriever.get_relevant_documents(question)
pretty_print_docs(hits)

Document 1:

then we want the algorithm to learn the a ssociation between the inputs and the outputs 
and to sort of give us more of the right answers, okay?  
It turns out this specific exam ple that I drew here is an example of something called a 
regression problem. And the term regression sort of refers to the fact that the variable 
you're trying to predict is a continuous value and price.  
There's another class of supervised learning problems which we'll talk about, which are 
classification problems. And so, in a classifi cation problem, the variab le you're trying to 
predict is discreet rather than continuous . So as one specific example — so actually a 
standard data set you can download online [i naudible] that lots of machine learning 
people have played with. Let's say you collect  a data set on breast cancer tumors, and you 
want to learn the algorithm to predict wh ether or not a certai n tumor is malignant.
--------------------------------------------------------------

