# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

Let's get our vectorDB from before.

# MMR algorithm (Maximum Marginal Relevance (MMR))

## Query the Vector Store
## Choose the "fetch_k" most similar responses
## Within Those responses choose the "k" most diverse

Motivation: You may not always want to choose the most similar responses. \
For example, when a chef is researching one kind of mushroom, he probably should care most if the mushroom is poisonous, besides which mushroom tastes best with certain food.


# LLM Aided Retrieval

### There are several situations where the Query applied to teh DB is more than just the Question asked.
### One is SelfQuery, where we use an LLM to convert the user question into a query.

#### Question: What are some movies about aliens made in 1980?
#### Query Parser, retrieve year, Subjects

## Compression

### Increase the number of results you can put the context by shriking the responses to only the relevant information.

### Question --> Store --> Compression LLM --> LLM

## Relevant splits, compressed relevant splits.




In [131]:
# !pip install openai
# !pip install Chroma
# !pip install python-dotenv
# !pip install chromadb
# !pip install lark
# !pip install langchain
# !pip install langchain_community
# !pip install chromadb
# !pip install tiktoken
# !pip install PdfReader
# !pip install python-docx
# !pip install pypdf
# !pip install PyPDF2
# !pip install pypdf
# !pip install Document

## Vectorstore retrieval


In [132]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']
OPENAI_API_KEY = openai.api_key

In [133]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [134]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# openai.api_key = os.environ['OPENAI_API_KEY']

### Similarity Search

In [135]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
# persist_directory = 'docs'

In [136]:
from google.colab import drive
# drive.mount('/content/drive')
persist_directory = 'Embeddings'
# persist_directory = 'sample_data/The_History_of_Starbucks.pdf'
# embedding = OpenAIEmbeddings()
embedding = OpenAIEmbeddings(
    openai_api_key= OPENAI_API_KEY
)
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [137]:
len(vectordb)

3

In [138]:
# in order to embedding texts from PDFs
from PyPDF2 import PdfReader
from docx import Document
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

In [139]:
# pdf
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

In [140]:
# docs
def extract_text_from_docx(docx_path):
    doc = Document(docx_path)
    text = ''
    for para in doc.paragraphs:
        text += para.text + '\n'
    return text

In [141]:
documents = []

for filename in os.listdir(persist_directory):
    file_path = os.path.join(persist_directory, filename)
    print("file_path", file_path)

    # process .pdf, .doc, .docx, files
    if filename.endswith('.pdf'):
        print(f"Processing PDF: {filename}")
        text = extract_text_from_pdf(file_path)
        if text:
            documents.append(text)
    elif filename.endswith('.docx'):
        print(f"Processing DOCX: {filename}")
        text = extract_text_from_docx(file_path)
        if text:
            documents.append(text)
    elif filename.endswith('.doc'):
        print(f"Processing DOC: {filename}")
        pass
    else:
        print(f"Skipping file: {filename}")


if not documents:
    print("no documents found.")
else:
    vectordb.add_texts(documents)
    print("vectordb._collection.count() count:", vectordb._collection.count())

file_path Embeddings/Pret.docx
Processing DOCX: Pret.docx
file_path Embeddings/Pret.pdf
Processing PDF: Pret.pdf
file_path Embeddings/chroma.sqlite3
Skipping file: chroma.sqlite3
file_path Embeddings/86f77b6f-4d5d-4d06-ab9f-8fa1cdb681f1
Skipping file: 86f77b6f-4d5d-4d06-ab9f-8fa1cdb681f1
file_path Embeddings/Starbucks.pdf
Processing PDF: Starbucks.pdf
vectordb._collection.count() count: 6


In [142]:
len(vectordb)  # Yay!!

6

In [143]:
# vectordb.metadata
all_documents = vectordb.get()
len(all_documents)
all_documents.keys()

dict_keys(['ids', 'embeddings', 'metadatas', 'documents', 'uris', 'data', 'included'])

In [144]:
all_documents.get('ids'), all_documents.get('metadata'), all_documents.get('embeddings')

(['03ab5258-36ad-4e6f-8f2a-4c76e677de7a',
  '1ce0fc9f-9721-4904-829f-899a4d1c718e',
  '221d7043-b6a8-49bf-b99e-4e628b45ae8d',
  '74ac0a29-b78b-4580-bf03-486cc03d13bd',
  'b3f09b14-deb9-4ef6-ad33-9f17d49a7e63',
  'c1b25563-cd62-4d48-be5c-630e770e854c'],
 None,
 None)

In [145]:
all_documents.get('documents').__len__()

6

In [146]:
all_documents.get('data')

In [147]:
all_documents.get('includes')

In [148]:
# from google.colab import drive
# # drive.mount('/content/drive')
# persist_directory = 'Embeddings'
# # persist_directory = 'sample_data/The_History_of_Starbucks.pdf'
# # embedding = OpenAIEmbeddings()

In [149]:
# from google.colab import drive
# # drive.mount('/content/drive')
# persist_directory = 'Embeddings'
# # persist_directory = 'sample_data/The_History_of_Starbucks.pdf'
# # embedding = OpenAIEmbeddings()
# embedding = OpenAIEmbeddings(
#     openai_api_key= OPENAI_API_KEY,
# )
# vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [150]:
# from langchain.embeddings import OpenAIEmbeddings

# embedding = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)

# vectordb = Chroma(
#     persist_directory=persist_directory,
#     embedding_function=embedding
# )

In [151]:
print(vectordb._collection.count())

6


In [152]:
# vectordb.similarity_search(
#     query="Pret的创始日期是什么？", k=2
# )

# Text processing

In [153]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
    """Today is a good day for a long walk.""",
    """One type of Italian dishes with Mushrooms is called La Cucina Italiana""",
    """Amanita phalloides is poisonous"""
]


In [154]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

In [155]:
question = "Tell me about all-white mushrooms with large fruiting bodies? Are they poisonous?"

In [156]:
smalldb.similarity_search(question, k=1)
# It would still pick the one that is all-white, but would not be able to pick up the attributes
# That are most important for chefs and customers, which is if they are poisonous.
# this is one of the problem of pure similarity_search

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.')]

In [157]:
smalldb.max_marginal_relevance_search(question,k=4, fetch_k=4)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.

`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

In [158]:
question = "what did they say about the menu of Pret?"
docs_ss = vectordb.similarity_search(question,k=3)

In [159]:
docs_ss[0].page_content[100:1000]

'often\nsimply\nknown\nas\nPret,\nis\na\nglobally\nrecognized\ncoffee\nshop\nand\nsandwich\nchain\nwith\na\nunique\napproach\nto\nfood\nand\nservice.\nWith\nits\nroots\nin\nthe\nbustling\nstreets\nof\nLondon,\nPret\nhas\ngrown\nto\nbecome\na\nfavorite\namong\nthose\nseeking\nfresh,\nhealthy,\nand\nconvenient\nmeals.\nThis\narticle\ndelves\ninto\nthe\nhistory\nof\nPret,\ntracing\nits\njourney\nfrom\na\nsingle\nshop\nto\na\nglobal\nbrand.\nThe\nBeginning:\nA\nSimple\nIdea\nThe\nstory\nof\nPret\nA\nManger\nbegins\nin\n1983\nwhen\ntwo\ncollege\nfriends,\nSinclair\nBeecham\nand\nJulian\nMetcalfe,\nnoticed\na\ngap\nin\nthe\nmarket\nfor\nfresh,\nnatural\nfood\nthat\ncould\nbe\nserved\nquickly\nto\nbusy\nLondoners.\nInspired\nby\nthe\nidea\nof\nproviding\nan\nalternative\nto\nthe\nprocessed\nand\nunhealthy\nfast\nfood\noptions\nthat\ndominated\nthe\nmarket,\nthey\ndecided\nto\ncreate\na\nplace\nwhere\npeople\ncould\nfind\nfresh\nsandwiches,\nsalads,\nand\ncoffee\nmade\nfrom\nhigh-quality\ningr

In [160]:
docs_ss[1].page_content[:100]

'The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nto\na\nGlobal\nPhenomenon\nIntroduction\nPret\nA\nManger,\n'

In [161]:
docs_ss[2].page_content[:100]

'The History of Pret A Manger: From a Single Shop to a Global Phenomenon\nIntroduction\nPret A Manger, '

In [162]:
docs_ss == vectordb

False

In [163]:
type(docs_ss)

list

In [164]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
type(docs_mmr)



list

Note the difference in results with `MMR`.

In [165]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)



In [166]:
docs_mmr[0].page_content[:100]

'The\nHistory\nof\nPret\nA\nManger:\nFrom\na\nSingle\nShop\nto\na\nGlobal\nPhenomenon\nIntroduction\nPret\nA\nManger,\n'

In [167]:
docs_mmr[1].page_content[:100]

'The History of Pret A Manger: From a Single Shop to a Global Phenomenon\nIntroduction\nPret A Manger, '

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [168]:
question = "what did they say about regression in the third lecture?"

In [169]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

In [170]:
for d in docs:
    print(d.metadata)

### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:

1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [171]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [172]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

In [173]:
document_content_description = "Cafe Inqury"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0, openai_api_key = OPENAI_API_KEY)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [1]:
vectordb.embeddings

In [175]:
question = "What did they say about the cafe Starbucks?"

In [176]:
vectordb.get(ids=["id1", "id2"])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [177]:
document = vectordb.get(ids=["id1", "id2"])
print(document)

{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': [], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}


**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [178]:
docs = retriever.get_relevant_documents(question)

In [179]:
for d in docs:
    print(d.metadata)

{}
{}
{}
{}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text.

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [180]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [181]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [182]:
# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct", openai_api_key = OPENAI_API_KEY)
compressor = LLMChainExtractor.from_llm(llm)

In [183]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [184]:
# question = "what did they say about matlab?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# pretty_print_docs(compressed_docs)

## Combining various techniques

In [185]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [186]:
# question = "what did they say about matlab?"
# compressed_docs = compression_retriever.get_relevant_documents(question)
# pretty_print_docs(compressed_docs)

## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents.

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [187]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [188]:
# Load PDF
loader = PyPDFLoader("Embeddings/Pret.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [191]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

In [193]:
question = "What do people say about Pret?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

In [194]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]