# Retrieval

Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow. 

Let's get our vectorDB from before.

## Vectorstore retrieval


In [1]:
# import os
# import openai
# import sys
# sys.path.append('../..')

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file

# openai.api_key  = os.environ['OPENAI_API_KEY']

import os
from langchain.embeddings import QianfanEmbeddingsEndpoint
from langchain.chat_models import QianfanChatEndpoint

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

print(f"QIANFAN_AK={os.environ['QIANFAN_AK']}")
print(f"QIANFAN_SK={os.environ['QIANFAN_SK']}")

QIANFAN_AK=HCCPsQy5p0Ex1rSEL6oorGQb
QIANFAN_SK=TCVRfPCfbtLr0eDPZRXcywMxcgaNuLDE


In [2]:
#!pip install lark

### Similarity Search

In [3]:
from langchain.vectorstores import Chroma
# from langchain.embeddings.openai import OpenAIEmbeddings
persist_directory = 'docs/chroma/'

In [4]:
# embedding = OpenAIEmbeddings()
embedding = QianfanEmbeddingsEndpoint()
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

In [5]:
print(vectordb._collection.count())

368


In [6]:
texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

In [7]:
smalldb = Chroma.from_texts(texts, embedding=embedding)

[INFO] [05-06 14:45:39] oauth.py:222 [t:25012]: trying to refresh access_token for ak `HCCPsQ***`
[INFO] [05-06 14:45:41] oauth.py:237 [t:25012]: sucessfully refresh access_token


In [8]:
question = "Tell me about all-white mushrooms with large fruiting bodies"

In [9]:
smalldb.similarity_search(question, k=2)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).')]

In [10]:
smalldb.max_marginal_relevance_search(question,k=2, fetch_k=3)

[Document(page_content='A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.'),
 Document(page_content='A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.')]

### Addressing Diversity: Maximum marginal relevance

Last class we introduced one problem: how to enforce diversity in the search results.
 
`Maximum marginal relevance` strives to achieve both relevance to the query *and diversity* among the results.

最大边际相关性（Maximum Marginal Relevance，MMR）是一种用于信息检索和文本摘要的算法。它的主要目标是提高文本摘要或搜索结果的多样性，同时保持相关性。MMR 最初由University of Massachusetts的Jing等人于1999年提出，被广泛应用于文本摘要、搜索引擎和信息检索等领域。

MMR 算法的核心思想是在选择文本摘要或搜索结果时，同时考虑两个方面的因素：相关性和多样性。具体来说，MMR 算法在选择下一个文本片段时，会同时考虑该片段与已选取片段的相关性以及它与已选取片段的差异性，以保证生成的摘要或搜索结果既相关又多样。

In [11]:
question = "what did they say about matlab?"
docs_ss = vectordb.similarity_search(question,k=3)

In [12]:
docs_ss[0].page_content[:100]

"don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] \nwrite that"

In [13]:

docs_ss[1].page_content[:100]

"don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] \nwrite that"

Note the difference in results with `MMR`.

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)

In [15]:
docs_mmr[0].page_content[:100]

"don't have a MATLAB license, for the purposes of  this class, there's also — [inaudible] \nwrite that"

In [16]:
docs_mmr[1].page_content[:100]

'emphasize that I’m taking this thing and view ing it as a function of  theta. Okay? So \nlikelihood a'

### Addressing Specificity: working with metadata

In last lecture, we showed that a question about the third lecture can include results from other lectures as well.

To address this, many vectorstores support operations on `metadata`.

`metadata` provides context for each embedded chunk.

In [17]:
question = "what did they say about regression in the third lecture?"

In [18]:
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"docs/cs229_lectures/MachineLearning-Lecture02.pdf"}
)

In [19]:
for d in docs:
    print(d.metadata)

{'page': 4, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 3, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}
{'page': 4, 'source': 'docs/cs229_lectures/MachineLearning-Lecture02.pdf'}


### Addressing Specificity: working with metadata using self-query retriever

But we have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use `SelfQueryRetriever`, which uses an LLM to extract:
 
1. The `query` string to use for vector search
2. A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [20]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [21]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

**Note:** The default model for `OpenAI` ("from langchain.llms import OpenAI") is `text-davinci-003`. Due to the deprication of OpenAI's model `text-davinci-003` on 4 January 2024, you'll be using OpenAI's recommended replacement model `gpt-3.5-turbo-instruct` instead.

Lark是一个Python库，用于构建语法解析器和抽象语法树。它可以处理任何现代编程语言的语法，并且可以处理大多数语法和语言特性。

Lark提供了两种主要的解析算法：Earley和LALR，每种都有其优点。Earley解析器可以处理任何上下文无关语法，包括左递归的语法。LALR解析器则更快，更经济，但只能处理一部分上下文无关语法。

Lark还提供了一种简洁的语法定义语言，使得定义新的语法变得非常简单。此外，Lark还提供了一些工具，如自动错误恢复和树转换，这些工具可以帮助你更容易地处理解析过程中的错误，以及更容易地处理解析后的结果。

In [22]:
#!pip install lark


In [23]:
document_content_description = "Lecture notes"
# llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
llm = QianfanChatEndpoint(model="ERNIE-Bot-4", temperature=0.01)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [24]:
question = "what did they say about regression in the third lecture?"

**You will receive a warning** about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

In [25]:
docs = retriever.get_relevant_documents(question)

  warn_deprecated(


In [26]:
for d in docs:
    print(d.metadata)

{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 15, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 3, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'docs/cs229_lectures/MachineLearning-Lecture03.pdf'}


### Additional tricks: compression

Another approach for improving the quality of retrieved docs is compression.

Information most relevant to a query may be buried in a document with a lot of irrelevant text. 

Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. 

In [27]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [28]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))


In [29]:
# Wrap our vectorstore
# llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
llm = QianfanChatEndpoint(model="ERNIE-Bot-4", temperature=0.01)
compressor = LLMChainExtractor.from_llm(llm)

In [30]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

In [31]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms.
----------------------------------------------------------------------------------------------------
Document 2:

MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms.


## Combining various techniques

In [32]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [33]:
question = "what did they say about matlab?"
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)






## Other types of retrieval

It's worth noting that vectordb as not the only kind of tool to retrieve documents. 

The `LangChain` retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

In [34]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [39]:
# Load PDF
loader = PyPDFLoader("docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()
all_page_text=[p.page_content for p in pages]
joined_page_text=" ".join(all_page_text)

# Split
# text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500,chunk_overlap = 150)
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 999,chunk_overlap = 150)
splits = text_splitter.split_text(joined_page_text)


In [40]:
# Retrieve
svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)

[ERROR] [05-06 15:39:56] openapi_requestor.py:246 [t:41304]: api request req_id:  failed with error code: 18, err msg: Open api qps request limit reached, please check https://cloud.baidu.com/doc/WENXINWORKSHOP/s/tlmyncueh
[ERROR] [05-06 15:39:56] openapi_requestor.py:246 [t:54684]: api request req_id:  failed with error code: 18, err msg: Open api qps request limit reached, please check https://cloud.baidu.com/doc/WENXINWORKSHOP/s/tlmyncueh
[ERROR] [05-06 15:39:56] openapi_requestor.py:246 [t:48688]: api request req_id:  failed with error code: 18, err msg: Open api qps request limit reached, please check https://cloud.baidu.com/doc/WENXINWORKSHOP/s/tlmyncueh
[ERROR] [05-06 15:39:56] openapi_requestor.py:246 [t:10588]: api request req_id:  failed with error code: 18, err msg: Open api qps request limit reached, please check https://cloud.baidu.com/doc/WENXINWORKSHOP/s/tlmyncueh
[ERROR] [05-06 15:39:56] openapi_requestor.py:246 [t:54696]: api request req_id:  failed with error code: 18

APIError: api return error, req_id:  code: 18, msg: Open api qps request limit reached

In [None]:
question = "What are major topics for this class?"
docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

In [None]:
question = "what did they say about matlab?"
docs_tfidf=tfidf_retriever.get_relevant_documents(question)
docs_tfidf[0]