<a href="https://colab.research.google.com/github/st20080675/Advanced-Retrieval-With-LangChain/blob/main/Advanced_Retrieval_With_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Retrieval With LangChain

Let's go over a few more complex and advanced retrieval methods with LangChain.

There is no one right way to retrieve data - it'll depend on your application so take some time to think about it before you jump in

Let's have some fun

* **Multi Query** - Given a single user query, use an LLM to synthetically generate multiple other queries. Use each one of the new queries to retrieve documents, take the union of those documents for the final context of your prompt
* **Contextual Compression** - Fluff remover. Normal retrieval but with an extra step of pulling out relevant information from each returned document. This makes each relevant document smaller for your final prompt (which increases information density)
* **Parent Document Retriever** - Split and embed *small* chunks (for maximum information density), then return the parent documents (or larger chunks) those small chunks come from
* **Ensemble Retriever** - Combine multiple retrievers together
* **Self-Query** - When the retriever infers filters from a users query and applies those filters to the underlying data

In [None]:
# from dotenv import load_dotenv
# import os

# load_dotenv()

# openai_api_key=os.getenv('OPENAI_API_KEY', 'YourAPIKey')

## Load up our texts and documents

Then chunk them, and put them into a vector store

In [2]:
!pip install langchain --upgrade

Collecting langchain
  Downloading langchain-0.1.12-py3-none-any.whl (809 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/809.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/809.1 kB[0m [31m5.1 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m532.5/809.1 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m809.0/809.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.28 (fro

In [4]:
from langchain.document_loaders import DirectoryLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

We're going to load up Paul Graham's essays. In this repo there are various sizes of folders (`PaulGrahamEssaysSmall`, `PaulGrahamEssaysMedium`, `PaulGrahamEssaysLarge` or `PaulGrahamEssays` for the full set.)

download data from [here](https://github.com/gkamradt/langchain-tutorials/tree/main/data/PaulGrahamEssaysLarge)  first. regarding how to download a subfolder content from a git repos, see [here](https://stackoverflow.com/questions/7106012/download-a-single-folder-or-directory-from-a-github-repo/38879691#38879691)

In [9]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.12.6-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting backoff==2.2.1 (from unstructured)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting dataclasses-json-speakeasy==0.5.11 (from unstructured)
  Downloading dataclasses_json_speakeasy-0.5.11-py3-none-any.whl (28 kB)
Collecting emoji==2.10.1 (from unstructured)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filetype==1.2.0 (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting jsonpath-python==1.0.6 (from unstructured)
  Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Collecting langdetect==1.0.9 (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K  

In [7]:
# loader = DirectoryLoader('../data/PaulGrahamEssaysLarge/', glob="**/*.txt", show_progress=True)
loader = DirectoryLoader('/content/PaulGrahamEssaysLarge/', glob="**/*.txt", show_progress=True)

docs = loader.load()

100%|██████████| 49/49 [00:24<00:00,  1.99it/s]


In [8]:
print (f"You have {len(docs)} essays loaded")

You have 49 essays loaded


Then we'll split up our text into smaller sized chunks

In [9]:
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

print (f"Your {len(docs)} documents have been split into {len(splits)} chunks")

Your 49 documents have been split into 468 chunks


In [12]:
!pip install InstructorEmbedding
!pip install sentence-transformers==2.2.2
!pip install faiss-cpu
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.0-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

In [13]:
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

# embedding = OpenAIEmbeddings()
embedding = HuggingFaceInstructEmbeddings()

if 'vectordb' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectordb.delete_collection()

# the follwing line took 44m for me, suggest using a subset of the doc
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

load INSTRUCTOR_Transformer
max_seq_length  512


### MultiQuery

This retrieval method will generated 3 additional questions to get a total of 4 queries (with the users included) that will be used to go retrieve documents. This is helpful when you want to retrieve documents which are similar in meaning to your question.

In [14]:
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import PromptTemplate
# Set logging for the queries
import logging

Doing some logging to see the other questions that were generated. I tried to find a way to get these via a model property but couldn't, lmk if you find a way!

In [15]:
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

Then we set up the MultiQueryRetriever which will generate other questions for us

In [30]:
# this openai version give me: RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota

# !pip install openai
# from langchain.chat_models import ChatOpenAI

# openai_api_key = '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'
# llm = ChatOpenAI(openai_api_key = openai_api_key, temperature=0)
# question = "What is the authors view on the early stages of a startup?"
# retriever_from_llm = MultiQueryRetriever.from_llm(
#     retriever=vectordb.as_retriever(), llm=llm
# )

In [32]:
question = "What is the authors view on the early stages of a startup?"
# llm = ChatOpenAI(temperature=0)
from langchain import HuggingFaceHub
import os

huggingface_api_key = "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"
os.environ["HUGGINGFACEHUB_API_TOKEN"] = huggingface_api_key

llm=HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.7, "max_length":512})

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

In [33]:
unique_docs = retriever_from_llm.get_relevant_documents(query=question)

INFO:langchain.retrievers.multi_query:Generated queries: ["What is the author's view on the early stages of a startup?"]


Check out how there are other questions which are related to but slightly different than the question I asked.

Let's see how many docs were actually returned

In [34]:
len(unique_docs)

4

Ok now let's put those docs into a prompt template which we'll use as context

In [35]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [39]:
# llm.predict(text=PROMPT.format_prompt(
#     context=unique_docs,
#     question=question
# ).text)

llm.predict(text=PROMPT.format_prompt(
    context=unique_docs[:1],
    question=question
).text)

"get a version 1 out fast, then improve it based on users' reactions"

### Contextual Compression

Then we'll move onto contextual compression. This will take the chunk that you've made (above) and compress it's information down to the parts relevant to your query.

Say that you have a chunk that has 3 topics within it, you only really care about one of them though, this compressor will look at your query, see that you only need one of the 3 topics, then extract & return that one topic.

This one is a bit more expensive because each doc returned will get processed an additional time (to pull out the relevant data)

In [40]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

We first need to set up our compressor, it's cool that it's a separate object because that means you can use it elsewhere outside this retriever as well.

In [41]:
# llm = ChatOpenAI(temperature=0, model='gpt-4')

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor,
                                                       base_retriever=vectordb.as_retriever())

First, an example of compression. Below we have one of our splits that we made above

In [44]:
splits[0].page_content

'Aaron Swartz created a scraped\n\nfeed\n\nof the essays page.'

Now we are going to pass a question to it and with that question we will compress the doc. The cool part is this doc will be contextually compressed, meaning the resulting file will only have the information relevant to the question.

In [45]:
compressor.compress_documents(documents=[splits[0]], query="test for what you like to do")



[Document(page_content='Aaron Swartz created a scraped feed', metadata={'source': '/content/PaulGrahamEssaysLarge/rss.txt'})]

Great so we had a long document, now we have a shorter document with more dense information. Great for getting rid of the fluff. Let's try it out on our essays

In [46]:
question = "What is the authors view on the early stages of a startup?"
compressed_docs = compression_retriever.get_relevant_documents(question)



In [47]:
print (len(compressed_docs))
compressed_docs

4


[Document(page_content="1. Release Early.The thing I probably repeat most is this recipe for a startup: get a version 1 out fast, then improve it based on users' reactions.", metadata={'source': '/content/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content="Startups are very counterintuitive. I'm not sure why. Maybe it's just because knowledge about them hasn't permeated our culture yet. But whatever the reason, starting a startup is a task where you can't always trust your instincts.", metadata={'source': '/content/PaulGrahamEssaysLarge/before.txt'}),
 Document(page_content="Almost everyone's initial plan is broken. If companies stuck to their initial plans, Microsoft would be selling programming languages, and Apple would be selling printed circuit boards. In both cases their customers told them what their business should be-- and they were smart enough to listen.", metadata={'source': '/content/PaulGrahamEssaysLarge/startuplessons.txt'}),
 Document(page_content="it's

We now have 4 docs but they are shorter and only contain the information that is relevant to our query.

Let's put it in our prompt template again.

In [48]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [49]:
llm.predict(text=PROMPT.format_prompt(
    context=compressed_docs,
    question=question
).text)

'They are counterintuitive and often fail.'

### Parent Document Retriever

[LangChain documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever) does a great job describing this - my minor edits below:

When you split your docs, you generally may want to have small documents, so that their embeddings can most accurately reflect their meaning. If too long, then the embeddings can lose meaning.

But at the same time you may want to have information around those small chunks to keep context of the longer document.

The ParentDocumentRetriever strikes that balance by splitting and storing small chunks of data. During retrieval, it first fetches the small chunks but then looks up the parent ids for those chunks and returns those larger documents.

Note that "parent document" refers to the document that a small chunk originated from. This can either be the whole raw document OR a larger chunk.

In [50]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

In [51]:
# This text splitter is used to create the child documents. They should be small chunk size.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [53]:
# The vectorstore to use to index the child chunks

# vectorstore = Chroma(
#     collection_name="return_full_documents",
#     embedding_function=OpenAIEmbeddings()
# )

vectorstore = Chroma(
    collection_name="return_full_documents",
    embedding_function=HuggingFaceInstructEmbeddings()
)

load INSTRUCTOR_Transformer
max_seq_length  512


In [54]:
# The storage layer for the parent documents
store = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

Now we will add the whole essays that we split above. We haven't chunked these essays yet, but the `.add_documents` will do the small chunking for us with the `child_splitter` above

In [55]:
# this line of code took 1h 12m
retriever.add_documents(docs, ids=None)

Now if we were to put in a question or query, we'll get small chunks returned

In [56]:
sub_docs = vectorstore.similarity_search("what is some investing advice?")

In [58]:
sub_docs

[Document(page_content="advising at Y Combinator, I would have said: Stop being so stressed\n\nout, because you're doing fine. You're growing 7x a year. Just don't\n\nhire too many more people and you'll soon be profitable, and then\n\nyou'll control your own destiny.Alas I hired lots more people, partly because our investors wanted\n\nme to, and partly because that's what startups did during the", metadata={'doc_id': '4a66829a-c7d3-4274-88de-e5afa891ab56', 'source': '/content/PaulGrahamEssaysLarge/worked.txt'}),
 Document(page_content="pay attention. Anyone who's been here any amount of time knows not\n\nto default to skepticism, no matter how inexperienced you seem or\n\nhow unpromising your idea sounds at first, because they've all seen\n\ninexperienced founders with unpromising sounding ideas who a few\n\nyears later were billionaires.Having people around you care about what you're doing is an", metadata={'doc_id': '1431e978-7c0a-4c3f-99eb-37289e711294', 'source': '/content/PaulGra

Look how small those chunks are. Now we want to get the parent doc which those small docs are a part of.

In [59]:
retrieved_docs = retriever.get_relevant_documents("what is some investing advice?")

I'm going to only do the first doc to save space, but there are more waiting for you. Keep in mind that LangChain will do the union of docs, so if you have two child docs from the same parent doc, you'll only return the parent doc once, not twice.

In [60]:
retrieved_docs[0].page_content[:1000]

'February 2021Before college the two main things I worked on, outside of school,\n\nwere writing and programming. I didn\'t write essays. I wrote what\n\nbeginning writers were supposed to write then, and probably still\n\nare: short stories. My stories were awful. They had hardly any plot,\n\njust characters with strong feelings, which I imagined made them\n\ndeep.The first programs I tried writing were on the IBM 1401 that our\n\nschool district used for what was then called "data processing."\n\nThis was in 9th grade, so I was 13 or 14. The school district\'s\n\n1401 happened to be in the basement of our junior high school, and\n\nmy friend Rich Draves and I got permission to use it. It was like\n\na mini Bond villain\'s lair down there, with all these alien-looking\n\nmachines \x97 CPU, disk drives, printer, card reader \x97 sitting up\n\non a raised floor under bright fluorescent lights.The language we used was an early version of Fortran. You had to\n\ntype programs on punch card

However here we got the full document back. Sometimes this will be too long and we actually just want to get a larger chunk instead. Let's do that.

Notice the chunk size difference between the parent splitter and child splitter.

In [62]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
# vectorstore = Chroma(collection_name="return_split_parent_documents", embedding_function=OpenAIEmbeddings())
vectorstore = Chroma(collection_name="return_split_parent_documents", embedding_function=HuggingFaceInstructEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()

load INSTRUCTOR_Transformer
max_seq_length  512


This will set up our retriever for us

In [63]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Now this time when we add documents two things will happen
1. Larger chunks - We'll split our docs into large chunks
2. Smaller chunks - We'll split our docs into smaller chunks

Both of them will be combined.

In [65]:
print(len(docs))

49


In [66]:
# retriever.add_documents(docs)

# the below line took 3m
retriever.add_documents(docs[:5])

Let's check out how many documents we have now

In [67]:
len(list(store.yield_keys()))

25

Then let's go get our small chunks to make sure it's working and see how long they are

In [68]:
sub_docs = vectorstore.similarity_search("what is some investing advice?")
sub_docs

[Document(page_content="them to make their own investment decisions. Most are only allowed\n\nto invest in deals where some reputable private VC firm is willing\n\nto act as lead investor.Not BuildingsIf you go to see Silicon Valley, what you'll see are buildings.\n\nBut it's the people that make it Silicon Valley, not the buildings.", metadata={'doc_id': '410337f2-8e9e-402f-a8c7-0618a1f88b6f', 'source': '/content/PaulGrahamEssaysLarge/siliconvalley.txt'}),
 Document(page_content="there's no one to invest in them.Not BureaucratsDo you really need the rich people? Wouldn't it work to have the\n\ngovernment invest in the nerds? No, it would not. Startup investors\n\nare a distinct type of rich people. They tend to have a lot of\n\nexperience themselves in the technology business. This (a) helps\n\nthem pick the right startups, and (b) means they can supply advice", metadata={'doc_id': '410337f2-8e9e-402f-a8c7-0618a1f88b6f', 'source': '/content/PaulGrahamEssaysLarge/siliconvalley.txt'}),


Now, let's do the full process, we'll see what small chunks are generated, but then return the larger chunks as our relevant documents

In [69]:
larger_chunk_relevant_docs = retriever.get_relevant_documents("what is some investing advice?")
larger_chunk_relevant_docs[0]

Document(page_content="list, the University of Washington yielded a high-tech community\n\nin Seattle, and the University of Texas at Austin yielded one in\n\nAustin. But what happened in Pittsburgh? And in Ithaca, home of\n\nCornell, which is also high on the list?I grew up in Pittsburgh and went to college at Cornell, so I can\n\nanswer for both. The weather is terrible,  particularly in winter,\n\nand there's no interesting old city to make up for it, as there is\n\nin Boston. Rich people don't want to live in Pittsburgh or Ithaca.\n\nSo while there are plenty of hackers who could start startups,\n\nthere's no one to invest in them.Not BureaucratsDo you really need the rich people? Wouldn't it work to have the\n\ngovernment invest in the nerds? No, it would not. Startup investors\n\nare a distinct type of rich people. They tend to have a lot of\n\nexperience themselves in the technology business. This (a) helps\n\nthem pick the right startups, and (b) means they can supply advice\n\

In [71]:
print(len(larger_chunk_relevant_docs))

2


In [72]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"

# Input validation error: `inputs` must have less than 1024 tokens. Given: 1146
# llm.predict(text=PROMPT.format_prompt(
#     context=larger_chunk_relevant_docs,
#     question=question
# ).text)

llm.predict(text=PROMPT.format_prompt(
    context=larger_chunk_relevant_docs[0],
    question=question
).text)

'Startup investors are a distinct type of rich people. They tend to have a lot of experience themselves in the technology business.'

### Ensemble Retriever

The next one on our list combines multiple retrievers together. The goal here is to see what multiple methods return, then pull them together for (hopefully) better results.

You may need to install bm25 with `!pip install rank_bm25`

In [75]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [76]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

We'll use a [BM25 retriever](https://en.wikipedia.org/wiki/Okapi_BM25) for this one which is really good at keyword matching (vs semantic). When you combine this method with regular semantic search it's known as hybrid search.

In [77]:
# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 2

In [79]:
# you probably has build the 'bectordb' already
# embedding = OpenAIEmbeddings()
# embedding = HuggingFaceInstructEmbeddings()
# vectordb = Chroma.from_documents(splits, embedding)
vectordb = vectordb.as_retriever(search_kwargs={"k": 2})

In [80]:
# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, vectordb], weights=[0.5, 0.5])

In [81]:
ensemble_docs = ensemble_retriever.get_relevant_documents("what is some investing advice?")
len(ensemble_docs)

4

In [84]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

question = "what is some investing advice?"
# Input validation error: `inputs` must have less than 1024 tokens. Given: 1793
# llm.predict(text=PROMPT.format_prompt(
#     context=ensemble_docs,
#     question=question
# ).text)

llm.predict(text=PROMPT.format_prompt(
    context=ensemble_docs[:2],
    question=question
).text)

'I spent almost a decadenninvesting in early stage startups, and curiously enough protectingnnyourself against obsolete beliefs is exactly what you have to donnto succeed as a startup investor. Most really good startup ideasnnlook like bad ideas at first, and many of those look bad specificallynnbecause some change in the world just switched them from bad to good. I spent a lot of time learning to'

### Self Querying

The last one we'll look at today is self querying. This is when the retriever has the ability to query itself. It does this so it can use filters when doing it's final query.

This means it'll use the users query for semantic search, but also its own query for filtering (so the user doesn't have to give a structured filter).

You may need to install `!pip install lark`

In [95]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# you probably already have 'embedding' and 'llm' in the env
# embeddings = OpenAIEmbeddings()
# llm = ChatOpenAI(temperature=0, model='gpt-4')

In [None]:
# if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
#     vectorstore.delete_collection()

# vectorstore = Chroma.from_documents(
#     splits, embeddings
# )

Below is the information on the fitlers available. This will help the model know which filters to semantically search for

In [96]:
metadata_field_info=[
    AttributeInfo(
        name="source",
        description="The filename of the essay",
        type="string or list[string]",
    ),
]

In [103]:
!pip uninstall lark
!pip install lark-parser

Found existing installation: lark 1.1.9
Uninstalling lark-1.1.9:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/lark-1.1.9.dist-info/*
    /usr/local/lib/python3.10/dist-packages/lark/*
  Would not remove (might be manually added):
    /usr/local/lib/python3.10/dist-packages/lark/parsers/lalr_puppet.py
Proceed (Y/n)? Y
  Successfully uninstalled lark-1.1.9


In [105]:
!pip install lark

Collecting lark
  Using cached lark-1.1.9-py3-none-any.whl (111 kB)
Installing collected packages: lark
Successfully installed lark-1.1.9


In [106]:
document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

ImportError: Cannot import lark, please install it with 'pip install lark'.

In [None]:
retriever.get_relevant_documents("Return only 1 essay. What is one thing you can do to figure out what you like to do from source '../data/PaulGrahamEssaysLarge/island.txt'")

query='figure out what you like to do' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='source', value='../data/PaulGrahamEssaysLarge/island.txt') limit=1


[Document(page_content="if I could only figure out what.As for books, I know the house would probably have something to\n\nread.  On the average trip I bring four books and only read one of\n\nthem, because I find new books to read en route.  Really bringing\n\nbooks is insurance.I realize this dependence on books is not entirely good—that what\n\nI need them for is distraction.  The books I bring on trips are\n\noften quite virtuous, the sort of stuff that might be assigned\n\nreading in a college class.  But I know my motives aren't virtuous.\n\nI bring books because if the world gets boring I need to be able\n\nto slip into another distilled by some writer.  It's like eating\n\njam when you know you should be eating fruit.There is a point where I'll do without books.  I was walking in\n\nsome steep mountains once, and decided I'd rather just think, if I\n\nwas bored, rather than carry a single unnecessary ounce.  It wasn't\n\nso bad.  I found I could entertain myself by having ideas

It's kind of annoying to have to put in the full file name, a user doesn't want to do that. Let's change `source` to `essay` and the file path w/ the essay name

In [None]:
import re

for split in splits:
    split.metadata['essay'] = re.search(r'[^/]+(?=\.\w+$)', split.metadata['source']).group()

Ok now that we did that, let's make a new field info config

In [None]:
metadata_field_info=[
    AttributeInfo(
        name="essay",
        description="The name of the essay",
        type="string or list[string]",
    ),
]

In [None]:
if 'vectorstore' in globals(): # If you've already made your vectordb this will delete it so you start fresh
    vectorstore.delete_collection()

vectorstore = Chroma.from_documents(
    splits, embeddings
)

In [None]:
document_content_description = "Essays from Paul Graham"
retriever = SelfQueryRetriever.from_llm(llm,
                                        vectorstore,
                                        document_content_description,
                                        metadata_field_info,
                                        verbose=True,
                                        enable_limit=True)

In [None]:
retriever.get_relevant_documents("Tell me about investment advice the 'worked' essay? return only 1")

query='investment advice' filter=Comparison(comparator=<Comparator.EQ: 'eq'>, attribute='essay', value='worked') limit=1


[Document(page_content='should make a larger number of smaller investments instead of a\n\nhandful of giant ones, they should be funding younger, more technical\n\nfounders instead of MBAs, they should let the founders remain as\n\nCEO, and so on.One of my tricks for writing essays had always been to give talks.\n\nThe prospect of having to stand up in front of a group of people\n\nand tell them something that won\'t waste their time is a great\n\nspur to the imagination. When the Harvard Computer Society, the\n\nundergrad computer club, asked me to give a talk, I decided I would\n\ntell them how to start a startup. Maybe they\'d be able to avoid the\n\nworst of the mistakes we\'d made.So I gave this talk, in the course of which I told them that the\n\nbest sources of seed funding were successful startup founders,\n\nbecause then they\'d be sources of advice too. Whereupon it seemed\n\nthey were all looking expectantly at me. Horrified at the prospect\n\nof having my inbox flooded by b

Awesome! It returned it back for us. It's a bit rigid because you need to put in the exact name of the file/essay you want to get. You could make a pre-step and infer the correct essay from the users choice but this is out of scope for now and application specific.