# Top-K Similarity Search - Ask A Book A Question

In this tutorial we will see a simple example of basic retrieval via Top-K Similarity search

In [1]:
!pip install langchain --upgrade
# Version: 0.0.164

# !pip install pypdf
!pip install beautifulsoup4
!pip install lxml
!pip install chromadb
!pip install unstructured

Collecting langchain
  Downloading langchain-0.1.0-py3-none-any.whl.metadata (13 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.9 (from langchain)
  Downloading langchain_community-0.0.9-py3-none-any.whl.metadata (7.3 kB)
Collecting langchain-core<0.2,>=0.1.7 (from langchain)
  Downloading langchain_core-0.1.7-py3-none-

In [2]:
# Text Loader
#from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader, UnstructuredXMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
#from bs4 import BeautifulSoup
from dotenv import load_dotenv
import os

load_dotenv()

True

### Load your data

Next let's load up some data. I've put a few 'loaders' on there which will load data from different locations. Feel free to use the one that suits you. The default one queries one of Paul Graham's essays for a simple example. This process will only stage the loader, not actually load it.

In [2]:
#temptext = ''

#for file in os.listdir('./'):
#    if file.endswith('.xml'):
#        with open(file) as f:
#            soup = BeautifulSoup(f, 'xml')
#            temptext += soup.get_text()

#f = open('./results.txt', 'w')
#f.write(temptext)
#f.close()

#loader = TextLoader(file_path="./results.txt")
loader = DirectoryLoader('./hmmwv280_allxml/', glob="**/M*.xml")

## Other options for loaders 
# loader = PyPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = UnstructuredPDFLoader("../data/field-guide-to-data-science.pdf")
# loader = OnlinePDFLoader("https://wolfpaulus.com/wp-content/uploads/2017/05/field-guide-to-data-science.pdf")

Then let's go ahead and actually load the data.

In [3]:
data = loader.load()

Then let's actually check out what's been loaded

In [16]:
# Note: If you're using PyPDFLoader then it will split by page for you already
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:200]}')

You have 1063 document(s) in your data
There are 2742 characters in your sample document
Here is a sample: TACOM

Warren

MI

JRS

AMG

Livonia

MI

TM 9-2320-280-13&P

UOC_A13 UOC_A14 UOC_A24 UOC_A25 UOC_A26 UOC_A27 UOC_AVY UOC_B17 UOC_B18 UOC_B24 UOC_B25 UOC_BVY UOC_C17 UOC_H13 UOC_H14 UOC_H17 UOC_H18 UO


### Chunk your data up into smaller documents

While we could pass the entire essay to a model w/ long context, we want to be picky about which information we share with our model. The better signal to noise ratio we have the more likely we are to get the right answer.

The first thing we'll do is chunk up our document into smaller pieces. The goal will be to take only a few of those smaller pieces and pass them to the LLM.

In [5]:
# We'll split our data into chunks around 500 characters each with a 50 character overlap. These are relatively small.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [17]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 4437 documents


### Create embeddings of your documents to get ready for semantic search

Next up we need to prepare for similarity searches. The way we do this is through embedding our documents (getting a vector per document).

This will help us compare documents later on.

In [18]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

Check to see if there is an environment variable with you API keys, if not, use what you put below

In [19]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY', 'YourAPIKey')

Then we'll get our embeddings engine going. You can use whatever embeddings engine you would like. We'll use OpenAI's ada today.

In [20]:
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### Option #1: Chroma (for local)

I like Chroma becauase it's local and easy to set up without an account.

First we'll pass our texts to Chroma via `.from_documents`, this will 1) embed the documents and get a vector, then 2) add them to the vectorstore for retrieval later.

In [22]:
# load it into Chroma and make it persistent
persist_directory = "chroma_db"
vectorstore = Chroma.from_documents(data, embeddings, persist_directory=persist_directory)
vectorstore.persist()

RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for text-embedding-ada-002 in organization org-mbQajeY2at64UfsO2XAIz5rY on tokens per min (TPM): Limit 1000000, Requested 1119064. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Let's test it out. I want to see which documents are most closely related to a query.



In [65]:
query = "Who makes the power steering pump that is used in this vehicle?"
docs = vectorstore.similarity_search(query)

Then we can check them out. In theory, the texts which are deemed most similar should hold the answer to our question.
But keep in mind that our query just happens to be a question, it could be a random statement or sentence and it would still work.

In [14]:
# Here's an example of the first document that was returned
for doc in docs:
    print (f"{doc.page_content}\n")

pressure is supplied by the power steering pump.HYDRO-BOOSTER - Converts hydraulic power from the power steering pump to mechanical power to the master cylinder and provides power assist during braking.WINCH - (Optional) Hydraulically-actuated by pressure from hydro-booster to control operation of winch. Hydraulic pressure is supplied by the power steering pump.STEERING GEAR - Converts hydraulic power from power steering pump to mechanical power at pitman arm.POWER STEERING COOLER - Directs

hydraulic power from the steering pump to mechanical power to the master cylinder, providing power assist during braking.ACCESSORY DRIVE PULLEY BELTS - Transmits mechanical driving power from crankshaft drive pulley to steering pump pulley which drives the steering pump.POWER STEERING COOLER - Directs power steering fluid through a series of fins or baffles so outside air can dissipate excess heat before the fluid is recirculated through the steering system.OIL RESERVOIR AND POWER STEERING PUMP

Co

### Option #2: Pinecone (for cloud)
If you want to use pinecone, run the code below, if not then skip over to Chroma below it. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account

In [14]:
# PINECONE_API_KEY = os.getenv('PINECONE_API_KEY', 'YourAPIKey')
# PINECONE_API_ENV = os.getenv('PINECONE_API_ENV', 'us-east1-gcp') # You may need to switch with your env

# # initialize pinecone
# pinecone.init(
#     api_key=PINECONE_API_KEY,  # find at app.pinecone.io
#     environment=PINECONE_API_ENV  # next to api key in console
# )
# index_name = "langchaintest" # put in the name of your pinecone index here

# docsearch = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)

### ~~Create an Index to ensure that records are not duplicated~~ I don't really understand indexing yet

Initialize a record manager with an appropriate namespace

You can use any namespace you want, but it is suggested to take into account both the vectorstore and the collection name in the vector store; e.g. 'redis/my_docs' or 'chroma/my_docs'.

In [None]:
namespace = f"chromadb/general_info"
record_manager = RecordManager(
    namespace )

### Query those docs to get your answer back

Great, those are just the docs which should hold our answer. Now we can pass those to a LangChain chain to query the LLM.

We could do this manually, but a chain is a convenient helper for us.

In [7]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

In [10]:
llm = ChatOpenAI(temperature=0, model='gpt-4-1106-preview', openai_api_key=OPENAI_API_KEY)
chain = load_qa_chain(llm, chain_type="stuff")

In [11]:
query = "What is the proper torque value for the wheel lugs?"
docs = vectorstore.similarity_search(query)
chain.run(input_documents=docs, question=query)

NameError: name 'vectorstore' is not defined

In [93]:
print(docs)

[Document(page_content='19207\n\nTANK,FUEL,ENGINE\n\nBASIC\n\n00001\n\nEA\n\nA11,A13,A14,A15,A20,A24,A25,A26,A27,AVY,B16,B17,B18,H11,H13,H14,H15,H16,H17,H18,H20,H21,H24,H25,H26,H27,H28,HVY,MMM\n\n16\n\n5310\n\n01-102-3270\n\n2436161\n\n24617\n\nWASHER,FLAT\n\n1/4\n\n00009\n\nEA\n\nA11,A13,A14,A15,A20,A24,A25,A26,A27,AVY,B16,B17,B18,H11,H13,H14,H15,H16,H17,H18,H20,H21,H24,H25,H26,H27,H28,HVY,MMM\n\n16\n\n5310\n\n01-548-1269\n\nM45913/4-4CG8Z\n\n81349\n\nNUT,SELF-LOCKING,HE\n\n1/4-20\n\n00012\n\nEA', metadata={'source': 'hmmwv280_allxml/R00001.xml'}), Document(page_content='19207\n\nTANK,FUEL,ENGINE\n\nBASIC\n\n00001\n\nEA\n\nA11,A13,A14,A15,A20,A24,A25,A26,A27,AVY,B16,B17,B18,H11,H13,H14,H15,H16,H17,H18,H20,H21,H24,H25,H26,H27,H28,HVY,MMM\n\n16\n\n5310\n\n01-102-3270\n\n2436161\n\n24617\n\nWASHER,FLAT\n\n1/4\n\n00009\n\nEA\n\nA11,A13,A14,A15,A20,A24,A25,A26,A27,AVY,B16,B17,B18,H11,H13,H14,H15,H16,H17,H18,H20,H21,H24,H25,H26,H27,H28,HVY,MMM\n\n16\n\n5310\n\n01-548-1269\n\nM45913/4-4CG8Z\

Awesome! We just went and queried an external data source!