# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [2]:
%pip install langchain openai chromadb

Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.chains import VectorDBQA
#from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredMarkdownLoader

In [4]:
%pip install unstructured > /dev/null

Note: you may need to restart the kernel to use updated packages.


## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [5]:
markdown_path = "qa-test.md"
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()
documents

[Document(page_content='speaker_name,text\nAlexandra Downing,Currently I am the competitive robotics program manager in my district.\nAlexandra Downing,"So I am connected to 230 teams across our district and I help they\'re all competitive teams from third grade all the way to 12th grade between Vex, IQ and VRC."\nAlexandra Downing,So managing that and working with the teams and communicating is part of my job.\nAlexandra Downing,And it would be very helpful if there was some sort of platform that I could use with Vex that could help me just communicate a little bit easier about the Vex stuff that we do because we are going to be adding to enhance our program.\nAlexandra Downing,We are also going to be adding vex one to three and vex go.\nAlexandra Downing,Starting with some camps this summer.\nAlexandra Downing,So I would really like to see that enhance our program.\nAlexandra Downing,I love Vex and I love what you guys have done.\nAlexandra Downing,I just think it might be a little b

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=2)
texts = text_splitter.split_documents(documents)

In [7]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db3'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [8]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [9]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)



## Ask questions!

Now we can use the chain to ask questions!

In [10]:
query = "Summarize the entirety of the speakers' comments in no more than two paragraphs."
qa.run(query)

' Bernie Contreras shared how his students had grown in terms of confidence and attitude due to their participation in a robotics program. He also mentioned the local recognition they received. Speaker Valdez discussed the importance of having enough resources and supplies in the classroom to support students and teachers in robotics clubs. He also mentioned the positive shift in behavior for students who participate in robotics clubs and the need to reach out to those who are in charge of the purse strings. Finally, Dallas Librarian mentioned the need for more resources to help teachers with robotics in the classroom, as well as grants that can be applied for to purchase more robotics kits. They also mentioned the need for more information about the VEX code VR and the coding associated with it.'

In [12]:
query = "give 3 good examples where teachers describe how vex products have sparked interest in broader edu subjects' comments."
qa.run(query)

' Speaker Valdez described how, after joining a robotics club, students had a newfound strength and ability and were proud in the hallway. Tori Aldridge mentioned that robotics classes have kept student interested in coming to school and Zach Kavanaugh mentioned how Vex PD Plus has been helpful for everyone.'

In [13]:
query = "What is the perception of Vex VR?"
qa.run(query)

' The perception of Vex VR is largely positive, with participants noting that the program is sufficient, that it is accessible to students with special education needs and those from poverty, and that it can be used across multiple curriculums.'

In [14]:
query = "What is the perception of Vex PD+?"
qa.run(query)

' Jeff Long perceives Vex PD+ to have a lot of information and step-by-step instructions that can be difficult to digest and streamline, but generally helpful.'

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [29]:
# # To cleanup, you can delete the collection
# vectordb.delete_collection()
# vectordb.persist()

# # Or just nuke the persist directory
# !rm -rf db/