# Document Question Answering

An example of using Chroma DB and LangChain to do question answering over documents.

In [44]:
%pip install langchain openai chromadb

Note: you may need to restart the kernel to use updated packages.


In [45]:
# from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.llms import OpenAI
# from langchain.chains import RetrievalQA
# from langchain.document_loaders import TextLoader

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
#from langchain.document_loaders import TextLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.document_loaders import JSONLoader


## Load documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

In [46]:
from langchain.document_loaders.csv_loader import CSVLoader
%pip install tiktoken

loader = CSVLoader(file_path='./sent4.csv')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Note: you may need to restart the kernel to use updated packages.


## Split documents

Split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [47]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

## Initialize ChromaDB

Create embeddings for each chunk and insert into the Chroma vector database.

In [48]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db2'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

In [49]:
vectordb.persist()
vectordb = None

## Create the chain

Initialize the chain we will use for question answering.

In [52]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)

ValidationError: 2 validation errors for RetrievalQA
retriever
  field required (type=value_error.missing)
vectorstore
  extra fields not permitted (type=value_error.extra)

## Ask questions!

Now we can use the chain to ask questions!

In [51]:
query = "What are the most common challenges that all speakers have in common?"
qa.run(query)

' The most common challenges that all speakers have in common are the lack of resources available for coaching non-educators on programming and mechanical engineering, the need for better communication and resources for coaching, and the need for feedback mechanisms for sour experiences.'

In [22]:
query =  "which VEX products are the most used by all speakers? And how many for each?"
qa.run(query)

' The two speakers at the conference use Vex IQ, Vex Go, and Vex 1,2,3. The first speaker uses all three products, while the second speaker only uses Vex IQ.'

In [27]:
query =  "what help do repondents ask for relating to their understanding of STEM?"
qa.run(query)

' Respondents ask for information and mentorship on hosting competitions, troubleshooting flowcharts, classroom kits with spare parts, simplified interfaces, resources to help increase STEM skills among their workforce, and encouragement of collaboration among teams.'

In [29]:
query =  " how did teachers describe  some examples of transfer learning using vex?"
qa.run(query)

" Teachers described transfer learning using Vex as providing more support and resources for schools that are just getting started with Vex, especially around setting up a workspace or workshop and managing equipment and materials, exploring ways to provide virtual support and training for teachers who are new to Vex, especially around mechanical engineering and CAD development, and learning more about Milo Zankov's experience with running an autonomous challenge in a grade eight class to better engage students and explore ethical and practical considerations."

In [30]:
query =  "do teachers want to be certified by vex?"
qa.run(query)

' It is not clear from the context if teachers want to be certified by Vex or not.'