# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [42]:
#pip install --upgrade pip

In [43]:
!pip install chromadb openai langchain pandasai unstructured llama-index


[0m

In [44]:
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

openai = OpenAI(model_name="text-davinci-003")

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

## Vector store-augmented text generation

This notebook walks through how to use LangChain for text generation over a vector index. This is useful if we want to generate text that is able to draw from a large body of custom text, for example, generating blog posts that have an understanding of previous blog posts written, or product tutorials that can refer to product documentation.

In [45]:
from langchain.llms import OpenAI
from langchain.docstore.document import Document
import requests
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts import PromptTemplate
import pathlib
import subprocess
import tempfile
import pandas as pd
from pandasai import PandasAI
import pprint


In [46]:
from langchain.llms import OpenAI
from langchain.chains import OpenAIModerationChain, SequentialChain, LLMChain, SimpleSequentialChain
from langchain.prompts import PromptTemplate

First, we prepare the data. For this example, we fetch a documentation site that consists of markdown files hosted on Github and split them into small enough Documents.

# STEP 1: load files from directory

In [None]:
# import pandas as pd

# # Replace './data/kb_texts_for_embeds.csv' with the actual file path of your CSV file
# file_path = './data/llama/qa_community.csv'

# # Load the CSV file into a DataFrame
# dataframe = pd.read_csv(file_path)

# # Extract the 'title' and 'body' columns as separate lists
# questions = dataframe['question'].str.replace('\n', ' ').tolist()
# answers = dataframe['answer'].str.replace('\n', ' ').tolist()

# # Combine the 'title' and 'body' lists to represent each row as a separate document
# documents = [{'question': question, 'answer': answer} for question, answer in zip(questions, answers)]


In [48]:
#create csv file
# csv1 = pd.DataFrame(documents)
# csv1.to_csv('data/treated_kb_texts_for_embeds.csv')

In [71]:
# from llama_index import SimpleDirectoryReader

# reader = SimpleDirectoryReader(input_dir="data/llama/")
# docs = reader.load_data()
# print(f"Loaded {len(docs)} docs")

Loaded 3 docs


# Defining and Customizing Nodes

In [72]:
# from llama_index.node_parser import SimpleNodeParser

# parser = SimpleNodeParser()

# nodes = parser.get_nodes_from_documents(docs)

In [75]:
# import
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index.embeddings import LangchainEmbedding
from IPython.display import Markdown, display
import chromadb

In [78]:
from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SimpleNodeParser
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("VEXindexLlama")

from llama_index.embeddings import OpenAIEmbedding

embed_model = OpenAIEmbedding()

documents = SimpleDirectoryReader("data/llama/").load_data()

node_parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
service_context = ServiceContext.from_defaults(node_parser=node_parser)

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(embed_model=embed_model)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

In [79]:
query_engine = index.as_query_engine()

response = query_engine.query("how does the Distance Sensor in IQ 1st generation detect objects at a distance?")

In [80]:
print(response)


The IQ Distance Sensor (1st gen) uses an infrared beam to detect objects at a distance. The sensor measures the amount of time it takes for the beam to be reflected off of an object and then calculates the distance.


In [None]:
# create client and a new collection
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")

# define embedding function
embed_model = LangchainEmbedding(
    HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
)

# load documents
documents = SimpleDirectoryReader(
    "../../../examples/paul_graham_essay/data"
).load_data()

# set up ChromaVectorStore and load in data
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(embed_model=embed_model)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context, service_context=service_context
)

# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

In [49]:
#this is the CSVLoader from langchain
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='data/kb_texts_for_embeds.csv', source_column="title")

data = loader.load_and_split()

#print(data)

sources = data

source_chunks = []
splitter = CharacterTextSplitter(separator=".", chunk_size=1024, chunk_overlap=100)
for source in sources:
     for chunk in splitter.split_text(source.page_content):
         source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

In [50]:
print(source_chunks)

[Document(page_content='title: Incorporating VEXcode VR Into Your Curriculum\nbody: VEX offers a comprehensive set of resources and curricular support to enable you to teach computer science successfully and easily using VEXcode VR. VEXcode VR educational offerings afford several levels of facilitation and scaffolding. They can be implemented individually or in combination to best match your teaching style and the needs and interests of your students. VR Computer Science Courses The Computer Science Level 1 Blocks Course, and the Computer Science Level 1 Python Course are introductory computer science courses taught using engaging, robotics-based activities in VEXcode VR. As students solve various coding challenges using the VR Robot, they learn about fundamental computer science concepts such as project flow, loops, conditions and algorithms.  VR Activities      VEXcode VR Activities are simple, student-facing, one-page student engagements that can be completed independently by studen

In [51]:
search_index = Chroma.from_documents(source_chunks, OpenAIEmbeddings())

In [60]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

prompt_template = """you are a primary school teacher teaches science, technology, mechanical engineering, physics and Computer Science. You should write multiple choice quiz question that has one correct answer choice and three incorrect choices. Your outputs should be on topic and in context.
    Context: {context}
    Topic: {topic}
    question-output:
    correctAnswer-output:
    incorrectAnswer-output:
    incorrectAnswer-output:
    incorrectAnswer-output:"""

PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "topic"])

llm = OpenAI(temperature=.1)
chain = LLMChain(llm=llm, prompt=PROMPT)

def generate_blog_post(topic):
    docs = search_index.similarity_search(topic, k=3)
    inputs = [{"context": doc.page_content, "topic": topic} for doc in docs]
    print(chain.apply(inputs))


In [61]:
item = "the logic of if/else statements"
generate_blog_post(item)

[{'text': '\n\nQ: What is the purpose of an if/else statement in a program?\nA. To assign multiple behaviors to a single button\nB. To assign behaviors to different axes on the Controller\nC. To accommodate custom builds\nD. To create a loop that will execute the same code over and over'}, {'text': '\n\nQ: What is the purpose of an if/else statement in a program?\nA. To assign multiple behaviors to a single button\nB. To assign behaviors to different axes on the Controller\nC. To accommodate custom builds with more than 4 motors\nD. To provide potential for slower project execution or lag in button response time'}, {'text': '\n\nQ: What is the purpose of an if/else statement in a program?\nA. To assign multiple behaviors to a single button\nB. To assign behaviors to different axes on the Controller\nC. To accommodate custom builds with more than 4 motors\nD. To create a loop that will execute the same code over and over'}]


# Multiple Chains

In [54]:
## from langchain
# llms = [
#     OpenAI(temperature=0),
#     Cohere(model="command-xlarge-20221108", max_tokens=20, temperature=0),
#     HuggingFaceHub(repo_id="google/flan-t5-xl", model_kwargs={"temperature": 1}),
# ]

## Create an Vector Store Index

In [None]:
from llama_index import VectorStoreIndex, ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

...

# define LLM
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=256)

# configure service context
service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context)

# build index
index = VectorStoreIndex.from_documents(
    documents
)

In [55]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader('data/llama').load_data()
index = VectorStoreIndex.from_documents(documents)

In [56]:
index.storage_context.persist()


In [57]:
from llama_index import StorageContext, load_index_from_storage

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")
# load index
index = load_index_from_storage(storage_context)

In [41]:
query_engine = index.as_query_engine()
response = query_engine.query("What materials are provided in the VEX STEM Labs?")
print(response)

# query_engine = index.as_query_engine()
# response = query_engine.query("how does the Distance Sensor in IQ 1st generation detect objects at a distance?")
# print(response)

# query_engine = index.as_query_engine()
# response = query_engine.query("what are some tips to understanding the VEX build instructions?")
# print(response)

# query_engine = index.as_query_engine()
# response = query_engine.query("What do I do with the VEX plastic construction system?")
# print(response)




Context information is below.
---------------------
The VEX STEM Labs provide a comprehensive set of resources to help teachers and students explore and understand STEM concepts. The Labs are designed to be used in a variety of settings, from the classroom to after-school programs. Each Lab includes a Materials Needed list in the Summary section that is a comprehensive guide with all the teaching and student-facing materials required to implement the Lab. The name of the material, the purpose of that material, and recommended amount of each material is included in this section, so there is no guesswork on why and when materials will be needed.

Included here are linked student-facing documents, such as the Lab 1 Image Slideshow. These links lead to editable Google Drive documents that can be shared with students as is, or edited to suit your needs.

The Master Materials List with everything that will be needed for your classroom to implement each and every STEM Lab is also provided.



[{'text': '\n\nQuestion 1 (Easy): What is the first step in troubleshooting a VEX V5 Sensor?\nA. Check the wiring\nB. Check the power source\nC. Check the programming\nD. Check the battery\n\nCorrect Answer: A. Check the wiring'}, {'text': '\n\nQuestion (Easy): What is the first step in troubleshooting a VEX IQ (1st gen) Sensor?\nA. Check the wiring\nB. Check the firmware\nC. Check the battery\nD. Check the programming\n\nCorrect Answer: A. Check the wiring'}, {'text': '\n\nQuestion (Easy): What is the first step in troubleshooting a VEX GO Sensor?\nA. Check the wiring\nB. Check the power source\nC. Check the programming\nD. Check the battery\n\nCorrect Answer: A. Check the wiring'}, {'text': '\n\nQuestion (Easy): What is the first step in troubleshooting a VEX 123 Sensor?\nA. Check the wiring\nB. Replace the sensor\nC. Check the programming\nD. Check the battery\n\nCorrect Answer: A. Check the wiring'}]


In [142]:
#dfresult = pd.DataFrame(result)
print(result)


None


In [None]:
# #splitting text into chunks of 1020 characters at each period
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(
#     # Set a really small chunk size, just to show.
#     chunk_size = 100,
#     chunk_overlap  = 20,
#     length_function = len,
#     add_start_index = True,
# )

# texts = text_splitter.create_documents(data)
# print(texts) 


## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [4]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
import chromadb

embedding = OpenAIEmbeddings()
client = chromadb.PersistentClient(path="persist_dir")

# persist_directory = 'db-abc'
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, client=client)



## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [5]:
# vectordb.persist()
# vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [6]:
from langchain.chains import RetrievalQA
# Now we can load the persisted database from disk, and use it as normal. 
persist_dir = Chroma(client=client, embedding_function=embedding)
#qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=persist_dir)

AttributeError: 'VectorDBQA' object has no attribute 'get'

In [None]:
# # example with a system message
# MODEL = "gpt-4"
# response = openai.ChatCompletion.create(
#     model=MODEL,
#     messages=[
#         {"role": "system", "content": "You are a high school STEM and robotics teacher. Write TWO new questions that are similar to the user's query. Each new question should have four answer options. Only one of the answer choices should be correct and three others should be incorrect. You should denote the correct answer with a '***' before the first letter of the sentence for the correct answer. use the words question and answer a deliminator if included in the query."},
#         {"role": "user", "content": """question" : "What is the purpose of the Engage section of a VEX GO STEM Lab?" "answer" : "To introduce the Lab and make connections between students and the main concepts"""},
#     ],
#     temperature=.1,
#     n=1,
# )

# output3 = response['choices'][0]['message']['content']
# print(output3)


## Ask questions!

Now we can use the chain to ask questions!

In [8]:
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

' The president said that Ketanji Brown Jackson is one of our nation’s top legal minds and that she will continue Justice Breyer’s legacy of excellence. He also said she is a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He mentioned that she is a consensus builder and has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.'

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [10]:
# # To cleanup, you can delete the collection
# vectordb.delete_collection()
# vectordb.persist()

# # Or just nuke the persist directory
# !rm -rf db/

Persisting DB to disk, putting it in the save folder db
