# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [569]:
%pip install langchain openai chromadb

Note: you may need to restart the kernel to use updated packages.


In [570]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
#from langchain.document_loaders import TextLoader
from langchain.document_loaders import UnstructuredAPIFileLoader
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.document_loaders import JSONLoader

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [571]:
import requests

url = 'https://api.fireflies.ai/graphql'

payload = {
    "query": """
        query {
            transcripts {
                title
                sentences {
                  text
                  raw_text
                  speaker_name
                }
            }
        }
    """
}

headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer 1ab2cdbd-53fb-40b5-a491-fd395fd2037f'
}

response = requests.post(url, json=payload, headers=headers)
dict = response.json()

In [572]:
import numpy as np
import pandas as pd

s = pd.Series(data=dict['data']['transcripts'])

In [573]:
print(s)
type(s)

0     {'title': 'Erin Valdez 5.m4a', 'sentences': [{...
1     {'title': 'Erin Valdez 6.m4a', 'sentences': [{...
2     {'title': 'Christine Carter, Washington, DC', ...
3     {'title': 'Erin Valdez 1.m4a', 'sentences': [{...
4     {'title': 'Erin Valdez 2.m4a', 'sentences': [{...
5     {'title': 'Erin Valdez 3.m4a', 'sentences': [{...
6     {'title': 'Erin Valdez 4.m4a', 'sentences': [{...
7     {'title': 'Zach Kavanaugh.m4a', 'sentences': [...
8     {'title': 'zach kavanaugh p2.m4a', 'sentences'...
9     {'title': 'Zach Kavanaugh 3.m4a', 'sentences':...
10    {'title': 'Zach Kavanaugh.m4a', 'sentences': [...
11    {'title': 'Jeff Long, San Antonio.m4a', 'sente...
12    {'title': 'Luis Chavez, Montague, CA', 'senten...
13    {'title': 'Andy Petrie', 'sentences': [{'text'...
14    {'title': 'Milo Zanko, Adelaide, AUS', 'senten...
15    {'title': 'Peggy Myers, Upper P, Michigan', 's...
16    {'title': 'Chris Collins', 'sentences': [{'tex...
17    {'title': 'Laura Mackay, Claire Creek', 's

pandas.core.series.Series

In [574]:
s.to_csv('booy.csv')

In [575]:
s2 = pd.read_csv('booy.csv')

In [576]:
s2.to_csv('booys2.csv')

In [577]:
import pandas as pd

# Convert the CSV data to a DataFrame
q4 = pd.read_csv('quotes.csv')

# Drop the unwanted columns
q5 = q4.drop(['one', 'two', 'start_time', 'end_time', 'speaker_id'], axis=1)

q5 = q5[q5['speaker_name'] != 'JM']

q5.to_csv('updated_quotes.csv', index=False)
# Print the updated DataFrame
print(q5)


      index                                               text  \
0         0                 So my goal with Stem education at.   
1         1             This point would be to incorporate it.   
2         2                              Into the science lab.   
3         3  I currently see 3rd, fourth and fifth graders ...   
4         4              Come in for hands on science lessons.   
...     ...                                                ...   
1600     42  Again, I know this is educational educator foc...   
1601     43                             I see it all the time.   
1602     44  Referee making a bad call and the parents gett...   
1603     45  Think that could be handled better from the co...   
1604     46                                              Yeah.   

                                               raw_text  \
0                    So my goal with Stem education at.   
1                This point would be to incorporate it.   
2                             

In [578]:
import json
import pandas as pd

# Create an empty list to store the JSON data
data_list = []

# Open the JSONL file for reading
with open('output.jsonl', 'r') as file:
    # Read each line of the file
    for line in file:
        # Parse the JSON data from each line
        data = json.loads(line)

        # Append the JSON data to the list
        data_list.append(data)

# Create a DataFrame from the JSON data
df = pd.DataFrame(data_list)

# Change the index from rows to columns
#df = df.transpose()

df = df.replace(r'[^\w\s]', '', regex=True)

df.to_csv('new_sm.csv', index=False)
# Print the updated DataFrame
print(df)


                                  AI meeting summary:  \
0   [A robotics coach with a background in IT and ...   
1   [Alcide Salse, representing the FDNY High Scho...   
2   [Jeff Long, a teacher at Stevenson Middle Scho...   
3   [Luis Chavez, a science teacher from Montague ...   
4   [Crystal Cap from Chattanooga Girls Leadership...   
5   [Jacob Roberts is an assistant program manager...   
6   [Mill Zankov, a teacher in Adelaide, Australia...   
7   [Chris Gingri teaches K-5 engineering lab at B...   
8   [Zach Kavanaugh teaches at Duchin, The Academy...   
9   [Chris Collins has been teaching robotics for ...   
10  [The speaker is impressed with the educational...   
11  [Laura Mackay, the Coordinator of Innovative P...   
12  [Andy Petrie discusses efforts to involve more...   
13  [Peggy Myers, a middle school teacher from Mic...   
14  [Bridget Myers, a teacher at Highland Park Mid...   
15  [A STEM educator discusses their goals of brin...   
16  [Bernie Contreras, a fifth 

In [586]:
import re
import pandas as pd

# Open the CSV file and read the data into a DataFrame
this_data = pd.read_csv('new_sm.csv')

# Define the headers
headers = ['AI meeting summary', 'Outline', 'Notes', 'Action items', 'Action Items']

# Create a list to store the processed data
processed_data = []

# Iterate through the data and remove special characters
for column in this_data.columns:
    processed_column = this_data[column].apply(lambda x: re.sub(r"[^\w\s]", "", str(x)))
    processed_data.append(processed_column)

# Write the processed data to a new markdown file
with open("processed_data.md", "w") as file:
    for i in range(len(headers)):
        file.write("# " + headers[i] + "\n")
        file.write(processed_data[i].to_string(index=False) + "\n\n")


In [583]:
# Example DataFrame 'this_data'

with open('new_sm.csv', 'r') as file:

    this_data = pd.DataFrame(file)

# Create a list to store the pages
pages = []

# Iterate through the columns
for column in this_data.columns:
    # Get the text from the column
    text = this_data[column]
    

    # Add the processed text to the pages list
    processed_pages.append(processed_text)

# Write the processed pages to a new markdown file
markdown_filename = 'processed_pages.md'
with open(markdown_filename, 'w') as file:
    for page_number, page in enumerate(processed_pages, 20):
        file.write(f"Page {page_number}:\n\n")
        file.write(page)
        file.write("\n\n----------------------\n")

print(f"Markdown file '{markdown_filename}' saved.")

#     # Add the text to the pages list
#     pages.append(text)

# # Print the pages
# for page_number, page in enumerate(pages, 1):
#     print(f"Page {page_number}:\n")
#     print(page)
#     print("\n----------------------\n")


Markdown file 'processed_pages.md' saved.


## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [536]:
from langchain.document_loaders.csv_loader import CSVLoader
%pip install tiktoken

loader = CSVLoader(file_path='./new_sm.csv')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Note: you may need to restart the kernel to use updated packages.


In [537]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)

## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [538]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [539]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=vectordb)

ValidationError: 2 validation errors for RetrievalQA
retriever
  field required (type=value_error.missing)
vectorstore
  extra fields not permitted (type=value_error.extra)

## Ask questions!

Now we can use the chain to ask questions!

In [None]:
query = "What are the top themes in these summaries?"
qa.run(query)

" I don't know."

In [422]:
query = "tell me more about this?"
qa.run(query)

" I'm not sure what you're asking about."

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [10]:
# To cleanup, you can delete the collection
vectordb.delete_collection()
vectordb.persist()

# Or just nuke the persist directory
!rm -rf db/

Persisting DB to disk, putting it in the save folder db
