<a href="https://colab.research.google.com/github/sangsin/AniHead-2K/blob/main/persistent_qa_with_pdf_chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install langchain openai chromadb tiktoken pypdf

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Document Question Answering with local persistence

An example of using Chroma DB and LangChain to do question answering over documents, with a locally persisted database. 
You can store embeddings and documents, then use them again later.

In [2]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader, PyPDFLoader

## Load and process documents

Load documents to do question answering over. If you want to do this over your documents, this is the section you should replace.

Next we split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [3]:
# Load and process the text
loader = PyPDFLoader('22 Essential YouTube Statistics You Need to Know in 2023.pdf')
documents = loader.load_and_split() 

In [4]:
documents[0]

Document(page_content='1 / 922 Essential Y ouTube Statistics Y ou Need to Know in\n2023\nthesocialshepherd.com /blog/youtube-statistics\nYouTube is without a doubt, one of the longest-standing video social media platforms in\nthe world.\nSo we decided to compile some YouTube statistics that cover its usage, demographics, and\nmore. As a marketer, this information will help you to utilize it in the most efficient way\npossible in an effort to reach your audience.\nWant to keep up to date on all the latest Y ouTube platform updates,\ninsights & algorithm changes? Sign up for our weekly Monday newsletter\nhere!\nTop Y ouTube Statistics\nAs the go-to platform for sharing video content, it should come as no surprise that\nYouTube has been pretty successful since its launch. To show just how well it’s doing\nworldwide, here are some fun YouTube statistics you should know:\nWhile you may not consider YouTube to be a typical social media platform, it’s still a place\nfor people to connect with

## Split documents

Split documents into small chunks. This is so we can find the most relevant chunks for a query and pass only those into the LLM.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
texts = text_splitter.split_documents(documents)

## Initialize PeristedChromaDB

Create embeddings for each chunk and insert into the Chroma vector database. The `persist_directory` argument tells ChromaDB where to store the database when it's persisted. 

In [6]:
OPENAI_API_KEY = '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'

In [7]:
# Embed and store the texts
# Supplying a persist_directory will store the embeddings on disk
persist_directory = 'db'
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
vectordb = Chroma.from_documents(documents=texts, embedding=embedding, persist_directory=persist_directory)



## Persist the Database
In a notebook, we should call `persist()` to ensure the embeddings are written to disk.
This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

In [8]:
vectordb.persist()
vectordb = None

## Load the Database from disk, and create the chain
Be sure to pass the same `persist_directory` and `embedding_function` as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [9]:
# Now we can load the persisted database from disk, and use it as normal. 
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=OPENAI_API_KEY), chain_type="stuff", retriever=vectordb.as_retriever())



## Ask questions!

Now we can use the chain to ask questions!

In [10]:
query = "How many are youtube premium subscribers?"
qa.run(query)

' As of 2021, there are 23.6 million paying subscribers who are enjoying ad-free videos.'

In [11]:
query = "How big is youtube revenue?"
qa.run(query)

' YouTube generated $19.7 billion in revenue in 2020.'

In [12]:
query = "인터넷 사용자 중 유투브에 접속하는 사람들이 몇 퍼센트나 될까?"
qa.run(query)

' 62% of internet users in the US access YouTube daily.'

In [15]:
query = "유투브에 접속하는 사람중 핸드폰 접속자는 몇 퍼센트나 될까?"
qa.run(query)

' 40.9%'

## Cleanup

When you're done with the database, you can delete it from disk. You can delete the specific collection you're working with (if you have several), or delete the entire database by nuking the persistence directory.

In [13]:
# # To cleanup, you can delete the collection
# vectordb.delete_collection()
# vectordb.persist()

# # Or just nuke the persist directory
# !rm -rf db/