### Doc Query Using Chroma

`VectorstoreIndexCreator` already uses Chroma vectorstore by default (to be verified).

In this example, the Chroma vectorstore is created and initialized from a document. 

Let us install/update the required packages.

```python

In [22]:
! pip install langchain --upgrade
! pip install openai --upgrade
! pip install unstructured --upgrade
! pip install pypdf --upgrade
! pip install chromadb --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting openai
  Using cached openai-0.27.4-py3-none-any.whl (70 kB)
Installing collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 0.27.2
    Uninstalling openai-0.27.2:
      Successfully uninstalled openai-0.27.2
Successfully installed openai-0.27.4
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid

In [23]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
import os
import textwrap
from langchain.llms import OpenAI

Since OpenAI embeddings will be used, the `OpenAIEmbeddings` instance is created. This requires that a valid API key is supplied.

In [24]:
query = input("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = query
embeddings = OpenAIEmbeddings()

### Creating a Chroma vectorstore from a document (pdf)

After initializing the OpenAI embeddings, the chroma vectorstore is created by loading a pdf document and parsing it into pages.

In this example, the `metadata` is set to the document name and page number.

In [25]:
# sample pdf: https://ac.upd.edu.ph/acmedia/images/newpdfs/UP_Academic_Information.pdf

pdf_url = input("Enter pdf url: ")
loader = PyPDFLoader(pdf_url)
document = loader.load()      
pages = loader.load_and_split()

# for the url, get the document name
document_name = pdf_url.split("/")[-1]
document_len = len(pages)
print(f"{document_name} number of pages = {document_len}")

vectorstore = Chroma.from_documents(documents=pages, embeddings=embeddings, \
                                    metadatas=[{"source": f"{document_name}:page-{i}"} for i in range(document_len)])

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


UP_Academic_Information.pdf number of pages = 69


### Create a QA Chain

A chain for document question answering is created. This will be used to process queries about the document and pass them to the LLM to retrieve the answer.

There are several chain_types. In this example, `map_rerank` is used to overcome the limitations of max tokens by breaking the document into chunks (eg pages) and reranking the results as returned by LLM. The result with the highest score is returned. 

More info can be found in the [LangChain documentation](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) and this [twitter post](https://twitter.com/wooing0306/status/1645092115914063872).

```python

In [26]:
llm = OpenAI(temperature=0)

from langchain.chains.qa_with_sources import load_qa_with_sources_chain
chain = load_qa_with_sources_chain(llm, chain_type="map_rerank",)

# if w/o sources
#from langchain.chains.question_answering import load_qa_chain
#chain = load_qa_chain(llm, chain_type="stuff")

In [28]:
while True:
    input_prompt = "Human: "
    query = input(input_prompt)

    if query.lower() == "bye":
        text = f"{input_prompt}: {query}"
        print(textwrap.fill(text, width=80))
        print("AI: Bye!")
        break

    docs = vectorstore.similarity_search(query, include_metadata=True)
    result = chain.run(input_documents=docs, question=query, return_sources=True)
    text = f"{input_prompt}: {query}"
    print(textwrap.fill(text, width=80))
    text = f"AI: {result}"
    print(textwrap.fill(text, width=80))

Human: : How many years to finish a bachelor's degree?
AI:  Four years
Human: : What is APE?
AI:  APE stands for Advanced Placement Examinations and is usually a written
examination that tests the student’s knowledge and skills of the course for
which he/she would want advanced credits in.
Human: : Are all freshman students required to take APE?
AI:  No, APE is only for students who wish to be considered for advanced credits
in a course.
Human: : When does the first semester start?
AI:  The first semester starts in August.
Human: : How many weeks per semester?
AI:  16 weeks
Human: : bye
AI: Bye!


In [37]:
# this is an alternative chain for QA with sources

from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(OpenAI(temperature=0), chain_type="map_rerank", \
    retriever=vectorstore.as_retriever())


while True:
    input_prompt = "Human: "
    query = input(input_prompt)

    if query.lower() == "bye":
        text = f"{input_prompt}: {query}"
        print(textwrap.fill(text, width=80))
        print("AI: Bye!")
        break

    result = chain({"question": query}, return_only_outputs=True)
    text = f"{input_prompt}: {query}"
    print(textwrap.fill(text, width=80))
    text = f"AI: {result}"
    print(textwrap.fill(text, width=80))

Human: : What is CRS?
AI: {'answer': ' The Computerized Registration System (CRS) allows UP Diliman
enrollees, faculty, and staff to access records online. It includes features for
creating and updating student records during and after admission, encoding of
new course offerings, submission of class schedules, pre-enlistment,
registration, change of matriculation, dropping, online submission of grades,
online viewing of grades, automated assessment of fees, tagging of
scholarship/privileges (UP faculty/employees and dependents) and STFAP brackets,
encoding of ineligibilities and accountabilities, students’ evaluation of
teacher, management of enlistment priority, online advising, and generation of
reports.', 'sources': ''}
Human: : What is role of a registrar?
AI: {'answer': ' The registrar is responsible for overseeing the registration
process, including creating and updating student records, encoding of new course
offerings, submission of class schedules, pre-enlistment, registration