### Document Query Example

This notebook demonstrates how to use the LLM Python API to ask questions from a document (a pdf).

Let's install all the dependencies first. Make sure to use `--upgrade` to get the latest version since most of the packages are under active development.

In [1]:
! pip install langchain --upgrade
! pip install openai --upgrade
! pip install unstructured --upgrade
! pip install pypdf --upgrade



### Import packages

We import the following packages. Note that we did not import OpenAI since it already used by LangChain under the hood.

In [2]:
from langchain.document_loaders import PyPDFLoader
from langchain.indexes import VectorstoreIndexCreator
import os
import textwrap

#### Upload a PDF of your choice

Upload a PDF of your choice by providing its link or url. In this case, the pdf is a student manual.

From the given link, a document loader is created. Some basic information (eg number of pages) about the document is printed.

In [3]:
pdf_url = input("Enter pdf url: ")

# eg https://ac.upd.edu.ph/acmedia/images/newpdfs/UP_Academic_Information.pdf
loader = PyPDFLoader(pdf_url)
pages = loader.load_and_split()

# for the url, get the document name
document_name = pdf_url.split("/")[-1]
document_len = len(pages)
print(f"{document_name} number of pages = {document_len}")

UP_Academic_Information.pdf number of pages = 69


#### Enter your OpenAI API key

To use OpenAI LLM, enter your API key. You can get one from [here](https://beta.openai.com/).

The key will be used to create a vector store or database of embeddings for the document. OpenAI converts the document tokens (like words or part of words) into vectors (embeddings). This vectorstore is used to find the most similar tokens to a given query. The tokens are then passed to the LLM model to generate the answer.

This step may take a few minutes. 

In [4]:
query = input("OpenAI API Key: ")
os.environ["OPENAI_API_KEY"] = query
index = VectorstoreIndexCreator().from_loaders([loader])

Using embedded DuckDB without persistence: data will be transient


#### Ask a question

Now that we have a vector store for the document, we can ask questions about it.  The answer from OpenAI LLM is printed.

The session will stop once the human user says "bye".

In [6]:
while True:
    input_prompt = "Human: "
    query = input(input_prompt)

    if query.lower() == "bye":
        print("AI: Bye!")
        break

    # print text within page width
    for key,value in index.query_with_sources(query).items():
        text = f"{key}: {value}"
        print(textwrap.fill(text, width=80))


question: what is the maximum number of units per semester for an undegraduate
student
answer:  The maximum academic load for undergraduate students is eighteen (18)
non-laboratory units, or twenty-one (21) units including laboratory, except in
programs where the prescribed load for the semester is more than eighteen (18)
units.
sources: /tmp/tmpd_4n6c23
question: how do i qualify for latin honors
answer:  To qualify for Latin honors, a student must have completed in the
University at least 75% of the total number of academic units or hours for
graduation and must have been in residence therein for at least two (2) years
immediately prior to graduation, have taken during each semester/ trimester not
less than fifteen (15) units of credit or the normal load prescribed in the
curriculum, and maintain a cumulative weighted average (CWAG) of “2.00” or
better at the end of each academic year.
sources: /tmp/tmpd_4n6c23
question: when is the start of the first semester
answer:  The first seme