# AIE 3 Midterm Assignment

## This notebook was used to test the code that was ported into the app.py file.

### Load Environment Variables

In [2]:
import os
from dotenv import find_dotenv, dotenv_values

keys = list(dotenv_values(find_dotenv('.env')).items())
os.environ['OPENAI_API_KEY'] = keys[0][1]

### Install Additional Packages

In [23]:
!pip install -qU rapidocr-onnxruntime PyPDF #this package will be used to extract table data from the PDF


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Instantiate LLM, Embedding Model, and Vectorstore

In [26]:
from langchain_community.document_loaders import PyPDFLoader

#Read text/tables from the pdf
loader = PyPDFLoader("data/Airbnb_10k.pdf", extract_images=True)
pages = loader.load()

In [27]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(pages)
print(len(documents))

1153


In [28]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

In [29]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="Airbnb_10k",
)

### Create Retriever and RAG Chain

In [30]:
retriever = qdrant_vector_store.as_retriever()

In [31]:
from langchain.prompts import ChatPromptTemplate

template = """Answer all questions from the user. Any questions that have accompanying context should be answered with the context. If the context does not pertain to the question, or you cannot find the answer within the context, politely tell the user that you do not know the answer to the question and guide them toward a similar question that you may be able to answer with the context:

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)



In [32]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

#### Brief Testing

In [35]:
questions = ["What is Airbnb's 'Description of Business'?",
             "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?",
             "What is the 'maximum number of shares to be sold under the 10b5-1 Trading plan' by Brian Chesky?"]

for question in questions:
    result = retrieval_augmented_qa_chain.invoke({"question" : question})

    print(f'QUESTION: {question}\n\nRESULT: {result["response"].content}\n\n')

QUESTION: What is Airbnb's 'Description of Business'?

RESULT: Based on the provided context, Airbnb's 'Description of Business' is as follows:

"Airbnb, Inc. (the “Company” or “Airbnb”) was incorporated in Delaware in June 2008 and is headquartered in San Francisco, California. The Company operates a global platform for..."

The context does not provide the complete description, so I cannot provide the full details. You may want to refer to the full document for a comprehensive description.


QUESTION: What was the total value of 'Cash and cash equivalents' as of December 31, 2023?

RESULT: The total value of 'Cash and cash equivalents' as of December 31, 2023, was $6,874 million.


QUESTION: What is the 'maximum number of shares to be sold under the 10b5-1 Trading plan' by Brian Chesky?

RESULT: The maximum number of shares to be sold under the 10b5-1 Trading Plan by Brian Chesky, Chief Executive Officer and Director, is 1,146,000.


