# Overview

This is a small RAG app that answers questions about American Airlines annual financial results (see https://americanairlines.gcs-web.com/financial-results/financial-aal). A few PDFs were taken from here and ingested into a vectorstore.

The first section contains rough prototyping on a few pages of a single PDF document, which is loaded into an in-memory vectorstore.

After working on the small example, I wrote a script that would create the vectorstore for large PDFs. The script works in batches and is intentionally slow, since I'm using the free Google LLM service, which is rate-limited. The batch processing is done to avoid quota limits and to create checkpoints during embedding so that if quota limit is reached and an error is thrown, we won't lose any progress. I used this for a ~900 page PDF. The total embedding process took around 25 minutes and used 5.6k tokens.

The full dataset consists of 9 PDFs with a mean length of 384 pages.

Once the embedding is complete, we load it and use it for RAG as usual.

Since this makes API calls to Gemini, you'll need to enter your API key if you want to run the notebook. If you don't already have one, you can get it here (https://aistudio.google.com/app/apikey) The script also makes API calls to Gemini.

# To-do

* Write more tests

    - A query generation LLM could potentially be used as an LLM that generates unit tests and runs against a judge LLM or has results verified by human input

* Experiment with faster ways to ingest PDFs without using paid version of Gemini
    - Run experiments with different chunking strategies. Currently, we're limited by the number of API calls, but we could potentially just make larger calls. Not clear at the moment what the limits are.

* Investigate scale
    - Get Databricks certification. Databricks can do data hosting, distributed computing, and serving the endpoint.

# Prelims

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
!pip install -q pypdf langchain faiss-cpu tiktoken langchain-community langgraph tdqm langchain-google-genai==2.0.10

In [7]:
import os
import getpass
import faiss

from langchain                              import hub
from typing_extensions                      import List, TypedDict

from langchain_community.vectorstores       import FAISS
from langchain_community.document_loaders   import PyPDFLoader
from langchain_core.documents               import Document
from langchain_core.vectorstores            import InMemoryVectorStore
from langchain_text_splitters               import RecursiveCharacterTextSplitter
from langchain_google_genai                 import GoogleGenerativeAIEmbeddings

from langgraph.graph                        import START, StateGraph, MessagesState
from langgraph.checkpoint.memory            import MemorySaver

from langchain.chat_models                  import init_chat_model
from langchain_core.prompts                 import PromptTemplate
from langgraph.prebuilt                     import ToolNode

from langchain_core.tools                   import tool
from langchain_core.messages                import SystemMessage
from langgraph.graph                        import END
from langgraph.prebuilt                     import ToolNode, tools_condition


if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter API key for Google Gemini: ")

Enter API key for Google Gemini: ··········


# Small Example

Loads a single test PDF that's only 9 pages.

In [None]:
template = """You are an analyst conducting financial due diligence for a private equity investment firm.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Explain your reasoning, and, if possible, provide citations for your results.

{context}

Question: {question}

Helpful Answer:"""
prompt = PromptTemplate.from_template(template)

In [None]:
llm     = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [None]:
TEST_PATH       = "/content/drive/MyDrive/Colab Notebooks/Data/American Airlines Forms/small_pdf_test.pdf"
TEST_QUERY      = "In which month did the Association of Professional Flight Attendants ratify a new collective bargaining agreement?"

# The correct answer to TEST_QUERY is September 2024. See paragraph 3 of page 9.

In [None]:
loader          = PyPDFLoader(TEST_PATH, mode='single')
docs            = loader.load()

text_splitter   = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits      = text_splitter.split_documents(docs)

embeddings      = GoogleGenerativeAIEmbeddings(
                        model="models/gemini-embedding-exp-03-07",
                        task_type="QUESTION_ANSWERING"
                                               )

vector_store    = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)

retriever = vector_store.as_retriever()

retriever = vector_store.as_retriever()
retrieved_documents = retriever.get_relevant_documents(TEST_QUERY)
retrieved_documents[0].page_content

# Medium sized example

This and following sections work off the vectorstore generated by create_embedding.py.

In [None]:
# embeddings      = GoogleGenerativeAIEmbeddings(
#                         model="models/gemini-embedding-exp-03-07",
#                         task_type="QUESTION_ANSWERING"
#                                                )

embeddings      = GoogleGenerativeAIEmbeddings(
                        model="models/embedding-001"
                                               )
###############     This is not the ideal model.
### WARNING ###     Re-run the embeddingscript with exp-03-07 and then uncomment the first embedding
###############

vector_store = FAISS.load_local('/content/drive/MyDrive/Colab Notebooks/Data/final_vectorstore',
                                embeddings,
                                allow_dangerous_deserialization=True)

In [None]:
llm     = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

def retrieve(state: State):
    retrieved_docs = retriever.invoke(state["question"])
    return {"context": retrieved_docs}

def generate(state: State):
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    response = llm.invoke(messages)
    return {"answer": response.content}

memory = MemorySaver()

graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile(checkpointer=memory)

In [None]:
result = graph.invoke({"question": 'What were the total non-cash transactions for 2024?'})

In [None]:
print(result["answer"])# This appears to be the correct answer.

Based on the provided text, the total non-cash transactions for the first nine months of 2024 are calculated as follows:

*   Right-of-use (ROU) assets acquired through operating leases: $775 million
*   Property and equipment acquired through debt, finance leases and other: $193 million
*   Operating leases converted to finance leases: $130 million
*   Finance leases converted to operating leases: $33 million

Total non-cash transactions = $775 + $193 + $130 + $33 = $1,131 million

Therefore, the total non-cash transactions for the first nine months of 2024 were $1.131 billion.


# Improving the model

Added chat memory

In [30]:
embeddings      = GoogleGenerativeAIEmbeddings(
                        model="models/embedding-001"
                                               )
###############     This is not the ideal model.
### WARNING ###     Re-run the embeddingscript with exp-03-07 and then uncomment the first embedding
###############

vector_store = FAISS.load_local('/content/drive/MyDrive/Colab Notebooks/Data/final_vectorstore',
                                embeddings,
                                allow_dangerous_deserialization=True)

llm     = init_chat_model("gemini-2.0-flash", model_provider="google_genai")

In [31]:
@tool(response_format="content_and_artifact")
def retrieve(query: str):
    """Retrieve information related to a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

In [17]:
# Step 1: Generate an AIMessage that may include a tool-call to be sent.
def query_or_respond(state: MessagesState):
    """Generate tool call for retrieval or respond."""
    llm_with_tools = llm.bind_tools([retrieve])
    response = llm_with_tools.invoke(state["messages"])
    # MessagesState appends messages to state instead of overwriting
    return {"messages": [response]}


# Step 2: Execute the retrieval.
tools = ToolNode([retrieve])


# Step 3: Generate a response using the retrieved content.
def generate(state: MessagesState):
    """Generate answer."""
    # Get generated ToolMessages
    recent_tool_messages = []
    for message in reversed(state["messages"]):
        if message.type == "tool":
            recent_tool_messages.append(message)
        else:
            break
    tool_messages = recent_tool_messages[::-1]

    # Format into prompt
    docs_content = "\n\n".join(doc.content for doc in tool_messages)
    system_message_content = (
        "You are an analyst conducting financial due diligence for a private equity investment firm."
        "Use the following pieces of context to answer the question at the end."
        "If you don't know the answer, just say that you don't know, don't try to make up an answer."
        "Explain your reasoning, and, if possible, provide citations for your results."
        "\n\n"
        f"{docs_content}"
    )
    conversation_messages = [
        message
        for message in state["messages"]
        if message.type in ("human", "system")
        or (message.type == "ai" and not message.tool_calls)
    ]
    prompt = [SystemMessage(system_message_content)] + conversation_messages

    response = llm.invoke(prompt)
    return {"messages": [response]}

In [32]:
graph_builder = StateGraph(MessagesState)

graph_builder.add_node(query_or_respond)
graph_builder.add_node(tools)
graph_builder.add_node(generate)

graph_builder.set_entry_point("query_or_respond")
graph_builder.add_conditional_edges(
    "query_or_respond",
    tools_condition,
    {END: END, "tools": "tools"},
)
graph_builder.add_edge("tools", "generate")
graph_builder.add_edge("generate", END)

memory = MemorySaver()
graph = graph_builder.compile(checkpointer=memory)

In [33]:
config = {"configurable": {"thread_id": "abc123"}}

In [27]:
input_message = 'Hello.'

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config
):
    step["messages"][-1].pretty_print()


Hello.

Hi there! Is there anything I can do for you?


In [35]:
input_message = 'In which month did the Association of Professional Flight Attendants ratify a new collective bargaining agreement?'

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config
):
    step["messages"][-1].pretty_print()


In which month did the Association of Professional Flight Attendants ratify a new collective bargaining agreement?
Tool Calls:
  retrieve (12d4a1f0-b1dd-416c-b057-5a214c3139e6)
 Call ID: 12d4a1f0-b1dd-416c-b057-5a214c3139e6
  Args:
    query: ratification of collective bargaining agreement Association of Professional Flight Attendants
Name: retrieve

Source: {'producer': 'KS - PDF Engine v1.2', 'creator': 'Chromium', 'creationdate': '2024-10-24T11:10:15+00:00', 'moddate': '2024-10-24T11:10:25+00:00', 'title': 'Form 10-Q for American Airlines Group INC filed 10/24/2024', 'author': 'Kaleidoscope - kscope.io', 'subject': '10-Q filed 10/24/2024', 'keywords': 'American Airlines Group INC 10-Q', 'source': 'C:/Users/19368/Repos/Personal/Scaylor Coding Challenge/american_airlines_pdfs/Q32024.pdf', 'total_pages': 981, 'page': 29, 'page_label': '30'}
Content: statements in accordance with accounting principles generally accepted in the United States (GAAP) requires management to make certain
es

In [36]:
input_message = "Were there payments involved? If so, for how much?"

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config,
):
    step["messages"][-1].pretty_print()


Were there payments involved? If so, for how much?
Tool Calls:
  retrieve (4e7fe4d9-1c81-4cc1-8d43-02717d34cd83)
 Call ID: 4e7fe4d9-1c81-4cc1-8d43-02717d34cd83
  Args:
    query: payment amount related to ratification of collective bargaining agreement Association of Professional Flight Attendants September 2024
Name: retrieve

Source: {'producer': 'KS - PDF Engine v1.2', 'creator': 'Chromium', 'creationdate': '2024-10-24T11:10:15+00:00', 'moddate': '2024-10-24T11:10:25+00:00', 'title': 'Form 10-Q for American Airlines Group INC filed 10/24/2024', 'author': 'Kaleidoscope - kscope.io', 'subject': '10-Q filed 10/24/2024', 'keywords': 'American Airlines Group INC 10-Q', 'source': 'C:/Users/19368/Repos/Personal/Scaylor Coding Challenge/american_airlines_pdfs/Q32024.pdf', 'total_pages': 981, 'page': 13, 'page_label': '14'}
Content: and other postretirement benefits.
(b) Labor Relations
In September 2024, American and the Association of Professional Flight Attendants, the union representing

In [37]:
input_message = "What are the key drivers of the company's cash flow and its working capital requirements?"

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config,
):
    step["messages"][-1].pretty_print()


What are the key drivers of the company's cash flow and its working capital requirements?
Tool Calls:
  retrieve (b2a3ea6f-365a-4199-bd23-23efc304775b)
 Call ID: b2a3ea6f-365a-4199-bd23-23efc304775b
  Args:
    query: key drivers of cash flow and working capital requirements for American Airlines
Name: retrieve

Source: {'producer': 'KS - PDF Engine v1.2', 'creator': 'Chromium', 'creationdate': '2024-10-24T11:10:15+00:00', 'moddate': '2024-10-24T11:10:25+00:00', 'title': 'Form 10-Q for American Airlines Group INC filed 10/24/2024', 'author': 'Kaleidoscope - kscope.io', 'subject': '10-Q filed 10/24/2024', 'keywords': 'American Airlines Group INC 10-Q', 'source': 'C:/Users/19368/Repos/Personal/Scaylor Coding Challenge/american_airlines_pdfs/Q32024.pdf', 'total_pages': 981, 'page': 56, 'page_label': '57'}
Content: connection with the financing of certain aircraft and repurchased $552 million of secured and unsecured notes in the open market.
American
Operating Activities
American’s net c

In [38]:
input_message = "Summarize American Airlines balance sheet for this period"

for step in graph.stream(
    {"messages": [{"role": "user", "content": input_message}]},
    stream_mode="values",
    config=config,
):
    step["messages"][-1].pretty_print()


Summarize American Airlines balance sheet for this period
Tool Calls:
  retrieve (1a6290d0-7e5a-4794-a30d-fc0f8dcd33d8)
 Call ID: 1a6290d0-7e5a-4794-a30d-fc0f8dcd33d8
  Args:
    query: American Airlines balance sheet summary for Q3 2024
Name: retrieve

Source: {'producer': 'KS - PDF Engine v1.2', 'creator': 'Chromium', 'creationdate': '2024-10-24T11:10:15+00:00', 'moddate': '2024-10-24T11:10:25+00:00', 'title': 'Form 10-Q for American Airlines Group INC filed 10/24/2024', 'author': 'Kaleidoscope - kscope.io', 'subject': '10-Q filed 10/24/2024', 'keywords': 'American Airlines Group INC 10-Q', 'source': 'C:/Users/19368/Repos/Personal/Scaylor Coding Challenge/american_airlines_pdfs/Q32024.pdf', 'total_pages': 981, 'page': 2, 'page_label': '3'}
Content: American Airlines Group Inc.
American Airlines, Inc.
Form 10-Q
Quarterly Period Ended September 30, 2024
Table of Contents
  Page
PART I: FINANCIAL INFORMATION
Item 1A. Condensed Consolidated Financial Statements of American Airlines Grou