I am importing the necessary libraries from Langchain for the RAG and requests for the API part i.e., problem 3

In [40]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

import re
import requests

I require the two keys one for the brain i.e., the LLM and another for RAPID API

In [5]:
import os
os.environ["OPENAI_API_KEY"] = "sk-proj-mpIwO4x-C5wh0PgXMIuFcWzhVnX4fbnpzxnjp0-7X-xgHafLWnYUVF-WWAWVtIHwDuYZ6wvRN9T3BlbkFJ-."
RAPID_API_KEY = "66ef2a227dmshbb9c186d77b2539p1ac253jsn6b5ba1ceda97"

I have added the file name with that added the stock name and the year to explicitely identify the year and the company

In [6]:
files = [
    ("AMZN", "2019", "/Assignment_BLS/Assignment/10-k_docs/Amazon/0001018724-20-000004.pdf"),
    ("AMZN", "2020", "/Assignment_BLS/Assignment/10-k_docs/Amazon/0001018724-21-000004.pdf"),
    ("AMZN", "2021", "/Assignment_BLS/Assignment/10-k_docs/Amazon/0001018724-22-000005.pdf"),
    ("UBER", "2019", "/Assignment_BLS/Assignment/10-k_docs/Uber/0001543151-20-000010.pdf"),
    ("UBER", "2020", "/Assignment_BLS/Assignment/10-k_docs/Uber/0001543151-21-000014.pdf"),
    ("UBER", "2021", "/Assignment_BLS/Assignment/10-k_docs/Uber/0001543151-22-000008.pdf"),
]

# Problem 1

Loaded all the pages from all the documents

In [7]:
all_docs = []

for company, year, path in files:
    loader = PyPDFLoader(path)
    docs = loader.load()
    for d in docs:
        d.metadata["company"] = company
        d.metadata["year"] = year
    all_docs.extend(docs)

Printed the number of pages as well as a data of page 5

In [8]:
print(len(all_docs))
print(all_docs[4])

1069
page_content='Table of Contents
Content Creators
We serve authors and independent publishers with Kindle Direct Publishing, an online service that lets independent authors and
publishers choose a royalty option and make their books available in the Kindle Store, along with Amazon’s own publishing arm,
Amazon Publishing. We also offer programs that allow authors, musicians, filmmakers, skill and app developers, and others to publish
and sell content.
Competition
Our businesses encompass a large variety of product types, service offerings, and delivery channels. The worldwide marketplace
in which we compete is evolving rapidly and intensely competitive, and we face a broad array of competitors from many different
industry sectors around the world. Our current and potential competitors include: (1) physical, e-commerce, and omnichannel retailers,
publishers, vendors, distributors, manufacturers, and producers of the products we offer and sell to consumers and businesses;
(2) publishe

Created blocks of size 1000 characters with an overlap of 200 characters

In [9]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks = splitter.split_documents(all_docs)
print("Total chunks:", len(chunks))

Total chunks: 4249


Using the text embedding by OpenAI
Built a vector database using the ChromeDB library choosed this as it is open source and organizes data by similarity and then stored it inside the directory "sec10k_db"

In [10]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./sec10k_db"
)

vectorstore.persist()

  vectorstore.persist()


It has the core logic of this problem
It find the top 5 relevant chunks based on the comapny name and the year

In [11]:
def get_retriever(company, year):
    return vectorstore.as_retriever(
        search_kwargs={
            "k": 5,
            "filter": {
                "$and": [
                    {"company": {"$eq": company}},
                    {"year": {"$eq": year}}
                ]
            }
        }
    )


Here I:
* Define the default OpenAI model to be used
* Created the prompt
* Created the RAG pipeline

In [12]:
llm = ChatOpenAI()

prompt = ChatPromptTemplate.from_template("""
Answer the question using only the context given below.
Also mention company, fiscal year and page number.
Context is:
{context}

Question is:
{question}
""")

def ask_10k(company, year, question):
    retriever = get_retriever(company, year)

    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser()
    )

    return rag_chain.invoke(question)

It helps in extracting the company and the year from the query

In [41]:
def extract_company_and_year(query):
    query_lower = query.lower()

    companies = {
        "amazon": "AMZN",
        "amzn": "AMZN",
        "uber": "UBER"
    }

    company = None
    for name, ticker in companies.items():
        if name in query_lower:
            company = ticker
            break

    year = None
    for token in query_lower.split():
        if token.isdigit() and len(token) == 4:
            if 2010 <= int(token) <= 2030:
                year = token
                break

    return [company, year, query]

Running the queries and printing the results

In [14]:
query = "Extract the total liabilities reported in Uber's 2020 10-K form."
vals = extract_company_and_year(query)
company = vals[0]
year = vals[1]
query = vals[2]
print(ask_10k(company, year, query), "\n")

query = "Extract the net sales reported in Amazon’s 2019 10-K form."
vals = extract_company_and_year(query)
company = vals[0]
year = vals[1]
query = vals[2]
print(ask_10k(company, year, query), "\n")

query = "Extract the number of employees reported in Uber's 2021 10-K form."
vals = extract_company_and_year(query)
company = vals[0]
year = vals[1]
query = vals[2]
print(ask_10k(company, year, query), "\n")

Total liabilities reported in Uber's 2020 10-K form is $31,761.

Company: UBER
Fiscal year: 2020
Page number: 149 

The net sales reported in Amazon's 2019 10-K form were as follows:
- North America: $170,773 million
- International: $74,723 million
- AWS: $35,026 million
- Consolidated: $280,522 million

[Company: AMZN, Fiscal Year: 2019, Page Number: 33] 

The number of employees reported in Uber's 2021 10-K form is not provided in the given context. 



# Problem 2

This block of code is used to handle the multi-part queries.
It extract lists of all companie and al ears mentioned in a sinle sentence.

In [15]:
def extract_companies_and_years(query):
    q = query.lower()

    companies = []
    if "amazon" in q or "amzn" in q:
        companies.append("AMZN")
    if "uber" in q:
        companies.append("UBER")

    years = []
    for token in q.split():
        if token.isdigit() and len(token) == 4:
            y = int(token)
            if 2010 <= y <= 2030:
                years.append(str(y))

    return companies, years

It is used to decompose the query.
For example if the query is "Compare Amazon and Uber is 2020 and 2021", then it will break the query in following parts:
* Amazon in 2020
* Amazon in 2021
* Uber is 2020
* Uber in 2021

In [17]:
def decompose_query(query):
    companies, years = extract_companies_and_years(query)

    tasks = []
    for c in companies:
        for y in years:
            tasks.append((c, y, query))

    return tasks

It executes the tasts that are returned by the decompose_query function.

In [18]:
def multi_query_rag(query):
    tasks = decompose_query(query)

    results = []

    for company, year, sub_query in tasks:
        answer = ask_10k(company, year, sub_query)

        results.append({
            "company": company,
            "year": year,
            "answer": answer
        })

    return results

In [42]:
def print_results(results):
    for r in results:
        print(r["answer"])

Showcasing the output of one of the query

In [44]:
query = "Compare Amazon’s net sales in 2019 vs 2021."
results = multi_query_rag(query1)
print(query)
print_results(results)

Compare Amazon’s net sales in 2019 vs 2021.
I'm sorry, I can't provide a comparison for Amazon's net sales in 2019 vs 2021 as the information provided only contains data for the fiscal year 2018 and 2019. The total net sales for Amazon in 2019 were $280,522 million.


Showcasing the output of another query

In [46]:
query = "Summarize the major risk factors in Uber 2021 and Amazon 2020."
results = multi_query_rag(query)
print(f"query: {query}")
print("\n")
print_results(results)

query: Summarize the major risk factors in Uber 2021 and Amazon 2020.


The major risk factors for Amazon in 2020 include intense competition in various industries such as e-commerce, retail, advertising, and logistics, as well as potential risks related to global economic conditions, supply chain constraints, and fluctuations in operating results. The document pertains to Amazon, fiscal year 2021, on page 8.
The major risk factors for Uber in 2021 include uncertainty regarding when driver supply levels will return to pre-pandemic levels due to the impacts of the COVID-19 pandemic, as well as the recent surge of the Omicron variant affecting travel and other operations. Additionally, substantial investments in new offerings and technologies pose risks as expected benefits may not be realized, and operations in large metropolitan areas may be negatively affected by various conditions including COVID-19.

There is no information provided about Amazon in 2020 in the context given.


# Problem 3

Defining the configurations for the problem 3

In [29]:
import requests
import re
from datetime import datetime

RAPID_API_KEY = "66ef2a227dmshbb9c186d77b2539p1ac253jsn6b5ba1ceda97"

BASE_URL = "https://yahoo-finance166.p.rapidapi.com"
HEADERS = {
    "x-rapidapi-key": RAPID_API_KEY,
    "x-rapidapi-host": "yahoo-finance166.p.rapidapi.com"
}

It is the Decision Maker for getting
* The current status of price of Amazon and Uber.
* The Historical trend over past days.

In [52]:
def parse_query(user_query):
    q = user_query.lower()

    symbols = []
    if "amazon" in q or "amzn" in q:
        symbols.append("AMZN")
    if "uber" in q or "uber" in q:
        symbols.append("UBER")

    if "last" in q and "day" in q:
        days = int(re.search(r"(\d+)", q).group(1))
        return {"type": "history", "symbols": symbols, "days": days}

    return {"type": "current", "symbols": symbols}

It send my credentials and retreive the raw data from the server.


In [60]:
def call_yahoo(symbol, range="1mo", interval="5m"):
    url = f"{BASE_URL}/api/stock/get-chart"
    params = {
        "region": "US",
        "symbol": symbol,
        "range": range,
        "interval": interval
    }

    r = requests.get(url, headers=HEADERS, params=params, timeout=10)
    r.raise_for_status()
    return r.json()["chart"]["result"][0]

I have extracted the Price, high, low and the change make and then beautify it so that the extracted text is readable

In [61]:
def get_current(symbol):
    try:
        data = call_yahoo(symbol)

        meta = data["meta"]
        price = meta["regularMarketPrice"]
        prev = meta["previousClose"]
        high = meta["regularMarketDayHigh"]
        low = meta["regularMarketDayLow"]

        pct = round(((price - prev) / prev) * 100, 2)

        return f"""{symbol}
Price: ${price}
Day High: ${high}
Day Low: ${low}
Change: {pct}%"""

    except Exception as e:
        return f"{symbol} ERROR: {str(e)}"

This function handles the Timeline requests. It takes the time series data from the API and converts it into a clean day by day list of closing prices.

In [66]:
def get_history(symbol, days):
    try:
        data = call_yahoo(symbol, range="1mo", interval="1d")

        ts = data["timestamp"]
        closes = data["indicators"]["quote"][0]["close"]

        result = []
        for t, c in zip(ts[-days:], closes[-days:]):
            date = datetime.fromtimestamp(t).strftime("%Y-%m-%d")
            result.append(f"{date}: ${round(c,2)}")

        return f"{symbol} Last {days} Days\n" + "\n".join(result)

    except Exception as e:
        return f"{symbol} ERROR: {str(e)}"

It is the main controller
If the intent of the query is to find the current price it will move to the get_current function and if the intent is history then it will go to the get_history function.

In [67]:
def ask_stock_agent(query):
    intent = parse_query(query)

    answers = []

    if intent["type"] == "current":
        for s in intent["symbols"]:
            answers.append(get_current(s))

    if intent["type"] == "history":
        for s in intent["symbols"]:
            answers.append(get_history(s, intent["days"]))

    return "\n\n".join(answers)

Showcasing the results

In [68]:
print(ask_stock_agent("What is the current stock price of Amazon and Uber?"))

AMZN
Price: $248.215
Day High: $248.48
Day Low: $246.24
Change: 0.35%

UBER
Price: $85.35
Day High: $85.395
Day Low: $83.77
Change: -0.11%


In [71]:
print(ask_stock_agent("Extract stock prices of Uber for the last 5 day."))

UBER Last 2 Days
2026-01-09: $85.44
2026-01-12: $85.41
