# Conversation AI: Assignment 2
## Tasks
### Component: 1. Data Collection & Preprocessing

**Download the last two years of financials (Use any one group member's company earning statements, if nothing is available use any company's data freely available). Clean and structure the data for retrieval.**

Download AAPL financials (PDF)

In [1]:
import requests

# make a dir to store the pdfs
import os
pdf_dl_path = './financial-docs-raw'
os.makedirs(pdf_dl_path, exist_ok=True)

for url in [
    'https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf',
    'https://s2.q4cdn.com/470004039/files/doc_earnings/2024/q4/filing/10-Q4-2024-As-Filed.pdf',
    ]:
    response = requests.get(url)

    with open(f'{pdf_dl_path}/' + url.split('/')[-1], 'wb') as file:
        file.write(response.content)

Convert the text to Markdown to enable better ingestion for later retrieval

In [3]:
!pip install markitdown

Collecting markitdown
  Downloading markitdown-0.0.1a4-py3-none-any.whl.metadata (8.1 kB)
Collecting azure-ai-documentintelligence (from markitdown)
  Downloading azure_ai_documentintelligence-1.0.0-py3-none-any.whl.metadata (51 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.3/51.3 kB[0m [31m880.4 kB/s[0m eta [36m0:00:00[0m31m1.2 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting azure-identity (from markitdown)
  Downloading azure_identity-1.20.0-py3-none-any.whl.metadata (81 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
Collecting mammoth (from markitdown)
  Downloading mammoth-1.9.0-py2.py3-none-any.whl.metadata (24 kB)
Collecting markdownify (from markitdown)
  Downloading markdownify-0.14.1-py3-none-any.whl.metadata (8.5 kB)
Collecting olefile (from markitdown)
  Downloading olefile-0.47-py2.py3-none-any.whl.metadata (9.7 kB)

In [4]:
from markitdown import MarkItDown

def get_markdown_from_pdf(pdf_file_path):
    md = MarkItDown()
    result = md.convert(pdf_file_path)
    return result.text_content

md_output_path = './financial-docs-md'
os.makedirs(md_output_path, exist_ok=True)

for pdf_file in os.listdir(pdf_dl_path):
    pdf_file_path = os.path.join(pdf_dl_path, pdf_file)
    markdown_content = get_markdown_from_pdf(pdf_file_path)
    
    md_file_path = os.path.join(md_output_path, pdf_file.replace('.pdf', '.md'))
    with open(md_file_path, 'w') as md_file:
        md_file.write(markdown_content)

### Component: 2. Basic RAG Implementation

- **Convert financial documents into text chunks.**

Now that we have the markdown version of the pdfs, let's use Langchain's wonderful markdown splitter to split the files to chunks.

In [6]:
!pip install langchain
!pip install qdrant-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting qdrant-client
  Using cached qdrant_client-1.13.2-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Using cached grpcio_tools-1.70.0-cp312-cp312-macosx_10_14_universal2.whl.metadata (5.3 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client)
  Using cached protobuf-5.29.3-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Using cached grpcio-1.70.0-cp312-cp312-macosx_10_14_universal2.whl.metadata (3.9 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client)
  Downloading h2-4.2.0-py3-none-any.whl.metadata (5.1 kB)
Collecting hyperframe<7,>=6.1 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client)


In [None]:
from langchain.text_splitter import MarkdownTextSplitter

# Initialize the markdown splitter
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split the markdown files into chunks
for md_file in os.listdir(md_output_path):
    md_file_path = os.path.join(md_output_path, md_file)
    with open(md_file_path, 'r') as file:
        markdown_content = file.read()
    
    chunks = splitter.split_text(markdown_content)
    
    # Save the chunks to new files
    chunk_output_path = os.path.join(md_output_path, 'chunks')
    os.makedirs(chunk_output_path, exist_ok=True)
    
    for i, chunk in enumerate(chunks):
        chunk_file_path = os.path.join(chunk_output_path, f'{md_file.replace(".md", "")}_chunk_{i}.md')
        with open(chunk_file_path, 'w') as chunk_file:
            chunk_file.write(chunk)

  qdrant_client = QdrantClient("http://localhost:6333", api_key='secret123')
  qdrant_client.recreate_collection(


- **Embed using a pre-trained model**

In [10]:
!pip install sentence-transformers

  pid, fd = os.forkpty()


Collecting sentence-transformers
  Downloading sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.6.0-cp312-none-macosx_11_0_arm64.whl.metadata (28 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.29.1-py3-none-any.whl.metadata (13 kB)
Collecting filelock (from huggingface-hub>=0.20.0->sentence-transformers)
  Using cached filelock-3.17.0-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub>=0.20.0->sentence-transformers)
  Downloading fsspec-2025.2.0-py3-none-any.whl.metadata (11 kB)
Collecting sympy==1.13.1 (from torch>=1.11.0->sentence-transformers)
  

- **Store and retrieve using a basic vector database**

In [15]:
from qdrant_client import QdrantClient
from qdrant_client.http.models import PointStruct
from sentence_transformers import SentenceTransformer
from uuid import uuid4

# Initialize Qdrant client
qdrant_client = QdrantClient("http://localhost:6333", api_key='secret123')

# Create a collection in Qdrant
collection_name = "financial_docs"
qdrant_client.recreate_collection(
    collection_name=collection_name,
    vectors_config={"size": 384, "distance": "Cosine"}
)

# Load a pre-trained model for embedding
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the chunks and upsert to Qdrant collection
for chunk_file in os.listdir(os.path.join(md_output_path, 'chunks')):
    chunk_file_path = os.path.join(md_output_path, 'chunks', chunk_file)
    with open(chunk_file_path, 'r') as file:
        chunk_content = file.read()
    
    # Generate embeddings for the chunk
    embeddings = model.encode([chunk_content])
    # import pdb;pdb.set_trace()
    # Create a point structure for Qdrant
    points = [
        PointStruct(
            id=str(uuid4()),
            vector=embeddings[0],
            payload={"text": chunk_content, "filename": chunk_file}
        )
    ]
    
    # Upsert the points to the Qdrant collection
    qdrant_client.upsert(
        collection_name=collection_name,
        points=points
    )

  qdrant_client = QdrantClient("http://localhost:6333", api_key='secret123')
  qdrant_client.recreate_collection(


In [21]:
def retrieve_relevant_chunks(query, top_k=5):
    # Generate embeddings for the query
    query_embedding = model.encode([query])[0]

    # Search for the most relevant chunks in the Qdrant collection
    search_result = qdrant_client.search(
        collection_name=collection_name,
        query_vector=query_embedding,
        limit=top_k
    )

    # Extract and return the relevant chunks
    relevant_chunks = [hit.payload['text'] for hit in search_result]
    return relevant_chunks

# Example usage
query = "iPad"
relevant_chunks = retrieve_relevant_chunks(query, top_k=1)
for i, chunk in enumerate(relevant_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

Chunk 1:
iPad

iPad net sales decreased during 2024 compared to 2023 due primarily to lower net sales of iPad Pro and the entry-level iPad
models, partially offset by higher net sales of iPad Air.

Wearables, Home and Accessories

Wearables,  Home  and  Accessories  net  sales  decreased  during  2024  compared  to  2023  due  primarily  to  lower  net  sales  of
Wearables and Accessories.

Services

Services net sales increased during 2024 compared to 2023 due primarily to higher net sales from advertising, the App Store®
and cloud services.

Apple Inc. | 2024 Form 10-K | 23

Gross Margin

Products and Services gross margin and gross margin percentage for 2024, 2023 and 2022 were as follows (dollars in millions):

Gross margin:

Products

Services

Total gross margin

Gross margin percentage:

Products

Services

Total gross margin percentage

Products Gross Margin

2024

2023

2022

$

$

109,633  $

108,803  $

114,728

71,050

60,345

56,054

180,683  $

169,148  $

170,782

 37.2

  search_result = qdrant_client.search(
