## Structuring Text for Financial Analysis

This notebook demonstrates several techniques for analyzing financial text documents:
1. **Bag of Words** - Simple word frequency analysis
2. **Sentiment Analysis** - Dictionary-based sentiment scoring
3. **Embeddings** - Semantic similarity using sentence transformers
4. **LLM Labelling** - Using large language models to classify text chunks


In [None]:
# Install libraries & load dictionary
!pip install PyPDF2
import nltk
nltk.download('vader_lexicon')



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
# Import required libraries
import requests
from io import BytesIO
from PyPDF2 import PdfReader
import pandas as pd

def load_pdf_from_url(pdf_url: str) -> str:
    """
    Download a PDF from a URL and return extracted text.

    Args:
        pdf_url: URL pointing to a PDF file

    Returns:
        Extracted text content from all pages
    """
    print(f"Downloading PDF from {pdf_url}...")
    response = requests.get(pdf_url)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to download PDF. Status code: {response.status_code}")

    reader = PdfReader(BytesIO(response.content))
    return "\n".join((page.extract_text() or "") for page in reader.pages)


In [None]:
# Define URLs for Google earnings transcripts
pdf_urls = {
    "goog_2025q4": "https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2025_Q4_Earnings_Transcript.pdf",
    "goog_2025q3": "https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2025_Q3_Earnings_Transcript.pdf",
    "goog_2023q1": "https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2023_Q1_Earnings_Transcript.pdf"
}

# Download and extract text from all PDFs
pdf_texts = {name: load_pdf_from_url(url) for name, url in pdf_urls.items()}

print(f"\nLoaded {len(pdf_texts)} transcripts:")
for name, text in pdf_texts.items():
    print(f"  - {name}: {len(text):,} characters")


Downloading PDF from https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2025_Q4_Earnings_Transcript.pdf...
Downloading PDF from https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2025_Q3_Earnings_Transcript.pdf...
Downloading PDF from https://raw.githubusercontent.com/travlake/ai-investments-course/main/Examples/Alphabet_2023_Q1_Earnings_Transcript.pdf...

Loaded 3 transcripts:
  - goog_2025q4: 74,269 characters
  - goog_2025q3: 66,783 characters
  - goog_2023q1: 54,144 characters


## Bag of Words Analysis
A simple but effective technique for understanding document content by counting word frequencies.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Analyze a single document - 2023 Q1 earnings call
pdf_text = pdf_texts['goog_2023q1']

# Create bag of words representation (excluding common English stop words)
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform([pdf_text])

# Extract word frequencies
feature_names = vectorizer.get_feature_names_out()
word_counts = X.toarray().sum(axis=0)

# Create DataFrame and show top 10 most frequent words
word_freq_df = pd.DataFrame({'Word': feature_names, 'Count': word_counts})
top_words = word_freq_df.sort_values(by='Count', ascending=False).head(10)

print("Top 10 Most Frequent Words (2023 Q1):")
print(top_words.to_string(index=False))


Top 10 Most Frequent Words (2023 Q1):
   Word  Count
 google     66
     ai     65
quarter     49
 search     44
youtube     42
  think     42
  cloud     38
 growth     36
   year     33
  thank     32


In [None]:
# Compare word frequencies between two earnings calls
pdf_text1 = pdf_texts['goog_2023q1']
pdf_text2 = pdf_texts['goog_2025q4']

# Fit vectorizer on first document, transform both
vectorizer = CountVectorizer(stop_words='english')
X1 = vectorizer.fit_transform([pdf_text1])
X2 = vectorizer.transform([pdf_text2])
feature_names = vectorizer.get_feature_names_out()

# Calculate word counts for each document
word_counts1 = X1.toarray().sum(axis=0)
word_counts2 = X2.toarray().sum(axis=0)

# Create DataFrames
word_freq_df1 = pd.DataFrame({'Word': feature_names, 'Count_2023': word_counts1})
word_freq_df2 = pd.DataFrame({'Word': feature_names, 'Count_2025': word_counts2})

# Calculate difference in word frequencies
merged_df = word_freq_df1.merge(word_freq_df2, on='Word')
merged_df['Diff'] = merged_df['Count_2025'] - merged_df['Count_2023']

# Get top words for each period
top_2023 = merged_df.nlargest(10, 'Count_2023')[['Word', 'Count_2023']].reset_index(drop=True)
top_2025 = merged_df.nlargest(10, 'Count_2025')[['Word', 'Count_2025']].reset_index(drop=True)

# Get rising and falling words (biggest changes)
rising = merged_df.nlargest(10, 'Diff')[['Word', 'Count_2023', 'Count_2025', 'Diff']].reset_index(drop=True)
falling = merged_df.nsmallest(10, 'Diff')[['Word', 'Count_2023', 'Count_2025', 'Diff']].reset_index(drop=True)

print("=" * 60)
print("WORD FREQUENCY COMPARISON: 2023 Q1 vs 2025 Q4")
print("=" * 60)

print("\nðŸ“Š Top 10 Words in 2023 Q1:")
print(top_2023.to_string(index=False))

print("\nðŸ“Š Top 10 Words in 2025 Q4:")
print(top_2025.to_string(index=False))

print("\nðŸ“ˆ Top 10 Rising Words (increased from 2023 to 2025):")
print(rising.to_string(index=False))

print("\nðŸ“‰ Top 10 Falling Words (decreased from 2023 to 2025):")
print(falling.to_string(index=False))


WORD FREQUENCY COMPARISON: 2023 Q1 vs 2025 Q4

ðŸ“Š Top 10 Words in 2023 Q1:
   Word  Count_2023
 google          66
     ai          65
quarter          49
 search          44
  think          42
youtube          42
  cloud          38
 growth          36
   year          33
  thank          32

ðŸ“Š Top 10 Words in 2025 Q4:
   Word  Count_2025
     ai          94
billion          55
   year          53
youtube          48
 google          47
 growth          43
  think          43
 search          37
quarter          34
  cloud          32

ðŸ“ˆ Top 10 Rising Words (increased from 2023 to 2025):
       Word  Count_2023  Count_2025  Diff
    billion          24          55    31
         ai          65          94    29
     strong          11          31    20
       year          33          53    20
       like          11          29    18
       mode           2          19    17
  increased           2          17    15
     driven           3          16    13
investments      

## Dictionary-Based Sentiment Analysis
Use NLTK's VADER sentiment analyzer to measure positive, negative, and neutral sentiment in the transcripts.


In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()

# Analyze both transcripts
pdf_text1 = pdf_texts['goog_2023q1']
pdf_text2 = pdf_texts['goog_2025q4']

sentiment1 = sia.polarity_scores(pdf_text1)
sentiment2 = sia.polarity_scores(pdf_text2)

# Create comparison DataFrame
sentiment_df = pd.DataFrame({
    'Sentiment': ['Positive', 'Negative', 'Neutral', 'Compound'],
    '2023 Q1': [sentiment1['pos'], sentiment1['neg'], sentiment1['neu'], sentiment1['compound']],
    '2025 Q4': [sentiment2['pos'], sentiment2['neg'], sentiment2['neu'], sentiment2['compound']]
})

print("=" * 50)
print("SENTIMENT COMPARISON: 2023 Q1 vs 2025 Q4")
print("=" * 50)
print(sentiment_df.round(3).to_string(index=False))
print("\nNote: Compound score ranges from -1 (most negative) to +1 (most positive)")


SENTIMENT COMPARISON: 2023 Q1 vs 2025 Q4
Sentiment  2023 Q1  2025 Q4
 Positive    0.159    0.147
 Negative    0.008    0.009
  Neutral    0.833    0.844
 Compound    1.000    1.000

Note: Compound score ranges from -1 (most negative) to +1 (most positive)


## Embeddings and Semantic Similarity
Use sentence transformers to create dense vector representations (embeddings) of the documents. This allows us to measure semantic similarity between texts.


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_text(text, chunk_size=100):
    """
    Split text into chunks of approximately chunk_size words.

    Args:
        text: Input text string
        chunk_size: Number of words per chunk

    Returns:
        List of text chunks
    """
    words = text.split()
    return [
        " ".join(words[i:i+chunk_size])
        for i in range(0, len(words), chunk_size)
    ]

# Create document embeddings by averaging chunk embeddings
chunks_2025 = chunk_text(pdf_texts['goog_2025q4'])
E_2025 = model.encode(chunks_2025, show_progress_bar=False)
call_embedding_2025 = E_2025.mean(axis=0)

chunks_2023 = chunk_text(pdf_texts['goog_2023q1'])
E_2023 = model.encode(chunks_2023, show_progress_bar=False)
call_embedding_2023 = E_2023.mean(axis=0)

print(f"Created embeddings with {len(call_embedding_2025)} dimensions")
print(f"2025 Q4: {len(chunks_2025)} chunks")
print(f"2023 Q1: {len(chunks_2023)} chunks")


In [None]:
# Compare document embedding to reference sentences to gain intuition about what the embedding captures
sentences = [
    "An earnings call transcript for Google's Q4 2025 earnings.",
    "Google reported strong earnings growth and increased capex in AI infrastructure",
    "Google reported weak earnings growth and reduced capex in AI infrastructure",
    "A recipe for sourdough bread including fermentation steps.",
    "Management provides concrete metrics, timelines, and operational details about performance.",
    "Management uses optimistic language but avoids specific numbers or commitments.",
    "The company discusses increased capital expenditures, infrastructure investment, and long-term capacity building.",
    "The company discusses buybacks, dividends, and returning capital to shareholders.",
    "Management expresses confidence and upside opportunities.",
    "Management expresses caution and downside risks.",
    "AI",
    "Pizza",
    "Billions of dollars",
    "Harry Potter"
]

# Encode reference sentences
sentence_embeddings = model.encode(sentences, show_progress_bar=False)

# Calculate similarities between 2025 Q4 document and reference sentences
similarities = cosine_similarity([call_embedding_2025], sentence_embeddings)[0]

# Calculate similarity between the two earnings calls
similarity_between_calls = cosine_similarity([call_embedding_2025], [call_embedding_2023])[0][0]

# Display results as a DataFrame
similarity_df = pd.DataFrame({
    'Reference Sentence': sentences + ['2023 Q1 Call Transcript'],
    'Similarity to 2025 Q4': list(similarities) + [similarity_between_calls]
})

print("=" * 80)
print("EMBEDDING SIMILARITY ANALYSIS")
print("Comparing 2025 Q4 earnings call embedding to reference sentences")
print("=" * 80)
print(similarity_df.round(3).to_string(index=False))


EMBEDDING SIMILARITY ANALYSIS
Comparing 2025 Q4 earnings call embedding to reference sentences
                                                                                               Reference Sentence  Similarity to 2025 Q4
                                                       An earnings call transcript for Google's Q4 2025 earnings.                  0.487
                                  Google reported strong earnings growth and increased capex in AI infrastructure                  0.663
                                      Google reported weak earnings growth and reduced capex in AI infrastructure                  0.630
                                                       A recipe for sourdough bread including fermentation steps.                  0.049
                      Management provides concrete metrics, timelines, and operational details about performance.                  0.294
                                  Management uses optimistic language but avoids sp

## LLM Labelling
Use a large language model (Gemini) to classify text chunks based on how AI is framed in the content.


In [None]:
from google import genai
from google.genai import types
import json

# Initialize Gemini client
# Note: Replace with your own API key
client = genai.Client(api_key="YOUR KEY HERE")

# Define the classification prompt
prompt = """
How is AI framed in this text?

Choose the closest category:
- revenue_driver
- cost_or_capex
- strategic_positioning
- vague_hype
- not_discussed

Think out loud in your response before giving your final answer.

When ready to give your final answer, write "<start of final answer>". Everything after this tag should be a JSON with two keys: "label" and "confidence". The "label" should be one of the categories above, and "confidence" should be an integer between 1 and 5 indicating how confident you are in your label.

Text:
"""

def extract_json_after_tag(text, tag="<start of final answer>"):
    """
    Extract JSON object from LLM response after a specific tag.

    Args:
        text: LLM response text
        tag: Marker indicating start of structured output

    Returns:
        Parsed JSON as dictionary, or None if parsing fails
    """
    idx = text.lower().find(tag.lower())
    if idx == -1:
        return None

    snippet = text[idx + len(tag):]
    start = snippet.find("{")
    if start == -1:
        return None

    try:
        return json.JSONDecoder().raw_decode(snippet[start:])[0]
    except:
        return None


In [None]:
# Process chunks through the LLM
chunks = chunk_text(pdf_texts['goog_2025q4'], chunk_size=500)
results = []

print(f"Processing {len(chunks)} chunks through LLM...")
print("-" * 60)

for i, chunk in enumerate(chunks):
    try:
        response = client.models.generate_content(
            model="gemini-3-flash-preview",
            config=types.GenerateContentConfig(
                temperature=0.0,
                system_instruction="You are an experienced financial analyst specializing in AI companies."
            ),
            contents=prompt + chunk
        )

        parsed = extract_json_after_tag(response.text)

        if parsed:
            results.append({
                "chunk_number": i,
                "label": parsed.get("label"),
                "confidence": parsed.get("confidence")
            })
            print(f"âœ“ Chunk {i}/{len(chunks)-1}: {parsed.get('label')} (confidence: {parsed.get('confidence')})")
        else:
            results.append({
                "chunk_number": i,
                "label": "PARSE_ERROR",
                "confidence": 0
            })
            print(f"âœ— Chunk {i}/{len(chunks)-1}: Parsing failed")

    except Exception as e:
        print(f"âœ— Chunk {i}/{len(chunks)-1}: API error - {e}")
        results.append({
            "chunk_number": i,
            "label": "API_ERROR",
            "confidence": 0
        })

# Create results DataFrame
df_output = pd.DataFrame(results, columns=["chunk_number", "label", "confidence"])

print("\n" + "=" * 60)
print("LLM LABELLING RESULTS")
print("=" * 60)

# Show summary statistics
print("\nLabel Distribution:")
print(df_output['label'].value_counts().to_string())

print("\nAverage Confidence by Label:")
print(df_output.groupby('label')['confidence'].mean().round(2).to_string())

print("\nFull Results:")
print(df_output.to_string(index=False))

Processing 19 chunks through LLM...
------------------------------------------------------------
âœ“ Chunk 0/18: revenue_driver (confidence: 5)
âœ“ Chunk 1/18: strategic_positioning (confidence: 5)
âœ“ Chunk 2/18: revenue_driver (confidence: 5)
âœ“ Chunk 3/18: revenue_driver (confidence: 5)
âœ“ Chunk 4/18: revenue_driver (confidence: 5)
âœ“ Chunk 5/18: revenue_driver (confidence: 5)
âœ“ Chunk 6/18: revenue_driver (confidence: 4)
âœ“ Chunk 7/18: revenue_driver (confidence: 5)
âœ“ Chunk 8/18: cost_or_capex (confidence: 5)
âœ“ Chunk 9/18: strategic_positioning (confidence: 5)
âœ“ Chunk 10/18: cost_or_capex (confidence: 5)
âœ“ Chunk 11/18: strategic_positioning (confidence: 4)
âœ“ Chunk 12/18: strategic_positioning (confidence: 4)
âœ“ Chunk 13/18: strategic_positioning (confidence: 4)
âœ“ Chunk 14/18: cost_or_capex (confidence: 5)
âœ“ Chunk 15/18: strategic_positioning (confidence: 4)
âœ“ Chunk 16/18: revenue_driver (confidence: 5)
âœ“ Chunk 17/18: revenue_driver (confidence: 5)
âœ“ Chunk 