# Week 10 -- From FinBERT to LLM Embeddings

**Key question:** How has NLP for finance evolved from bag-of-words to LLM embeddings, and what does the frontier look like today?

---

## Outline

1. Evolution of NLP in finance: three eras
2. FinBERT as baseline: sentiment analysis
3. The LLM embedding revolution (Chen-Kelly-Xiu 2023)
4. Sentence-transformers for financial text embedding
5. Using API-based embeddings (OpenAI, Anthropic)
6. Agentic AI: AlphaGPT, Qlib RD-Agent
7. Demo: FinBERT sentiment + sentence-transformer embeddings
8. Key papers and references

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

print('Imports ready.')

---
## 1. Evolution of NLP in Finance: Three Eras

### Era 1: Bag-of-Words and Dictionaries (2000-2018)
- **Loughran-McDonald dictionary** (2011): a curated list of positive/negative financial words
- Count word frequencies, compute sentiment scores
- Simple, interpretable, but misses context ("not profitable" = positive + negative?)

### Era 2: FinBERT and Fine-Tuned Transformers (2019-2022)
- **FinBERT** (Araci 2019, Huang et al. 2020): BERT fine-tuned on financial text
- Understands context: "not profitable" is correctly negative
- Classification: positive / negative / neutral
- Limitation: fixed categories, no rich representation

### Era 3: LLM Embeddings and Agentic AI (2023+)
- **Chen-Kelly-Xiu (2023):** showed that LLM embeddings of news text predict stock returns better than any previous NLP method
- **Key insight:** do not classify text into sentiment -- instead, embed it into a high-dimensional vector and let the model learn what matters
- **Agentic AI (2024+):** systems like Man Group's AlphaGPT and Microsoft's Qlib RD-Agent that autonomously generate and test alpha hypotheses

In [None]:
# Visualize the three eras
fig, ax = plt.subplots(figsize=(14, 4))

eras = [
    ('Bag-of-Words\n(Loughran-McDonald)', 2000, 2018, '#e74c3c'),
    ('FinBERT\n(Fine-tuned BERT)', 2019, 2022, '#f39c12'),
    ('LLM Embeddings\n+ Agentic AI', 2023, 2026, '#27ae60'),
]

for i, (name, start, end, color) in enumerate(eras):
    ax.barh(0, end - start, left=start, height=0.5, color=color, alpha=0.8, edgecolor='white')
    ax.text((start + end) / 2, 0, name, ha='center', va='center', fontsize=10, fontweight='bold')

# Key milestones
milestones = [
    (2011, 'L-M Dictionary'),
    (2019, 'FinBERT'),
    (2023, 'Chen-Kelly-Xiu'),
    (2024, 'AlphaGPT'),
]
for year, label in milestones:
    ax.annotate(label, xy=(year, -0.35), fontsize=8, ha='center',
                arrowprops=dict(arrowstyle='->', color='black'), xytext=(year, -0.6))

ax.set_xlim(1998, 2027)
ax.set_ylim(-0.8, 0.5)
ax.set_yticks([])
ax.set_xlabel('Year')
ax.set_title('Three Eras of NLP in Finance', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

---
## 2. FinBERT as Baseline

**FinBERT** is a BERT model fine-tuned on ~50,000 analyst reports and financial news articles.

**Architecture:**
```
Input text: "Apple reports record quarterly revenue"
     --> BERT tokenizer (subword tokens)
     --> 12-layer transformer encoder
     --> [CLS] token pooling
     --> Linear classification head
     --> Output: {positive: 0.92, negative: 0.03, neutral: 0.05}
```

**Strengths:**
- Understands financial context
- Pre-trained, easy to use
- Good for sentiment classification tasks

**Limitations:**
- Fixed to 3 classes (positive/negative/neutral)
- Does not capture magnitude or nuance
- 110M params = limited capacity
- "Earnings beat expectations" and "earnings massively beat expectations" get similar scores

**Popular variants:**
- `ProsusAI/finbert` (most commonly used)
- `yiyanghkust/finbert-tone`
- `ahmedrachid/FinancialBERT-Sentiment-Analysis`

---
## 3. The LLM Embedding Revolution

### Chen-Kelly-Xiu (2023): "Expected Returns and Large Language Models"

**Key finding:** Instead of classifying text into sentiment categories, embed the entire text into a continuous vector space, then use the embedding directly for return prediction.

**Why this works better:**

| Approach | Information Retained |
|----------|--------------------|
| Bag-of-words | Word frequencies |
| FinBERT sentiment | 3 numbers (P/N/N probabilities) |
| LLM embedding | 768-4096 dimensional vector |

**The math:**
```
r_{i,t+1} = f(E(text_{i,t})) + epsilon

where:
  r_{i,t+1}    = next-period return for stock i
  E(text_{i,t}) = LLM embedding of news about stock i at time t
  f(.)          = a learned function (neural net or XGBoost)
```

**Results (Chen-Kelly-Xiu 2023):**
- LLM embeddings achieve OOS R-squared of ~2% for monthly returns
- Outperforms FinBERT sentiment, Loughran-McDonald, and all prior NLP methods
- The alpha is distinct from known risk factors (Fama-French, momentum, etc.)

**Why?** The embedding captures *nuance* that sentiment scores miss:
- Magnitude: "slight miss" vs. "massive miss"
- Context: "revenue grew despite macro headwinds" (resilience signal)
- Forward-looking language: "management raised guidance" (future expectation)
- Cross-referencing: implicit comparisons to competitors or industry trends

In [None]:
# Illustration: Information loss from sentiment vs. embeddings

headlines = [
    "Apple reports slight miss on iPhone revenue",
    "Apple reports massive miss on iPhone revenue, shares plunge 10%",
    "Apple beats estimates with record services revenue",
    "Apple slightly edges past lowered expectations",
    "Apple's iPhone sales decline amid weak China demand",
]

# Simulated FinBERT sentiment (3 numbers per headline)
finbert_scores = [
    [0.15, 0.70, 0.15],  # negative
    [0.05, 0.90, 0.05],  # very negative
    [0.85, 0.05, 0.10],  # positive
    [0.40, 0.20, 0.40],  # neutral-ish positive
    [0.10, 0.75, 0.15],  # negative
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# FinBERT: only 3 numbers
labels_short = [h[:40] + '...' for h in headlines]
fb_df = pd.DataFrame(finbert_scores, columns=['Positive', 'Negative', 'Neutral'],
                     index=labels_short)
fb_df.plot(kind='barh', stacked=True, ax=axes[0],
           color=['#27ae60', '#e74c3c', '#95a5a6'], edgecolor='white')
axes[0].set_title('FinBERT: 3 Numbers Per Headline')
axes[0].set_xlabel('Probability')
axes[0].legend(loc='lower right')

# LLM Embedding: 768-dim heatmap (simulated)
np.random.seed(42)
fake_embeddings = np.random.randn(5, 50) * 0.5  # show 50 of 768 dims
# Make embeddings reflect semantic similarity
fake_embeddings[1] = fake_embeddings[0] + np.random.randn(50) * 0.2  # similar to 0
fake_embeddings[3] = fake_embeddings[2] + np.random.randn(50) * 0.3  # similar to 2

im = axes[1].imshow(fake_embeddings, aspect='auto', cmap='RdBu_r', vmin=-1.5, vmax=1.5)
axes[1].set_yticks(range(5))
axes[1].set_yticklabels(labels_short, fontsize=8)
axes[1].set_xlabel('Embedding Dimension (showing 50 of 768)')
axes[1].set_title('LLM Embedding: 768 Numbers Per Headline')
plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.suptitle('Information Content: Sentiment vs. Embedding', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Key point: Headlines 1 and 2 look similar under FinBERT (both "negative"),')
print('but the embedding captures the MAGNITUDE difference (slight vs. massive miss).')

---
## 4. Sentence-Transformers for Financial Text Embedding

**Sentence-transformers** (Reimers & Gurevych 2019) are BERT-like models optimized to produce meaningful sentence-level embeddings.

**Why use sentence-transformers instead of raw BERT?**
- BERT is designed for token-level tasks (NER, question answering)
- Its [CLS] token is not a good sentence embedding out of the box
- Sentence-transformers are trained with contrastive learning to make semantically similar sentences have similar embeddings

**Popular models:**

| Model | Dim | Speed | Quality |
|-------|-----|-------|---------|
| `all-MiniLM-L6-v2` | 384 | Fast | Good |
| `all-mpnet-base-v2` | 768 | Medium | Better |
| `e5-large-v2` | 1024 | Slow | Best open-source |

**Usage:**
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(['Apple beats earnings', 'Google misses revenue'])
# embeddings.shape = (2, 384)
```

**For finance:** You can use these embeddings directly as features for return prediction, or fine-tune on financial text pairs.

---
## 5. API-Based Embeddings

### OpenAI Embeddings
```python
import openai
response = openai.Embedding.create(
    input='Apple reports record revenue',
    model='text-embedding-3-small'  # 1536-dim
)
embedding = response['data'][0]['embedding']  # list of 1536 floats
```

### Anthropic (via Messages API)
- Anthropic does not offer a dedicated embedding endpoint (as of early 2026)
- Workaround: use Claude to extract structured features from text (not raw embeddings)
- Or: use sentence-transformers locally for embeddings, Claude for text understanding tasks

### Cost Considerations

| Method | Cost per 1M tokens | Dimensions | Latency |
|--------|-------------------|------------|----------|
| OpenAI text-embedding-3-small | ~$0.02 | 1536 | ~100ms |
| OpenAI text-embedding-3-large | ~$0.13 | 3072 | ~200ms |
| Sentence-transformers (local) | $0 (compute only) | 384-1024 | ~10ms |

For a quant fund processing 100K news articles per day, API costs add up. Local models are often preferred for production.

---
## 6. Agentic AI for Quant Finance

The frontier is moving beyond embeddings toward *agentic systems* that autonomously generate and test alpha hypotheses.

### Man Group -- AlphaGPT (2024)
- **Paper:** "Can Large Language Models Beat Wall Street?" (Man Group, 2024)
- An LLM agent that:
  1. Reads financial research papers and news
  2. Proposes alpha factor hypotheses
  3. Writes code to compute the factors
  4. Backtests them
  5. Iterates based on results
- Human quant researchers supervise and curate the output
- **Key result:** Some LLM-generated factors showed genuine alpha in out-of-sample testing

### Microsoft Qlib -- RD-Agent (2024)
- Open-source agent built on Microsoft's Qlib framework
- Autonomous research-and-development loop:
  1. Generate factor hypothesis
  2. Implement in code
  3. Run backtest
  4. Analyze results
  5. Refine hypothesis
- **GitHub:** [microsoft/RD-Agent](https://github.com/microsoft/RD-Agent)

### What this means

- NLP is no longer just about *reading* text -- it is about *reasoning* with it
- LLMs are becoming research tools, not just data processing tools
- The quant researcher's role shifts from feature engineering to supervising AI agents
- But: the agents still need human oversight, domain knowledge, and risk management

---
## 7. Demo: FinBERT Sentiment + Sentence-Transformer Embeddings

We will:
1. Run FinBERT on sample financial headlines
2. Generate sentence-transformer embeddings
3. Visualize the embedding space
4. Compare the information content

**Setup:**
```bash
pip install transformers sentence-transformers torch scikit-learn
```

In [None]:
# Sample financial headlines
headlines = [
    # Positive
    "Apple reports record quarterly revenue, beats analyst expectations",
    "NVIDIA surges 15% on strong AI chip demand, raises guidance",
    "JPMorgan posts best quarter in history, increases dividend",
    "Tesla deliveries exceed estimates, stock rallies on optimism",
    # Negative
    "Meta shares plunge after disappointing ad revenue guidance",
    "Boeing reports massive loss, faces regulatory scrutiny",
    "Bank of America misses earnings, warns of rising credit losses",
    "Pfizer cuts full-year forecast amid declining vaccine demand",
    # Neutral / Mixed
    "Microsoft reports in-line results, cloud growth moderates",
    "Google faces antitrust ruling, but ad revenue holds steady",
    "Amazon's AWS growth slows but retail margins improve",
    "Fed holds rates steady, signals possible cut in September",
]

labels = ['Positive'] * 4 + ['Negative'] * 4 + ['Mixed'] * 4
print(f'{len(headlines)} headlines loaded.')

In [None]:
# FinBERT sentiment analysis
try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch

    finbert_model_name = 'ProsusAI/finbert'
    tokenizer = AutoTokenizer.from_pretrained(finbert_model_name)
    model = AutoModelForSequenceClassification.from_pretrained(finbert_model_name)
    model.eval()

    finbert_results = []
    for headline in headlines:
        inputs = tokenizer(headline, return_tensors='pt', truncation=True, max_length=512)
        with torch.no_grad():
            outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1).squeeze().numpy()
        finbert_results.append({
            'headline': headline[:60] + '...' if len(headline) > 60 else headline,
            'positive': probs[0],
            'negative': probs[1],
            'neutral': probs[2],
            'sentiment': ['positive', 'negative', 'neutral'][np.argmax(probs)],
        })
    finbert_available = True
    print('FinBERT analysis complete.')

except ImportError:
    print('transformers not available. Using simulated FinBERT results.')
    np.random.seed(42)
    finbert_results = []
    sentiment_map = {'Positive': [0.8, 0.1, 0.1], 'Negative': [0.1, 0.8, 0.1], 'Mixed': [0.3, 0.3, 0.4]}
    for headline, label in zip(headlines, labels):
        base = sentiment_map[label]
        noise = np.random.dirichlet([10, 10, 10]) * 0.1
        probs = np.array(base) + noise
        probs /= probs.sum()
        finbert_results.append({
            'headline': headline[:60] + '...' if len(headline) > 60 else headline,
            'positive': probs[0],
            'negative': probs[1],
            'neutral': probs[2],
            'sentiment': ['positive', 'negative', 'neutral'][np.argmax(probs)],
        })
    finbert_available = False

fb_df = pd.DataFrame(finbert_results)
print(fb_df[['headline', 'sentiment', 'positive', 'negative', 'neutral']].to_string(index=False))

In [None]:
# Sentence-transformer embeddings
try:
    from sentence_transformers import SentenceTransformer

    st_model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim, fast
    embeddings = st_model.encode(headlines)
    st_available = True
    print(f'Embeddings shape: {embeddings.shape}')

except ImportError:
    print('sentence-transformers not available. Using simulated embeddings.')
    np.random.seed(42)
    embeddings = np.random.randn(len(headlines), 384)
    # Make similar headlines have similar embeddings
    for i in range(4):
        embeddings[i] += 1.0  # positive cluster
    for i in range(4, 8):
        embeddings[i] -= 1.0  # negative cluster
    st_available = False
    print(f'Simulated embeddings shape: {embeddings.shape}')

In [None]:
# Visualize embedding space with t-SNE / PCA
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

# PCA to 2D
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot
color_map = {'Positive': '#27ae60', 'Negative': '#e74c3c', 'Mixed': '#f39c12'}
for label in ['Positive', 'Negative', 'Mixed']:
    mask = [l == label for l in labels]
    axes[0].scatter(emb_2d[mask, 0], emb_2d[mask, 1], c=color_map[label],
                    label=label, s=80, edgecolors='white', zorder=3)

# Annotate points
for i, headline in enumerate(headlines):
    short = headline.split(',')[0][:30]
    axes[0].annotate(short, (emb_2d[i, 0], emb_2d[i, 1]), fontsize=7, alpha=0.7)

axes[0].set_title('Embedding Space (PCA 2D Projection)')
axes[0].legend()
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')

# Cosine similarity heatmap
cos_sim = cosine_similarity(embeddings)
short_labels = [h[:30] + '...' for h in headlines]
im = axes[1].imshow(cos_sim, cmap='RdBu_r', vmin=-0.2, vmax=1.0)
axes[1].set_xticks(range(len(headlines)))
axes[1].set_yticks(range(len(headlines)))
axes[1].set_xticklabels(range(len(headlines)), fontsize=8)
axes[1].set_yticklabels(range(len(headlines)), fontsize=8)
axes[1].set_title('Pairwise Cosine Similarity')
plt.colorbar(im, ax=axes[1], shrink=0.8)

plt.suptitle('Sentence-Transformer Embeddings of Financial Headlines', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Headline index:')
for i, h in enumerate(headlines):
    print(f'  {i}: {h[:65]}')

In [None]:
# Compare information content: FinBERT (3 dims) vs. embedding (384 dims)

# FinBERT similarity (based on 3 sentiment scores)
fb_vectors = np.array([[r['positive'], r['negative'], r['neutral']] for r in finbert_results])
fb_sim = cosine_similarity(fb_vectors)

# Embedding similarity
emb_sim = cosine_similarity(embeddings)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

im1 = axes[0].imshow(fb_sim, cmap='RdBu_r', vmin=-0.2, vmax=1.0)
axes[0].set_title('FinBERT Similarity (3 dims)')
plt.colorbar(im1, ax=axes[0], shrink=0.8)

im2 = axes[1].imshow(emb_sim, cmap='RdBu_r', vmin=-0.2, vmax=1.0)
axes[1].set_title('Embedding Similarity (384 dims)')
plt.colorbar(im2, ax=axes[1], shrink=0.8)

plt.suptitle('Richer Representations Capture More Nuance', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

print('Notice: FinBERT groups ALL negative headlines as similar.')
print('Embeddings distinguish between "Boeing loss" and "Pfizer forecast cut" -- different stocks, different stories.')

---
## 8. Key Papers and References

### NLP for Finance -- Foundational
- Loughran & McDonald, "When is a Liability Not a Liability?" (Journal of Finance, 2011). The financial sentiment dictionary.
- Araci, "FinBERT: Financial Sentiment Analysis with Pre-trained Language Models" (2019). [arXiv:1908.10063](https://arxiv.org/abs/1908.10063)
- Huang, Wang, Yang, "FinBERT: A Large Language Model for Extracting Information from Financial Text" (2020).

### LLM Embeddings for Return Prediction
- Chen, Kelly, Xiu, "Expected Returns and Large Language Models" (2023). The landmark paper showing LLM embeddings outperform traditional NLP for return prediction. [SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4416687)
- Reimers & Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (2019). [arXiv:1908.10084](https://arxiv.org/abs/1908.10084)

### Agentic AI
- Man Group, "AlphaGPT: Can Large Language Models Beat Wall Street?" (2024). LLM agents for autonomous alpha research.
- Microsoft Research, "RD-Agent: An Autonomous Research-Development Agent for Factor Discovery" (2024). [GitHub](https://github.com/microsoft/RD-Agent)

### Practical Resources
- Sentence-Transformers library: [sbert.net](https://www.sbert.net/)
- HuggingFace FinBERT: [ProsusAI/finbert](https://huggingface.co/ProsusAI/finbert)
- OpenAI Embedding API: [platform.openai.com](https://platform.openai.com/docs/guides/embeddings)

---

## Summary

1. NLP for finance has evolved from word counting (2000s) to contextual understanding (FinBERT, 2019) to rich embeddings (2023+)
2. LLM embeddings capture far more information than sentiment classification (384-4096 dims vs. 3 classes)
3. Chen-Kelly-Xiu (2023) showed that embeddings outperform all prior NLP methods for return prediction
4. Sentence-transformers provide a practical, free, local option for embedding financial text
5. Agentic AI (AlphaGPT, RD-Agent) represents the next frontier: LLMs as autonomous quant researchers
6. The quant researcher's role is shifting from feature engineering to supervising AI systems