# Week 10 — Homework: Text Alpha — FinBERT vs. LLM Embeddings

**Course:** ML for Quantitative Finance  
**Due:** Before Week 11 lecture

---

## Objective

Compare three eras of financial NLP for return prediction:
- **Baseline:** FinBERT sentiment (3 scores per headline)
- **Modern:** Sentence-transformer embeddings (384-dim per headline)
- **Combined:** Both feature sets together

Demonstrate that richer text representations carry more predictive information.

## Setup

```bash
pip install transformers sentence-transformers xgboost scikit-learn
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: Data Preparation (10 pts)

1. Load the Financial PhraseBank dataset (built into `datasets` library, or download from HuggingFace)
2. Filter to sentences with ≥75% annotator agreement
3. Report: class distribution, average sentence length
4. Assign simulated stock tickers and dates for the pipeline demo

In [None]:
# TODO: Load Financial PhraseBank
#   Option A: from datasets import load_dataset
#             dataset = load_dataset('financial_phrasebank', 'sentences_75agree')
#   Option B: Manual download from https://huggingface.co/datasets/financial_phrasebank
#   Option C: Use the provided synthetic dataset below as fallback

# TODO: If using fallback, create a dataset of ~200 financial headlines
#   with columns: ['text', 'label', 'ticker', 'date', 'next_day_return']
# TODO: Report class distribution and basic stats

## Part 2: FinBERT Baseline (20 pts)

1. Load `ProsusAI/finbert` from HuggingFace
2. Score all headlines → get (positive, negative, neutral) probabilities
3. Compute `net_sentiment = positive - negative`
4. Aggregate to stock-level daily sentiment (average across headlines per ticker per day)
5. Measure IC between `net_sentiment` and next-day return

In [None]:
# TODO: Load FinBERT
#   from transformers import AutoTokenizer, AutoModelForSequenceClassification
#   tokenizer = AutoTokenizer.from_pretrained('ProsusAI/finbert')
#   model = AutoModelForSequenceClassification.from_pretrained('ProsusAI/finbert')

# TODO: Score all headlines (batch or loop)
# TODO: Compute net_sentiment = positive - negative
# TODO: Aggregate per ticker per day
# TODO: Compute IC(net_sentiment, next_day_return)
# TODO: Plot sentiment vs return scatter

## Part 3: Sentence-Transformer Embeddings (25 pts)

1. Encode all headlines with `all-MiniLM-L6-v2` (384-dim, runs on M4)
2. Reduce dimensions with PCA (keep 10 and 20 components — compare both)
3. Aggregate to stock-level daily embedding (average embeddings per ticker per day)
4. Visualize embedding space: PCA scatter colored by sentiment label
5. Compute pairwise cosine similarity — do semantically similar headlines cluster?

In [None]:
# TODO: Load sentence-transformer
#   from sentence_transformers import SentenceTransformer
#   model = SentenceTransformer('all-MiniLM-L6-v2')
#   embeddings = model.encode(headlines)

# TODO: PCA to 10 and 20 components
# TODO: Aggregate per ticker per day
# TODO: Plot PCA 2D scatter colored by sentiment
# TODO: Plot cosine similarity heatmap

## Part 4: Model Comparison (25 pts)

Train XGBoost to predict next-day return using 4 feature sets:

| Model | Features |
|-------|----------|
| A | Price/volume features only (momentum, vol) |
| B | A + FinBERT sentiment (3 scores) |
| C | A + embedding PCA (10-20 components) |
| D | A + both text feature sets |

Evaluate with:
- Information coefficient (Spearman rank correlation)
- Direction accuracy (% correct sign)
- Report the **marginal contribution** of each text feature set

In [None]:
# TODO: Construct feature matrices for Models A, B, C, D
# TODO: Train XGBoost with cross-validation (temporal split or LOO for small data)
# TODO: Compute IC, direction accuracy for each model
# TODO: Build comparison table
# TODO: Bar chart comparing IC across models

## Part 5: Signal Decay Analysis (10 pts)

1. For both FinBERT sentiment and embedding features, compute IC at different horizons:
   - 1-day, 2-day, 5-day, 10-day, 20-day forward returns
2. Plot IC vs horizon for both methods
3. Which decays faster? Why?

In [None]:
# TODO: Compute forward returns at multiple horizons
# TODO: Compute IC(text_feature, forward_return) for each horizon
# TODO: Plot signal decay curves for FinBERT vs embeddings
# TODO: Discuss which decays faster and why

## Part 6: Discussion (10 pts)

Answer in 2-3 sentences each:

1. Why do embeddings outperform 3-class sentiment? What information is preserved?
2. In production, you need to process 10,000 headlines per day. Compare the cost/benefit of:
   - Free sentence-transformers (local, ~10ms per headline)
   - OpenAI embedding API (~$0.02 per 1M tokens)
   - Fine-tuning FinBERT on your own data
3. If everyone uses the same embedding model, is there still alpha? Where does the edge come from?
4. How would you handle conflicting headlines about the same stock on the same day?

In [None]:
# Write your discussion in markdown cells below