# Week 10 Seminar -- From FinBERT to LLM Embeddings

**Duration:** 90 minutes

| Exercise | Topic | Time |
|----------|-------|------|
| 1 | FinBERT sentiment classification | 25 min |
| 2 | Sentence-transformer embeddings + clustering | 25 min |
| 3 | Build a text-to-alpha pipeline | 20 min |
| 4 | Discussion: Will LLMs replace quant researchers? | 20 min |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

print('Imports ready.')

In [None]:
# Sample dataset: financial headlines with dates and tickers
news_data = pd.DataFrame({
    'date': pd.to_datetime([
        '2024-01-25', '2024-01-25', '2024-01-26', '2024-01-26',
        '2024-02-01', '2024-02-01', '2024-02-02', '2024-02-02',
        '2024-02-15', '2024-02-15', '2024-02-16', '2024-02-16',
        '2024-03-01', '2024-03-01', '2024-03-05', '2024-03-05',
        '2024-03-10', '2024-03-10', '2024-03-15', '2024-03-15',
        '2024-04-01', '2024-04-01', '2024-04-10', '2024-04-10',
    ]),
    'ticker': [
        'AAPL', 'MSFT', 'AAPL', 'GOOGL',
        'META', 'AMZN', 'NVDA', 'TSLA',
        'JPM', 'BAC', 'AAPL', 'MSFT',
        'GOOGL', 'META', 'NVDA', 'AMZN',
        'TSLA', 'JPM', 'AAPL', 'MSFT',
        'NVDA', 'GOOGL', 'META', 'AMZN',
    ],
    'headline': [
        "Apple beats Q1 estimates with strong iPhone sales in emerging markets",
        "Microsoft cloud revenue grows 30%, AI products drive enterprise adoption",
        "Apple warns of slowing China demand, shares dip 3% in after-hours",
        "Google announces $70B share buyback program, stock hits all-time high",
        "Meta's Reality Labs loses $4.6B, but core advertising rebounds sharply",
        "Amazon Web Services growth accelerates for third straight quarter",
        "NVIDIA revenue more than triples on unprecedented AI chip demand",
        "Tesla misses delivery targets, price cuts fail to boost demand",
        "JPMorgan raises dividend by 10%, signals confidence in economic outlook",
        "Bank of America warns of rising consumer credit delinquencies",
        "Apple Vision Pro launch receives mixed reviews from early adopters",
        "Microsoft faces EU antitrust probe over Teams bundling practices",
        "Google DeepMind achieves breakthrough in protein structure prediction",
        "Meta launches next-gen Llama model, open-sources weights for researchers",
        "NVIDIA announces Blackwell GPU architecture, orders backlogged through 2025",
        "Amazon expands same-day delivery to 30 new metro areas",
        "Tesla recalls 2.2 million vehicles over warning light software issue",
        "JPMorgan CEO warns of geopolitical risks to global economic recovery",
        "Apple services revenue hits record $23B, offsetting hardware decline",
        "Microsoft Copilot adoption surges among Fortune 500 companies",
        "NVIDIA data center revenue hits $18.4B, exceeding all estimates",
        "Google Cloud achieves operating profitability for first time",
        "Meta cuts 10,000 jobs in latest round of efficiency measures",
        "Amazon's advertising business grows 27%, becomes third profit pillar",
    ],
    # Simulated next-day returns
    'next_day_return': [
        0.025, 0.018, -0.031, 0.042,
        -0.008, 0.015, 0.068, -0.045,
        0.012, -0.022, -0.005, -0.015,
        0.020, 0.010, 0.035, 0.008,
        -0.038, -0.010, 0.015, 0.022,
        0.055, 0.028, -0.020, 0.018,
    ],
})

print(f'Loaded {len(news_data)} headlines for {news_data["ticker"].nunique()} tickers.')
news_data.head(5)

---
## Exercise 1: FinBERT Sentiment Classification (25 min)

**Goal:** Run FinBERT on all headlines and analyze the relationship between sentiment and next-day returns.

### Tasks
1. Load `ProsusAI/finbert` and classify each headline
2. Add sentiment scores (positive, negative, neutral) to the dataframe
3. Compute the average next-day return by sentiment class
4. Measure the information coefficient between sentiment score and return

In [None]:
# Task 1: Run FinBERT
import torch

try:
    from transformers import AutoTokenizer, AutoModelForSequenceClassification

    tokenizer = AutoTokenizer.from_pretrained('ProsusAI/finbert')
    finbert = AutoModelForSequenceClassification.from_pretrained('ProsusAI/finbert')
    finbert.eval()

    sentiments = []
    for headline in news_data['headline']:
        inputs = tokenizer(headline, return_tensors='pt', truncation=True, max_length=512)
        with torch.no_grad():
            logits = finbert(**inputs).logits
        probs = torch.softmax(logits, dim=1).squeeze().numpy()
        sentiments.append({'pos': probs[0], 'neg': probs[1], 'neu': probs[2]})

    finbert_loaded = True
    print('FinBERT analysis complete.')

except ImportError:
    print('transformers not available. Using simulated FinBERT scores.')
    np.random.seed(42)
    sentiments = []
    for ret in news_data['next_day_return']:
        # Simulate: correlated with actual return (imperfect signal)
        if ret > 0.01:
            base = [0.7, 0.1, 0.2]
        elif ret < -0.01:
            base = [0.1, 0.7, 0.2]
        else:
            base = [0.25, 0.25, 0.50]
        noise = np.random.dirichlet([8, 8, 8])
        probs = 0.7 * np.array(base) + 0.3 * noise
        probs /= probs.sum()
        sentiments.append({'pos': probs[0], 'neg': probs[1], 'neu': probs[2]})
    finbert_loaded = False

In [None]:
# Task 2: Add scores to dataframe
news_data['fb_positive'] = [s['pos'] for s in sentiments]
news_data['fb_negative'] = [s['neg'] for s in sentiments]
news_data['fb_neutral'] = [s['neu'] for s in sentiments]
news_data['fb_sentiment'] = news_data[['fb_positive', 'fb_negative', 'fb_neutral']].idxmax(axis=1)
news_data['fb_sentiment'] = news_data['fb_sentiment'].str.replace('fb_', '')

# Net sentiment score: positive - negative
news_data['fb_net_score'] = news_data['fb_positive'] - news_data['fb_negative']

print(news_data[['headline', 'fb_sentiment', 'fb_net_score', 'next_day_return']].head(10).to_string())

In [None]:
# Task 3: Average return by sentiment class
sentiment_returns = news_data.groupby('fb_sentiment')['next_day_return'].agg(['mean', 'std', 'count'])
print('Average Next-Day Return by FinBERT Sentiment:')
print(sentiment_returns.round(4).to_string())

# Task 4: Information coefficient
ic = np.corrcoef(news_data['fb_net_score'], news_data['next_day_return'])[0, 1]
print(f'\nIC (FinBERT net score vs. next-day return): {ic:.4f}')

In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sentiment vs. return scatter
colors = {'positive': '#27ae60', 'negative': '#e74c3c', 'neutral': '#95a5a6'}
for sent in ['positive', 'negative', 'neutral']:
    mask = news_data['fb_sentiment'] == sent
    axes[0].scatter(news_data.loc[mask, 'fb_net_score'],
                    news_data.loc[mask, 'next_day_return'],
                    c=colors[sent], label=sent, s=60, edgecolors='white', alpha=0.8)

axes[0].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_xlabel('FinBERT Net Score (pos - neg)')
axes[0].set_ylabel('Next-Day Return')
axes[0].set_title(f'FinBERT Sentiment vs. Return (IC={ic:.3f})')
axes[0].legend()

# Bar chart of average returns
bar_colors = [colors.get(s, 'gray') for s in sentiment_returns.index]
axes[1].bar(sentiment_returns.index, sentiment_returns['mean'], color=bar_colors, edgecolor='white')
axes[1].set_ylabel('Average Next-Day Return')
axes[1].set_title('Average Return by Sentiment Class')
axes[1].axhline(y=0, color='gray', linestyle='--')

plt.suptitle('Exercise 1: FinBERT Sentiment Analysis', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### Exercise 1 -- Discussion
- Is the IC statistically significant with only 24 observations?
- Does the sentiment capture *magnitude* of the move?
- What information is lost by reducing a headline to 3 probabilities?

---

## Exercise 2: Sentence-Transformer Embeddings + Clustering (25 min)

**Goal:** Generate rich embeddings of financial headlines and explore their structure via clustering.

### Tasks
1. Encode all headlines with `all-MiniLM-L6-v2`
2. Compute pairwise cosine similarity
3. Cluster embeddings (K-Means, k=4)
4. Visualize clusters in 2D (PCA)
5. Analyze: do clusters correspond to sectors, sentiment, or something else?

In [None]:
# Task 1: Encode headlines
try:
    from sentence_transformers import SentenceTransformer
    st_model = SentenceTransformer('all-MiniLM-L6-v2')
    embeddings = st_model.encode(news_data['headline'].tolist())
    st_loaded = True
    print(f'Embeddings computed: {embeddings.shape}')

except ImportError:
    print('sentence-transformers not available. Using simulated embeddings.')
    np.random.seed(42)
    embeddings = np.random.randn(len(news_data), 384)
    # Add sector-based structure
    sector_map = {
        'AAPL': 0, 'MSFT': 0, 'GOOGL': 0, 'META': 0, 'AMZN': 0, 'NVDA': 0, 'TSLA': 0,
        'JPM': 1, 'BAC': 1,
    }
    for i, ticker in enumerate(news_data['ticker']):
        sector_id = sector_map.get(ticker, 0)
        embeddings[i, :10] += sector_id * 2  # sector signal
        embeddings[i, 10:20] += news_data.iloc[i]['next_day_return'] * 50  # sentiment signal
    st_loaded = False
    print(f'Simulated embeddings: {embeddings.shape}')

In [None]:
# Task 2: Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

cos_sim = cosine_similarity(embeddings)

# Find most similar headline pairs
n = len(news_data)
pairs = []
for i in range(n):
    for j in range(i + 1, n):
        pairs.append((i, j, cos_sim[i, j]))

pairs.sort(key=lambda x: x[2], reverse=True)

print('Top 5 most similar headline pairs:')
for i, j, sim in pairs[:5]:
    print(f'  Sim={sim:.3f}')
    print(f'    [{news_data.iloc[i]["ticker"]}] {news_data.iloc[i]["headline"][:70]}')
    print(f'    [{news_data.iloc[j]["ticker"]}] {news_data.iloc[j]["headline"][:70]}')
    print()

In [None]:
# Task 3: K-Means clustering
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
news_data['cluster'] = kmeans.fit_predict(embeddings)

print('Cluster composition:')
for c in range(4):
    mask = news_data['cluster'] == c
    tickers_in = news_data.loc[mask, 'ticker'].unique()
    avg_ret = news_data.loc[mask, 'next_day_return'].mean()
    print(f'  Cluster {c}: {mask.sum()} headlines, tickers={list(tickers_in)}, avg return={avg_ret:.3f}')

In [None]:
# Task 4: Visualize in 2D
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(embeddings)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Color by cluster
cluster_colors = ['#e74c3c', '#2980b9', '#27ae60', '#f39c12']
for c in range(4):
    mask = news_data['cluster'] == c
    axes[0].scatter(emb_2d[mask, 0], emb_2d[mask, 1],
                    c=cluster_colors[c], label=f'Cluster {c}',
                    s=60, edgecolors='white', alpha=0.8)

axes[0].set_title('Headlines Colored by K-Means Cluster')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].legend()

# Color by sector
tech = ['AAPL', 'MSFT', 'GOOGL', 'META', 'AMZN', 'NVDA', 'TSLA']
finance = ['JPM', 'BAC']

for i, row in news_data.iterrows():
    color = '#2980b9' if row['ticker'] in tech else '#e74c3c'
    label = 'Tech' if row['ticker'] in tech else 'Finance'
    axes[1].scatter(emb_2d[i, 0], emb_2d[i, 1], c=color, s=60, edgecolors='white', alpha=0.8)
    axes[1].annotate(row['ticker'], (emb_2d[i, 0], emb_2d[i, 1]), fontsize=7, alpha=0.7)

# Manual legend
axes[1].scatter([], [], c='#2980b9', label='Tech', s=60)
axes[1].scatter([], [], c='#e74c3c', label='Finance', s=60)
axes[1].set_title('Headlines Colored by Sector')
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].legend()

plt.suptitle('Exercise 2: Embedding Clustering', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### Exercise 2 -- Task 5: Analysis

- Do clusters align with sectors (tech vs. finance)?
- Or do they align with themes (earnings beats, guidance, risk events)?
- What does this tell you about what the embedding captures?

---

## Exercise 3: Text-to-Alpha Pipeline (20 min)

**Goal:** Build a simple pipeline that goes from text to return prediction.

```
Headlines --> Embeddings --> Dimensionality reduction --> XGBoost --> Return prediction
```

### Tasks
1. Reduce embedding dimensionality (PCA to 20 components)
2. Combine with FinBERT sentiment as additional features
3. Train XGBoost to predict next-day return
4. Evaluate: IC, R-squared, classification accuracy for direction

In [None]:
# Task 1: PCA on embeddings
from sklearn.decomposition import PCA

pca_full = PCA(n_components=20)
emb_pca = pca_full.fit_transform(embeddings)

emb_features = pd.DataFrame(
    emb_pca,
    columns=[f'emb_pc{i}' for i in range(20)],
    index=news_data.index
)

print(f'PCA embeddings: {emb_features.shape}')
print(f'Variance explained: {pca_full.explained_variance_ratio_.sum():.1%}')

In [None]:
# Task 2: Combine features
feature_df = pd.concat([
    emb_features,
    news_data[['fb_positive', 'fb_negative', 'fb_neutral', 'fb_net_score']].reset_index(drop=True)
], axis=1)

target = news_data['next_day_return'].values

print(f'Feature matrix: {feature_df.shape}')
print(f'Features: {list(feature_df.columns)}')

In [None]:
# Task 3: Train models (using leave-one-out due to small dataset)
from xgboost import XGBRegressor
from sklearn.model_selection import LeaveOneOut, cross_val_predict
from sklearn.metrics import r2_score

# Three models: FinBERT only, Embeddings only, Combined
finbert_cols = ['fb_positive', 'fb_negative', 'fb_neutral', 'fb_net_score']
emb_cols = [f'emb_pc{i}' for i in range(20)]
all_cols = emb_cols + finbert_cols

models = {
    'FinBERT only': finbert_cols,
    'Embedding only': emb_cols,
    'Combined': all_cols,
}

results = []
loo = LeaveOneOut()

for name, cols in models.items():
    X = feature_df[cols].values
    xgb = XGBRegressor(n_estimators=50, max_depth=3, learning_rate=0.1,
                       subsample=0.8, random_state=42)
    
    # Leave-one-out predictions
    predictions = cross_val_predict(xgb, X, target, cv=loo)
    
    ic = np.corrcoef(target, predictions)[0, 1]
    r2 = r2_score(target, predictions)
    dir_acc = np.mean(np.sign(target) == np.sign(predictions))
    
    results.append({'Model': name, 'IC': ic, 'R2': r2, 'Dir Acc': dir_acc})
    print(f'{name}: IC={ic:.4f}, R2={r2:.4f}, Dir Acc={dir_acc:.1%}')

results_df = pd.DataFrame(results).set_index('Model')
print()
print(results_df.round(4).to_string())

In [None]:
# Task 4: Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# IC comparison
colors = ['#f39c12', '#2980b9', '#27ae60']
axes[0].bar(results_df.index, results_df['IC'], color=colors, edgecolor='white')
axes[0].set_ylabel('Information Coefficient')
axes[0].set_title('IC: Text-to-Alpha Pipeline')
axes[0].tick_params(axis='x', rotation=15)

# Predicted vs. actual (combined model)
X_all = feature_df[all_cols].values
xgb_combined = XGBRegressor(n_estimators=50, max_depth=3, learning_rate=0.1,
                            subsample=0.8, random_state=42)
preds_combined = cross_val_predict(xgb_combined, X_all, target, cv=loo)

axes[1].scatter(preds_combined, target, c='#2980b9', s=60, edgecolors='white', alpha=0.8)
axes[1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)

# Regression line
z = np.polyfit(preds_combined, target, 1)
p = np.poly1d(z)
x_line = np.linspace(preds_combined.min(), preds_combined.max(), 50)
axes[1].plot(x_line, p(x_line), 'r--', alpha=0.7)

axes[1].set_xlabel('Predicted Return')
axes[1].set_ylabel('Actual Return')
axes[1].set_title('Combined Model: Predicted vs. Actual')

plt.suptitle('Exercise 3: Text-to-Alpha Pipeline', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### Exercise 3 -- Key Takeaways

- Embeddings typically carry more predictive information than 3-class sentiment
- The combined model can leverage both structured sentiment and unstructured embedding information
- With only 24 headlines, these results are highly noisy -- real pipelines use thousands of headlines
- In production: daily aggregation of embeddings per ticker, with proper time-series CV

---

## Exercise 4: Discussion -- Will LLMs Replace Quant Researchers? (20 min)

### Discussion Questions

1. **The Chen-Kelly-Xiu result:** They showed LLM embeddings outperform traditional NLP for return prediction. But the embedding model itself does not understand finance -- it just maps text to vectors. Where does the "intelligence" reside?

2. **AlphaGPT and RD-Agent:** These systems can autonomously propose and test alpha hypotheses. If an LLM can write a factor, backtest it, and iterate -- what is the human quant researcher doing?

3. **The moat question:** If everyone uses the same LLM (GPT-4, Claude, Llama) for embeddings, is there alpha in the embeddings themselves? Or does the alpha come from *what you do with them* (the downstream model, the data, the execution)?

4. **Interpretability:** A portfolio manager asks: "Why did we go long NVDA today?" You say: "Because dimension 47 of the sentence-transformer embedding was 0.3 standard deviations above its 90-day mean." Is this acceptable?

5. **Data pipeline challenges:** In production, you need to process thousands of headlines per day, each with different tickers, sources, and reliability levels. The NLP model is easy -- the engineering is hard. Does this change the value proposition?

### Positions to Debate

**"LLMs will replace most quant researchers within 5 years"**
- AlphaGPT can already generate and test factors autonomously
- LLM embeddings outperform decades of hand-crafted NLP features
- Coding is becoming a commodity -- LLMs write backtests better than juniors

**"LLMs are tools, not replacements"**
- Alpha discovery requires domain knowledge that LLMs lack
- Risk management, portfolio construction, and execution are not NLP problems
- The hard part is data acquisition and cleaning, not model building
- Regulatory requirements demand human oversight and accountability

**"The real change is in workflow, not headcount"**
- One researcher with LLM tools can do the work of five without them
- The role shifts from coding to supervising and curating
- Domain experts become more valuable, not less, as they guide the AI