# Week 7 --- Feedforward Neural Networks for Asset Pricing

**Quantitative Finance ML Course**

---

## Roadmap

1. PyTorch fundamentals for finance (tensors, Dataset/DataLoader)
2. The Gu-Kelly-Xiu neural net architecture
3. Training pitfalls in finance
4. Financial loss functions
5. Ensemble methods
6. Working demo: feedforward net for cross-sectional return prediction

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print(f'PyTorch version: {torch.__version__}')
print(f'MPS available: {torch.backends.mps.is_available()}')

---

## 1. PyTorch Fundamentals for Finance

### Why PyTorch for Quant Finance?

- **Flexibility**: custom loss functions (IC-based, asymmetric), custom architectures
- **GPU/MPS acceleration**: train on Apple Silicon or NVIDIA GPUs
- **Research-friendly**: most academic finance DL papers use PyTorch
- **Ecosystem**: Lightning, TorchMetrics, etc.

### Tensors on MPS

Apple Silicon Macs have the MPS (Metal Performance Shaders) backend. It works like CUDA but for Apple GPUs.

In [None]:
# Device selection: prefer MPS > CUDA > CPU
if torch.backends.mps.is_available():
    device = torch.device('mps')
elif torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

print(f'Using device: {device}')

# Basic tensor operations
x = torch.randn(1000, 50)  # 1000 stocks, 50 features
x_device = x.to(device)
print(f'Tensor shape: {x_device.shape}, device: {x_device.device}')

### Dataset and DataLoader for Financial Panels

Financial data is a **panel**: stocks x time x features. We need a custom `Dataset` that respects this structure.

In [None]:
class CrossSectionalDataset(Dataset):
    """
    Dataset for cross-sectional stock prediction.
    Each sample is one (stock, date) observation.
    """
    def __init__(self, features: np.ndarray, targets: np.ndarray, dates: np.ndarray):
        """
        Args:
            features: (N, K) array of K features for N observations
            targets: (N,) array of forward returns
            dates: (N,) array of date indices (for temporal splitting)
        """
        self.X = torch.tensor(features, dtype=torch.float32)
        self.y = torch.tensor(targets, dtype=torch.float32)
        self.dates = dates

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


# Example: create a dummy panel
n_stocks, n_months, n_features = 200, 120, 10
N = n_stocks * n_months  # total observations

features = np.random.randn(N, n_features)
targets = np.random.randn(N) * 0.05  # monthly returns
dates = np.repeat(np.arange(n_months), n_stocks)

ds = CrossSectionalDataset(features, targets, dates)
print(f'Dataset size: {len(ds)}')
print(f'Sample: X shape={ds[0][0].shape}, y={ds[0][1].item():.4f}')

**Key point**: When creating the DataLoader, do NOT shuffle across time. We'll cover proper temporal splitting shortly.

In [None]:
# For cross-sectional models, we can shuffle WITHIN each month
# but not across months. A simple approach:
# - split data by date into train/val/test
# - shuffle within train, don't shuffle val/test

train_mask = dates < 72   # first 72 months = train
val_mask = (dates >= 72) & (dates < 96)  # next 24 months = val
test_mask = dates >= 96   # last 24 months = test

train_ds = CrossSectionalDataset(features[train_mask], targets[train_mask], dates[train_mask])
val_ds = CrossSectionalDataset(features[val_mask], targets[val_mask], dates[val_mask])
test_ds = CrossSectionalDataset(features[test_mask], targets[test_mask], dates[test_mask])

# Shuffle train only
train_loader = DataLoader(train_ds, batch_size=512, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=1024, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=1024, shuffle=False)

print(f'Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}')

---

## 2. The Gu-Kelly-Xiu Neural Net

**Reference**: Gu, Kelly, Xiu (2020) "Empirical Asset Pricing via Machine Learning", *Review of Financial Studies*.

### Architecture

Their feedforward net (called NN3 in the paper) uses:
- **3 hidden layers**: 32 -> 16 -> 8 neurons
- **ReLU** activations
- **Batch normalization** after each linear layer
- **Dropout** for regularization
- **Single output**: predicted return (or return rank)

The architecture is deliberately **small** --- financial signal is weak, so large nets just memorize noise.

In [None]:
class GuKellyXiuNet(nn.Module):
    """
    Gu-Kelly-Xiu NN3 architecture for cross-sectional return prediction.
    3 hidden layers: 32 -> 16 -> 8, with BN, ReLU, and dropout.
    """
    def __init__(self, input_dim, dropout=0.5):
        super().__init__()
        self.net = nn.Sequential(
            # Layer 1: input -> 32
            nn.Linear(input_dim, 32),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.Dropout(dropout),

            # Layer 2: 32 -> 16
            nn.Linear(32, 16),
            nn.BatchNorm1d(16),
            nn.ReLU(),
            nn.Dropout(dropout),

            # Layer 3: 16 -> 8
            nn.Linear(16, 8),
            nn.BatchNorm1d(8),
            nn.ReLU(),
            nn.Dropout(dropout),

            # Output: 8 -> 1
            nn.Linear(8, 1)
        )

    def forward(self, x):
        return self.net(x).squeeze(-1)


# Instantiate and inspect
model = GuKellyXiuNet(input_dim=n_features)
print(model)

# Count parameters
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTrainable parameters: {n_params}')

### Why This Architecture?

| Design choice | Reason |
|---|---|
| Small layers (32-16-8) | Financial signals are weak; overfitting is the main risk |
| Batch normalization | Stabilizes training with heterogeneous cross-sectional features |
| Dropout (0.5) | Strong regularization needed for noisy financial data |
| ReLU | Simple, works well, avoids vanishing gradients |
| Single output | Predicting one number: next-period return |

---

## 3. Training Pitfalls in Finance

### Pitfall 1: Shuffling Across Time

**Wrong**: `shuffle=True` on the entire dataset (leaks future info into training).

**Right**: Split by time first, then shuffle only within the training set.

### Pitfall 2: No Early Stopping

Financial data is so noisy that NNs overfit quickly. Always use early stopping on a temporal validation set.

### Pitfall 3: Single Random Seed

NN predictions are sensitive to initialization. Always train with multiple seeds and ensemble.

### Pitfall 4: Normalizing Targets Wrong

Cross-sectional normalization (rank or z-score within each month) is better than time-series normalization for returns.

In [None]:
class EarlyStopping:
    """Simple early stopping to prevent overfitting."""
    def __init__(self, patience=10, min_delta=0.0):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float('inf')
        self.counter = 0
        self.best_state = None

    def step(self, val_loss, model):
        """Returns True if training should stop."""
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            self.best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
            return False
        else:
            self.counter += 1
            return self.counter >= self.patience

    def restore_best(self, model):
        """Restore the best model weights."""
        if self.best_state is not None:
            model.load_state_dict(self.best_state)


print('Early stopping class ready.')

### Temporal Train/Val/Test Split

The standard approach:

```
Time ──────────────────────────────────────────────►
│     TRAIN (60%)     │  VAL (20%)  │  TEST (20%)  │
```

For expanding-window CV (used in HW):

```
Fold 1: [=TRAIN=][VAL][TEST]
Fold 2: [==TRAIN==][VAL][TEST]
Fold 3: [===TRAIN===][VAL][TEST]
```

---

## 4. Financial Loss Functions

### Standard MSE

$$\mathcal{L}_{MSE} = \frac{1}{N} \sum_{i=1}^{N} (r_i - \hat{r}_i)^2$$

Simple, but treats all stocks equally. A $1 stock and a $100B stock get the same weight.

### Weighted MSE

Weight by market cap or inverse volatility:

$$\mathcal{L}_{WMSE} = \frac{1}{N} \sum_{i=1}^{N} w_i (r_i - \hat{r}_i)^2$$

### IC-Based Loss

Instead of minimizing prediction error, maximize the **rank correlation** (Information Coefficient) between predictions and actual returns:

$$\mathcal{L}_{IC} = -\text{corr}(\hat{r}, r)$$

We care about ranking stocks correctly, not predicting exact returns.

In [None]:
def mse_loss(y_pred, y_true):
    """Standard MSE loss."""
    return ((y_pred - y_true) ** 2).mean()


def weighted_mse_loss(y_pred, y_true, weights):
    """MSE weighted by e.g. market cap."""
    w = weights / weights.sum()
    return (w * (y_pred - y_true) ** 2).sum()


def ic_loss(y_pred, y_true):
    """
    Negative Pearson correlation (IC) loss.
    Minimizing this maximizes cross-sectional rank correlation.
    """
    y_pred_dm = y_pred - y_pred.mean()
    y_true_dm = y_true - y_true.mean()
    corr = (y_pred_dm * y_true_dm).sum() / (
        torch.sqrt((y_pred_dm ** 2).sum() * (y_true_dm ** 2).sum()) + 1e-8
    )
    return -corr  # negative because we minimize


# Quick demo
y_pred = torch.randn(200)
y_true = y_pred * 0.3 + torch.randn(200) * 0.7  # weak signal

print(f'MSE loss:  {mse_loss(y_pred, y_true):.4f}')
print(f'IC loss:   {ic_loss(y_pred, y_true):.4f}')
print(f'(IC = {-ic_loss(y_pred, y_true):.4f})')

---

## 5. Ensemble Methods

Neural nets in finance benefit hugely from ensembling:

1. **Seed ensembles**: Train the same architecture with different random seeds, average predictions
2. **Architecture ensembles**: Combine different architectures (NN + XGBoost + LightGBM)
3. **Temporal ensembles**: Average predictions from models trained on different expanding windows

Gu-Kelly-Xiu use 10 random seed ensembles. This is standard practice.

In [None]:
def train_ensemble(model_class, input_dim, train_loader, val_loader,
                   n_seeds=5, n_epochs=100, lr=1e-3, device='cpu'):
    """
    Train an ensemble of models with different random seeds.
    Returns list of trained models.
    """
    models = []

    for seed in range(n_seeds):
        torch.manual_seed(seed)
        model = model_class(input_dim).to(device)
        optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
        stopper = EarlyStopping(patience=10)

        for epoch in range(n_epochs):
            # Train
            model.train()
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                optimizer.zero_grad()
                loss = mse_loss(model(X_batch), y_batch)
                loss.backward()
                optimizer.step()

            # Validate
            model.eval()
            val_losses = []
            with torch.no_grad():
                for X_batch, y_batch in val_loader:
                    X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                    val_losses.append(mse_loss(model(X_batch), y_batch).item())

            if stopper.step(np.mean(val_losses), model):
                break

        stopper.restore_best(model)
        model.to('cpu')
        models.append(model)
        print(f'  Seed {seed}: best val loss = {stopper.best_loss:.6f}, '
              f'stopped at epoch {epoch+1}')

    return models


def predict_ensemble(models, X, device='cpu'):
    """Average predictions from an ensemble."""
    preds = []
    X_tensor = torch.tensor(X, dtype=torch.float32).to(device)
    for model in models:
        model.eval()
        model.to(device)
        with torch.no_grad():
            preds.append(model(X_tensor).cpu().numpy())
        model.to('cpu')
    return np.mean(preds, axis=0)


print('Ensemble functions ready.')

---

## 6. Working Demo: Feedforward Net for Return Prediction

Let's put it all together. We'll:
1. Generate synthetic cross-sectional data (mimicking momentum, volatility, size features)
2. Build the Gu-Kelly-Xiu net
3. Train with proper temporal splitting and early stopping
4. Evaluate with IC (Information Coefficient)

In [None]:
# --- Generate synthetic cross-sectional data ---
# Mimics the features from Weeks 4-5: momentum, volatility, etc.

np.random.seed(42)
n_stocks = 500
n_months = 180  # 15 years

records = []
for t in range(n_months):
    for i in range(n_stocks):
        # Features
        mom_1m = np.random.randn() * 0.08
        mom_12m = np.random.randn() * 0.20
        vol_20d = np.abs(np.random.randn()) * 0.02 + 0.01
        size = np.random.randn() * 2 + 15  # log market cap
        bm = np.random.randn() * 0.5  # book-to-market
        turnover = np.abs(np.random.randn()) * 0.01
        rev_1m = np.random.randn() * 0.05  # short-term reversal

        # Target: next-month return with weak factor structure
        ret_next = (
            -0.002 * mom_1m       # reversal
            + 0.003 * mom_12m     # momentum
            - 0.005 * vol_20d     # low-vol premium
            + 0.001 * bm          # value
            + 0.002 * np.sin(mom_12m * size)  # nonlinearity!
            + np.random.randn() * 0.08  # noise
        )

        records.append({
            'date_idx': t, 'stock_id': i,
            'mom_1m': mom_1m, 'mom_12m': mom_12m, 'vol_20d': vol_20d,
            'size': size, 'bm': bm, 'turnover': turnover, 'rev_1m': rev_1m,
            'ret_next': ret_next
        })

df = pd.DataFrame(records)
feature_cols = ['mom_1m', 'mom_12m', 'vol_20d', 'size', 'bm', 'turnover', 'rev_1m']

print(f'Panel shape: {df.shape}')
print(f'Date range: 0 to {df.date_idx.max()}')
print(f'Features: {feature_cols}')
df.head()

In [None]:
# --- Cross-sectional normalization ---
# Rank-transform features within each month (standard in finance ML)

for col in feature_cols:
    df[col] = df.groupby('date_idx')[col].transform(
        lambda x: (x.rank() - 1) / (len(x) - 1) - 0.5  # map to [-0.5, 0.5]
    )

print('After cross-sectional rank normalization:')
df[feature_cols].describe().round(3)

In [None]:
# --- Temporal split ---
train_end = 108   # 9 years
val_end = 144     # 3 years val
# test: last 3 years

train_df = df[df.date_idx < train_end]
val_df = df[(df.date_idx >= train_end) & (df.date_idx < val_end)]
test_df = df[df.date_idx >= val_end]

X_train = train_df[feature_cols].values
y_train = train_df['ret_next'].values
X_val = val_df[feature_cols].values
y_val = val_df['ret_next'].values
X_test = test_df[feature_cols].values
y_test = test_df['ret_next'].values

train_ds = CrossSectionalDataset(X_train, y_train, train_df['date_idx'].values)
val_ds = CrossSectionalDataset(X_val, y_val, val_df['date_idx'].values)
test_ds = CrossSectionalDataset(X_test, y_test, test_df['date_idx'].values)

train_loader = DataLoader(train_ds, batch_size=2048, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=4096, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=4096, shuffle=False)

print(f'Train: {len(train_ds):,}, Val: {len(val_ds):,}, Test: {len(test_ds):,}')

In [None]:
# --- Train the model ---
print('Training NN ensemble (5 seeds)...')
use_device = 'cpu'  # change to 'mps' or 'cuda' if available

models = train_ensemble(
    GuKellyXiuNet,
    input_dim=len(feature_cols),
    train_loader=train_loader,
    val_loader=val_loader,
    n_seeds=5,
    n_epochs=100,
    lr=1e-3,
    device=use_device
)

In [None]:
# --- Evaluate with IC ---
from scipy.stats import spearmanr

def compute_monthly_ic(df, pred_col='pred', ret_col='ret_next'):
    """Compute monthly rank IC (Spearman) and return a Series."""
    ic_series = df.groupby('date_idx').apply(
        lambda g: spearmanr(g[pred_col], g[ret_col])[0]
    )
    return ic_series


# Predict on test set
test_df = test_df.copy()
test_df['pred'] = predict_ensemble(models, X_test)

# Compute monthly IC
ic_series = compute_monthly_ic(test_df)

print(f'Test IC:')
print(f'  Mean IC:   {ic_series.mean():.4f}')
print(f'  Std IC:    {ic_series.std():.4f}')
print(f'  IR (IC/std): {ic_series.mean()/ic_series.std():.4f}')
print(f'  IC > 0:    {(ic_series > 0).mean():.1%}')

In [None]:
# --- Plot IC over time ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Monthly IC
axes[0].bar(ic_series.index, ic_series.values, color='steelblue', alpha=0.7)
axes[0].axhline(y=ic_series.mean(), color='red', linestyle='--',
                label=f'Mean IC = {ic_series.mean():.4f}')
axes[0].axhline(y=0, color='black', linewidth=0.5)
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Rank IC')
axes[0].set_title('Monthly Information Coefficient (Test Set)')
axes[0].legend()

# Cumulative IC
axes[1].plot(ic_series.index, ic_series.cumsum().values, color='steelblue', linewidth=2)
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Cumulative IC')
axes[1].set_title('Cumulative IC (Test Set)')

plt.tight_layout()
plt.show()

In [None]:
# --- Long-short portfolio returns ---
# Each month: go long top quintile, short bottom quintile

def compute_ls_returns(df, pred_col='pred', ret_col='ret_next', n_quantiles=5):
    """Compute long-short (top minus bottom quintile) returns."""
    def _ls(g):
        g = g.copy()
        g['q'] = pd.qcut(g[pred_col], n_quantiles, labels=False, duplicates='drop')
        long_ret = g[g['q'] == n_quantiles - 1][ret_col].mean()
        short_ret = g[g['q'] == 0][ret_col].mean()
        return long_ret - short_ret

    return df.groupby('date_idx').apply(_ls)


ls_returns = compute_ls_returns(test_df)

print(f'Long-Short Portfolio (Test):')
print(f'  Mean monthly return: {ls_returns.mean():.4f} ({ls_returns.mean()*12:.2%} annualized)')
print(f'  Sharpe (annualized):  {ls_returns.mean()/ls_returns.std()*np.sqrt(12):.2f}')

# Plot cumulative returns
cum_ret = (1 + ls_returns).cumprod()
plt.figure(figsize=(12, 5))
plt.plot(cum_ret.index, cum_ret.values, linewidth=2, color='steelblue')
plt.xlabel('Month')
plt.ylabel('Cumulative Return')
plt.title('Long-Short Portfolio: Cumulative Returns (Test Set)')
plt.show()

---

## Key Takeaways

1. **Architecture**: The Gu-Kelly-Xiu net is deliberately small (32-16-8). Financial signals are weak.
2. **Temporal splitting**: Never shuffle across time. Train/val/test must be ordered chronologically.
3. **Early stopping**: Essential. Financial NNs overfit in just a few epochs.
4. **Loss function**: IC-based loss aligns with what we actually care about (ranking stocks).
5. **Ensembles**: Always average over random seeds. Cheap and effective.
6. **Cross-sectional normalization**: Rank-transform features within each month.

### Next Week

Week 8: LSTM/GRU for volatility forecasting --- when **sequence** matters.