# Week 7 Seminar --- Feedforward Neural Networks for Asset Pricing

**Quantitative Finance ML Course**

---

## Today's Plan (90 min)

| Time | Activity |
|------|----------|
| 25 min | Exercise 1: Build the Gu-Kelly-Xiu 3-layer net |
| 25 min | Exercise 2: Temporal train/val/test splitting in DataLoader |
| 20 min | Exercise 3: Compare MSE vs weighted MSE vs IC loss |
| 20 min | Discussion: When do NNs beat trees? |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from scipy.stats import spearmanr
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

np.random.seed(42)
torch.manual_seed(42)

In [None]:
# --- Shared data generation (same as lecture) ---

np.random.seed(42)
n_stocks = 500
n_months = 180

records = []
for t in range(n_months):
    for i in range(n_stocks):
        mom_1m = np.random.randn() * 0.08
        mom_12m = np.random.randn() * 0.20
        vol_20d = np.abs(np.random.randn()) * 0.02 + 0.01
        size = np.random.randn() * 2 + 15
        bm = np.random.randn() * 0.5
        turnover = np.abs(np.random.randn()) * 0.01
        rev_1m = np.random.randn() * 0.05

        ret_next = (
            -0.002 * mom_1m + 0.003 * mom_12m - 0.005 * vol_20d
            + 0.001 * bm + 0.002 * np.sin(mom_12m * size)
            + np.random.randn() * 0.08
        )

        records.append({
            'date_idx': t, 'stock_id': i,
            'mom_1m': mom_1m, 'mom_12m': mom_12m, 'vol_20d': vol_20d,
            'size': size, 'bm': bm, 'turnover': turnover, 'rev_1m': rev_1m,
            'ret_next': ret_next
        })

df = pd.DataFrame(records)
feature_cols = ['mom_1m', 'mom_12m', 'vol_20d', 'size', 'bm', 'turnover', 'rev_1m']

# Cross-sectional rank normalization
for col in feature_cols:
    df[col] = df.groupby('date_idx')[col].transform(
        lambda x: (x.rank() - 1) / (len(x) - 1) - 0.5
    )

print(f'Data ready: {df.shape[0]:,} observations, {len(feature_cols)} features')

---

## Exercise 1: Build the Gu-Kelly-Xiu 3-Layer Net (25 min)

Implement the NN3 architecture from the lecture:
- 3 hidden layers with 32, 16, 8 neurons
- ReLU activation after each hidden layer
- Batch normalization after each linear layer
- Dropout after each activation
- Single linear output

**Tasks**:
1. Fill in the `__init__` method
2. Fill in the `forward` method
3. Verify the parameter count

In [None]:
class GuKellyXiuNet(nn.Module):
    """
    TODO: Implement the Gu-Kelly-Xiu NN3 architecture.
    
    Architecture: input -> Linear(32) -> BN -> ReLU -> Dropout
                       -> Linear(16) -> BN -> ReLU -> Dropout
                       -> Linear(8)  -> BN -> ReLU -> Dropout
                       -> Linear(1)
    """
    def __init__(self, input_dim, hidden_sizes=(32, 16, 8), dropout=0.5):
        super().__init__()
        # TODO: Build the network
        # Hint: use nn.Sequential with nn.Linear, nn.BatchNorm1d, nn.ReLU, nn.Dropout
        pass

    def forward(self, x):
        # TODO: Forward pass
        # Hint: don't forget to squeeze the output from (batch, 1) to (batch,)
        pass

In [None]:
# --- Test your implementation ---
model = GuKellyXiuNet(input_dim=7)
print(model)

# Test forward pass
x_test = torch.randn(64, 7)  # batch of 64, 7 features
y_test = model(x_test)
print(f'\nInput shape:  {x_test.shape}')
print(f'Output shape: {y_test.shape}')  # should be (64,)

n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Parameters:   {n_params}')  # should be around 1,000-1,200

assert y_test.shape == (64,), f'Expected shape (64,), got {y_test.shape}'
print('\nAll checks passed!')

---

## Exercise 2: Temporal Train/Val/Test Splitting (25 min)

Implement proper temporal splitting for financial panel data.

**Requirements**:
1. Create a `CrossSectionalDataset` class
2. Split data into train (months 0-107), val (108-143), test (144-179)
3. Create DataLoaders with `shuffle=True` for train only
4. Write a training loop with early stopping

In [None]:
class CrossSectionalDataset(Dataset):
    """Dataset for cross-sectional stock data."""
    def __init__(self, features, targets):
        # TODO: Convert to tensors and store
        pass

    def __len__(self):
        # TODO
        pass

    def __getitem__(self, idx):
        # TODO: Return (features, target) tuple
        pass

In [None]:
# TODO: Split the data temporally
train_end = 108
val_end = 144

# TODO: Create train_df, val_df, test_df
# TODO: Extract X_train, y_train, X_val, y_val, X_test, y_test
# TODO: Create Datasets and DataLoaders

# train_loader = DataLoader(..., batch_size=2048, shuffle=True)
# val_loader = DataLoader(..., batch_size=4096, shuffle=False)
# test_loader = DataLoader(..., batch_size=4096, shuffle=False)

# Verify:
# print(f'Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}')

In [None]:
class EarlyStopping:
    """TODO: Implement early stopping."""
    def __init__(self, patience=10):
        self.patience = patience
        self.best_loss = float('inf')
        self.counter = 0
        self.best_state = None

    def step(self, val_loss, model):
        """Returns True if we should stop training."""
        # TODO: Implement the logic
        # - If val_loss improved, save model state, reset counter
        # - If not, increment counter
        # - Return True if counter >= patience
        pass

    def restore_best(self, model):
        """Restore the best model state."""
        # TODO
        pass

In [None]:
# TODO: Write the training loop
# 1. Initialize model, optimizer (Adam, lr=1e-3, weight_decay=1e-5)
# 2. For each epoch:
#    a. Train: model.train(), iterate train_loader, compute MSE, backprop
#    b. Validate: model.eval(), iterate val_loader, compute MSE
#    c. Check early stopping
# 3. Restore best model
# 4. Evaluate on test set with IC

# model = GuKellyXiuNet(input_dim=7)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)
# stopper = EarlyStopping(patience=10)
# ...

print('Training loop: implement above')

---

## Exercise 3: Compare Loss Functions (20 min)

Train the same architecture with 3 different loss functions and compare test IC:

1. **MSE**: standard mean squared error
2. **Weighted MSE**: weight by absolute return (focus on big movers)
3. **IC loss**: maximize cross-sectional correlation

In [None]:
def mse_loss(y_pred, y_true):
    return ((y_pred - y_true) ** 2).mean()


def weighted_mse_loss(y_pred, y_true):
    """TODO: Implement weighted MSE, weighting by |y_true|."""
    # Hint: normalize weights so they sum to 1
    pass


def ic_loss(y_pred, y_true):
    """TODO: Implement negative Pearson correlation loss."""
    # Hint: de-mean both, compute correlation, return negative
    pass

In [None]:
# TODO: Train 3 models with different loss functions
# For each:
#   1. Train with early stopping (same hyperparameters)
#   2. Compute test IC
#   3. Store results

# loss_functions = {
#     'MSE': mse_loss,
#     'Weighted MSE': weighted_mse_loss,
#     'IC Loss': ic_loss
# }

# results = {}
# for name, loss_fn in loss_functions.items():
#     ... train model ...
#     ... compute IC ...
#     results[name] = {'mean_ic': ..., 'ic_ir': ...}

print('Loss comparison: implement above')

In [None]:
# TODO: Create a comparison bar chart of Mean IC for each loss function
# Also create a table showing Mean IC, IC IR, and IC > 0 percentage

print('Comparison plot: implement above')

---

## Discussion: When Do NNs Beat Trees? (20 min)

Consider the following questions with your group:

### Question 1
The Gu-Kelly-Xiu paper shows NNs outperform tree-based models in cross-sectional return prediction. But many practitioners still prefer XGBoost/LightGBM. Why?

**Think about**: training speed, interpretability, hyperparameter sensitivity, data requirements.

### Question 2
In what scenarios would you expect neural nets to have a clear advantage over trees?

**Think about**: nonlinear interactions between features, smoothness of the signal, dataset size.

### Question 3
Why do we use a small architecture (32-16-8) instead of a large one (256-128-64)?

**Think about**: signal-to-noise ratio in financial data, overfitting risk, sample size.

### Question 4
Why is ensembling over random seeds so effective for neural nets but less critical for tree-based models?

**Think about**: loss landscape, initialization sensitivity, bagging in tree ensembles.

### Question 5
Should we use the same loss function (e.g., MSE) for NNs and trees, or should we tailor the loss to the model type?

**Think about**: differentiability requirements, what each model is good at learning.

### Discussion Notes

*Write your group's key takeaways here:*

- Q1: ...
- Q2: ...
- Q3: ...
- Q4: ...
- Q5: ...

---

## Summary

Today you practiced:
1. Building the Gu-Kelly-Xiu feedforward architecture in PyTorch
2. Implementing proper temporal data splitting for financial panels
3. Comparing different financial loss functions (MSE, weighted MSE, IC)

**For the homework**: you'll extend this to expanding-window CV, ensemble with tree models, and analyze where NNs win vs lose.