# Exogenous Variables in APDTFlow v0.2.0

## Boost Forecast Accuracy by 30-50%! 🚀

This notebook demonstrates how to use **external features** (exogenous variables) to dramatically improve forecasting accuracy.

### What are Exogenous Variables?

External features that influence your target variable:
- **Sales forecasting**: Weather, holidays, promotions
- **Energy demand**: Temperature, day-of-week
- **Traffic prediction**: Events, weather conditions
- **Stock prices**: Economic indicators

### Research Evidence

Recent research shows **30-50% accuracy improvement** with exogenous variables:
- **ChronosX** (March 2025) - arXiv:2503.12107
- **TimeXer** (Feb 2024) - arXiv:2402.19072
- **ExoLLM** (2025) - State-of-the-art with external features

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

from apdtflow import APDTFlowForecaster
from apdtflow.exogenous import ExogenousFeatureFusion
import torch

print("✓ APDTFlow v0.2.0 loaded successfully!")

## Step 1: Create Synthetic Dataset

We'll create a sales dataset influenced by:
1. **Temperature** - Past observed only
2. **Holidays** - Known in advance
3. **Promotions** - Known in advance

In [None]:
# Generate dates (2 years of daily data)
dates = pd.date_range(start='2023-01-01', end='2024-12-31', freq='D')
n = len(dates)

# Set seed for reproducibility
np.random.seed(42)

# 1. Temperature (seasonal pattern)
temperature = 20 + 10 * np.sin(np.arange(n) * 2 * np.pi / 365) + np.random.normal(0, 2, n)

# 2. Is holiday (5% of days)
is_holiday = np.random.choice([0, 1], size=n, p=[0.95, 0.05])

# 3. Promotion (15% of days)
promotion = np.random.choice([0, 1], size=n, p=[0.85, 0.15])

# Sales: influenced by all external factors!
base_sales = 100 + 20 * np.sin(np.arange(n) * 2 * np.pi / 365)  # Seasonal
sales = (base_sales + 
         -0.5 * temperature +      # Cooler weather → more sales
         30 * is_holiday +         # +30 on holidays
         25 * promotion +          # +25 during promotions
         np.random.normal(0, 5, n)) # Random noise

# Create DataFrame
df = pd.DataFrame({
    'date': dates,
    'sales': sales,
    'temperature': temperature,
    'is_holiday': is_holiday.astype(int),
    'promotion': promotion.astype(int)
})

print(f"Dataset created: {len(df)} days")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst 10 rows:")
df.head(10)

### Visualize the Data

In [None]:
fig, axes = plt.subplots(4, 1, figsize=(14, 10))

# Sales
axes[0].plot(df['date'], df['sales'])
axes[0].set_title('Sales (Target Variable)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Sales')
axes[0].grid(alpha=0.3)

# Temperature  
axes[1].plot(df['date'], df['temperature'], color='orange')
axes[1].set_title('Temperature (Exogenous - Past Observed)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('°C')
axes[1].grid(alpha=0.3)

# Holidays
axes[2].fill_between(df['date'], df['is_holiday'], alpha=0.6, color='green')
axes[2].set_title('Holidays (Exogenous - Future Known)', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Is Holiday')
axes[2].grid(alpha=0.3)

# Promotions
axes[3].fill_between(df['date'], df['promotion'], alpha=0.6, color='red')
axes[3].set_title('Promotions (Exogenous - Future Known)', fontsize=14, fontweight='bold')
axes[3].set_ylabel('Promotion')
axes[3].set_xlabel('Date')
axes[3].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice how sales spike during holidays and promotions!")

## Step 2: Baseline Model (WITHOUT Exogenous Variables)

Train a model without external features as a baseline.

In [None]:
# Split data: 80% train, 20% test
train_size = int(0.8 * len(df))
train_df = df.iloc[:train_size]
test_df = df.iloc[train_size:]

print(f"Training set: {len(train_df)} days")
print(f"Test set: {len(test_df)} days")

In [None]:
# Baseline model - NO exogenous variables
baseline_model = APDTFlowForecaster(
    forecast_horizon=14,
    history_length=30,
    num_epochs=30,
    verbose=True
)

baseline_model.fit(train_df, target_col='sales', date_col='date')
print("\n✓ Baseline model trained (no exogenous variables)")

In [None]:
# Make predictions
baseline_preds = baseline_model.predict()

# Evaluate on test data
actual = test_df['sales'].values[:14]
baseline_mae = np.mean(np.abs(baseline_preds - actual))
baseline_rmse = np.sqrt(np.mean((baseline_preds - actual)**2))

print(f"Baseline Model Performance:")
print(f"  MAE:  {baseline_mae:.2f}")
print(f"  RMSE: {baseline_rmse:.2f}")

## Step 3: Enhanced Model (WITH Exogenous Variables)

Now train with external features!

In [None]:
# Enhanced model - WITH exogenous variables!
enhanced_model = APDTFlowForecaster(
    forecast_horizon=14,
    history_length=30,
    num_epochs=30,
    exog_fusion_type='gated',  # Options: 'concat', 'gated', 'attention'
    verbose=True
)

# Train with exogenous variables
enhanced_model.fit(
    train_df,
    target_col='sales',
    date_col='date',
    exog_cols=['temperature', 'is_holiday', 'promotion'],  # All exog features
    future_exog_cols=['is_holiday', 'promotion']  # These are known in advance!
)

print("\n✓ Enhanced model trained WITH exogenous variables!")
print("  - Temperature (past observed)")
print("  - Holidays (future known)")
print("  - Promotions (future known)")

In [None]:
# Make predictions with future exog data
future_exog_df = test_df[['is_holiday', 'promotion']].head(14)

enhanced_preds, uncertainty = enhanced_model.predict(
    exog_future=future_exog_df,
    return_uncertainty=True
)

# Evaluate
enhanced_mae = np.mean(np.abs(enhanced_preds - actual))
enhanced_rmse = np.sqrt(np.mean((enhanced_preds - actual)**2))

print(f"Enhanced Model Performance:")
print(f"  MAE:  {enhanced_mae:.2f}")
print(f"  RMSE: {enhanced_rmse:.2f}")
print(f"\nImprovement:")
print(f"  MAE:  {(1 - enhanced_mae/baseline_mae)*100:.1f}% better")
print(f"  RMSE: {(1 - enhanced_rmse/baseline_rmse)*100:.1f}% better")

## Step 4: Comparison Visualization

In [None]:
fig, ax = plt.subplots(figsize=(14, 6))

x = np.arange(14)

# Actual values
ax.plot(x, actual, 'ko-', label='Actual', linewidth=2, markersize=8)

# Baseline predictions
ax.plot(x, baseline_preds, 'b--', label=f'Baseline (MAE: {baseline_mae:.2f})', linewidth=2)

# Enhanced predictions with uncertainty
ax.plot(x, enhanced_preds, 'r-', label=f'With Exog (MAE: {enhanced_mae:.2f})', linewidth=2)
ax.fill_between(x, 
                enhanced_preds - uncertainty, 
                enhanced_preds + uncertainty,
                alpha=0.3, color='red', label='Uncertainty')

ax.set_xlabel('Days Ahead', fontsize=12)
ax.set_ylabel('Sales', fontsize=12)
ax.set_title('Forecast Comparison: Baseline vs Exogenous Variables', 
             fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"✨ Exogenous variables improved accuracy by {(1 - enhanced_mae/baseline_mae)*100:.1f}%!")

## Step 5: Comparing Fusion Strategies

APDTFlow offers 3 fusion strategies:
1. **Concat**: Simple concatenation
2. **Gated**: Learned gating mechanism (recommended)
3. **Attention**: Attention-based fusion

In [None]:
fusion_results = {}

for fusion_type in ['concat', 'gated', 'attention']:
    print(f"\nTraining with {fusion_type} fusion...")
    
    model = APDTFlowForecaster(
        forecast_horizon=14,
        history_length=30,
        num_epochs=20,
        exog_fusion_type=fusion_type,
        verbose=False
    )
    
    model.fit(
        train_df,
        target_col='sales',
        date_col='date',
        exog_cols=['temperature', 'is_holiday', 'promotion'],
        future_exog_cols=['is_holiday', 'promotion']
    )
    
    preds = model.predict(exog_future=future_exog_df)
    mae = np.mean(np.abs(preds - actual))
    
    fusion_results[fusion_type] = {'mae': mae, 'preds': preds}
    print(f"  {fusion_type}: MAE = {mae:.2f}")

print("\n✓ Fusion strategy comparison complete!")

In [None]:
# Visualize fusion comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(14)
ax.plot(x, actual, 'ko-', label='Actual', linewidth=2, markersize=8)

colors = {'concat': 'blue', 'gated': 'red', 'attention': 'green'}
for fusion_type, results in fusion_results.items():
    ax.plot(x, results['preds'], '--', 
            color=colors[fusion_type],
            label=f"{fusion_type.capitalize()} (MAE: {results['mae']:.2f})",
            linewidth=2)

ax.set_xlabel('Days Ahead', fontsize=12)
ax.set_ylabel('Sales', fontsize=12)
ax.set_title('Fusion Strategy Comparison', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Step 6: Direct Usage of Fusion Module

For advanced users: use the fusion module directly in custom models.

In [None]:
# Create fusion module
fusion = ExogenousFeatureFusion(
    hidden_dim=32,
    num_exog_features=3,
    fusion_type='gated'
)

# Example tensors
target = torch.randn(8, 1, 30)  # batch=8, channels=1, time=30
exog = torch.randn(8, 3, 30)    # batch=8, exog_features=3, time=30

# Fuse target with exog
fused = fusion(target, exog)

print(f"Target shape:  {target.shape}")
print(f"Exog shape:    {exog.shape}")
print(f"Fused shape:   {fused.shape}")
print(f"\n✓ Fusion module can be integrated into custom PyTorch models!")

## Key Takeaways

### 1. Types of Exogenous Variables

**Past Observed**
- Only available historically
- Examples: Temperature, actual traffic, competitor prices
- Used for model training only

**Future Known**
- Known in advance for forecast period
- Examples: Holidays, planned promotions, calendar features
- Critical for predictions!

### 2. Fusion Strategies

- **Concat**: Fast, simple baseline
- **Gated**: Learns importance weights (recommended)
- **Attention**: Most flexible, captures complex interactions

### 3. Expected Improvements

Research shows **30-50% accuracy improvement** with relevant exogenous variables!

### 4. API Usage

```python
model = APDTFlowForecaster(
    exog_fusion_type='gated'  # Choose fusion strategy
)

model.fit(
    df,
    target_col='sales',
    exog_cols=['temp', 'holiday', 'promo'],  # All exog features
    future_exog_cols=['holiday', 'promo']    # Known in advance
)

preds = model.predict(exog_future=future_df)
```

### 5. References

- **ChronosX** (March 2025): arXiv:2503.12107
- **TimeXer** (Feb 2024): arXiv:2402.19072
- **ExoLLM** (2025): State-of-the-art with LLM-based features

## Next Steps

1. Try with your own data!
2. Experiment with different fusion strategies
3. Combine with conformal prediction (see `conformal_prediction.ipynb`)
4. Check out advanced features in the documentation

📚 **Documentation**: https://github.com/yotambraun/APDTFlow