# DFM Python - Basic Tutorial

This notebook demonstrates the basic usage of the Dynamic Factor Model (DFM) package.

## Outline

1. **Data Loading**: Load configuration and time series data
2. **Model Training**: Estimate factors and parameters using EM algorithm
3. **Inference and Forecasting**: Generate forecasts for future periods
4. **Model Persistence**: Save and load trained models

## Prerequisites

- Python 3.12+
- Required packages: numpy, pandas, matplotlib
- DFM package installed


In [3]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import pickle
import warnings
warnings.filterwarnings('ignore')

# DFM package import
# High-level API (recommended - simple and intuitive)
import dfm_python as dfm
from dfm_python import DFMResult

print("✓ All libraries loaded")


✓ All libraries loaded


## 1. Data Loading

First, we need to load the model configuration and time series data.

### Configuration Options

The DFM package supports multiple configuration formats:

1. **Hydra-style YAML** (recommended): `load_config('config/default.yaml')`
   - Main settings from `config/default.yaml`
   - Series definitions from `config/series/default.yaml`
   - Block definitions from `config/blocks/default.yaml`

2. **Spec CSV**: `load_config_from_spec('data/sample_spec.csv')`
   - Loads series and block definitions from CSV

3. **Direct DFMConfig**: Create `DFMConfig` objects programmatically


In [None]:
# Load configuration using high-level API (recommended)
dfm.load_config('config/default.yaml')
config = dfm.get_config()

print(f"✓ Config loaded")
print(f"  - Number of series: {len(config.series)}")
print(f"  - Number of blocks: {len(config.block_names)}")
print(f"  - Block names: {', '.join(config.block_names)}")
print(f"  - Clock frequency: {config.clock}")
print(f"  - Convergence threshold: {config.threshold}")
print(f"  - Max iterations: {config.max_iter}")


In [None]:
# Display sample series information
print("Series sample (first 5):")
for i, series in enumerate(config.series[:5]):
    print(f"\n  {i+1}. {series.series_id}")
    print(f"     Name: {series.series_name[:60]}...")
    print(f"     Frequency: {series.frequency}")
    print(f"     Transformation: {series.transformation}")
    print(f"     Blocks: {series.blocks}")
    print(f"     Category: {series.category}")
    print(f"     Units: {series.units}")


### Load Time Series Data

Now we load the actual time series data. The data will be automatically:
- Sorted to match the configuration order
- Transformed according to each series' transformation specification
- Validated for frequency constraints


In [None]:
# Load data using high-level API
# sample_start filters data to start from a specific date (for quick testing)
dfm.load_data('data/sample_data.csv', sample_start='2022-01-01')

# Get the processed data
X = dfm.get_data()  # Transformed data (ready for DFM estimation)
Time = dfm.get_time()  # Time index
Z = dfm.get_original_data()  # Original untransformed data

print(f"✓ Data loaded")
print(f"  - Shape: {X.shape} (time periods × series)")
print(f"  - Time range: {Time[0]} ~ {Time[-1]}")
print(f"  - Number of periods: {len(Time)}")
print(f"  - Number of series: {X.shape[1]}")
print(f"  - Missing data ratio: {np.isnan(X).sum() / X.size * 100:.2f}%")


In [None]:
# Visualize first 5 series
fig, axes = plt.subplots(5, 1, figsize=(12, 10))
for i in range(min(5, X.shape[1])):
    axes[i].plot(Time, X[:, i], linewidth=1.5)
    axes[i].set_title(f"{config.series[i].series_id}: {config.series[i].series_name[:50]}...")
    axes[i].set_ylabel("Value")
    axes[i].grid(True, alpha=0.3)
axes[-1].set_xlabel("Date")
plt.tight_layout()
plt.savefig('outputs/data_visualization.png', dpi=150, bbox_inches='tight')
print("✓ Saved data visualization: outputs/data_visualization.png")
plt.show()


## 2. Model Training

The DFM estimation uses the Expectation-Maximization (EM) algorithm to:
- Estimate factor loadings (C matrix)
- Estimate transition dynamics (A matrix)
- Extract latent factors (Z)
- Handle missing data via Kalman filtering

### Training Parameters

- **threshold**: Convergence tolerance (smaller = more precise, slower)
- **max_iter**: Maximum EM iterations
- **fast**: Quick test mode (threshold=1e-2, max_iter=5)


In [None]:
# Train the model using high-level API
# fast=True enables quick test mode (threshold=1e-2, max_iter=5)
# For production, use: dfm.train(threshold=1e-5, max_iter=5000)
dfm.train(
    fast=True,       # Fast test mode
    max_iter=1       # Minimal execution for quick verification
)

# Get the result
result = dfm.result

print(f"✓ Model training complete!")
print(f"  - Converged: {result.converged}")
print(f"  - Iterations: {result.num_iter}")
print(f"  - Final log-likelihood: {result.loglik:.2f}")
print(f"  - Number of factors: {result.Z.shape[1]}")
print(f"  - Loading matrix shape: {result.C.shape}")
print(f"  - Transition matrix shape: {result.A.shape}")


In [None]:
# Model fit statistics
if result.rmse is not None:
    print(f"\nModel fit:")
    print(f"  - Overall RMSE: {result.rmse:.4f}")
    if result.rmse_std is not None:
        print(f"  - Standardized RMSE: {result.rmse_std:.4f}")
    
    print(f"\n  - RMSE per series (top 5 worst):")
    rmse_per_series = result.rmse_per_series
    top_5_idx = np.argsort(rmse_per_series)[-5:][::-1]  # Top 5 worst
    for idx in top_5_idx:
        series_id = config.series[idx].series_id
        print(f"    {series_id}: {rmse_per_series[idx]:.4f}")


In [None]:
# Visualize the common factor (first factor)
common_factor = result.Z[:, 0]

plt.figure(figsize=(12, 4))
plt.plot(Time, common_factor, linewidth=2, label='Common Factor', color='blue')
plt.title('Common Factor (First Factor)', fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Factor Value', fontsize=12)
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.tight_layout()
plt.savefig('outputs/common_factor.png', dpi=150, bbox_inches='tight')
print("✓ Saved common factor plot: outputs/common_factor.png")
plt.show()


In [None]:
# Save the trained model
output_dir = Path('outputs')
output_dir.mkdir(exist_ok=True)

model_path = output_dir / 'trained_model.pkl'
with open(model_path, 'wb') as f:
    pickle.dump({
        'result': result,
        'config': config,
        'Time': Time,
        'X_original': Z,  # Original data
        'X_transformed': X,  # Transformed data
    }, f)

print(f"✓ Model saved: {model_path}")


## 3. Inference and Forecasting

Once the model is trained, we can forecast future values of:
- **Factors**: Latent factors (Z)
- **Series**: Observed time series (X)

### Forecasting Method

The forecasting uses the transition equation:
- $Z_{t+h} = A^h Z_t$ (factor forecast)
- $X_{t+h} = C Z_{t+h}$ (series forecast, then unstandardized)


In [None]:
def forecast_factors(result: DFMResult, horizon: int):
    """Forecast factors for a given horizon.
    
    Parameters
    ----------
    result : DFMResult
        Trained model result
    horizon : int
        Number of periods to forecast
        
    Returns
    -------
    Z_forecast : np.ndarray
        Forecasted factors (horizon × n_factors)
    """
    A = result.A
    Z_last = result.Z[-1, :]  # Last observed factor value
    
    # Forecast factors forward
    Z_forecast = np.zeros((horizon, Z_last.shape[0]))
    Z_forecast[0, :] = A @ Z_last
    
    for h in range(1, horizon):
        Z_forecast[h, :] = A @ Z_forecast[h-1, :]
    
    return Z_forecast

def forecast_series(result: DFMResult, horizon: int):
    """Forecast observed series using factor forecasts.
    
    Parameters
    ----------
    result : DFMResult
        Trained model result
    horizon : int
        Number of periods to forecast
        
    Returns
    -------
    X_forecast : np.ndarray
        Forecasted series (horizon × n_series) - original scale
    Z_forecast : np.ndarray
        Forecasted factors (horizon × n_factors)
    """
    Z_forecast = forecast_factors(result, horizon)
    
    # Project factors to series space
    X_forecast = Z_forecast @ result.C.T
    
    # Unstandardize (restore to original scale)
    X_forecast_unstd = X_forecast * result.Wx + result.Mx
    
    return X_forecast_unstd, Z_forecast


In [None]:
# Perform forecasting
forecast_horizon = 12  # 12 months ahead

X_forecast, Z_forecast = forecast_series(result, forecast_horizon)

print(f"✓ Forecast complete")
print(f"  - Forecast horizon: {forecast_horizon} periods")
print(f"  - Forecasted series shape: {X_forecast.shape}")
print(f"  - Forecasted factors shape: {Z_forecast.shape}")


In [None]:
# Visualize factor forecast
forecast_dates = pd.date_range(start=Time[-1] + pd.Timedelta(days=30), periods=forecast_horizon, freq='M')

plt.figure(figsize=(14, 5))
plt.plot(Time, result.Z[:, 0], 'b-', linewidth=2, label='Past Factor', alpha=0.8)
plt.plot(forecast_dates, Z_forecast[:, 0], 'r--', linewidth=2, label='Forecast Factor', alpha=0.8)
plt.axvline(x=Time[-1], color='gray', linestyle=':', linewidth=1.5, label='Forecast Start')
plt.title('Common Factor: Past and Forecast', fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Factor Value', fontsize=12)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('outputs/factor_forecast.png', dpi=150, bbox_inches='tight')
print("✓ Saved factor forecast plot: outputs/factor_forecast.png")
plt.show()


In [None]:
# Visualize forecast for a specific series
series_idx = 0  # First series
series_id = config.series[series_idx].series_id
series_name = config.series[series_idx].series_name

plt.figure(figsize=(14, 5))
plt.plot(Time, Z[:, series_idx], 'b-', linewidth=2, label='Observed', alpha=0.7)
plt.plot(Time, result.X_sm[:, series_idx], 'g-', linewidth=1.5, label='Smoothed', alpha=0.7)
plt.plot(forecast_dates, X_forecast[:, series_idx], 'r--', linewidth=2, label='Forecast', alpha=0.8)
plt.axvline(x=Time[-1], color='gray', linestyle=':', linewidth=1.5, label='Forecast Start')
plt.title(f'{series_id}: {series_name[:50]}...', fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('outputs/series_forecast.png', dpi=150, bbox_inches='tight')
print("✓ Saved series forecast plot: outputs/series_forecast.png")
plt.show()


In [None]:
# Save forecasts to CSV
series_ids = config.get_series_ids()
forecast_df = pd.DataFrame(
    X_forecast,
    index=forecast_dates,
    columns=series_ids
)

forecast_path = output_dir / 'forecasts.csv'
forecast_df.to_csv(forecast_path)
print(f"✓ Saved forecast CSV: {forecast_path}")

# Display sample
print(f"\nForecast sample (first 5 series, first 3 periods):")
print(forecast_df.iloc[:3, :5])


## 4. Load Saved Model and Reuse

Trained models can be saved and loaded for later use, avoiding the need to retrain.


In [None]:
# Load saved model
with open(model_path, 'rb') as f:
    model_data = pickle.load(f)

loaded_result = model_data['result']
loaded_config = model_data['config']
loaded_time = model_data['Time']

print(f"✓ Model loaded")
print(f"  - Training span: {loaded_time[0]} ~ {loaded_time[-1]}")
print(f"  - Number of factors: {loaded_result.Z.shape[1]}")
print(f"  - Number of series: {loaded_result.C.shape[0]}")


In [None]:
# Use loaded model for forecasting
X_forecast_loaded, Z_forecast_loaded = forecast_series(loaded_result, forecast_horizon)
print(f"✓ Forecast with loaded model complete")
print(f"  - Forecasted series shape: {X_forecast_loaded.shape}")
print(f"  - Forecasted factors shape: {Z_forecast_loaded.shape}")


## Summary

In this tutorial, we covered:

1. **Data Loading**: 
   - Loaded configuration from Hydra-style YAML files
   - Loaded and transformed time series data
   - Validated data against configuration

2. **Model Training**: 
   - Estimated factors and parameters using the DFM
   - Used EM algorithm for parameter estimation
   - Extracted latent factors via Kalman filtering

3. **Forecasting**: 
   - Forecasted factors and series into the future
   - Visualized past and forecasted values
   - Saved forecasts to CSV

4. **Model Persistence**: 
   - Saved trained model to disk
   - Loaded and reused saved model

### Next Steps

- **News Decomposition**: Analyze how forecast updates when new data arrives
- **Hyperparameter Tuning**: Adjust `threshold`, `max_iter`, etc. for better convergence
- **Block Structure Experiments**: Try different block configurations
- **Extended Visualization**: Create more detailed factor and series plots
- **Model Diagnostics**: Use `diagnose_series()` for detailed fit analysis


In [2]:
# Additional: Access result using object-oriented methods
print("DFMResult object-oriented access:")
print(f"  - Number of series: {result.num_series()}")
print(f"  - Number of state variables: {result.num_state()}")
print(f"  - Number of factors: {result.num_factors()}")

# Convert factors to pandas DataFrame
if result.series_ids is not None:
    factors_df = result.to_pandas_factors()
    print(f"\n  - Factors DataFrame shape: {factors_df.shape}")
    print(f"  - Factors DataFrame columns: {factors_df.columns.tolist()[:5]}...")

print("\n✓ Tutorial complete!")


DFMResult object-oriented access:


NameError: name 'result' is not defined