# 📊 PCA-Based Visual Forecast Dataset Generator

This notebook/script creates a training dataset for a CNN-based macroeconomic climate forecasting model.

## 🎯 Objective:
To generate realistic, interpretable training pairs for a neural network that learns to forecast future macroeconomic trends — not as numeric values, but as **plausible visual trajectories**.

## 🧠 How It Works:
1. **Input Dataset**: Uses `selected_features_83_w_rare_events.csv`, containing 83 macro/market features with monthly timestamps.
2. **PCA Transformation**: Reduces these features into the top 15 principal components (PC0–PC14), preserving interpretable latent structure.
3. **Windowing**:
   - Input: 15-year rolling window (180 months) of PCA components
   - Output: Corresponding next 2 years (24 months) and 5 years (60 months)
4. **Graph Imaging**:
   - Each input/output pair is visualized as a line plot showing all 15 PCs
   - Saved as PNGs for use as training data for a CNN
5. **Metadata Export**: A `pairs_metadata.csv` file logs the date ranges and paths for each image pair.

## 📦 Output Folder Structure:

In [3]:
# pca_graph_pairs/
# ├── inputs/             ← Graphs of past 15-year PCA sequences (PC0–PC14, legended)
# ├── outputs_2yr/        ← Graphs of next 2-year PCA projections (ground truth)
# ├── outputs_5yr/        ← Graphs of next 5-year PCA projections (ground truth)
# └── pairs_metadata.csv  ← Table with image paths, forecast windows, and actual date ranges

## 💡 Next Step:
Train a CNN to take these input images and learn to paint plausible future macroeconomic trajectories — enabling "visual latent forecasting" of economic climate drift.

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from pathlib import Path
from matplotlib.cm import get_cmap
import gc
import csv

# --- Step 1: Load dataset ---
df = pd.read_csv("selected_features_83_w_rare_events.csv", parse_dates=["date"])
df = df.sort_values("date")

In [2]:
# --- Step 2: Prepare features for PCA ---
flag_cols = [col for col in df.columns if "flag" in col.lower()]
macro_cols = [col for col in df.select_dtypes(include=[float, int]).columns if col not in flag_cols]
X = df[macro_cols]
X_scaled = StandardScaler().fit_transform(X)

In [5]:
# --- Step 3: Run PCA and store top 15 components ---
pca = PCA(n_components=15)
pca_data = pca.fit_transform(X_scaled)
pc_cols = [f"PC{i}" for i in range(15)]
df_pca = pd.DataFrame(pca_data, columns=pc_cols)
df_pca["date"] = df["date"].values

In [7]:
# --- Step 4: Setup output folders ---
output_root = Path("pca_graph_pairs")
(output_root / "inputs").mkdir(parents=True, exist_ok=True)
(output_root / "outputs_2yr").mkdir(exist_ok=True)
(output_root / "outputs_5yr").mkdir(exist_ok=True)

In [9]:
def generate_pca_image_batch(
    df_pca,
    output_root,
    pc_cols,
    start_idx=0,
    max_pairs=200,
    window_years=15,
    forecast_years=(2, 5),
    freq_per_year=12
):
    """
    Generates PCA input/output image pairs in batches for CNN training.

    Parameters:
    - df_pca: DataFrame with date + PC0–PC14
    - output_root: Path to base directory for image storage
    - pc_cols: list of PCA column names (e.g., PC0–PC14)
    - start_idx: starting index into df_pca
    - max_pairs: max total (input → output) image pairs per call
    - window_years: years in each input chunk
    - forecast_years: list of forecast durations (e.g., [2, 5])
    - freq_per_year: default 12 (monthly data)
    """
    colors = get_cmap("tab20").colors[:len(pc_cols)]
    input_len = window_years * freq_per_year
    pairs = []

    total_generated = 0
    for i in range(start_idx, len(df_pca) - input_len - max(forecast_years) * freq_per_year):
        input_chunk = df_pca.iloc[i : i + input_len]

        for f_yrs in forecast_years:
            forecast_len = f_yrs * freq_per_year
            output_chunk = df_pca.iloc[i + input_len : i + input_len + forecast_len]

            # File naming based on date
            start = input_chunk["date"].iloc[0].strftime("%Y%m")
            end = output_chunk["date"].iloc[-1].strftime("%Y%m")
            base_name = f"chunk_{start}_to_{end}_{f_yrs}yr"

            input_path = Path(output_root) / "inputs" / f"{base_name}.png"
            output_path = Path(output_root) / f"outputs_{f_yrs}yr" / f"{base_name}.png"

            # Plot input
            plt.figure(figsize=(6, 3))
            for j, col in enumerate(pc_cols):
                plt.plot(input_chunk["date"], input_chunk[col], label=col, color=colors[j])
            plt.title(f"Input: {input_chunk['date'].iloc[0].year}–{input_chunk['date'].iloc[-1].year}")
            plt.legend(loc="upper left", fontsize=6, ncol=3)
            plt.xticks([]); plt.yticks([]); plt.tight_layout()
            plt.savefig(input_path, dpi=100)
            plt.close()

            # Plot output
            plt.figure(figsize=(6, 3))
            for j, col in enumerate(pc_cols):
                plt.plot(output_chunk["date"], output_chunk[col], label=col, color=colors[j])
            plt.title(f"Output: {output_chunk['date'].iloc[0].year}–{output_chunk['date'].iloc[-1].year} ({f_yrs}yr)")
            plt.legend(loc="upper left", fontsize=6, ncol=3)
            plt.xticks([]); plt.yticks([]); plt.tight_layout()
            plt.savefig(output_path, dpi=100)
            plt.close()

            # Save metadata
            pairs.append({
                "input_img": str(input_path),
                "output_img": str(output_path),
                "forecast_years": f_yrs,
                "input_start": input_chunk["date"].iloc[0],
                "input_end": input_chunk["date"].iloc[-1],
                "output_start": output_chunk["date"].iloc[0],
                "output_end": output_chunk["date"].iloc[-1],
            })

            total_generated += 1
            if total_generated >= max_pairs:
                gc.collect()
                print(f"✅ Generated {total_generated} pairs (from index {start_idx})")
                return pairs, i + 1  # return new start index

    gc.collect()
    print(f"✅ Completed entire pass from index {start_idx}")
    return pairs, len(df_pca)

In [11]:
# # --- Setup paths ---
# output_root = Path("pca_graph_pairs")
# (output_root / "inputs").mkdir(parents=True, exist_ok=True)
# (output_root / "outputs_2yr").mkdir(exist_ok=True)
# (output_root / "outputs_5yr").mkdir(exist_ok=True)
# meta_path = output_root / "pairs_metadata.csv"

# # --- Initialize PCA columns ---
# pc_cols = [f"PC{i}" for i in range(15)]

# # --- Initialize metadata file (only if first run) ---
# if not meta_path.exists():
#     with open(meta_path, "w", newline="") as f:
#         writer = csv.DictWriter(f, fieldnames=[
#             "input_img", "output_img", "forecast_years",
#             "input_start", "input_end", "output_start", "output_end"
#         ])
#         writer.writeheader()

# # --- Generate in safe batches ---
# start_idx = 0
# while start_idx < len(df_pca):
#     batch, start_idx = generate_pca_image_batch(
#         df_pca=df_pca,
#         output_root=output_root,
#         pc_cols=pc_cols,
#         start_idx=start_idx,
#         max_pairs=50  # 👈 smaller is safer (reduce to 25 if needed)
#     )

#     if len(batch) == 0:
#         print("🛑 No more valid image pairs to generate.")
#         break

#     # Append this batch directly to the CSV file
#     with open(meta_path, "a", newline="") as f:
#         writer = csv.DictWriter(f, fieldnames=batch[0].keys())
#         writer.writerows(batch)

#     print(f"✅ Appended {len(batch)} pairs. Current index: {start_idx}")

#     # 💣 Free up memory before next round
#     del batch
#     gc.collect()

In [13]:
import gc
import csv
import matplotlib
matplotlib.use("Agg")  # Headless, fast plotting backend

from pathlib import Path
from tqdm import tqdm

# --- Setup paths ---
output_root = Path("pca_graph_pairs")
(output_root / "inputs").mkdir(parents=True, exist_ok=True)
(output_root / "outputs_2yr").mkdir(exist_ok=True)
(output_root / "outputs_5yr").mkdir(exist_ok=True)
meta_path = output_root / "pairs_metadata.csv"

# --- Initialize PCA columns ---
pc_cols = [f"PC{i}" for i in range(15)]

# --- Initialize metadata file (only if first run) ---
if not meta_path.exists():
    with open(meta_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=[
            "input_img", "output_img", "forecast_years",
            "input_start", "input_end", "output_start", "output_end"
        ])
        writer.writeheader()

# --- Generate in safe, memory-lean batches ---
start_idx = 0
while start_idx < len(df_pca):
    try:
        # Generate one safe batch
        batch, start_idx = generate_pca_image_batch(
            df_pca=df_pca,
            output_root=output_root,
            pc_cols=pc_cols,
            start_idx=start_idx,
            max_pairs=25  # 🔥 Lower for safer memory footprint
        )

        # No more images to generate
        if len(batch) == 0:
            print("🛑 No more valid image pairs to generate.")
            break

        # Append metadata to CSV immediately
        with open(meta_path, "a", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=batch[0].keys())
            writer.writerows(batch)
            f.flush()  # Ensure metadata is written immediately

        print(f"✅ Appended {len(batch)} pairs. Current index: {start_idx}")

        # 💣 Clear memory after each batch
        del batch
        gc.collect()

    except Exception as e:
        print(f"❌ Error at index {start_idx}: {e}")
        gc.collect()
        break

  colors = get_cmap("tab20").colors[:len(pc_cols)]


✅ Generated 25 pairs (from index 0)
✅ Appended 25 pairs. Current index: 13
✅ Generated 25 pairs (from index 13)
✅ Appended 25 pairs. Current index: 26
✅ Generated 25 pairs (from index 26)
✅ Appended 25 pairs. Current index: 39
✅ Generated 25 pairs (from index 39)
✅ Appended 25 pairs. Current index: 52
✅ Generated 25 pairs (from index 52)
✅ Appended 25 pairs. Current index: 65
✅ Generated 25 pairs (from index 65)
✅ Appended 25 pairs. Current index: 78
✅ Generated 25 pairs (from index 78)
✅ Appended 25 pairs. Current index: 91
✅ Generated 25 pairs (from index 91)
✅ Appended 25 pairs. Current index: 104
✅ Generated 25 pairs (from index 104)
✅ Appended 25 pairs. Current index: 117
✅ Generated 25 pairs (from index 117)
✅ Appended 25 pairs. Current index: 130
✅ Generated 25 pairs (from index 130)
✅ Appended 25 pairs. Current index: 143
✅ Generated 25 pairs (from index 143)
✅ Appended 25 pairs. Current index: 156
✅ Generated 25 pairs (from index 156)
✅ Appended 25 pairs. Current index: 169
✅ 

In [None]:
# --- Step 6: Save metadata file ---
pairs_df = pd.DataFrame(all_pairs)
pairs_df.to_csv(output_root / "pairs_metadata.csv", index=False)
print(f"✅ Saved {len(all_pairs)} metadata rows to:", output_root / "pairs_metadata.csv")