# Multimodal Training & Evaluation

## Goals
- Tabular-only baseline (XGBoost/RandomForest) vs. Multimodal (ResNet18 + MLP)
- Train/val split, metrics: RMSE, R²
- Generate Grad-CAM overlays
- Export `outputs/submission.csv`

## Outline
1. Load cleaned data and image paths
2. Fit StandardScaler on tabular features
3. Build PyTorch datasets/dataloaders
4. Train FusionModel (late fusion), log metrics per epoch
5. Evaluate on val set; compare with tabular-only XGBoost
6. Grad-CAM visualization on val samples
7. Predict test set and save CSV

In [None]:
# Import libraries
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
from tqdm import tqdm

from src.config import cfg
from src.datasets import HousePriceDataset
from src.model import FusionModel
from src.gradcam import GradCAM
from src.utils import seed_everything

# Set seed for reproducibility
seed_everything(cfg.seed)
print(f"Using device: {cfg.device}")

## 1. Load Data

In [None]:
# Load train data
train_df = pd.read_excel(f"../{cfg.train_xlsx}")
test_df = pd.read_excel(f"../{cfg.test_xlsx}")

print(f"Train shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

# Train-validation split
train_data, val_data = train_test_split(
    train_df, 
    test_size=cfg.val_split, 
    random_state=cfg.seed
)
print(f"Train samples: {len(train_data)}, Val samples: {len(val_data)}")

## 2. Tabular Baseline (XGBoost)

In [None]:
# Prepare tabular features
X_train = train_data[cfg.tab_feats].values
X_val = val_data[cfg.tab_feats].values
y_train = train_data[cfg.target].values
y_val = val_data[cfg.target].values

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train XGBoost baseline
xgb_model = XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=cfg.seed
)
xgb_model.fit(X_train_scaled, y_train)

# Evaluate
xgb_pred = xgb_model.predict(X_val_scaled)
xgb_rmse = np.sqrt(mean_squared_error(y_val, xgb_pred))
xgb_r2 = r2_score(y_val, xgb_pred)

print(f"XGBoost Baseline Results:")
print(f"  RMSE: ${xgb_rmse:,.2f}")
print(f"  R²: {xgb_r2:.4f}")

## 3. Multimodal Model Setup

**Note:** Before running this section, make sure you have:
1. Set your API key in the `.env` file (MAPBOX_TOKEN or GOOGLE_API_KEY)
2. Run the data fetcher to download satellite images:
```bash
python -m src.data_fetcher
```

In [None]:
# Check if satellite images exist
import os
image_dir = f"../{cfg.image_dir}"

if os.path.exists(image_dir):
    images = [f for f in os.listdir(image_dir) if f.endswith('.png')]
    print(f"Found {len(images)} satellite images in {image_dir}")
else:
    print(f"WARNING: Image directory not found: {image_dir}")
    print("Please run: python -m src.data_fetcher")
    print("Make sure your API key is set in .env file")

## 4. Train Multimodal Model

Run the full training pipeline using the `train.py` script:
```bash
python -m src.train
```

Or continue below to train interactively:

In [None]:
# Create datasets (only runs if images exist)
if os.path.exists(image_dir) and len(os.listdir(image_dir)) > 0:
    train_dataset = HousePriceDataset(train_data, scaler, cfg.image_dir, is_train=True)
    val_dataset = HousePriceDataset(val_data, scaler, cfg.image_dir, is_train=False)
    
    train_loader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=cfg.batch_size, shuffle=False, num_workers=0)
    
    print(f"Train dataset: {len(train_dataset)} samples")
    print(f"Val dataset: {len(val_dataset)} samples")
else:
    print("Skipping dataset creation - no images found")

## 5. Results Summary

After training completes, you'll find:
- **Model checkpoint:** `models/best_model.pt`
- **Predictions:** `outputs/submission.csv`
- **Grad-CAM visualizations:** `outputs/gradcam/`

In [None]:
# Display results comparison
print("=" * 50)
print("MODEL COMPARISON")
print("=" * 50)
print(f"\n{'Model':<25} {'RMSE':>12} {'R²':>10}")
print("-" * 50)
print(f"{'XGBoost (Tabular Only)':<25} ${xgb_rmse:>10,.0f} {xgb_r2:>10.4f}")
print("\n(Multimodal results will appear after training)")
print("=" * 50)