# CSIRO Biomass Prediction - Universal Features K-Fold Ensemble

## Model: 5-Fold CV with Universal Features + Species

### Why This Approach?

**Previous K-Fold ensemble** scored **0.50** on Kaggle (worse than baseline 0.51!).

**Key insight from 22nd place leaderboard user:**
> "Test dataset uses locations that aren't in the training dataset"

### The Problem with Previous Approach

‚ùå **State classification** (NSW/Tas/Vic/WA) ‚Üí Useless for unseen test locations!  
‚ùå **Weather features** (rainfall, temp, ET0) ‚Üí Location-specific climate patterns  

Models learned to recognize **training locations**, not general biomass patterns.

### New Strategy: Universal Features + Species

‚úÖ **NDVI** - Vegetation density (universal)  
‚úÖ **Height** - Plant height (universal)  
‚úÖ **Season/Daylength** - Calendar-based (universal)  
‚úÖ **Species** - Plant type (universal! Ryegrass biomass similar everywhere)  

‚ùå **Removed:** State, Weather (location-specific)

### Expected Improvement

- **Previous K-Fold**: 0.50 (learned location bias)
- **This approach**: **0.52-0.54** (should generalize to new locations!)

---

In [None]:
# Cell 1: Setup & Imports

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm.auto import tqdm

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

print("‚úì Setup complete")

In [None]:
# Cell 2: Configuration

# Target columns (order matters!)
TARGET_COLS = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

# Target normalization statistics (calculated from FULL training set - 357 images)
# These stats are CONSISTENT across all 5 folds
TARGET_MEANS = torch.tensor([
    26.624722,  # Dry_Green_g
    12.044548,  # Dry_Dead_g
    6.649692,   # Dry_Clover_g
    33.274414,  # GDM_g
    45.318097   # Dry_Total_g
], dtype=torch.float32)

TARGET_STDS = torch.tensor([
    25.401232,  # Dry_Green_g
    12.402007,  # Dry_Dead_g
    12.117761,  # Dry_Clover_g
    24.935822,  # GDM_g
    27.984015   # Dry_Total_g
], dtype=torch.float32)

# Inference batch size
BATCH_SIZE = 16
NUM_FOLDS = 5
NUM_SPECIES = 15  # Unique plant species

print("Configuration:")
print(f"  Model: 5-Fold CV Ensemble (Universal Features + Species)")
print(f"  Number of models: {NUM_FOLDS}")
print(f"  Auxiliary tasks: NDVI, Height, Daylength, Season, Species (15 classes)")
print(f"  Removed: State, Weather (location-specific!)")
print(f"  Targets: {TARGET_COLS}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"\n‚úì Configuration loaded")

In [None]:
# Cell 3: Model Architecture

class UniversalAuxiliaryModel(nn.Module):
    """ResNet18 with UNIVERSAL auxiliary tasks + Species classification.
    
    Training approach:
    - Phase 1: Auxiliary pretraining (predict NDVI, height, daylength, season, species from images)
    - Phase 2: Biomass fine-tuning (predict 5 biomass targets)
    
    Key difference from previous: Removed State and Weather (location-specific)
    Why Species is kept: Same species has similar biomass anywhere (universal!)
    
    This model was trained 5 times with different train/val splits (K-Fold CV).
    """
    def __init__(self, num_outputs=5, hidden_dim=256, dropout=0.2, num_species=15):
        super().__init__()
        # ResNet18 backbone (weights=None means no pretrained ImageNet weights)
        model = models.resnet18(weights=None)  # No download needed!
        self.backbone = nn.Sequential(*list(model.children())[:-1])
        feature_dim = 512
        
        # UNIVERSAL auxiliary heads (required for loading checkpoint, not used in inference)
        self.ndvi_head = nn.Linear(feature_dim, 1)
        self.height_head = nn.Linear(feature_dim, 1)
        self.daylength_head = nn.Linear(feature_dim, 1)
        self.season_head = nn.Linear(feature_dim, 1)
        self.species_head = nn.Linear(feature_dim, num_species)  # 15 species classes
        
        # NO State/Weather heads! (location-specific, removed)
        
        # Biomass head (this is what we use for predictions)
        self.biomass_head = nn.Sequential(
            nn.Linear(feature_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_outputs)
        )
    
    def forward(self, x, mode='biomass'):
        features = self.backbone(x).flatten(1)
        if mode == 'auxiliary':
            return {
                'ndvi': self.ndvi_head(features),
                'height': self.height_head(features),
                'daylength': self.daylength_head(features),
                'season': self.season_head(features),
                'species': self.species_head(features)
            }
        else:
            return self.biomass_head(features)

print("‚úì UniversalAuxiliaryModel defined")
print("  5 auxiliary heads: NDVI, Height, Daylength, Season, Species")
print("  Removed: State (4 classes), Weather (14 features)")
print("  Why: Test locations not in training ‚Üí State/Weather don't generalize!")

In [None]:
# Cell 4: Load All 5 Fold Model Checkpoints

import os

print(f"Loading {NUM_FOLDS} fold models...\n")

# Get current working directory for better path resolution
cwd = os.getcwd()
print(f"Current working directory: {cwd}\n")

fold_models = []
checkpoint_loaded_count = 0

for fold_idx in range(1, NUM_FOLDS + 1):
    checkpoint_name = f'universal_Fold{fold_idx}_best.pth'
    
    # Try multiple checkpoint paths (local testing vs Kaggle submission)
    checkpoint_paths = [
        f'./{checkpoint_name}',  # Local path (same directory)
        checkpoint_name,  # Try without ./
        os.path.join(cwd, checkpoint_name),  # Absolute path
        f'../input/csiro-biomass-universal-kfold/{checkpoint_name}',  # Kaggle input
        f'/kaggle/input/csiro-biomass-universal-kfold/{checkpoint_name}',  # Alternative Kaggle path
    ]
    
    model = UniversalAuxiliaryModel(num_species=NUM_SPECIES)
    checkpoint_loaded = False
    
    for path in checkpoint_paths:
        if Path(path).exists():
            print(f"Fold {fold_idx}: Found at {path}")
            model.load_state_dict(torch.load(path, map_location=device))
            model = model.to(device)
            model.eval()
            fold_models.append(model)
            checkpoint_loaded = True
            checkpoint_loaded_count += 1
            break
    
    if not checkpoint_loaded:
        print(f"\n‚ùå Could not find checkpoint for Fold {fold_idx}!\n")
        print(f"Tried paths:")
        for p in checkpoint_paths:
            exists = "‚úì" if Path(p).exists() else "‚úó"
            print(f"  {exists} {p}")

if checkpoint_loaded_count != NUM_FOLDS:
    print(f"\n" + "="*80)
    print("FOR KAGGLE SUBMISSION:")
    print("="*80)
    print("1. Upload all 5 model checkpoints as a Kaggle Dataset:")
    print("   - universal_Fold1_best.pth")
    print("   - universal_Fold2_best.pth")
    print("   - universal_Fold3_best.pth")
    print("   - universal_Fold4_best.pth")
    print("   - universal_Fold5_best.pth")
    print("2. Add the dataset as input to this notebook (click 'Add Data' button)")
    print("3. Update the checkpoint_paths list above with your dataset name")
    print("   Example: '../input/YOUR-DATASET-NAME/{checkpoint_name}'")
    print("\n" + "="*80)
    print("FOR LOCAL TESTING:")
    print("="*80)
    print(f"Ensure all 5 checkpoint files are in: {cwd}")
    print("="*80)
    raise FileNotFoundError(f"Only loaded {checkpoint_loaded_count}/{NUM_FOLDS} model checkpoints")

print(f"\n‚úÖ Successfully loaded all {len(fold_models)} fold models!")
print(f"  Device: {device}")
print(f"  Mode: Inference (eval mode)")
print(f"  Parameters per model: {sum(p.numel() for p in fold_models[0].parameters()):,}")
print(f"  Total ensemble parameters: {sum(p.numel() for p in fold_models[0].parameters()) * NUM_FOLDS:,}")

In [None]:
# Cell 5: Load Test Data

print("Loading test data...\n")

# Try multiple test data paths (local testing vs Kaggle submission)
test_csv_paths = [
    './competition/test.csv',  # Local testing path
    '/kaggle/input/csiro-biomass/test.csv',  # Correct Kaggle path
    '../input/csiro-biomass/test.csv',  # Alternative Kaggle format
]

test_df = None
for path in test_csv_paths:
    if Path(path).exists():
        print(f"Found test.csv at: {path}")
        test_df = pd.read_csv(path)
        base_path = str(Path(path).parent)
        break

if test_df is None:
    raise FileNotFoundError("Could not find test.csv")

print(f"\nTest data shape: {test_df.shape}")
print(f"Columns: {list(test_df.columns)}")
print(f"\nFirst few rows:")
print(test_df.head())

# Extract unique images from long format
test_df['full_image_path'] = test_df['image_path'].apply(lambda x: f"{base_path}/{x}")
unique_images_df = test_df[['image_path', 'full_image_path']].drop_duplicates().reset_index(drop=True)

print(f"\n‚úì Found {len(unique_images_df)} unique test images")
print(f"  Total test rows: {len(test_df)} (images √ó targets)")
print(f"  Expected: {len(unique_images_df)} images √ó 5 targets = {len(unique_images_df) * 5} rows")

# Verify all images exist
missing_images = []
for path in unique_images_df['full_image_path']:
    if not Path(path).exists():
        missing_images.append(path)

if missing_images:
    print(f"\n‚ö†Ô∏è  WARNING: {len(missing_images)} images not found:")
    for img in missing_images[:5]:
        print(f"  - {img}")
    if len(missing_images) > 5:
        print(f"  ... and {len(missing_images) - 5} more")
else:
    print(f"\n‚úì All {len(unique_images_df)} test images found!")

In [None]:
# Cell 6: Create Test Dataset & DataLoader

class TestDataset(Dataset):
    """Test dataset for inference (images only, no labels)."""
    
    def __init__(self, image_paths):
        self.image_paths = image_paths
        
        # Same transforms used during training (without augmentation)
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # ImageNet stats
        ])
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        img = Image.open(img_path).convert('RGB')
        img = self.transform(img)
        return img

# Create dataset and dataloader
test_dataset = TestDataset(unique_images_df['full_image_path'].tolist())
test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=False,  # Important: Keep order for matching predictions to images
    num_workers=0
)

print(f"‚úì Test dataset created")
print(f"  Images: {len(test_dataset)}")
print(f"  Batches: {len(test_loader)}")
print(f"  Batch size: {BATCH_SIZE}")

In [None]:
# Cell 7: Generate Ensemble Predictions

print(f"Generating predictions from {NUM_FOLDS} models...\n")

# Get predictions from each fold model
all_fold_predictions = []

for fold_idx, model in enumerate(fold_models, 1):
    print(f"Fold {fold_idx}/{NUM_FOLDS}...")
    fold_preds = []
    
    with torch.no_grad():
        for images in tqdm(test_loader, desc=f"  Predicting", leave=False):
            images = images.to(device)
            
            # Forward pass (returns normalized predictions)
            outputs = model(images, mode='biomass')  # [batch_size, 5]
            
            # Denormalize to original scale (grams)
            outputs_denorm = outputs.cpu() * TARGET_STDS + TARGET_MEANS
            
            # Clip negative values to 0 (biomass cannot be negative)
            outputs_denorm = torch.clamp(outputs_denorm, min=0)
            
            fold_preds.append(outputs_denorm.numpy())
    
    fold_predictions = np.vstack(fold_preds)  # [num_images, 5]
    all_fold_predictions.append(fold_predictions)
    print(f"  ‚úì Shape: {fold_predictions.shape}")

# Ensemble: Average predictions across all folds
all_predictions = np.mean(all_fold_predictions, axis=0)  # [num_images, 5]

print(f"\n‚úÖ Ensemble predictions generated!")
print(f"  Shape: {all_predictions.shape} (images √ó targets)")
print(f"  Method: Averaged across {NUM_FOLDS} models")
print(f"\nPrediction statistics (grams):")
for i, col in enumerate(TARGET_COLS):
    print(f"  {col:15s}: min={all_predictions[:, i].min():7.2f}g, "
          f"max={all_predictions[:, i].max():7.2f}g, "
          f"mean={all_predictions[:, i].mean():7.2f}g")

In [None]:
# Cell 8: Create Submission File

import os

print("Creating submission file...\n")

# Convert predictions to long format (one row per sample_id)
submission_rows = []

for idx, img_path in enumerate(unique_images_df['image_path'].tolist()):
    # Extract image ID from path (e.g., 'test/ID1001187975.jpg' -> 'ID1001187975')
    image_id = Path(img_path).stem  # Get filename without extension
    
    # Create one row per target (5 rows per image)
    for target_idx, target_name in enumerate(TARGET_COLS):
        sample_id = f"{image_id}__{target_name}"  # Format: ImageID__TargetName
        target_value = all_predictions[idx, target_idx]
        
        submission_rows.append({
            'sample_id': sample_id,
            'target': target_value
        })

# Create DataFrame
submission = pd.DataFrame(submission_rows)

print("Submission DataFrame:")
print(submission.head(10))
print(f"\nShape: {submission.shape}")
print(f"Expected: ({len(unique_images_df) * 5}, 2)")

# Quality checks
print(f"\nQuality checks:")
print(f"  NaN values: {submission.isna().sum().sum()} ‚úì" if submission.isna().sum().sum() == 0 else f"  ‚ö†Ô∏è  NaN values: {submission.isna().sum().sum()}")
print(f"  Infinite values: {np.isinf(submission['target']).sum()} ‚úì" if np.isinf(submission['target']).sum() == 0 else f"  ‚ö†Ô∏è  Infinite values: {np.isinf(submission['target']).sum()}")
print(f"  Negative values: {(submission['target'] < 0).sum()} ‚úì" if (submission['target'] < 0).sum() == 0 else f"  ‚ö†Ô∏è  Negative values: {(submission['target'] < 0).sum()}")
print(f"  Correct columns: {list(submission.columns) == ['sample_id', 'target']} ‚úì" if list(submission.columns) == ['sample_id', 'target'] else f"  ‚ö†Ô∏è  Columns: {list(submission.columns)}")

# IMPORTANT: Save to current working directory for Kaggle compatibility
output_path = 'submission.csv'
submission.to_csv(output_path, index=False)

# Verify file was created
if os.path.exists(output_path):
    file_size = os.path.getsize(output_path)
    print(f"\n‚úÖ File verified: {output_path} ({file_size:,} bytes)")
else:
    raise FileNotFoundError(f"Failed to create {output_path}")

print(f"\n{'='*80}")
print("‚úÖ SUBMISSION FILE CREATED: submission.csv")
print(f"{'='*80}")
print(f"\nFile details:")
print(f"  Filename: submission.csv (required by Kaggle)")
print(f"  Location: {os.path.abspath(output_path)}")
print(f"  Rows: {len(submission):,}")
print(f"  Images: {len(unique_images_df)}")
print(f"  Format: Long format (sample_id, target)")
print(f"\nModel info:")
print(f"  Approach: 5-Fold CV with Universal Features + Species")
print(f"  Architecture: ResNet18 + Auxiliary Pretraining")
print(f"  Auxiliary tasks: NDVI, Height, Daylength, Season, Species (15 classes)")
print(f"  Removed: State, Weather (location-specific!)")
print(f"\nExpected Kaggle score: 0.52-0.54")
print(f"  (Previous K-Fold: 0.50 - learned location bias)")
print(f"  (Baseline: 0.51)")
print(f"\nWhy this should work better:")
print(f"  1. Universal features generalize to new locations")
print(f"  2. Species is universal (Ryegrass biomass similar everywhere)")
print(f"  3. No State/Weather ‚Üí no location bias")
print(f"  4. Ensemble averaging reduces overfitting")
print(f"\nNext steps:")
print(f"  1. Download submission.csv from notebook output")
print(f"  2. Submit to Kaggle competition")
print(f"  3. Compare with previous K-Fold (0.50) and baseline (0.51)")
print(f"  4. Expected improvement: +0.01 to +0.03")
print(f"\n{'='*80}")

In [None]:
# Cell 9: Final Verification (For Kaggle)

import os
import glob

print("\n" + "="*80)
print("FINAL VERIFICATION")
print("="*80)

# List all CSV files in current directory
csv_files = glob.glob('*.csv')
print(f"\nCSV files in current directory:")
for f in csv_files:
    size = os.path.getsize(f)
    print(f"  {f}: {size:,} bytes")

# Verify submission.csv specifically
if os.path.exists('submission.csv'):
    size = os.path.getsize('submission.csv')
    print(f"\n‚úÖ SUCCESS! submission.csv exists ({size:,} bytes)")
    print(f"   Absolute path: {os.path.abspath('submission.csv')}")
    
    # Show first few lines
    import pandas as pd
    sub = pd.read_csv('submission.csv')
    print(f"\nFirst 10 rows:")
    print(sub.head(10))
    print(f"\nTotal rows: {len(sub)}")
    print(f"Columns: {list(sub.columns)}")
    print(f"\nTarget statistics:")
    print(f"  Min: {sub['target'].min():.2f}g")
    print(f"  Max: {sub['target'].max():.2f}g")
    print(f"  Mean: {sub['target'].mean():.2f}g")
    print(f"  Median: {sub['target'].median():.2f}g")
else:
    print(f"\n‚ùå ERROR! submission.csv not found!")
    print(f"Current directory: {os.getcwd()}")
    print(f"Files in directory: {os.listdir('.')}")

print("\n" + "="*80)
print("üéØ READY FOR KAGGLE SUBMISSION!")
print("="*80)
print("\nKey improvements over previous K-Fold (0.50):")
print("  ‚úÖ Removed State classification (location-specific)")
print("  ‚úÖ Removed Weather features (location-specific)")
print("  ‚úÖ Kept Species (universal - same species similar everywhere!)")
print("  ‚úÖ Kept NDVI, Height, Season, Daylength (universal)")
print("\nExpected: 0.52-0.54 (should generalize to new test locations!)")
print("="*80)