# CSIRO Biomass Prediction - Kaggle Submission

## Model: 4b (Auxiliary Pretrained) - Variation A_Baseline

**Validation R²**: +0.6852

### Approach

This notebook uses a **two-phase trained model**:

**Phase 1 (Auxiliary Pretraining):**
- Trained CNN to predict tabular features (NDVI, height, weather, location, species) from images
- Forced model to learn visual patterns correlated with tabular data
- Achieved 88% state classification accuracy (model learned to "see" location!)

**Phase 2 (Biomass Fine-tuning):**
- Fine-tuned pretrained CNN for biomass prediction
- Leverages implicit tabular understanding from Phase 1
- **At inference: Only needs images!** (No tabular features required)

### Why This Works

The auxiliary pretraining teaches the model to extract tabular information from visual cues:
- **NDVI** → Green vegetation color and density
- **Height** → Plant size and structure in image
- **Location** → Terrain, soil color, background features
- **Species** → Leaf shape, growth patterns, color
- **Weather** → Moisture levels, plant stress indicators

This "baked-in" knowledge improves biomass prediction even though we only use images at test time.

---

In [None]:
# Cell 1: Setup & Imports

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm.auto import tqdm

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

print("✓ Setup complete")

In [None]:
# Cell 2: Configuration

# Target columns (order matters!)
TARGET_COLS = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

# Target normalization statistics (calculated from full training set)
# These values were used during training and must be used to denormalize predictions
TARGET_MEANS = torch.tensor([
    26.624722,  # Dry_Green_g
    12.044548,  # Dry_Dead_g
    6.649692,   # Dry_Clover_g
    33.274414,  # GDM_g
    45.318097   # Dry_Total_g
], dtype=torch.float32)

TARGET_STDS = torch.tensor([
    25.401232,  # Dry_Green_g
    12.402007,  # Dry_Dead_g
    12.117761,  # Dry_Clover_g
    24.935822,  # GDM_g
    27.984015   # Dry_Total_g
], dtype=torch.float32)

# Model configuration (A_Baseline variant)
MODEL_CONFIG = {
    'hidden_dim': 256,
    'dropout': 0.2,
    'num_outputs': 5,
    'num_states': 4,
    'num_species': 15
}

# Inference batch size
BATCH_SIZE = 16

print("Configuration:")
print(f"  Targets: {TARGET_COLS}")
print(f"  Model: AuxiliaryPretrainedModel (A_Baseline)")
print(f"  Hidden dim: {MODEL_CONFIG['hidden_dim']}")
print(f"  Dropout: {MODEL_CONFIG['dropout']}")
print(f"  Batch size: {BATCH_SIZE}")
print("\n✓ Configuration loaded")

In [None]:
# Cell 3: Model Architecture

class AuxiliaryPretrainedModel(nn.Module):
    """Model 4b: Two-phase trained model with auxiliary pretraining.
    
    Phase 1: Trained to predict tabular features (NDVI, height, weather, state, species) from images
    Phase 2: Fine-tuned for biomass prediction
    
    At inference: Only needs image (learned implicit tabular patterns)
    """
    def __init__(self, num_outputs=5, hidden_dim=256, dropout=0.2, num_states=4, num_species=15):
        super().__init__()
        
        # Shared backbone: ResNet18 (pretrained on ImageNet)
        self.backbone = models.resnet18(pretrained=False)  # Set to False since we'll load full checkpoint
        self.backbone = nn.Sequential(*list(self.backbone.children())[:-1])  # Remove final FC layer
        # Output: 512-dimensional feature vector
        
        # Phase 1: Auxiliary heads (not used at inference, but needed for checkpoint loading)
        self.ndvi_head = nn.Linear(512, 1)                    # Predict NDVI from image
        self.height_head = nn.Linear(512, 1)                  # Predict height from image
        self.weather_head = nn.Linear(512, 14)                # Predict 14 weather features
        self.state_head = nn.Linear(512, num_states)          # Predict location (4 states)
        self.species_head = nn.Linear(512, num_species)       # Predict species (15 types)
        
        # Phase 2: Biomass prediction head (THIS is what we use at inference!)
        self.biomass_head = nn.Sequential(
            nn.Linear(512, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_outputs)
        )
    
    def forward(self, x, mode='biomass'):
        """Forward pass.
        
        Args:
            x: Input image tensor [B, 3, 224, 224]
            mode: 'biomass' for inference (default), 'auxiliary' for phase 1 training
        
        Returns:
            Predicted biomass values [B, 5] (normalized during training, denormalized after inference)
        """
        # Extract features from image
        features = self.backbone(x)  # [B, 512, 1, 1]
        features = features.flatten(1)  # [B, 512]
        
        if mode == 'auxiliary':
            # Phase 1: Predict tabular features (not used at inference)
            return {
                'ndvi': self.ndvi_head(features),
                'height': self.height_head(features),
                'weather': self.weather_head(features),
                'state': self.state_head(features),
                'species': self.species_head(features)
            }
        else:  # mode == 'biomass'
            # Phase 2: Predict biomass (THIS is inference mode!)
            return self.biomass_head(features)  # [B, 5]

print("✓ AuxiliaryPretrainedModel defined")

In [None]:
# Cell 4: Load Model Checkpoint

import os

print("Loading model checkpoint...\n")

# Create model instance
model = AuxiliaryPretrainedModel(**MODEL_CONFIG)

# Get current working directory for better path resolution
cwd = os.getcwd()
print(f"Current working directory: {cwd}")

# Try multiple checkpoint paths (local testing vs Kaggle submission)
checkpoint_paths = [
    './model4b_A_Baseline_phase2_best.pth',  # Local path (same directory)
    'model4b_A_Baseline_phase2_best.pth',    # Try without ./
    os.path.join(cwd, 'model4b_A_Baseline_phase2_best.pth'),  # Absolute path
    '../input/csiro-biomass-model-weights/model4b_A_Baseline_phase2_best.pth',  # Kaggle input (update dataset name!)
    '/kaggle/input/csiro-biomass-model-weights/model4b_A_Baseline_phase2_best.pth',  # Alternative Kaggle path
]

checkpoint_loaded = False
for path in checkpoint_paths:
    if Path(path).exists():
        print(f"Found checkpoint at: {path}")
        model.load_state_dict(torch.load(path, map_location=device))
        checkpoint_loaded = True
        break

if not checkpoint_loaded:
    print("\n❌ Could not find model checkpoint!\n")
    print("Tried paths:")
    for p in checkpoint_paths:
        exists = "✓" if Path(p).exists() else "✗"
        print(f"  {exists} {p}")
    print("\n" + "="*80)
    print("FOR KAGGLE SUBMISSION:")
    print("="*80)
    print("1. Upload 'model4b_A_Baseline_phase2_best.pth' as a Kaggle Dataset")
    print("2. Add the dataset as input to this notebook (click 'Add Data' button)")
    print("3. Update the checkpoint_paths list above with your dataset name")
    print("   Example: '../input/YOUR-DATASET-NAME/model4b_A_Baseline_phase2_best.pth'")
    print("\n" + "="*80)
    print("FOR LOCAL TESTING:")
    print("="*80)
    print("Ensure 'model4b_A_Baseline_phase2_best.pth' is in:")
    print(f"  {cwd}")
    print("="*80)
    raise FileNotFoundError("Model checkpoint not found")

# Prepare model for inference
model = model.to(device)
model.eval()

print(f"\n✓ Model loaded successfully!")
print(f"  Device: {device}")
print(f"  Mode: Inference (eval mode)")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Cell 5: Load Test Data

print("Loading test data...\n")

# Try multiple test data paths (local testing vs Kaggle submission)
test_csv_paths = [
    './competition/test.csv',  # Local path
    '../input/csiro-biomass-prediction/test.csv',  # Kaggle input path (typical)
    '/kaggle/input/csiro-biomass-prediction/test.csv',  # Alternative Kaggle path
]

test_df = None
for path in test_csv_paths:
    if Path(path).exists():
        print(f"Found test.csv at: {path}")
        test_df = pd.read_csv(path)
        base_path = str(Path(path).parent)
        break

if test_df is None:
    raise FileNotFoundError("Could not find test.csv")

print(f"\nTest data shape: {test_df.shape}")
print(f"Columns: {list(test_df.columns)}")
print(f"\nFirst few rows:")
print(test_df.head())

# Extract unique images from long format
# test.csv format: sample_id, image_path, target_name (one row per image×target combination)
test_df['full_image_path'] = test_df['image_path'].apply(lambda x: f"{base_path}/{x}")
unique_images_df = test_df[['image_path', 'full_image_path']].drop_duplicates().reset_index(drop=True)

print(f"\n✓ Found {len(unique_images_df)} unique test images")
print(f"  Total test rows: {len(test_df)} (images × targets)")
print(f"  Expected: {len(unique_images_df)} images × 5 targets = {len(unique_images_df) * 5} rows")

# Verify all images exist
missing_images = []
for path in unique_images_df['full_image_path']:
    if not Path(path).exists():
        missing_images.append(path)

if missing_images:
    print(f"\n⚠️  WARNING: {len(missing_images)} images not found:")
    for img in missing_images[:5]:
        print(f"  - {img}")
    if len(missing_images) > 5:
        print(f"  ... and {len(missing_images) - 5} more")
else:
    print(f"\n✓ All {len(unique_images_df)} test images found!")

In [None]:
# Cell 6: Create Test Dataset & DataLoader

class TestDataset(Dataset):
    """Test dataset for inference (images only, no labels)."""
    
    def __init__(self, image_paths):
        self.image_paths = image_paths
        
        # Same transforms used during training (without augmentation)
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # ImageNet stats
        ])
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        img = Image.open(img_path).convert('RGB')
        img = self.transform(img)
        return img

# Create dataset and dataloader
test_dataset = TestDataset(unique_images_df['full_image_path'].tolist())
test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=False,  # Important: Keep order for matching predictions to images
    num_workers=0
)

print(f"✓ Test dataset created")
print(f"  Images: {len(test_dataset)}")
print(f"  Batches: {len(test_loader)}")
print(f"  Batch size: {BATCH_SIZE}")

In [None]:
# Cell 7: Generate Predictions

print("Generating predictions...\n")

all_predictions = []

with torch.no_grad():
    for batch_idx, images in enumerate(tqdm(test_loader, desc='Predicting')):
        images = images.to(device)
        
        # Forward pass (returns normalized predictions)
        outputs = model(images, mode='biomass')  # [batch_size, 5]
        
        # Denormalize to original scale (grams)
        outputs_denorm = outputs.cpu() * TARGET_STDS + TARGET_MEANS
        
        # Clip negative values to 0 (biomass cannot be negative)
        outputs_denorm = torch.clamp(outputs_denorm, min=0)
        
        all_predictions.append(outputs_denorm.numpy())

# Stack all predictions
all_predictions = np.vstack(all_predictions)  # [num_images, 5]

print(f"\n✓ Predictions generated!")
print(f"  Shape: {all_predictions.shape} (images × targets)")
print(f"\nPrediction statistics (grams):")
for i, col in enumerate(TARGET_COLS):
    print(f"  {col:15s}: min={all_predictions[:, i].min():7.2f}g, "
          f"max={all_predictions[:, i].max():7.2f}g, "
          f"mean={all_predictions[:, i].mean():7.2f}g")

In [None]:
# Cell 8: Create Submission File

import os

print("Creating submission file...\n")

# Convert predictions to long format (one row per sample_id)
submission_rows = []

for idx, img_path in enumerate(unique_images_df['image_path'].tolist()):
    # Extract image ID from path (e.g., 'test/ID1001187975.jpg' -> 'ID1001187975')
    image_id = Path(img_path).stem  # Get filename without extension
    
    # Create one row per target (5 rows per image)
    for target_idx, target_name in enumerate(TARGET_COLS):
        sample_id = f"{image_id}__{target_name}"  # Format: ImageID__TargetName
        target_value = all_predictions[idx, target_idx]
        
        submission_rows.append({
            'sample_id': sample_id,
            'target': target_value
        })

# Create DataFrame
submission = pd.DataFrame(submission_rows)

print("Submission DataFrame:")
print(submission.head(10))
print(f"\nShape: {submission.shape}")
print(f"Expected: ({len(unique_images_df) * 5}, 2)")

# Quality checks
print(f"\nQuality checks:")
print(f"  NaN values: {submission.isna().sum().sum()} ✓" if submission.isna().sum().sum() == 0 else f"  ⚠️  NaN values: {submission.isna().sum().sum()}")
print(f"  Infinite values: {np.isinf(submission['target']).sum()} ✓" if np.isinf(submission['target']).sum() == 0 else f"  ⚠️  Infinite values: {np.isinf(submission['target']).sum()}")
print(f"  Negative values: {(submission['target'] < 0).sum()} ✓" if (submission['target'] < 0).sum() == 0 else f"  ⚠️  Negative values: {(submission['target'] < 0).sum()}")
print(f"  Correct columns: {list(submission.columns) == ['sample_id', 'target']} ✓" if list(submission.columns) == ['sample_id', 'target'] else f"  ⚠️  Columns: {list(submission.columns)}")

# IMPORTANT: Save to current working directory for Kaggle compatibility
output_path = 'submission.csv'
submission.to_csv(output_path, index=False)

# Verify file was created
if os.path.exists(output_path):
    file_size = os.path.getsize(output_path)
    print(f"\n✅ File verified: {output_path} ({file_size:,} bytes)")
else:
    raise FileNotFoundError(f"Failed to create {output_path}")

print(f"\n{'='*80}")
print("✅ SUBMISSION FILE CREATED: submission.csv")
print(f"{'='*80}")
print(f"\nFile details:")
print(f"  Filename: submission.csv (required by Kaggle)")
print(f"  Location: {os.path.abspath(output_path)}")
print(f"  Rows: {len(submission):,}")
print(f"  Images: {len(unique_images_df)}")
print(f"  Format: Long format (sample_id, target)")
print(f"\nModel info:")
print(f"  Model: AuxiliaryPretrainedModel (4b A_Baseline)")
print(f"  Validation R²: +0.6852")
print(f"  Training: Two-phase (auxiliary pretraining + biomass fine-tuning)")
print(f"\nNext steps:")
print(f"  1. Download submission.csv from notebook output")
print(f"  2. Submit to Kaggle competition")
print(f"  3. Check leaderboard for public R² score")
print(f"  4. Compare with validation R² (+0.6852)")
print(f"\n{'='*80}")
