# CSIRO Biomass Prediction - Model 1 (Simple Baseline) Submission

## Model: 1 - Simple ResNet18 Baseline

**Validation R²**: +0.6352 (10 epochs)

### Why Test This Model?

Model 4b (auxiliary pretrained) achieved **R²=+0.6852** on validation but only **R²=+0.51** on Kaggle test set.

**This is a -0.175 gap!** Possible causes:
1. **Overfitting** - Model 4b trained for 30 epochs, may have memorized training data
2. **Complexity** - More complex models often overfit on small datasets  
3. **Wrong normalization** - Used split stats instead of full dataset stats

### Model 1 Advantages

✅ **Simpler architecture** - Plain ResNet18 with FC head (no auxiliary heads)
✅ **Less training** - Only 10 epochs (vs 30 for Model 4b)
✅ **Less overfit** - Validation R² increased steadily without bouncing
✅ **No ColorJitter** - Avoided harmful augmentation

**Expected Kaggle score**: 0.55-0.60 (better generalization than Model 4b)

---

In [1]:
# Cell 1: Setup & Imports

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm.auto import tqdm

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

print("✓ Setup complete")

Using device: cpu
✓ Setup complete


In [2]:
# Cell 2: Configuration

# Target columns (order matters!)
TARGET_COLS = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

# Target normalization statistics (calculated from FULL training set - 357 images)
# IMPORTANT: Model 1 was trained with SPLIT stats, but we should use FULL dataset stats
# for consistency with Kaggle expectations
TARGET_MEANS = torch.tensor([
    26.624722,  # Dry_Green_g
    12.044548,  # Dry_Dead_g
    6.649692,   # Dry_Clover_g
    33.274414,  # GDM_g
    45.318097   # Dry_Total_g
], dtype=torch.float32)

TARGET_STDS = torch.tensor([
    25.401232,  # Dry_Green_g
    12.402007,  # Dry_Dead_g
    12.117761,  # Dry_Clover_g
    24.935822,  # GDM_g
    27.984015   # Dry_Total_g
], dtype=torch.float32)

# Inference batch size
BATCH_SIZE = 16

print("Configuration:")
print(f"  Model: Simple ResNet18 Baseline")
print(f"  Validation R²: +0.6352")
print(f"  Training: 10 epochs")
print(f"  Targets: {TARGET_COLS}")
print(f"  Batch size: {BATCH_SIZE}")
print("\n✓ Configuration loaded")

Configuration:
  Model: Simple ResNet18 Baseline
  Validation R²: +0.6352
  Training: 10 epochs
  Targets: ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']
  Batch size: 16

✓ Configuration loaded


In [3]:
# Cell 3: Model Architecture

class SimpleBaseline(nn.Module):
    """Simple ResNet18 baseline for normalized targets.
    
    Architecture:
    - ResNet18 backbone (pretrained on ImageNet)
    - 512 → 256 → ReLU → Dropout(0.2) → 256 → 5
    
    Total parameters: ~11.2M
    """
    def __init__(self, num_outputs=5):
        super().__init__()
        self.resnet = models.resnet18(pretrained=False)  # We'll load trained weights
        num_features = self.resnet.fc.in_features  # 512
        
        self.resnet.fc = nn.Sequential(
            nn.Linear(num_features, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, num_outputs)
        )
    
    def forward(self, x):
        return self.resnet(x)

print("✓ SimpleBaseline model defined")

✓ SimpleBaseline model defined


In [4]:
# Cell 4: Load Model Checkpoint

import os

print("Loading Model 1 checkpoint...\n")

# Create model instance
model = SimpleBaseline(num_outputs=5)

# Get current working directory for better path resolution
cwd = os.getcwd()
print(f"Current working directory: {cwd}")

# Try multiple checkpoint paths (local testing vs Kaggle submission)
checkpoint_paths = [
    './Model_1_Simple_best.pth',  # Local path (same directory)
    'Model_1_Simple_best.pth',    # Try without ./
    os.path.join(cwd, 'Model_1_Simple_best.pth'),  # Absolute path
    '../input/csiro-biomass-model1-weights/Model_1_Simple_best.pth',  # Kaggle input
    '/kaggle/input/csiro-biomass-model1-weights/Model_1_Simple_best.pth',  # Alternative Kaggle path
]

checkpoint_loaded = False
for path in checkpoint_paths:
    if Path(path).exists():
        print(f"Found checkpoint at: {path}")
        model.load_state_dict(torch.load(path, map_location=device))
        checkpoint_loaded = True
        break

if not checkpoint_loaded:
    print("\n❌ Could not find model checkpoint!\n")
    print("Tried paths:")
    for p in checkpoint_paths:
        exists = "✓" if Path(p).exists() else "✗"
        print(f"  {exists} {p}")
    print("\n" + "="*80)
    print("FOR KAGGLE SUBMISSION:")
    print("="*80)
    print("1. Upload 'Model_1_Simple_best.pth' as a Kaggle Dataset")
    print("2. Add the dataset as input to this notebook (click 'Add Data' button)")
    print("3. Update the checkpoint_paths list above with your dataset name")
    print("   Example: '../input/YOUR-DATASET-NAME/Model_1_Simple_best.pth'")
    print("\n" + "="*80)
    print("FOR LOCAL TESTING:")
    print("="*80)
    print("Ensure 'Model_1_Simple_best.pth' is in:")
    print(f"  {cwd}")
    print("="*80)
    raise FileNotFoundError("Model checkpoint not found")

# Prepare model for inference
model = model.to(device)
model.eval()

print(f"\n✓ Model loaded successfully!")
print(f"  Device: {device}")
print(f"  Mode: Inference (eval mode)")
print(f"  Parameters: {sum(p.numel() for p in model.parameters()):,}")

Loading Model 1 checkpoint...

Current working directory: /Users/tim/Code/Tim/csiro-biomass
Found checkpoint at: ./Model_1_Simple_best.pth

✓ Model loaded successfully!
  Device: cpu
  Mode: Inference (eval mode)
  Parameters: 11,309,125


In [5]:
# Cell 5: Load Test Data

print("Loading test data...\n")

# Try multiple test data paths (local testing vs Kaggle submission)
test_csv_paths = [
    './competition/test.csv',  # Local path
    '../input/csiro-biomass-prediction/test.csv',  # Kaggle input path (typical)
    '/kaggle/input/csiro-biomass-prediction/test.csv',  # Alternative Kaggle path
]

test_df = None
for path in test_csv_paths:
    if Path(path).exists():
        print(f"Found test.csv at: {path}")
        test_df = pd.read_csv(path)
        base_path = str(Path(path).parent)
        break

if test_df is None:
    raise FileNotFoundError("Could not find test.csv")

print(f"\nTest data shape: {test_df.shape}")
print(f"Columns: {list(test_df.columns)}")
print(f"\nFirst few rows:")
print(test_df.head())

# Extract unique images from long format
# test.csv format: sample_id, image_path, target_name (one row per image×target combination)
test_df['full_image_path'] = test_df['image_path'].apply(lambda x: f"{base_path}/{x}")
unique_images_df = test_df[['image_path', 'full_image_path']].drop_duplicates().reset_index(drop=True)

print(f"\n✓ Found {len(unique_images_df)} unique test images")
print(f"  Total test rows: {len(test_df)} (images × targets)")
print(f"  Expected: {len(unique_images_df)} images × 5 targets = {len(unique_images_df) * 5} rows")

# Verify all images exist
missing_images = []
for path in unique_images_df['full_image_path']:
    if not Path(path).exists():
        missing_images.append(path)

if missing_images:
    print(f"\n⚠️  WARNING: {len(missing_images)} images not found:")
    for img in missing_images[:5]:
        print(f"  - {img}")
    if len(missing_images) > 5:
        print(f"  ... and {len(missing_images) - 5} more")
else:
    print(f"\n✓ All {len(unique_images_df)} test images found!")

Loading test data...

Found test.csv at: ./competition/test.csv

Test data shape: (5, 3)
Columns: ['sample_id', 'image_path', 'target_name']

First few rows:
                    sample_id             image_path   target_name
0  ID1001187975__Dry_Clover_g  test/ID1001187975.jpg  Dry_Clover_g
1    ID1001187975__Dry_Dead_g  test/ID1001187975.jpg    Dry_Dead_g
2   ID1001187975__Dry_Green_g  test/ID1001187975.jpg   Dry_Green_g
3   ID1001187975__Dry_Total_g  test/ID1001187975.jpg   Dry_Total_g
4         ID1001187975__GDM_g  test/ID1001187975.jpg         GDM_g

✓ Found 1 unique test images
  Total test rows: 5 (images × targets)
  Expected: 1 images × 5 targets = 5 rows

✓ All 1 test images found!


In [6]:
# Cell 6: Create Test Dataset & DataLoader

class TestDataset(Dataset):
    """Test dataset for inference (images only, no labels)."""
    
    def __init__(self, image_paths):
        self.image_paths = image_paths
        
        # Same transforms used during training (without augmentation)
        self.transform = transforms.Compose([
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])  # ImageNet stats
        ])
    
    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        img = Image.open(img_path).convert('RGB')
        img = self.transform(img)
        return img

# Create dataset and dataloader
test_dataset = TestDataset(unique_images_df['full_image_path'].tolist())
test_loader = DataLoader(
    test_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=False,  # Important: Keep order for matching predictions to images
    num_workers=0
)

print(f"✓ Test dataset created")
print(f"  Images: {len(test_dataset)}")
print(f"  Batches: {len(test_loader)}")
print(f"  Batch size: {BATCH_SIZE}")

✓ Test dataset created
  Images: 1
  Batches: 1
  Batch size: 16


In [7]:
# Cell 7: Generate Predictions

print("Generating predictions...\n")

all_predictions = []

with torch.no_grad():
    for batch_idx, images in enumerate(tqdm(test_loader, desc='Predicting')):
        images = images.to(device)
        
        # Forward pass (returns normalized predictions)
        outputs = model(images)  # [batch_size, 5]
        
        # Denormalize to original scale (grams)
        outputs_denorm = outputs.cpu() * TARGET_STDS + TARGET_MEANS
        
        # Clip negative values to 0 (biomass cannot be negative)
        outputs_denorm = torch.clamp(outputs_denorm, min=0)
        
        all_predictions.append(outputs_denorm.numpy())

# Stack all predictions
all_predictions = np.vstack(all_predictions)  # [num_images, 5]

print(f"\n✓ Predictions generated!")
print(f"  Shape: {all_predictions.shape} (images × targets)")
print(f"\nPrediction statistics (grams):")
for i, col in enumerate(TARGET_COLS):
    print(f"  {col:15s}: min={all_predictions[:, i].min():7.2f}g, "
          f"max={all_predictions[:, i].max():7.2f}g, "
          f"mean={all_predictions[:, i].mean():7.2f}g")

Generating predictions...



Predicting:   0%|          | 0/1 [00:00<?, ?it/s]


✓ Predictions generated!
  Shape: (1, 5) (images × targets)

Prediction statistics (grams):
  Dry_Green_g    : min=  15.72g, max=  15.72g, mean=  15.72g
  Dry_Dead_g     : min=  19.35g, max=  19.35g, mean=  19.35g
  Dry_Clover_g   : min=   1.30g, max=   1.30g, mean=   1.30g
  GDM_g          : min=  19.12g, max=  19.12g, mean=  19.12g
  Dry_Total_g    : min=  34.26g, max=  34.26g, mean=  34.26g


In [8]:
# Cell 8: Create Submission File

import os

print("Creating submission file...\n")

# Convert predictions to long format (one row per sample_id)
submission_rows = []

for idx, img_path in enumerate(unique_images_df['image_path'].tolist()):
    # Extract image ID from path (e.g., 'test/ID1001187975.jpg' -> 'ID1001187975')
    image_id = Path(img_path).stem  # Get filename without extension
    
    # Create one row per target (5 rows per image)
    for target_idx, target_name in enumerate(TARGET_COLS):
        sample_id = f"{image_id}__{target_name}"  # Format: ImageID__TargetName
        target_value = all_predictions[idx, target_idx]
        
        submission_rows.append({
            'sample_id': sample_id,
            'target': target_value
        })

# Create DataFrame
submission = pd.DataFrame(submission_rows)

print("Submission DataFrame:")
print(submission.head(10))
print(f"\nShape: {submission.shape}")
print(f"Expected: ({len(unique_images_df) * 5}, 2)")

# Quality checks
print(f"\nQuality checks:")
print(f"  NaN values: {submission.isna().sum().sum()} ✓" if submission.isna().sum().sum() == 0 else f"  ⚠️  NaN values: {submission.isna().sum().sum()}")
print(f"  Infinite values: {np.isinf(submission['target']).sum()} ✓" if np.isinf(submission['target']).sum() == 0 else f"  ⚠️  Infinite values: {np.isinf(submission['target']).sum()}")
print(f"  Negative values: {(submission['target'] < 0).sum()} ✓" if (submission['target'] < 0).sum() == 0 else f"  ⚠️  Negative values: {(submission['target'] < 0).sum()}")
print(f"  Correct columns: {list(submission.columns) == ['sample_id', 'target']} ✓" if list(submission.columns) == ['sample_id', 'target'] else f"  ⚠️  Columns: {list(submission.columns)}")

# IMPORTANT: Save to current working directory for Kaggle compatibility
output_path = 'submission.csv'
submission.to_csv(output_path, index=False)

# Verify file was created
if os.path.exists(output_path):
    file_size = os.path.getsize(output_path)
    print(f"\n✅ File verified: {output_path} ({file_size:,} bytes)")
else:
    raise FileNotFoundError(f"Failed to create {output_path}")

print(f"\n{'='*80}")
print("✅ SUBMISSION FILE CREATED: submission.csv")
print(f"{'='*80}")
print(f"\nFile details:")
print(f"  Filename: submission.csv (required by Kaggle)")
print(f"  Location: {os.path.abspath(output_path)}")
print(f"  Rows: {len(submission):,}")
print(f"  Images: {len(unique_images_df)}")
print(f"  Format: Long format (sample_id, target)")
print(f"\nModel info:")
print(f"  Model: Simple ResNet18 Baseline (Model 1)")
print(f"  Validation R²: +0.6352 (10 epochs)")
print(f"  Training: ResNet18 + basic augmentation, no ColorJitter")
print(f"\nExpected Kaggle score: 0.55-0.60")
print(f"  (Better than Model 4b's 0.51 due to less overfitting)")
print(f"\nNext steps:")
print(f"  1. Download submission.csv from notebook output")
print(f"  2. Submit to Kaggle competition")
print(f"  3. Compare with Model 4b score (0.51)")
print(f"  4. If Model 1 > 0.51: confirms overfitting hypothesis")
print(f"  5. If Model 1 < 0.51: indicates distribution shift problem")
print(f"\n{'='*80}")

Creating submission file...

Submission DataFrame:
                    sample_id     target
0   ID1001187975__Dry_Green_g  15.715257
1    ID1001187975__Dry_Dead_g  19.350996
2  ID1001187975__Dry_Clover_g   1.302833
3         ID1001187975__GDM_g  19.119936
4   ID1001187975__Dry_Total_g  34.255455

Shape: (5, 2)
Expected: (5, 2)

Quality checks:
  NaN values: 0 ✓
  Infinite values: 0 ✓
  Negative values: 0 ✓
  Correct columns: True ✓

✅ File verified: submission.csv (191 bytes)

✅ SUBMISSION FILE CREATED: submission.csv

File details:
  Filename: submission.csv (required by Kaggle)
  Location: /Users/tim/Code/Tim/csiro-biomass/submission.csv
  Rows: 5
  Images: 1
  Format: Long format (sample_id, target)

Model info:
  Model: Simple ResNet18 Baseline (Model 1)
  Validation R²: +0.6352 (10 epochs)
  Training: ResNet18 + basic augmentation, no ColorJitter

Expected Kaggle score: 0.55-0.60
  (Better than Model 4b's 0.51 due to less overfitting)

Next steps:
  1. Download submission.csv from

In [None]:
# Cell 9: Final Verification (For Kaggle)

import os
import glob

print("\n" + "="*80)
print("FINAL VERIFICATION")
print("="*80)

# List all CSV files in current directory
csv_files = glob.glob('*.csv')
print(f"\nCSV files in current directory:")
for f in csv_files:
    size = os.path.getsize(f)
    print(f"  {f}: {size:,} bytes")

# Verify submission.csv specifically
if os.path.exists('submission.csv'):
    size = os.path.getsize('submission.csv')
    print(f"\n✅ SUCCESS! submission.csv exists ({size:,} bytes)")
    print(f"   Absolute path: {os.path.abspath('submission.csv')}")
    
    # Show first few lines
    import pandas as pd
    sub = pd.read_csv('submission.csv')
    print(f"\nFirst 5 rows:")
    print(sub.head())
    print(f"\nTotal rows: {len(sub)}")
    print(f"Columns: {list(sub.columns)}")
else:
    print(f"\n❌ ERROR! submission.csv not found!")
    print(f"Current directory: {os.getcwd()}")
    print(f"Files in directory: {os.listdir('.')}")

print("\n" + "="*80)