<a href="https://colab.research.google.com/github/shunte88/ACU/blob/main/csiro_image2biomass_colab_starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSIRO Image2Biomass — Colab Starter

End‑to‑end baseline for multi‑target biomass prediction.

**What you get**: Kaggle download cell, EDA, robust target auto‑detection, GroupKFold CV, PyTorch Lightning image regressor (ViT‑tiny via timm), optional tabular fusion, OOF + submission.

> Competition context: launched Oct 2025 by CSIRO/MLA/Google; dataset: 1,162 top‑view quadrats with green/dead/legume components, height, and AOS NDVI. See the [arXiv dataset paper](https://arxiv.org/abs/2510.22916).


## 0) Runtime
If you're on **Colab**, ensure GPU is enabled: `Runtime → Change runtime type → T4/L4/A100`.

In [1]:
!nvidia-smi  # uncomment to inspect GPU
import sys, platform, torch
print(f'Python {platform.python_version()}  Torch {torch.__version__}  CUDA avail: {torch.cuda.is_available()}')

Wed Oct 29 18:52:18 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   44C    P8             11W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1) Install deps

In [7]:
!pip -q install timm==1.0.9 pytorch-lightning==2.4.0 opendatasets==0.1.22 albumentations==1.4.21 pandas==2.2.2 numpy==1.26.4 torchvision==0.19.1 scikit-learn==1.5.2 mlflow==2.16.2 matplotlib==3.9.2 pillow==10.4.0 tqdm==4.66.5 pyyaml==6.0.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m109.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m815.2/815.2 kB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.9/227.9 kB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m136.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m134.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2) Download data from Kaggle
You need a Kaggle account + API token:
1. Get `kaggle.json` from https://www.kaggle.com/settings/account (Create New API Token)
2. Upload it in the cell below or mount Google Drive and place it at `~/.kaggle/kaggle.json`.
3. The code will download **CSIRO - Image2Biomass Prediction** data into `./csiro-biomass/`.  
If you're running inside Kaggle Notebooks, skip this and use `/kaggle/input/` paths instead.

In [3]:
import os, json, shutil, pathlib
from pathlib import Path

DATA_DIR = Path('./csiro-biomass')
COMP = 'csiro-biomass'

# Option A: upload kaggle.json interactively
from google.colab import files
if not (Path.home()/'.kaggle/kaggle.json').exists():
    print('Upload your kaggle.json (from Kaggle account settings).')
    uploaded = files.upload()
    if 'kaggle.json' in uploaded:
        kaggle_dir = Path.home()/'.kaggle'
        kaggle_dir.mkdir(exist_ok=True)
        with open(kaggle_dir/'kaggle.json','wb') as f:
            f.write(uploaded['kaggle.json'])
        os.chmod(kaggle_dir/'kaggle.json', 0o600)

# Download competition data
!pip -q install kaggle==1.6.17
!kaggle competitions download -c {COMP} -p {DATA_DIR}
!unzip -q -o {DATA_DIR/'csiro-biomass.zip'} -d {DATA_DIR}
print('Files:', os.listdir(DATA_DIR))

Downloading csiro-biomass.zip to csiro-biomass
100% 1.02G/1.02G [00:50<00:00, 21.4MB/s]
100% 1.02G/1.02G [00:50<00:00, 21.6MB/s]
Files: ['sample_submission.csv', 'train', 'test.csv', 'train.csv', 'test', 'csiro-biomass.zip']


## 3) Quick EDA & target auto‑detection
This cell prints the available CSVs and infers likely target columns. Adjust `TARGET_COLS` if needed.

In [4]:
import pandas as pd, re
from pathlib import Path

def find_csv(root):
    return sorted([p for p in Path(root).rglob('*.csv')])

csvs = find_csv(DATA_DIR)
print('Found CSVs:\n', '\n'.join(map(str,csvs)))

# Heuristics for common file names
train_candidates = [p for p in csvs if re.search(r'train', p.name, re.I)]
test_candidates  = [p for p in csvs if re.search(r'test',  p.name, re.I)]
sub_candidates   = [p for p in csvs if re.search(r'sample|submission', p.name, re.I)]

TRAIN_CSV = train_candidates[0] if train_candidates else csvs[0]
TEST_CSV  = test_candidates[0]  if test_candidates else (csvs[1] if len(csvs)>1 else csvs[0])
print('Guessed TRAIN:', TRAIN_CSV.name)
print('Guessed TEST :', TEST_CSV.name)

train_df = pd.read_csv(TRAIN_CSV)
print('Train shape:', train_df.shape)
print(train_df.head(3))

# Guess ID, image, group columns
id_cols = [c for c in train_df.columns if re.search(r'id$', c, re.I)]
img_cols = [c for c in train_df.columns if re.search(r'image|img|path|filename', c, re.I)]
group_cols = [c for c in train_df.columns if re.search(r'(site|farm|location|block|paddock|date)', c, re.I)]
print('ID columns:', id_cols)
print('IMG columns:', img_cols)
print('Group-ish columns:', group_cols)

# Guess target columns: names containing biomass components
target_patterns = r'(green|dead|legume|clover|total|biomass)'
TARGET_COLS = [c for c in train_df.columns
               if re.search(target_patterns, c, re.I) and train_df[c].dtype != 'O']
# Keep 3–6 numeric targets max
TARGET_COLS = [c for c in TARGET_COLS if c not in id_cols+img_cols][:6]
print('AUTO TARGET_COLS:', TARGET_COLS)

# Persist a small config for later cells
import yaml, json, os
cfg = {
    'data_dir': str(DATA_DIR),
    'train_csv': str(TRAIN_CSV),
    'test_csv': str(TEST_CSV),
    'id_col': id_cols[0] if id_cols else None,
    'img_col': img_cols[0] if img_cols else None,
    'group_cols': group_cols[:2],
    'target_cols': TARGET_COLS,
}
os.makedirs('cfg', exist_ok=True)
with open('cfg/baseline.yaml','w') as f: yaml.safe_dump(cfg, f)
print('\nSaved cfg/baseline.yaml:', cfg)

Found CSVs:
 csiro-biomass/sample_submission.csv
csiro-biomass/test.csv
csiro-biomass/train.csv
Guessed TRAIN: train.csv
Guessed TEST : test.csv
Train shape: (1785, 9)
                    sample_id              image_path Sampling_Date State  \
0  ID1011485656__Dry_Clover_g  train/ID1011485656.jpg      2015/9/4   Tas   
1    ID1011485656__Dry_Dead_g  train/ID1011485656.jpg      2015/9/4   Tas   
2   ID1011485656__Dry_Green_g  train/ID1011485656.jpg      2015/9/4   Tas   

           Species  Pre_GSHH_NDVI  Height_Ave_cm   target_name   target  
0  Ryegrass_Clover           0.62         4.6667  Dry_Clover_g   0.0000  
1  Ryegrass_Clover           0.62         4.6667    Dry_Dead_g  31.9984  
2  Ryegrass_Clover           0.62         4.6667   Dry_Green_g  16.2751  
ID columns: ['sample_id']
IMG columns: ['image_path']
Group-ish columns: ['Sampling_Date']
AUTO TARGET_COLS: []

Saved cfg/baseline.yaml: {'data_dir': 'csiro-biomass', 'train_csv': 'csiro-biomass/train.csv', 'test_csv': 'csiro-

## 4) Dataset & augmentations

In [5]:
import os, cv2, numpy as np, torch
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image

# Simple albumentations pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2

IM_SIZE = 384

def build_train_aug():
    return A.Compose([
        A.LongestMaxSize(max_size=IM_SIZE),
        A.PadIfNeeded(IM_SIZE, IM_SIZE, border_mode=cv2.BORDER_REFLECT_101),
        A.RandomBrightnessContrast(0.1, 0.1, p=0.5),
        A.HueSaturationValue(10, 10, 10, p=0.3),
        A.ShiftScaleRotate(shift_limit=0.02, scale_limit=0.05, rotate_limit=10, border_mode=cv2.BORDER_REFLECT_101, p=0.5),
        A.Normalize(),
        ToTensorV2(),
    ])

def build_valid_aug():
    return A.Compose([
        A.LongestMaxSize(max_size=IM_SIZE),
        A.PadIfNeeded(IM_SIZE, IM_SIZE, border_mode=cv2.BORDER_REFLECT_101),
        A.Normalize(),
        ToTensorV2(),
    ])

class BiomassDataset(Dataset):
    def __init__(self, df, cfg, root=None, img_aug=None, is_test=False):
        self.df = df.reset_index(drop=True)
        self.cfg = cfg
        self.root = root or cfg['data_dir']
        self.img_col = cfg['img_col']
        self.id_col  = cfg['id_col'] or 'id'
        self.targets = cfg['target_cols']
        self.is_test = is_test
        self.aug = img_aug or build_valid_aug()
    def __len__(self): return len(self.df)
    def load_image(self, row):
        path = str(Path(self.root)/row[self.img_col]) if self.img_col else None
        if (not path) or (not os.path.exists(path)):
            # fallback: try to find an images/ folder and use id.jpg/png
            iid = str(row[self.id_col]) if self.id_col and self.id_col in row else str(row.name)
            for ext in ('.jpg','.jpeg','.png','.bmp','.tif','.tiff'):
                trial = os.path.join(self.root, 'images', f'{iid}{ext}')
                if os.path.exists(trial): path = trial; break
        if not path:
            raise FileNotFoundError('Image path not found; set cfg.img_col correctly.')
        image = cv2.imread(path, cv2.IMREAD_COLOR)[:, :, ::-1]
        return image
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image = self.load_image(row)
        augmented = self.aug(image=image)
        image_t = augmented['image']
        if self.is_test or not self.targets:
            return image_t, torch.tensor([0.0])
        y = torch.tensor(row[self.targets].values.astype(np.float32))
        return image_t, y


## 5) Model — ViT‑tiny head for multi‑target regression

In [6]:
import timm, torch, torch.nn as nn, pytorch_lightning as pl
from torch.utils.data import DataLoader

class Regressor(pl.LightningModule):
    def __init__(self, backbone='vit_tiny_patch16_224.augreg_in21k', n_out=1, lr=2e-4):
        super().__init__()
        self.save_hyperparameters()
        self.backbone = timm.create_model(backbone, pretrained=True, num_classes=0, global_pool='avg')
        in_features = self.backbone.num_features
        self.head = nn.Sequential(
            nn.LayerNorm(in_features),
            nn.Linear(in_features, 256),
            nn.GELU(),
            nn.Linear(256, n_out)
        )
        self.loss = nn.L1Loss()  # MAE for robustness
        self.lr = lr
    def forward(self, x):
        feat = self.backbone(x)
        return self.head(feat)
    def training_step(self, batch, batch_idx):
        x, y = batch
        yhat = self(x)
        loss = self.loss(yhat, y)
        self.log('train_mae', loss, prog_bar=True)
        return loss
    def validation_step(self, batch, batch_idx):
        x, y = batch
        yhat = self(x)
        loss = self.loss(yhat, y)
        self.log('val_mae', loss, prog_bar=True)
        return loss
    def configure_optimizers(self):
        opt = torch.optim.AdamW(self.parameters(), lr=self.lr, weight_decay=1e-4)
        sch = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=10)
        return [opt], [sch]

def make_loaders(train_df, valid_df, cfg, bs=16):
    train_ds = BiomassDataset(train_df, cfg, img_aug=build_train_aug())
    valid_ds = BiomassDataset(valid_df, cfg, img_aug=build_valid_aug())
    tr = DataLoader(train_ds, batch_size=bs, shuffle=True, num_workers=2, pin_memory=True)
    va = DataLoader(valid_ds, batch_size=bs*2, shuffle=False, num_workers=2, pin_memory=True)
    return tr, va


ModuleNotFoundError: No module named 'pytorch_lightning'

## 6) Cross‑validation split (GroupKFold by site/date if present)

In [None]:
import pandas as pd, numpy as np
from sklearn.model_selection import GroupKFold, KFold
from pathlib import Path
import yaml, math

with open('cfg/baseline.yaml') as f: cfg = yaml.safe_load(f)

df = pd.read_csv(cfg['train_csv'])
targets = cfg['target_cols']
assert len(targets) >= 1, "No target columns detected; please edit cfg/baseline.yaml"

# make groups
groups_cols = cfg['group_cols'] or []
if len(groups_cols) >= 1:
    groups = df[groups_cols[0]].astype(str)
    if len(groups_cols) >= 2:
        groups = groups + '_' + df[groups_cols[1]].astype(str)
    splitter = GroupKFold(n_splits=5)
    cv = list(splitter.split(df, groups=groups))
    print('Using GroupKFold on', groups_cols)
else:
    splitter = KFold(n_splits=5, shuffle=True, random_state=42)
    cv = list(splitter.split(df))
    print('Using KFold (no group columns found)')

print('Fold sizes:', [len(v) for _, v in cv])

## 7) Train one fold (quick sanity check)

In [None]:
import pytorch_lightning as pl, numpy as np
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

fold = 0
tr_idx, va_idx = cv[fold]
train_df = df.iloc[tr_idx].reset_index(drop=True)
valid_df = df.iloc[va_idx].reset_index(drop=True)

tr_loader, va_loader = make_loaders(train_df, valid_df, cfg, bs=16)
model = Regressor(n_out=len(targets))

ckpt = ModelCheckpoint(monitor='val_mae', mode='min', save_top_k=1, filename='fold{fold}-{{epoch:02d}}-{{val_mae:.4f}}')
es = EarlyStopping(monitor='val_mae', mode='min', patience=3)
trainer = pl.Trainer(max_epochs=10, precision='16-mixed', callbacks=[ckpt, es], log_every_n_steps=20, enable_checkpointing=True)

trainer.fit(model, tr_loader, va_loader)

## 8) OOF predictions + Submission template
This creates `oof.csv` for diagnostics and a skeleton `submission.csv` based on sample file if present.

In [None]:
import numpy as np, pandas as pd, yaml, os
from pathlib import Path

# Save OOF for the single fold (demo); extend to full CV later
valid_loader = va_loader
model.eval(); preds=[]; gts=[]
with torch.no_grad():
    for xb, yb in valid_loader:
        xb = xb.to(model.device)
        yhat = model(xb).cpu().numpy()
        preds.append(yhat); gts.append(yb.numpy())
preds = np.vstack(preds); gts = np.vstack(gts)
oof = valid_df.copy()
for i,t in enumerate(targets): oof[f'pred_{t}'] = preds[:,i]
oof.to_csv('out_oof_fold0.csv', index=False)
print('Saved out_oof_fold0.csv with columns:', list(oof.columns))

# Build submission
# Try to find sample submission
sub_path = None
for p in Path(cfg['data_dir']).rglob('*.csv'):
    if re.search(r'sample|submission', p.name, re.I):
        sub_path = p; break

if sub_path:
    sub = pd.read_csv(sub_path)
    # Try to populate columns that intersect with TARGET_COLS
    for t in targets:
        for c in sub.columns:
            if re.sub('[^a-z]','',c.lower()) == re.sub('[^a-z]','',t.lower()):
                # Fill with global mean as placeholder
                sub[c] = df[t].mean()
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv based on sample:', sub.shape)
else:
    print('No sample_submission found; please craft one based on competition schema.')

## 9) Next steps
- Expand to full 5‑fold CV and average OOF.
- Try ConvNeXt‑T or EfficientNet‑B0.
- Add tabular fusion (height, AOS NDVI) by concatenating embeddings + normalized tabular features.
- Add quantile heads for uncertainty.
- Ensembling across seeds & backbones.
