In [1]:
import subprocess
import sys
print(subprocess.run(['bash', '-lc', 'nvidia-smi || true'], capture_output=True, text=True).stdout)

Sun Sep 28 05:55:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [5]:
import json
import pandas as pd
import numpy as np
from collections import Counter
import os

# Load train metadata
with open('nybg2020/train/metadata.json', 'r') as f:
    train_meta = json.load(f)

print('Type of train_meta:', type(train_meta))
print('Keys in train_meta:', list(train_meta.keys()))

# Extract images and annotations
train_images = train_meta['images']
train_annotations = train_meta['annotations']

print('Number of train images:', len(train_images))
print('Number of annotations:', len(train_annotations))

# Create DataFrames
train_df = pd.DataFrame(train_images)
annotations_df = pd.DataFrame(train_annotations)

print('Train images columns:', list(train_df.columns))
print('Annotations columns:', list(annotations_df.columns))

# Merge on image id (images have 'id', annotations have 'image_id')
train_df = train_df.merge(annotations_df, left_on='id', right_on='image_id', how='left')
print('Merged train shape:', train_df.shape)
print(train_df.head())

print('\nClass distribution:')
class_counts = Counter(train_df['category_id'])
print(f'Number of unique classes: {len(class_counts)}')
print(f'Min images per class: {min(class_counts.values())}')
print(f'Max images per class: {max(class_counts.values())}')
print(f'Mean images per class: {np.mean(list(class_counts.values())):.2f}')
print(f'Total train samples: {len(train_df)}')

# Load test metadata
with open('nybg2020/test/metadata.json', 'r') as f:
    test_meta = json.load(f)

print('\nType of test_meta:', type(test_meta))
print('Keys in test_meta:', list(test_meta.keys()))

test_images = test_meta['images']
test_df = pd.DataFrame(test_images)
print('Test images columns:', list(test_df.columns))
print('Test shape:', test_df.shape)
print(test_df.head())

# Check sample submission
sample_sub = pd.read_csv('sample_submission.csv')
print('\nSample submission shape:', sample_sub.shape)
print(sample_sub.head())
print('\nSubmission format: id, Predicted')

Type of train_meta: <class 'dict'>
Keys in train_meta: ['annotations', 'categories', 'images', 'info', 'licenses', 'regions']
Number of train images: 811623
Number of annotations: 811623


Train images columns: ['file_name', 'height', 'id', 'license', 'width']
Annotations columns: ['category_id', 'id', 'image_id', 'region_id']
Merged train shape: (811623, 9)
                  file_name  height    id_x  license  width  category_id  \
0  images/156/72/124136.jpg    1000  124136        1    661        15672   
1    images/156/72/5327.jpg    1000    5327        1    661        15672   
2  images/156/72/449419.jpg    1000  449419        1    662        15672   
3   images/156/72/29079.jpg    1000   29079        1    661        15672   
4  images/156/72/368979.jpg    1000  368979        1    667        15672   

     id_y  image_id  region_id  
0  124136    124136          1  
1    5327      5327          1  
2  449419    449419          1  
3   29079     29079          1  
4  368979    368979          1  

Class distribution:
Number of unique classes: 32093
Min images per class: 1
Max images per class: 1412
Mean images per class: 25.29
Total train samples: 811623



Type of test_meta: <class 'dict'>
Keys in test_meta: ['images', 'info', 'licenses']
Test images columns: ['file_name', 'height', 'id', 'license', 'width']
Test shape: (219124, 5)
          file_name  height id  license  width
0  images/000/0.jpg    1000  0        1    667
1  images/000/1.jpg    1000  1        1    667
2  images/000/2.jpg    1000  2        1    675
3  images/000/3.jpg    1000  3        1    676
4  images/000/4.jpg    1000  4        1    678

Sample submission shape: (219124, 2)
   Id  Predicted
0   0          0
1   1          0
2   2          0
3   3          0
4   4          0

Submission format: id, Predicted


# Revised Plan for Herbarium 2020 (Post-Expert Review)

## Dataset Overview
- Train: 811k images, 32k classes, highly imbalanced (long-tail, min 1, max 1412, mean 25)
- Test: 219k images
- Metric: Macro F1 score
- Goal: Gold medal ≥0.63151

## Updated Strategy (Incorporating Expert Advice)
1. **Environment Setup**: Install PyTorch 2.4.1+cu121, timm, albumentations, scikit-learn following best practices. Ensure AMP for mixed precision.
2. **Quick Baseline (Retrieval/Embeddings - First Submission)**:
   - Use timm tf_efficientnet_b3_ns (num_classes=0) to extract embeddings on train/test (input 512px, center crop).
   - Compute class prototypes: mean embedding per category_id (use up to 5 images/class, at least 1).
   - For test: Embed, predict nearest prototype via cosine similarity (FAISS if possible, else batched matmul).
   - Add simple TTA: hflip + original, average embeddings.
   - Train linear head on train embeddings with Balanced Softmax or CB-Focal, class-balanced sampler.
   - This gets a fast submission (target Bronze/Silver), then iterate.
3. **Data Pipeline**:
   - Custom Dataset: Load from nybg2020/train/images/{dir}/{file} using file_name from metadata.
   - Preprocess: Tight non-white crop (threshold margins), resize shorter to 512, center crop 480, normalize ImageNet.
   - Augmentations (for training): HorizontalFlip, small Rotate(±15°), BrightnessContrast (mild), Mixup α=0.2, CutMix p=0.2, Label Smoothing 0.05. Avoid heavy rotation/blur/perspective.
4. **Handling Imbalance**:
   - CB-Focal Loss (beta=0.9999, gamma=1.5-2.0) or LDAM-DRW.
   - Sampler: Class-Aware (uniform over classes) or sqrt-frequency.
   - Two-stage: Stage 1 instance-balanced full data; Stage 2 class-balanced head fine-tune.
   - CV: Single stratified split (90/10) or 2 folds max for OOF macro F1.
5. **Full Models**:
   - Primary: tf_efficientnetv2_m or tf_efficientnet_b5_ns via timm, GeM pooling if easy.
   - Diversity: resnest101e at 384px.
   - Train: AdamW lr 3e-4, cosine scheduler, warmup 500 steps, bs 64 (AMP), EMA 0.9998.
   - Fine-tune at 512px for 1-1.5 epochs, lr 1e-4, no Mixup/CutMix.
6. **Evaluation & Inference**:
   - Macro F1 on OOF; early stop on val macro F1.
   - TTA: orig + hflip (2-4x), average logits.
   - Ensemble: Weighted average of 1-2 models by OOF F1.
   - Submission: Ensure Id matches test id, Predicted = original category_id (no remapping).

## 24h Timeline
- 0-1h: PyTorch install, implement tight crop, quick embedding baseline in '01_quick_baseline.ipynb'.
- 1-4h: Extract embeddings, compute prototypes, generate first submission.csv (retrieval).
- 4-6h: Train linear head on embeddings, evaluate OOF macro F1.
- 6-13h: Train Model A (tf_efficientnetv2_m 384px) with CB-Focal, Class-Aware sampler, AMP/EMA.
- 13-15h: Fine-tune Model A at 512px.
- 15-21h: Train Model B (resnest101e 384px), fine-tune at 512px.
- 21-23h: TTA inference, blend, generate final submission.
- 23-24h: Sanity check, submit.

## Avoided Mistakes
- No top-class subsampling; ensure tail coverage.
- Mild augs only; preserve botanical details.
- 1-2 folds max; full data training.
- Verify paths (use file_name), id mapping, submission format.
- Log macro F1 per epoch; monitor rare class performance.

## Next Steps
- Create '01_quick_baseline.ipynb': PyTorch setup, embedding extraction, prototype retrieval for fast submission.
- After first OOF, request expert review with score to decide on second model or two-stage fine-tune.
- If retrieval baseline scores well, pivot to light fine-tune.