# iMet Collection 2020 - FGVC7: Plan and EDA

## 1. Goal
The objective is to classify artworks from The Met's collection with fine-grained attributes. This is a multi-label image classification problem. The evaluation metric is the micro F1-score.

**Medal Targets:**
- **Gold:** ≥ 0.696
- **Silver:** ≥ 0.649
- **Bronze:** ≥ 0.649

## 2. Revised Workflow Plan (Post-Expert Review)
The initial plan has been updated based on expert feedback to accelerate progress towards a medal-winning solution. The focus is on using proven high-impact techniques for multi-label classification from the start.

### Key Changes from Initial Plan:
- **Model:** Start directly with a strong baseline: `tf_efficientnet_b4_ns` at 384px.
- **Validation:** Use `MultilabelStratifiedKFold` immediately. It's critical for this problem.
- **Loss Function:** Prioritize `Asymmetric Loss (ASL)` or `BCEWithLogitsLoss` with `pos_weight` to handle class imbalance.
- **Training:** Employ Automatic Mixed Precision (AMP) and potentially Exponential Moving Average (EMA) for faster and more stable training.
- **Thresholding:** **Do not use a fixed 0.5 threshold.** Tune a global threshold on Out-of-Fold (OOF) predictions. Implement a fallback to predict the highest-scoring class if no prediction exceeds the threshold.
- **TTA:** Use horizontal flip Test-Time Augmentation (TTA) during inference.

### Phase 1: Fast-Track Setup & EDA (Hours 0-4)
1.  **Essential EDA:**
    - Load `train.csv` and `labels.csv`.
    - Create the multi-hot encoded label matrix.
    - Analyze label frequencies and the distribution of labels per image to understand the imbalance.
    - Create stable mappings: `attribute_id <-> class_index`.
2.  **Validation Strategy:**
    - Implement a `MultilabelStratifiedKFold` split (e.g., 5 folds). We will start by training on just one fold to iterate quickly.
3.  **Dataloader Pipeline:**
    - Create a PyTorch `Dataset`.
    - Use `albumentations` for augmentations: `RandomResizedCrop(384)`, `HorizontalFlip`, light color jitter.
    - Ensure dataloaders are fast: set `num_workers`, `pin_memory=True`.

### Phase 2: Strong Baseline Training (Hours 4-14)
1.  **Model & Training Recipe:**
    - **Model:** `timm.create_model('tf_efficientnet_b4_ns', pretrained=True, num_classes=N_CLASSES)`.
    - **Image Size:** 384x384.
    - **Loss:** `BCEWithLogitsLoss` with pre-calculated `pos_weight`. (ASL is a later improvement if needed).
    - **Optimizer:** `AdamW`.
    - **Scheduler:** `CosineAnnealingLR` with warmup.
    - **Training:** Use AMP (`torch.cuda.amp`). Train for 6-8 epochs on a single fold to establish a baseline score and training time.
2.  **Evaluation & Checkpointing:**
    - In the validation loop, save OOF predictions (logits/probabilities).
    - Track both validation loss and micro F1-score.
    - Save the model checkpoint with the best micro F1-score.

### Phase 3: Optimization & Scaling (Hours 14-20)
1.  **Threshold Optimization:**
    - Using the saved OOF predictions from the first fold, perform a grid search to find the optimal global threshold for the micro F1-score.
    - Implement the "at-least-one" fallback: if an image has no predictions above the threshold, assign it the label with the highest probability.
2.  **Cross-Validation Training:**
    - If time permits and the single-fold model is promising, train models on the remaining folds of the `MultilabelStratifiedKFold` split.
3.  **Inference:**
    - Write a clean inference loop for the test set.
    - Implement horizontal flip TTA: predict on original and flipped images, then average the logits.

### Phase 4: Final Submission (Hours 20-24)
1.  **Ensembling & Prediction:**
    - Average the (TTA-enhanced) logit predictions from all trained fold models.
2.  **Final Prediction Generation:**
    - Apply the globally optimized threshold to the averaged predictions.
    - Apply the "at-least-one" fallback rule.
    - Convert the binary predictions back to `attribute_ids`.
3.  **Submission:**
    - Format the predictions into `submission.csv` ensuring the `id` order matches `sample_submission.csv` and `attribute_ids` are space-separated strings.
    - Perform a final sanity check on the file format.