# SETI Breakthrough Listen - E.T. Signal Search: Revised Plan (Post-Expert Review)

This notebook outlines the revised plan to tackle the SETI Breakthrough Listen competition, incorporating feedback from Kaggle Grandmasters. The goal is to achieve a medal-winning score on the AUC-ROC metric.

## 1. Initial Setup & Environment
*   **Goal:** Prepare the environment for the project.
*   **Actions:**
    *   Import necessary libraries (pandas, numpy, matplotlib, sklearn, torch, timm, albumentations).
    *   Install `timm` and `albumentations` if not present.
    *   Define constants for file paths.
    *   Set up logging and device (CUDA/CPU).

## 2. Exploratory Data Analysis (EDA)
*   **Goal:** Understand the data structure and confirm key assumptions.
*   **Actions:**
    *   Inspect the file structure using `ls -R`. Avoid any files in `old_leaky_data/`.
    *   Load `train_labels.csv` and analyze the target distribution. Calculate `pos_weight` for the loss function.
    *   Load `sample_submission.csv` to understand the required submission format.
    *   Load a single data file (`.npy`) to confirm its shape is `(6, 273, 256)`. The 6 slices represent ON-OFF cadence pairs.
    *   Visualize a few positive and negative samples after applying the 3-channel difference preprocessing.

## 3. Data Preparation & Preprocessing (Top Priority)
*   **Goal:** Create a robust data loading and preprocessing pipeline based on expert advice.
*   **Actions:**
    *   **Input Representation:**
        *   Do NOT use raw 6 channels. The input data is a sequence of 3 ON-OFF cadence pairs.
        *   Create a 3-channel image by taking the difference between each ON-OFF pair: `[ON_1 - OFF_1, ON_2 - OFF_2, ON_3 - OFF_3]`.
    *   **Normalization:**
        *   Apply a per-sample, per-channel normalization scheme. A good starting point is to apply `log1p` for contrast and then standardize (z-score).
        *   Ensure identical normalization is applied to train, validation, and test sets.
    *   **Dataset Class:**
        *   Create a `torch.utils.data.Dataset` class that loads `.npy` files on-the-fly.
        *   The `__getitem__` method will perform the 3-channel differencing and normalization.
    *   **Augmentations:**
        *   Use light, physically-sensible augmentations with `albumentations`.
        *   Good choices: HorizontalFlip, VerticalFlip, small time/frequency shifts (e.g., `ShiftScaleRotate` with small shifts and no rotation).
        *   Avoid heavy distortions. MixUp/CutMix are noted to be less effective for this problem.

## 4. Baseline Model & Training
*   **Goal:** Build and train a strong baseline model using a robust validation strategy.
*   **Actions:**
    *   **Validation Strategy (Crucial):**
        *   Use `StratifiedGroupKFold` (e.g., k=5).
        *   Group samples by their base ID (e.g., the part of the filename before the first underscore) to prevent leakage from near-duplicate observations.
    *   **Model Choice:**
        *   Use a pretrained 3-channel CNN from `timm`. Start with `efficientnet_b0` or `convnext_tiny` for speed, then move to `efficientnet_b2/b3` for performance.
        *   Modify the model's first convolutional layer if input size is not standard, or simply resize images to `224x224` or `256x256`.
    *   **Training Loop:**
        *   **Loss Function:** `BCEWithLogitsLoss` with `pos_weight` calculated from the training data imbalance.
        *   **Optimizer:** `AdamW`.
        *   **Scheduler:** `CosineAnnealingLR` with warmup.
        *   **Mixed Precision:** Use `torch.cuda.amp` for faster training.
        *   **Metric & Stopping:** Monitor validation AUC-ROC and use Early Stopping to save the best model per fold.

## 5. Iteration, Ensembling, and Submission
*   **Goal:** Improve the baseline and generate the final submission.
*   **Actions:**
    *   **Cross-Validation:** Train the model on all 5 folds from the `StratifiedGroupKFold` split.
    *   **Test-Time Augmentation (TTA):** For inference, apply augmentations (e.g., horizontal/vertical flips) to each test sample and average the predictions.
    *   **Ensembling:** The primary ensemble strategy will be to average the predictions from the 5 models trained on different folds.
    *   **Submission:**
        *   Run inference on the test set using the ensembled models with TTA.
        *   Format the predictions into `submission.csv`.

## Performance Targets & Sanity Checks
*   **Initial Sanity Check:** A single-fold training run should achieve a validation AUC > 0.73. If not, debug the preprocessing, normalization, and data loading.
*   **Medal Target:** A 5-fold cross-validated average AUC should be >= 0.77 for a strong result. The goal is to push this higher with model selection and tuning.