# üéØ Chest X-Ray Classification - Optimized for 80%+ Score

**Target Score: 80-82% (Public Leaderboard)**

## üìã Strategy:

This notebook reproduces the **80.122% baseline** configuration that actually works!

### Why This Approach?

After extensive experiments (Exp1-5), we found:
- ‚ùå Complex models (ConvNeXt-Base, 512px) ‚Üí **Lower scores** (76-72%)
- ‚úÖ **Simple config (ResNet18, 224px) ‚Üí 80.122%** ‚ú®

### Key Success Factors:

1. ‚úÖ **ResNet18** (not ConvNeXt) - Simple but effective
2. ‚úÖ **224px** (not 512px) - Optimal resolution
3. ‚úÖ **Weighted Sampler** - Handles COVID-19 (only 1% samples)
4. ‚úÖ **Label Smoothing 0.05** - Prevents overfitting
5. ‚úÖ **TTA** - Test-Time Augmentation for +1-2% boost

## ‚è±Ô∏è Time Required:

- **Setup**: 5-10 minutes
- **Training**: 15-20 minutes (A100) or 40-60 minutes (T4)
- **TTA Inference**: 3-5 minutes
- **Total**: ~30 minutes on A100

## üéØ Expected Performance:

| Method | Val F1 | Public Score | Time (A100) |
|--------|--------|--------------|-------------|
| Baseline | 0.80-0.82 | 80-81% | 15 min |
| **Baseline + TTA** | **0.81-0.83** | **81-82%** | **20 min** |

---

## üîß Before You Start:

### 1. Change Runtime Type:
- Click: `Runtime` ‚Üí `Change runtime type`
- Hardware accelerator: **GPU**
- GPU type: **A100** (fastest) or T4 (slower but works)

### 2. Get Kaggle API Key:
- Go to: https://www.kaggle.com/settings
- Scroll to "API" section
- Click "Create New API Token"
- Download `kaggle.json`

### 3. Join Competition:
- Visit: https://www.kaggle.com/competitions/cxr-multi-label-classification
- Click "Join Competition" and accept rules

### 4. Run All Cells:
- Just click: `Runtime` ‚Üí `Run all`
- Upload `kaggle.json` when prompted
- Wait ~30 minutes for training + TTA

---

## Step 0: Verify GPU

‚ö†Ô∏è **CRITICAL**: You MUST have GPU enabled!

In [None]:
import torch

print("=" * 60)
print("GPU VERIFICATION")
print("=" * 60)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    
    print(f"\n[OK] GPU: {gpu_name}")
    print(f"[OK] Memory: {gpu_memory:.1f} GB")
    print(f"[OK] CUDA: {torch.version.cuda}")
    print(f"[OK] PyTorch: {torch.__version__}")
    
    if "A100" in gpu_name:
        print("\nüöÄ EXCELLENT: A100 GPU detected!")
        print("   Training will take ~15-20 minutes")
    elif "T4" in gpu_name:
        print("\n‚ö° GOOD: T4 GPU detected!")
        print("   Training will take ~40-60 minutes")
    else:
        print(f"\n‚ÑπÔ∏è  Detected: {gpu_name}")
    
    # Enable optimizations
    torch.set_float32_matmul_precision('medium')
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    print(f"\n[OK] TF32 enabled: {torch.backends.cuda.matmul.allow_tf32}")
else:
    print("\n‚ùå NO GPU DETECTED!")
    print("\n‚ö†Ô∏è  Please enable GPU:")
    print("   Runtime ‚Üí Change runtime type ‚Üí GPU")
    raise Exception("GPU required for training")

print("=" * 60)

## Step 1: Clone Repository

Download the training code and pre-split data from GitHub.

In [None]:
import os
import shutil

print("=" * 60)
print("CLONE REPOSITORY")
print("=" * 60)

REPO_URL = "https://github.com/thc1006/nycu-CSIC30014-LAB3.git"
PROJECT_DIR = "nycu-CSIC30014-LAB3"

# Remove if exists (to get latest version)
if os.path.exists(PROJECT_DIR):
    print(f"\nRemoving existing {PROJECT_DIR}...")
    shutil.rmtree(PROJECT_DIR)

# Clone
print(f"\nCloning from GitHub...")
!git clone {REPO_URL}

# Change to project directory
os.chdir(PROJECT_DIR)
print(f"\n[OK] Working directory: {os.getcwd()}")

# Show structure
print("\n[OK] Project structure:")
!ls -lh | head -15

print("\n" + "=" * 60)

## Step 2: Install Dependencies

In [None]:
print("=" * 60)
print("INSTALL DEPENDENCIES")
print("=" * 60)
print("\nThis will take 1-2 minutes...\n")

# Install PyTorch with CUDA 12.1
!pip install -q torch torchvision --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
!pip install -q numpy pandas scikit-learn matplotlib tqdm pyyaml opencv-python seaborn albumentations

# Install Kaggle API
!pip install -q kaggle

print("\n[OK] Installation complete!")
print("=" * 60)

## Step 3: Setup Kaggle API

Upload your `kaggle.json` file to authenticate.

In [None]:
from google.colab import files
from pathlib import Path
import subprocess

print("=" * 60)
print("KAGGLE API SETUP")
print("=" * 60)
print("\nPlease upload your kaggle.json file:")
print("(Click 'Choose Files' button below)\n")

uploaded = files.upload()

if 'kaggle.json' in uploaded:
    print("\n[OK] kaggle.json uploaded successfully!")
    
    # Setup Kaggle credentials
    kaggle_dir = Path.home() / '.kaggle'
    kaggle_dir.mkdir(exist_ok=True)
    
    kaggle_json_path = kaggle_dir / 'kaggle.json'
    with open(kaggle_json_path, 'wb') as f:
        f.write(uploaded['kaggle.json'])
    
    # Set permissions
    os.chmod(kaggle_json_path, 0o600)
    
    print(f"   Saved to: {kaggle_json_path}")
    print(f"   Permissions: 600\n")
    
    # Verify authentication
    print("Verifying authentication...")
    result = subprocess.run(
        ['kaggle', 'competitions', 'list', '--page', '1'],
        capture_output=True,
        text=True
    )
    
    if result.returncode == 0:
        print("[OK] Kaggle API authenticated!\n")
    else:
        print("[FAIL] Authentication failed!")
        print(f"Error: {result.stderr}")
else:
    print("\n[FAIL] kaggle.json not uploaded!")
    raise Exception("Please upload kaggle.json")

print("=" * 60)

## Step 4: Download Competition Dataset

‚ö†Ô∏è **IMPORTANT**: You MUST join the competition first!
- Visit: https://www.kaggle.com/competitions/cxr-multi-label-classification
- Click "Join Competition" and accept rules

In [None]:
import zipfile
from tqdm.auto import tqdm

print("=" * 60)
print("DOWNLOAD COMPETITION DATASET")
print("=" * 60)

COMPETITION_NAME = "cxr-multi-label-classification"

print(f"\nCompetition: {COMPETITION_NAME}")
print("\nIMPORTANT: Make sure you've:")
print("  1. Visited https://www.kaggle.com/competitions/cxr-multi-label-classification")
print("  2. Clicked 'Join Competition'")
print("  3. Accepted the rules")
print("\nDownloading...\n")

# Download from competition
result = subprocess.run(
    ['kaggle', 'competitions', 'download', '-c', COMPETITION_NAME],
    capture_output=True,
    text=True
)

if result.returncode != 0:
    if "403" in result.stderr or "Forbidden" in result.stderr:
        print("[FAIL] 403 Forbidden Error!")
        print("\nYou haven't accepted the competition rules yet.")
        print(f"\nPlease:")
        print(f"  1. Visit: https://www.kaggle.com/competitions/{COMPETITION_NAME}")
        print(f"  2. Click 'Join Competition'")
        print(f"  3. Accept the rules")
        print(f"  4. Re-run this cell")
        raise Exception("Need to join competition first")
    else:
        print(f"[FAIL] Download failed: {result.stderr}")
        raise Exception("Competition download failed")

print("[OK] Competition data downloaded!")

# Extract all zip files
print("\nExtracting files...")
zip_files = [f for f in os.listdir('.') if f.endswith('.zip')]

if len(zip_files) == 0:
    print("[FAIL] No zip files found!")
else:
    for zip_file in zip_files:
        print(f"\n  Processing: {zip_file}")
        
        with zipfile.ZipFile(zip_file, 'r') as zip_ref:
            file_list = zip_ref.namelist()
            
            for file in tqdm(file_list, desc="  Extracting", leave=False):
                zip_ref.extract(file, '.')
        
        os.remove(zip_file)
        print(f"  [OK] Extracted and removed {zip_file}")

print("\n" + "=" * 60)
print("DOWNLOAD & EXTRACTION COMPLETE")
print("=" * 60)

## Step 5: Verify Data

Check that we have all required files.

In [None]:
import pandas as pd

print("=" * 60)
print("VERIFY DATA")
print("=" * 60)

# Check directories and CSVs
expected_dirs = ['train_images', 'val_images', 'test_images']
expected_csvs = ['data/train_data.csv', 'data/val_data.csv', 'data/test_data.csv']

all_good = True

print("\nImage directories:")
for dir_name in expected_dirs:
    if os.path.exists(dir_name):
        count = len([f for f in os.listdir(dir_name) if f.endswith(('.jpeg', '.jpg', '.png'))])
        print(f"  [OK] {dir_name}/ ({count} images)")
    else:
        print(f"  [FAIL] {dir_name}/ NOT FOUND")
        all_good = False

print("\nCSV files:")
for csv_file in expected_csvs:
    if os.path.exists(csv_file):
        df = pd.read_csv(csv_file)
        print(f"  [OK] {csv_file} ({len(df)} samples)")
        
        # Show class distribution
        if 'train' in csv_file or 'val' in csv_file:
            label_cols = ['normal', 'bacteria', 'virus', 'COVID-19']
            if all(col in df.columns for col in label_cols):
                print(f"       Normal={int(df['normal'].sum())}, "
                      f"Bacteria={int(df['bacteria'].sum())}, "
                      f"Virus={int(df['virus'].sum())}, "
                      f"COVID-19={int(df['COVID-19'].sum())}")
    else:
        print(f"  [FAIL] {csv_file} NOT FOUND")
        all_good = False

if all_good:
    print("\n" + "=" * 60)
    print("[OK] ALL DATA VERIFIED!")
    print("=" * 60)
else:
    print("\n[FAIL] Some files missing!")
    raise Exception("Data verification failed")

## Step 6: üî• Train Baseline Model (80.122% Config)

### Configuration:
- Model: **ResNet18** (not ConvNeXt!)
- Image size: **224px** (not 512px!)
- Batch size: 32 (A100) or 16 (T4)
- Epochs: 12
- Loss: CrossEntropy + **Label Smoothing 0.05**
- **Weighted Sampler**: True (critical for COVID-19)
- AMP: bfloat16

### Expected:
- Training time: 15-20 min (A100)
- Val F1: 0.80-0.82
- GPU utilization: 60-80%

In [None]:
# Set PYTHONPATH
os.environ['PYTHONPATH'] = os.getcwd()

print("=" * 60)
print("TRAINING BASELINE MODEL (80.122% CONFIG)")
print("=" * 60)
print(f"\nWorking directory: {os.getcwd()}")
print(f"Config: configs/model_small.yaml")
print(f"\nModel: ResNet18 @ 224px")
print(f"Epochs: 12")
print(f"Weighted Sampler: True")
print(f"Label Smoothing: 0.05")
print(f"\nTraining time: ~15-20 minutes (A100)")
print("\nYou can monitor GPU: Runtime ‚Üí Manage sessions")
print("=" * 60)
print()

# Train using the baseline config
!python -m src.train --config configs/model_small.yaml

print()
print("=" * 60)
print("TRAINING COMPLETE!")
print("=" * 60)
print("\nModel saved to: outputs/run1/best.pt")
print("\nExpected Val F1: 0.80-0.82")
print("=" * 60)

## Step 7: Evaluate Model

In [None]:
print("=" * 60)
print("EVALUATING TRAINED MODEL")
print("=" * 60)
print()

model_path = 'outputs/run1/best.pt'

if not os.path.exists(model_path):
    print(f"[FAIL] Model not found: {model_path}")
    print("   Please run Step 6 (Training) first.")
else:
    print(f"[OK] Model found: {model_path}\n")
    
    !python -m src.eval --config configs/model_small.yaml --ckpt {model_path}

print("\n" + "=" * 60)

## Step 8: Generate Standard Predictions

First, let's generate standard predictions (without TTA).

In [None]:
print("=" * 60)
print("GENERATING STANDARD PREDICTIONS")
print("=" * 60)
print()

model_path = 'outputs/run1/best.pt'

if not os.path.exists(model_path):
    print(f"[FAIL] Model not found: {model_path}")
else:
    print(f"[OK] Model found: {model_path}\n")
    
    !python -m src.predict --config configs/model_small.yaml --ckpt {model_path}
    
    print("\n[OK] Predictions generated!")
    print("   Output: data/submission.csv")

print("\n" + "=" * 60)

## Step 9: Generate TTA Predictions (Recommended)

Test-Time Augmentation for +0.5-1.5% improvement.

### TTA Transforms:
1. Original image
2. Horizontal flip
3. Vertical flip
4. Rotate 90¬∞
5. Rotate 180¬∞
6. Rotate 270¬∞

Average all 6 predictions for robust results.

In [None]:
print("=" * 60)
print("GENERATING TTA PREDICTIONS")
print("=" * 60)
print()
print("Test-Time Augmentation:")
print("  - 6 transformations (original, flips, rotations)")
print("  - Averages predictions for robustness")
print("  - Expected: +0.5-1.5% F1 boost")
print()

model_path = 'outputs/run1/best.pt'

if not os.path.exists(model_path):
    print(f"[FAIL] Model not found: {model_path}")
else:
    print(f"[OK] Model found: {model_path}\n")
    
    !python -m src.tta_predict --config configs/model_small.yaml --ckpt {model_path}
    
    print("\n[OK] TTA Predictions generated!")
    print("   Output: submission_tta.csv")

print("\n" + "=" * 60)

## Step 10: Download Submission Files

In [None]:
print("=" * 60)
print("DOWNLOAD SUBMISSION FILES")
print("=" * 60)
print()

# Check both submission files
standard_file = 'data/submission.csv'
tta_file = 'submission_tta.csv'

files_to_download = []

if os.path.exists(standard_file):
    df = pd.read_csv(standard_file)
    print(f"[OK] {standard_file} ({len(df)} samples)")
    files_to_download.append(standard_file)
    
    # Show distribution
    print("\nStandard prediction distribution:")
    pred_counts = df[['normal', 'bacteria', 'virus', 'COVID-19']].sum()
    for cls, count in pred_counts.items():
        pct = count / len(df) * 100
        print(f"  {cls:12s}: {int(count):4d} ({pct:5.2f}%)")

if os.path.exists(tta_file):
    df = pd.read_csv(tta_file)
    print(f"\n[OK] {tta_file} ({len(df)} samples)")
    files_to_download.append(tta_file)
    
    # Show distribution
    print("\nTTA prediction distribution:")
    pred_counts = df[['normal', 'bacteria', 'virus', 'COVID-19']].sum()
    for cls, count in pred_counts.items():
        pct = count / len(df) * 100
        print(f"  {cls:12s}: {int(count):4d} ({pct:5.2f}%)")

if files_to_download:
    print("\n" + "=" * 60)
    print("Downloading files...")
    print("=" * 60)
    
    from google.colab import files as colab_files
    
    for file in files_to_download:
        print(f"\nDownloading: {file}")
        colab_files.download(file)
    
    print("\n" + "=" * 60)
    print("DOWNLOAD COMPLETE!")
    print("=" * 60)
    print("\nüìä EXPECTED KAGGLE SCORES:")
    print("   - Standard: 80-81%")
    print("   - TTA (recommended): 81-82%")
    print("\nüìù NEXT STEPS:")
    print("   1. Go to Kaggle competition page")
    print("   2. Click 'Submit Predictions'")
    print("   3. Upload submission_tta.csv (recommended)")
    print("   4. Check your score on the leaderboard!")
    print("\n" + "=" * 60)
else:
    print("\n[FAIL] No submission files found!")
    print("Please run Steps 8 and 9 first.")

---

## üéâ Training Complete!

### Performance Summary:

| Metric | Value |
|--------|-------|
| **Model** | ResNet18 (11M params) |
| **Image Size** | 224√ó224 |
| **Training Time** | ~15-20 minutes (A100) |
| **Expected Val F1** | 0.80-0.82 |
| **Expected Public Score** | **80-82%** |

### Why This Works:

1. ‚úÖ **Simple is Better** - ResNet18 > Complex models for this dataset
2. ‚úÖ **Weighted Sampler** - Handles COVID-19 (only 1% samples)
3. ‚úÖ **Label Smoothing** - Prevents overfitting
4. ‚úÖ **TTA** - Low-risk improvement (+1-2%)

### Lessons Learned:

We tested 5 complex experiments (ConvNeXt, EfficientNet, 512px):
- Exp1 (ConvNeXt-Tiny, 288px): 76.15% ‚ùå
- Exp2 (EfficientNetV2-S, 320px): 71.95% ‚ùå
- **Baseline (ResNet18, 224px): 80.122% ‚úÖ**

**Conclusion**: Simple configuration works best!

### Next Steps to Reach 85%+ (Optional):

1. **Train Multiple Models** - ResNet34, ResNet50
2. **Soft Ensemble** - Average predictions from 3 models
3. **Longer Training** - 20-30 epochs with early stopping
4. **SWA** - Stochastic Weight Averaging

---

**Congratulations! You've successfully trained a model that matches the 80.122% baseline! üöÄ**