# Baseline Model Robustness Experiments (with BLEU Scores)

This notebook runs robustness experiments to test the baseline model performance under various image corruptions.

**Tests:**
- Baseline performance (no corruption)
- All 11 corruption types (Gaussian noise, shot noise, impulse noise, defocus blur, motion blur, zoom blur, brightness, contrast, JPEG compression, pixelate, elastic transform)
- 3 severity levels (1, 3, 5) for each corruption

**Output:**
- Results summary (console) with Loss and BLEU scores
- JSON results file (includes all BLEU metrics)
- Visualization plots (loss and BLEU-4)

## Setup

In [2]:
# setup
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/GTech\ OMSCS/CS\ 7643/group\ project/CS7643_project

import sys
import os
project_root = '/content/drive/MyDrive/GTech OMSCS/CS 7643/group project/CS7643_project'
if project_root not in sys.path:
    sys.path.insert(0, project_root)

%pip install nltk -q
import nltk
nltk.download('punkt', quiet=True)


ValueError: mount failed

## Config


In [None]:
# Model config
MODEL_PATH = './epoch_decoder_only_baseline_3'

# Dataset config
DATASET_NAME = 'flickr8k'  # 'flickr8k' or 'flickr30k'
SPLIT = 'test'  # 'test' or 'dev'

# Experiment config
OUTPUT_DIR = './robustness_results/baseline'
BATCH_SIZE = 16
MAX_LEN = 48

# Corruption config
SEVERITY_LEVELS = [1, 3, 5]

# Device
import torch
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


## Run Robustness Experiments

This will:
1. Load the model
2. Test baseline performance (no corruption)
3. Test all 11 corruption types at 3 severity levels each (33 corruption tests total)
4. Calculate Loss and BLEU scores (BLEU-1, BLEU-2, BLEU-3, BLEU-4)
5. Generate results and visualizations

Results are automatically saved after each corruption type.

**Checkpoint file:** `{OUTPUT_DIR}/checkpoint_{DATASET_NAME}_{SPLIT}.json`

In [None]:
from robustness.run_baseline_robustness import run_baseline_robustness_experiments
import os

checkpoint_file = os.path.join(OUTPUT_DIR, f'checkpoint_{DATASET_NAME}_{SPLIT}.json')
if os.path.exists(checkpoint_file):
    print(f"âœ“ Found checkpoint: {checkpoint_file}")
    print("Resuming from checkpoint...")
else:
    print("Starting new experiment...")

# Run experiments (automatically resumes from checkpoint if exists)
results = run_baseline_robustness_experiments(
    model_path=MODEL_PATH,
    dataset_name=DATASET_NAME,
    split=SPLIT,
    output_dir=OUTPUT_DIR,
    corruption_types=None,
    severity_levels=SEVERITY_LEVELS,
    batch_size=BATCH_SIZE,
    max_len=MAX_LEN,
    device=DEVICE,
    base_dir='./'
)


## View Results

The results have been saved to:
- JSON file: `{OUTPUT_DIR}/baseline_robustness_{DATASET_NAME}_{SPLIT}.json`
- Loss plot: `{OUTPUT_DIR}/robustness_loss_{DATASET_NAME}_{SPLIT}.png`
- BLEU-4 plot: `{OUTPUT_DIR}/robustness_bleu4_{DATASET_NAME}_{SPLIT}.png`

In [None]:
if 'baseline' in results:
    baseline = results['baseline']
    print("Baseline Performance (No Corruption):")
    print(f"  Loss: {baseline.get('loss', 'N/A'):.4f}")
    if 'metrics' in baseline:
        metrics = baseline['metrics']
        print(f"\nBLEU Scores:")
        print(f"  BLEU-1: {metrics.get('bleu_1', 0):.4f}")
        print(f"  BLEU-2: {metrics.get('bleu_2', 0):.4f}")
        print(f"  BLEU-3: {metrics.get('bleu_3', 0):.4f}")
        print(f"  BLEU-4: {metrics.get('bleu_4', 0):.4f}")
        print(f"\nCaption Lengths:")
        print(f"  Avg Prediction: {metrics.get('avg_pred_length', 0):.2f} tokens")
        print(f"  Avg Reference: {metrics.get('avg_ref_length', 0):.2f} tokens")
