# 📋 Task 2: Domain Generalization via Invariant & Robust Learning

In this task, we explore Domain Generalization (DG), where a model is trained on multiple source domains and must generalize to a completely unseen target domain. We will implement and compare four methods: ERM, IRM, GroupDRO, and SAM.

Our setup will use the **PACS dataset**. We will train on the **Art, Cartoon, and Photo** domains, holding out the **Sketch** domain as our unseen test environment, as suggested in the assignment manual.

---

## **Part 1: Empirical Risk Minimization (ERM) Baseline**

### **1.1. Overview**

We begin by establishing a baseline using standard **Empirical Risk Minimization (ERM)**. This approach involves merging all data from the source domains into a single dataset and training a standard classifier on it. This model's performance on the unseen target domain will serve as the benchmark against which we will compare more advanced DG techniques.

### **1.2. Environment Setup**

First, we need to set up the Python environment to ensure the notebook can find and import the DomainBed library from our `code/` directory.

In [7]:
import json
import os
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# The path to the 'domainbed' repository inside the 'code' folder
# This is the parent directory of the actual 'domainbed' package
module_path = os.path.abspath(os.path.join(".", "code", "domainbed"))

if module_path not in sys.path:
    sys.path.append(module_path)
    print(f"✅ Added '{module_path}' to Python path.")

# Import the main training function from DomainBed
try:
    from domainbed.scripts import train

    print("✅ Successfully imported DomainBed.")
except ImportError as e:
    print(
        "❌ Error importing DomainBed. Check that the path is correct and the repository is at './code/domainbed'."
    )
    print(e)

# Set plotting style for later
sns.set_theme(style="whitegrid")

✅ Successfully imported DomainBed.


### **1.3. Experiment Runner Function**

To keep our code clean, we'll define a helper function that can launch any DomainBed experiment by taking a dictionary of arguments. This function mimics passing arguments via the command line.

In [None]:
def run_experiment(args_dict):
    """
    Builds a command-line command from a dictionary of arguments
    and executes the DomainBed training script, setting the PYTHONPATH.
    """
    # Define the path to the directory containing the 'domainbed' package
    module_path = os.path.abspath(os.path.join('.', 'code', 'domainbed'))
    
    # Enclose the module_path in quotes to handle spaces in the directory name.
    command = f'PYTHONPATH="{module_path}" python -m domainbed.scripts.train'
    
    # Append each argument from the dictionary to the command string
    for key, value in args_dict.items():
        if isinstance(value, bool) and value:
            command += f" --{key}"
        elif not (isinstance(value, bool) and not value):
            command += f" --{key} {value}"
            
    print("🚀 Executing Command:")
    print(command)
    
    # Execute the command in the shell
    os.system(command)
    
    print(f"\n🎉 Training finished for {args_dict.get('algorithm', 'N/A')}.")

### **1.4. Run ERM Training**

Now, we define the specific parameters for our ERM baseline experiment and launch the training.

In [None]:
# --- ERM Experiment Configuration ---

# 1. Define hyperparameters in their own dictionary.
hparams = {
    'progress_bar': True
}

# 2. Configure the main experiment arguments.
erm_args = {
    'data_dir': './data/',
    'dataset': 'PACS',
    'algorithm': 'ERM',
    'test_env': 3,  # The index for the 'Sketch' domain in PACS
    'output_dir': './results/erm',
    'hparams_seed': 0,
    'trial_seed': 0,
    'seed': 0,
    'hparams': f"'{json.dumps(hparams)}'"
}

run_experiment(erm_args)

🚀 Executing Command:
PYTHONPATH="/root/IbsATML/PA2/Domain Generalisation/code/domainbed" python -m domainbed.scripts.train --data_dir ./data/ --dataset PACS --algorithm ERM --test_env 3 --output_dir ./results/erm --hparams_seed 0 --trial_seed 0 --seed 0 --hparams '{"progress_bar": true}'


  from pkg_resources import parse_version


Environment:
	Python: 3.12.11
	PyTorch: 2.8.0+cu129
	Torchvision: 0.23.0+cu129
	CUDA: 12.9
	CUDNN: 91002
	NumPy: 2.1.2
	PIL: 11.0.0
Args:
	algorithm: ERM
	checkpoint_freq: None
	data_dir: ./data/
	dataset: PACS
	holdout_fraction: 0.2
	hparams: {"progress_bar": true}
	hparams_seed: 0
	output_dir: ./results/erm
	save_model_every_checkpoint: False
	seed: 0
	skip_model_save: False
	steps: None
	task: domain_generalization
	test_envs: [3]
	trial_seed: 0
	uda_holdout_fraction: 0
HParams:
	batch_size: 32
	class_balanced: False
	data_augmentation: True
	dinov2: False
	freeze_bn: False
	lars: False
	linear_steps: 500
	lr: 5e-05
	nonlinear_classifier: False
	progress_bar: True
	resnet18: False
	resnet50_augmix: True
	resnet_dropout: 0.0
	vit: False
	vit_attn_tune: False
	vit_dropout: 0.0
	weight_decay: 0.0
env0_in_acc   env0_out_acc  env1_in_acc   env1_out_acc  env2_in_acc   env2_out_acc  env3_in_acc   env3_out_acc  epoch         loss          mem_gb        step          step_time    
0.11165344

---

## **Part 2: Invariant Risk Minimization (IRM)**

### **2.1. Overview**

Now we move to **Invariant Risk Minimization (IRM)**, as required by Part 2 of the assignment. The core idea behind IRM is to learn a feature representation where the optimal classifier is the same across all training domains. This is intended to prevent the model from relying on spurious, domain-specific correlations, thereby improving generalization to unseen domains.

We will first run IRM with its default hyperparameters from DomainBed and then perform an ablation study with a stronger penalty weight to analyze its stability and performance.

### **2.2. Run IRM Training (Default Hyperparameters)**

We start with the default IRM penalty `irm_lambda=25`.

In [None]:
# --- IRM Experiment Configuration (Default) ---

# 1. Define hyperparameters for the default IRM run.
hparams_irm_default = {
    "progress_bar": True,
    "irm_lambda": 10,  # Default penalty weight
    "irm_penalty_anneal_iters": 500,  # Steps to anneal the penalty
}

# 2. Configure the main experiment arguments.
irm_args_default = {
    "data_dir": "./data/",
    "dataset": "PACS",
    "algorithm": "IRM",
    "test_env": 3,  # Sketch domain
    "output_dir": "./results/irm_default",
    "hparams_seed": 0,
    "trial_seed": 0,
    "seed": 0,
    "hparams": f"'{json.dumps(hparams_irm_default)}'",
}

run_experiment(irm_args_default)

🚀 Executing Command:
PYTHONPATH="/root/IbsATML/PA2/Domain Generalisation/code/domainbed" python -m domainbed.scripts.train --data_dir ./data/ --dataset PACS --algorithm IRM --test_env 3 --output_dir ./results/irm_default --hparams_seed 0 --trial_seed 0 --seed 0 --hparams '{"progress_bar": true, "irm_lambda": 10, "irm_penalty_anneal_iters": 500}'


  from pkg_resources import parse_version


Environment:
	Python: 3.12.11
	PyTorch: 2.8.0+cu129
	Torchvision: 0.23.0+cu129
	CUDA: 12.9
	CUDNN: 91002
	NumPy: 2.1.2
	PIL: 11.0.0
Args:
	algorithm: IRM
	checkpoint_freq: None
	data_dir: ./data/
	dataset: PACS
	holdout_fraction: 0.2
	hparams: {"progress_bar": true, "irm_lambda": 10, "irm_penalty_anneal_iters": 500}
	hparams_seed: 0
	output_dir: ./results/irm_default
	save_model_every_checkpoint: False
	seed: 0
	skip_model_save: False
	steps: None
	task: domain_generalization
	test_envs: [3]
	trial_seed: 0
	uda_holdout_fraction: 0
HParams:
	batch_size: 32
	class_balanced: False
	data_augmentation: True
	dinov2: False
	freeze_bn: False
	irm_lambda: 10
	irm_penalty_anneal_iters: 500
	lars: False
	linear_steps: 500
	lr: 5e-05
	nonlinear_classifier: False
	progress_bar: True
	resnet18: False
	resnet50_augmix: True
	resnet_dropout: 0.0
	vit: False
	vit_attn_tune: False
	vit_dropout: 0.0
	weight_decay: 0.0
env0_in_acc   env0_out_acc  env1_in_acc   env1_out_acc  env2_in_acc   env2_out_acc  en

### **2.3. Ablation Study: Run IRM with Stronger Penalty**

We will now conduct a stability analysis by increasing the penalty weight to `irm_lambda=25`. This will help us understand how sensitive the IRM algorithm is to this hyperparameter.

In [None]:
# --- IRM Experiment Configuration (Stronger Penalty) ---

# 1. Define hyperparameters with the increased penalty.
hparams_irm_stronger = {
    "progress_bar": True,
    "irm_lambda": 25,
    "irm_penalty_anneal_iters": 500,
}

# 2. Configure the main experiment arguments.
irm_args_stronger = {
    "data_dir": "./data/",
    "dataset": "PACS",
    "algorithm": "IRM",
    "test_env": 3,  # Sketch domain
    "output_dir": "./results/irm_stronger_penalty",
    "hparams_seed": 0,
    "trial_seed": 0,
    "seed": 0,
    "hparams": f"'{json.dumps(hparams_irm_stronger)}'",
}

run_experiment(irm_args_stronger)

🚀 Executing Command:
PYTHONPATH="/root/IbsATML/PA2/Domain Generalisation/code/domainbed" python -m domainbed.scripts.train --data_dir ./data/ --dataset PACS --algorithm IRM --test_env 3 --output_dir ./results/irm_stronger_penalty --hparams_seed 0 --trial_seed 0 --seed 0 --hparams '{"progress_bar": true, "irm_lambda": 25, "irm_penalty_anneal_iters": 500}'


  from pkg_resources import parse_version


Environment:
	Python: 3.12.11
	PyTorch: 2.8.0+cu129
	Torchvision: 0.23.0+cu129
	CUDA: 12.9
	CUDNN: 91002
	NumPy: 2.1.2
	PIL: 11.0.0
Args:
	algorithm: IRM
	checkpoint_freq: None
	data_dir: ./data/
	dataset: PACS
	holdout_fraction: 0.2
	hparams: {"progress_bar": true, "irm_lambda": 25, "irm_penalty_anneal_iters": 500}
	hparams_seed: 0
	output_dir: ./results/irm_stronger_penalty
	save_model_every_checkpoint: False
	seed: 0
	skip_model_save: False
	steps: None
	task: domain_generalization
	test_envs: [3]
	trial_seed: 0
	uda_holdout_fraction: 0
HParams:
	batch_size: 32
	class_balanced: False
	data_augmentation: True
	dinov2: False
	freeze_bn: False
	irm_lambda: 25
	irm_penalty_anneal_iters: 500
	lars: False
	linear_steps: 500
	lr: 5e-05
	nonlinear_classifier: False
	progress_bar: True
	resnet18: False
	resnet50_augmix: True
	resnet_dropout: 0.0
	vit: False
	vit_attn_tune: False
	vit_dropout: 0.0
	weight_decay: 0.0
env0_in_acc   env0_out_acc  env1_in_acc   env1_out_acc  env2_in_acc   env2_ou

### **2.4. Ablation Study: Run IRM with Even Stronger Penalty**

To complete our hyperparameter sensitivity analysis, we now test IRM with an even stronger penalty (`irm_lambda=100`), which is 10x stronger than the default. This creates a symmetric ablation study spanning three orders of magnitude: λ=10 (default), λ=25 (stronger), and λ=100 (strongest).

In [None]:
# --- IRM Experiment Configuration (Weaker Penalty) ---

# 1. Define hyperparameters with an even more increased penalty.
hparams_irm_weaker = {
    "progress_bar": True,
    "irm_lambda": 100,  # 10x stronger penalty
    "irm_penalty_anneal_iters": 500,
}

# 2. Configure the main experiment arguments.
irm_args_weaker = {
    "data_dir": "./data/",
    "dataset": "PACS",
    "algorithm": "IRM",
    "test_env": 3,  # Sketch domain
    "output_dir": "./results/irm_weak_penalty",
    "hparams_seed": 0,
    "trial_seed": 0,
    "seed": 0,
    "hparams": f"'{json.dumps(hparams_irm_weaker)}'",
}

run_experiment(irm_args_weaker)

🚀 Executing Command:
PYTHONPATH="/root/IbsATML/PA2/Domain Generalisation/code/domainbed" python -m domainbed.scripts.train --data_dir ./data/ --dataset PACS --algorithm IRM --test_env 3 --output_dir ./results/irm_weak_penalty --hparams_seed 0 --trial_seed 0 --seed 0 --hparams '{"progress_bar": true, "irm_lambda": 100, "irm_penalty_anneal_iters": 500}'


  from pkg_resources import parse_version


Environment:
	Python: 3.12.11
	PyTorch: 2.8.0+cu129
	Torchvision: 0.23.0+cu129
	CUDA: 12.9
	CUDNN: 91002
	NumPy: 2.1.2
	PIL: 11.0.0
Args:
	algorithm: IRM
	checkpoint_freq: None
	data_dir: ./data/
	dataset: PACS
	holdout_fraction: 0.2
	hparams: {"progress_bar": true, "irm_lambda": 100, "irm_penalty_anneal_iters": 500}
	hparams_seed: 0
	output_dir: ./results/irm_weak_penalty
	save_model_every_checkpoint: False
	seed: 0
	skip_model_save: False
	steps: None
	task: domain_generalization
	test_envs: [3]
	trial_seed: 0
	uda_holdout_fraction: 0
HParams:
	batch_size: 32
	class_balanced: False
	data_augmentation: True
	dinov2: False
	freeze_bn: False
	irm_lambda: 100
	irm_penalty_anneal_iters: 500
	lars: False
	linear_steps: 500
	lr: 5e-05
	nonlinear_classifier: False
	progress_bar: True
	resnet18: False
	resnet50_augmix: True
	resnet_dropout: 0.0
	vit: False
	vit_attn_tune: False
	vit_dropout: 0.0
	weight_decay: 0.0
env0_in_acc   env0_out_acc  env1_in_acc   env1_out_acc  env2_in_acc   env2_out_

---

## **Part 3: Group Distributionally Robust Optimization (GroupDRO)**

### **3.1. Overview**

Next, we implement **Group Distributionally Robust Optimization (GroupDRO)**. Instead of averaging the loss over all source domains like ERM, GroupDRO explicitly optimizes for the worst-case performance among them. At each step, it identifies the domain with the highest loss and updates the model to prioritize improving performance on this "hardest" domain. 

The goal is to prevent the model from simply overfitting to easier domains, thereby encouraging it to learn more robust features that can generalize better to unseen environments.

### **3.2. Training GroupDRO**

We will now train our model using the GroupDRO algorithm. The DomainBed library already includes this implementation. We only need to specify `GroupDRO` as the algorithm and set its associated hyperparameters. Based on the original paper and DomainBed's defaults, we will use a `groupdro_eta` of `1e-2`.

In [None]:
# --- GroupDRO Experiment Configuration ---

# 1. Define hyperparameters for GroupDRO
hparams_groupdro = {
    "progress_bar": True,
    "groupdro_eta": 0.01  # Default learning rate for group weights
}

# 2. Configure the experiment arguments (same format as your IRM experiments)
groupdro_args = {
    "data_dir": "./data/",
    "dataset": "PACS",
    "algorithm": "GroupDRO",
    "test_env": 3,  # Test on Sketch
    "output_dir": "./results/groupdro",
    "hparams_seed": 0,
    "trial_seed": 0,
    "seed": 0,
    "hparams": f"'{json.dumps(hparams_groupdro)}'"
}

os.makedirs(groupdro_args['output_dir'], exist_ok=True)

print("Starting GroupDRO training...")
run_experiment(groupdro_args)
print("\nGroupDRO training complete!")

Starting GroupDRO training...
🚀 Executing Command:
PYTHONPATH="/root/IbsATML/PA2/Domain Generalisation/code/domainbed" python -m domainbed.scripts.train --data_dir ./data/ --dataset PACS --algorithm GroupDRO --test_env 3 --output_dir ./results/groupdro --hparams_seed 0 --trial_seed 0 --seed 0 --hparams '{"progress_bar": true, "groupdro_eta": 0.01}'


  from pkg_resources import parse_version


Environment:
	Python: 3.12.11
	PyTorch: 2.8.0+cu129
	Torchvision: 0.23.0+cu129
	CUDA: 12.9
	CUDNN: 91002
	NumPy: 2.1.2
	PIL: 11.0.0
Args:
	algorithm: GroupDRO
	checkpoint_freq: None
	data_dir: ./data/
	dataset: PACS
	holdout_fraction: 0.2
	hparams: {"progress_bar": true, "groupdro_eta": 0.01}
	hparams_seed: 0
	output_dir: ./results/groupdro
	save_model_every_checkpoint: False
	seed: 0
	skip_model_save: False
	steps: None
	task: domain_generalization
	test_envs: [3]
	trial_seed: 0
	uda_holdout_fraction: 0
HParams:
	batch_size: 32
	class_balanced: False
	data_augmentation: True
	dinov2: False
	freeze_bn: False
	groupdro_eta: 0.01
	lars: False
	linear_steps: 500
	lr: 5e-05
	nonlinear_classifier: False
	progress_bar: True
	resnet18: False
	resnet50_augmix: True
	resnet_dropout: 0.0
	vit: False
	vit_attn_tune: False
	vit_dropout: 0.0
	weight_decay: 0.0
env0_in_acc   env0_out_acc  env1_in_acc   env1_out_acc  env2_in_acc   env2_out_acc  env3_in_acc   env3_out_acc  epoch         loss          

### **3.3. Results & Analysis**

After training is complete, we'll parse the output file to extract the final accuracies on both the source domains and the unseen target domain (Sketch) and do a quick comparison to our baseline.

In [9]:
# Load results
try:
    with open(os.path.join(groupdro_args['output_dir'], 'results.jsonl'), 'r') as f:
        groupdro_results_log = [json.loads(line) for line in f]
except FileNotFoundError:
    print("❌ Results file for GroupDRO not found. Please ensure training completed successfully.")
    groupdro_results_log = []

if groupdro_results_log:
    groupdro_df = pd.DataFrame(groupdro_results_log)
    
    # Extract final accuracies
    final_step = groupdro_df['step'].max()
    final_accuracies = groupdro_df[groupdro_df['step'] == final_step]

    # Get target and source accuracies
    groupdro_target_acc = final_accuracies['env3_out_acc'].values[0] * 100
    groupdro_art_acc = final_accuracies['env0_in_acc'].values[0] * 100
    groupdro_cartoon_acc = final_accuracies['env1_in_acc'].values[0] * 100
    groupdro_photo_acc = final_accuracies['env2_in_acc'].values[0] * 100
    
    # Calculate balance metrics
    source_accuracies = [groupdro_art_acc, groupdro_cartoon_acc, groupdro_photo_acc]
    source_avg = np.mean(source_accuracies)
    source_range = max(source_accuracies) - min(source_accuracies)
    source_std = np.std(source_accuracies)
    
    # Quick comparison to ERM
    erm_target = 83.82
    erm_range = 5.13  # 100.00 - 94.87
    target_gap = groupdro_target_acc - erm_target
    
    # Display results
    print("=" * 70)
    print("📊 GroupDRO Results Summary")
    print("=" * 70)
    print(f"\n🎯 Target Domain (Sketch): {groupdro_target_acc:.2f}% ({target_gap:+.2f}% vs ERM)")
    print(f"\n📚 Source Domains:")
    print(f"  ├─ Art:     {groupdro_art_acc:.2f}%")
    print(f"  ├─ Cartoon: {groupdro_cartoon_acc:.2f}%")
    print(f"  ├─ Photo:   {groupdro_photo_acc:.2f}%")
    print(f"  └─ Average: {source_avg:.2f}%")
    print(f"\n⚖️  Balance: Range={source_range:.2f}% (ERM: {erm_range:.2f}%), Std={source_std:.2f}%")
    
    # Status check
    if groupdro_target_acc > erm_target:
        print("\n✅ Status: OUTPERFORMED ERM!")
    elif groupdro_target_acc > 78.85:
        print("\n⚠️  Status: Better than IRM, below ERM")
    else:
        print("\n❌ Status: Underperformed")
    
    print("=" * 70)
    
else:
    print("⚠️ No results to display. Training may have failed.")

📊 GroupDRO Results Summary

🎯 Target Domain (Sketch): 85.10% (+1.28% vs ERM)

📚 Source Domains:
  ├─ Art:     99.63%
  ├─ Cartoon: 99.47%
  ├─ Photo:   99.93%
  └─ Average: 99.68%

⚖️  Balance: Range=0.46% (ERM: 5.13%), Std=0.19%

✅ Status: OUTPERFORMED ERM!
