# Synthetic Experiment Reproduction (Google Colab Version)
This notebook guide you through the reproduction of the synthetic hidden subgroup discovery experiment. 

**Hardware Recommendation:** Ensure you are using a GPU runtime (Runtime > Change runtime type > T4 GPU).

## 1. Setup Environment

In [None]:
# Clone repo
!git clone https://github.com/stu-rdy/hidden-subgroup-perf.git
%cd hidden-subgroup-perf/experiments/synthetic

In [None]:
import os
print("Current directory:", os.getcwd())
os.makedirs("../../results", exist_ok=True)

### 1.1 Install Specialized Research Tools
We need to downgrade NumPy to 1.x for research tool compatibility. Since Colab 2024 uses Python 3.12, we use specific versions to avoid binary incompatibility.

In [None]:
# 1. Clean environment: remove incompatible versions
!pip uninstall -y numpy pandas

# 2. Install stable binaries for Python 3.12 bridge
%pip install "numpy==1.26.4" "pandas>=2.1.1" --quiet

# 3. Install core project dependencies
%pip install -r requirements.txt --quiet

# 4. Install CLIP and HazyResearch Discovery Tools
%pip install git+https://github.com/openai/CLIP.git --quiet
%pip install git+https://github.com/hazyresearch/meerkat.git --quiet
%pip install "domino[clip] @ git+https://github.com/hazyresearch/domino.git" --quiet

### ⚠️ ACTION REQUIRED: RESTART RUNTIME
After the cell above finishes, you **MUST** click **Runtime > Restart session** in the menu above. 

Failure to do this will result in a `ValueError: numpy.dtype size changed` error because the old versions are still in memory.

## 2. Environment Verification
Run this cell **AFTER** restarting the session.

In [None]:
import torch
import numpy as np
import pandas as pd
import meerkat as mk
import domino

print("✅ Device:", "CUDA Available" if torch.cuda.is_available() else "❌ NO GPU FOUND")
print(f"✅ NumPy version: {np.__version__} (Should be 1.26.4)")
print(f"✅ Pandas version: {pd.__version__}")
print("✅ All research tools imported successfully!")

if not np.__version__.startswith("1.26"):
    print("\n⚠️ WARNING: Your NumPy version is still wrong. Did you forget to 'Restart session'?")

## 3. Run Reproduction Pipeline
We use `configs/colab_config.yaml` for standardized runs on NVIDIA T4 GPUs.

In [None]:
CONFIG = "configs/colab_config.yaml"

In [None]:
# 3.1 Data Preparation
!python scripts/1_setup_data.py
!python scripts/2_generate_synthetic.py --config {CONFIG}

In [None]:
# 3.2 Model Training (includes Early Stopping on Worst-Group Accuracy)
!python scripts/3_train_model.py --config {CONFIG}

In [None]:
# 3.3 Discovery Analysis
!python scripts/4_extract_features.py --config {CONFIG}
!python scripts/5_run_analysis.py --config {CONFIG}

## 4. Results Visualization
The analysis identifies 'slices' of data where the model fails. We look for the 'vertical line' artifact identified as a high-error subgroup.

In [None]:
import pandas as pd
try:
    results = pd.read_csv('../../results/synthetic_analysis.csv')
    print("Found Discovered Slices:")
    display(results.head(10))
except FileNotFoundError:
    print("❌ Results file not found. Check if scripts 4 and 5 completed successfully.")