# Synthetic Experiment Reproduction (Google Colab Version)
This notebook guide you through the reproduction of the synthetic hidden subgroup discovery experiment. 

**Hardware Recommendation:** Ensure you are using a GPU runtime (Runtime > Change runtime type > T4 GPU).

## 1. Setup Environment

### 1.1 Install Specialized Research Tools
We need to downgrade NumPy to 1.x for research tool compatibility. Since Colab 2024 uses Python 3.12, we use specific versions to avoid binary incompatibility.

In [None]:
# 1. Clean environment: remove incompatible versions
!pip uninstall -y numpy pandas

# 2. Install stable binaries for Python 3.12 bridge
%pip install "numpy==1.26.4" "pandas>=2.1.1" --quiet

# 3. Install CLIP and HazyResearch Discovery Tools
%pip install git+https://github.com/openai/CLIP.git --quiet
%pip install git+https://github.com/hazyresearch/meerkat.git --quiet
%pip install "domino[clip] @ git+https://github.com/hazyresearch/domino.git" --quiet

### ACTION REQUIRED: RESTART RUNTIME
After the cell above finishes, you **MUST** click **Runtime > Restart session** in the menu above. 

Failure to do this will result in a `ValueError: numpy.dtype size changed` error because the old versions are still in memory.

In [None]:
import torch
import numpy as np
import pandas as pd
import meerkat as mk
import domino

print("✅ Device:", "CUDA Available" if torch.cuda.is_available() else "NO GPU FOUND")
print(f"✅ NumPy version: {np.__version__} (Should be 1.26.4)")
print(f"✅ Pandas version: {pd.__version__}")
print("✅ All research tools imported successfully!")

if not np.__version__.startswith("1.26"):
    print("\nWARNING: Your NumPy version is still wrong. Did you forget to 'Restart session'?")

### 1.2 Clone github repo

In [None]:
# Clone repo
!git clone https://github.com/stu-rdy/uncover_hidden_stratification.git
%cd uncover_hidden_stratification/experiments/synthetic

In [None]:
import os
print("Current directory:", os.getcwd())
os.makedirs("../../results", exist_ok=True)

## 3. Run Reproduction Pipeline
We use `configs/colab_config.yaml` for standardized runs on NVIDIA T4 GPUs.

In [None]:
CONFIG = "configs/colab_config.yaml"

---

## Reproduction pipeline

In [None]:
# 3.1 Data Preparation
!python scripts/1_setup_data.py
!python scripts/2_generate_synthetic.py --config {CONFIG} --no-wandb

In [None]:
# 3.2 Model Training (includes Early Stopping on Worst-Group Accuracy)
!python scripts/3_train_model.py --config {CONFIG} --no-wandb

In [None]:
# 3.3 Discovery Analysis
!python scripts/4_extract_features.py --config {CONFIG}
!python scripts/5_run_analysis.py --config {CONFIG} --no-wandb

## 4. Results Visualization

In [None]:
import pandas as pd
try:
    results = pd.read_csv('../../results/synthetic_analysis.csv')
    print("Found Discovered Slices:")
    display(results.head(10))
except FileNotFoundError:
    print("Results file not found. Check if scripts 4 and 5 completed successfully.")

### 4.1 Discovery Visualizations

In [None]:
from IPython.display import Image, display
import os

plot_dir = '../../results/plots'
if os.path.exists(plot_dir):
    display(Image(filename=os.path.join(plot_dir, 'slice_analysis.png')))
    display(Image(filename=os.path.join(plot_dir, 'error_concentration.png')))
else:
    print("Plot directory not found.")

### 4.2 Example Images for Top Error Slices
These images are automatically extracted based on their contribution to the total error mass.

In [None]:
import os
from IPython.display import Image, display, Markdown

example_dir = '../../results/slice_examples'
if os.path.exists(example_dir):
    # Read the summary if it exists
    summary_path = os.path.join(example_dir, "extraction_summary.md")
    if os.path.exists(summary_path):
        with open(summary_path, 'r') as f:
            display(Markdown(f.read()))
            
    # List and display images per slice
    for slice_folder in sorted(os.listdir(example_dir)):
        slice_path = os.path.join(example_dir, slice_folder)
        if os.path.isdir(slice_path):
            display(Markdown(f"#### Images for {slice_folder}"))
            image_files = [f for f in os.listdir(slice_path) if f.endswith(('.png', '.jpg', '.jpeg'))]
            for img_file in sorted(image_files):
                img_path = os.path.join(slice_path, img_file)
                display(Markdown(f"**{img_file}**"))
                display(Image(filename=img_path, width=300))
else:
    print("Example images directory not found.")