# Fixed Gamma and Kappa Workflow

This notebook documents the workflow for:
1. Pooling kappa and gamma from training batches
2. Running predictions with fixed gamma and kappa (only lambda is learned)

## Overview

During prediction, we discovered that kappa (calibration parameter) and gamma (genetic effect weights) were being learned per batch, leading to inconsistent values. The solution is to:
- Pool kappa and gamma from the training batches (where they were learned on the full population)
- Fix these values during prediction, so only lambda (individual-specific signature loadings) is learned

This ensures consistent, population-level calibration and genetic effects across all prediction batches.


## Step 1: Pool Kappa and Gamma from Training Batches

First, we pool the kappa and gamma values from all training batch checkpoints to get population-level estimates.


In [1]:
# Run the pooling script
# This will:
# 1. Load kappa and gamma from all training batch files
# 2. Compute mean kappa (pooled across batches)
# 3. Compute mean gamma (pooled across batches)
# 4. Save to pooled_kappa_gamma.pt

import subprocess
import sys

# Paths
pooling_script = '/Users/sarahurbut/aladynoulli2/claudefile/pool_kappa_and_gamma_from_batches.py'
batch_pattern = '/Users/sarahurbut/Library/CloudStorage/Dropbox/censor_e_batchrun_vectorized/enrollment_model_W0.0001_batch_*.pt'
output_dir = '/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/'

# Run the script
cmd = [
    sys.executable,
    pooling_script,
    '--batch_pattern', batch_pattern,
    '--output_dir', output_dir
]

print("Running pooling script...")
print(f"Command: {' '.join(cmd)}")
print("\n" + "="*80)

result = subprocess.run(cmd, capture_output=True, text=True)

print(result.stdout)
if result.stderr:
    print("\nSTDERR:")
    print(result.stderr)

if result.returncode == 0:
    print("\n✓ Pooling completed successfully!")
else:
    print(f"\n✗ Pooling failed with return code {result.returncode}")


Running pooling script...
Command: /opt/miniconda3/envs/new_env_pyro2/bin/python /Users/sarahurbut/aladynoulli2/claudefile/pool_kappa_and_gamma_from_batches.py --batch_pattern /Users/sarahurbut/Library/CloudStorage/Dropbox/censor_e_batchrun_vectorized/enrollment_model_W0.0001_batch_*.pt --output_dir /Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/

POOLING KAPPA AND GAMMA FROM TRAINING BATCHES
Found 40 files matching pattern: /Users/sarahurbut/Library/CloudStorage/Dropbox/censor_e_batchrun_vectorized/enrollment_model_W0.0001_batch_*.pt
  Loaded kappa=2.884689, gamma shape=(47, 21) from enrollment_model_W0.0001_batch_0_10000.pt
  Loaded kappa=2.925767, gamma shape=(47, 21) from enrollment_model_W0.0001_batch_100000_110000.pt
  Loaded kappa=2.906323, gamma shape=(47, 21) from enrollment_model_W0.0001_batch_10000_20000.pt

KAPPA POOLING RESULTS
  Number of batches: 40
  Pooled kappa (mean): 2.932933
  Std kappa: 0.030100
  Min kappa: 2.884689
  Max kappa: 3.002235

In [2]:
# Verify the pooled values
import torch
import numpy as np

pooled_path = '/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/pooled_kappa_gamma.pt'
pooled_data = torch.load(pooled_path, weights_only=False)

print("Pooled Kappa and Gamma Summary")
print("="*80)
print(f"Kappa: {pooled_data['kappa']:.6f}")
print(f"Gamma shape: {pooled_data['gamma'].shape}")
print(f"Number of batches used: {pooled_data['n_batches']}")

if 'kappa_array' in pooled_data:
    kappa_array = pooled_data['kappa_array']
    print(f"\nKappa statistics across batches:")
    print(f"  Mean: {np.mean(kappa_array):.6f}")
    print(f"  Std: {np.std(kappa_array):.6f}")
    print(f"  Min: {np.min(kappa_array):.6f}")
    print(f"  Max: {np.max(kappa_array):.6f}")

print(f"\nGamma statistics:")
gamma = pooled_data['gamma']
if isinstance(gamma, torch.Tensor):
    gamma = gamma.numpy()
print(f"  Mean |γ|: {np.abs(gamma).mean():.6f}")
print(f"  Max |γ|: {np.abs(gamma).max():.6f}")
print(f"  Shape: {gamma.shape}")


Pooled Kappa and Gamma Summary
Kappa: 2.932933
Gamma shape: (47, 21)
Number of batches used: 40

Kappa statistics across batches:
  Mean: 2.932933
  Std: 0.030100
  Min: 2.884689
  Max: 3.002235

Gamma statistics:
  Mean |γ|: 0.000205
  Max |γ|: 0.007480
  Shape: (47, 21)


## Step 2: Run Predictions with Fixed Gamma and Kappa

Now we run the prediction script using the pooled gamma and kappa values. This ensures:
- Consistent calibration (kappa) across all batches
- Consistent genetic effects (gamma) across all batches
- Only lambda (individual signature loadings) is learned per batch


In [3]:
# Example command to run predictions with fixed gamma and kappa
# This can be run in a terminal or as a background job

prediction_script = '/Users/sarahurbut/aladynoulli2/claudefile/run_aladyn_predict_with_master_vector_cenosrE_fixedgk.py'
master_checkpoint = '/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/master_for_fitting_pooled_correctedE.pt'
pooled_gk_path = '/Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/pooled_kappa_gamma.pt'
output_dir = '/Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_predictions_fixedphi_fixedgk_vectorized/'

# Example command (commented out - uncomment to run)
example_cmd = f"""
python {prediction_script} \\
    --trained_model_path {master_checkpoint} \\
    --pooled_gamma_kappa_path {pooled_gk_path} \\
    --output_dir {output_dir} \\
    --batch_size 10000 \\
    --num_epochs 200 \\
    --learning_rate 0.1 \\
    --lambda_reg 0.01 \\
    --max_batches 40
"""

print("Example command to run predictions:")
print("="*80)
print(example_cmd)
print("="*80)
print("\nTo run in background:")
print(f"nohup python {prediction_script} \\")
print(f"    --trained_model_path {master_checkpoint} \\")
print(f"    --pooled_gamma_kappa_path {pooled_gk_path} \\")
print(f"    --output_dir {output_dir} \\")
print(f"    --max_batches 40 \\")
print(f"    > predict_fixedgk.log 2>&1 &")


Example command to run predictions:

python /Users/sarahurbut/aladynoulli2/claudefile/run_aladyn_predict_with_master_vector_cenosrE_fixedgk.py \
    --trained_model_path /Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/master_for_fitting_pooled_correctedE.pt \
    --pooled_gamma_kappa_path /Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/pooled_kappa_gamma.pt \
    --output_dir /Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_predictions_fixedphi_fixedgk_vectorized/ \
    --batch_size 10000 \
    --num_epochs 200 \
    --learning_rate 0.1 \
    --lambda_reg 0.01 \
    --max_batches 40


To run in background:
nohup python /Users/sarahurbut/aladynoulli2/claudefile/run_aladyn_predict_with_master_vector_cenosrE_fixedgk.py \
    --trained_model_path /Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/data_for_running/master_for_fitting_pooled_correctedE.pt \
    --pooled_gamma_kappa_path /Users/sarahurbut/Library/CloudStorage/Drop

In [None]:
# Optionally: Run a test batch to verify everything works
# Uncomment to run a single batch test

# import subprocess
# import sys

# test_cmd = [
#     sys.executable,
#     prediction_script,
#     '--trained_model_path', master_checkpoint,
#     '--pooled_gamma_kappa_path', pooled_gk_path,
#     '--output_dir', output_dir + '_test',
#     '--batch_size', '10000',
#     '--num_epochs', '10',  # Fewer epochs for testing
#     '--max_batches', '1',  # Just one batch
# ]

# print("Running test batch...")
# result = subprocess.run(test_cmd, capture_output=True, text=True)
# print(result.stdout)
# if result.stderr:
#     print("\nSTDERR:")
#     print(result.stderr)


## Step 3: Verify Predictions

After running predictions, verify that:
1. All batches completed successfully
2. Predictions are concatenated into a single file
3. The model used fixed gamma and kappa (check batch_info.pt)


In [None]:
# Check prediction results
import torch
from pathlib import Path

output_dir = '/Users/sarahurbut/Library/CloudStorage/Dropbox/enrollment_predictions_fixedphi_fixedgk_vectorized/'
info_path = Path(output_dir) / 'batch_info.pt'
full_pi_path = Path(output_dir) / 'pi_enroll_fixedphi_sex_FULL.pt'

print("Verifying Predictions")
print("="*80)

# Check if batch info exists
if info_path.exists():
    batch_info = torch.load(info_path, weights_only=False)
    print(f"✓ Batch info found")
    print(f"  Total batches: {len(batch_info['batches'])}")
    print(f"  Total patients: {batch_info['total_patients']:,}")
    print(f"  Diseases: {batch_info['n_diseases']}")
    print(f"  Timepoints: {batch_info['n_timepoints']}")
    if 'gamma_shape' in batch_info:
        print(f"  Gamma shape used: {batch_info['gamma_shape']}")
    if 'kappa_value' in batch_info:
        print(f"  Kappa value used: {batch_info['kappa_value']:.6f}")
else:
    print(f"✗ Batch info not found at {info_path}")

# Check if full predictions exist
if full_pi_path.exists():
    pi_full = torch.load(full_pi_path, weights_only=False)
    print(f"\n✓ Full predictions found")
    print(f"  Shape: {pi_full.shape}")
    print(f"  Mean pi: {pi_full.mean():.6f}")
    print(f"  Min pi: {pi_full.min():.6f}")
    print(f"  Max pi: {pi_full.max():.6f}")
else:
    print(f"\n✗ Full predictions not found at {full_pi_path}")

# List all batch files
batch_files = sorted(Path(output_dir).glob('pi_enroll_fixedphi_sex_*_*.pt'))
if batch_files:
    print(f"\n✓ Found {len(batch_files)} batch prediction files")
    print(f"  First: {batch_files[0].name}")
    print(f"  Last: {batch_files[-1].name}")
else:
    print(f"\n✗ No batch prediction files found")


## Summary

This workflow ensures that:
1. **Kappa** (calibration parameter) is fixed at the population level, preventing per-batch calibration issues
2. **Gamma** (genetic effect weights) is fixed at the population level, ensuring consistent genetic effects
3. **Lambda** (individual signature loadings) is still learned per batch, allowing for individual-specific predictions

The key advantage is that predictions will have consistent calibration and genetic effects across all batches, matching the training population parameters.
