This guide shows how to configure and manage cluster resources for distributed computing with phasic.

## Overview

The configuration system provides:

- **YAML-based configs** - Separate configuration from code
- **Predefined profiles** - Quick start with standard configurations
- **Script generation** - Automatic SLURM script creation
- **Flexible customization** - Adapt to any cluster setup

## Configuration Management

### Loading Predefined Profiles

phasic includes several predefined profiles for common scenarios:

In [None]:
from phasic.cluster_configs import get_default_config

# Load a predefined profile
config = get_default_config("medium")

print("Medium Cluster Configuration")
print("=" * 60)
print(f"Name: {config.name}")
print(f"Nodes: {config.nodes}")
print(f"CPUs per node: {config.cpus_per_node}")
print(f"Memory per CPU: {config.memory_per_cpu}")
print(f"Time limit: {config.time_limit}")
print(f"Partition: {config.partition}")
print(f"Total devices: {config.total_devices}")
print(f"Platform: {config.platform}")
print("=" * 60)

### Available Profiles

Let's examine all available profiles:

In [None]:
import pandas as pd

# List all profiles
profiles = ["debug", "small", "medium", "large", "production"]

# Create comparison table
data = []
for profile in profiles:
    cfg = get_default_config(profile)
    data.append({
        'Profile': profile,
        'Nodes': cfg.nodes,
        'CPUs/node': cfg.cpus_per_node,
        'Total Devices': cfg.total_devices,
        'Memory/CPU': cfg.memory_per_cpu,
        'Time Limit': cfg.time_limit,
        'Use Case': {
            'debug': 'Quick testing',
            'small': 'Development',
            'medium': 'Standard jobs',
            'large': 'Large-scale inference',
            'production': 'Maximum scale'
        }[profile]
    })

df = pd.DataFrame(data)
print("\nAvailable Cluster Profiles:")
print(df.to_string(index=False))

### Creating Custom Configurations

For cluster-specific settings, create a YAML configuration file:

In [None]:
# Example: Create a custom configuration
custom_config_yaml = """
name: my_cluster
nodes: 6
cpus_per_node: 24
memory_per_cpu: "8G"
time_limit: "04:00:00"
partition: "gpu-partition"
qos: "high-priority"
coordinator_port: 12345
platform: "gpu"
gpus_per_node: 4

# Network configuration
network_interface: "ib0"  # InfiniBand interface

# Environment variables
env_vars:
  JAX_ENABLE_X64: "1"
  XLA_PYTHON_CLIENT_PREALLOCATE: "false"
  CUDA_VISIBLE_DEVICES: "0,1,2,3"

# Modules to load
modules_to_load:
  - "cuda/11.8"
  - "python/3.11"
  - "gcc/11.2.0"

# Additional SBATCH options
extra_sbatch_options:
  constraint: "skylake"
  account: "my_project"
"""

# Save to file
import os
os.makedirs("slurm_configs", exist_ok=True)

with open("slurm_configs/my_cluster.yaml", "w") as f:
    f.write(custom_config_yaml)

print("Created custom configuration: slurm_configs/my_cluster.yaml")
print("\nConfiguration content:")
print(custom_config_yaml)

### Loading Custom Configurations

In [None]:
from phasic.cluster_configs import load_config

# Load the custom configuration
custom_config = load_config("slurm_configs/my_cluster.yaml")

print("Loaded Custom Configuration:")
print("=" * 60)
print(f"Name: {custom_config.name}")
print(f"Platform: {custom_config.platform}")
print(f"Nodes: {custom_config.nodes}")
print(f"GPUs per node: {custom_config.gpus_per_node}")
print(f"Total devices: {custom_config.total_devices}")
print(f"Network: {custom_config.network_interface}")
print(f"\nEnvironment variables:")
for key, val in custom_config.env_vars.items():
    print(f"  {key}: {val}")
print(f"\nModules to load:")
for mod in custom_config.modules_to_load:
    print(f"  - {mod}")
print("=" * 60)

## SLURM Script Generation

The `generate_slurm_script.py` tool creates complete SLURM submission scripts from configurations.

### Basic Usage

In [None]:
# Generate script from profile
!python ../examples/generate_slurm_script.py \
    --profile small \
    --script my_inference.py \
    --output submit_small.sh

print("\nGenerated SLURM script: submit_small.sh")
print("\nFirst 30 lines of generated script:")
!head -30 submit_small.sh

### Generated Script Structure

The generated script includes:

1. **SBATCH directives** - Resource requests
2. **Module loading** - Environment setup
3. **Python environment** - Pixi or Conda activation
4. **Coordinator setup** - JAX distributed configuration
5. **Execution** - Running your script with srun
6. **Status reporting** - Job completion information

### Advanced: Custom Config to Script

In [None]:
# Generate from custom configuration
!python ../examples/generate_slurm_script.py \
    --config slurm_configs/my_cluster.yaml \
    --script my_inference.py \
    --output submit_custom.sh \
    --job-name "my_gpu_job"

print("Generated custom SLURM script: submit_custom.sh")
print("\nSBATCH directives from custom config:")
!grep "^#SBATCH" submit_custom.sh

## Complete Workflow Example

Here's a complete workflow from development to production:

### Step 1: Develop Locally

In [None]:
# Create your inference script
inference_script = """
#!/usr/bin/env python3
from phasic import initialize_distributed, Graph, SVGD

# Initialize (works locally AND on SLURM)
dist_info = initialize_distributed()

if dist_info.is_coordinator:
    print(f"Running on {dist_info.global_device_count} devices")

# Your inference code here
# ...
"""

with open("my_inference.py", "w") as f:
    f.write(inference_script)

print("Created my_inference.py")

# Test locally
print("\nTesting locally:")
!python my_inference.py

### Step 2: Test on Small Scale

In [None]:
# Generate script for small-scale testing
!python ../examples/generate_slurm_script.py \
    --profile debug \
    --script my_inference.py \
    --output submit_test.sh

print("Generated test submission script")
print("\nTo submit:")
print("  sbatch submit_test.sh")
print("\nOr quick submit:")
print("  sbatch <(python ../examples/generate_slurm_script.py --profile debug --script my_inference.py)")

### Step 3: Scale to Production

In [None]:
# Generate production-scale script
!python ../examples/generate_slurm_script.py \
    --profile production \
    --script my_inference.py \
    --output submit_production.sh

print("Generated production submission script")
print("\nProduction configuration:")
prod_config = get_default_config("production")
print(f"  Nodes: {prod_config.nodes}")
print(f"  Total devices: {prod_config.total_devices}")
print(f"  Time limit: {prod_config.time_limit}")
print("\nTo submit:")
print("  sbatch submit_production.sh")

## Monitoring Jobs

Once submitted, you can monitor your jobs:

### Check Job Status

In [None]:
# This cell shows commands - run on cluster
monitoring_commands = """
# Check your jobs
squeue -u $USER

# Check specific job
squeue -j <job_id>

# View job details
scontrol show job <job_id>

# View output in real-time
tail -f logs/my_inference_<job_id>.out

# View errors
tail -f logs/my_inference_<job_id>.err

# Cancel job
scancel <job_id>
"""

print("Job Monitoring Commands:")
print("=" * 60)
print(monitoring_commands)
print("=" * 60)

## Configuration Best Practices

### 1. Start Small

Always test with `debug` or `small` profiles before scaling up:

In [None]:
# Development workflow
workflow = """
1. Test locally:        python my_script.py
2. Test on cluster:     sbatch <(python generate_slurm_script.py --profile debug --script my_script.py)
3. Small scale:         sbatch <(python generate_slurm_script.py --profile small --script my_script.py)
4. Production scale:    sbatch <(python generate_slurm_script.py --profile production --script my_script.py)
"""

print("Recommended Development Workflow:")
print(workflow)

### 2. Cluster-Specific Configs

Create configs for each cluster you use:

In [None]:
# Example: Different clusters
clusters = {
    "local_cluster": {
        "partition": "compute",
        "modules": ["python/3.11"],
        "network": "eth0"
    },
    "hpc_center": {
        "partition": "gpu-nodes",
        "modules": ["cuda/11.8", "python/3.11", "gcc/11"],
        "network": "ib0",
        "qos": "high-priority"
    },
    "cloud_cluster": {
        "partition": "standard",
        "modules": [],  # Using containers
        "network": "eth0"
    }
}

print("Cluster-Specific Settings:")
print("=" * 60)
for cluster, settings in clusters.items():
    print(f"\n{cluster}:")
    for key, val in settings.items():
        print(f"  {key}: {val}")
print("=" * 60)

### 3. Resource Estimation

Estimate resources needed for your job:

In [None]:
def estimate_resources(n_particles, n_iterations, model_complexity="medium"):
    """
    Rough resource estimation for SVGD inference.
    
    Parameters
    ----------
    n_particles : int
        Number of SVGD particles
    n_iterations : int
        Number of iterations
    model_complexity : str
        "simple", "medium", or "complex"
    """
    # Rough estimates (adjust based on your model)
    time_per_eval = {
        "simple": 0.001,   # 1ms per evaluation
        "medium": 0.01,    # 10ms per evaluation
        "complex": 0.1     # 100ms per evaluation
    }[model_complexity]
    
    total_evals = n_particles * n_iterations
    total_time_seconds = total_evals * time_per_eval
    
    # Add overhead (30%)
    total_time_seconds *= 1.3
    
    hours = int(total_time_seconds // 3600)
    minutes = int((total_time_seconds % 3600) // 60)
    
    # Memory estimate (very rough)
    memory_per_particle_mb = 10  # Adjust for your model
    total_memory_gb = (n_particles * memory_per_particle_mb) / 1024
    
    print(f"Resource Estimation:")
    print(f"  Particles: {n_particles:,}")
    print(f"  Iterations: {n_iterations:,}")
    print(f"  Total evaluations: {total_evals:,}")
    print(f"  Estimated time: {hours}h {minutes}m")
    print(f"  Estimated memory: {total_memory_gb:.1f} GB")
    print(f"\nRecommended configuration:")
    
    if hours < 1:
        print(f"  Profile: debug or small")
    elif hours < 2:
        print(f"  Profile: small or medium")
    elif hours < 4:
        print(f"  Profile: medium or large")
    else:
        print(f"  Profile: large or production")

# Example estimations
print("Small job:")
estimate_resources(100, 500, "medium")
print("\n" + "=" * 60 + "\n")
print("Large job:")
estimate_resources(1000, 2000, "complex")

## Troubleshooting

### Common Issues and Solutions

In [None]:
troubleshooting = {
    "Job stays in queue": [
        "Check partition availability: sinfo",
        "Check your priority: sprio -j <job_id>",
        "Reduce resource requests (nodes, time, memory)",
        "Use different partition or QoS"
    ],
    "Job fails immediately": [
        "Check error log: logs/jobname_<id>.err",
        "Verify modules load correctly",
        "Check Python environment is activated",
        "Test script locally first"
    ],
    "Out of memory": [
        "Increase memory_per_cpu in config",
        "Reduce particles per device",
        "Use batch processing",
        "Check for memory leaks"
    ],
    "Timeout before completion": [
        "Increase time_limit in config",
        "Optimize model computation",
        "Use checkpointing",
        "Reduce iterations or particles"
    ],
    "Inter-node communication fails": [
        "Check network_interface setting",
        "Verify coordinator_port is open",
        "Check firewall rules",
        "Test with single node first"
    ]
}

print("Troubleshooting Guide:")
print("=" * 60)
for issue, solutions in troubleshooting.items():
    print(f"\n{issue}:")
    for solution in solutions:
        print(f"  • {solution}")
print("\n" + "=" * 60)

## Advanced Topics

### GPU Configuration

For GPU clusters, use specialized configs:

In [None]:
gpu_config_yaml = """
name: gpu_cluster
nodes: 4
cpus_per_node: 16
memory_per_cpu: "8G"
time_limit: "02:00:00"
partition: "gpu"
platform: "gpu"
gpus_per_node: 4

env_vars:
  JAX_PLATFORMS: "gpu"
  CUDA_VISIBLE_DEVICES: "0,1,2,3"
  XLA_PYTHON_CLIENT_MEM_FRACTION: "0.8"

modules_to_load:
  - "cuda/11.8"
  - "cudnn/8.6"
"""

with open("slurm_configs/gpu_cluster.yaml", "w") as f:
    f.write(gpu_config_yaml)

print("Created GPU cluster configuration")
print("\nKey GPU settings:")
print("  • platform: gpu")
print("  • gpus_per_node: 4")
print("  • JAX_PLATFORMS: gpu")
print("  • CUDA modules loaded")

### Environment Variables Reference

In [None]:
env_vars_reference = {
    "JAX Configuration": {
        "JAX_PLATFORMS": "Set to 'cpu' or 'gpu'",
        "JAX_ENABLE_X64": "Enable 64-bit precision (1 or 0)",
        "JAX_COORDINATOR_PORT": "Port for distributed coordinator",
    },
    "XLA Configuration": {
        "XLA_FLAGS": "XLA compiler flags",
        "XLA_PYTHON_CLIENT_PREALLOCATE": "Preallocate GPU memory (true/false)",
        "XLA_PYTHON_CLIENT_MEM_FRACTION": "Fraction of GPU memory to use (0-1)",
    },
    "CUDA Configuration": {
        "CUDA_VISIBLE_DEVICES": "Which GPUs to use (e.g., '0,1,2,3')",
        "NCCL_SOCKET_IFNAME": "Network interface for NCCL (e.g., 'ib0')",
    },
    "SLURM Variables" : {
        "SLURM_COORDINATOR_ADDRESS": "Set by script (coordinator hostname)",
        "SLURM_JOB_ID": "Auto-set by SLURM",
        "SLURM_PROCID": "Auto-set by SLURM (process rank)",
    }
}

print("Environment Variables Reference:")
print("=" * 60)
for category, vars in env_vars_reference.items():
    print(f"\n{category}:")
    for var, desc in vars.items():
        print(f"  {var}")
        print(f"    {desc}")
print("\n" + "=" * 60)

## Summary

This guide covered:

**Configuration management** - YAML-based cluster configs

**Predefined profiles** - Quick start with standard setups

**Script generation** - Automatic SLURM script creation

**Workflow best practices** - From development to production

**Troubleshooting** - Common issues and solutions

## Key Takeaways

1. **Start with predefined profiles** for quick testing
2. **Create cluster-specific configs** for production
3. **Use script generator** to avoid manual SBATCH setup
4. **Test incrementally** (local → debug → small → production)
5. **Monitor resources** to optimize configuration

## Next Steps

- **[Distributed Computing Basics](distributed_computing_basics.ipynb)** - Learn the fundamentals
- **[Distributed SVGD Inference](distributed_svgd_inference.ipynb)** - Apply to inference problems
- **[API Reference](../api/index.html)** - Complete API documentation