# HPO Job Runner for HPC

**Purpose**: Submit and manage hyperparameter optimization jobs on SLURM cluster

**Date**: January 16, 2026

---

## üìã Overview

This notebook helps you:
1. Submit individual HPO jobs to SLURM
2. Submit batch experiments (all models for a dataset)
3. Monitor running jobs
4. Check job status and logs
5. Analyze completed results

**Models**: NHITS_Q, TFT_Q, TIMESNET_Q  
**Datasets**: heat, water_centrum, water_tommerby  
**Total Experiments**: 9 (3 models √ó 3 datasets)

## üîß Setup

In [38]:
import subprocess
import json
import os
from pathlib import Path
from datetime import datetime
import time
import pandas as pd
from glob import glob

# Set working directory
os.chdir('/home/hpc/iwi5/iwi5389h/ExAI-Timeseries-Thesis')
print(f"‚úÖ Working directory: {os.getcwd()}")

# Configuration
MODELS = ['NHITS_Q', 'TFT_Q', 'TIMESNET_Q']
DATASETS = ['heat', 'water_centrum', 'water_tommerby']
DEFAULT_TRIALS = 50

print(f"‚úÖ Models: {', '.join(MODELS)}")
print(f"‚úÖ Datasets: {', '.join(DATASETS)}")

‚úÖ Working directory: /home/hpc/iwi5/iwi5389h/ExAI-Timeseries-Thesis
‚úÖ Models: NHITS_Q, TFT_Q, TIMESNET_Q
‚úÖ Datasets: heat, water_centrum, water_tommerby


## üöÄ Job Submission Functions

In [39]:
def submit_single_job(model, dataset, trials=50, dry_run=False):
    """
    Submit a single HPO job to SLURM
    
    Args:
        model: Model name (NHITS_Q, TFT_Q, TIMESNET_Q)
        dataset: Dataset name (heat, water_centrum, water_tommerby)
        trials: Number of optimization trials (default: 50)
        dry_run: If True, print command without executing
    
    Returns:
        Job ID if submitted, None otherwise
    """
    cmd = f"./hpo/submit_job.sh {model} {dataset} {trials}"
    
    if dry_run:
        print(f"[DRY RUN] Would execute: {cmd}")
        return None
    
    try:
        result = subprocess.run(
            cmd,
            shell=True,
            capture_output=True,
            text=True,
            check=True
        )
        
        # Parse job ID from output: "Submitted batch job 1234567"
        output = result.stdout.strip()
        job_id = None
        
        for line in output.split('\n'):
            if "Submitted batch job" in line:
                # Extract just the job ID number
                parts = line.split()
                if len(parts) >= 4:
                    job_id = parts[3]
                    break
        
        if job_id:
            print(f"‚úÖ Submitted {model} on {dataset} - Job ID: {job_id}")
            return job_id
        else:
            print(f"‚ö†Ô∏è  Job submitted but couldn't extract ID")
            print(f"   Full output:\n{output}")
            return None
            
    except subprocess.CalledProcessError as e:
        print(f"‚ùå Failed to submit {model} on {dataset}")
        print(f"   stdout: {e.stdout}")
        print(f"   stderr: {e.stderr}")
        return None


def submit_batch_jobs(models, datasets, trials=50, delay=5, dry_run=False):
    """
    Submit multiple HPO jobs with delay between submissions
    
    Args:
        models: List of model names
        datasets: List of dataset names
        trials: Number of trials per job
        delay: Seconds to wait between submissions
        dry_run: If True, print commands without executing
    
    Returns:
        Dictionary mapping (model, dataset) to job_id
    """
    job_ids = {}
    total = len(models) * len(datasets)
    current = 0
    
    print(f"üìä Submitting {total} jobs...\n")
    
    for dataset in datasets:
        for model in models:
            current += 1
            print(f"[{current}/{total}] {model} on {dataset}")
            
            job_id = submit_single_job(model, dataset, trials, dry_run)
            if job_id:
                job_ids[(model, dataset)] = job_id
            
            # Wait between submissions to avoid overwhelming scheduler
            if current < total and not dry_run:
                time.sleep(delay)
        
        print()  # Blank line between datasets
    
    print(f"\n‚úÖ Submitted {len(job_ids)}/{total} jobs successfully")
    return job_ids


def save_job_tracker(job_ids, filename='hpo/hpo_current_jobs.json'):
    """
    Save submitted job IDs for later tracking
    
    Args:
        job_ids: Dictionary mapping (model, dataset) to job_id
        filename: Path to save JSON file
    """
    # Convert tuple keys to string for JSON serialization
    serializable = {
        f"{model}_{dataset}": {
            'job_id': job_id,
            'model': model,
            'dataset': dataset,
            'submitted_at': datetime.now().isoformat()
        }
        for (model, dataset), job_id in job_ids.items()
    }
    
    with open(filename, 'w') as f:
        json.dump(serializable, f, indent=2)
    
    print(f"üíæ Saved job tracker to {filename}")

print("‚úÖ Job submission functions loaded")

‚úÖ Job submission functions loaded


## üìä Job Monitoring Functions

In [40]:
def get_job_status(job_id=None):
    """
    Get status of SLURM jobs
    
    Args:
        job_id: Specific job ID to check, or None for all user jobs
    
    Returns:
        DataFrame with job information
    """
    if job_id:
        cmd = f"squeue -j {job_id} -o '%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R'"
    else:
        cmd = "squeue -u $USER -o '%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R'"
    
    try:
        result = subprocess.run(
            cmd,
            shell=True,
            capture_output=True,
            text=True,
            check=True
        )
        
        lines = result.stdout.strip().split('\n')
        if len(lines) <= 1:
            print("No jobs found")
            return None
        
        # Parse output into DataFrame
        header = lines[0].split()
        data = [line.split(None, len(header)-1) for line in lines[1:]]
        
        df = pd.DataFrame(data, columns=header)
        return df
        
    except subprocess.CalledProcessError:
        print("‚ùå Failed to get job status")
        return None


def check_hpo_jobs():
    """
    Check status of all HPO jobs (jobs with 'hpo_' prefix)
    
    Returns:
        DataFrame with HPO job information
    """
    df = get_job_status()
    if df is None:
        return None
    
    # Filter for HPO jobs
    hpo_jobs = df[df['NAME'].str.contains('hpo_', na=False)]
    
    if len(hpo_jobs) == 0:
        print("No HPO jobs currently running")
        return None
    
    return hpo_jobs


def tail_log(model, dataset, job_id, lines=20, log_type='log'):
    """
    Display last N lines of a job's log file
    
    Args:
        model: Model name
        dataset: Dataset name
        job_id: Job ID
        lines: Number of lines to show
        log_type: 'log' for stdout or 'err' for stderr
    """
    extension = 'log' if log_type == 'log' else 'err'
    log_file = f"hpo/logs/hpo_{model}_{dataset}_{job_id}.{extension}"
    
    if not os.path.exists(log_file):
        print(f"‚ùå Log file not found: {log_file}")
        # Try to find similar files
        pattern = f"hpo/logs/hpo_{model}_{dataset}_*.{extension}"
        matches = glob(pattern)
        if matches:
            print(f"\nüí° Found similar files:")
            for match in matches:
                print(f"   {match}")
        return
    
    cmd = f"tail -n {lines} {log_file}"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    
    log_type_name = "stdout" if log_type == 'log' else "stderr"
    print(f"üìÑ Last {lines} lines of {log_file} ({log_type_name}):\n")
    print(result.stdout)


def view_logs(model, dataset, job_id, lines=30):
    """
    Display both stdout and stderr logs for a job
    
    Args:
        model: Model name
        dataset: Dataset name
        job_id: Job ID
        lines: Number of lines to show from each log
    """
    print("="*80)
    print(f"LOGS FOR: {model} on {dataset} (Job {job_id})")
    print("="*80)
    
    # Check stdout log
    print("\nüìä STDOUT LOG:")
    print("-"*80)
    tail_log(model, dataset, job_id, lines, 'log')
    
    # Check stderr log
    print("\n" + "="*80)
    print("‚ö†Ô∏è  STDERR LOG:")
    print("-"*80)
    tail_log(model, dataset, job_id, lines, 'err')
    print("="*80)


def cancel_job(job_id):
    """
    Cancel a SLURM job
    
    Args:
        job_id: Job ID to cancel
    """
    cmd = f"scancel {job_id}"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    
    if result.returncode == 0:
        print(f"‚úÖ Cancelled job {job_id}")
    else:
        print(f"‚ùå Failed to cancel job {job_id}")
        print(result.stderr)

print("‚úÖ Job monitoring functions loaded")

‚úÖ Job monitoring functions loaded


## üéØ Quick Actions

### Option 1: Submit Single Job

In [49]:
# Submit a single job
# Modify these values as needed:

MODEL = 'TIMESNET_Q'  # NHITS_Q, TFT_Q, or TIMESNET_Q
DATASET = 'water_tommerby'   # heat, water_centrum, or water_tommerby
TRIALS = 50        # Number of optimization trials
DRY_RUN = False     # Set to False to actually submit

job_id = submit_single_job(MODEL, DATASET, TRIALS, dry_run=DRY_RUN)

‚úÖ Submitted TIMESNET_Q on water_tommerby - Job ID: 1511651


### Option 2: Submit All Jobs for One Dataset

In [None]:
# Submit all 3 models for a specific dataset

DATASET = 'heat'   # heat, water_centrum, or water_tommerby
TRIALS = 50
DRY_RUN = True     # Set to False to actually submit

job_ids = submit_batch_jobs(
    models=MODELS,
    datasets=[DATASET],
    trials=TRIALS,
    delay=5,
    dry_run=DRY_RUN
)

if not DRY_RUN and job_ids:
    save_job_tracker(job_ids)

### Option 3: Submit ALL 9 Experiments (Priority Order)

In [24]:
# # Submit all experiments in priority order:
# # Priority 1: heat (most critical)
# # Priority 2: water_centrum
# # Priority 3: water_tommerby

# TRIALS = 50
# DRY_RUN = True  # Set to False to actually submit

# PRIORITY_ORDER = ['heat', 'water_centrum', 'water_tommerby']

# all_job_ids = submit_batch_jobs(
#     models=MODELS,
#     datasets=PRIORITY_ORDER,
#     trials=TRIALS,
#     delay=5,
#     dry_run=DRY_RUN
# )

# if not DRY_RUN and all_job_ids:
#     save_job_tracker(all_job_ids)
#     print(f"\nüìä Total jobs submitted: {len(all_job_ids)}")
#     print(f"‚è±Ô∏è  Estimated total GPU hours: 80-100 hours")
#     print(f"üíæ Job tracker saved to hpo/hpo_current_jobs.json")

## üìà Monitor Running Jobs

### Check All User Jobs

In [50]:
# Check all your running jobs
df = get_job_status()

if df is not None:
    print(f"\nüìä Total active jobs: {len(df)}\n")
    display(df)


üìä Total active jobs: 9



Unnamed: 0,JOBID,PARTITION,NAME,USER,ST,TIME,NODES,NODELIST(REASON)
0,1511651,a100,hpo_TIMESNET_Q_water_tommerby,iwi5389h,PD,0:00,1,(Priority)
1,1511650,a100,hpo_TFT_Q_water_tommerby,iwi5389h,PD,0:00,1,(Priority)
2,1511649,a100,hpo_NHITS_Q_water_tommerby,iwi5389h,PD,0:00,1,(Priority)
3,1511648,a100,hpo_TIMESNET_Q_water_centrum,iwi5389h,PD,0:00,1,(Priority)
4,1511647,a100,hpo_TFT_Q_water_centrum,iwi5389h,PD,0:00,1,(Priority)
5,1511646,a100,hpo_NHITS_Q_water_centrum,iwi5389h,PD,0:00,1,(Priority)
6,1511645,a100,hpo_TIMESNET_Q_heat,iwi5389h,PD,0:00,1,(Priority)
7,1511644,a100,hpo_TFT_Q_heat,iwi5389h,PD,0:00,1,(Priority)
8,1511642,a100,hpo_NHITS_Q_heat,iwi5389h,PD,0:00,1,(Priority)


### Check HPO Jobs Only

In [37]:
# Check only HPO-related jobs
hpo_df = check_hpo_jobs()

if hpo_df is not None:
    print(f"\nüî¨ HPO jobs running: {len(hpo_df)}\n")
    display(hpo_df)
    
    # Count by status
    status_counts = hpo_df['ST'].value_counts()
    print("\nStatus breakdown:")
    for status, count in status_counts.items():
        status_name = {'R': 'Running', 'PD': 'Pending', 'CG': 'Completing'}.get(status, status)
        print(f"  {status_name}: {count}")

No jobs found


### View Job Log

In [None]:
# View last 30 lines of a job's logs (both stdout and stderr)
# Modify these values:

MODEL = 'NHITS_Q'
DATASET = 'heat'
JOB_ID = '1234567'  # Replace with actual job ID
LINES = 30

# Option 1: View both stdout and stderr
view_logs(MODEL, DATASET, JOB_ID, LINES)

# Option 2: View only stdout
# tail_log(MODEL, DATASET, JOB_ID, LINES, 'log')

# Option 3: View only stderr
# tail_log(MODEL, DATASET, JOB_ID, LINES, 'err')

### Load Tracked Jobs

In [27]:
# Load previously submitted jobs from tracker file
tracker_file = 'hpo/hpo_current_jobs.json'

if os.path.exists(tracker_file):
    with open(tracker_file) as f:
        tracked_jobs = json.load(f)
    
    print(f"üìä Tracked jobs: {len(tracked_jobs)}\n")
    
    # Create summary DataFrame
    summary_data = []
    for key, info in tracked_jobs.items():
        summary_data.append({
            'Model': info['model'],
            'Dataset': info['dataset'],
            'Job ID': info['job_id'],
            'Submitted': info['submitted_at'][:19]  # Remove milliseconds
        })
    
    df_tracked = pd.DataFrame(summary_data)
    display(df_tracked)
else:
    print("‚ùå No tracked jobs file found")
    print("   Submit jobs first to create tracking file")

‚ùå No tracked jobs file found
   Submit jobs first to create tracking file


## üóëÔ∏è Job Management

### Cancel a Specific Job

In [None]:
# Cancel a job by ID
JOB_ID = '1234567'  # Replace with actual job ID

# Uncomment to execute:
# cancel_job(JOB_ID)

### Cancel All HPO Jobs (DANGEROUS!)

In [35]:
# Cancel ALL HPO jobs
# WARNING: This will cancel all running HPO experiments!

CONFIRM = False  # Set to True to execute

if CONFIRM:
    hpo_df = check_hpo_jobs()
    if hpo_df is not None:
        job_ids = hpo_df['JOBID'].tolist()
        print(f"‚ö†Ô∏è  Cancelling {len(job_ids)} HPO jobs...\n")
        
        for job_id in job_ids:
            cancel_job(job_id)
        
        print(f"\n‚úÖ Cancelled {len(job_ids)} jobs")
else:
    print("‚ö†Ô∏è  Set CONFIRM=True to cancel all HPO jobs")

‚ö†Ô∏è  Set CONFIRM=True to cancel all HPO jobs


## üìä Results Analysis

### Check Completed Results

In [None]:
# Find all completed HPO results
result_files = glob('hpo/results/*/best_params_*.json')

print(f"üìä Found {len(result_files)} completed HPO results\n")

if result_files:
    results_summary = []
    
    for result_file in sorted(result_files):
        with open(result_file) as f:
            data = json.load(f)
        
        # Extract info from filename: best_params_MODEL_DATASET_JOBID.json
        filename = os.path.basename(result_file)
        parts = filename.replace('best_params_', '').replace('.json', '').split('_')
        
        # Handle model names with underscores (e.g., NHITS_Q)
        if len(parts) >= 3:
            # Reconstruct model name (everything except last 2 parts)
            dataset = parts[-2]
            job_id = parts[-1]
            model = '_'.join(parts[:-2])
        else:
            model = dataset = job_id = 'unknown'
        
        balanced = data.get('best_balanced', {})
        best_mae = data.get('best_mae', {})
        
        results_summary.append({
            'Model': model,
            'Dataset': dataset,
            'Job ID': job_id,
            'Best MAE': f"{best_mae.get('mae', 999):.3f}",
            'PICP (%)': f"{best_mae.get('picp', 0)*100:.1f}",
            'Balanced MAE': f"{balanced.get('mae', 999):.3f}",
            'Balanced PICP (%)': f"{balanced.get('picp', 0)*100:.1f}",
            'Pareto Solutions': data.get('num_pareto_solutions', 0)
        })
    
    df_results = pd.DataFrame(results_summary)
    display(df_results)
    
    print(f"\nüíæ Result files located in: hpo/results/*/")
else:
    print("‚ùå No completed results found yet")
    print("   Results will appear in hpo/results/ after jobs complete")

### Run Full Analysis Script

In [None]:
# Run the comprehensive analysis script
!python hpo/analyze_results.py

## üìù Notes

### Estimated Runtime per Job
- **NHITS_Q**: ~20-25 minutes per trial ‚Üí **16-20 GPU hours** for 50 trials
- **TFT_Q**: ~30-35 minutes per trial ‚Üí **25-30 GPU hours** for 50 trials
- **TIMESNET_Q**: ~20-25 minutes per trial ‚Üí **16-20 GPU hours** for 50 trials

### Priority Recommendations
1. **Heat dataset first** (primary thesis dataset)
2. **Water centrum second** (good validation dataset)
3. **Water tommerby last** (additional validation)

### Troubleshooting
- If jobs fail immediately, check logs: `tail -f hpo/logs/hpo_MODEL_DATASET_JOBID.log`
- If jobs are pending forever, check cluster status: `sinfo -p gpu`
- For out of memory errors, reduce batch_size in search space

### Resource Allocation
- **GPU**: 1√ó NVIDIA A100 per job
- **RAM**: 64GB per job
- **Time limit**: 12 hours per job
- **Partition**: gpu