Skip to content

slaclab/slurm-whacker

Repository files navigation

Slurm Whacker

A Python tool to manage preemptable Slurm jobs that are blocking pending jobs due to AssocGrpNodeLimit constraints.

Overview

This script identifies pending jobs blocked by AssocGrpNodeLimit and finds preemptable jobs running on nodes used by the same accounts. It then kills the lowest-priority preemptable jobs to free up resources.

This script is an interim fix whilst we work on a more permanent solution with SchedMD whereby AssocGrpNodeLimit blocked jobs do not appear to preempt jobs even though they should.

Slurm Account Name Format: Accounts are expected in format <facility>:<repo> or <facility>:<repo>@<cluster> (e.g., fpd:g4sim@milano).

Features

  • 🔍 Identifies pending jobs blocked by AssocGrpNodeLimit
  • 🎯 Finds preemptable jobs on nodes where the pending job's facility already have running jobs
  • 📊 Kills lowest priority first, then prefers 'default' repos
  • 🔒 Dry-run mode for safe testing
  • 🏢 Facility-based consolidation analysis to identify node sharing patterns
  • 📈 Fragmentation scoring to assess resource utilization efficiency

Requirements

  • Python 3.9+
  • Slurm workload manager (with squeue, scontrol, and scancel commands)
  • Access permissions to query and cancel Slurm jobs

Setup

Quick Start

Use the provided Makefile to set up everything automatically:

# Create virtual environment and install dependencies
make

# Or explicitly:
make dev

Usage

Using Make (Recommended)

# Dry-run mode (safe - shows what would be done)
# Default: facility matching
make dry-run

# Dry-run with verbose logging
make dry-run-verbose

# Actually kill jobs (facility matching - default)
make run

# Run with verbose logging
make run-verbose

# Show all available commands
make help

Direct Python Usage

# Activate virtual environment first
source venv/bin/activate

# Show help
python kill_preemptable_jobs.py --help

# Dry-run mode (default: facility matching)
python kill_preemptable_jobs.py --dry-run

# Dry-run with tables (shows resource details)
python kill_preemptable_jobs.py --dry-run -v

# Dry-run with full debug logging
python kill_preemptable_jobs.py --dry-run -vv

# Actually kill jobs (default: facility matching)
python kill_preemptable_jobs.py

# Kill jobs with tables
python kill_preemptable_jobs.py -v

# Kill jobs with full debug logging
python kill_preemptable_jobs.py -vv

Command Line Options

  • --dry-run: Show what would be done without actually killing jobs
  • --verbose / -v: Control output verbosity (can be repeated)
    • No flag (default): Show only summaries and key status messages
    • -v: Show summaries + detailed tables (resource breakdowns, job lists)
    • -vv: Show summaries + tables + debug logs (all Slurm commands, data processing)

Verbosity Examples

# Minimal output - just summaries (47 lines)
python kill_preemptable_jobs.py --dry-run

# Moderate output - summaries + tables (67 lines)
python kill_preemptable_jobs.py --dry-run -v

# Full output - everything including debug info (154+ lines)
python kill_preemptable_jobs.py --dry-run -vv

How It Works

Processing Pipeline

The tool follows an 11-step pipeline to manage preemptable jobs:

Steps 1-6: Data Collection

  1. Query Pending Jobs: Finds all pending non-preemptable jobs with AssocGrpNodeLimit reason
  2. Filter by Account Capacity: Validates pending jobs against account capacity
  3. Display Pending Jobs Summary: Shows summary by facility
  4. Find Occupied Nodes: Locates all nodes needed for pending jobs
  5. Locate Preemptable Jobs: Finds all running preemptable jobs on those nodes
  6. Get Node Memory Information: Collects memory constraints for resource calculations

Step 7: Account-Based Processing

  • Process each account independently based on account limits and usage
  • Calculate resource requirements per partition
  • Determine jobs to terminate based on account limits
  • Sort by priority (lowest first), then prefer 'default' repos
  • Generate account results with kill decisions

Step 8: Facility-Based Processing

  • Group accounts by facility prefix (e.g., fpd:g4sim, fpd:analysis → facility fpd)
  • Analyze node distribution across accounts within each facility
  • Calculate fragmentation scores (0.0-1.0) indicating node sharing
  • Identify consolidation opportunities where multiple accounts share nodes
  • Display multi-account node details for visibility

Fragmentation Score Interpretation:

  • < 0.1: Low fragmentation (excellent consolidation)
  • 0.1 - 0.3: Moderate fragmentation (some sharing)
  • > 0.3: High fragmentation (many nodes shared across accounts)

Step 9: Cross-Account Summary

  • Display total statistics across all accounts
  • Show per-account breakdown of kill decisions

Step 10: Display All Jobs by Account

  • Show detailed job listings grouped by account
  • Mark jobs to be killed vs preserved

Step 11: Execute Terminations

  • Cancel minimum number of preemptable jobs to free required resources
  • Or perform dry-run to show what would be done

Visual Workflow

┌─────────────────────────────────────────────────────────────────┐
│ Steps 1-6: Data Collection                                      │
│ • Get pending jobs with AssocGrpNodeLimit                       │
│ • Filter by account capacity                                    │
│ • Get nodes for pending jobs                                    │
│ • Get and filter preemptable jobs                               │
│ • Get node memory information                                   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 7: ACCOUNT-BASED PROCESSING                                │
│ • Process each account independently                            │
│ • Calculate resource requirements per partition                 │
│ • Determine jobs to terminate based on account limits           │
│ • Generate account_results list                                 │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 8: FACILITY-BASED PROCESSING                               │
│ • Group accounts by facility prefix                             │
│ • Analyze node distribution per facility                        │
│ • Calculate fragmentation scores                                │
│ • Identify consolidation opportunities                          │
│ • Display multi-account node details                            │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 9: CROSS-ACCOUNT SUMMARY                                   │
│ • Display total statistics across all accounts                  │
│ • Show per-account breakdown of kill decisions                  │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 10: Display All Jobs by Account                            │
│ • Show detailed job listings                                    │
│ • Mark jobs to be killed vs preserved                           │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Step 11: Execute Terminations                                   │
│ • Kill jobs (or dry-run)                                        │
└─────────────────────────────────────────────────────────────────┘

Facility-Based Consolidation Analysis

What is a Facility?

In the account naming convention <facility>:<repo>@<partition>, the facility is the prefix before the colon. For example:

  • In fpd:g4sim@milano: fpd is the facility, g4sim is the repo, milano is the partition
  • Multiple accounts can share the same facility: fpd:g4sim, fpd:analysis, fpd:simulation

Node Distribution Analysis

The facility-based processing analyzes how jobs are distributed across nodes:

  • Single-account nodes: Nodes where all jobs belong to one account (optimal consolidation)
  • Multi-account nodes: Nodes shared by multiple accounts from the same facility (fragmentation)

Example Output

================================================================================
FACILITY-BASED PROCESSING
================================================================================
Analyzing job placement by facility to identify consolidation opportunities.

Found 2 unique facility/facilities: atlas, fpd

----------------------------------------------------------------------------------------------------
FACILITY: fpd
----------------------------------------------------------------------------------------------------
Accounts in facility: 3
  → fpd:g4sim@milano
  → fpd:analysis@milano
  → fpd:simulation@milano

Total pending jobs: 45
Total preemptable jobs: 120
Total jobs to kill (from account-based): 25

Node Distribution Analysis:
  → Total nodes in use: 50
  → Nodes with single account: 35 (70.0%)
  → Nodes with multiple accounts: 15 (30.0%)
  → Fragmentation score: 0.30

⚠️  High fragmentation detected - multiple accounts sharing many nodes

Nodes with multiple accounts (consolidation opportunities):
  → node001: 2 accounts (fpd:g4sim@milano, fpd:analysis@milano), 8 jobs
  → node015: 2 accounts (fpd:analysis@milano, fpd:simulation@milano), 6 jobs
  → node023: 3 accounts (fpd:g4sim@milano, fpd:analysis@milano, fpd:simulation@milano), 12 jobs
  ... and 12 more nodes

Benefits

  1. Visibility: Provides insights into resource fragmentation at the facility level
  2. Planning: Helps administrators understand consolidation opportunities
  3. Efficiency: Identifies cases where better job placement could improve resource utilization
  4. Debugging: Helps diagnose scheduling inefficiencies related to node sharing

Design Considerations

  • Non-Invasive: Currently read-only analysis that doesn't modify kill decisions
  • Facility-Aware: Respects the facility grouping implicit in account names
  • Partition-Aware: Maintains partition boundaries (jobs only analyzed within their partition)

Future Enhancement Opportunities

While the current implementation focuses on analysis and reporting, future enhancements could include:

  • Active consolidation that modifies kill decisions to prioritize better facility-level consolidation
  • Node affinity preferences when selecting which preemptable jobs to kill
  • Facility-level bin packing to optimize job placement
  • Cross-account coordination within facilities

Resource Calculation

  • CPUs & GPUs: Aggregated across nodes (can be summed across all jobs)
  • Memory: Per-node constraint (uses largest job, as jobs can run on different nodes)
  • The script kills the minimum number of preemptable jobs needed to satisfy resource requirements

Makefile Targets

Setup

  • make or make venv - Create virtual environment and install dependencies
  • make install - Install/update dependencies in existing venv

Dry-run (Safe Testing)

  • make dry-run - Run in dry-run mode with facility matching (default)
  • make dry-run-verbose - Run in dry-run mode with debug logging (facility)
  • make dry-run-full-account - Run in dry-run mode with full account matching
  • make dry-run-full-account-verbose - Run in dry-run mode with full account matching & verbose

Execute (Will Kill Jobs)

  • make run - Run with facility matching and kill jobs (default)
  • make run-verbose - Run with facility matching & verbose logging
  • make run-full-account - Run with full account matching and kill jobs (narrower scope)
  • make run-full-account-verbose - Run with full account matching & verbose logging

Maintenance

  • make clean - Remove virtual environment and cache files
  • make help - Show help message

Development

Project Structure

slurm-whacker/
├── kill_preemptable_jobs.py  # Main script
├── requirements.txt           # Python dependencies
├── Makefile                   # Build automation
├── README.md                  # This file
├── CHANGELOG.md               # Version history
├── QUICK_REFERENCE.md         # Quick usage guide
└── TESTING.md                 # Testing guide

Dependencies

  • click (>=8.0.4,<8.1.0) - Command line interface framework
  • loguru (>=0.7.0) - Beautiful logging with colors
  • python-hostlist (>=1.21) - Slurm hostlist parsing for performance optimization

Cleaning Up

# Remove virtual environment and cache files
make clean

Safety Notes

  • ⚠️ Always use --dry-run first to see what will be affected
  • 🏢 Facility matching is the default - this affects all repos in the facility
  • Only preemptable jobs (QoS "preemptable") will be targeted
  • Jobs are killed in priority order (lowest priority first), with 'default' repos preferred
  • 'default' repos are killed before other repos at the same priority level
  • Failed cancellations are logged but don't stop the process

License

This tool is provided as-is for managing Slurm workloads.

Contributing

Feel free to submit issues or pull requests for improvements.

About

Temp workaround to monitor for pending jobs stuck because of `AssocGrpNodeLimit` and attempt to terminate jobs that these jobs should have preempted.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors