Skip to content

A simple job runner utility for SLURM clusters

Notifications You must be signed in to change notification settings

ylabbe/job-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Main features

  • Support for multi-node multi-gpu jobs. The assignment of GPUs to each process (via $CUDA_VISIBLE_DEVICES) is automatically done. Each process of the job is assumed to be single-GPU.
  • Run a command without writing submission scripts
runjob --ngpus=8 --project=myproject --queue=myqueue python distributed_program.py
  • Create automatically one directory for each job and redirect the output of each process to a separate log file.
  • Automatically print the outputs of your job to your monitor. No need to run tail -f manually.
  • Cancel your job using a CTRL+C as if you were running you program interactively.
  • Exports useful environment variables related to your project and available resources.

Exported environment variables:

  • JOB_DIR
  • PROJECT_DIR
  • N_CPUS
  • CONDA_ROOT
  • CONDA_ENV
  • N_PROCS
  • PROC_ID
  • OUT_FILE
  • JOB_LOG_FILE

Config file

See examples/config.yaml for config of projects and queues.

Usage

runjob-config examples/config.yaml
runjob --ngpus=8 --project=myproject --queue=myqueue python distributed_program.py

This will start a multi-gpu (possibly multi-node according to your queue config) job with 1 process per GPU and print the output (stdout and stderr) of one of the processes (the one with SLURM_LOCALID=0). Use keyboard interupt to cancel your job.

Running tests

Make sure you are on a SLURM cluster which sinfo should output something.

runjob-config examples/config.yaml
pytest -vs

Features that will be added in the future:

  • A simple interface to resume long-running jobs automatically.
  • Copy your python project to an temporary directory to ensure modifications you make to your code do not affect your job while it's running.
  • A utility to runjob for interactive jobs.

About

A simple job runner utility for SLURM clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages