Skip to content

Dannce on Slurm HPC

Diego Aldarondo edited this page Jan 17, 2022 · 2 revisions

We offer several modules to assist with launching dannce on slurm-equipped high performance clusters. They support parallel inference for center of mass and dannce keypoints, as well as parallel grid search for dannce training.

Slurm configuration file

Differences between slurm clusters are accounted for using a slurm configuration file. The file consists of sbatch command-line arguments that specify the required resources for dannce operations in a slurm system. It also specifies the setup script that activates the appropriate dannce environment in your HPC. For example, this is the configuration file for operating on the Harvard Cannon cluster cluster/holyoke.yaml.

# Dannce slurm configuration
dannce_train: "--job-name=trainDannce -p olveczkygpu,gpu --mem=80000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_train_grid: "--job-name=trainDannce -p olveczkygpu,gpu --mem=80000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_predict: "--job-name=predictDannce -p olveczkygpu,gpu,cox,gpu_requeue --mem=30000 -t 1-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
dannce_multi_predict: "--job-name=predictDannce -p olveczkygpu,gpu,cox,gpu_requeue --mem=30000 -t 0-03:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"

# Com slurm configuration
com_train: "--job-name=trainCom -p olveczkygpu,gpu --mem=30000 -t 3-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
com_predict: "--job-name=predictCom -p olveczkygpu,gpu,cox,gpu_requeue --mem=10000 -t 1-00:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"
com_multi_predict: "--job-name=predictCom -p olveczkygpu,gpu,cox,gpu_requeue --mem=10000 -t 0-03:00 --gres=gpu:1 -N 1 -n 8 --constraint=cc5.2 --exclude=holygpu7c1726"

# Inference
inference: '--job-name=inference -p olveczky,shared --mem=30000 -t 3-00:00 -N 1 -n 8 --constraint="intel&avx2"'
# Setup functions (optional, set to "" if no setup is required. Trailing ; is required)
setup: "module load Anaconda3/2020.11; source activate dannce;"

You slurm configuration file should be included as a key value pain in your COM and dannce configuration files as follows:

slurm_config: /path/to/slurm_configuration.yaml

Command line slurm interface

There are several command line functions that facilitate deployment on a slurm hpc. They all accept the path to model configuration files as inputs, denoted here by $com_config and $dannce_config. For each command, the list of command line arguments is available through the --help argument.

  1. com-train-sbatch $com_config

Submit a COM training job to the cluster.

  1. com-predict-sbatch $com_config

Submit a single-process COM prediction job to the cluster.

  1. com-predict-multi-gpu $com_config

Submit a multi-process COM prediction job to the cluster. (Results can be merged with com-merge $com_config)

  1. dannce-train-sbatch $dannce_config

Submit a dannce training job to the cluster

  1. dannce-train-grid $dannce_config /path/to/training_params.yaml

Submit multiple training jobs to the cluster, specified by the training_params.yaml file (described below).

  1. dannce-predict-sbatch $dannce_config

Submit a single-instance, single-process dannce prediction job to the cluster.

  1. dannce-predict-multi-gpu $dannce_config

Submit a single-instance, multi-process DANNCE prediction job to the cluster. (Results can be merged with dannce-merge $com_config)

  1. dannce-inference-sbatch $com_config $dannce_config

For each instance:

  • Submit a multi-process COM prediction job to the cluster
  • Merge the results
  • Submit a multi-process DANNCE prediction job to the cluster.
  • Merge the results

Requires that COM and dannce networks are already trained and specified in io.yaml

Training parameters file

Training parameters for grid search can be defined in a training_parameters.yaml file. These will incoporate changes to the base dannce configuration file used in dannce-train-grid. The training_parameters.yaml consists of a list of dictionaries specifying the desired parameters to change. For example, the following parameters file will launch two training jobs, one using the default loss function, and the other using a l1 loss.

batch_params:
    - dannce_finetune_weights: /n/holylfs02/LABS/olveczky_lab/Diego/code/dannce/demo/markerless_mouse_1/DANNCE/weights/weights.rat.MAX/
      data_split_seed: 42
      dannce_train_dir: ./DANNCE/FT_MAX
    - dannce_finetune_weights: /n/holylfs02/LABS/olveczky_lab/Diego/code/dannce/demo/markerless_mouse_1/DANNCE/weights/weights.rat.MAX/
      data_split_seed: 42
      loss: mask_nan_l1_loss
      dannce_train_dir: ./DANNCE/FT_MAX_L1