Skip to content
JB Mouret edited this page Aug 23, 2015 · 2 revisions

Some clusters provide a "best effort" queue: you can use all the available cores but if someone requests them, your jobs are automatically killed.

Resuming

To use these very useful queues, we need sferes to be able to resume interrupted experiments. By default, all sferes binaries should be "resumable" unless SFERES_NO_STATE is defined. Technically, a special statistics class (called State) saves the genotypes of the whole population.

  • To resume an experiment: your_binary --resume your_gen_file
  • Sferes should try to save the last generation before dying, but this may not work if a single generation needs too much time
  • a useful option is -d, which allows you to specify where sferes should write the results (typically, you want your binary to write its results in the same directory as the original experiment).

sferes_job_manager.py

Resuming the right job by hand is a tedious process. To avoid this issue, we provide a script (scripts/sferes_job_manager.py) that takes care of everything:

  • we submit many instances of this script
  • each time the script is activated by the scheduler, it will look at the state of the experiments and launch or resume one the experiment.

This means that you should typically queue more instances of this script than the total number of experiments.

The script takes a json as an input (see scripts/sferes_job_manager_ex.json):

{
  "replicates": 3,
  "bin_dir": "/Users/jbm/Documents/git/sferes2/build/debug/examples/",
  "res_dir": "/Users/jbm/Documents/git/sferes2/test_job_manager",
  "exps" : ["ex_ea", "ex_ea2"],
}

This json file says that we want 3 replicates of ex_ea and 3 replicates of ex_ea2. The binaries can be found in /Users/jbm/Documents/git/sferes2/build/debug/examples/ and the results should be in /Users/jbm/Documents/git/sferes2/test_job_manager (the script will create ex_ea/exp_0, ex_ea/exp_1, etc.).

Each time this script is run, it will look at the "open experiments" and "adopt" one. If the script is killed, it will transmit the signal to the sferes binary.

Making your job resumable

Everything should work "out of the box" except:

  • be careful with the copy constructors of your genotypes (if they are not properly written, resuming might be buggy)
  • if you use a custom EA, your_ea->_set_pop(pop) will be called. If you only use the basic this->_pop, there is no need to do anything. However, if you use a different structure, you should redefine _set_pop and add your code.
Clone this wiki locally