Permalink
Fetching contributors…
Cannot retrieve contributors at this time
469 lines (356 sloc) 19.3 KB

The `grosetta`:command: and `gdocking`:command: scripts

GC3Apps provide two scripts to drive execution of applications (protocols, in Rosetta_ terminology) from the Rosetta_ bioinformatics suite.

The purpose of `grosetta`:command: and `gdocking`:command: is to execute several concurrent runs of minirosetta or docking_protocol on a set of input files, and collect the generated output. These runs are performed in parallel using every available GC3Pie :term:`resource`; you can of course control how many runs should be executed and select what output files you want from each one.

The script `grosetta`:command: is a relatively generic front-end that executes the minirosetta program by default (but a different application can be chosen with the -x `command-line option`:term:). The `gdocking`:command: script is specialized for running Rosetta_'s docking_protocol program.

Introduction

The `grosetta`:command: and `gdocking`:command: execute several runs of minirosetta or docking_protocol on a set of input files, and collect the generated output. These runs are performed in parallel, up to a limit that can be configured with the -J `command-line option`:term:. You can of course control how many runs should be executed and select what output files you want from each one.

Note

The `grosetta`:command: and `gdocking`:command: scripts are very similar in usage. In the following, whatever is written about `grosetta`:command: applies to `gdocking`:command: as well; the differences will be pointed out on a case-by-case basis.

In more detail, `grosetta`:command: does the following:

  1. Reads the `session`:term: (specified on the command line with the --session option) and loads all stored jobs into memory. If the session directory does not exist, one will be created with empty contents.

  2. Scans the input file names given on the command-line, and generates a number of identical computational jobs, all running the same Rosetta_ program on the same set of input files. The objective is to compute a specified number P of decoys of any given PDB file.

    The number P of wanted decoys can be set with the --total-decoys option (see below). The option --decoys-per-job can set the number of decoys that each computational job can compute; this should be a guessed based on the maximum allowed run time of each job and the time taken by the Rosetta_ protocol to compute a single decoy.

  3. Updates the state of all existing jobs, collects output from finished jobs, and submits new jobs generated in step 2.

    Finally, a summary table of all known jobs is printed. (To control the amount of printed information, see the -l command-line option in the `session-based script`:ref: section.)

  4. If the -C command-line option was given (see below), waits the specified amount of seconds, and then goes back to step 3.

    The program `grosetta`:command: exits when all jobs have run to completion, i.e., when the wanted number of decoys have been computed.

    Execution can be interrupted at any time by pressing Ctrl+C. If the execution has been interrupted, it can be resumed at a later stage by calling `grosetta`:command: with exactly the same command-line options.

The `gdocking`:command: program works in exactly the same way, with the important exception that `gdocking`:command: uses a separate Rosetta_ docking_protocol program invocation per input file.

Command-line invocation of `grosetta`:command:

The `grosetta`:command: script is based on GC3Pie's `session-based script <session-based script>`:ref: model; please read also the `session-based script`:ref: section for an introduction to sessions and generic command-line options.

A `grosetta`:command: command-line is constructed as follows:

  1. The 1st argument is the flags file, containing options to pass to every executed Rosetta_ program;
  2. then follows any number of input files (copied from your PC to the execution site);
  3. then a literal colon character :;
  4. finally, you can list any number of output file patterns (copied back from the execution site to your PC); wildcards (e.g., *.pdb) are allowed, but you must enclose them in quotes. Note that:
    • you can omit the output files: the default is "*.pdb" "*.sc" "*.fasc"
    • if you omit the output files patterns, omit the colon as well

Example 1. The following command-line invocation uses `grosetta`:command: to run minirosetta on the molecule files 1bjpA.pdb, 1ca7A.pdb, and 1cgqA.pdb. The flags file (1st command-line argument) is a text file containing options to pass to the actual minirosetta program. Additional input files are specified on the command line between the flags file and the PDB input files.

   $ grosetta flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb 1cgqA.pdb

You can see that the listing of output patterns has been omitted,
so `grosetta`:command: will use the default and retrieve all
`*.pdb`:file:, `*.sc`:file: and `*.fasc`:file: files.

There will be a number of identical jobs being executed as a result of a `grosetta`:command: or `gdocking`:command: invocation; this number depends on the ratio of the values given to options -P and -p:

-P NUM, --total-decoys NUM
 Compute NUM decoys per input file.
-p NUM, --decoys-per-job NUM
 Compute NUM decoys in a single job (default: 1). This parameter should be tuned so that the running time of a single job does not exceed the maximum wall-clock time (see the --wall-clock-time command-line option in `session-based script`:ref:).

If you omit -P and -p, they both default to 1, i.e., one job will be created (as in the example 1. above).

Example 2. The following command-line invocation will run 3 parallel instances of minirosetta, each of which generates 2 decoys (save the last one, which only generates 1 decoy) of the molecule described in file 1bjpA.pdb:

$ grosetta --session SAMPLE_SESSION --total-decoys 5 --decoys-per-job 2 flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb

In this example, job information is stored into session SAMPLE_SESSION (see the documentation of the --session option in `session-based script`:ref:). The command above creates the jobs, submits them, and finally prints the following status report:

Status of jobs in the 'SAMPLE_SESSION' session: (at 10:53:46, 02/28/12)
        NEW   0/3    (0.0%)
    RUNNING   0/3    (0.0%)
    STOPPED   0/3    (0.0%)
  SUBMITTED   3/3   (100.0%)
 TERMINATED   0/3    (0.0%)
TERMINATING   0/3    (0.0%)
      total   3/3   (100.0%)

Note that the status report counts the number of jobs in the session, not the total number of decoys being generated. (Feel free to report this as a bug.)

Calling grosetta over and over again will result in the same jobs being monitored; to create new jobs, change the command line and raise the value for -P or -p. (To completely erase an existing session and start over, use the --new-session option, as per `session-based script <session-based script>`:ref: documentation.)

The -C option tells `grosetta`:command: to continue running until all jobs have finished running and the output files have been correctly retrieved. On successful completion, the command given in example 2. above, would print:

Status of jobs in the 'SAMPLE_SESSION' session: (at 11:05:50, 02/28/12)
        NEW   0/3    (0.0%)
    RUNNING   0/3    (0.0%)
    STOPPED   0/3    (0.0%)
  SUBMITTED   0/3    (0.0%)
 TERMINATED   3/3   (100.0%)
TERMINATING   0/3    (0.0%)
         ok   3/3   (100.0%)
      total   3/3   (100.0%)

The three jobs are named 0--1, 2--3 and 4--5 (you could see this by passing the -l option to `grosetta`:command:); each of these jobs will create an output directory named after the job.

In general, `grosetta`:command: jobs are named `{N}--{M}`:file: with N and M being two integers from 0 up to the value specified with option --total-decoys. Jobs generated by `gdocking`:command: are instead named after the input file, with a `.{N}--{M}`:file: suffix added.

For each job, the set of output files is automatically retrieved and placed in the locations described below.

Note

The naming and contents of output files differ between `grosetta`:command: and `gdocking`:command:. Refer to the appropriate section below!

Output files for `grosetta`:command:

Upon successful completion, the output directory of each `grosetta`:command: job contains:

The `minirosetta.static.log`:file: file contains the output log of the minirosetta execution. For each of the S_*.pdb files above, a line like the following should be present in the log file (the file name and number of elapsed seconds will of course vary!):

protocols.jd2.JobDistributor: S_1CA7A_1_0001 reported success in 124 seconds

The `minirosetta.static.stdout.txt`:file: contains a copy of the minirosetta output log, plus the output of the wrapper script. In case of successful minirosetta run, the last line of this file will read:

minirosetta.static: All done, exitcode: 0

Output files for `gdocking`:command:

Execution of gdocking yields the following output:

  • For each .pdb input file, a .decoys.tar file (e.g., for 1bjpa.pdb input, a 1bjpa.decoys.tar output is produced), which contains the .pdb files of the decoys produced by `gdocking`:command:.
  • For each successful job, a .N--M directory: e.g., for the 1bjpa.1--2 job, a 1bjpa.1--2/ directory is created, with the following content:
    • docking_protocol.log: output of Rosetta's docking_protocol program;
    • docking_protocol.stderr.txt, docking_protocol.stdout.txt: obvoius meaning. The "stdout" file contains a copy of the docking_protocol.log contents, plus the output from the wrapper script.
    • docking_protocol.tar.gz: the .pdb decoy files produced by the job.

The following scheme summarizes the location of `gdocking`:command: output files:

(directory where gdocking is run)/
  |
  +- file1.pdb      Original input file
  |
  +- file1.N--M/    Directory collecting job outputs from job file1.N--M
  |    |
  |    +- docking_protocol.tar.gz
  |    +- docking_protocol.log
  |    +- docking_protocol.stderr.txt
  |    ... etc
  |
  +- file1.N--M.fasc   FASC file for decoys N to M [1]
  |
  +- file1.decoys.tar  tar archive of PDB file of all decoys
  |                    generated corresponding to 'file1.pdb' [2]
  |
  ...

Let P be the total number of decoys (the argument to the -P option), and p be the number of decoys per job (argument to the -p option). Then you would get in a single directory:

  1. (P/p) different .fasc files, corresponding to the (P/p) jobs;
  2. P different .pdb files, named `a_file.0.pdb`:file: to `a_file.{(P-1)}.pdb`:file:

Example usage

This section contains commented example sessions with `grosetta`:command:. All the files used in this example are available in the GC3Pie Rosetta test directory (courtesy of Lars Malmstroem).

Manage a set of jobs from start to end

In typical operation, one calls `grosetta`:command: with the -C option and lets it manage a set of jobs until completion.

So, to generate one decoy from a set of given input files, one can use the following command-line invocation:

$ grosetta -s example -C 120 -P 1 -p 1 \
    flags alignment.filt query.fasta \
    query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \
    boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \
    2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \
    3c6vA.pdb

The -s example option tells `grosetta`:command: to store information about the computational jobs in the example.jobs directory.

The -C 120 option tells `grosetta`:command: to update job state every 120 seconds; output from finished jobs is retrieved and new jobs are submitted at the same interval.

The -P 1 and -p 1 options set the total number of decoys to compute and the maximum number of decoys that a single computational job can handle. These values can be arbitrarily high (however the p value should be such that the computational job can actually compute that many decoys in the allotted `wall-clock time <walltime>`:term:).

The above command will start by printing a status report like the following:

Status of jobs in the 'example.csv' session:
 SUBMITTED   1/1 (100.0%)

It will continue printing an updated status report every 120 seconds until the requested number of decoys (set by the -P option) has been computed.

In GC3Pie terminology when a job is finished and its output has been successfully retrieved, the job is marked as TERMINATED:

Status of jobs in the 'example.csv' session:
 TERMINATED   1/1 (100.0%)

Managing a session by repeated `grosetta`:command: invocation

We now show how one can obtain the same result by calling `grosetta`:command: multiple times (there could be hours of interruption between one invocation and the next one).

Note

This is not the typical mode of operating with `grosetta`:command:, but may still be useful in certain settings.

  1. Create a session (1 job only, since no -P option is given); the session name is chosen with the -s (short for --session) option. You should take care of re-using the same session name with subsequent commands.

    $ grosetta -s example flags alignment.filt query.fasta \
        query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \
        boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \
        2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb
    Status of jobs in the 'example.csv' session:
     SUBMITTED   1/1 (100.0%)
    
  2. Now we call `grosetta`:command: again, and request that 3 decoys be computed starting from a single PDB file (--total-decoys 3 on the command line). Since we are submitting a single PDB file, the 3 decoys will be computed all in a single run, so the --decoys-per-job option will have value 3.

    $ grosetta -s example --total-decoys 3 --decoys-per-job 3 \
        flags alignment.filt query.fasta \
        query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \
        boinc_aaquery09_05.200_v1_3.gz 3c6vA.pdb
    Status of jobs in the 'example.csv' session:
     SUBMITTED   3/3 (100.0%)
    

    Note that 3 jobs were submitted: `grosetta`:command: interprets the --total-decoys option globally, and adds one job to compute the 2 missing decoys from the file set from step 1. (This is currently a limitation of `grosetta`:command:)

    From here on, one could simply run grosetta -C 120 and let it manage the session until completion of all jobs, as in the example Manage a set of jobs from start to end above. For the sake of showing how the use of several command-line options of `grosetta`:command:, we shall further show how manage the session by repeated separate invocations.

  3. Next step is to monitor the session, so we add the command-line option -l which tells `grosetta`:command: to list all the jobs with their status. Also note that we keep the -s example option to tell `grosetta`:command: that we would like to operate on the session named example.

    All non-option arguments can be omitted: as long as the total number of decoys is unchanged, they're not needed.

    $ grosetta -s example -l
    Decoys Nr.       State (JobID)       Info
    ================================================================================
    0--1             RUNNING (job.766)   Running at Mon Dec 20 19:32:08 2010
    2--3             RUNNING (job.767)   Running at Mon Dec 20 19:33:23 2010
    0--2             RUNNING (job.768)   Running at Mon Dec 20 19:33:43 2010
    

    Without the -l option only a summary of job statuses is presented:

    $ grosetta -s example
    Status of jobs in the 'grosetta.csv' session:
     RUNNING     3/3 (100.0%)
    

    Alternatively, we can keep the command line arguments used in the previous invocation: they will be ignored since they do not add any new job (the number of decoys to compute is always 1):

    $ grosetta -s example -l flags alignment.filt query.fasta \
        query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \
        boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \
        2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \
        3c6vA.pdb
    Decoys Nr.       State (JobID)       Info
    ================================================================================
    0--1             RUNNING (job.766)
    2--3             RUNNING (job.767)   Running at Mon Dec 20 19:33:23 2010
    0--2             RUNNING (job.768)   Running at Mon Dec 20 19:33:43 2010
    

    Note that the -l option is available also in combination with the -C option (see Manage a set of jobs from start to end).

  4. Calling grosetta again when jobs are done triggers automated download of the results:

    $ ../grosetta.py
    File downloaded:
    gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.stdout.txt
    File downloaded:
    gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.log
     ...
    File downloaded:
    gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/.arc/input
    Status of jobs in the 'grosetta.csv' session:
     TERMINATED  1/1 (100.0%)
     ok          1/1 (100.0%)
    

    The -l option comes handy to see what directory contains the job output:

    $ grosetta -l
    Decoys Nr.       State (JobID)         Info
    ==================================================================================
    0--1             TERMINATED (job.766)  Output retrieved into directory '/tmp/0--1'