GC3Apps provide two scripts to drive execution of applications (protocols, in Rosetta terminology) from the Rosetta bioinformatics suite.
The purpose of grosetta
and gdocking
is to execute several concurrent runs of minirosetta or docking_protocol on a set of input files, and collect the generated output. These runs are performed in parallel using every available GC3Pie resource
; you can of course control how many runs should be executed and select what output files you want from each one.
The script grosetta
is a relatively generic front-end that executes the minirosetta program by default (but a different application can be chosen with the -x
command-line
option
). The gdocking
script is specialized for running Rosetta's docking_protocol program.
The grosetta
and gdocking
execute several runs of minirosetta or docking_protocol on a set of input files, and collect the generated output. These runs are performed in parallel, up to a limit that can be configured with the -J
command-line
option
. You can of course control how many runs should be executed and select what output files you want from each one.
Note
The grosetta
and gdocking
scripts are very similar in usage. In the following, whatever is written about grosetta
applies to gdocking
as well; the differences will be pointed out on a case-by-case basis.
In more detail, grosetta
does the following:
- Reads the
session
(specified on the command line with the--session
option) and loads all stored jobs into memory. If the session directory does not exist, one will be created with empty contents. Scans the input file names given on the command-line, and generates a number of identical computational jobs, all running the same Rosetta program on the same set of input files. The objective is to compute a specified number P of decoys of any given PDB file.
The number P of wanted decoys can be set with the
--total-decoys
option (see below). The option--decoys-per-job
can set the number of decoys that each computational job can compute; this should be a guessed based on the maximum allowed run time of each job and the time taken by the Rosetta protocol to compute a single decoy.Updates the state of all existing jobs, collects output from finished jobs, and submits new jobs generated in step 2.
Finally, a summary table of all known jobs is printed. (To control the amount of printed information, see the
-l
command-line option in thesession-based script
section.)If the
-C
command-line option was given (see below), waits the specified amount of seconds, and then goes back to step 3.The program
grosetta
exits when all jobs have run to completion, i.e., when the wanted number of decoys have been computed.Execution can be interrupted at any time by pressing
Ctrl+C
. If the execution has been interrupted, it can be resumed at a later stage by callinggrosetta
with exactly the same command-line options.
The gdocking
program works in exactly the same way, with the important exception that gdocking
uses a separate Rosetta docking_protocol program invocation per input file.
The grosetta
script is based on GC3Pie's session-based
script <session-based script>
model; please read also the session-based script
section for an introduction to sessions and generic command-line options.
A grosetta
command-line is constructed as follows:
- The 1st argument is the flags file, containing options to pass to every executed Rosetta program;
- then follows any number of input files (copied from your PC to the execution site);
- then a literal colon character
:
; - finally, you can list any number of output file patterns (copied back from the execution site to your PC); wildcards (e.g.,
*.pdb
) are allowed, but you must enclose them in quotes. Note that:- you can omit the output files: the default is
"*.pdb" "*.sc" "*.fasc"
- if you omit the output files patterns, omit the colon as well
- you can omit the output files: the default is
Example 1. The following command-line invocation uses
grosetta
to run minirosetta on the molecule files1bjpA.pdb
,1ca7A.pdb
, and1cgqA.pdb
. Theflags
file (1st command-line argument) is a text file containing options to pass to the actual minirosetta program. Additional input files are specified on the command line between theflags
file and the PDB input files.$ grosetta flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb 1cgqA.pdb
You can see that the listing of output patterns has been omitted, so
grosetta
will use the default and retrieve all*.pdb
,*.sc
and*.fasc
files.
There will be a number of identical jobs being executed as a result of a grosetta
or gdocking
invocation; this number depends on the ratio of the values given to options -P
and -p
:
- -P NUM, --total-decoys NUM
Compute NUM decoys per input file.
- -p NUM, --decoys-per-job NUM
Compute NUM decoys in a single job (default: 1). This parameter should be tuned so that the running time of a single job does not exceed the maximum wall-clock time (see the
--wall-clock-time
command-line option insession-based script
).
If you omit -P
and -p
, they both default to 1, i.e., one job will be created (as in the example 1. above).
Example 2. The following command-line invocation will run 3 parallel instances of minirosetta, each of which generates 2 decoys (save the last one, which only generates 1 decoy) of the molecule described in file
1bjpA.pdb
:$ grosetta --session SAMPLE_SESSION --total-decoys 5 --decoys-per-job 2 flags alignment.filt query.fasta query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb
In this example, job information is stored into session
SAMPLE_SESSION
(see the documentation of the--session
option insession-based script
). The command above creates the jobs, submits them, and finally prints the following status report:Status of jobs in the 'SAMPLE_SESSION' session: (at 10:53:46, 02/28/12) NEW 0/3 (0.0%) RUNNING 0/3 (0.0%) STOPPED 0/3 (0.0%) SUBMITTED 3/3 (100.0%) TERMINATED 0/3 (0.0%) TERMINATING 0/3 (0.0%) total 3/3 (100.0%)
Note that the status report counts the number of jobs in the session, not the total number of decoys being generated. (Feel free to report this as a bug.)
Calling grosetta
over and over again will result in the same jobs being monitored; to create new jobs, change the command line and raise the value for -P
or -p
. (To completely erase an existing session and start over, use the --new-session
option, as per session-based script <session-based script>
documentation.)
The -C
option tells grosetta
to continue running until all jobs have finished running and the output files have been correctly retrieved. On successful completion, the command given in example 2. above, would print:
Status of jobs in the 'SAMPLE_SESSION' session: (at 11:05:50, 02/28/12)
NEW 0/3 (0.0%)
RUNNING 0/3 (0.0%)
STOPPED 0/3 (0.0%)
SUBMITTED 0/3 (0.0%)
TERMINATED 3/3 (100.0%)
TERMINATING 0/3 (0.0%)
ok 3/3 (100.0%)
total 3/3 (100.0%)
The three jobs are named 0--1
, 2--3
and 4--5
(you could see this by passing the -l
option to grosetta
); each of these jobs will create an output directory named after the job.
In general, grosetta
jobs are named {N}--{M}
with N and M being two integers from 0 up to the value specified with option --total-decoys
. Jobs generated by gdocking
are instead named after the input file, with a .{N}--{M}
suffix added.
For each job, the set of output files is automatically retrieved and placed in the locations described below.
Note
The naming and contents of output files differ between grosetta
and gdocking
. Refer to the appropriate section below!
Upon successful completion, the output directory of each grosetta
job contains:
- A copy of the input PDB files;
- Additional
.pdb
files namedS_{random string}.pdb
, generated by minirosetta during its run; - A file
score.sc
; - Files
minirosetta.static.log
,minirosetta.static.stdout.txt
andminirosetta.static.stderr.txt
.
The minirosetta.static.log
file contains the output log of the minirosetta execution. For each of the S_*.pdb
files above, a line like the following should be present in the log file (the file name and number of elapsed seconds will of course vary!):
protocols.jd2.JobDistributor: S_1CA7A_1_0001 reported success in 124 seconds
The minirosetta.static.stdout.txt
contains a copy of the minirosetta output log, plus the output of the wrapper script. In case of successful minirosetta run, the last line of this file will read:
minirosetta.static: All done, exitcode: 0
Execution of gdocking
yields the following output:
- For each
.pdb
input file, a.decoys.tar
file (e.g., for1bjpa.pdb
input, a1bjpa.decoys.tar
output is produced), which contains the.pdb
files of the decoys produced bygdocking
. - For each successful job, a .N--M directory: e.g., for the
1bjpa.1--2
job, a1bjpa.1--2/
directory is created, with the following content:docking_protocol.log
: output of Rosetta'sdocking_protocol
program;docking_protocol.stderr.txt
,docking_protocol.stdout.txt
: obvoius meaning. The "stdout" file contains a copy of thedocking_protocol.log
contents, plus the output from the wrapper script.docking_protocol.tar.gz
: the.pdb
decoy files produced by the job.
The following scheme summarizes the location of gdocking
output files:
(directory where gdocking is run)/
|
+- file1.pdb Original input file
|
+- file1.N--M/ Directory collecting job outputs from job file1.N--M
| |
| +- docking_protocol.tar.gz
| +- docking_protocol.log
| +- docking_protocol.stderr.txt
| ... etc
|
+- file1.N--M.fasc FASC file for decoys N to M [1]
|
+- file1.decoys.tar tar archive of PDB file of all decoys
| generated corresponding to 'file1.pdb' [2]
|
...
Let P be the total number of decoys (the argument to the -P
option), and p be the number of decoys per job (argument to the -p
option). Then you would get in a single directory:
- (P/p) different
.fasc
files, corresponding to the (P/p) jobs; - P different
.pdb
files, nameda_file.0.pdb
toa_file.{(P-1)}.pdb
This section contains commented example sessions with grosetta
. All the files used in this example are available in the GC3Pie Rosetta test directory (courtesy of Lars Malmstroem).
In typical operation, one calls grosetta
with the -C
option and lets it manage a set of jobs until completion.
So, to generate one decoy from a set of given input files, one can use the following command-line invocation:
$ grosetta -s example -C 120 -P 1 -p 1 \
flags alignment.filt query.fasta \
query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \
boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \
2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \
3c6vA.pdb
The -s example
option tells grosetta
to store information about the computational jobs in the example.jobs
directory.
The -C 120
option tells grosetta
to update job state every 120 seconds; output from finished jobs is retrieved and new jobs are submitted at the same interval.
The -P 1
and -p 1
options set the total number of decoys to compute and the maximum number of decoys that a single computational job can handle. These values can be arbitrarily high (however the p value should be such that the computational job can actually compute that many decoys in the allotted wall-clock time <walltime>
).
The above command will start by printing a status report like the following:
Status of jobs in the 'example.csv' session:
SUBMITTED 1/1 (100.0%)
It will continue printing an updated status report every 120 seconds until the requested number of decoys (set by the -P
option) has been computed.
In GC3Pie terminology when a job is finished and its output has been successfully retrieved, the job is marked as TERMINATED
:
Status of jobs in the 'example.csv' session:
TERMINATED 1/1 (100.0%)
We now show how one can obtain the same result by calling grosetta
multiple times (there could be hours of interruption between one invocation and the next one).
Note
This is not the typical mode of operating with grosetta
, but may still be useful in certain settings.
Create a session (1 job only, since no
-P
option is given); the session name is chosen with the-s
(short for--session
) option. You should take care of re-using the same session name with subsequent commands.$ grosetta -s example flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \ 2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb Status of jobs in the 'example.csv' session: SUBMITTED 1/1 (100.0%)
Now we call
grosetta
again, and request that 3 decoys be computed starting from a single PDB file (--total-decoys 3
on the command line). Since we are submitting a single PDB file, the 3 decoys will be computed all in a single run, so the--decoys-per-job
option will have value3
.$ grosetta -s example --total-decoys 3 --decoys-per-job 3 \ flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 3c6vA.pdb Status of jobs in the 'example.csv' session: SUBMITTED 3/3 (100.0%)
Note that 3 jobs were submitted:
grosetta
interprets the--total-decoys
option globally, and adds one job to compute the 2 missing decoys from the file set from step 1. (This is currently a limitation ofgrosetta
)From here on, one could simply run
grosetta -C 120
and let it manage the session until completion of all jobs, as in the example Manage a set of jobs from start to end above. For the sake of showing how the use of several command-line options ofgrosetta
, we shall further show how manage the session by repeated separate invocations.Next step is to monitor the session, so we add the command-line option
-l
which tellsgrosetta
to list all the jobs with their status. Also note that we keep the-s example
option to tellgrosetta
that we would like to operate on the session named example.All non-option arguments can be omitted: as long as the total number of decoys is unchanged, they're not needed.
$ grosetta -s example -l Decoys Nr. State (JobID) Info ================================================================================ 0--1 RUNNING (job.766) Running at Mon Dec 20 19:32:08 2010 2--3 RUNNING (job.767) Running at Mon Dec 20 19:33:23 2010 0--2 RUNNING (job.768) Running at Mon Dec 20 19:33:43 2010
Without the
-l
option only a summary of job statuses is presented:$ grosetta -s example Status of jobs in the 'grosetta.csv' session: RUNNING 3/3 (100.0%)
Alternatively, we can keep the command line arguments used in the previous invocation: they will be ignored since they do not add any new job (the number of decoys to compute is always 1):
$ grosetta -s example -l flags alignment.filt query.fasta \ query.psipred_ss2 boinc_aaquery03_05.200_v1_3.gz \ boinc_aaquery09_05.200_v1_3.gz 1bjpA.pdb 1ca7A.pdb \ 2fltA.pdb 2fm7A.pdb 2op8A.pdb 2ormA.pdb 2os5A.pdb \ 3c6vA.pdb Decoys Nr. State (JobID) Info ================================================================================ 0--1 RUNNING (job.766) 2--3 RUNNING (job.767) Running at Mon Dec 20 19:33:23 2010 0--2 RUNNING (job.768) Running at Mon Dec 20 19:33:43 2010
Note that the
-l
option is available also in combination with the-C
option (see Manage a set of jobs from start to end).Calling
grosetta
again when jobs are done triggers automated download of the results:$ ../grosetta.py File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.stdout.txt File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/minirosetta.static.log ... File downloaded: gsiftp://idgc3grid01.uzh.ch:2811/jobs/214661292869757468202765/.arc/input Status of jobs in the 'grosetta.csv' session: TERMINATED 1/1 (100.0%) ok 1/1 (100.0%)
The
-l
option comes handy to see what directory contains the job output:$ grosetta -l Decoys Nr. State (JobID) Info ================================================================================== 0--1 TERMINATED (job.766) Output retrieved into directory '/tmp/0--1'