Skip to content

Frequently Asked Questions

melrom edited this page Dec 6, 2012 · 38 revisions

Q: Why did you call it BigJob?

Because we just love British comedies.

Q: How can I delete a BigJob installation installed with the bootstrap script?

Delete the following directory: rm $HOME/.bigjob/python/lib/python2.<PYTHON_MINOR_VERSION>/site-packages/BigJob-<BIGJOB_VERSION>-py2.X.egg/

Q: How can a BigJob package be updated?

To update a bigjob package execute:

easy_install -U bigjob

Q: Can tasks be distributed across two machines of two different infrastructures (e.g.: Eric on LONI & Ranger on XSEDE) which require different credential (SSH user or Globus certificates)?

Currently, SAGA Context is supported. If SSH is used, different credentials can be configured in the ~/.ssh/config file (see man ssh_config). SAGA-Python (Bliss) currently does not support Globus.

Q: How can I install/configure my own Redis server?

Redis is the most stable and fast backend (requires Python >2.5) and the recommended way of using BigJob. Redis can easily be run in user space. It can be downloaded at: http://redis.io/download (just ~500 KB). Once you downloaded and compiled Redis, start a Redis server on the machine of your choice:

$ redis-server 
[489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
[489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
[489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
[489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use

Then set the COORDINATION_URL parameter (on top of most examples) in the example to the Redis endpoint of your Redis installation, e.g.

redis://<hostname>:6379 

The coordination url is passed to the constructor of the PilotComputeService respectively to the PilotDataService, e.g.

pilot_compute_service = PilotComputeService(coordination_url="redis://<hostname>:6379")

It is recommend to setup a password for your Redis server. Otherwise, other users will be able to access and manipulate your data stored in the Redis server.

Q: How do I execute and reconnect to long-running sessions of BigJob in a Unix terminal?

The UNIX screen tool can / should be used to re-connect to a running BigJob session on a remote machine. For documentation on screen, please see Screen Manpage.

You should not just submit a BigJob from your local machine to a remote host and then close the terminal without the use of screen.

Q: If BigJob is being used to launch Pilot-Jobs on multiple machines, does SAGA have to be installed on the all the machines?

It is recommended to have SAGA-Python (Bliss) is installed on the resources running the pilot. BJ will work with SAGA-Python on the resource, but will not support file staging.

Q: The agent cannot be correctly started. I see an Python error in the agent stdout/stderr file.

Please make sure that the resource has a suitable Python version installed. The following command should return a valid Python version (Python 2.7 in the optimal case):

$ ssh localhost "python -V"
Python 2.7.2

Q: Does BigJob stage work & data unit files onto the target resource where the work unit has to execute?

Yes, there is SSH-based support for file stage-in.

Q: How can I run on a remote resource?

The BigJob manager expects a URL to a SAGA Job Service as a parameter (lrms_url). The respective SAGA adaptor needs to be installed and working (please test the adaptor properly with SAGA before using BJ). Currently, BigJob works with the following SAGA Job adaptors:

SAGA/PBS: lrms_url = "pbs://localhost" 
SAGA/SSH: lrms_url = "ssh://oliver2.loni.org"
SAGA/PSB+SSH: lrms_url = "pbs+ssh://oliver2.loni.org"
SAGA/Globus: lrms_url = "gram://oliver1.loni.org/jobmanager-pbs" (only SAGA C++, deprecated)

Q: Can I use BigJob with SAGA-Python (Bliss)?

Bliss (>0.2.3) is the best support SAGA version for BigJob. It is the default version!

Q: Can I use BigJob with SAGA-C++?

Yes, but it is deprecated.

Q: My stdout file doesn't contain the output of /bin/date but "ssh: connect to host localhost port 22: Connection refused"

BigJob utilizes ssh for the execution of sub-jobs. Please ensure that your local SSH daemon is up and running and that you can login without password.

Q: Why is BigJob downloading an installation package?

BigJob attempts to install itself, if it can't find a valid BJ installation on a resource (i.e. if import bigjob fails). By default BigJob search for $HOME/.bigjob/python for a working BJ installation. Please, make sure that the correct Python is found in your default paths. If BJ attempts to install itself despite being already installed on a resource this can be a sign that the wrong Python is found.

Q: How can I control logging?

BigJob utilizes a configuration file named bigjob.conf located in the root of the BigJob installation (e.g. $HOME/.bigjob/python/lib/python2.<PYTHON_MINOR_VERSION>/site-packages/BigJob-<BIGJOB_VERSION>-py2.X.egg/):

# Logging config
# logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL
logging.level=logging.INFO

Alternatively you can set the logging level in the code:

import logging
from bigjob import logger
logger.setLevel(logging.FATAL)

or via the environment variable BIGJOB_VERBOSE. For example, for full debug log output use:

export BIGJOB_VERBOSE=5

Q: How can I control BigJob log messages from its application?

BigJob logger can be obtained and can further handlers can be added to the logger object. For example with the below set of instructions, the application and BigJob debug messages are written to namd_bigwork.log file

logger = logging.getLogger('bigjob')
fh = logging.FileHandler('namd_bigwork.log',mode='w')
fh.setLevel(logging.DEBUG)
logger.addHandler(fh)
logger.debug("Logging to namd_bigwork.log at DEBUG level")

Q: How can I use path relative to the $HOME directory on a remote resource?

BigJob expands the tokens $HOME and ~ in the Compute Unit Description in the attribute executable and working_directory with the home directory of the respective resource.

Q: Can I reconnect to a current running BigJob?

Yes, if your BigJob manager (or application) terminates before all ComputeUnits terminate, you can reconnect to a running pilot by providing a pilot_url to the PilotCompute constructor. For example:

pilot = PilotCompute(pilot_url="redis://localhost:6379/bigjob:bj-a7bfae68-25a0-11e2-bd6c-705681b3df0f:localhost")

Q: What directories is BJ creating? How can I control the directory in which my subjob is executed?

By default, BJ creates a directory structure relative to the BJ working directory specified in start_pilot_job:

<BIGJOB_WORKING_DIRECTORY>/bj-54aaba6c-32ec-11e1-a4e5-00264a13ca4c/sj-55010912-32ec-11e1-a4e5-00264a13ca4c
<BIGJOB_WORKING_DIRECTORY>/bj-54aaba6c-32ec-11e1-a4e5-00264a13ca4c/sj-55153072-32ec-11e1-a4e5-00264a13ca4c

For each sub-job a own directory is created. Subjobs can be executed in any directory by setting the working directory to the desired directory in the sub-job description:

jd.working_directory = "<your directory of choice>" 

Q: Does BigJob work on Kraken?

Yes, it works. However, there are limitations: Kraken requires the user to use aprun to launch jobs. Aprun can only be called once per batch job - BJ compute unit launch mechanism which spawns 1 process per compute unit is not compatible with aprun.

You can however execute a single compute unit concurrently by setting the NUMBER_SUBJOBS variable to:

jd.environment=["NUMBER_SUBJOBS=2"]

FAQ for SAGA C++/BigJob

Q: SAGA C++ - What does the error - saga.bad_parameter: SAGA(BadParameter) - mean?

Very likely, the SAGA C++ adaptor is not correctly configured. If PBSPro adaptor is used export PBS_HOME or if Torque adaptors are used export TORQUE_HOME environment variable to the corresponding scheduling installation location. For example:

$ which qsub
/usr/local/bin/qsub
$ export PBS_HOME=/usr/local

Q: Can BigJob be run from a normal desktop/laptop on a cluster like (LONI)?

Yes. It is possible. The pre-requisites are 1. SAGA 2. Globus client tools 3. Add cluster(LONI) CA to list of trusted certificates authority. The below three steps are used to setup LONI certificates and for more information please use https://docs.loni.org/wiki/LONI_Certificates.

cd $HOME/.globus/certificates
wget https://docs.loni.org/mediawiki-1.13.3-docsloni/images/9/9c/a3bf9f3c.0 --no-check-certificate
wget https://docs.loni.org/mediawiki-1.13.3-docsloni/images/d/d9/a3bf9f3c.signing_policy --no-check-certificate 
Clone this wiki locally