Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Merge branch 'develop-prod'

* develop-prod: (144 commits)
  throughput measurement script updated
  Update 05_bigjob_mandelbrot.py
  working (?!?) MB on stampede...
  despereate attempt to get MB to work on stampede
  make threadpool size on agent side configurable via pilot description
  Make executor threadpool size configurable
  Throughput measurements example
  make local executor plugin obey number_of_processes
  re-did some fix lost in reset
  make launch mechanism detection more reliable to missing configuration files
  improved error message in case of bootstrap failures made AST optional (support older Python versions on agent side)
  added support for aprun on Hector
  fixing build script
  further work on Hector support
  more manifest updates
  updated manifest
  reverted changes
  removed allocation from tutorial examples`
  argh
  fixed manifest
  ...

Conflicts:
	setup.py
  • Loading branch information...
commit 45d7d5192df10fc2c084b2f63fe29cf042ba938e 2 parents 33d653a + bc2cab0
@drelu drelu authored
Showing with 1,313 additions and 34,617 deletions.
  1. +19 −2 MANIFEST.in
  2. +8 −0 Makefile
  3. +1 −0  README.md
  4. +2 −2 bigjob.conf
  5. +12 −0 bigjob.sublime-project
  6. +0 −5 bigjob/__init__.py
  7. +47 −11 bigjob/bigjob_agent.py
  8. +113 −113 bigjob/bigjob_manager.py
  9. +10 −8 bigjob/job_plugin/ec2ssh.py
  10. +11 −7 bigjob/job_plugin/gcessh.py
  11. +5 −11 bigjob/job_plugin/slurmssh.py
  12. +3 −0  bigjob_agent.conf
  13. +3 −3 coordination/bigjob_coordination_redis.py
  14. BIN  docs/build/doctrees/_themes/armstrong/README.doctree
  15. BIN  docs/build/doctrees/architecture/index.doctree
  16. BIN  docs/build/doctrees/environment.pickle
  17. BIN  docs/build/doctrees/index.doctree
  18. BIN  docs/build/doctrees/install/config.doctree
  19. BIN  docs/build/doctrees/install/index.doctree
  20. BIN  docs/build/doctrees/install/install.doctree
  21. BIN  docs/build/doctrees/install/redis.doctree
  22. BIN  docs/build/doctrees/install/trouble.doctree
  23. BIN  docs/build/doctrees/install/xsede.doctree
  24. BIN  docs/build/doctrees/intro/index.doctree
  25. BIN  docs/build/doctrees/library/index.doctree
  26. BIN  docs/build/doctrees/patterns/chained.doctree
  27. BIN  docs/build/doctrees/patterns/coupled.doctree
  28. BIN  docs/build/doctrees/patterns/exsede.doctree
  29. BIN  docs/build/doctrees/patterns/index.doctree
  30. BIN  docs/build/doctrees/patterns/pdata.doctree
  31. BIN  docs/build/doctrees/patterns/simple.doctree
  32. BIN  docs/build/doctrees/tutorial/index.doctree
  33. BIN  docs/build/doctrees/tutorial/part1.doctree
  34. BIN  docs/build/doctrees/tutorial/part2.doctree
  35. BIN  docs/build/doctrees/tutorial/part3.doctree
  36. BIN  docs/build/doctrees/tutorial/part4.doctree
  37. BIN  docs/build/doctrees/tutorial/part5.doctree
  38. BIN  docs/build/doctrees/tutorial/part6.doctree
  39. BIN  docs/build/doctrees/usage/appwriting.doctree
  40. BIN  docs/build/doctrees/usage/cmdtools.doctree
  41. BIN  docs/build/doctrees/usage/index.doctree
  42. BIN  docs/build/doctrees/usage/logging.doctree
  43. BIN  docs/build/doctrees/usage/output.doctree
  44. BIN  docs/build/doctrees/usage/pilotdata.doctree
  45. +0 −4 docs/build/html/.buildinfo
  46. BIN  docs/build/html/_images/bigjob-architecture.png
  47. +0 −70 docs/build/html/_sources/_themes/armstrong/README.txt
  48. +0 −3  docs/build/html/_sources/architecture/index.txt
  49. +0 −37 docs/build/html/_sources/index.txt
  50. +0 −54 docs/build/html/_sources/install/config.txt
  51. +0 −17 docs/build/html/_sources/install/index.txt
  52. +0 −154 docs/build/html/_sources/install/install.txt
  53. +0 −28 docs/build/html/_sources/install/redis.txt
  54. +0 −72 docs/build/html/_sources/install/trouble.txt
  55. +0 −128 docs/build/html/_sources/install/xsede.txt
  56. +0 −74 docs/build/html/_sources/intro/index.txt
  57. +0 −450 docs/build/html/_sources/library/index.txt
  58. +0 −14 docs/build/html/_sources/patterns/chained.txt
  59. +0 −8 docs/build/html/_sources/patterns/coupled.txt
  60. +0 −122 docs/build/html/_sources/patterns/exsede.txt
  61. +0 −9 docs/build/html/_sources/patterns/index.txt
  62. +0 −104 docs/build/html/_sources/patterns/pdata.txt
  63. +0 −12 docs/build/html/_sources/patterns/simple.txt
  64. +0 −32 docs/build/html/_sources/tutorial/index.txt
  65. +0 −186 docs/build/html/_sources/tutorial/part1.txt
  66. +0 −96 docs/build/html/_sources/tutorial/part2.txt
  67. +0 −3  docs/build/html/_sources/tutorial/part3.txt
  68. +0 −3  docs/build/html/_sources/tutorial/part4.txt
  69. +0 −3  docs/build/html/_sources/tutorial/part5.txt
  70. +0 −3  docs/build/html/_sources/tutorial/part6.txt
  71. +0 −192 docs/build/html/_sources/usage/appwriting.txt
  72. +0 −82 docs/build/html/_sources/usage/cmdtools.txt
  73. +0 −17 docs/build/html/_sources/usage/index.txt
  74. +0 −83 docs/build/html/_sources/usage/logging.txt
  75. +0 −17 docs/build/html/_sources/usage/output.txt
  76. +0 −221 docs/build/html/_sources/usage/pilotdata.txt
  77. +0 −464 docs/build/html/_static/agogo.css
  78. BIN  docs/build/html/_static/ajax-loader.gif
  79. +0 −439 docs/build/html/_static/basic.css
  80. BIN  docs/build/html/_static/bgfooter.png
  81. BIN  docs/build/html/_static/bgtop.png
  82. +0 −1,109 docs/build/html/_static/bootstrap-2.3.0/css/bootstrap-responsive.css
  83. +0 −9 docs/build/html/_static/bootstrap-2.3.0/css/bootstrap-responsive.min.css
  84. +0 −6,158 docs/build/html/_static/bootstrap-2.3.0/css/bootstrap.css
  85. +0 −9 docs/build/html/_static/bootstrap-2.3.0/css/bootstrap.min.css
  86. BIN  docs/build/html/_static/bootstrap-2.3.0/img/glyphicons-halflings-white.png
  87. BIN  docs/build/html/_static/bootstrap-2.3.0/img/glyphicons-halflings.png
  88. +0 −2,268 docs/build/html/_static/bootstrap-2.3.0/js/bootstrap.js
  89. +0 −6 docs/build/html/_static/bootstrap-2.3.0/js/bootstrap.min.js
  90. +0 −30 docs/build/html/_static/bootstrap-sphinx.css
  91. +0 −112 docs/build/html/_static/bootstrap-sphinx.js
  92. BIN  docs/build/html/_static/comment-bright.png
  93. BIN  docs/build/html/_static/comment-close.png
  94. BIN  docs/build/html/_static/comment.png
  95. BIN  docs/build/html/_static/darkmetal.png
  96. +0 −256 docs/build/html/_static/default.css
  97. +0 −170 docs/build/html/_static/default.css.disabled
  98. BIN  docs/build/html/_static/dialog-note.png
  99. BIN  docs/build/html/_static/dialog-seealso.png
  100. BIN  docs/build/html/_static/dialog-topic.png
  101. BIN  docs/build/html/_static/dialog-warning.png
  102. +0 −247 docs/build/html/_static/doctools.js
  103. BIN  docs/build/html/_static/down-pressed.png
  104. BIN  docs/build/html/_static/down.png
  105. +0 −310 docs/build/html/_static/epub.css
  106. BIN  docs/build/html/_static/file.png
  107. BIN  docs/build/html/_static/footerbg.png
  108. BIN  docs/build/html/_static/headerbg.png
  109. +0 −7 docs/build/html/_static/ie6.css
  110. +0 −154 docs/build/html/_static/jquery.js
  111. +0 −9,597 docs/build/html/_static/js/jquery-1.9.1.js
  112. +0 −5 docs/build/html/_static/js/jquery-1.9.1.min.js
  113. +0 −2  docs/build/html/_static/js/jquery-fix.js
  114. BIN  docs/build/html/_static/logo.png
  115. BIN  docs/build/html/_static/metal.png
  116. BIN  docs/build/html/_static/middlebg.png
  117. BIN  docs/build/html/_static/minus.png
  118. +0 −245 docs/build/html/_static/nature.css
  119. BIN  docs/build/html/_static/navigation.png
  120. BIN  docs/build/html/_static/plus.png
  121. +0 −5 docs/build/html/_static/print.css
  122. +0 −261 docs/build/html/_static/pydoctheme.css
  123. +0 −62 docs/build/html/_static/pygments.css
  124. +0 −323 docs/build/html/_static/pyramid.css
  125. +0 −786 docs/build/html/_static/rtd.css
  126. +0 −431 docs/build/html/_static/scrolls.css
  127. +0 −560 docs/build/html/_static/searchtools.js
  128. +0 −151 docs/build/html/_static/sidebar.js
  129. +0 −26 docs/build/html/_static/theme_extras.js
  130. BIN  docs/build/html/_static/transparent.gif
  131. +0 −23 docs/build/html/_static/underscore.js
  132. BIN  docs/build/html/_static/up-pressed.png
  133. BIN  docs/build/html/_static/up.png
  134. BIN  docs/build/html/_static/watermark.png
  135. BIN  docs/build/html/_static/watermark_blur.png
  136. +0 −808 docs/build/html/_static/websupport.js
  137. +0 −167 docs/build/html/_themes/armstrong/README.html
  138. +0 −104 docs/build/html/architecture/index.html
  139. +0 −343 docs/build/html/genindex.html
  140. +0 −185 docs/build/html/index.html
  141. +0 −157 docs/build/html/install/config.html
  142. +0 −184 docs/build/html/install/index.html
  143. +0 −275 docs/build/html/install/install.html
  144. +0 −144 docs/build/html/install/redis.html
  145. +0 −180 docs/build/html/install/trouble.html
  146. +0 −270 docs/build/html/install/xsede.html
  147. +0 −204 docs/build/html/intro/index.html
  148. +0 −898 docs/build/html/library/index.html
  149. BIN  docs/build/html/objects.inv
  150. +0 −197 docs/build/html/patterns/chained.html
  151. +0 −204 docs/build/html/patterns/coupled.html
  152. +0 −274 docs/build/html/patterns/exsede.html
  153. +0 −132 docs/build/html/patterns/index.html
  154. +0 −224 docs/build/html/patterns/pdata.html
  155. +0 −178 docs/build/html/patterns/simple.html
  156. +0 −112 docs/build/html/search.html
  157. +0 −1  docs/build/html/searchindex.js
  158. +0 −151 docs/build/html/tutorial/index.html
  159. +0 −296 docs/build/html/tutorial/part1.html
  160. +0 −252 docs/build/html/tutorial/part2.html
  161. +0 −128 docs/build/html/tutorial/part3.html
  162. +0 −128 docs/build/html/tutorial/part4.html
  163. +0 −118 docs/build/html/tutorial/part5.html
  164. +0 −105 docs/build/html/tutorial/part6.html
  165. +0 −320 docs/build/html/usage/appwriting.html
  166. +0 −215 docs/build/html/usage/cmdtools.html
  167. +0 −166 docs/build/html/usage/index.html
  168. +0 −225 docs/build/html/usage/logging.html
  169. +0 −135 docs/build/html/usage/output.html
  170. +0 −340 docs/build/html/usage/pilotdata.html
  171. +2 −2 docs/source/_themes/armstrong/theme.conf
  172. BIN  docs/source/images/github.jpg
  173. BIN  docs/source/images/google.png
  174. +10 −2 docs/source/index.rst
  175. +2 −2 docs/source/install/install.rst
  176. +52 −51 docs/source/library/index.rst
  177. +14 −6 docs/source/usage/appwriting.rst
  178. +62 −0 examples/example_styleguide.py
  179. +2 −6 examples/pilot-api/example-pilot-api.py
  180. +47 −0 examples/pilot-api/fromsphinx.py
  181. +63 −33 examples/tutorial/local_simple_ensembles.py
  182. +83 −0 examples/xsede2013/01_bigjob-simple-ensemble.py
  183. +91 −0 examples/xsede2013/02_bigjob-simple-ensemble-datatransfer.py
  184. +107 −0 examples/xsede2013/03_bigjob_chained_ensemble.py
  185. +128 −0 examples/xsede2013/04_bigjob_coupled_ensembles.py
  186. +130 −0 examples/xsede2013/05_bigjob_mandelbrot.py
  187. +81 −0 examples/xsede2013/CDS-01_bigjob-simple-ensemble.py
  188. +94 −0 examples/xsede2013/CDS-02_bigjob-simple-ensemble-datatransfer.py
  189. +95 −0 examples/xsede2013/mandelbrot.py
  190. +8 −0 examples/xsede2013/mandelbrot.sh
  191. +0 −1  pilot/__init__.py
  192. +1 −1  pilot/coordination/nocoord_adaptor.py
  193. +3 −5 pilot/coordination/redis_adaptor.py
  194. +1 −1  pilot/filemanagement/irods_adaptor.py
  195. +1 −1  pilot/filemanagement/s3_adaptor.py
  196. +2 −1  pilot/impl/pilot_manager_decentral.py
Sorry, we could not display the entire diff because it was too big.
View
21 MANIFEST.in
@@ -1,2 +1,19 @@
-recursive-include package *
-prune doc
+include *.cfg
+include *.conf
+include *.md
+
+recursive-include api *
+recursive-include bigjob *
+recursive-include bigjob_dynamic *
+recursive-include bootstrap *
+recursive-include cli *
+recursive-include coordination *
+recursive-include pilot *
+recursive-include scripts *
+recursive-include tests *
+recursive-include util *
+
+global-exclude *.dll
+global-exclude *.pyc
+global-exclude *.pyo
+global-exclude *.so
View
8 Makefile
@@ -0,0 +1,8 @@
+
+.PHONY: clean
+
+
+clean:
+ -rm -rf build/ saga.egg-info/ temp/ MANIFEST dist/ *.egg-info
+ make -C docs clean
+ find . -name \*.pyc -exec rm -f {} \;
View
1  README.md
@@ -76,3 +76,4 @@ Building PyPi package
Upload to PyPi
python setup.py sdist upload
+
View
4 bigjob.conf
@@ -5,5 +5,5 @@ saga=bliss
# Logging config
# logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL
# logging.level=logging.DEBUG
-logging.level=logging.FATAL
-#logging.level=logging.DEBUG
+#logging.level=logging.FATAL
+logging.level=logging.DEBUG
View
12 bigjob.sublime-project
@@ -0,0 +1,12 @@
+{
+ "folders":
+ [
+ {
+ "path": ".",
+ "folder_exclude_patterns": ["build", "dist", "*.egg-info",
+ "test", "temp", "venv", "*.egg"],
+ "file_exclude_patterns": ["*.sublime-workspace", "*.egg"]
+
+ }
+ ]
+}
View
5 bigjob/__init__.py
@@ -8,7 +8,6 @@
#READ config
-SAGA_BLISS=False
try:
import ConfigParser
_CONFIG_FILE="bigjob.conf"
@@ -67,10 +66,6 @@
paramiko_logger = logging.getLogger(name="paramiko.transport")
paramiko_logger.setLevel(logging.ERROR)
#logging.basicConfig(level=logging_level)
-
- saga = default_dict["saga"]
- if saga.lower() == "bliss":
- SAGA_BLISS=True
except:
print("bigjob.conf could not be read")
View
58 bigjob/bigjob_agent.py
@@ -22,6 +22,12 @@
logging.basicConfig(level=logging.DEBUG)
+# Optional Imports
+try:
+ import ast
+except:
+ logging.debug("Python version <2.6. AST coult not be imported. ")
+
try:
import saga
except:
@@ -91,19 +97,15 @@ def __init__(self, args):
# linked under mpirun_rsh
if default_dict.has_key("mpirun"):
self.MPIRUN=default_dict["mpirun"]
+
+ if default_dict.has_key("number_executor_threads"):
+ THREAD_POOL_SIZE=int(default_dict["number_executor_threads"])
+
self.OUTPUT_TAR=False
if default_dict.has_key("create_output_tar"):
self.OUTPUT_TAR=eval(default_dict["create_output_tar"])
logger.debug("Create output tar: %r", self.OUTPUT_TAR)
- self.LAUNCH_METHOD="ssh"
- if default_dict.has_key("launch_method"):
- self.LAUNCH_METHOD=self.__get_launch_method(default_dict["launch_method"])
-
- logging.debug("Launch Method: " + self.LAUNCH_METHOD + " mpi: " + self.MPIRUN + " shell: " + self.SHELL)
-
- # init rms (SGE/PBS)
- self.init_rms()
self.failed_polls = 0
##############################################################################
@@ -169,9 +171,32 @@ def __init__(self, args):
logger.debug("set state to : " + str(bigjob.state.Running))
self.coordination.set_pilot_state(self.base_url, str(bigjob.state.Running), False)
self.pilot_description = self.coordination.get_pilot_description(self.base_url)
+ try:
+ self.pilot_description = ast.literal_eval(self.pilot_description)
+ except:
+ logger.warn("Unable to parse pilot description")
+ self.pilot_description = None
+
+
+ ############################################################################
+ # Detect launch method
+ self.LAUNCH_METHOD="ssh"
+ if default_dict.has_key("launch_method"):
+ self.LAUNCH_METHOD=default_dict["launch_method"]
+
+ self.LAUNCH_METHOD=self.__get_launch_method(self.LAUNCH_METHOD)
+
+ logging.debug("Launch Method: " + self.LAUNCH_METHOD + " mpi: " + self.MPIRUN + " shell: " + self.SHELL)
+
+ # init rms (SGE/PBS)
+ self.init_rms()
##############################################################################
# start background thread for polling new jobs and monitoring current jobs
+ # check whether user requested a certain threadpool size
+ if self.pilot_description!=None and self.pilot_description.has_key("number_executor_threads"):
+ THREAD_POOL_SIZE=int(self.pilot_description["number_executor_threads"])
+ logger.debug("Creating executor thread pool of size: %d"%(THREAD_POOL_SIZE))
self.resource_lock=threading.RLock()
self.threadpool = ThreadPool(THREAD_POOL_SIZE)
@@ -238,7 +263,11 @@ def init_local(self):
""" initialize free nodes list with dummy (for fork jobs)"""
logger.debug("Init nodefile from /proc/cpuinfo")
try:
- num_cpus = self.get_num_cpus()
+ num_cpus=1
+ if self.pilot_description != None:
+ num_cpus = self.pilot_description["number_of_processes"]
+ else:
+ num_cpus = self.get_num_cpus()
for i in range(0, num_cpus):
self.freenodes.append("localhost\n")
except IOError:
@@ -269,12 +298,19 @@ def init_pbs(self):
""" initialize free nodes list from PBS environment """
logger.debug("Init nodeslist from PBS NODEFILE")
if self.LAUNCH_METHOD == "aprun":
- # Workaround for Kraken
+ # Workaround for Kraken and Hector
# PBS_NODEFILE does only contain front node
# thus we create a dummy node file with the respective
# number of slots
# aprun does not rely on the nodefile for job launching
- number_nodes = os.environ.get("PBS_NNODES")
+
+ # get number of requested slots from pilot description
+ number_of_requested_processes = self.pilot_description["number_of_processes"]
+ if os.environ.has_key("PBS_NNODES"):
+ # use PBS assigned node count if available
+ number_nodes = os.environ.get("PBS_NNODES")
+ else:
+ number_nodes = number_of_requested_processes
self.freenodes=[]
for i in range(0, int(number_nodes)):
slot = "slot-%d\n"%i
View
226 bigjob/bigjob_manager.py
@@ -18,8 +18,18 @@
import types
import subprocess
import pdb
+
+# the one and only saga
+import saga
+from saga.job import Description
+from saga import Url as SAGAUrl
+from saga.job import Description as SAGAJobDescription
+from saga.job import Service as SAGAJobService
+from saga import Session as SAGASession
+from saga import Context as SAGAContext
+
+from saga.utils.object_cache import ObjectCache as SAGAObjectCache
-from bigjob import SAGA_BLISS
from bigjob.state import Running, New, Failed, Done, Unknown
from bigjob import logger
@@ -33,10 +43,6 @@
from job_plugin.ec2ssh import Service as EC2Service
except:
pass
-try:
- from job_plugin.slurmssh import Service as SlurmService
-except:
- pass
# import other BigJob packages
@@ -44,48 +50,10 @@
import api.base
sys.path.append(os.path.dirname(__file__))
-
-if SAGA_BLISS == False:
- try:
- import saga
- logger.info("Using SAGA C++/Python.")
- is_bliss=False
- except:
- logger.warn("SAGA C++ and Python bindings not found. Using Bliss.")
- try:
- import bliss.saga as saga
- is_bliss=True
- except:
- logger.warn("SAGA Bliss not found")
-else:
- logger.info("Using SAGA Bliss.")
- try:
- import bliss.saga as saga
- is_bliss=True
- except:
- logger.warn("SAGA Bliss not found")
-
-
-"""BigJob Job Description is always derived from BLISS Job Description
- BLISS Job Description behaves compatible to SAGA C++ job description
-"""
-import bliss.saga.job.Description
-
-"""BLISS / SAGA C++ detection """
-if is_bliss:
- import bliss.saga as saga
- from bliss.saga import Url as SAGAUrl
- from bliss.saga.job import Description as SAGAJobDescription
- from bliss.saga.job import Service as SAGAJobService
- from bliss.saga import Session as SAGASession
- from bliss.saga import Context as SAGAContext
-else:
- from saga import url as SAGAUrl
- from saga.job import description as SAGAJobDescription
- from saga.job import service as SAGAJobService
- from saga import session as SAGASession
- from saga import context as SAGAContext
-
+# Some python version detection
+if sys.version_info < (2, 5):
+ sys.path.append(os.path.dirname( __file__ ) + "/ext/uuid-1.30/")
+ sys.stderr.write("Warning: Using unsupported Python version\n")
if sys.version_info < (2, 4):
sys.stderr.write("Error: Python versions <2.4 not supported\n")
@@ -159,6 +127,7 @@ def __init__(self,
logger.error("Coordination URL not set. Exiting BigJob.")
#self.launch_method=""
self.__filemanager=None
+ self._ocache = SAGAObjectCache ()
# restore existing BJ or initialize new BJ
if pilot_url!=None:
@@ -187,7 +156,7 @@ def __init__(self,
self.working_directory = None
logger.debug("initialized BigJob: " + self.app_url)
-
+
def start_pilot_job(self,
lrms_url,
number_nodes=1,
@@ -198,6 +167,7 @@ def start_pilot_job(self,
walltime=None,
processes_per_node=1,
filetransfers=None,
+ spmd_variation=None,
external_queue="",
pilot_compute_description=None):
""" Start a batch job (using SAGA Job API) at resource manager. Currently, the following resource manager are supported:
@@ -239,10 +209,10 @@ def start_pilot_job(self,
elif lrms_saga_url.scheme=="ec2+ssh" or lrms_saga_url.scheme=="euca+ssh" \
or lrms_saga_url.scheme=="nova+ssh":
self.js = EC2Service(lrms_saga_url, pilot_compute_description)
- elif lrms_saga_url.scheme=="slurm+ssh":
- self.js = SlurmService(lrms_saga_url, pilot_compute_description)
+ #elif lrms_saga_url.scheme=="slurm+ssh":
+ # self.js = SlurmService(lrms_saga_url, pilot_compute_description)
else:
- self.js = SAGAJobService(lrms_saga_url)
+ self.js = self._ocache.get_obj (lrms_saga_url, lambda : SAGAJobService (lrms_saga_url))
##############################################################################
# create job description
jd = SAGAJobDescription()
@@ -263,14 +233,14 @@ def start_pilot_job(self,
if queue != None:
jd.queue = queue
+ if spmd_variation != None:
+ jd.spmd_variation = spmd_variation
if project !=None:
jd.project=project
if walltime!=None:
logger.debug("setting walltime to: " + str(walltime))
- if is_bliss:
- jd.wall_time_limit=int(walltime)
- else:
- jd.wall_time_limit=str(walltime)
+ jd.wall_time_limit=int(walltime)
+
##############################################################################
@@ -335,19 +305,20 @@ def start_pilot_job(self,
)
logger.debug("Adaptor specific modifications: " + str(lrms_saga_url.scheme))
- if is_bliss and lrms_saga_url.scheme.startswith("condor")==False:
- bootstrap_script = self.__escape_bliss(bootstrap_script)
- else:
- if lrms_saga_url.scheme == "gram":
- bootstrap_script = self.__escape_rsl(bootstrap_script)
- elif lrms_saga_url.scheme == "pbspro" or lrms_saga_url.scheme=="xt5torque" or lrms_saga_url.scheme=="torque":
- bootstrap_script = self.__escape_pbs(bootstrap_script)
- elif lrms_saga_url.scheme == "ssh" and lrms_saga_url.scheme == "slurm+ssh":
- bootstrap_script = self.__escape_ssh(bootstrap_script)
+ #if lrms_saga_url.scheme.startswith("condor") == False:
+ # bootstrap_script = self.__escape_saga(bootstrap_script)
+ #else:
+ # if lrms_saga_url.scheme == "gram":
+ # bootstrap_script = self.__escape_rsl(bootstrap_script)
+ # elif lrms_saga_url.scheme == "pbspro" or lrms_saga_url.scheme=="xt5torque" or lrms_saga_url.scheme=="torque":
+ # bootstrap_script = self.__escape_pbs(bootstrap_script)
+ # elif lrms_saga_url.scheme == "ssh" and lrms_saga_url.scheme == "slurm+ssh":
+ # bootstrap_script = self.__escape_ssh(bootstrap_script)
+ bootstrap_script = self.__escape_pbs(bootstrap_script)
logger.debug(bootstrap_script)
-
-
+
+
# Define Agent Executable in Job description
# in Condor case bootstrap script is staged
# (Python app cannot be passed inline in Condor job description)
@@ -385,11 +356,7 @@ def start_pilot_job(self,
logger.debug("Condor file transfers: " + str(bj_file_transfers))
jd.file_transfer = bj_file_transfers
else:
- if is_bliss:
- jd.total_cpu_count=int(number_nodes)
- else:
- jd.number_of_processes=str(number_nodes)
- jd.processes_per_host=str(processes_per_node)
+ jd.total_cpu_count=int(number_nodes)
jd.spmd_variation = "single"
if pilot_compute_description!=None and pilot_compute_description.has_key("spmd_variation"):
jd.spmd_variation=pilot_compute_description["spmd_variation"]
@@ -405,8 +372,15 @@ def start_pilot_job(self,
# Create and submit pilot job to job service
logger.debug("Creating pilot job with description: %s" % str(jd))
self.job = self.js.create_job(jd)
- logger.debug("Submit pilot job to: " + str(lrms_saga_url))
+ logger.debug("Trying to submit pilot job to: " + str(lrms_saga_url))
self.job.run()
+
+ if self.job.state == saga.job.FAILED:
+ logger.debug("SUBMISSION FAILED. Exiting... ")
+ sys.exit(-1)
+ else:
+ logger.debug("Submission succeeded. Job ID: %s " % self.job.id)
+
return self.pilot_url
@@ -487,14 +461,24 @@ def cancel(self):
""" duck typing for cancel of saga.cpr.job and saga.job.job """
logger.debug("Cancel Pilot Job")
try:
- if self.url.scheme.startswith("condor")==False:
- self.job.cancel()
- else:
- pass
- #logger.debug("Output files are being transfered to file: outpt.tar.gz. Please wait until transfer is complete.")
+ self.job.cancel()
+ except:
+ pass
+ #traceback.print_stack()
+
+ logger.debug("Cancel Job Service")
+ try:
+ if not self._ocache.rem_obj (self.js) :
+ logger.debug("Cancel Job Service Manually")
+ del (self.js)
+ else :
+ logger.debug("Cancel Job Service done")
+
+ self.js = None
except:
pass
#traceback.print_stack()
+
try:
self._stop_pilot_job()
logger.debug("delete pilot job: " + str(self.pilot_url))
@@ -561,7 +545,7 @@ def _add_subjob(self, queue_url, jd, job_url, job_id):
logger.debug("create dictionary for job description. Job-URL: " + job_url)
# put job description attributes to Coordination Service
job_dict = {}
- # to accomendate current bug in bliss (Number of processes is not returned from list attributes)
+ # to accomendate current bug in saga (Number of processes is not returned from list attributes)
job_dict["NumberOfProcesses"] = "1"
attributes = jd.list_attributes()
logger.debug("SJ Attributes: " + str(jd))
@@ -639,8 +623,11 @@ def __generate_bootstrap_script(self, coordination_host, coordination_namespace,
try: import bigjob.bigjob_agent
except:
print "BigJob not installed. Attempt to install it.";
- opener = urllib.FancyURLopener({});
- opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE);
+ try:
+ opener = urllib.FancyURLopener({});
+ opener.retrieve(BOOTSTRAP_URL, BOOTSTRAP_FILE);
+ except Exception, ex:
+ print "Unable to download bootstrap script: " + str(ex) + ". Please install BigJob manually."
print "Execute: " + "python " + BOOTSTRAP_FILE + " " + BIGJOB_PYTHON_DIR
os.system("/usr/bin/env")
try:
@@ -653,7 +640,10 @@ def __generate_bootstrap_script(self, coordination_host, coordination_namespace,
activate_this = os.path.join(BIGJOB_PYTHON_DIR, "bin/activate_this.py");
execfile(activate_this, dict(__file__=activate_this))
#try to import BJ once again
-import bigjob.bigjob_agent
+try:
+ import bigjob.bigjob_agent
+except Exception, ex:
+ print "Unable install BigJob: " + str(ex) + ". Please install BigJob manually."
# execute bj agent
args = list()
args.append("bigjob_agent.py")
@@ -737,8 +727,8 @@ def __escape_ssh(self, bootstrap_script):
bootstrap_script = "\"" + bootstrap_script+ "\""
return bootstrap_script
- def __escape_bliss(self, bootstrap_script):
- logger.debug("Escape Bliss")
+ def __escape_saga(self, bootstrap_script):
+ logger.debug("Escape SAGA")
#bootstrap_script = bootstrap_script.replace("\'", "\"")
#bootstrap_script = "\'" + bootstrap_script+ "\'"
bootstrap_script = bootstrap_script.replace('"','\\"')
@@ -889,7 +879,7 @@ def __initialize_pilot_data(self, service_url):
# initialize file adaptor
# Pilot Data API for File Management
if service_url.startswith("ssh:"):
- logger.debug("Use SSH backend")
+ logger.debug("Use SSH backend for PilotData")
try:
from pilot.filemanagement.ssh_adaptor import SSHFileAdaptor
self.__filemanager = SSHFileAdaptor(service_url)
@@ -1000,11 +990,11 @@ def __print_traceback(self):
def __repr__(self):
return self.pilot_url
- def __del__(self):
- """ BJ is not cancelled when object terminates
- Application can reconnect to BJ via pilot url later on"""
- pass
- #self.cancel()
+ # def __del__(self):
+ # """ BJ is not cancelled when object terminates
+ # Application can reconnect to BJ via pilot url later on"""
+ # pass
+ # #self.cancel()
@@ -1153,17 +1143,6 @@ def __parse_subjob_url(self, subjob_url):
## Properties for description class
#
-def environment():
- doc = "The environment variables to set in the job's execution context."
- def fget(self):
- return self._environment
- def fset(self, val):
- self._environment = val
- def fdel(self, val):
- self._environment = None
- return locals()
-
-
def input_data():
doc = "List of input data units."
def fget(self):
@@ -1185,23 +1164,44 @@ def fdel(self, val):
return locals()
-class description(bliss.saga.job.Description):
+class description(SAGAJobDescription):
""" Sub-job description """
- environment = property(**environment())
- input_data = property(**input_data())
- output_data = property(**output_data())
-
+ ##input_data = property(**input_data())
+ ##output_data = property(**output_data())
+ ##environment = {}
+ # --------------------------------------------------------------------------
+ #
def __init__(self):
- bliss.saga.job.Description.__init__(self)
+ saga.job.Description.__init__(self)
#self.attributes_extensible_ (True)
-
+
# Extend description class by Pilot-Data relevant attributes
self._output_data = None
self._input_data = None
-
- self._register_rw_vec_attribute(name="InputData",
- accessor=self.__class__.input_data)
- self._register_rw_vec_attribute(name="OutputData",
- accessor=self.__class__.output_data)
-
+
+ import saga.attributes as sa
+
+ self._attributes_extensible (True)
+ self._attributes_camelcasing (True)
+
+
+ self._attributes_register ("InputData", None, sa.ANY, sa.VECTOR, sa.WRITEABLE)
+ self._attributes_register ("OutputData", None, sa.ANY, sa.VECTOR, sa.WRITEABLE)
+
+ self._attributes_set_getter ("InputData", self._get_input_data )
+ self._attributes_set_getter ("OutputData", self._get_output_data)
+
+ # --------------------------------------------------------------------------
+ #
+ def _get_input_data (self) :
+ print "get caled. returning: %s" % self.input_data
+ return self.input_data
+
+ # --------------------------------------------------------------------------
+ #
+ def _get_output_data (self) :
+ return self.output_data
+
+
+
View
18 bigjob/job_plugin/ec2ssh.py
@@ -10,8 +10,7 @@
from boto.ec2.regioninfo import RegionInfo
import boto.ec2
-import bliss.saga as saga
-import bliss
+import saga
###############################################################################
# EC2 General
@@ -168,15 +167,18 @@ def run(self):
self.network_ip = self.instance.ip_address
url = "ssh://" + str(self.network_ip)
logger.debug("Connect to: %s"%(url))
- js = saga.job.Service(url)
+
# Submit job
- ctx = saga.Context()
- ctx.type = saga.Context.SSH
- ctx.userid = self.pilot_compute_description["vm_ssh_username"]
- ctx.userkey = self.pilot_compute_description["vm_ssh_keyfile"]
- js.session.contexts = [ctx]
+ ctx = saga.Context("SSH")
+ #ctx.type = saga.Context.SSH
+ ctx.user_id = self.pilot_compute_description["vm_ssh_username"]
+ ctx.user_key = self.pilot_compute_description["vm_ssh_keyfile"]
+
+ session = saga.Session()
+ session.add_context(ctx)
+ js = saga.job.Service(url, session=session)
logger.debug("Job Description Type: " + str(type(self.job_description)))
job = js.create_job(self.job_description)
View
18 bigjob/job_plugin/gcessh.py
@@ -10,7 +10,7 @@
import uuid
import time
-import bliss.saga as saga
+import saga
"""
AN OAUTH2 Client Id must be created at the Google API console at:
@@ -165,15 +165,19 @@ def run(self):
self.network_ip = compute_instance_details["networkInterfaces"][0]["accessConfigs"][0]['natIP']
url = "ssh://" + str(self.network_ip)
logger.debug("Connect to: %s"%(url))
- js = saga.job.Service(url)
+
# Submit job
- ctx = saga.Context()
- ctx.type = saga.Context.SSH
- ctx.userid = self.pilot_compute_description["vm_ssh_username"]
- ctx.userkey = self.pilot_compute_description["vm_ssh_keyfile"]
- js.session.contexts = [ctx]
+ ctx = saga.Context("SSH")
+ #ctx.type = saga.Context.SSH
+ ctx.user_id = self.pilot_compute_description["vm_ssh_username"]
+ ctx.user_key = self.pilot_compute_description["vm_ssh_keyfile"]
+ #js.session.contexts = [ctx]
+
+ session = saga.Session()
+ session.add_context(ctx)
+ js = saga.job.Service(url, session=session)
job = js.create_job(self.job_description)
print "Submit pilot job to: " + str(url)
View
16 bigjob/job_plugin/slurmssh.py
@@ -8,13 +8,7 @@
from bigjob import logger
import bigjob
-try:
- import bliss.saga as saga
-except:
- logger.warn("slurm+ssh://<hostname> plugin not compatible with SAGA Bliss. Use slurm+ssh://<hostname>")
-
-
-
+import saga
class Service(object):
""" Plugin for SlURM """
@@ -105,8 +99,8 @@ def run(self):
jd.arguments = ["-c", self.bootstrap_script]
jd.executable = "python"
jd.working_directory = self.working_directory
- jd.output = "bliss_job_submission.out"
- jd.error = "bliss_job_submission.err"
+ jd.output = "saga_job_submission.out"
+ jd.error = "saga_job_submission.err"
# Submit job
js = None
js = saga.job.Service(self.resource_url)
@@ -121,11 +115,11 @@ def run(self):
if saga_surl.username!=None and saga_surl.username!="":
sftp_url = sftp_url + str(saga_surl.username) + "@"
sftp_url = sftp_url + saga_surl.host + "/"
- outfile = sftp_url + self.working_directory+'/bliss_job_submission.out'
+ outfile = sftp_url + self.working_directory+'/saga_job_submission.out'
logger.debug("BigJob/SLURM: get outfile: " + outfile)
out = saga.filesystem.File(outfile)
out.copy("sftp://localhost/"+os.getcwd() + "/tmpout")
- errfile = sftp_url + self.working_directory+'/bliss_job_submission.err'
+ errfile = sftp_url + self.working_directory+'/saga_job_submission.err'
err = saga.filesystem.File(errfile)
err.copy("sftp://localhost/"+os.getcwd() + "/tmperr")
View
3  bigjob_agent.conf
@@ -5,6 +5,9 @@ cpr=False
shell = /bin/bash
mpirun = mpirun
+# control multi-threaded compute unit execution
+number_executor_threads=3
+
# Lauch Method
# Default launch method is ssh
# Future support for aprun (e.g. for Kraken)
View
6 coordination/bigjob_coordination_redis.py
@@ -78,9 +78,9 @@ def __init__(self, server=REDIS_SERVER, server_port=REDIS_SERVER_PORT, server_co
self.pipe = self.redis_client.pipeline()
try:
self.redis_client.ping()
- except:
- logger.error("Please start Redis server!")
- raise Exception("Please start Redis server!")
+ except Exception, ex:
+ logger.error("Cannot connect to Redis server: %s" % str(ex))
+ raise Exception("Cannot connect to Redis server: %s" % str(ex))
def get_address(self):
View
BIN  docs/build/doctrees/_themes/armstrong/README.doctree
Binary file not shown
View
BIN  docs/build/doctrees/architecture/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/environment.pickle
Binary file not shown
View
BIN  docs/build/doctrees/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/config.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/install.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/redis.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/trouble.doctree
Binary file not shown
View
BIN  docs/build/doctrees/install/xsede.doctree
Binary file not shown
View
BIN  docs/build/doctrees/intro/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/library/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/chained.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/coupled.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/exsede.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/pdata.doctree
Binary file not shown
View
BIN  docs/build/doctrees/patterns/simple.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part1.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part2.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part3.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part4.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part5.doctree
Binary file not shown
View
BIN  docs/build/doctrees/tutorial/part6.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/appwriting.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/cmdtools.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/index.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/logging.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/output.doctree
Binary file not shown
View
BIN  docs/build/doctrees/usage/pilotdata.doctree
Binary file not shown
View
4 docs/build/html/.buildinfo
@@ -1,4 +0,0 @@
-# Sphinx build info version 1
-# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: b01a59cbadba2a2e69b70132a1b242d3
-tags: fbb0d17656682115ca4d033fb2f83ba1
View
BIN  docs/build/html/_images/bigjob-architecture.png
Deleted file not rendered
View
70 docs/build/html/_sources/_themes/armstrong/README.txt
@@ -1,70 +0,0 @@
-Armstrong Sphinx Theme
-======================
-Sphinx theme for Armstrong documentation
-
-
-Usage
------
-Symlink this repository into your documentation at ``docs/_themes/armstrong``
-then add the following two settings to your Sphinx ``conf.py`` file::
-
- html_theme = "armstrong"
- html_theme_path = ["_themes", ]
-
-You can also change colors and such by adjusting the ``html_theme_options``
-dictionary. For a list of all settings, see ``theme.conf``.
-
-
-Defaults
---------
-This repository has been customized for Armstrong documentation, but you can
-use the original default color scheme on your project by copying the
-``rtd-theme.conf`` over the existing ``theme.conf``.
-
-
-Contributing
-------------
-
-* Create something awesome -- make the code better, add some functionality,
- whatever (this is the hardest part).
-* `Fork it`_
-* Create a topic branch to house your changes
-* Get all of your commits in the new topic branch
-* Submit a `pull request`_
-
-.. _Fork it: http://help.github.com/forking/
-.. _pull request: http://help.github.com/pull-requests/
-
-
-State of Project
-----------------
-Armstrong is an open-source news platform that is freely available to any
-organization. It is the result of a collaboration between the `Texas Tribune`_
-and `Bay Citizen`_, and a grant from the `John S. and James L. Knight
-Foundation`_. The first stable release is scheduled for September, 2011.
-
-To follow development, be sure to join the `Google Group`_.
-
-``armstrong_sphinx`` is part of the `Armstrong`_ project. Unless you're
-looking for a Sphinx theme, you're probably looking for the main project.
-
-.. _Armstrong: http://www.armstrongcms.org/
-.. _Bay Citizen: http://www.baycitizen.org/
-.. _John S. and James L. Knight Foundation: http://www.knightfoundation.org/
-.. _Texas Tribune: http://www.texastribune.org/
-.. _Google Group: http://groups.google.com/group/armstrongcms
-
-
-Credit
-------
-This theme is based on the the excellent `Read the Docs`_ theme. The original
-can be found in the `readthedocs.org`_ repository on GitHub.
-
-.. _Read the Docs: http://readthedocs.org/
-.. _readthedocs.org: https://github.com/rtfd/readthedocs.org
-
-
-License
--------
-Like the original RTD code, this code is licensed under a BSD. See the
-associated ``LICENSE`` file for more information.
View
3  docs/build/html/_sources/architecture/index.txt
@@ -1,3 +0,0 @@
-##############
-BigJob Architecture
-##############
View
37 docs/build/html/_sources/index.txt
@@ -1,37 +0,0 @@
-.. BigJob documentation master file, created by
- sphinx-quickstart on Mon Dec 3 21:55:42 2012.
- You can adapt this file completely to your liking, but it should at least
- contain the root `toctree` directive.
-
-##############################
-BigJob |version| User Manual
-##############################
-
-BigJob is a light-weight Python package that provides a Pilot-based Job and Data Management system. BigJob aims to be as flexible and extensible as possible - it installs where you want it (requiring no root access to the target machine). Unlike many other Pilot-Job systems, BigJob natively supports MPI jobs and, because of its integration with saga-python_, works on a variety of backend systems (such as SGE, PBS, SLURM, etc.). BigJob has been shown to work on grids, clouds, and clusters, as well as locally on your personal computer.
-
-More information can be found at the BigJob_ website.
-
-.. _BigJob: https://github.com/saga-project/BigJob/
-.. _saga-python: ttps://github.com/saga-project/saga-python/
-
-
-Contents
---------
-
-.. toctree::
- :numbered:
- :maxdepth: 2
-
- intro/index.rst
- install/index.rst
- usage/index.rst
- patterns/index.rst
- library/index.rst
- tutorial/index.rst
-
-Indices and tables
-==================
-
-* :ref:`genindex`
-* :ref:`search`
-
View
54 docs/build/html/_sources/install/config.txt
@@ -1,54 +0,0 @@
-#############
-Configuration
-#############
-
-.. note::
-
- SAGA has been designed as a zero-configuration library. Unless you are
- experiencing problems with one of the default configuration settings, there's
- really no need to create a configuration file for SAGA.
-
-SAGA and its individual middleware adaptors provide various optional
-:ref:`conf_options`. While SAGA tries to provide sensible default values for
-the majority of these options (zero-conf), it can sometimes be necessary to
-modify or extend SAGA's configuration. SAGA provides two ways to access and
-modify its configuration: via :ref:`conf_file` (recommended) and via the
-:ref:`conf_api` (for advanced use-cases).
-
-.. _conf_file:
-
-Configuration Files
--------------------
-
-If you need to make persistent changes to any of SAGA's :ref:`conf_options`, the
-simplest option is to create a configuration file. During startup, SAGA checks
-in two different locations for the existence of a configuration file:
-
-- ``/etc/saga.conf`` - for a system-wide configuration
-- ``$HOME/.saga.conf`` - for a user-specific configuration (Note that it start with a '.')
-
-If a configuration file is found, it is parsed by SAGA's configuration system.
-If files are present in both locations, SAGA will try to merge both, with the
-user-level configuration (``$HOME/.saga.conf``) always having precedence over
-the system-wide configuration (``$HOME/.saga.conf``). SAGA configuration files
-use a structure that looks like this::
-
- [saga.core]
- option = value
-
- [saga.logging]
- option = value
-
- [saga.adaptor.name]
- option = value
-
-
-.. _conf_options:
-
-Configuration Options
----------------------
-
-.. warning:: This should be generated automatically!
-
-
-
View
17 docs/build/html/_sources/install/index.txt
@@ -1,17 +0,0 @@
-.. _using-index:
-
-############
-Installation
-############
-
-
-This part of the documentation is devoted to general information on the setup
-and configuration of BigJob and things that make working with BigJob easier.
-
-
-.. toctree::
-
- install.rst
- redis.rst
- xsede.rst
- trouble.rst
View
154 docs/build/html/_sources/install/install.txt
@@ -1,154 +0,0 @@
-
-#################
-Installing BigJob
-#################
-
-=================
-Environment Setup
-=================
-
-This section will explain how to set up your environment and install BigJob.
-
------------------
-Prerequisites
------------------
-* Python 2.6 or higher. Python 2.7.x is recommended.
-* Redis Server
-* SAGA-Python Installation (automatically installed when installing BigJob following this guide)
-
------------------
-Bootstrap your Local Python Environment
------------------
-
-Assuming you don't want to mess with your system Python installation, you need a place where you can install BigJob locally. A small tool called `virtualenv <http://pypi.python.org/pypi/virtualenv/>`_ allows you to create a local Python software repository that behaves exactly like the global Python repository, with the only difference that you have write access to it. This is referred to as a 'virtual environment.'
-
-To create your local Python environment run the following command (you can install virtualenv on most systems via apt-get or yum, etc.)::
-
- virtualenv $HOME/.bigjob/python
-
-If you don't have virtualenv installed and you don't have root access to your machine, you can use the following script instead::
-
- curl --insecure -s https://raw.github.com/pypa/virtualenv/master/virtualenv.py | python - $HOME/.bigjob/python
-
------------------
-Activate your Local Python Environment
------------------
-
-You need to *activate* your Python environment in order to make it work. Run the command below. It will temporarily modify your :code:`PYTHONPATH` so that it points to :code:`$HOME/.bigjob/python/lib/python2.7/site-packages/` instead of the the system site-package directory::
-
- source $HOME/.bigjob/python/bin/activate
-
-Activating the virtualenv is *very* important. If you don't activate your virtual Python environment, the rest of this tutorial **will not work.** You can usually tell that your environment is activated properly if your bash command-line prompt starts with :code:`(python)`.
-
-The last step in this process is to add your newly created virtualenv to your :code:`.bashrc` so that any batch jobs that you submit have the same Python environment as you have on your submitting resource. Add the following line at the end of your :code:`$HOME/.bashrc` file::
-
- source $HOME/.bigjob/python/bin/activate
-
-=================
-Install BigJob
-=================
-
-After your virtual environment is active, you are ready to install BigJob. BigJob is available via PyPi and can be installed using easy_install as follows::
-
- easy_install BigJob
-
-You can change the default installation directory by calling::
-
- easy_install --prefix=<target-dir> BigJob
-
-To make sure that your installation works, run the following command to check if the BigJob module can be imported by the python interpreter::
-
- python -c "import pilot; print pilot.version"
-
-=================
-Execution Setup
-=================
-
-There are two requirements for proper BigJob execution:
-
-#. Agent Directory
-#. SSH Password-Less Login
-
------------------
-Create your Agent Directory
------------------
-
-BigJob needs a working directory in which to store all of its output, run information, and any errors that may occur. This directory can be named anything you choose, but for any examples in this manual, we will call the directory 'agent' (default). You should create this directory in the same location you run your scripts from, i.e. usually :code:`$SCRATCH` or :code:`$WORK`. You can create this directory by typing::
-
- mkdir agent
-
------------------
-SSH Password-Less Login
------------------
-
-If you are planning to submit from one resource to another, you must have SSH password-less login enabled to the submitting resource. This is achieved by placing your public key on one resource in the authorized_keys file on the target machine.
-
-Examples of when you would need password-less login:
-
-#. You want to submit from your local machine to an XSEDE resource
-#. You want to submit from one XSEDE resource to another
-#. You want to submit from your local cluster to external clusters, etc. etc.
-
-^^^^^^^^^^^^^^^^^
-Prerequisites
-^^^^^^^^^^^^^^^^^
-
-* :code:`openssh-server` (if you're running on your own system)
-* If running on XSEDE or FutureGrid systems, you do not have prerequisites.
-
-^^^^^^^^^^^^^^^^^
-Key Generation and Installation
-^^^^^^^^^^^^^^^^^
-
-1. **Generate Public/Private Key Pair**
-
-First, you have to generate a key. You do this as follows:
-
-* Step 1: Use the command :code:`ssh-keygen -t rsa -C <your-e-mail>` to generate the key.
-* Step 2: Specify the KeyPair location and name. We recommend that you use the default location if you do not yet have another key there, e.g. :code:`/home/username/.ssh/id_rsa`
-* Step 3: Type user defined passphrase when asking passphrase for the key.
-
-Example::
-
- ssh-keygen -t rsa -C johndoe@email.edu
-
- Generating public/private rsa key pair.
- Enter file in which to save the key (/home/johndoe/.ssh/id_rsa):
- Enter passphrase (empty for no passphrase):
- Enter same passphrase again:
- Your identification has been saved in /home/johndoe/.ssh/id_rsa.
- Your public key has been saved in /home/johndoe/.ssh/id_rsa.pub.
- The key fingerprint is: 34:87:67:ea:c2:49:ee:c2:81:d2:10:84:b1:3e:05:59 johndoe@email.edu
-
-2. **List the Result**
-
-You can find your key under the key location. As we used the .ssh directory, it will be located there.::
-
- cd /home/username/.ssh
- ls
-
-Verify that you have created the files :code:`id_rsa` and :code:`id_rsa.pub`.
-
-3. **Capture the Public Key for Target Machine**
-
-Use a text editor to open the :code:`id_rsa.pub` file. Copy the **entire** contents of this file.
-
-The contents of this file needs to be appended to the target machine's :code:`.ssh/authorized_keys` file. If the authorized_keys file is not accessible, then just create a :code:`.ssh/authorized_keys2` file and paste the key.
-
-Now the target machine is ready to accept your ssh key.
-
-4. **Test your Key Installation**
-
-The ssh-add command tells the machine which keys to use. For a test, type::
-
- ssh-agent sh -c 'ssh-add < /dev/null && bash'
-
-This will start the ssh-agent, add your default identity (prompting you for your passphrase), and spawn a bash shell.
-
-From this new shell, you should be able to :code:`ssh target_machine`. This should let you in without typing a password or passphrase.
-
-Test whether you have a password-less login to the target machine by executing the simple command::
-
- ssh <hostname> /bin/date
-
-This command should execute without password input.
View
28 docs/build/html/_sources/install/redis.txt
@@ -1,28 +0,0 @@
-
-#########################
-Setting Up a Redis Server
-#########################
-
-BigJob uses a Redis server for coordination and task management. Redis is the most stable and fastest backend (requires Python >2.5) and the recommended way of using BigJob. BigJob will **not** work without a coordination backend.
-
-Redis can easily be run in user space. For additional information about redis, please visit the website, `redis.io<http://www.redis.io>`_. To install your own redis server, please take the following steps::
-
- wget http://download.redis.io/redis-stable.tar.gz
- tar xvzf redis-stable.tar.gz
- cd redis-stable
- make
-
-Once you have downloaded and installed it, start a Redis server on the machine of your choice as follows::
-
- $ cd redis-stable
- $ ./src/redis-server
- [489] 13 Sep 10:11:28 # Warning: no config file specified, using the default config. In order to specify a config file use 'redis-server /path/to/redis.conf'
- [489] 13 Sep 10:11:28 * Server started, Redis version 2.2.12
- [489] 13 Sep 10:11:28 * The server is now ready to accept connections on port 6379
- [489] 13 Sep 10:11:28 - 0 clients connected (0 slaves), 922160 bytes in use
-
-You can install redis on a persistent server and use this server as your dedicated coordination server.
-
-
-
-
View
72 docs/build/html/_sources/install/trouble.txt
@@ -1,72 +0,0 @@
-###############
-Troubleshooting
-###############
-
-Having trouble with your BigJob installation? We're here to help! Below is a list of some common installation problems. If your problem persists, you can always message us at `bigjob-users@googlegroups.com <bigjob-users@googlegroups.com>`_.
-
-If you are encountering errors that aren't listed below, set the environment variable :code:`$BIGJOB_VERBOSE=100` in your :code:`.bashrc`.
-
-======================
-Common Error Messages
-======================
-
-1. The most common problems we encounter are with incorrect python version.
-
-In these cases, :code:`import pilot` may return::
-
- Traceback (most recent call last):
- File "<string>", line 1, in <module>
- ImportError: No module named pilot
-
-Using a virtualenv will modify your Python path, but you can verify that you are using the correct Python in two ways. From command line::
-
- which python
-
-should return the installation directory where you installed BigJob (i.e. $HOME/.bigjob/python/...).
-
-On remote resources such as XSEDE, before installing your virtualenv, you must be using Python 2.7.x. Some of these resources use Python 2.4 or Python 2.6 by default. You can use :code:`module load python` to upgrade to Python 2.7.x.
-
-Verify that your python version is correct at the destination by trying::
-
- ssh <name-of.remote.resource> "python -V"
-
-If this does not give the correct python version, check your :code:`.bashrc` at the destination to verify that you source your virtual environment.
-
-2. My stdout file doesn't contain the output of /bin/date but "ssh: connect to host localhost port 22: Connection refused"
-
-BigJob utilizes ssh for the execution of sub-jobs. Please ensure that your local SSH daemon is up and running and that you can login without password.
-
-==========================
-Frequently Asked Questions
-==========================
-
-**Q: How can I update my existing BigJob package?**::
-
- easy_install -U bigjob
-
-**Q: How do I execute and reconnect to long-running sessions of BigJob in a Unix terminal?**
-
-The UNIX :code:`screen` tool can / should be used to re-connect to a running BigJob session on a remote machine. For documentation on screen, please see `Screen Manpage <http://www.slac.stanford.edu/comp/unix/package/epics/extensions/iocConsole/screen.1.html>`_.
-
-You should not just submit a BigJob from your local machine to a remote host and then close the terminal without the use of screen.
-
-**Q: Can I reconnect to a current running BigJob?**
-
-Yes, if your BigJob manager (or application) terminates before all ComputeUnits terminate, you can reconnect to a running pilot by providing a :code:`pilot_url` to the PilotCompute constructor. For example::
-
-
- pilot = PilotCompute(pilot_url="redis://localhost:6379/bigjob:bj-a7bfae68-25a0-11e2-bd6c-705681b3df0f:localhost")
-
-**Q: Why is BigJob downloading an installation package?**
-
-BigJob attempts to install itself, if it can't find a valid BJ installation on a resource (i.e. if :code:`import pilot` fails). By default BigJob searches for :code:`$HOME/.bigjob/python` for a working BJ installation. Please, make sure that the correct Python is found in your default paths. If BJ attempts to install itself despite being already installed on a resource, this can be a sign that the wrong Python is found.
-
-
-
-
-
-
-
-
-
-
View
128 docs/build/html/_sources/install/xsede.txt
@@ -1,128 +0,0 @@
-###################
-XSEDE Specific Tips
-###################
-
-This page provides both general and specific tips for running on XSEDE infrastructure. General information is provided first, and then tips are listed by machine name (i.e. Lonestar, Kraken, Trestles, Stampede etc). If you are interested in running on a specific machine, please scroll down until you see the machine name.
-
-If you do not see a particular machine name, BigJob may run on this machine but not be supported yet in the documentation. Please feel free to email :code:`bigjob-users@googlegroups.com` to request machine information to be added.
-
-===================
-General
-===================
-
-------------------
-Where to Run
-------------------
-
-In general, on XSEDE machines, production-grade science should be done in either the :code:`$SCRATCH` or `$WORK` directories on the machine. This means you will run your BigJob script and make your BigJob :code:`agent` directory in either $SCRATCH or $WORK and **not** in $HOME.
-
-------------------------------
-Adding your Project Allocation
-------------------------------
-
-When creating BigJob scripts for XSEDE machines, it is necessary to add the :code:`project` field to the :code:`pilot_compute_description`. ::
-
- "project": "TG-XXXXXXXXX"
-
-TG-XXXXX must be replaced with your individual allocation SU number as provided to you by XSEDE.
-
-===================
-Stampede
-===================
-
-----------------------
-service_url
-----------------------
-
-Stampede uses the SLURM batch queuing system. When editing your scripts, the :code:`service_url` should be set to :code:`slurm+ssh://login1.stampede.tacc.utexas.edu`.
-
-
-===================
-Lonestar
-===================
-
-Installation of a virtual environment on Lonestar requires the use of a higher python version than the default. In order to load Python 2.7.x before installing the virtual environment, please execute::
-
- module load python
-
-Then you can proceed with the Installation instructions, and make sure that you activate your virtual environment in your :code:`.bashrc` before you try to run BigJob.
-
-You will need to put the following two lines in both your :code:`.bashrc` and your :code:`.bash_profile` in order to run on Ranger. This is due to the fact that interactive shells source a different file than regular shells. ::
-
- module load python
- source $HOME/bigjob/.python/bin/activate
-
-----------------------
-service_url
-----------------------
-
-Lonestar uses the Sun Grid Engine (SGE) batch queuing system. When editing your scripts, the :code:`service_url` should be set to :code:`sge://localhost` for running locally on Lonestar or :code:`sge+ssh://lonestar.tacc.utexas.edu` for running remotely.
-
-----------------------
-queues
-----------------------
-
-Commonly used queues on Lonestar to run BigJob:
-
-+------------+------------+-----------+------------------+
-| Queue Name | Max Runtime| Max Procs | Purpose |
-+============+============+===========+==================+
-| normal | 24 hrs | 4104 | normal priority |
-+------------+------------+-----------+------------------+
-| development| 1 hr | 264 | development |
-+------------+------------+-----------+------------------+
-| largemem | 24 hrs | 48 | large memory jobs|
-+------------+------------+-----------+------------------+
-
-A complete list of Lonestar queues can be found `here <http://www.tacc.utexas.edu/user-services/user-guides/lonestar-user-guide>`_.
-
-===================
-Kraken
-===================
-
-------------------------------
-Load Proper Python Environment
-------------------------------
-
-Before installing your virtual environment, you must do a :code:`module load python` on Kraken to ensure you're using Python 2.7.x instead of the system-level Python.
-
-------------------------------
-Using Lustre Scratch
-------------------------------
-
-Prior to running code on Kraken, you will need to make a directory called :code:`agent` in the same location that you are running your scripts from. The BigJob agent relies on :code:`aprun` to execute subjobs. :code:`aprun` works only if the working directory of the Pilot and Compute Units is set to the scratch space of Kraken.
-
-Create your agent directory in :code:`/lustre/scratch/<username>` by typing::
-
- cd /lustre/scratch/<username>
- mkdir agent
-
-Replace :code:`<username>` with your Kraken username.
-
-------------------------------
-Activate your Credentials
-------------------------------
-
-To submit jobs to Kraken from another resource using gsissh, the use of myproxy is required. To start a my proxy server, execute the following command::
-
- myproxy-logon -T -t <number of hours> -l <your username>
-
-You need to use your XSEDE portal username and password. To verify that your my proxy server is running, type :code:`grid-proxy-info`.
-
-If it was successful, you should see a valid proxy running.
-
-----------------------
-service_url
-----------------------
-
-Kraken is a Cray machine with a special Torque queuing system. It requires the use of GSISSH (Globus certificates required). Initiate a grid proxy (using :code:`myproxy-logon`) before executing the BigJob application. When editing your scripts, the :code:`service_url` should be set to :code:`xt5torque+gsissh://gsissh.kraken.nics.xsede.org`.
-
-===================
-Trestles
-===================
-
-----------------------
-service_url
-----------------------
-
-Trestles uses the Torque queuing system. When editing your scripts, the :code:`service_url` should be set to :code:`pbs+ssh://trestles.sdsc.edu`.
View
74 docs/build/html/_sources/intro/index.txt
@@ -1,74 +0,0 @@
-############
-Introduction
-############
-
-BigJob is a Pilot-Job framework built on top of `The Simple API for Grid Applications (SAGA) <http://saga-project.github.com>`_, a high-level, easy-to-use API for accessing distributed resources. BigJob supports a wide range of application types and is usable over a broad range of infrastructures, i.e., it is general-purpose, extensible, and interoperable. It is written in the python programming language.
-
-===========================
-Introduction to Pilot-Jobs
-===========================
-
-Pilot-Jobs support the decoupling of workload submission from resource assignment. This results in a flexible execution model, which in turn enables the distributed scale-out of applications on multiple and possibly heterogeneous resources. Pilot-Jobs support the use of container jobs with sophisticated workflow management to coordinate the launch and interaction of actual computational tasks within the container. It allows the execution of jobs without the necessity to queue each individual job.
-
-============================
-Why do you need Pilot-Jobs?
-============================
-
-Production-grade distributed cyberinfrastructure almost always has a local resource manager installed, such as a batch queuing system. A distributed application often requires many jobs to produce useful output data; these jobs often have the same executable. A traditional way of submitting these jobs would be to submit an individual job for each executable. These jobs (often hundreds) sit in the batch queuing system and may not become active at the same time. Overall, time-to-completion can take many hours due to load and scheduling variations.
-
-A Pilot-Job provides an alternative approach. It can be thought of as a container job for many sub-jobs. A Pilot-Job acquires the resources necessary to execute the sub-jobs (thus, it asks for all of the resources required to run the sub-jobs, rather than just one sub-job). If a system has a batch queue, the Pilot-Job is submitted to this queue. Once it becomes active, it can run the sub-jobs directly, instead of having to wait for each sub-job to queue. This eliminates the need to submit a different job for every executable, and significantly reduces the time-to-completion.
-
-============================
-What makes BigJob different?
-============================
-
-Unlike other common Pilot-Job systems, SAGA BigJob:
-
-#. Natively supports MPI jobs
-#. Works on a variety of back-end systems
-
-===========================
-What can I use BigJob for?
-===========================
-
-* Parameter sweeps
-* Many instances of the same task (ensemble)
-* Chained tasks
-* Loosely coupled but distinct tasks
-* Dependent tasks
- * Tasks with Data Dependencies
- * Tasks with Compute Dependencies
-
-=================
-BigJob Overview
-=================
-
-BigJob is comprised of three major components: (1) The Pilot-Manager, (2) The Pilot-Agent, and (3) The distributed coordination service. In order to understand what each component is responsible for, we must first describe the break down of a distributed application.
-
-An application is comprised of compute units (the application kernel) and data units (i.e. input/output files or data). Using the Pilot-API, an application can create a Pilot (Pilot-Compute [aka: Pilot-Job] or Pilot-Data) in order to acquire resources (computational or storage, respectively). The Pilot-Compute is the entity that actually gets submitted and scheduled on a resource using the resource management system. Once the resources are acquired, the application can submit compute units and data units via the Pilot-Manager.
-
-The Pilot-Manager is responsible for the orchestration and scheduling of Pilots. It runs locally on the machine used to run the distributed application. For submission of Pilots, BigJob relies on the SAGA Job API, and thus can be used in conjunction with different SAGA adaptors, e. g. the Globus, PBS, Condor, and Amazon Web Service adaptors. The Pilot-Manager ensures that the tasks are launched onto the correct resource based upon the specified jobID using the correct number of processes.
-
-The Pilot-Manager then stores information into the distributed coordination service (usually a redis database). For each new job (or chunk of data), an entry is created in the database by the BigJob manager. This database can be located on any resource, including the localhost. It is used for communication between the Pilot-Manager and the Pilot-Agent.
-
-Once the Pilot-Compute is submitted to the batch queuing system of the remote resource and becomes active, the Pilot-Agent comes into play. The Pilot-Agent is responsible for gathering local information and for executing the actual tasks (compute units) on its local resource. It achieves this by periodically polling for new jobs. If a new job is found and resources are available, the job is dispatched, otherwise it is queued. If multiple resources (machines) are acquired, there will be multiple Pilot-Agents.
-
-The overall BigJob architecture is shown below. BigJob utilizes a Master-Worker coordination model.
-
-.. image:: ../images/bigjob-architecture.png
- :width: 500px
- :align: center
-
--------------------
-Supported Adaptors
--------------------
-
-* **fork** - Allows job execution and file handling on the local machine
-* **SSH** - Allows job execution on remote hosts via SSH
-* **GSISSH** - Allows job execution on remote hosts via GSISSH
-* **PBS(+SSH,+GSISSH)** - Provides local and remote access (SSH+GSISSH) to PBS/Torque clusters
-* **SGE(+SSH,+GSISSH)** - Provides local and remote access (SSH+GSISSH) Sun (Oracle) Grid Engine Clusters
-* **SLURM(+SSH)** - Provides local and remote access (SSH) to SLURM clusters
-* **GRAM** - Uses Globus to submit jobs. Globus certificates are required.
-* **Amazon EC2(+SSH)** - Start Virtual Machines and submit jobs to AWS clouds
-* **Eucalyptus(+SSH)** - Start Virtual Machines and submit jobs to Eucalyptus clouds
View
450 docs/build/html/_sources/library/index.txt
@@ -1,450 +0,0 @@
-#################
-Library Reference
-#################
-
-.. pilot/impl/pilot_manager.py defines:
-.. class ComputeDataService
-..
-.. pilot/impl/pilotdata_manager.py defines
-.. class PilotData
-.. class PilotDataService
-.. class DataUnit
-..
-.. pilot/impl/pilotcompute_manager.py defines
-.. class PilotCompute
-.. class PilotComputeService
-.. class ComputeUnit
-
-.. pilot/api/compute/api.py defines
-.. class PilotComputeDescription
-.. class State
-
-.. pilot/api/data/api.py defines
-.. class PilotDataDescription
-
-Compute and Data Services
-*************************
-
-This section is meant to provide a hierarchical overview of the various library components and their interaction. The subsections then provide the API details associated with each component.
-
-The main concepts and classes exposed by the Compute part of the API are:
-
-* **PilotCompute (PC):** a pilot job, which can execute some compute workload (ComputeUnit).
-* **PilotComputeDescription (PCD):** description for specifying the requirements of a PilotCompute.
-* **PilotComputeService (PCS):** a factory for creating \PilotComputes.
-
-The data side of the Pilot API is symmetric to the compute side. The exposed classes for managing Pilot Data are:
-
-* **PilotData (PD):** a pilot that manages some data workload (DataUnit)
-* **PilotDataDescription (PDD):** a abstract description of the requirements of the PD
-* **PilotDataService (PDS):** a factory (service) which can create PilotDatas according to some specification
-
-The application workload is represented by so called ComputeUnits and DataUnits:
-
-* **ComputeUnit (CU):** a work item executed on a PilotCompute.
-* **DataUnit (DU):** a data item managed by a PilotData
-
-Both Compute and Data Units are specified using an abstract description object:
-
-* **ComputeUnitDescription (CUD):** abstract description of a ComputeUnit.
-* **DataUnitDescription (DUD):** abstract description of a DataUnit.
-
-The ComputeDataService represents the central entry point for the application workload:
-
-* **ComputeDataService (CDS):** a service which can map CUs and DUs to a set of Pilot Compute and Pilot Data. The ComputeDataService (CDS) takes care of the placement of Compute and Data Units. The set of PilotComputes and PilotData available to the CDS can be changed during the application's runtime. The CDS different data-compute affinity and will handle compute/data co-locating for the requested data-compute workload.
-
-PilotComputeService
-===================
-
-The PilotComputeService (PCS) is a factory for creating Pilot-Compute objects, where the latter is the individual handle to the resource. The PCS takes the COORDINATION_URL (as defined above) as an argument. This is for coordination of the compute units with the redis database.
-
-.. autoclass:: pilot.impl.pilotcompute_manager.PilotComputeService
- :members:
-
-
-PilotComputeDescription
-=======================
-
-The PCD defines the compute resource in which you will be running on and different attributes required for managing jobs on that resource. Recall that a Pilot-Job requests resources required to run all of the jobs. There can be any number of Pilot-Computes instantiated depending on the compute resources available to the application (using two machines rather than 1 requires 2 pilot compute descriptions).
-
-An example of a Pilot Compute Description is shown below::
-
- pilot_compute_description = {
- "service_url": 'pbs+ssh://india.futuregrid.org',
- "number_of_processes": 8,
- "processes_per_node":8,
- "working_directory": "/N/u/<username>",
- 'affinity_datacenter_label': "us-east-indiana",
- 'affinity_machine_label': "india"
- }
-
-.. class:: PilotComputeDescription
-
-.. data:: affinity_datacenter_label
-
-The data center label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-.. data:: affinity_machine_label
-
-The machine (resource) label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-.. data:: output
-
-Controls the location of the Pilot-Agent standard output file.
-
-:type: string
-
-.. data:: error
-
-Controls the location of the Pilot-Agent standard error file.
-
-:type: string
-
-.. data:: number_of_processes
-
-The number of cores that need to be allocated to run the jobs.
-
-:type: string
-
-.. data:: processes_per_node
-
-The number of cores per node.
-
-This argument does not actually limit the number of processes that can run on a node but specifies the order in which processes are assigned to nodes. You can think of :code:`-ppn 2` as repeating each line of the :code:`$PBS_NODEFILE` 2 times.
-
-:type: string
-
- .. note:: This field is required by some XSEDE/Torque clusters. If you have to specify ppn when running an MPI job on command line, then you must likely need this field in your BigJob script.
-
-.. data:: project
-
-The project allocation, if running on an XSEDE resource.
-
-:type: string
-
- .. note:: This field must be removed if you are running somewhere that does not require an allocation.
-
-.. data:: queue
-
-The job queue to be used.
-
-:type: string
-
- .. note:: If you are not submitting to a batch queuing system, remove this parameter.
-
-.. data:: service_url
-
-Specifies the SAGA-Python job adaptor (often this is based on the batch queuing system) and resource hostname (for instance, lonestar.tacc.utexas.edu) on which jobs can be executed.
-
-:type: string
-
- .. note:: For remote hosts, password-less login must be enabled.
-
-.. data:: wall_time_limit
-
-The number of minutes the resources are requested for
-
-:type: string
-
-.. data:: working_directory
-
-The directory in which the Pilot-Job agent executes
-
-:type: string
-
-
-PilotCompute
-============
-A pilot job, which can execute some compute workload (ComputeUnit).
-
-This is the object that is returned by the PilotComputeService when a new PilotCompute is created based on a PilotComputeDescription.
-
-The PilotCompute object can be used by the application to keep track of active pilots.
-
-A PilotCompute has state, can be queried, and cancelled.
-
-.. autoclass:: pilot.impl.pilotcompute_manager.PilotCompute
- :members:
-
-
-PilotDataService
-================
-
-The PilotDataService (PDS) is a factory for creating Pilot-Data objects. The PDS takes the COORDINATION_URL as an argument. This is for coordination of the data units with the redis database.
-
-
-.. autoclass:: pilot.impl.pilotdata_manager.PilotDataService
- :members:
-
-
-PilotDataDescription
-=======================
-PilotDataDescription objects are used to describe the requirements for a
-:class:`~pilot.impl.pilotdata_manager.PilotData` instance. Currently, the only
-generic property that can be set is :data:`size`, all other properties are
-backend-specific security / authentication hints. Example::
-
- PilotDataDescription pdd()
- pdd.size = 100
-
- data_pilot = service.create_pilot(pdd)
-
-.. data:: size
-
- The storage space required (in Megabyte) on the storage resource.
-
- :type: int
-
- .. note:: The 'size' attribute is not supported by all PilotData backends.
-
-.. data:: userkey
-
- The SSH private key -- this is required by some systems by the Pilot-Data in order to ensure that the SSH service can be accessed from worker nodes.
-
- :type: string
-
- .. note:: 'userkey' is only supported by backends where worker nodes need private key access. An example of this is OSG.
-
-.. data:: access_key_id
-
- The 'username' for Amazon AWS compliant instances. It is an alphanumeric text string that uniquely identifies a user who owns an account. No two accounts can have the same access key.
-
- :type: string
-
- .. note:: 'access_key_id' is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon's documentation to learn how to obtain your access key id and password.
-
-.. data:: secret_access_key
-
- The 'password' for Amazon AWS compliant instances. It's called secret because it is assumed to be known to the owner only.
-
- :type: string
-
- .. note:: 'secret_access_key' is only supported by AWS complaint EC2 based connections. This applies to Amazon AWS, Eucalpytus, and OpenStack. Please see Amazon's documentation to learn how to obtain your access key id and password.
-
-.. data:: service_url
-
- Specifies the file adaptor and resource hostname on which a Pilot-Data will be created.
-
- :type: string
-
-PilotData
-=========
-A Pilot-Data, which can store some data (DataUnit).
-
-This is the object that is returned by the PilotDataService when a new PilotData is created based on a PilotDataDescription.
-
-The PilotData object can be used by the application to keep track of active pilots.
-
-.. autoclass:: pilot.impl.pilotdata_manager.PilotData
- :members:
-
-ComputeDataService
-==================
-
-The Compute Data Service is created to handle both Pilot Compute and Pilot Data entities in a holistic way. It represents the central entry point for the application workload. The CDS takes care of the placement of Compute and Data Units. The set of Pilot Computes and Pilot Data available to the CDS can be changed during the application's runtime. The CDS handles different data-compute affinity and will handle compute/data co-locating for the requested data-compute workload.
-
-.. autoclass:: pilot.impl.pilot_manager.ComputeDataService
- :members:
-
-
-Compute and Data Units
-**********************
-
-ComputeUnitDescription
-=======================
-
-The ComputeUnitDescription defines the actual compute unit will be run. The executable specified here is what constitutes the individual jobs that will run within the Pilot. This executable can have input arguments or environment variables that must be passed with it in order for it to execute properly.
-
-Example::
-
- compute_unit_description = {
- "executable": "/bin/cat",
- "arguments": ["test.txt"],
- "number_of_processes": 1,
- "output": "stdout.txt",
- "error": "stderr.txt",
- "input_data" : [data_unit.get_url()], # this stages the content of the data unit to the working directory of the compute unit
- "affinity_datacenter_label": "eu-de-south",
- "affinity_machine_label": "mymachine-1"
- }
-
-
-.. class:: ComputeUnitDescription
-
-.. data:: executable
-
-Specifies the path to the executable that will be run
-
- :type: string
-
-.. data:: arguments
-
-Specifies any arguments that the executable needs. For instance, if running an executable from the command line requires a -p flag, then this -p flag can be added in this section.
-
- :type: string
-
-.. data:: environment
-
-Specifies any environment variables that need to be passed with the compute unit in order for the executable to work.
-
- :type: string
-
-.. data:: working_directory
-
-The working directory for the executable
-
- :type: string
-
- .. note:: Recommendation: Do not set the working directory! If none, working directory is a sandbox directory of the CU (automatically created by BigJob)
-
-.. data:: input
-
-Specifies the capture of <stdin>
-
- :type: string
-
-.. data:: output
-
-Specifies the name of the file who captures the output from <stdout>. Default is stdout.txt
-
- :type: string
-
-.. data:: error
-
-Specifies the name of the file who captures the output from <stderr>. Default is stderr.txt
-
- :type: string
-
-.. data:: number_of_processes
-
-Defines how many CPU cores are reserved for the application process.
-
-For instance, if you need 4 cores for 1 MPI executable, this value would be 4.
-
- :type: string
-
-.. data:: spmd_variation
-
-Defines how the application process is launched. Valid strings for this field are 'single' or 'mpi'. If your executable is :code:`a.out`, "single" executes it as :code:`./a.out`, while "mpi" executes :code:`mpirun -np <number_of_processes> ./a.out` (note: :code:`aprun` is used for Kraken, and :code:`srun/ibrun` is used for Stampede).
-
- :type: string
-
-.. data:: input_data
-
-Specifies the input data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is :code:`[<data unit url>, … ]`
-
- :type: string
-
-.. data:: output_data
-
-Specifies the output data flow for a ComputeUnit. This is used in conjunction with PilotData. The format is :code:`[<data unit url>, … ]`
-
- :type: string
-
-.. data:: affinity_datacenter_label
-
-The data center label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-.. data:: affinity_machine_label
-
-The machine (resource) label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-ComputeUnitDescription objects are loosely typed. A dictionary containing the respective keys can be passed instead to the ComputeDataService.
-
-ComputeUnit
-===========
-A ComputeUnit is a work item executed by a PilotCompute. These are what constitute the individual jobs that will run within the Pilot. Oftentimes, this will be an executable, which can have input arguments or environment variables.
-
-A ComputeUnit is the object that is returned by the ComputeDataService when a new ComputeUnit is submitted based on a ComputeUnitDescription. The ComputeUnit object can be used by the application to keep track of ComputeUnits that are active.
-
-A ComputeUnit has state, can be queried, and can be cancelled.
-
-.. autoclass:: pilot.impl.pilotcompute_manager.ComputeUnit
- :members:
-
-
-DataUnitDescription
-=======================
-
-The data unit description defines the different files to be moved around. There is currently no support for directories. ::
-
- data_unit_description = {
- 'file_urls': [file1, file2, file3]
- }
-
-.. class:: DataUnitDescription
-
-.. data:: file_urls
-
-:type: string
-
-.. data:: affinity_datacenter_label
-
-The data center label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-.. data:: affinity_machine_label
-
-The machine (resource) label used for affinity topology.
-
-:type: string
-
- .. note:: Data centers and machines are organized in a logical topology tree (similar to the tree spawned by an DNS topology). The further the distance between two resources, the smaller their affinity.
-
-
-DataUnit
-========
-A DataUnit is a container for a logical group of data that is often accessed together or comprises a larger set of data; e.g. a data file or chunk.
-
-A DataUnit is the object that is returned by the ComputeDataService when a new DataUnit is submitted based on a DataUnitDescription. The DataUnit object can be used by the application to keep track of DataUnits that are active.
-
-A DataUnit has state, can be queried, and can be cancelled.
-
-.. autoclass:: pilot.impl.pilotdata_manager.DataUnit
- :members:
-
-
-State Enumeration
-******************
-
-Pilots and Compute Units can have state. These states can be queried using the :code:`get_state()` function. States are used for PilotCompute, PilotData, ComputeUnit, DataUnit and ComputeDataService. The following table describes the values that states can have.
-
-.. class:: State
-
-.. cssclass:: table-hover
-+------------------------------+
-| **State** |
-+------------------------------+
-| .. data:: Unknown='Unknown' |
-+------------------------------+
-| .. data:: New='New' |
-+------------------------------+
-| .. data:: Running=`Running' |
-+------------------------------+
-| .. data:: Done=`Done' |
-+------------------------------+
-| .. data:: Canceled=`Canceled'|
-+------------------------------+
-| .. data:: Failed=`Failed' |
-+------------------------------+
-| .. data:: Pending=`Pending' |
-+------------------------------+
View
14 docs/build/html/_sources/patterns/chained.txt
@@ -1,14 +0,0 @@
-###############
-Chained Example
-###############
-
-What if you had two different executables to run? What if this second set of executables had some dependencies on data from A? Can you use one BigJob to run both jobs? Yes!
-
-The below example submits a set of echo jobs (set A) using BigJob, and for every successful job (with state Done), it submits another /bin/echo job (set B) to the same Pilot-Job.
-
-We can think of this as A is comprised of subjobs {a1,a2,a3}, while B is comprised of subjobs {b1,b2,b3}. Rather than wait for each subjob {a1},{a2},{a3} to complete, {b1} can run as soon as {a1} is complete, or {b1} can run as soon as a slot becomes available -- i.e. {a2} could finish before {a1}.
-
-The code below demonstrates this behavior. As soon as there is a slot available to run a job in B (i.e. a job in A has completed), it executes the job in B. This keeps the BigJob utilization high.
-
-.. literalinclude:: ../../../examples/tutorial/local_chained_ensembles.py
- :language: python
View
8 docs/build/html/_sources/patterns/coupled.txt
@@ -1,8 +0,0 @@
-#################
-Coupled Ensembles
-#################
-
-The script provides a simple workflow which submit a set of jobs(A) and jobs(B) and wait until they are completed and then submits set of jobs(C). It demonstrates synchronization mechanisms provided by the Pilot-API. This example is useful if an executable C has dependencies on some of the output generated from jobs A and B.
-
-.. literalinclude:: ../../../examples/tutorial/local_coupled_ensembles.py
- :language: python
View
122 docs/build/html/_sources/patterns/exsede.txt
@@ -1,122 +0,0 @@
-#############################
-XSEDE Simple Ensemble Example
-#############################
-
-One of the features of BigJob is the ability for application-level programmability by users. Many of the parameters in each script are customizable and configurable. There are several parameters that must be added to the PilotComputeDescription in order to run on XSEDE.
-
-----------------------------
-:code:`service_url`
-----------------------------
-
-The service URL communicates what type of queueing system or middleware you want to use and where it is. The following table shows the machine type and the adaptor to use for that machine.
-
-+-----------------------------+-----------------------------------------------------------------------------+
-| Machine | :code:`service_url` |
-+=============================+=============================================================================+
-| All machines |* *fork://localhost* |
-| |* *ssh://eric1.loni.org* |
-+-----------------------------+-----------------------------------------------------------------------------+
-| Stampede |* **Local:** *slurm://localhost* |
-| |* **Remote:** *slurm+ssh://stampede.tacc.utexas.edu* |
-+-----------------------------+-----------------------------------------------------------------------------+
-| Lonestar and Ranger |* **Local:** *sge://localhost* |
-| |* **Remote (over SSH):** *sge+ssh://lonestar.tacc.utexas.edu* |
-| |* **Remote (GSISSH):** *sge+gsissh://ranger.tacc.utexas.edu* |
-+-----------------------------+-----------------------------------------------------------------------------+
-| Trestles |* **Local:** *pbs://localhost* |
-| |* **Remote (over SSH):** *pbs+ssh://trestles.sdsc.edu* |
-+-----------------------------+-----------------------------------------------------------------------------+
-| Kraken |* **Local:** *xt5torque://localhost* |
-| |* **Remote (GSISSH):** *xt5torque+gsissh://gsissh.kraken.nics.xsede.org* |
-+-----------------------------+-----------------------------------------------------------------------------+
-
-
-----------------------------
-:code:`project`
-----------------------------
-
-When running on XSEDE, the project parameter must be changed to your project's allocation number.
-
-----------------------------
-:code:`number_of_processes`
-----------------------------
-
-This refers to the number of cores used. If your machine does not have 12 cores per node, you will have to change this parameter.
-
-----------------------------
-:code:`queue`
-----------------------------
-
-This refers to the name of the queue on the submission machine. For example, two queue names on Lonestar are 'normal' and 'development'. Please refer to the machine-specific documentation to find out the names of the queues on the machines.
-
--------------------------------
-Example PilotComputeDescription
--------------------------------
-
-::
-
- pilot_compute_description = {
- "service_url": 'slurm+ssh://stampede.tacc.utexas.edu',
- "number_of_processes": 32,
- "queue":"normal",
- "project":"TG-MCBXXXXXX", # if None default allocation is used
- "walltime":10,
- "working_directory": os.getcwd()
- }
-
-
-----------------------------------
-Simple Ensembles Stampede Example
-----------------------------------
-
-Now that we have modified the Pilot Compute Description, we can put this together with our simple ensemble pattern to build a script that executes on Stampede. Note that the PCD is the only thing that changes in this example. ::
-
- import os
- import time
- import sys
- from pilot import PilotComputeService, ComputeDataService, State
-
- ### This is the number of jobs you want to run
- NUMBER_JOBS=4
- COORDINATION_URL = "redis://localhost"
-
- if __name__ == "__main__":