Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CU's fail on Stampede - no STDOUT no STDERR #620

Closed
antonst opened this issue May 7, 2015 · 20 comments
Closed

CU's fail on Stampede - no STDOUT no STDERR #620

antonst opened this issue May 7, 2015 · 20 comments

Comments

@antonst
Copy link
Contributor

antonst commented May 7, 2015

Terminal output
agent.err
agent.log
agent.out

@marksantcroos
Copy link
Contributor

Duplicate of your own #281?

@antonst
Copy link
Contributor Author

antonst commented May 7, 2015

No it is not, but in principle could as well be even a triplicate having 127 open issues.

@marksantcroos
Copy link
Contributor

So what are you reporting? That a unit fails? That all units fails? That there are no STDOUT and STDERR?

@andre-merzky
Copy link
Member

FWIW, the agent.err contains:

2015:05:07 15:07:37 16837 StageinWorker-0 radical.pilot.agent :
[ERROR ] Copy'ed
/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/staging_area/ala10_remd_75_1.rst
to /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/unit.000663/ala10_remd_75_1.rst

  • failure ([Errno 2] No such file or directory:
    '/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/staging_area/ala10_remd_75_1.rst')
    Traceback (most recent call last):
    File "/work/02457/antontre/radical.pilot.sandbox/ve_stampede/rp_install/bin/radical-pilot-agent-multicore.py",
    line 4502, in run
    elif directive['action'] == COPY: shutil.copyfile(source, abs_target)
    File "/opt/apps/intel13/python/2.7.9/lib/python2.7/shutil.py", line
    82, in copyfile
    with open(src, 'rb') as fsrc:
    IOError: [Errno 2] No such file or directory:
    '/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/staging_area/ala10_remd_75_1.rst'

I would assume there is an input file missing?

On Fri, May 8, 2015 at 1:06 AM, Mark Santcroos notifications@github.com
wrote:

So what are you reporting? That a unit fails? That all units fails? That
there are no STDOUT and STDERR?


Reply to this email directly or view it on GitHub
#620 (comment)
.

99 little bugs in the code.
99 little bugs in the code.
Take one down, patch it around.

127 little bugs in the code...

@antonst
Copy link
Contributor Author

antonst commented May 7, 2015

This file is missing because unit.000467 failed - which is Amber MD run. Contents of unit.000467:
login4.stampede(29)$ cd unit.000467
login4.stampede(30)$ ls -lrt
total 68
-rw------- 1 antontre G-801782 99 May 7 14:58 ala10_us.RST.77
-rw------- 1 antontre G-801782 315 May 7 14:58 ala10_us.mdin
-rw------- 1 antontre G-801782 55330 May 7 14:58 ala10.prmtop
-rw------- 1 antontre G-801782 317 May 7 15:01 ala10_remd_75_1.mdin

@andre-merzky
Copy link
Member

how do you see that this unit failed? The DB records are gone,
unfortunately...

On Fri, May 8, 2015 at 1:16 AM, Antons notifications@github.com wrote:

This file is missing because unit.000467 failed - which is Amber MD run.
Contents of unit.000467:
login4.stampede(29)$ cd unit.000467
login4.stampede(30)$ ls -lrt
total 68
-rw------- 1 antontre G-801782 99 May 7 14:58 ala10_us.RST.77
-rw------- 1 antontre G-801782 315 May 7 14:58 ala10_us.mdin
-rw------- 1 antontre G-801782 55330 May 7 14:58 ala10.prmtop
-rw------- 1 antontre G-801782 317 May 7 15:01 ala10_remd_75_1.mdin


Reply to this email directly or view it on GitHub
#620 (comment)
.

99 little bugs in the code.
99 little bugs in the code.
Take one down, patch it around.

127 little bugs in the code...

@antonst
Copy link
Contributor Author

antonst commented May 7, 2015

2015:05:07 20:01:12 28186  Thread-3     radical.pilot         : [INFO    ] RUN ComputeUnit 'unit.000467' state changed from 'StagingInput' to 'Failed'.
2015:05:07 20:01:12 28186  Thread-3     radical.repex.pk-patternB: [INFO    ] ComputeUnit 'unit.000467' state changed to Failed.
2015:05:07 20:01:12 28186  Thread-3     radical.repex.pk-patternB: [ERROR   ] Log: {'log': [<radical.pilot.logentry.Logentry object at 0x7fa6a820f0d0>, <radical.pilot.logentry.Logentry object at 0x7fa6a820f090>, <radical.pilot.logentry.Logentry object at 0x7fa6a820f050>, <radical.pilot.logentry.Logentry object at 0x7fa6a9750f90>], 'state': u'Failed', 'working_directory': u'sftp://stampede.tacc.utexas.edu/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000//unit.000467', 'uid': 'unit.000467', 'submission_time': datetime.datetime(2015, 5, 7, 19, 58, 49, 941000), 'execution_details': {u'stdout': None, u'Agent_Output_Directives': [{u'target': u'staging:///ala10_remd_75_1.mdinfo', u'priority': 0, u'source': u'ala10_remd_75_1.mdinfo', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'staging:///ala10_remd_75_1.rst', u'priority': 0, u'source': u'ala10_remd_75_1.rst', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}], u'Agent_Output_Status': u'New', u'exec_locs': None, u'FTW_Input_Directives': [{u'target': u'ala10_remd_75_1.mdin', u'priority': 0, u'source': u'ala10_remd_75_1.mdin', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Transfer'}], u'log': [{u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 49, 960000), u'message': u'Scheduled for data transfer to ComputePilot pilot.0000.'}, {u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 50, 255000), u'message': u'unit needs input staging'}, {u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 50, 285000), u'message': u"Copy'ed /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/staging_area/ala10.prmtop to /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/unit.000467/ala10.prmtop - success"}, {u'timestamp': datetime.datetime(2015, 5, 7, 20, 1, 6, 650000), u'message': u'Input transfer failed: cannot release object -- not managed'}], u'exit_code': None, u'FTW_Input_Status': u'Executing', u'state': u'Failed', u'unitmanager': u'554bc01323769c6e1a27f580', u'statehistory': [{u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 49, 939000), u'state': u'Scheduling'}, {u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 50, 109000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 50, 255000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 5, 7, 20, 0, 55, 413000), u'state': u'StagingInput'}, {u'timestamp': datetime.datetime(2015, 5, 7, 20, 1, 6, 650000), u'state': u'Failed'}], u'pilot': u'pilot.0000', u'FTW_Output_Directives': [], u'pilot_sandbox': u'sftp://stampede.tacc.utexas.edu/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/', u'description': {u'kernel': None, u'executable': u'/opt/apps/intel13/mvapich2_1_9/amber/12.0/bin/sander.MPI', u'name': None, u'restartable': False, u'stdout': None, u'output_staging': [{u'action': u'Copy', u'source': u'ala10_remd_75_1.mdinfo', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'staging:///ala10_remd_75_1.mdinfo', u'priority': 0}, {u'action': u'Copy', u'source': u'ala10_remd_75_1.rst', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'staging:///ala10_remd_75_1.rst', u'priority': 0}], u'pre_exec': [u'module load TACC', u'module load amber/12.0'], u'mpi': True, u'environment': None, u'cleanup': False, u'arguments': [u'-O', u'-i ', u'ala10_remd_75_1.mdin', u'-o ', u'ala10_remd_75_1.mdout', u'-p ', u'ala10.prmtop', u'-c ', u'../staging_area//replica_75_0/ala10_minimized.inpcrd', u'-r ', u'ala10_remd_75_1.rst', u'-x ', u'ala10_remd_75_1.mdcrd', u'-inf ', u'ala10_remd_75_1.mdinfo'], u'stderr': None, u'cores': 1, u'post_exec': None, u'input_staging': [{u'action': u'Transfer', u'source': u'ala10_remd_75_1.mdin', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ala10_remd_75_1.mdin', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ala10.prmtop', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ala10.prmtop', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ala10_us.mdin', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ala10_us.mdin', u'priority': 0}, {u'action': u'Copy', u'source': u'staging:///ala10_us.RST.77', u'flags': [u'CreateParents', u'SkipFailed'], u'target': u'ala10_us.RST.77', u'priority': 0}]}, u'restartable': False, u'started': None, u'FTW_Output_Status': None, u'finished': None, u'Agent_Input_Directives': [{u'target': u'ala10.prmtop', u'priority': 0, u'source': u'staging:///ala10.prmtop', u'state': u'Done', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ala10_us.mdin', u'priority': 0, u'source': u'staging:///ala10_us.mdin', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}, {u'target': u'ala10_us.RST.77', u'priority': 0, u'source': u'staging:///ala10_us.RST.77', u'state': u'Pending', u'flags': [u'CreateParents', u'SkipFailed'], u'action': u'Copy'}], u'Agent_Input_Status': u'Done', u'submitted': datetime.datetime(2015, 5, 7, 19, 58, 49, 941000), u'sandbox': u'sftp://stampede.tacc.utexas.edu/work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000//unit.000467', u'stderr': None, u'_id': u'unit.000467'}, 'stop_time': None, 'start_time': None, 'exit_code': None, 'name': None}
2015:05:07 20:01:12 28186  InputFileTransferWorker-1 radical.pilot         : [DEBUG   ] read : [   19] [    6] (sftp> )

@andre-merzky
Copy link
Member

Thanks.

Antons, can you please try to rerun it and see if this is reproducible? It seems you hit an internal error on the radical.utils layer. I don't understand it yet I'm afraid -- knowing if it is reproducible would help to triage it. Thanks...

@andre-merzky andre-merzky self-assigned this May 7, 2015
@andre-merzky
Copy link
Member

 {u'timestamp': datetime.datetime(2015, 5, 7, 19, 58, 50, 285000),
 u'message': u"Copy'ed /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/staging_area/ala10.prmtop to /work/02457/antontre/radical.pilot.sandbox/rp.session.ip-10-184-31-85.treikalis.016562.0007-pilot.0000/unit.000467/ala10.prmtop - success"},

 {u'timestamp': datetime.datetime(2015, 5, 7, 20, 1, 6, 650000),
 u'message': u'Input transfer failed: cannot release object -- not managed'}],

@marksantcroos
Copy link
Contributor

And the stacktrace:

2015:05:07 20:01:06 28186  InputFileTransferWorker-2 radical.pilot         : [ERROR   ] {'timestamp': datetime.datetime(2015, 5, 7, 20, 1, 6, 650230), 'message': 'Input transfer failed: cannot release object -- not managed'}
Traceback (most recent call last):
  File "/home/treikalis/repex-16/local/lib/python2.7/site-packages/radical.pilot-0.31-py2.7.egg/radical/pilot/controller/input_file_transfer_worker.py", line 188, in run
    input_file.close()
  File "/home/treikalis/repex-16/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/filesystem/file.py", line 178, in close
    return self._adaptor.close ()
  File "/home/treikalis/repex-16/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/adaptors/cpi/decorators.py", line 57, in wrap_function
    return sync_function (self, *args, **kwargs)
  File "/home/treikalis/repex-16/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/adaptors/shell/shell_file.py", line 1079, in close
    self.finalize (kill=True)
  File "/home/treikalis/repex-16/local/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/adaptors/shell/shell_file.py", line 1063, in finalize
    self.lm.release (self.local)
  File "/home/treikalis/repex-16/lib/python2.7/site-packages/radical.utils-0.28-py2.7.egg/radical/utils/lease_manager.py", line 416, in release
    raise RuntimeError ("cannot release object -- not managed")
RuntimeError: cannot release object -- not managed

@antonst
Copy link
Contributor Author

antonst commented May 8, 2015

Thank you for your input gentelmen. If I may ask, apart from verifying that this issue is reproducible is there any other reason to re-run?

@marksantcroos
Copy link
Contributor

Mainly getting an intuition about occurrence frequency and more hints about where to instrument the code to get further debugging information
Meanwhile we have started with trying to reproduce it in more isolation.

marksantcroos pushed a commit to radical-cybertools/radical.utils that referenced this issue May 8, 2015
@antonst
Copy link
Contributor Author

antonst commented May 10, 2015

Reproduced with RADICAL_PILOT_VERBOSE=debug SAGA_VERBOSE=debug RADICAL_VERBOSE=debug RADICAL_REPEX_VERBOSE=info
terminal output

@antonst antonst closed this as completed May 10, 2015
@antonst antonst reopened this May 10, 2015
@andre-merzky
Copy link
Member

Great! Would you mind giving us instructions on how exactly to run your code to reproduce it? Thanks!

@antonst
Copy link
Contributor Author

antonst commented May 10, 2015

Sorry, logs and terminal output are missing. Will post soon.

@andre-merzky
Copy link
Member

Antons, could you please try to add the following entries to your ~/.saga.cfg:

[saga.utils.pty]
connection_pool_size = 20
connection_pool_ttl = 1200
connection_pool_wait = 1200

Also, see above, could you provide instructions on how to run your code? Thanks!

@antonst
Copy link
Contributor Author

antonst commented May 18, 2015

This was US use-case from repex. You can run it by doing this:

git clone https://github.com/radical-cybertools/RepEx.git
cd RepEX
git checkout feature/2d-prof
python setup.py install
cd examples/amber_pattern_b_umbrella_sampling
# provide your username and allocation for stampede input.PILOT in amber_input.json
# change number of replicas >= 196 and cores >=96
# do all needed exports e.g. RADICAL_PILOT_VERBOSE, etc.
# now you can run this simulation:
RADICAL_REPEX_VERBOSE=info python launch_simulation_pattern_b_amber_us.py --input='amber_input.json'

@antonst
Copy link
Contributor Author

antonst commented Jun 13, 2015

Proposed solution results in the following error:

radical.utils.config.config.ValueTypeError: Option saga.utils.pty.connection_pool_ttl requires value of type '<type 'int'>' but got '<type 'str'>'.

@andre-merzky
Copy link
Member

Hi Antons -- we dropped the ball on this ticket. Does this still pop up for you?

@andre-merzky
Copy link
Member

Optimistically closing due to inactivity (and also because the lease manager has seen change since then)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants