Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCOOP is Losing Futures #64

Open
maharjun opened this issue Jun 6, 2017 · 3 comments · May be fixed by #67
Open

SCOOP is Losing Futures #64

maharjun opened this issue Jun 6, 2017 · 3 comments · May be fixed by #67

Comments

@maharjun
Copy link
Contributor

maharjun commented Jun 6, 2017

As a result of the changes in master branch, it appears that SCOOP is losing futures without there being any communication issues. The following issues lead to lost futures.

  1. execQueue.inprogress does not seem to be updated when a future that is popped from the queue is run (via runFuture) This means that for the duration that a job is running, all of the ready, movable, and inprogress do not contain the future that is being run. Now, if the asynchronous thread that performs futures reporting decides to send an update message during this phase, the executing future is not sent causing it to be erronously deleted from the assigned_tasks in the broker.

  2. The following sequence of events is an issue:

    1. A future gets completed on a remote worker thread
    2. sendResult is called on that future on the remote executing worker. This is results in a STATUS_DONE being sent to the broker which then (wrongly) deletes the future from assigned_tasks.
    3. self.askForPreviousFutures() is called on the process that spawned this future (Note that this is possible as the processes are asynchronous). This then leads to the reporting of a 'Lost future'.

    Basically, the fact that the future is not delted on the originator worker before the STATUS_DONE is causing the error

@soravux
Copy link
Owner

soravux commented Jun 7, 2017

Thanks for getting the time to understand this error. I am sadly very busy with other matters, but I will gladly take any pull request proposing a solution for this.

@maharjun maharjun linked a pull request Jun 7, 2017 that will close this issue
@nfaguirrec
Copy link

Hello everyone,

I am not sure if the issue that maharjun describes is related with the one I am observing (I apologize in advance if this is not the case). But definitely, what I have is related that “SCOOP is Losing Futures” (see below). The weird thing is that when I use the version 0.7.1.1 (from pip) instead 0.7.2.0 (cloning the repository) I do not observe the error anymore. But unfortunately, I need other upgrades that they are only available in the last version. I will appreciate a lot any help any of you can give me.

All the best,
Nestor

[2017-07-26 16:49:16,747] scoopzmq (192.168.2.23:52308) WARNING Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
(MainThread) Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
Traceback (most recent call last):
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 298, in
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 92, in main
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 285, in run
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/main.py", line 266, in futures_startup
File "build/bdist.linux-x86_64/egg/scoop/futures.py", line 65, in _startup
File "build/bdist.linux-x86_64/egg/scoop/_control.py", line 273, in runController
File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 359, in pop
File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 382, in updateQueue
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 352, in recvFuture
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 269, in _recv
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 369, in sendFuture
File "build/bdist.linux-x86_64/egg/scoop/encapsulation.py", line 164, in pickleFileLike
IOError: File not open for reading
[2017-07-26 16:49:16,816] launcher (127.0.0.1:42167) INFO Root process is done.
[2017-07-26 16:49:16,816] workerLaunch (127.0.0.1:42167) DEBUG Closing workers on wf535 (4 workers).
[2017-07-26 16:49:16,816] brokerLaunch (127.0.0.1:42167) DEBUG Closing local broker.
[2017-07-26 16:49:16,816] launcher (127.0.0.1:42167) INFO Finished cleaning spawned subprocesses.

@RuralHunter
Copy link

RuralHunter commented Aug 27, 2023

I have the same problem for local workers with version 0.7.2.

[2023-08-27 12:32:44,600] scoopzmq  (b'127.0.0.1:59141') WARNING Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...
2023-08-27 12:32:44 WARNING SCOOPLogger Lost track of future (b'127.0.0.1:59141', 9):run_test_on_date('2023-08-25-08',){}=None. Resending it...

The workers are complete(according to the log in worker) but they are resent again and again...and the main process never ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants