Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry socket connections #266

Closed
stephanie-wang opened this issue Feb 10, 2017 · 1 comment · Fixed by #294
Closed

Retry socket connections #266

stephanie-wang opened this issue Feb 10, 2017 · 1 comment · Fixed by #294
Labels
bug Something that is supposed to be working; but isn't

Comments

@stephanie-wang
Copy link
Contributor

A common pattern for a Ray process is to connect to some socket(s). In these cases, we often try once and exit violently if the initial connection is unsuccessful. However, these failures to connect may be transient (for instance, if the listen buffer on the server-side is full during photon_connect). In these cases, we should retry the connection, with some backoff.

@stephanie-wang stephanie-wang added the bug Something that is supposed to be working; but isn't label Feb 10, 2017
@robertnishihara
Copy link
Collaborator

This is probably the cause of the error in testGettingManyObjects in stress_tests.py in this log https://travis-ci.org/ray-project/ray/jobs/200210188.

testGettingManyObjects (__main__.TaskTests) ... Waiting for redis server at 127.0.0.1:38081 to respond...
10098:M 10 Feb 07:14:23.632 # Server started, Redis version 3.9.102
Failed to connect to the redis server, retrying.
Waiting for redis server at 127.0.0.1:38081 to respond...
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:870) Allowing the Plasma store to use up to 3.44GB of memory.
[INFO] (/Users/travis/build/ray-project/ray/src/photon/photon_scheduler.c:799) Start worker command is python /Users/travis/.local/lib/python3.5/site-packages/ray-0.0.1-py3.5.egg/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --object-store-name=/tmp/plasma_store79902205 --object-store-manager-name=/tmp/plasma_manager74482700 --local-scheduler-name=/tmp/scheduler28728453 --redis-address=127.0.0.1:38081
[ERROR] (/Users/travis/build/ray-project/ray/src/common/io.c:115: errno: Connection refused) Connection to socket failed for pathname /tmp/scheduler28728453.
[FATAL] (/Users/travis/build/ray-project/ray/src/photon/photon_client.c:14: errno: Bad file descriptor) Check failure: success == 0 
Unable to register worker with local scheduler
0   libphoton.so                        0x0000000102e0f1ff photon_connect + 287
1   libphoton.so                        0x0000000102e0c24e PyPhotonClient_init + 78
[ERROR] (/Users/travis/build/ray-project/ray/src/common/io.c:115: errno: Connection refused) Connection to socket failed for pathname /tmp/scheduler28728453.
[FATAL] (/Users/travis/build/ray-project/ray/src/photon/photon_client.c:14: errno: Bad file descriptor) Check failure: success == 0 
Unable to register worker with local scheduler
0   libphoton.so                        0x0000000102e101ff photon_connect + 287
1   libphoton.so                        0x0000000102e0d24e PyPhotonClient_init + 78
2   libpython3.5m.dylib                 0x0000000100062329 type_call + 281
3   libpython3.5m.dylib                 0x000000010000fd73 PyObject_Call + 99
4   libpython3.5m.dylib                 0x00000001000bd766 PyEval_EvalFrameEx + 23590
5   libpython3.5m.dylib                 0x00000001000c10c3 _PyEval_EvalCodeWithName + 1779
6   libpython3.5m.dylib                 0x00000001000c19ae fast_function + 334
7   libpython3.5m.dylib                 0x00000001000bd434 PyEval_EvalFrameEx + 22772
8   libpython3.5m.dylib                 0x00000001000c10c3 _PyEval_EvalCodeWithName + 1779
9   libpython3.5m.dylib                 0x00000001000b7ac1 PyEval_EvalCode + 81
10  libpython3.5m.dylib                 0x00000001000e6937 PyRun_FileExFlags + 215
11  libpython3.5m.dylib                 0x00000001000e60ea PyRun_SimpleFileExFlags + 842
12  libpython3.5m.dylib                 0x00000001000fcc5b Py_Main + 3355
13  python                              0x0000000100000dc7 main + 215
14  python                              0x0000000100000ce4 start + 52
15  ???                                 0x0000000000000007 0x0 + 7
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 19
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 16
2   libpython3.5m.dylib                 0x0000000100062329 type_call + 281
3   libpython3.5m.dylib                 0x000000010000fd73 PyObject_Call + 99
4   libpython3.5m.dylib                 0x00000001000bd766 PyEval_EvalFrameEx + 23590
5   libpython3.5m.dylib                 0x00000001000c10c3 _PyEval_EvalCodeWithName + 1779
6   libpython3.5m.dylib                 0x00000001000c19ae fast_function + 334
7   libpython3.5m.dylib                 0x00000001000bd434 PyEval_EvalFrameEx + 22772
8   libpython3.5m.dylib                 0x00000001000c10c3 _PyEval_EvalCodeWithName + 1779
9   libpython3.5m.dylib                 0x00000001000b7ac1 PyEval_EvalCode + 81
10  libpython3.5m.dylib                 0x00000001000e6937 PyRun_FileExFlags + 215
11  libpython3.5m.dylib                 0x00000001000e60ea PyRun_SimpleFileExFlags + 842
12  libpython3.5m.dylib                 0x00000001000fcc5b Py_Main + 3355
13  python                              0x0000000100000dc7 main + 215
14  python                              0x0000000100000ce4 start + 52
15  ???                                 0x0000000000000007 0x0 + 7
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 18
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 15
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 12
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 9
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 13
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 10
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 14
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 11
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 15
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 12
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 16
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 13
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 17
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 14
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 19
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 17
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 20
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 21
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_manager.c:1396) Disconnecting client on fd 11
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 18
[INFO] (/Users/travis/build/ray-project/ray/src/plasma/plasma_store.c:796) Disconnecting client on fd 19
10098:signal-handler (1486710883) Received SIGTERM scheduling shutdown...
10098:M 10 Feb 07:14:43.662 # User requested shutdown...
10098:M 10 Feb 07:14:43.662 # Redis is now ready to exit, bye bye...
ok

The test still passes even though one of the workers failed to connect. Also, when we run self.assertTrue(ray.services.all_processes_alive()), it doesn't catch the dead worker because workers are started from the local scheduler now (and not from services.py).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants