-
Notifications
You must be signed in to change notification settings - Fork 35
Temporary solution for race between plugin_runner and device-aggregator #1036
Conversation
Can one of the admins verify this patch? |
2 similar comments
Can one of the admins verify this patch? |
Can one of the admins verify this patch? |
Jenkins: trigger a test run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some early feedback.
In addition to comments, there should be an ordering constraint added so that device-aggregator.socket
comes up before nginx.service and plugin-runner.service.
This won't eliminate all the errors, but should lessen their occurence.
In addition, the device-aggregator docker container should have a health-check that tests the connection is responsive.
Jenkins: trigger a test run |
73c9a6c
to
cc716c0
Compare
Jenkins: trigger a test run |
14c38c3
to
b12a881
Compare
Jenkins: trigger a test run |
2 similar comments
Jenkins: trigger a test run |
Jenkins: trigger a test run |
Signed-off-by: Alex Talker <alextalker@ya.ru>
Co-Authored-By: Joe Grund <grundjoseph@gmail.com> Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Joe Grund <jgrund@whamcloud.io>
84fcbd7
to
3efae4d
Compare
Refreshed on latest master |
Jenkins: trigger a test run |
Jenkins: trigger a test run |
Setup
4 servers, aka 2 DC. first pair holds data(active-active), second - metadata(active-passive)
1 IML server.
Network between Lustre client and servers is
IPoIB
via simple switch.Everything goes in monitored mode.
Problem
Sometimes, in logs seen message:
Usually it happens:
The reason seems to be a race condition between
plugin_runner
anddevice-aggregator-daemon
.First of all, there's no dependency declared between these services in
systemd
unit, and connection goes viaHTTP
.Second, since connection proxies by nginx, it returns 502 if service didn't started yet.
Third, if connection from agent is initiated, it doesn't mean device information has already arrive in parallel. Thus, it is useful to wait for the incoming data.
Signed-off-by: Alex Talker alextalker@ya.ru