Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Temporary solution for race between plugin_runner and device-aggregator #1036

Merged
merged 7 commits into from
Aug 19, 2019

Conversation

AlexTalker
Copy link
Contributor

Setup

4 servers, aka 2 DC. first pair holds data(active-active), second - metadata(active-passive)
1 IML server.
Network between Lustre client and servers is IPoIB via simple switch.
Everything goes in monitored mode.

Problem

Sometimes, in logs seen message:

[2019-07-04 08:44:49,214: ERROR/plugin_runner] iml-device-aggregator is not providing expected data, ensure iml-device-scanner package is installed and relevant services are running on storage servers (No JSON object could be decoded)

Usually it happens:

  1. On restart/reboot of IML server
  2. Randomly

The reason seems to be a race condition between plugin_runner and device-aggregator-daemon.
First of all, there's no dependency declared between these services in systemd unit, and connection goes via HTTP.
Second, since connection proxies by nginx, it returns 502 if service didn't started yet.
Third, if connection from agent is initiated, it doesn't mean device information has already arrive in parallel. Thus, it is useful to wait for the incoming data.

Signed-off-by: Alex Talker alextalker@ya.ru

@manager-for-lustre-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

2 similar comments
@manager-for-lustre-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@manager-for-lustre-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@jgrund
Copy link
Member

jgrund commented Jul 5, 2019

Jenkins: trigger a test run

Copy link
Member

@jgrund jgrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback.

In addition to comments, there should be an ordering constraint added so that device-aggregator.socket comes up before nginx.service and plugin-runner.service.

This won't eliminate all the errors, but should lessen their occurence.

In addition, the device-aggregator docker container should have a health-check that tests the connection is responsive.

chroma_core/plugins/linux.py Outdated Show resolved Hide resolved
chroma_core/plugins/linux.py Show resolved Hide resolved
chroma_core/plugins/block_devices.py Show resolved Hide resolved
chroma_core/plugins/block_devices.py Outdated Show resolved Hide resolved
chroma_core/plugins/block_devices.py Show resolved Hide resolved
@jgrund
Copy link
Member

jgrund commented Jul 8, 2019

Jenkins: trigger a test run

@jgrund
Copy link
Member

jgrund commented Jul 9, 2019

Jenkins: trigger a test run

@jgrund
Copy link
Member

jgrund commented Jul 9, 2019

Jenkins: trigger a test run

2 similar comments
@jgrund
Copy link
Member

jgrund commented Jul 11, 2019

Jenkins: trigger a test run

@jgrund
Copy link
Member

jgrund commented Jul 12, 2019

Jenkins: trigger a test run

@AlexTalker AlexTalker changed the title WIP: Temporary solution for race between plugin_runner and device-aggregator Temporary solution for race between plugin_runner and device-aggregator Jul 26, 2019
AlexTalker and others added 6 commits August 15, 2019 18:08
Co-Authored-By: Joe Grund <grundjoseph@gmail.com>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Alex Talker <alextalker@ya.ru>
Signed-off-by: Joe Grund <jgrund@whamcloud.io>
@jgrund
Copy link
Member

jgrund commented Aug 16, 2019

Refreshed on latest master

@jgrund
Copy link
Member

jgrund commented Aug 16, 2019

Jenkins: trigger a test run

@jgrund jgrund self-requested a review August 16, 2019 13:35
Signed-off-by: Joe Grund <jgrund@whamcloud.io>
@jgrund
Copy link
Member

jgrund commented Aug 16, 2019

Jenkins: trigger a test run

@jgrund jgrund requested a review from johnsonw August 19, 2019 13:58
@jgrund jgrund merged commit aeab881 into whamcloud:master Aug 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants