Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.2] Empty hostgroup now cause Arbiter to fail to start #1468

Open
lostmimic opened this issue Jan 20, 2015 · 14 comments
Open

[2.2] Empty hostgroup now cause Arbiter to fail to start #1468

lostmimic opened this issue Jan 20, 2015 · 14 comments

Comments

@lostmimic
Copy link

This is related to issue #1050

I run multiple environments and in the service configs I had a large set of services that may or may not be used in that environment. Before, if the hostgroup was empty, it just ignored it an moved along. With the issue #1050, it now causes Arbiter to fail to start.

Looking through other issues, I found #851 where this exact scenario was basically discussed when someone wanted a nagios feature brought over. Naparuba said "But it should just be a warning, not a full block of the start." and we are now seeing a full block of the start in this scenario :(

@naparuba
Copy link
Contributor

arg yes, this is not good. I'll have a look, it's indeed a regression from
the choosen behavior.

On Tue, Jan 20, 2015 at 9:36 PM, lostmimic notifications@github.com wrote:

This is related to issue #1050
#1050

I run multiple environments and in the service configs I had a large set
of services that may or may not be used in that environment. Before, if the
hostgroup was empty, it just ignored it an moved along. With the issue
#1050 #1050, it now causes
Arbiter to fail to start.

Looking through other issues, I found #851
#851 where this exact
scenario was basically discussed when someone wanted a nagios feature
brought over. Naparuba said "But it should just be a warning, not a full
block of the start." and we are now seeing a full block of the start in
this scenario :(


Reply to this email directly or view it on GitHub
#1468.

@naparuba
Copy link
Contributor

Hum.... cannot reproduce with the tests (look at
tests/test_hostgroup_no_host.py that is present since long).

I'll look with a full launch.

On Tue, Jan 20, 2015 at 10:09 PM, nap naparuba@gmail.com wrote:

arg yes, this is not good. I'll have a look, it's indeed a regression from
the choosen behavior.

On Tue, Jan 20, 2015 at 9:36 PM, lostmimic notifications@github.com
wrote:

This is related to issue #1050
#1050

I run multiple environments and in the service configs I had a large set
of services that may or may not be used in that environment. Before, if the
hostgroup was empty, it just ignored it an moved along. With the issue
#1050 #1050, it now causes
Arbiter to fail to start.

Looking through other issues, I found #851
#851 where this exact
scenario was basically discussed when someone wanted a nagios feature
brought over. Naparuba said "But it should just be a warning, not a full
block of the start." and we are now seeing a full block of the start in
this scenario :(


Reply to this email directly or view it on GitHub
#1468.

@naparuba
Copy link
Contributor

ok void host groups are ok, so it's we did apply service on void groups?
I'll give a look to this case too.

On Tue, Jan 20, 2015 at 10:13 PM, nap naparuba@gmail.com wrote:

Hum.... cannot reproduce with the tests (look at
tests/test_hostgroup_no_host.py that is present since long).

I'll look with a full launch.

On Tue, Jan 20, 2015 at 10:09 PM, nap naparuba@gmail.com wrote:

arg yes, this is not good. I'll have a look, it's indeed a regression
from the choosen behavior.

On Tue, Jan 20, 2015 at 9:36 PM, lostmimic notifications@github.com
wrote:

This is related to issue #1050
#1050

I run multiple environments and in the service configs I had a large set
of services that may or may not be used in that environment. Before, if the
hostgroup was empty, it just ignored it an moved along. With the issue
#1050 #1050, it now causes
Arbiter to fail to start.

Looking through other issues, I found #851
#851 where this exact
scenario was basically discussed when someone wanted a nagios feature
brought over. Naparuba said "But it should just be a warning, not a full
block of the start." and we are now seeing a full block of the start in
this scenario :(


Reply to this email directly or view it on GitHub
#1468.

@naparuba
Copy link
Contributor

ok in fact the test did have this.

Can you provide us a configuration sample? (or better a pull request for a
test_hostgroup_no_host.py test if you know how to do :) )

Thanks

On Tue, Jan 20, 2015 at 10:17 PM, nap naparuba@gmail.com wrote:

ok void host groups are ok, so it's we did apply service on void groups?
I'll give a look to this case too.

On Tue, Jan 20, 2015 at 10:13 PM, nap naparuba@gmail.com wrote:

Hum.... cannot reproduce with the tests (look at
tests/test_hostgroup_no_host.py that is present since long).

I'll look with a full launch.

On Tue, Jan 20, 2015 at 10:09 PM, nap naparuba@gmail.com wrote:

arg yes, this is not good. I'll have a look, it's indeed a regression
from the choosen behavior.

On Tue, Jan 20, 2015 at 9:36 PM, lostmimic notifications@github.com
wrote:

This is related to issue #1050
#1050

I run multiple environments and in the service configs I had a large
set of services that may or may not be used in that environment. Before, if
the hostgroup was empty, it just ignored it an moved along. With the issue
#1050 #1050, it now causes
Arbiter to fail to start.

Looking through other issues, I found #851
#851 where this exact
scenario was basically discussed when someone wanted a nagios feature
brought over. Naparuba said "But it should just be a warning, not a full
block of the start." and we are now seeing a full block of the start in
this scenario :(


Reply to this email directly or view it on GitHub
#1468.

@lostmimic
Copy link
Author

This used to work in 2.0.3 (we are upgrading due to another bug in a module we need that was fixed in 2.2)

We are using the mod-import-aws plugin to generate the hosts, here is a sample one that threw an error:

define host{
name nginx-app-hosts
hostgroups +nginx-app-hosts

register 0
}

...

define service{
name generic-nginx-process-check
use generic-active-critical-service
service_description Assure Nginx Process is Running
hostgroup_name nginx-app-hosts
check_command check-nginx-procs
servicegroups nginx-service-checks
action_url 0006NginxDown
}

and the /tmp/bad_start_for_arbiter file had this in it:

ESC[31m[1421782614] ERROR: [Shinken] [service::UNKNOWN-SERVICE] the hostgroup '[u'nginx-app-hosts']' is unknownESC[0m
ESC[31m[1421782614] ERROR: [Shinken] The service 'Assure Nginx Process is Running' is not bound do any host.ESC[0m
ESC[31m[1421782614] ERROR: [Shinken] [items] In Assure Nginx Process is Running is incorrect ; from /etc/shinken/templates/template-services.cfg:120ESC[0m
ESC[31m[1421782614] ERROR: [Shinken] [service::UNKNOWN-SERVICE] a service has been defined without service_description in unknownESC[0m

I think what is going on is that the hostgroup is not being created at all because no instance in that EC2 environment matches it.

@lostmimic
Copy link
Author

So I was able to fix error the unknown hostgroup by explicitly putting all the hostgroups missing into one of the hostgroup configs (did not have to do that on 2.0.3)

But we still see these types of errors:

[1421861689] ERROR: [Shinken] [service::UNKNOWN-SERVICE] a service has been defined without service_description in unknown

If a service is a template, do we still need a service_description? (didnt get blocked by this on 2.0.3)

[1421861689] ERROR: [Shinken] The service 'Assure Our Process is Running' is not bound do any host.
[1421861689] ERROR: [Shinken] [items] In Assure Our Process is Running is incorrect ; from /etc/shinken/services/services_productf.cfg:9

It seems that if we define a service that is not attached to any hosts (because the hostgroup is an empty set) we get these errors.

@lostmimic
Copy link
Author

It looks like it is not splitting up the hosts in the host_group and taking them as one large string:

ERROR: [Shinken] The service 'Check Health Page' got an unknown host_name 'product-prod-eu-product-6_i-12345678,product-prod-eu-product-2_i-12345678,product-prod-eu-product-4_i-12345678,product-prod-
eu-product-3_i-12345678,product-prod-eu-product-5_i-12345678,product-prod-eu-product-1_i-12345678'.ESC[0m

@Seb-Solon
Copy link
Contributor

Don't bother, I made those commits. I may have a look this week end and fix it quicly with the provided doc :)

@lostmimic
Copy link
Author

Too late, I dug a little haha...

Seems I found the change in shinken/complexexpression.py as part of "Merge : Rework-Parsing-Clean branch after drift" (2f0e42f)

-            elts = hg.get_hosts().split(',')
+            elts = hg.get_hosts()

It seems that is because the get_hosts() function changed to no longer return an empty string but a list (which was the oddity we noticed originally with empty groups I'm guessing)

     def get_hosts(self):
-        return getattr(self, 'members', '')
+        if getattr(self, 'members', None) is not None:
+            return self.members
+        else:
+            return []

And this is because the Itemgroup class changed members attribute from a string to a list thought it should have split on the commas...

properties.update({
-        'members': StringProp(fill_brok=['full_status']),
+        'members': ListProp(fill_brok=['full_status'], default=None, split_on_coma=True),

But I havent gone further then that

@Seb-Solon
Copy link
Contributor

Yeah that the good way it should be done. Member should be a list that's not a issue. To not bailout the trick si to put the error into warning confis instead of errors config. See function append_unknown_host in itemgroup (or something like that)

@lostmimic
Copy link
Author

Hey Sebastion, what about this error:

ERROR: [Shinken] The service 'Check Health Page' got an unknown host_name 'product-prod-eu-product-6_i-12345678,product-prod-eu-product-2_i-12345678,product-prod-eu-product-4_i-12345678,product-prod-
eu-product-3_i-12345678,product-prod-eu-product-5_i-12345678,product-prod-eu-product-1_i-12345678'.ESC[0m

It seems like it took the list joined it into a single host_name and then failed because it didnt exist.

@Seb-Solon
Copy link
Contributor

yeah, beacause host_name shoul be splitted on coma. Im pretty sure it was defined like that in the conf. The thing is the doc was not clear if this should be a list or a string.

@lostmimic
Copy link
Author

Any update on this bug?

@titilambert titilambert modified the milestone: 2.4 (Noteworthy Nagapie) **we will find :)** Mar 9, 2015
@Seb-Solon
Copy link
Contributor

We should rework a bit more the code to handle that properly. As we are in freeze mode we will do it in next release

@Seb-Solon Seb-Solon modified the milestones: 2.6 (), 2.4 (Noteworthy Nagapie) **we will find :)** Mar 18, 2015
@titilambert titilambert mentioned this issue Mar 18, 2015
11 tasks
Seb-Solon pushed a commit to Alignak-monitoring/alignak that referenced this issue Sep 15, 2015
Seb-Solon pushed a commit to Alignak-monitoring/alignak that referenced this issue Sep 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants