Skip to content
This repository has been archived by the owner on Jan 7, 2020. It is now read-only.

uchiwa needs restarting to update results #438

Closed
jaxxstorm opened this issue Jan 27, 2016 · 35 comments
Closed

uchiwa needs restarting to update results #438

jaxxstorm opened this issue Jan 27, 2016 · 35 comments
Assignees
Labels

Comments

@jaxxstorm
Copy link

This is really hard for me to pin down, but I want to open this to see if anyone else is having a similar problem.

Uchiwa is showing me stale results until I start the daemon. Even if I shift+refresh the page and completely clear my browser cache, the stale results still remain in the client.

Is there anything I can do to give some debug output for this? It's very frustrating for my users who think there's issues, login to a box to debug it and find the issue was resolved ages ago.

@mvdavid7
Copy link

What is your refresh parameter in the uchiwa config set to? http://docs.uchiwa.io/en/latest/configuration/uchiwa/

@palourde
Copy link
Contributor

@jaxxstorm Not sure what you mean by until I start the daemon. Are you referring to the Uchiwa daemon?

I also recommend you to check your browser console and the Uchiwa logs.

@jaxxstorm
Copy link
Author

@mvdavid7 it's set to "5"

@palourde It's a typo, it's meant to say until I restart the daemon

@palourde
Copy link
Contributor

@jaxxstorm As suggested, try to consult your browser logs and the Uchiwa logs. You might also want to try a lower log level (see the loglevel in http://docs.uchiwa.io/en/latest/configuration/uchiwa/, requires 0.14.1)

@neilprosser
Copy link

@jaxxstorm - I've seen this issue too and find that I have to restart Uchiwa before I start seeing latest results. The data looks okay if I drill in to a client but the summaries seem out of whack.

We've just restarted our instance of Uchiwa to clear the problem but I'll play around with log levels and browser logs to see whether we can find anything out.

Might be worth noting it only started happening in the last few versions, though I'm unable to pin it down to appearing in a specific version.

@jaxxstorm
Copy link
Author

Thanks @neilprosser glad I'm not the only one experiencing this.

@jaxxstorm
Copy link
Author

@palourde this is still ongoing for me and I see nothing in the logs to indicate why this is happening. I have a screencap of the issue happening and then being resolved by restart, is there somewhere I can send it to you that's off the issue? I don't want the hostnames in the issue being made public

@shaimona
Copy link

I am having the same issue. I have to restart uchiwa for the dashboard to refresh. Dashboard is not refreshing as configured by uchiwa.json file! Nothing in the log indicating any issues.

@palourde
Copy link
Contributor

@jaxxstorm You can send me an email at simon.plourde@gmail.com

@palourde palourde self-assigned this Feb 27, 2016
@s-schweer
Copy link

Im having the same issue as well. And nothing helpful in the logs...

@jaxxstorm
Copy link
Author

Hi @palourde I've sent over the video of this as promised, sorry for the delay

@jaxxstorm
Copy link
Author

Any update on this? I'm having to restart uchiwa once an hour at the moment

@sumit-sngpt
Copy link

I too have to restart Uchiwa more frequently these days. There are stale alerts and after restart works fine. I checked logs but nothing helpful.

@zbintliff
Copy link

We ran into the issue today as well. We have it fronting 3 Datacenters and it didn't refresh for over 16 hours. When looking at specific clients it was ok, but the 'checks' and 'clients' dashboard were stale. Did not have this issue until upgrading from 12.x to the 14.x

@nicholascapo
Copy link

I experienced this issue too, but was able to resolve it by reducing the refresh parameter, from 1000(!) to 10.

During my testing i cleared the Sensu Redis instance, and reinstalled the uchiwa deb package.

@rmc3
Copy link
Contributor

rmc3 commented Mar 17, 2016

+1 also experiencing this issue, will see if I can reproduce it in my dev environment and shim in some hopefully useful debugging text

@rmc3
Copy link
Contributor

rmc3 commented Mar 24, 2016

The issue reoccurred for me today. Both prod and dev instances stopped updating at roughly the same time, around 16:20 CDT. Both instances are configured with the same Sensu datacenter, so it's plausible that they were receiving the same data at the time that they stopped updating.

I have the log_level set to trace presently in my dev instance. The last 12 or so lines of the logs up to when the data stops updating are below. They don't tell us much other than that the daemon appears to have stopped doing anything after its most recent set of GET requests to the Sensu API. It's worth noting that they stopped before the "sending data" messages for that data update run. This, together with the fact that two separate instances stopped updating at almost the same time, makes me wonder if the issue is something like a particular pattern of response from the Sensu API inducing the goroutine that's fetching and processing data to choke.

@palourde, Any recommendations on debugging text or similar that might help us diagnose this?

Some additional information:
Prod instance is running Uchiwa 0.12.0 with Golang 1.3.3 on CentOS 6
Dev instance is running Uchiwa 0.14.3 with Golang 1.3.3 on CentOS 6
Sensu is a slightly modified 0.20.3-1, but the modifications should only affect clients.

2016-03-24 16:19:25.093939500 {"timestamp":"2016-03-24T16:19:25.093598398-05:00","level":"info","src":{"func":"github.com/sensu/uchiwa/uchiwa/daemon.(*Daemon).fetchData","line":72},"message":"Updating the datacenter REDACTED"}
2016-03-24 16:19:25.093943500 {"timestamp":"2016-03-24T16:19:25.093743802-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/stashes"}
2016-03-24 16:19:25.128619500 {"timestamp":"2016-03-24T16:19:25.128563928-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/checks"}
2016-03-24 16:19:25.131601500 {"timestamp":"2016-03-24T16:19:25.131557337-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/clients"}
2016-03-24 16:19:25.341522500 {"timestamp":"2016-03-24T16:19:25.34142188-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/events"}
2016-03-24 16:19:25.401431500 {"timestamp":"2016-03-24T16:19:25.401355259-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getBytes","line":35},"message":"GET http://REDACTED:4567/info"}
2016-03-24 16:19:25.408374500 {"timestamp":"2016-03-24T16:19:25.408306867-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/aggregates"}
2016-03-24 16:19:25.550048500 {"timestamp":"2016-03-24T16:19:25.549968229-05:00","level":"trace","src":{"func":"github.com/sensu/uchiwa/uchiwa/daemon.(*Daemon).Start","line":49},"message":"Sending results on the 'data' channel"}
2016-03-24 16:19:28.285855500 {"timestamp":"2016-03-24T16:19:28.285691515-05:00","level":"trace","src":{"func":"github.com/sensu/uchiwa/uchiwa.(*Uchiwa).listener","line":88},"message":"Received results on the 'data' channel"}
2016-03-24 16:19:30.094041500 {"timestamp":"2016-03-24T16:19:30.093693807-05:00","level":"info","src":{"func":"github.com/sensu/uchiwa/uchiwa/daemon.(*Daemon).fetchData","line":72},"message":"Updating the datacenter REDACTED"}
2016-03-24 16:19:30.094045500 {"timestamp":"2016-03-24T16:19:30.093835274-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/stashes"}
2016-03-24 16:19:30.121815500 {"timestamp":"2016-03-24T16:19:30.121762641-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/checks"}
2016-03-24 16:19:30.124705500 {"timestamp":"2016-03-24T16:19:30.124661904-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*Sensu).getSlice","line":50},"message":"GET http://REDACTED:4567/clients"}

@rmc3
Copy link
Contributor

rmc3 commented Mar 24, 2016

Another datapoint: In the logs of one of our Sensu API servers, we can see that every 5 seconds it makes two paginated /clients API calls, one for the first thousand clients, the second for the second thousand. This is as expected, as this datacenter has just over one thousand Sensu clients in it. However, at 16:19:30, the same time that we see the logs for the Uchiwa daemon stop updating, we see the following in the Sensu API logs:

{"timestamp":"2016-03-24T16:19:30.122638-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=0","request_body":""}
{"timestamp":"2016-03-24T16:19:30.283571-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=1000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.304643-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=2000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.344294-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=3000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.362130-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=4000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.378422-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=5000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.394185-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=6000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.410672-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=7000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.460350-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=8000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.477301-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=9000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.493842-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=10000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.510607-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=11000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.526880-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=12000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.561055-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=13000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.578465-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=14000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.595062-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=15000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.612245-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=16000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.721524-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=17000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.757014-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=18000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.774617-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=19000","request_body":""}
{"timestamp":"2016-03-24T16:19:30.792330-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=20000","request_body":""}

It continues to flood the API server with scores of requests per second until Uchiwa (in this case, the dev instance) was restarted at 17:30:51. By that time, it was attempting to get clients 192402000 through 192403000 :)

{"timestamp":"2016-03-24T17:30:51.087669-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=192402000","request_body":""}

EDIT: We see the same behavior on another Sensu API (there are multiple behind haproxy) from the production instance, which got to fetching clients 100873000 to 100874000 before it was restarted at 16:55:

{"timestamp":"2016-03-24T16:55:19.595021-0500","level":"info","message":"GET /clients","remote_address":"REDACTED","user_agent":"Go 1.1 package http","request_method":"GET","request_uri":"/clients?limit=1000&offset=100873000","request_body":""}

@rmc3
Copy link
Contributor

rmc3 commented Mar 25, 2016

I've added the following as debugging text for now to see if there's some unexpected value being returned in the initial call that breaks the for loop in getSlice()

index 23bc6b4..765106e 100644
--- a/uchiwa/sensu/methods.go
+++ b/uchiwa/sensu/methods.go
@@ -58,6 +58,9 @@ func (api *API) getSlice(endpoint string, limit int) ([]interface{}, error) {
                        logger.Warning(err)
                }

+                logger.Debug("xPagination: ", xPagination)
+                logger.Debug("len(list): ", len(list))
+
                for len(list) < xPagination.Total {
                        offset += limit
                        params := u.Query()

Hopefully we'll get to reproduce the issue soon :)

@rmc3
Copy link
Contributor

rmc3 commented Mar 25, 2016

dev stopped updating again about 4 hours ago. Prod didn't go with it, this time.

The logs got spammed incredibly hard by the debugging text I added, so I don't have the beginning where the infinite loop of /client GETs started. What I have is filled with this:

{"timestamp":"2016-03-25T13:19:48.108528673-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":63},"message":"xPagination: {1000 0 1054}"}
{"timestamp":"2016-03-25T13:19:48.108611955-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":64},"message":"len(list): 1053"}
{"timestamp":"2016-03-25T13:19:48.128685825-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":63},"message":"xPagination: {1000 0 1054}"}
{"timestamp":"2016-03-25T13:19:48.128738656-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":64},"message":"len(list): 1053"}
{"timestamp":"2016-03-25T13:19:48.143000776-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":63},"message":"xPagination: {1000 0 1054}"}
{"timestamp":"2016-03-25T13:19:48.143048577-05:00","level":"debug","src":{"func":"github.com/sensu/uchiwa/uchiwa/sensu.(*API).getSlice","line":64},"message":"len(list): 1053"}

It looks like xPagination.Total = 1054 while len(list) (likely after the 2nd paginated request, where it enumerated clients 1000 to 2000) = 1053. Since no new clients are found in any of the subsequent paginated requests, the condition len(list) < xPagination.Total is always true.

Not sure what could have caused this. Perhaps a client was in the process of being added or removed and this caused the API to very briefly return inconsistent data. It being a brief, transient state might also explain why the prod instance didn't experience the issue at the same time

Any thoughts on how we might mitigate this, or further troubleshooting steps?

EDIT: I should clarify that the location of those debugging lines was moved into the for loop, which is different from what I posted last night.

--- a/uchiwa/sensu/methods.go
+++ b/uchiwa/sensu/methods.go
@@ -58,7 +58,10 @@ func (api *API) getSlice(endpoint string, limit int) ([]interface{}, error) {
                        logger.Warning(err)
                }

                for len(list) < xPagination.Total {
+                       logger.Debug("xPagination: ", xPagination)
+                       logger.Debug("len(list): ", len(list))
                        offset += limit
                        params := u.Query()
                        params.Set("offset", strconv.Itoa(offset))

@rmc3
Copy link
Contributor

rmc3 commented Mar 25, 2016

As a temporary solution, running with this change to disable paging of client GETs.

Contegix@2f4def0

@rmc3
Copy link
Contributor

rmc3 commented Mar 27, 2016

It seems like a slightly more elegant solution might be to break out of the loop if the subsequent paginated requests do not contain any additional elements. It's not pretty, but it would mitigate the infinite loop scenario while still allowing paginated requests.

@rmc3
Copy link
Contributor

rmc3 commented Apr 11, 2016

Been running with Contegix@2f4def0 in prod and #464 in dev for over two weeks now. Neither have locked up.

In the meantime, since @palourde looks to be MIA, I've gone ahead and updated master of my fork to run with #464 and with the necessary import string changes to point the imports to the right namespace.

https://github.com/rmc3/uchiwa

If you need a fix before #464 gets merged and a new version is pushed, you might want to try running Uchiwa from source with my fork. See http://docs.uchiwa.io/en/latest/getting-started/#from-source Just keep in mind that anywhere it says "github.com/sensu/uchiwa", you'll use "github.com/rmc3/uchiwa" instead.

@jaxxstorm
Copy link
Author

First of all, thanks so much @rmc3 for supplying a fix for this.

I built a binary like so, it basically consists of the instructions to build from source, but minimal impact on my current system

Ensure go 1.3+ is installed as well as a recent version of rake
Clone the uchiwa-build repo: https://github.com/sensu/uchiwa-build
Edit all references to sensu/uchiwa to use rmc3/uchiwa
Set your $GOPATH - mikdir $HOME/go && export GOPATH=$HOME/go
run rake install_deps from within the uchiwa-build repo
run rake build from within the uchiwa-build repo

This should create a file /tmp/assets/uchiwa-linux-x86_64 - copy this to /opt/uchiwa/bin/uchiwa-patched
The update your init/upstart script as required to use this binary:

program=/opt/uchiwa/bin/uchiwa-patched

Hopefully someone from sensu sees this soon. Ping @palourde @calebhailey or @portertech

@rmc3
Copy link
Contributor

rmc3 commented Apr 12, 2016

Good thought @jaxxstorm. It's my understanding that you should just be able to use go build to build a new Uchiwa binary as well, without needing the uchiwa-build repo. With typical Go settings:

go get github.com/rmc3/uchiwa
go build $GOPATH/src/github.com/rmc3/uchiwa/uchiwa.go

This should create an executable binary named "uchiwa" in your current working directory, which you can then put in place as suggested by @jaxxstorm

@palourde
Copy link
Contributor

Great job everyone for troubleshooting this issue, I'll take a look at the related PR today and hopefully have a new release ready ASAP.

@palourde
Copy link
Contributor

I pushed a simple fix for this issue without disabling pagination: #478

@harishbsrinivas
Copy link

+1 having the same issue on the latest release 0.14.5-1. Will try and implement the fix outlined by @palourde

@rmc3
Copy link
Contributor

rmc3 commented May 20, 2016

@harishbsrinivas I don't know if there have been any packages built for 0.14.4, but the release was tagged 10 days ago and it's the only release since @palourde committed the fix. If packages are available, upgrading to that version should fix the issue.

@palourde
Copy link
Contributor

@harishbsrinivas This fix is included in 0.14.5-1; make sure to restart Uchiwa after the upgrade.

@jeremysmitherman
Copy link

This has regressed in 0.19.0-1

Exact behavior with the individual client showing correct data while the summary page is stale until restart.

@hany
Copy link

hany commented Mar 13, 2017

👍 Also having this same issue with 0.22.0. The main "events" page shows some stale checks (check results I have deleted). If I click on the stale result, it takes me to the client summary page instead of the alert page (indicating it was actually deleted).

Restarting uchiwa clears this up.

@danslinky
Copy link

Also experiencing the same issue as @hany

@palourde
Copy link
Contributor

@hany @user9384732902

Thank you for reporting this problem. Could you open a new issue so we can properly track it and include both browser console logs and Uchiwa logs. Also, if possible, could you include some screenshots of the experienced behaviour? I'm unable to reproduce this problem so it would help us a lot! Thanks

@powellchristoph
Copy link

powellchristoph commented Oct 10, 2017

I am currently experiencing the issue in 0.26.3-1. I will collect information and create a new bug.

EDIT: Nevermind, it was a very high refresh that came from somewhere.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests