Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request timed out causing meteor instances to be restarted #753

Closed
kaihendry opened this issue Apr 9, 2019 · 6 comments
Closed

Request timed out causing meteor instances to be restarted #753

kaihendry opened this issue Apr 9, 2019 · 6 comments
Assignees

Comments

@kaihendry
Copy link
Contributor

https://media.dev.unee-t.com/2019-04-09/request-timed-out.mp4

Meteor instances in production seem to get stuck and fail to respond to health checks.

@kaihendry
Copy link
Contributor Author

kaihendry commented Apr 9, 2019

@nbiton I'm pretty confident it appears to go down (become unresponsive) in one of those repeated bugzilla ajax retries like mentioned upon #699

Actually I am not sure anymore since the log times and the event don't line up exact.

@kaihendry
Copy link
Contributor Author

We need to improve logging since production doesn't appear to log requests right now. Did we pull out of #631 on prod? Wondering why the logs are not on one line.

https://media.dev.unee-t.com/2019-04-09/case-dev-vs-prod-logging.mp4

@kaihendry
Copy link
Contributor Author

If NodeJS locks up and gets killed, there is a some time before it comes back online. This could be expedited by #750 since this proposal would avoid the npm install stage in our current Docker image.

@kaihendry
Copy link
Contributor Author

We reproduced the problem locally in the sense we see Meteor taking a very long time to process a HTTP.call

I20190412-10:35:46.717(8)? { statusCode: 200,
I20190412-10:35:46.717(8)?   method: 'get',
I20190412-10:35:46.718(8)?   endpoint: '/rest/bug/73397/comment',
I20190412-10:35:46.718(8)?   payload: { api_key: 'secret' },
I20190412-10:35:46.718(8)?   duration: 74517 }
I20190412-10:35:46.722(8)? { statusCode: 200,
I20190412-10:35:46.722(8)?   method: 'get',
I20190412-10:35:46.723(8)?   endpoint: '/rest/bug/73397',
I20190412-10:35:46.723(8)?   payload: { api_key: 'secret' },
I20190412-10:35:46.723(8)?   duration: 74525 }

MEFE says 74s and Bugzilla below says half a second. This is the crux of the problem.

bugzilla_1   | 172.24.0.1 - - [12/Apr/2019:02:25:05 +0000] "GET /rest/bug/70142/comment?api_key=secret HTTP/1.
1" 200 1352 500 "-" "-"
bugzilla_1   | 172.24.0.1 - - [12/Apr/2019:02:25:05 +0000] "GET /rest/bug/70142?api_key=secret HTTP/1.1" 200 3
239 585 "-" "-"

@kaihendry
Copy link
Contributor Author

Since getting things running locally, we know it's not a DevOps issue. It appears to be something to do with how caseNotifications are read causing a huge delay. @nbiton is working on a fix.

@franck-boullier
Copy link
Member

this shoudl now be fixed thatnks to throttling and the implementation of unee-t/bz-database#129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants