-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a cluster issue for when disk performance degrades #1114
Comments
While we are at it, maybe it should also warn when a) disk space and b) main memory get close to exhaustion? |
There is (or there was) some stats about disk/memory usage in the stats reported. |
My first instinct is that we should seriously limit the amount of work we do on this (at the very least I think we should be extremely careful about introducing these types of features). There are many, many existing monitoring tools, and some of them are extremely good and are very widely adopted. If we start introducing these features ourselves, we have to make sure they offer something to people that's better/more convenient than what they can get with existing tools, and I suspect that would be very difficult to do and would involve very non-trivial amount of work. I think a good solution here might be to introduce an issue that pops up when a node in the cluster has a significantly higher latency than the average latency in the cluster. This will automatically take care of a whole class of issues (disk performance, network issues, hardware failures, etc.) without having to reimplement existing monitoring tools. It would help people monitor the cluster, would involve relatively little work, and wouldn't take us too far into the monitoring tools rabbit hole. Also, I think we should wait to do this until we get more information about the problems people encounter when operating RethinkDB. So I'd try and wait to schedule this until we finish low-hanging ReQL issues and performance/scalability issues. |
I think the latency issue might be a good one to add. This came up because a user had a failing disk and from his perspective it just seemed like RethinkDB was slow (he'd been waiting for 2 days for replication to finish). It fooled the rest of us for a while and we probably would have spent a long time debugging it had the user not happened to check dmesg and notice some messages about the errors. I think we should actually prioritize this a bit, maybe not in 1.8 but in 1.9 because it does lead to a really bad experience when this happens and a latency checking issue is not too hard and would catch a lot of cases like this. We shouldn't do fancy stuff because like you said there are better monitoring solutions but for the users not running them (of which I'm pretty sure there are many) having something bare bones could seriously improve the experience for a lot of people. |
I agree with this (this being an out-of-whack latency issue, and prioritizing it a bit). |
When disk performance degrades too much the cluster becomes very unresponsive and it looks like other things are wrong. It would be really nice to have an issue which popped up when writes to disk were taking longer than could be reasonably expected.
The text was updated successfully, but these errors were encountered: