New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose stats through the ReQL admin API #2885
Comments
In addition to the perfmon stats, we could also expose the number of documents in each shard (which comes from performing a distribution query) |
To get the ball rolling on a stats proposal, I'll first give an example of what stats we currently provide, and then discuss which ones I think are worthwhile to have in the ReQL stats tables. I am trying to keep modification of existing perfmons to a minimum, new stats can be handled later. Server stats:
Proposed format for a
Table stats (per server):
Proposed format for a
The available interface for these tables is nothing special, they should work just like the existing Final thoughts:
|
I agree that this project should be done one step at a time, but I think we should also use this thread to discuss what we want the stats to look like when we ship
|
/cc @coffeemug; I think this is still on the list of things you care about. |
I do care about this (and all other user-facing issues in reql-admin, which are pretty much all of them 😄) I'll comment on the format details next week. |
Good wishlist, I think a few things are infeasible, but most can be incorporated. A few comments:
I don't think we should be tracking usage by things other than RethinkDB. These stats should be for service-monitoring, not server-monitoring. If users want that information, they should use real server monitoring tools, which we shouldn't be trying to compete with.
I think this would be prohibitively expensive to track, and would involve a lot of work, but perhaps it could be done.
Do we officially support multiple disks on a single server? I can imagine a user could use a work-around to make this work, but I think we should just report the disk used by the
Is this feasible to calculate in a moving window, or would this be for the entire lifetime of the cluster? In particular, I don't think it's feasible to give this at a cluster level unless we're averaging these values across the cluster, which is incorrect/misleading. As for giving these values for individual tables, it would be pretty difficult to categorize queries in that way, and we're probably better off giving per-server statistics like this. Here are updated proposals for the stats tables. I tried to incorporate as much as I could. This is a wishlist at the moment, but it should still be practical. server_stats
table_stats
|
Forgot to fill in the |
This is a lot of stuff that you have here I will try to cover what i have learned from working with the existing API.
Server Stats Purpose{
"id": <UUID>,
"name": <STRING>,
"coroutines": {
"active": <NUMBER>,
"allocated": <NUMBER>
},
"query_language": {
"queries_per_sec": <NUMBER>,
"active_queries": <NUMBER>,
"total_queries": <NUMBER>,
"query_duration_ms": {
"50_percentile": <NUMBER>,
"90_percentile": <NUMBER>,
"99_percentile": <NUMBER>
}
},
"disk": {
"tables":
<table_id>: {
"db": <STRING>,
"table": <STRING>,
"table_id": <STRING>,
"used_bytes": <NUMBER>,
"reads_per_sec": <NUMBER>,
"writes_per_sec": <NUMBER>
}, ...
},
"used_bytes": <NUMBER>,
"free_bytes": <NUMBER>,
"total_bytes": <NUMBER>,
"reads_per_sec": <NUMBER>,
"writes_per_sec": <NUMBER>
},
"memory": {
"tables":
<table_id> : {
"db": <STRING>,
"table": <STRING>,
"table_id": <UUID>,
"used_bytes": <NUMBER>
}, ...
},
"used_bytes": <NUMBER>,
"free_bytes": <NUMBER>,
"total_bytes": <NUMBER>,
"active_swap": <BOOL>
},
"cpu": {
"user": <NUMBER>
"system": <NUMBER>
"total": <NUMBER>
},
"network": {
"cluster_latency_ms": <NUMBER>,
"intracluster": {
<SERVER_ID> : {
"sent_bytes": <NUMBER>,
"received_bytes": <NUMBER>,
"active_connections": <NUMBER>,
"total_connections": <NUMBER>
},
"sent_bytes": <NUMBER>,
"received_bytes": <NUMBER>,
"active_connections" <NUMBER>
"total_connections": <NUMBER>
},
"clients": {
"sent_bytes": <NUMBER>,
"received_bytes": <NUMBER>,
"active_connections": <NUMBER>,
"total_connections": <NUMBER>
}
}
} |
A few thoughts/questions:
|
@wojons and @neumino, thanks for bringing up these concerns, I'll try to address them here:
If we want to provide a way to clear running totals, I propose allowing a delete of the table stats, like
We can add these back in, I omitted them to avoid giving too much information to users. If we do give these kinds of stats to users, they would probably need to be per-server to avoid the discontinuities mentioned above.
Yes. We currently don't have tracking for this, but it shouldn't be too hard to get it. This is wishlist stuff and will not be in the first draft.
My biggest concern here is any stat that is both per-server and per-table. For example, the
We could have a
With how we're handling server membership in the cluster, I don't think we will have any special handling for a down server, it will just be omitted from the results.
I don't think this is feasible to track. Existing perfmons don't give us a way to store history, and the memory requirements would be difficult to manage. Stats are collected on-demand, though they are continuously 'computed'.
I agree this results in a rather nasty query, and that #2708 would simplify this a lot. Note that getting the rows read/written (rather than disk reads/writes) would be much simpler. |
Just a couple of points: Having disk writes/s reads/s is not a useful metric in general.
That would be a per-table property. @Tryneus you added a I don't think we should include disk i/o and memory usage in the
Finally, (@wojons) percentiles for the perfmons that measure timings would be really cool. I think that might be the next step, but shouldn't be in the first version. |
@coffeemug, could you make a decision on the format of |
As to which perfmons we might want to hide from the table: Note that some perfmons are marked as "internal metric" (or their description uses the word obscure or similar). These are likely the ones we want to skip. I think @wojons is using most of the remaining ones for his monitoring solution, so we should not remove those unless there's a good reason for doing so. |
The perfmons in the serializer that have extents as their unit should probably all be in bytes instead. |
With respect to which stats to include, I'd like to defer to @danielmewes and @wojons -- they seem to understand what is and isn't needed really well and have done a lot more research than I have (especially @wojons, since he actually built a RethinkDB monitoring product and was responsible for operating infrastructures before). My only guidance is to include as few metrics as possible, give them good names, and clearly include units into the names in a consistent way to avoid any confusion. However, I don't like the idea of a This way the user can use |
I'll go through this again and prepare a full proposal for which perfmons I think we should have in v1. |
Here's a proposal adapted from @Tryneus earlier one. It describes the entries in a unified one per server {
server_id: <UUID>,
server_name: <STRING>,
coroutines: {
active: <NUMBER>,
allocated: <NUMBER>
},
query_server: {
queries_per_sec: <NUMBER>,
queries_total: <NUMBER>,
queries_active: <NUMBER>
},
network: {
intracluster: {
sent_bytes_per_sec: <NUMBER>,
sent_bytes_total: <NUMBER>,
received_bytes_per_sec: <NUMBER>,
received_bytes_total: <NUMBER>,
open_connections: <NUMBER>
},
clients: {
sent_bytes_per_sec: <NUMBER>,
sent_bytes_total: <NUMBER>,
received_bytes_per_sec: <NUMBER>,
received_bytes_total: <NUMBER>,
open_connections: <NUMBER>
}
},
started_at: <TIME>
} one per table/server pair {
server_id: <UUID>,
server_name: <STRING>,
table_id: <UUID>,
db_name: <STRING>,
table_name: <STRING>,
indexes: {
<name>: {
reads_per_sec: <NUMBER>
reads_total: <NUMBER>
writes_per_sec: <NUMBER>
writes_total: <NUMBER>
},
...
},
disk: {
read_bytes_per_sec: <NUMBER>, // currently serializer_block_reads, must be modified to report bytes
read_bytes_total: <NUMBER>,
written_bytes_per_sec: <NUMBER>, // currently serializer_block_writes, must be modified to report bytes
written_bytes_total: <NUMBER>,
commits_per_sec: <NUMBER>, // currently serializer_index_writes
commits_total: <NUMBER>,
space_usage: {
lba_bytes: <NUMBER>, // currently serializer_lba_extents, multiply by extent size
data_bytes: <NUMBER>, // approximate as follows: take serializer_data_extents, multiply by extent size, subtract serializer_old_garbage_block_bytes
garbage_bytes: <NUMBER> // currently serializer_old_garbage_block_bytes
}
},
cache: {
in_use_bytes: <NUMBER> // Should be straight forward to add
}
} one per server that timed out {
server_id: <UUID>,
server_name: <STRING>,
error: "Timed out. Unable to retrieve stats."
} All stats have totals, and I think this is ok because they are all reported per server. I've included the Most of the stats correspond directly to existing perfmons which should make the implementation relatively easy. The exceptions are the cache.in_use_bytes perfmon and the network.*.open_connections one. I think both of them are very useful, and probably worth adding. @wojons you also asked about the GC stats. I've omitted them for now, because I think we should improve them first. For example we have a stat that has the number of extents GCed by the data GC, but we don't know how much of those extents was actually garbage. So that makes it confusing. Instead we should first implement some more meaningful stats such as gc_bytes_reclaimed. Remarks? Suggestions? Things that are missing which we absolutely need in the first version? |
+1 for the composite key idea |
Ah, yes. I don't have an opinion on what a good pkey should be, I just saw that it wasn't specified. A composite key seems reasonable. |
@timmaxw and I talked about this more offline.
Here's the new proposal:
one per server
one per table
one per table/server pair
for timed out servers
@coffeemug does this look ok to you? (Edit: Updated the one per server document structure) |
Yes. One option to avoid adding a per-table read/write metric while allowing people not to count dups is to expose info on whether a given table/server stats entry is for a primary replica (e.g. |
I have mixed feelings about that. On the one hand, it reduces duplication; the table was getting rather cluttered. On the other hand, there's no longer a natural way to get read/write stats for a given table; it's such a simple stat that it feels like people shouldn't have to do a complex query to get it. (For reads, they need to sum the stats for all servers; for writes, they have to take the stat on the primary, which should be the same as all the other servers.) |
Ok, I don't feel too strongly about it and don't mind special casing it. (I think we should include a |
We had considered something like that. Note however that in the current proposal, the finest granularity of information is table/server. For having a |
I see. Ok, kindly disregard my suggestion. They can still do it by grouping all machines for a given table and collecting stats from a random one if they wanted to. |
I still need to go over the formats but here are some replies that i have so far.
I think that this should stay on a per table issue its nice to know that this server is doing lots of so and so but when you have hurdeds of tables this can be come a problem.
I thnk speed of reads and writes is one thing but that is more of an ssd thing. On ssd lots of small reads and writes are fine but you can kill a spinning disk and it will say 1MB/s I think spead and number of read writes or blocks should be shown
I think this is great if its in a multi index I attached some of the photos so you can see how some of the stats that i get are used i have a few more that i am not graphing at the moment but should get the idea around. |
Thank you for your feedback @wojons. You will also be able to get the current GC activity from the I agree that something that's roughly equivalent to "number of disk seeks" would be very useful on rotational drives. However we don't currently have anything implemented to measure this. The writes/s that we currently have in |
I think this a pretty imporant branch so not rushing to get it out will be a good idea make sure that there is enough base code that making addations is stirght foward. |
We should also observe #2890 (comment) in here, and not expose db, table and server UUIDs unless the opt arg is set accordingly. |
To clarify: I think the primary key should always use UUIDs. Just fields such as |
I think the fields should have consistent name -- |
Ok, that sounds good @coffeemug . |
Working on this now. |
This is up in review 2356. |
👏 |
This has been approved and merged into |
The graphs in the webui are now using the real stats tables on |
👍 👏 This is amazing. |
Proposed API: Introduce two new pseudo-tables
rethinkdb.table_stats
andrethinkdb.server_stats
. The first has one document per table, and the second has one document per server. Each has a bunch of fields with statistics about the table or server. We could also have nested sub-documents to organize the stats.One possible problem is that
stats
looks a lot likestatus
.The text was updated successfully, but these errors were encountered: