-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement explain command #175
Comments
Not making into 1.4, moving to 1.5. |
I'm getting close enough to a working implementation of this that I think we should discuss the API a bit. There are 2 components that make up this issue:
I'd like to get the backend part merged in without the frontend part for 2 reasons. First I'd like to get this to the people doing profiling as soon as possible because I think it will make their jobs easier and to avoid having a nasty merge conflict the easiest thing to do is merge it in to next. Second this will help a lot in providing support. One of the most common questions I get from users is "why is this query slow." When the query in question involves multiple tables I frequently need to ask, how many documents are in each table, how big documents in each table are, and what the schema of those documents are before I can even make a guess as to what the problem is. It gets to be a pretty painful experience for everyone involved. If I could just tell the user to run with Here's what I'd propose for the API: >>> query.run(profile=True)
{
"value" : query.run(),
"profile" : [
{ #single step
"description": "Read table"
"duration(ms)": 5
"sub_tasks" : [ ... ]
},
[ #Tasks happening in parallel
[
{
"description": "Read from shard 1"
"duration(ms)": 5
"sub_tasks" : [ ... ]
},
...
]
[
{
"description": "Read from shard 2",
"duration(ms)": 5,
"sub_tasks" : [ ... ]
},
...
]
],
...
]
} |
Wow, that's awesome. I strongly agree with merging this without even working on the front-end part. I think we should merge this in, use it for a bit (ourselves and with other people), learn how this turns out to work, what's useful and what isn't, may be tweak the API/information a bit, and only add the front-end visualization later. The On the API:
I think it would be really cool to get an internal MVP of this working and post real output from an example query. It would probably make feedback much more constructive, since it's a bit hard to go off a pseudocode example. |
Also, I would love to run this on some of my queries, like yesterday. This is really cool. |
I agree for splitting back end/front end. I still have quite a lot of things to finish before working on the visualization part. I would tend to think that it's going to be tricky to build a good one, so merging the back end without the front end is definitely a good idea.
About the API, a few questions
>>> query.run(profile=True)
{
"value" : query.run(),
"profile" : [
{ #single step
"description": "Read table"
"duration(ms)": 5
"sub_tasks" : [ ... ]
},
[ #Tasks happening in parallel
"description": "Reading from shards"
"duration(ms)": 5
"tasks_in_parallel" : [
{
"description": "Read from shard 1"
"duration(ms)": 5
"sub_tasks" : [ ... ]
},
...
]
[
{
"description": "Read from shard 2",
"duration(ms)": 5,
"sub_tasks" : [ ... ]
},
...
]
],
...
]
} |
I'm gonna split responses up over a bunch of posts here just so I'm trying to wrangle a huge response.
A bit broader (although I'm not sure I have the most precise definition of a subquery). A subtask is a task whose duration duration is consider part of the other task. So taking the simple query: {
"description" : "add",
"duration(ms)": 1,
"sub_tasks" : [
{
"description" : "expr"
"duration(ms)" : .2
},
{
"description" : "expr"
"duration(ms)" : .1
}
]
} Think of it just like how most profiling tools will annotate function calls. Except when things go over the network they automatically send back their profiling information. |
I know this is just an example, but in case you didn't think of it, I think you should seriously consider grouping low constant factor commands into a single task. For example, breaking up expressions like this into subtasks seems like it would make the output a lot less useful. |
Steps are just things that happen sequentially, they don't correspond to jumps between physical machines. So in the example above each of the
I definitely a agree about this. I've engineered things such that it will be a very easy to add fields to profile nodes. And all you have to do is change the function that renders it as a datum to propagate that information back up. Machine and shard information are obvious things to include. Thread might make sense as well. These will be easy to add as they're needed I just haven't gotten around to adding anything more than a simple string. The above applies to the TYPE suggestion as well. I'd have to think a bit more about what TYPE would actually be but we'll definitely have something of that ilk. |
I think this is wonderful. My only preference is to just call everything "tasks" instead of "subtasks" -- it seems obvious from the nesting of the output that it's part of a broader parent task. |
We'll have this.
This is a bit vague to me. The most we can say about intra cluster traffic is, for example: "We were doing a read, we measured this process to take 10 ms. On the machine we did the read from it measured its process to take 7 ms. From this we surmise that there was 3 ms of time spent waiting for the network traffic." We could probably give information when we send information over the network, how much we're sending and to whom. However we can't in general say, we sent a message to this machine and it arrived 1.5 ms later.
This isn't explicitly enforced by the schema but in practice they always will be. Consider these for reference: {
"value" : ...,
"profile" : A
}
{
"value" : ...,
"profile" : [
{
"duration" : ...,
"description" : ...,
"sub_tasks" : B
}
]
} The schema here is recursive such that |
As for grouping tasks with low constant factor things in to groups I think that might be being a bit smarter with the backend than we want to be. We're definitely going to need to do some grouping for things like I'm a bit dubious for doing it based on the content of the subtasks though since it's unclear exactly where we'd draw the line. For example we definitely want the full profile for the arguments to r.table("foo").map(...).reduce(...) + r.table("bar").map(...).reduce(...) and similarly if the arguments to |
Also I was going to put off mentioning this for a while since it only really becomes meaningful once we have a visualizer but another piece of info we'll be including in each profile node is the backtrace info we use for error reporting. Once we have this we can present a view like this which shows the relative expense of various parts of the query: |
I think the Unix philosophy of "do one thing and do it well" breaks down for profilers in practice. The profilers end up dumping too much data, which is useful in theory (if it could be analyzed), but is too voluminous to analyze in practice. Second-order analysis tools tend to leave a lot to be desired, so people end up feeling like they have enough information, but no immediately obvious actions they can take based on it. I can easily think of a heuristic that would probably do a good job: group low constant time deterministic operations. Basically all the math, all the operations on fields, etc, but not point reads, math with nondeterministic components, array ops, etc. This isn't ideal (random isn't deterministic but should be grouped anyway), but I think it's pretty good. We might need to tweak this, but I think we could arrive at good heuristics pretty quickly. It doesn't seem hard. Anyway, this is a bit hypothetical. I think we'll be able to tell very quickly what's worth doing and what isn't by running it on actual queries and examining the results. |
I believe this calls for the following -- http://www.youtube.com/watch?v=xB5ceAruYrI. |
Also, my GOD the OS X build takes a long time. The anticipation is killing me :-D |
As discussed...
The text was updated successfully, but these errors were encountered: