-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The distinct
command returns array instead of stream
#1566
Comments
This is actually the correct behavior. Computing distinct requires loading everything in to memory so |
Probably related to #1135 |
@neumino it is not actually. Even if #1135 were implemented this would still return an |
Hmm. Apparently the docs also say that
That being said, it's definitely not a point release issue, so moving it into backlog. |
We absolutely intended the type system to be used to denote operations that requires loading the stream in to RAM and it's the only paradigm that makes sense. This is in fact an issue @timmaxw and I spent quite a while hashing out. What you're suggesting here is incredibly dangerous. The fact that class high_scoring_users_paginator(n):
def __init__(self):
c = r.table("users").order_by("score").run()
def get_next_page(self, n):
res = []
for user in c:
res += [user]
if (len(res) == n):
break Perfectly reasonable code, in fact this is the recommended way to do pagination because I'll open a separate bug for this but I'm pretty sure this issue should be closed. There's no way for distinct to return anything other than an Note: I did actually just test this and it seems to behave exactly as this bug would predict. Running: while True:
r.table("foo").order_by("bar").run() will make your server rapidly consume memory. That memory usage will go away if you kill the script (thus closing the connection). |
I think this is less of a problem than it appears because 99% of apps written with Rethink will open the connection, do some stuff, and close it for each web request (which is < 50ms). It's very rare to hold cursors across requests. Admittedly this is a problem for repls and some apps, but I think the solution is to cap the amount of data we can order (and error if the query goes over the cap), not to return an array. For Also, what about objections above? (Changing driver interface depending on whether you use an index) I would much rather wait to implement per-query memory limit. I think it's a better solution to this problem than returning arrays. |
I just talked about this with Joe. I think having We've had an ambiguity with selections for a long time, and it's led to a lot of exploding complexity internally (we currently have arrays, eager streams, lazy streams, and wrapper streams). Occasionally this complexity bubbles up to form warts like this one where people have different intuitions. One way to resolve this is to have the rule "you always get out what you put in". If Another way to resolve it is to have the rule "you always get out whatever we use internally". If everything is in memory, you get back an array; if stuff is on disk, you get back a stream. (I think this is what Joe wants.) This is very doable -- we just make selections be parameterized on the underlying representation (we'd have The first approach is definitely easier to understand, but gives users a poor intuition for the system. The second approach is safer but more irritating. The question of which to use comes down to how much we need the safety and how irritating it is for users to learn and work around that rule. I have mixed opinions on that question, but I think I'm leaning toward JD's point of view at the moment. What do other people think? (Also, tagging this as a RQL proposal since the most discussion seems to be going on here.) |
Copy pasted from #1567.
No this is just wrong up and down. For one thing we should not bank on one query per connection being a universal enough pattern that we consider leaking memory if people don't use it to be acceptable. For another this is just not true. Even if connection pools are rare right now I guarantee people are going to use them once there are good libraries that make them easy. These actually already exist in a somewhat nascent form. A good example is here: https://github.com/nviennot/nobrainer. This is a library that we recommend on our site and people seem to actually be using, it has a bug report from 10 days ago. It also has this bug. Every time you use
This isn't true. I'm not sure why you think it is. As for objections above.
Obviously this is the whole point of the this bug report so it doesn't apply. As part of this issue we should make everything follow consistent rules.
As mentioned above this isn't correct. On the last 2 points I'll concede that this does indeed make the APIs more complicated but it does so in a very consistent way that makes it much safer. I've had users ask me for an easy way to tell which functions can use lots of memory and right now there's no easy way to tell which ones they are. With the API I'm proposing you'll know that queries only consume memory on the server while they're running (disregarding a very small constant overhead for streams). That's a really nice guarantee to have and it's really unsafe not to have it.
This isn't even close to a solution to this problem. Whatever limit we select you can still leak that much memory per query. How is that a solution? What memory limit are we going to select that we're okay leaking on each query and is still high enough to make the query language useful. This isn't how we write software, we're supposed to be the guys who don't leave landmines like this lying around to screw our users over. And if we were discussing this in the context of why a user had hit the oom killer I really doubt I would need to justify fixing it. I actually think there's a very good chance users are hitting this and we just haven't diagnosed it yet OOM related problems are one of the most commonly reported things and at this point we generally just say we have myriad memory problems we know about and we're working on. I really feel pretty strongly that this is something that needs to be fixed. |
This is actually a lot easier to hit than I thought before because it happens in the data explorer. If you type |
@jdoliner -- that sounds like a separate bug; I would bet that the data explorer is leaking other resources associated with the connection as well. Could you open an issue for that? |
The data explorer close the connection every five minutes, which should be enough no? I guess I could close the cursor too every time a new query is executed. |
Closing the connection should definitely do it. |
I think saying it's "just wrong up and down" is a bit excessive. I agree that what I said isn't necessarily a viable long-term assumption, and that it isn't 100% true for every use case today, but it does hold for most use cases today. I agree that we should fix the underlying problem, but I do have concerns about this specific proposal. I was making an argument that this isn't urgent and that we shouldn't rush it into a point release without looking at this issue from various points of view.
I was under the impression that you could only execute mutation operations on the original table via selections (which is why we have the type
Agreed, I retract this specific objection.
I think the word "leak" doesn't quite apply to what's going on here. Normally when I think of software leaking memory, I think of something unpredictable and uncontrollable. It isn't the case here -- there are very clear semantics on what to do to clean up the RAM (close the cursor).
I agree that this is an issue that could screw people over and it would be great to fix it. But it's not a landmine in the way a BKL is a landmine. Pretty much every RDBMS handles this the way we do right now (they give you a cursor and the server holds on to the memory while the cursor is open). It's the status quo. I agree that to be a 10x product for ops we need to get rid of cases like this to the extent that we can, so ops people don't have to worry about running out of ram, or running into performance problems, etc. (Or, when we can't get rid of issues like this, we need to give people visibility into what happens in the cluster and an easy way to fix things via admin tools). I just have concerns about this specific proposal. (Actually, the only concern I have is the selection/mutation issue) |
@mlucy already chimed in on this. |
Actually, In addition to the selection/mutation concern, the other concern I have is latency/protobuf limitations. We can't currently stream arrays, so a a big GMR result or Also, just saw @mlucy's comment. I really like the idea of parameterized selections. This way the rule is "you mostly get out what you put in, parameterized on what we use internally". This makes a lot of sense to me. I'm still concerned about latency of things like GMR though and potential protobuf issues with sending huge arrays. |
Also, moving to subsequent. |
I really don't like the idea of sending back I believe that asking the users to understand the internals of RethinkDB is just not nice. Making a difference between If the concern is about leaks, we can still kill a cursor if it has not been used for the last X seconds and dump an error message somewhere. Also about my comment here #1567 (comment), the data explorer freeze for a little second when parsing 1000 documents. If you somehow get back a bigger array (up to one million), that would just kill your Chrome tab (or your browser). |
After talking to @mlucy and @jdoliner, I think that in this case it's a good idea. It is a bit of a pain, but it's not a pain we're introducing for no reason. Understanding the difference here is critical because if you don't, you can seriously shoot yourself in the foot and run into all sort of performance issues (which really piss people off). We're going to implement an array/memory limit, so the operations won't return large results -- in these cases they'll error instead, and tell people to use indexes. The tradeoff here is that adding a bit of pain for developers makes ops 10x nicer, and I think in this case it's justified. |
We should also maybe see if we can make this easier to handle in javascript. In python and ruby you can iterate arrays and cursors using identical code so for a lot of users when we start giving them back arrays they may well not even need to change their code. In theory something similar could be done for js. |
You can use identical code in js now too. When we return arrays we splice in functions into them so users can interact with them in exact same way as they do with cursors. |
This will make some of the batching stuff I'm doing easier, so I'm going to roll this into #1543. |
(Changing un-indexed |
If order_by returns an array, the data explorer will break as soon as this query returns something like 10k small documents (it will freeze in a irritating way -- it's already noticeable).
I'm strongly opposed having |
@neumino We're planning to have a per query memory limit. So you can set that lower and be guaranteed not to get back an object above that limit. We might want to split that into a limit on the amount of memory a query can use during execution and a limit on the size of the object a query can return. |
@coffeemug asked me to chime in here since @neumino had concerns about arrays vs streams. Here are my thoughts: The first and foremost priority I have as a developer is that I don't want a query to accidentally take down the server (or reduce service quality for other queries). Always returning an array results in a predictable cost being incurred (I have to know which queries return arrays vs streams, and that getting the array is a hefty amount of data being returned all at once). The alternative (streams) means I might have to worry about dealing with cases where the query reduces performance or takes down the entire server. Unpredictability is strictly worse than a known cost. The key here is good documentation. Advertising which queries return a stream and which queries return arrays, providing clear errors or documentation on how simply creating an index allows me to get my friendly streams, and why this cost is being incurred is bearable -- once it's been explained to me. |
I also talked to a few other people about this, and I think it's best to go with the parameterized selection proposal above. Sorry @neumino. |
@mlucy -- as far as I can tell this is done. Can we close this? |
I believe it should return a stream instead. Assigning to @mlucy.
The text was updated successfully, but these errors were encountered: