Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The distinct command returns array instead of stream #1566

Closed
coffeemug opened this issue Oct 24, 2013 · 27 comments
Closed

The distinct command returns array instead of stream #1566

coffeemug opened this issue Oct 24, 2013 · 27 comments
Assignees
Milestone

Comments

@coffeemug
Copy link
Contributor

r.table('foo').distinct().type_of() # returns array

I believe it should return a stream instead. Assigning to @mlucy.

@jdoliner
Copy link
Contributor

This is actually the correct behavior. Computing distinct requires loading everything in to memory so ARRAY is the most honest type. It's good to do this because then you can use the type system to figure out if your query requires loading everything in to memory.

@neumino
Copy link
Member

neumino commented Oct 24, 2013

Probably related to #1135

@jdoliner
Copy link
Contributor

@neumino it is not actually. Even if #1135 were implemented this would still return an ARRAY. We could in theory add an indexed version of distinct ie distinct(index="foo") and that would return a stream. But seeing as we don't even have a version of distinct that takes a function and AFAIK no one has requested one I'm not totally sure this is actually something people would care about.

@coffeemug
Copy link
Contributor Author

Hmm. Apparently the docs also say that distinct returns an array (http://rethinkdb.com/api/#py:aggregation-distinct). Somehow I didn't expect this. I don't think we ever intended to use the type system to denote operations that require loading data into RAM on the server. It doesn't necessarily mean it's a bad idea, but it's not immediately obvious to me that we should be doing this. Here are reasons not to do it:

  • It's inconsistent. I believe unindexed order_by currently requires loading all data into RAM, but it doesn't return an array.
  • There may be operations that must load all data into RAM to complete, but can't return an array because they have to return a selection (like order_by). This suggests that using arrays to denote operations that have to load all data into RAM isn't the right paradigm.
  • Returning things as arrays vs. streams has impact on the interface the end user has to program against in the drivers. If we have indexed operations (like distinct) return streams and unindexed ones return arrays, we'd change the driver interface depending on whether the command is indexed. I don't think it's a good idea.
  • Returning streams has additional advantages -- users can stream data to the clients without waiting for the whole dataset to be sent over the network. Additionally, large result-sets encoded as arrays would probably hit protobuf limits, while streams would not. (For the same reason I think groupBy should also return a stream)

That being said, it's definitely not a point release issue, so moving it into backlog.

@jdoliner
Copy link
Contributor

We absolutely intended the type system to be used to denote operations that requires loading the stream in to RAM and it's the only paradigm that makes sense. This is in fact an issue @timmaxw and I spent quite a while hashing out. What you're suggesting here is incredibly dangerous. The fact that order_by returns a stream is a really really nasty bug that I didn't realize existed and should be fixed. Consider the following code, which is novice attempt at what a server that serves a high score list might look like:

class high_scoring_users_paginator(n):
    def __init__(self):
        c = r.table("users").order_by("score").run()        
    def get_next_page(self, n):
        res = []
        for user in c:
            res += [user]
            if (len(res) == n):
                break

Perfectly reasonable code, in fact this is the recommended way to do pagination because skip and limit are inherently slow. However you've just really screwed yourself. order_by returns a stream, not an array. That means that anything you don't consume off of c is still in memory on the server. So per request for the high score list you'll be storing a copy of the users table in memory on the server and you will continue to do so until the connection is closed. Needless to say it's really really easy to hit the OOM killer this way. We should fix this ASAP I wouldn't be surprised if this is responsible for a few of our OOM killer related problems.

I'll open a separate bug for this but I'm pretty sure this issue should be closed. There's no way for distinct to return anything other than an array and not be a gigantic memory leak waiting to happen.

Note: I did actually just test this and it seems to behave exactly as this bug would predict. Running:

while True:
    r.table("foo").order_by("bar").run()

will make your server rapidly consume memory. That memory usage will go away if you kill the script (thus closing the connection).

@coffeemug
Copy link
Contributor Author

I think this is less of a problem than it appears because 99% of apps written with Rethink will open the connection, do some stuff, and close it for each web request (which is < 50ms). It's very rare to hold cursors across requests. Admittedly this is a problem for repls and some apps, but I think the solution is to cap the amount of data we can order (and error if the query goes over the cap), not to return an array.

For order_by, what do we do about r.table("foo").order_by("bar").limit(5).delete()? It's a very common query, and returning an array would make it impossible (or much more complicated).

Also, what about objections above? (Changing driver interface depending on whether you use an index)

I would much rather wait to implement per-query memory limit. I think it's a better solution to this problem than returning arrays.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2013

I just talked about this with Joe. I think having order_by return an array or a stream depending on whether or not you use an index is irritating, but I also buy the argument that we're reflecting something real in the underlying system.

We've had an ambiguity with selections for a long time, and it's led to a lot of exploding complexity internally (we currently have arrays, eager streams, lazy streams, and wrapper streams). Occasionally this complexity bubbles up to form warts like this one where people have different intuitions.

One way to resolve this is to have the rule "you always get out what you put in". If order_by takes a stream, it returns a stream; if it takes an array, it returns an array. This is roughly what we have now, plus or minus a few edge cases.

Another way to resolve it is to have the rule "you always get out whatever we use internally". If everything is in memory, you get back an array; if stuff is on disk, you get back a stream. (I think this is what Joe wants.) This is very doable -- we just make selections be parameterized on the underlying representation (we'd have selection<stream>, selection<array>, and selection<object>).

The first approach is definitely easier to understand, but gives users a poor intuition for the system. The second approach is safer but more irritating. The question of which to use comes down to how much we need the safety and how irritating it is for users to learn and work around that rule.

I have mixed opinions on that question, but I think I'm leaning toward JD's point of view at the moment. What do other people think?

(Also, tagging this as a RQL proposal since the most discussion seems to be going on here.)

@jdoliner
Copy link
Contributor

Copy pasted from #1567.

I think this is less of a problem than it appears because 99% of apps written with Rethink will open the connection, do some stufff, and close it for each web request

No this is just wrong up and down. For one thing we should not bank on one query per connection being a universal enough pattern that we consider leaking memory if people don't use it to be acceptable. For another this is just not true. Even if connection pools are rare right now I guarantee people are going to use them once there are good libraries that make them easy. These actually already exist in a somewhat nascent form. A good example is here: https://github.com/nviennot/nobrainer. This is a library that we recommend on our site and people seem to actually be using, it has a bug report from 10 days ago. It also has this bug. Every time you use order_by in nobrainer you'll be using a table's worth of memory on the server until your program terminates.

For order_by, what do we do about r.table("foo").order_by("bar").limit(5).delete()? It's a very common query, and returning an array would make it impossible (or much more complicated).

This isn't true. I'm not sure why you think it is.

As for objections above.

It's inconsistent. I believe unindexed order_by currently requires loading all data into RAM, but it doesn't return an array.

Obviously this is the whole point of the this bug report so it doesn't apply. As part of this issue we should make everything follow consistent rules.

There may be operations that must load all data into RAM to complete, but can't return an array because they have to return a selection (like order_by). This suggests that using arrays to denote operations that have to load all data into RAM isn't the right paradigm.

As mentioned above this isn't correct.

On the last 2 points I'll concede that this does indeed make the APIs more complicated but it does so in a very consistent way that makes it much safer. I've had users ask me for an easy way to tell which functions can use lots of memory and right now there's no easy way to tell which ones they are. With the API I'm proposing you'll know that queries only consume memory on the server while they're running (disregarding a very small constant overhead for streams). That's a really nice guarantee to have and it's really unsafe not to have it.

I would much rather wait to implement per-query memory limit. I think it's a better solution to this problem than returning arrays.

This isn't even close to a solution to this problem. Whatever limit we select you can still leak that much memory per query. How is that a solution? What memory limit are we going to select that we're okay leaking on each query and is still high enough to make the query language useful.


This isn't how we write software, we're supposed to be the guys who don't leave landmines like this lying around to screw our users over. And if we were discussing this in the context of why a user had hit the oom killer I really doubt I would need to justify fixing it. I actually think there's a very good chance users are hitting this and we just haven't diagnosed it yet OOM related problems are one of the most commonly reported things and at this point we generally just say we have myriad memory problems we know about and we're working on. I really feel pretty strongly that this is something that needs to be fixed.

@jdoliner
Copy link
Contributor

This is actually a lot easier to hit than I thought before because it happens in the data explorer. If you type r.table("foo").orderBy(...) in the data explorer then unless you click through all of the results you'll be leaking memory. We have some things that are supposed to time these streams out but they seem to be broken somehow. Right now once you do an in memory orderBy in the data explorer you have about 5 minutes to click through all the results. If you don't whatever's left on the server will sit in memory until you kill the process. It never gets reclaimed.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2013

@jdoliner -- that sounds like a separate bug; I would bet that the data explorer is leaking other resources associated with the connection as well. Could you open an issue for that?

@neumino
Copy link
Member

neumino commented Oct 24, 2013

The data explorer close the connection every five minutes, which should be enough no?

I guess I could close the cursor too every time a new query is executed.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2013

Closing the connection should definitely do it.

@coffeemug
Copy link
Contributor Author

I think this is less of a problem than it appears because 99% of apps written with Rethink will open the connection, do some stufff, and close it for each web request

No this is just wrong up and down.

I think saying it's "just wrong up and down" is a bit excessive. I agree that what I said isn't necessarily a viable long-term assumption, and that it isn't 100% true for every use case today, but it does hold for most use cases today. I agree that we should fix the underlying problem, but I do have concerns about this specific proposal. I was making an argument that this isn't urgent and that we shouldn't rush it into a point release without looking at this issue from various points of view.

For order_by, what do we do about r.table("foo").order_by("bar").limit(5).delete()? It's a very common query, and returning an array would make it impossible (or much more complicated).

This isn't true. I'm not sure why you think it is.

I was under the impression that you could only execute mutation operations on the original table via selections (which is why we have the type selection and single_row_selection. @mlucy -- could you chime in? There is no command in RethinkDB that I am aware of that allows you execute a mutation operation on an array (or any type other than selections). We can consider changing that, but I think it's a big change that we should discuss. I'm not sure what it would mean to call delete on an array.

On the last 2 points I'll concede that this does indeed make the APIs more complicated but it does so in a very consistent way that makes it much safer.

Agreed, I retract this specific objection.

I would much rather wait to implement per-query memory limit. I think it's a better solution to this problem than returning arrays.

This isn't even close to a solution to this problem. Whatever limit we select you can still leak that much memory per query.

I think the word "leak" doesn't quite apply to what's going on here. Normally when I think of software leaking memory, I think of something unpredictable and uncontrollable. It isn't the case here -- there are very clear semantics on what to do to clean up the RAM (close the cursor).

This isn't how we write software, we're supposed to be the guys who don't leave landmines like this lying around to screw our users over.

I agree that this is an issue that could screw people over and it would be great to fix it. But it's not a landmine in the way a BKL is a landmine. Pretty much every RDBMS handles this the way we do right now (they give you a cursor and the server holds on to the memory while the cursor is open). It's the status quo. I agree that to be a 10x product for ops we need to get rid of cases like this to the extent that we can, so ops people don't have to worry about running out of ram, or running into performance problems, etc. (Or, when we can't get rid of issues like this, we need to give people visibility into what happens in the cluster and an easy way to fix things via admin tools). I just have concerns about this specific proposal.

(Actually, the only concern I have is the selection/mutation issue)

@jdoliner
Copy link
Contributor

This is very doable -- we just make selections be parameterized on the underlying representation

@mlucy already chimed in on this.

@coffeemug
Copy link
Contributor Author

Actually, In addition to the selection/mutation concern, the other concern I have is latency/protobuf limitations. We can't currently stream arrays, so a a big GMR result or orderBy result will have serious latency problems or will crash protobufs if we return arrays. (A solution to this might be streaming arrays in some way, but that's what streams are for).

Also, just saw @mlucy's comment. I really like the idea of parameterized selections. This way the rule is "you mostly get out what you put in, parameterized on what we use internally". This makes a lot of sense to me. I'm still concerned about latency of things like GMR though and potential protobuf issues with sending huge arrays.

@coffeemug
Copy link
Contributor Author

Also, moving to subsequent.

@neumino
Copy link
Member

neumino commented Oct 24, 2013

I really don't like the idea of sending back array. We should always give the users a cursor and ask them to close it if they don't fetch all the data. That would be much more developer-friendly.
At some point the JavaScript driver was not providing the same interface for array and stream, and that was a huge pain to deal with.

I believe that asking the users to understand the internals of RethinkDB is just not nice. Making a difference between array and stream is just a way to force them to understand some subtle differences.
About distinct, the current implementation is not obvious at all. One (like me...) could perfectly think of distinct going through the stream and checking each document against previously visited ones (stored in a hash) and send back elements one by one.

If the concern is about leaks, we can still kill a cursor if it has not been used for the last X seconds and dump an error message somewhere.

Also about my comment here #1567 (comment), the data explorer freeze for a little second when parsing 1000 documents. If you somehow get back a bigger array (up to one million), that would just kill your Chrome tab (or your browser).

@coffeemug
Copy link
Contributor Author

Making a difference between array and stream is just a way to force them to understand some subtle differences.

After talking to @mlucy and @jdoliner, I think that in this case it's a good idea. It is a bit of a pain, but it's not a pain we're introducing for no reason. Understanding the difference here is critical because if you don't, you can seriously shoot yourself in the foot and run into all sort of performance issues (which really piss people off).

We're going to implement an array/memory limit, so the operations won't return large results -- in these cases they'll error instead, and tell people to use indexes. The tradeoff here is that adding a bit of pain for developers makes ops 10x nicer, and I think in this case it's justified.

@jdoliner
Copy link
Contributor

We should also maybe see if we can make this easier to handle in javascript. In python and ruby you can iterate arrays and cursors using identical code so for a lot of users when we start giving them back arrays they may well not even need to change their code. In theory something similar could be done for js.

@coffeemug
Copy link
Contributor Author

You can use identical code in js now too. When we return arrays we splice in functions into them so users can interact with them in exact same way as they do with cursors.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2013

This will make some of the batching stuff I'm doing easier, so I'm going to roll this into #1543.

@mlucy
Copy link
Member

mlucy commented Oct 24, 2013

(Changing un-indexed orderby so that it returns an array, that is.)

@neumino
Copy link
Member

neumino commented Oct 25, 2013

If order_by returns an array, the data explorer will break as soon as this query returns something like 10k small documents (it will freeze in a irritating way -- it's already noticeable).

r.table("foo").orderBy("key")

I'm strongly opposed having orderBy returning an array.

@jdoliner
Copy link
Contributor

@neumino We're planning to have a per query memory limit. So you can set that lower and be guaranteed not to get back an object above that limit. We might want to split that into a limit on the amount of memory a query can use during execution and a limit on the size of the object a query can return.

@mglukhovsky
Copy link
Member

@coffeemug asked me to chime in here since @neumino had concerns about arrays vs streams. Here are my thoughts:

The first and foremost priority I have as a developer is that I don't want a query to accidentally take down the server (or reduce service quality for other queries).

Always returning an array results in a predictable cost being incurred (I have to know which queries return arrays vs streams, and that getting the array is a hefty amount of data being returned all at once). The alternative (streams) means I might have to worry about dealing with cases where the query reduces performance or takes down the entire server. Unpredictability is strictly worse than a known cost.

The key here is good documentation. Advertising which queries return a stream and which queries return arrays, providing clear errors or documentation on how simply creating an index allows me to get my friendly streams, and why this cost is being incurred is bearable -- once it's been explained to me.

@coffeemug
Copy link
Contributor Author

I also talked to a few other people about this, and I think it's best to go with the parameterized selection proposal above. Sorry @neumino.

@coffeemug
Copy link
Contributor Author

@mlucy -- as far as I can tell this is done. Can we close this?

@mlucy mlucy closed this as completed Nov 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants