Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete Thorttling #1609

Open
wojons opened this issue Nov 5, 2013 · 9 comments
Open

Delete Thorttling #1609

wojons opened this issue Nov 5, 2013 · 9 comments
Milestone

Comments

@wojons
Copy link
Contributor

wojons commented Nov 5, 2013

I am hopping there is a way for a feature to throttle deletes. I am partioning my data by week. overall i dont care if it takes a week to delete the last weeks data. What happenes is i delete the last week and it locks up the servers since they are racing to delete. Maybe some sort of flag for slow delete or delete yeailding

@coffeemug
Copy link
Contributor

I think there are two open questions here.

  • I think we might want to give range queries a lower priority than point queries in general. @danielmewes -- can you comment on this?
  • We might want to add a priority flag to run. This is an interesting idea we haven't explored yet. What do others thing? Should we do it? Would it be hard to do?

@jdoliner
Copy link
Contributor

jdoliner commented Nov 5, 2013

There's some really low hanging fruit for optimizing deletes. I suspect this won't be an issue once we fix them.

@danielmewes
Copy link
Member

I fear that everything we can do with respect to priorities might not have the desired effects when it comes to write queries.
I haven't actually looked at the implementation of delete, but I could imagine that this is actually a locking issue. It seems that yielding the locks within a delete might be difficult. I'll have to think about that a bit more.

@jdoliner: Do you have something specific in mind?

@wojons: What size is your database? Do you know if it fits in the cache or if there is disk i/o involved?

@wojons
Copy link
Contributor Author

wojons commented Nov 5, 2013

@coffeemug a priority flag for everything would be SUPER useful i think even on queries because some queries i expect to take a long time and dont care if its takes a long time. but there are some things that you want to be super fast.

@danielmewes the database could fit in cache if i allocated more space to the vm. but the tables are normally 4-16gb in size but soon i will keep more then 2 keeps of data and it wont fit so its all disk based also.

@danielmewes
Copy link
Member

Regarding the priority flag: I'm not certain if it would work as expected. First of all there are two ways in which we can influence the priority of a given query: 1. By changing the scheduler priority of the involved coroutines. 2. By changing the i/o priority of the transaction.
Both of these options have the desired effect for some kinds of queries (e.g. backfilling). They have basically no effect for other queries (especially short running ones). In yet other cases reducing the priority of a given query can have negative effects on the overall cluster performance because it might end up holding locks for longer than necessary or interacting badly with the i/o requirements of other queries [1].

We should keep the idea in mind, but it requires a lot of testing and probably a number of careful changes to actually become a useful and reliable feature.

Edit:
Sorry, forgot the [1]: What I mean is that if two queries request the same block from disk, one of them with a high i/o priority and the other one with a low one, and the one with the low priority requests the block slightly earlier than the high-priority one, the block is actually going to be loaded at the lower priority. The high-priority query will be slowed down by the low-priority one. This seems like it would be rare, but that's just one of the things that we have to keep in mind and that require testing.

@danielmewes
Copy link
Member

@wojons: As a work-around, would it be an option to divide the delete into smaller batches from the application side? For example you could run something like r.table(...).between(...).limit(50).delete() repeatedly until the deleted field of the result is 0. We will hopefully find a proper solution to the problem eventually, but my impression right now is that it could take a little while for us to completely fix this.

@jdoliner
Copy link
Contributor

jdoliner commented Nov 5, 2013

Well, one optimization is that we could turn r.table(...).between(...).delete() in to a range delete which would be fast as blazes. Another more general optimization is not transferring the entire row over the network to do a delete.

@danielmewes
Copy link
Member

@jdoliner: That of course would make a lot sense.
Mind though that the "fast as blazes" will only hold for in-memory data sets, right?

@coffeemug
Copy link
Contributor

Moving to backlog. We'll look into this after #1762 is fixed, but this is generally outside the scope of the LTS release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants