Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write throughput when globally distributed #1020

Closed
jmordica opened this issue May 12, 2022 · 35 comments
Closed

Write throughput when globally distributed #1020

jmordica opened this issue May 12, 2022 · 35 comments

Comments

@jmordica
Copy link

Using this as a place to further discuss the idea of having an option to allow rqlite cache write transactions for ~100ms in order to decrease the amount of round trips per transaction (which could significantly improve write throughout).

Are there any other thoughts since the last discussion about this?

If we come up with a solution I’d be willing to fund the development.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

I'd like to understand the use case better. What data would you like to store such that you can tolerate the potential for data loss (node crashes before it sends the batch the Raft log)? And if you knew a write request that had been previously ack'ed by rqlite was not actually committed, how do you expect to recover?

@jmordica
Copy link
Author

Good questions:

  1. Fast changing hot rows like user status/state. These types of transactions also doesn’t care as much about read after write consistency and can possibly suffer missing some transactions based on the likely hood that a new state/status change will happen again very soon (overwriting the previously missed state). One would normally use a kv store for something like this but the current distributed kv implementations are complex and require yet more infrastructure.
  2. I’m not sure how to overcome the issue where an ackd request might not actually be written, however these types of transactions are somewhat understood to not need guaranteed consistency because they will likely be overwritten very soon after some state/status changes again.

I would just implement my own bulk requests using the API, but in a distributed environment where writes can happen from multiple clients, it seems like it would be better to handle the caching server side (rqlite), unless there are other workarounds I’m not thinking of.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

OK, thanks, I recall us talking about this. I'll have to think about it, it would be a different API and behaviour to what rqlite currently exposes. On the surface it mightn't be too difficult, but it would have to be maintained going forward. Long term, that can be the higher cost. So I can't promise I'll add this to rqlite at the present time, but I'm happy to leave the issue open for discussion.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

If you can tolerate data loss, have you looked at https://litestream.io/ ?

@otoolep otoolep closed this as completed May 12, 2022
@otoolep otoolep reopened this May 12, 2022
@otoolep
Copy link
Member

otoolep commented May 12, 2022

Another question -- still considering this. What do you want to happen if some of the requests failed, but others go through? If a bulk request, created locally by rqlite, has more than 1 SQL INSERT, some may fail, some may success. So you could even have a partial success, which again would be difficult to track. Do you care? Probably not.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

Slept on this feature request, more appealing to me now. :-)

I'd just think about semantics a little more. I presume a new API would be OK for this? Something like this:

localhost:4001/db/bulk

Posting writes to this endpoint would return once the receiving node got the data. That node would batch them up, and if it got to some max size, or a timeout expired, it would basically take all the data, create a bulk request, and send to the leader.

Another way it could work would be the following. You send as many requests as you want to this new endpoint. When you've batched enough, you do this:

localhost:4001/db/bulk?commit=true

and it then sends all the requests it has received in a single batch to the leader -- and the endpoint doesn't return until the commit happens. That way your client code would get a positive signal that the data was written. I actually like this as it's very close to what a client would do if it was doing the batching ifself, we just move the code into the rqlite system itself. WDYT? Any downsides?

@otoolep
Copy link
Member

otoolep commented May 12, 2022

Alternatively you really do want the model where there is a queue inside the node, and you just keep pushing writes to it, and it's taking care of it for you, draining that queue completely transparently. It would mean your client code would do less work, but it gives up a lot of visibility into what is going on. But as you say, perhaps you don't really care, and the trade-off is worth it.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

There are other things to think about in the queue model:

  • what should be the maximum size of outstanding messages? You don't want to run out of memory
  • what should the queue's behavior be if the leader is not contactable? Buffer? Drop? Block client? (See question above)

It gets more complicated quickly. Ultimately the behaviour will be configurable by the client, but you see what I mean.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

With a queuing model, I might create the following API:

  • POST http://localhost:4001?db/execute/queue/_default -> send a write request to the default queue, the write request would be in the body of the request. The default queue would always exist. The HTTP request would return once the write had been added to the queue, no guarantee it will be written to the SQLite database. But it will, assuming the node doesn't crash.
  • POST http://localhost:4001?db/execute/queue/_default/flush -> explicitly flush the queue now, and wait until the flush completes before returning
  • POST http://localhost:4001?db/execute/queue/_default/config -> configure the max queue depth, timeout, etc. The queue config would be in the body of the request. The queue would have sensible defaults however.
  • GET http://localhost:4001?db/execute/queue/_default/config -> get current configuration for default queue.
  • POST http://localhost:4001?db/execute/queue -> create new queues, which can be configured separately.
  • GET http://localhost:4001?db/execute/queue get a listing of all queues.

@jmordica
Copy link
Author

😂 you really did sleep on this one.

So to your first point about queueing the requests up on node, is there a reason the node queuing up the requests wouldn’t be the leader? So when a node receives requests, it immediately sends them to the leader and the leader does the batching/queueing. This should simplify it and keep you from having a bunch of queued requests on follower nodes. Would make it easier to track the queue size, and other limitations. It seems like that would keep it from getting real complicated.

As far as your last example endpoints go, my initial thought is that this seems a bit over engineered? The flush endpoint I wouldn’t use because there would need fairly significant code changes to accomplish this successfully on the client side (atleast in my case). I would opt for some timeout config param in ms where the leader commits the transactions on its own. The idea of having multiple queues besides the default could be interesting but seems a bit over engineered atleast for my purpose (but others may disagree).

I’m just saying if I’m the only one interested in this along with a few others, maybe its not worth building it out to the degree you outlined until others have more interest. I do think it solves a need in an interesting way because with this default queue with a timeout, you now have a distributed kv store that won’t have 1000’s of round trip requests per second.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

So to your first point about queueing the requests up on node, is there a reason the node queuing up the requests wouldn’t be the leader

No reason, if your client code specifically looks for the leader, and sends requests to that node. But that requires work, which is not strictly necessary as nodes will forward requests transparently to the leader. It's up to you what you want to do -- I agree you can go straight to the leader if you wish. However the issue still remains because the moment you introduce a queue, there is always a chance that the write will fail (for whatever reason) at some later time. There must be policy implemented about what to do if a queued write fails -- whether that queue lives in the leader or follower.

So when a node receives requests, it immediately sends them to the leader and the leader does the batching/queueing. This should simplify it and keep you from having a bunch of queued requests on follower nodes. Would make it easier to track the queue size, and other limitations. It seems like that would keep it from getting real complicated.

That is going to be an implementation detail. Conceptually queues would live distinct from the leader -- distinct from the Raft system. So what node the queues actually live on will not matter to the client. The client won't be able to tell the difference.

As far as your last example endpoints go, my initial thought is that this seems a bit over engineered? The flush endpoint I wouldn’t use because there would need fairly significant code changes to accomplish this successfully on the client side (at least in my case). I would opt for some timeout config param in ms where the leader commits the transactions on its own. The idea of having multiple queues besides the default could be interesting but seems a bit over engineered atleast for my purpose (but others may disagree).

Over-engineering is a matter of taste and experience, perhaps. :-) I need to think about other people who will come along and want to use the endpoint -- what they will need. To be clear I wouldn't necessarily implement all what I'm proposing at the start, but the API would need to support iterative changes, and the end-state for the API should be clear (as much as possible) when development starts. That way I don't end up with a dead-end API, which needs to be changed in a breaking fashion in the future. As long as I can see how a) I can add an explicit "flush" to the API, and b) add more queues, without making breaking changes, I can focus on your use case to start.

I’m just saying if I’m the only one interested in this along with a few others, maybe its not worth building it out to the degree you outlined until others have more interest. I do think it solves a need in an interesting way because with this default queue with a timeout, you now have a distributed kv store that won’t have 1000’s of round trip requests per second.

With a distinct endpoint I'm much more comfortable exposing this new mode of operation -- that approach makes it much easier to keep it separate from the current write path. Go is a nice language, and it may be possible that this isn't much work. So the ROI on doing this work may be high, that's why I'm now seriously considering it.

@jmordica
Copy link
Author

There are other things to think about in the queue model:

  • what should be the maximum size of outstanding messages? You don't want to run out of memory
  • what should the queue's behavior be if the leader is not contactable? Buffer? Drop? Block client? (See question above)

It gets more complicated quickly. Ultimately the behaviour will be configurable by the client, but you see what I mean.

I think if you don’t allow the queue to be configured over 1000ms you wouldn’t have to worry much about memory? I can’t think of a reason you would want the queue timeout above 1000ms anyway.

Also the problem if the leader can’t be contacted would somewhat be solved if the follower doesn’t queue the transactions but goes ahead and sends them to the leader as they arrive (to be queued on the leader).

@otoolep
Copy link
Member

otoolep commented May 12, 2022

I think if you don’t allow the queue to be configured over 1000ms you wouldn’t have to worry much about memory? I can’t think of a reason you would want the queue timeout above 1000ms anyway.

I think you underestimate the size of requests, and number, someone could write in 1 second. :-) But yes, I agree it's probably not an issue in practise. But reliability is important to me, someone should have the ability to set limits, such that the rqlited process does not crash hard.

Also the problem if the leader can’t be contacted would somewhat be solved if the follower doesn’t queue the transactions but goes ahead and sends them to the leader as they arrive (to be queued on the leader).

But what if the leader is not contactable in the second case ("as they arrive")? It's the same problem. Anyway, just having the queues on the leader doesn't solve the fundamental problem. Just because the queues "live" in the leader node's RAM doesn't actually solve the core problem -- we're dealing with a distributed system here. Successful writes require the leader to contact the followers. The network is always involved. Anyway, I know you know this -- we can make it work. I am just pointing out that where the queues live doesn't really have anything to do with solving the core problem.

@otoolep
Copy link
Member

otoolep commented May 12, 2022

Also, there are a couple of other interesting things to note. If the design doesn't dictate where the queues live, you benefit from spreading the memory load across the nodes in the cluster, avoiding consuming RAM on a single node. Secondly, if you, the client code, want all the queues to live in a particular node -- the leader perhaps -- you can just send all your requests to the leader, and that's exactly what will happen.

@jmordica
Copy link
Author

I think you underestimate the size of requests, and number, someone could write in 1 second. :-) But yes, I agree it's probably not an issue in practise. But reliability is important to me, someone should have the ability to set limits, such that the rqlited process does not crash hard.

I understand the concern to maintain reliability and agree with this approach of configurable limits with sane defaults.

Also, there are a couple of other interesting things to note. If the design doesn't dictate where the queues live, you benefit from spreading the memory load across the nodes in the cluster, avoiding consuming RAM on a single node. Secondly, if you, the client code, want all the queues to live in a particular node -- the leader perhaps -- you can just send all your requests to the leader, and that's exactly what will happen.

This is a good point. If the client "learns" who the leader is and sends write transactions to it, but if not, using memory on other nodes is totally fine. I was just trying to simplify the codebase but now that I'm thinking about the distributed nature of the system, this approach is much better than my original thought of having only the leader be responsible for queuing..

If it's outlined and documented well, I think this could pull a lot of people away from a traditional mindset when it comes to distributed kv/relational data, especially because most people using these types of higher write throughput systems are either guaranteeing consistency or not. There is nothing I'm aware of that allows you to use a single database for both datasets that need consistency and datasets that don't need consistency (but need higher throughput).

@otoolep
Copy link
Member

otoolep commented May 12, 2022

OK, let me see what I can do, I'm definitely intrigued. This is going to take a few days, minimum. Perhaps a couple of weeks to get right.

@otoolep
Copy link
Member

otoolep commented May 17, 2022

I've started coding the implementation, as per the PR above. It's going as I expect, seems fairly straightforward. Fair amount of testing to do yet. If you can build HEAD, I'll let you know when the first usable version merges. You can try it out before I release it officially.

@jmordica
Copy link
Author

Sweet! Yes I can. Looking good so far.

@otoolep
Copy link
Member

otoolep commented May 17, 2022

Merged -- can you build HEAD and give it a go? I'm seeing big speed-up, as you can imagine. But of course, this path incurs a small risk of data loss.....

Full docs here: https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md

Basically, do everything you were doing before, but send your requests to localhost:4001/db/execute/queue/_default.

@otoolep
Copy link
Member

otoolep commented May 17, 2022

@jmordica -- are you using your own code to access rqlite, or a client library written by someone else?

@otoolep
Copy link
Member

otoolep commented May 17, 2022

I may still tweak the shape of the API, but the core functionality is done now.

@otoolep
Copy link
Member

otoolep commented May 17, 2022

#1025

@otoolep
Copy link
Member

otoolep commented May 17, 2022

I changed the API, it was ugly. It now looks like this: localhost:4001/db/execute?queue

Full docs here: https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md

@otoolep
Copy link
Member

otoolep commented May 17, 2022

So build HEAD, let me know how it goes.

@jmordica
Copy link
Author

@jmordica -- are you using your own code to access rqlite, or a client library written by someone else?

I’m using the JavaScript library. I’m going to modify it to support the queue functionality and hopefully the owner will take PR’s.

Sweet! I’ll be trying this out tomorrow and report back. I’ll also do a throughput test and do a comparison with/without queueing. Should be interesting!

@otoolep
Copy link
Member

otoolep commented May 18, 2022

I’ll also do a throughput test

It will appear to be very high, if you simply measure how long it takes the write to take, since rqlite will just be adding data to memory and then returning to the caller. You will really need to test how long it takes for the writes to reach the whole cluster. I suggest you play around with the queue parameters, to see what affect that has (done at launch time).

The API isn't quite stable yet, I may need to change its shape slightly as we go. So I would hold off on any PR just yet (though that is a great idea once the API is finalized).

@jmordica
Copy link
Author

For sure. I just mean I’m going to measure how many actual writes per second happen across the cluster while adjusting the queue timeout.

@otoolep
Copy link
Member

otoolep commented May 18, 2022

Cool.

I have a feeling the queue capacity is a bit high, batch size too. I think lower numbers would still result in major improvements, perhaps set the queue capacity to 128 and the batch size 16. Timeout maybe 50ms.

@otoolep
Copy link
Member

otoolep commented May 20, 2022

Any updates @jmordica -- I'm still doing some development, tweaking the API surface a little. The changes will not be large, mostly focused around providing a way to confirm a given queue has been flushed, and safely committed to disk. I think some clients may wish to queue up a bunch of writes, but checkpoint every so often.

@jmordica
Copy link
Author

Hey there! I plan to do the tests tonight and report back. Good idea here.

@jmordica
Copy link
Author

I ran out of time tonight but I’ll be getting on it this weekend and will report back.

@jmordica
Copy link
Author

Wow that’s fast..

I was able to do a build and deploy to us-east1, us-central1, us-west1 with latency upwards of ~60ms. Started out with the defaults within the queue feature and the client running on the same machine as the leader and without using the -write-queue-tx flag.

I used a small bash script to loop over a bunch of inserts (500) and sent them via the queue endpoint (waiting for ack before sending the next). Obviously rqlite returned to the client instantly after each insert. I saw no failures and these writes took under a second to complete.

In the time it took to hit enter to run this script and immediately run a select query on the farthest node, the data was already on the farthest node (resulting in 500 successful inserts in under a second).

I tried this same thing again with 1000 inserts and it took almost 2 seconds to complete. However, the script running the inserts didn’t even finish for a whole 2 seconds so this 2 seconds wasn’t due to a queue or latency thing, but more of a curl/bash/userspace thing.

I didn’t test with multiple clients writing simultaneously but I can. I’m assuming you can get even more throughout than ~500 per/sec but anything above that is quite insane cross-region and would be outside the bounds of normal business use.

I’m also assuming that adjusting the queue timeout you could squeeze more out of it but why..

~50ms seems like a good default. You basically get 4-6 successful writes with ~60ms of latency. The payload (size of the batch) doesn’t seem to have a big impact.

@otoolep
Copy link
Member

otoolep commented May 23, 2022

Great, thanks for the update. Yeah, it's not that surprising it's so much faster -- bigger batches were always the way to get more throughput. It's not just the network effect either, the batches are also rolled into a single Raft log entry too, so the disk-write overhead is also amortized over the larger amount of data.

Still, good to get confirmation it's working. I'm tweaking the API a little, to make it easier to know when a given queued request has actually been written to Raft. The API won't change that much though, and once I'm finished, I'll release 7.5.0.

@otoolep
Copy link
Member

otoolep commented May 23, 2022

This should be done now, fixed on master.

@otoolep otoolep closed this as completed May 23, 2022
@otoolep
Copy link
Member

otoolep commented May 23, 2022

@jmordica -- I'd appreciate if you could build the latest source, and try it out. It should work the same from your point-of-view, but there are some extra flags that might be useful to you:

https://github.com/rqlite/rqlite/blob/master/DOC/QUEUED_WRITES.md

Basically you can wait to the request, and when it returns you can be sure that all queues writes on that node have been persisted to disk. That might be useful to you.

If nothing else, building and running the latest code will give me some confidence that it's ready for release (even if you don't use the new flags, just trying it out to make sure it's stable is helpful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants