RFD 110: Discussion #52

jmwski · 2017-08-22T18:25:08Z

This issue represents an opportunity for discussion of RFD 108 Operator Configurable Throttles for Manta while it remains in a pre-draft state.

trentm · 2017-09-11T18:50:36Z

Note that restify@5.x has server.inflightRequests() and a simple plugin to respond with 429s if the number of inflight requests goes above a hard threshold: https://github.com/restify/node-restify/blob/master/lib/plugins/inflightRequestThrottle.js
This seems equiv to the "concurrency + queueTolerance" described in the current proposal.

A discussion of or comparison with restify's current throttle (http://restify.com/docs/plugins-api/#throttle) plugin would be useful. Sorry if I missed that discussion elsewhere.

jmwski · 2017-09-11T19:23:27Z

At a manta call a couple of weeks back I proposed using a similar restify plugin but we came to the conclusion to not use it because (1) not all services that we want to throttle use restify, and (2) we also just didn't want to depend on restify. I can include some notes about this in the RFD itself.

rmustacc · 2017-09-12T00:02:43Z

Thanks for putting this together.

Is it possible to get a discussion of why we ended up going with a per-process model with no state shared between them? I'm not sure if this was just for simplicity, if other things were ruled out already, or not. I mostly ask because other major cases that have come up for throttling are to have per-user throttles, where the stateless nature can break down right away. In addition, it can be harder to reason about what the per-process throttle should be, as that ends up meaning that the throttle scales with the number of zones. Also, was there any consideration for doing the throttling based on the backend resources? For example, the global handled request rate may vary a lot depending on the actual consumed resources.

Is throttling going to be enabled by default or will it be disabled by default and it will be up to operators to enable it somehow? If so, how do we plan on managing this across the fleet even statically. Would these be sapi properties that config-agent picks up or is there some other means that we're expecting to use?

jmwski · 2017-09-12T00:29:31Z

The per-process muskie throttle was just an initial iteration intended mostly as a proof-of-concept but also as a base implementation to work off of to build a more generic library.

I think making throttling decisions based on the availability of backend resources is a good idea. Do you have any suggestions as to what specific resources would be useful to take into account and how I might be able to interface with such metrics from a node module?

The eventual plan is to have a globally aware stateful throttling service. If I recall correctly in an early discussion at the manta calls we decided that sapi might not be the right place for these tunable to be pushed out because they're intended for long-lived properties that don't change frequently (I'm not 100% certain but I seem to recall dap saying something along those lines).

In the RFD I propose the idea of having http endpoint for setting and getting throttle parameters dynamically. I guess I really what I really want is for it to play the role of an agent that could be queried periodically by a global throttling service that could make decisions based in information from all the muskie processes.

I think maybe leaving throttling off by default might be less invasive. But I could also see having a boolean per-service parameter that decided whether throttling is on or not.

UPDATE: I've modified the RFD to (hopefully) address the above feedback! Thanks to everyone who has read through it so far!

kellymclaughlin · 2017-09-18T16:08:39Z

Returning a 429 for every throttled request seems potentially confusing since the throttle is not actually looking at the user account making the requests to make decisions. It seems like there are two ideas here: user rate-limiting and backpressure to prevent system overload. This makes me think about two different things:

It seems like most of the cases we're trying to handle with the initial iteration are actually server-side problems and something like a 503 would be more appropriate. For internal services like moray where the database replication is laggy that isn't a user problem, it's our internal service that is unable to keep up. If the return status code is static maybe it should be configurable.
Maybe the throttle (now or in the future) should have a backpressure component for managing overload that receives feedback from the servers it communicates with and also a user rate-limiting component. It be great to be able distinguish these two cases and return different status codes based on the problem.

As an aside, I wonder if it would be a good idea to standardize our manta components to all use either an HTTP framework like restify or the RPC framework. I can't claim to know the history about why different pieces use one or the other, but having more commonality has a lot of advantages when learning and debugging and also means less software (and accompanying bugs) to support.

jmwski changed the title ~~RFD 108: Discussion~~ RFD 109: Discussion Aug 23, 2017

trentm changed the title ~~RFD 109: Discussion~~ RFD 108: Discussion Aug 23, 2017

jmwski changed the title ~~RFD 108: Discussion~~ RFD 110: Discussion Sep 1, 2017

jmwski closed this as completed Sep 29, 2017

jmwski reopened this Sep 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 110: Discussion #52

RFD 110: Discussion #52

jmwski commented Aug 22, 2017

trentm commented Sep 11, 2017

jmwski commented Sep 11, 2017

rmustacc commented Sep 12, 2017

jmwski commented Sep 12, 2017 •

edited

Loading

kellymclaughlin commented Sep 18, 2017 •

edited

Loading

RFD 110: Discussion #52

RFD 110: Discussion #52

Comments

jmwski commented Aug 22, 2017

trentm commented Sep 11, 2017

jmwski commented Sep 11, 2017

rmustacc commented Sep 12, 2017

jmwski commented Sep 12, 2017 • edited Loading

kellymclaughlin commented Sep 18, 2017 • edited Loading

jmwski commented Sep 12, 2017 •

edited

Loading

kellymclaughlin commented Sep 18, 2017 •

edited

Loading