Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: add mcp.backend_depth_limit(int) #1137

Merged
merged 1 commit into from
May 20, 2024

Conversation

dormando
Copy link
Member

@dormando dormando commented May 16, 2024

If a backend has a queue depth limt over this amount, fast-fail any further requests.

A global parallel request limit can mean a single slow or overloaded backend causes the entire proxy to stop working. As a first layer of defence the depth for a particular backend should be capped.

TODO:

- [ ] might add a counter for failed-due-to-bad while I'm in here.
- [ ] might also add a tunable for fast-failing while be is connecting/validating. been meaning to do that forever.

delaying further TODO's until after the TLS work is in testing.

@dormando dormando added needs review/testing proxy worklogs and issues related to proxy labels May 16, 2024
@dormando
Copy link
Member Author

definitely failing the stability suite... requires a combo of running the fault-blocked test first and then any other test after that. working on it.

@dormando
Copy link
Member Author

dumbassery.

@dormando
Copy link
Member Author

got a bizarre protocol desync failure an hour in... not reproducing it on a retry using short times. need to re-run using longer timings. blocking merge until I can get multiple clean full runs.

@dormando
Copy link
Member Author

dormando commented May 17, 2024

passed full suite at 10s time and 30s time per test. running again at 60s with the same instance as the 30s. sadly fat-fingered the wrong window and didn't dump a core when it failed earlier. :/ This might be a rare unrelated problem.

If it passes the 60s suite I'll still merge but, but will try to keep running the tests every day for a while to see if there are any other rare issues.

@dormando
Copy link
Member Author

New stat does get hit:

STAT proxy_config_reloads 875
STAT proxy_config_reload_fails 0
STAT proxy_config_cron_runs 0
STAT proxy_config_cron_fails 0
STAT proxy_backend_total 60
STAT proxy_backend_marked_bad 222
STAT proxy_backend_failed 9844
STAT proxy_request_failed_depth 848466

wonder if I have a mcshredder bug that's seeing a protocol desync?

@dormando
Copy link
Member Author

STAT proxy_conn_requests 24064087641
STAT proxy_conn_errors 0
STAT proxy_conn_oom 0
STAT proxy_req_active 0
STAT proxy_await_active 0
STAT proxy_config_reloads 7692
STAT proxy_config_reload_fails 0
STAT proxy_config_cron_runs 0
STAT proxy_config_cron_fails 0
STAT proxy_backend_total 60
STAT proxy_backend_marked_bad 1450
STAT proxy_backend_failed 83787
STAT proxy_request_failed_depth 8268609

24b queries. 7.6k reloads... didn't see the issue I saw again. going to merge this now.

If a backend has a queue depth limt over this amount, fast-fail any
further requests.

A global parallel request limit can mean a single slow or overloaded
backend causes the entire proxy to stop working. As a first layer of
defence the depth for a particular backend should be capped.
@dormando dormando merged commit 4a75ac2 into memcached:next May 20, 2024
1 check passed
@dormando dormando deleted the proxy_bedepthlim branch May 20, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged/fixed for next proxy worklogs and issues related to proxy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant