XMLRPC statistics on "abusive" requests #9136

abitrolly · 2021-02-25T03:49:48Z

What's the problem this feature will solve?

An ongoing 2 months outage of XMLRPC search reported by https://status.python.org/incidents/grk0k7sz6zkp can be solved by optimizing or caching popular queries.

Describe the solution you'd like

I'd like to see the volume and contents of the:

most popular API requests
longest API requests

Additional context

Depending on the statistics, it will be possible to provision additional index servers to offload API requests. or provide a way for organization to incrementally sync the database. Sync can be done either using global event notifications similar to Fedora Messaging System, or using standard P2P Merkle tree lookup mechanism employed by blockchains.

ewdurbin · 2021-03-04T12:56:59Z

Our current attempted call rate for the disabled search endpoint is roughly 100rps (yellow trace). All of these are receiving either a rate limit response (brown trace) or a disabled response (red trace). This call rate has not changed since we implemented rate limiting or disabled search.

The issue isn't solely one of provisioning resources to sustain the search volume, it is that we don't have any viable mechanism to communicate with users of the very expensive XMLRPC API who abuse the endpoint. Architecturally XMLRPC being based on POST requests, combined with the high cardinality of results (search queries are arbitrary), makes caching this at the CDN edge or otherwise reducing the load imposed on our backends untenable in the long run.

Our current search is based on ElasticSearch, which I'm not familiar enough with to determine if such incremental syncs are viable.

abitrolly · 2021-03-04T13:17:28Z

@ewdurbin it is possible to publish stats by popularity on these 150rps without doing the actual requests? Without it we can only state that optimization in general sense is impossible.

ewdurbin · 2021-03-04T13:18:35Z

popularity in what sense?

abitrolly · 2021-03-04T13:20:09Z

Structure or request, which query, how popular are such queries. Then it will be possible to determine overhead for certain query structures and set selective filters to cut expensive requests and optimizing most popular more.

di · 2021-03-11T18:57:29Z

How do you propose to "set selective filters to cut expensive requests" and how would that be less expensive than the current response?

abitrolly · 2021-03-11T21:35:49Z

Filters can be set at load balancer, at web server, at middlewire or at Django level. It might be possible to set them at SQL level is SQL can explain that the query is too expensive to be run. Whatever method is chosen, it depends on metrics. The best way is to add OpenTracing of course. Maybe the "abusive" requests are just malformed XML that make parser choke.

di added the needs discussion a product management/policy issue maintainers and users should discuss label Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XMLRPC statistics on "abusive" requests #9136

XMLRPC statistics on "abusive" requests #9136

abitrolly commented Feb 25, 2021

ewdurbin commented Mar 4, 2021 •

edited

abitrolly commented Mar 4, 2021

ewdurbin commented Mar 4, 2021

abitrolly commented Mar 4, 2021

di commented Mar 11, 2021

abitrolly commented Mar 11, 2021

XMLRPC statistics on "abusive" requests #9136

XMLRPC statistics on "abusive" requests #9136

Comments

abitrolly commented Feb 25, 2021

ewdurbin commented Mar 4, 2021 • edited

abitrolly commented Mar 4, 2021

ewdurbin commented Mar 4, 2021

abitrolly commented Mar 4, 2021

di commented Mar 11, 2021

abitrolly commented Mar 11, 2021

ewdurbin commented Mar 4, 2021 •

edited