Skip to content

Optimize robotparser for long list of rules #149381

@serhiy-storchaka

Description

@serhiy-storchaka

Previously, robotparser implemented old pre-standard specification which nobody uses now. It returned the result after finding the first matching rule, which worked incorrectly in many cases (see #83368). After #138907 it follows the longest path rule.

The code can be optimized, for example by sorting rules by the path length, matching them from longest to shorter and stopping if the match is longer than the remaining paths. This can only be used for paths which do not contain metacharacters * and $.

Other optimizations can also be used, for example a trie-like structure, which could also be used for paths with metacharacters. But this will significantly complicate the code.

I am not actually sure that such optimization is necessary. In most cases the number of rules should not be too large. This is why I did not include it in the previous PR. We need to collect some data first. So I publish my code as a draft.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions