Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robotparser doesn't handle URL's with query strings #50574

Closed
skybrian mannequin opened this issue Jun 23, 2009 · 3 comments
Closed

robotparser doesn't handle URL's with query strings #50574

skybrian mannequin opened this issue Jun 23, 2009 · 3 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@skybrian
Copy link
Mannequin

skybrian mannequin commented Jun 23, 2009

BPO 6325
Nosy @orsenthil
Files
  • 6325.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2010-07-28.16:37:55.548>
    created_at = <Date 2009-06-23.04:25:48.828>
    labels = ['type-bug', 'library']
    title = "robotparser doesn't handle URL's with query strings"
    updated_at = <Date 2010-07-28.16:37:55.546>
    user = 'https://bugs.python.org/skybrian'

    bugs.python.org fields:

    activity = <Date 2010-07-28.16:37:55.546>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2010-07-28.16:37:55.548>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2009-06-23.04:25:48.828>
    creator = 'skybrian'
    dependencies = []
    files = ['18218']
    hgrepos = []
    issue_num = 6325
    keywords = ['patch']
    message_count = 3.0
    messages = ['89622', '111687', '111831']
    nosy_count = 3.0
    nosy_names = ['orsenthil', 'skybrian', 'mikejs']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6325'
    versions = ['Python 3.1', 'Python 2.7', 'Python 3.2']

    @skybrian
    Copy link
    Mannequin Author

    skybrian mannequin commented Jun 23, 2009

    If a robots.txt file contains a rule of the form:

    Disallow: /some/path?name=value

    This pattern will never match a URL passed to can_fetch(), as far as I
    can tell.

    It's arguable whether this is a bug. The 1994 robots.txt protocol is
    silent on whether to treat query strings specially and just says "any
    URL that starts with this value will not be retrieved". The 1997 draft
    standard talks about the path portion of a URL but doesn't give any
    examples about how to treat the '?' character in a robots.txt pattern.

    Google extends the protocol to allow wildcard characters in a way that
    doesn't treat the '?' character specially. See:
    http://www.google.com/support/webmasters/bin/answer.py?answer=40360&cbid=-1rdq1gi8f11xx&src=cb&lev=answer#3

    I'll leave aside whether to implement pattern matching, but it seems
    like a good idea to do something reasonable when a robots.txt pattern
    contains a literal '?', and treating it as a literal character seems
    simplest.

    Cause: in robotparser.can_fetch(), there is this code which seems to
    take only the path (stripping the query string).

     url = urllib.quote(urlparse.urlparse(urllib.unquote(url))[2]) or "/"

    Also, when parsing patterns in the robots.txt file, a '?' character
    seems to be automatically URL-escaped. There's nothing in a standards
    doc about doing this so I think that might be a bug too.

    Tested with python 2.4. I looked at the code in Subversion head and it
    doesn't look like there were any changes on the trunk.

    @skybrian skybrian mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 23, 2009
    @BreamoreBoy BreamoreBoy mannequin assigned orsenthil Jul 10, 2010
    @mikejs
    Copy link
    Mannequin

    mikejs mannequin commented Jul 27, 2010

    Supplied patch matches rules with query params.

    @orsenthil
    Copy link
    Member

    I modified the patch slightly (so that it takes care of path, query, params and fragments).

    Fixed in r83209,r83210 and r83211.

    I also think that we need to move the robotparser to allow regexs in the allow and disallow patterns. ( Shall open an issue in the tracker, if it is not already present).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant