web.cookies() -- improve error handling, add unicode support, and speed it up #148

Closed
benhoyt opened this Issue Apr 26, 2012 · 4 comments

Projects

None yet

2 participants

Contributor
benhoyt commented Apr 26, 2012

We (Oyster.com) had a few issues with cookie handling:

  • We call util.cookies() a few times throughout a particular request, and it may be called up to say 10 times. Because Cookie.SimpleCookie is quite slow, we found during profile that this was a significant part of the request time. We've made two speed-ups: (1) store the parsed cookies in web.ctx._parsed_cookies for later use so if you call web.cookies() 10 times you only pay for 1, and (2) it uses simple string splitting if it looks okay to do so, otherwise it falls back to SimpleCookie.
  • SimpleCookie, and hence web.cookies(), is not very liberal on what it accepts. If there's bad user input in the cookie header, web.cookies() will raise CookieError and you'll get no cookies. Our version loads the cookies it can, and ignores errors.
  • Our version decode unicode cookies, like web.input().

Our version is pasted below for review/inclusion if you like:

def decode_cookie(value):
    r"""Safely decode a cookie value, handling UTF-8 and ISO8859 nicely. If the value is
    an ASCII string, return a byte string, otherwise return a unicode string.

    >>> decode_cookie('')
    ''
    >>> decode_cookie('asdf')
    'asdf'
    >>> decode_cookie('foo %C3%A9 bar')
    u'foo \xe9 bar'
    >>> decode_cookie('foo %E9 bar')
    u'foo \xe9 bar'

    """
    value = urllib.unquote(value)
    try:
        # First try to return a plain ASCII byte string
        return str(unicode(value, 'us-ascii'))
    except UnicodeError:
        # Then try UTF-8, and if that fails, ISO8859
        try:
            return unicode(value, 'utf-8')
        except UnicodeError:
            return unicode(value, 'iso8859', 'ignore')

def parse_cookies(http_cookie):
    """Parse a HTTP_COOKIE header and return dict of decoded cookie names and values."""
    if '"' in http_cookie:
        # HTTP_COOKIE has quotes in it, use slow but correct cookie parsing
        cookie = Cookie.SimpleCookie()
        try:
            cookie.load(http_cookie)
        except Cookie.CookieError:
            # If HTTP_COOKIE header is malformed, try at least to load the cookies we can by
            # first splitting on ';' and loading each attr=value pair separately
            cookie = Cookie.SimpleCookie()
            for attr_value in http_cookie.split(';'):
                try:
                    cookie.load(attr_value)
                except Cookie.CookieError:
                    pass
        cookies = dict((k, decode_cookie(v.value)) for k, v in cookie.iteritems())
    else:
        # HTTP_COOKIE doesn't have quotes, use fast cookie parsing
        cookies = {}
        for key_value in http_cookie.split(';'):
            key_value = key_value.split('=', 1)
            if len(key_value) == 2:
                key, value = key_value
                cookies[key.strip()] = decode_cookie(value.strip())
    return cookies

def cookies(*requireds, **defaults):
    r"""Returns storage object with all request cookies in it. Like web.storify(), raise
    KeyError if not all requireds are present, and use the defaults given.
    defaults['_http_cookie'] overrides HTTP_COOKIE header if present (for unit testing).

    This is like web.py's original web.cookies(), but is more forgiving on bad
    HTTP_COOKIE input -- it tries to parse at least the cookies it can. It's also about
    4x as fast in the normal case (Cookie.SimpleCookie is slow!).

    >>> sorted(cookies(_http_cookie='').items())
    []
    >>> sorted(cookies(_http_cookie='a=1').items())
    [('a', '1')]
    >>> sorted(cookies(_http_cookie='a=1%202').items())
    [('a', '1 2')]
    >>> sorted(cookies(_http_cookie='a=Z%C3%A9Z').items())
    [('a', u'Z\xe9Z')]
    >>> sorted(cookies(_http_cookie='a=1; b=2; c=3').items())
    [('a', '1'), ('b', '2'), ('c', '3')]
    >>> sorted(cookies(_http_cookie='a=1; b=w("x")|y=z; c=3').items())
    [('a', '1'), ('b', 'w('), ('c', '3')]
    >>> sorted(cookies(_http_cookie='a=1; b=w(%22x%22)|y=z; c=3').items())
    [('a', '1'), ('b', 'w("x")|y=z'), ('c', '3')]

    >>> sorted(cookies(_http_cookie='keebler=E=mc2').items())
    [('keebler', 'E=mc2')]
    >>> sorted(cookies(_http_cookie=r'keebler="E=mc2; L=\"Loves\"; fudge=\012;"').items())
    [('keebler', 'E=mc2; L="Loves"; fudge=\n;')]

    """
    if '_parsed_cookies' in web.ctx and '_http_cookie' not in defaults:
        # Parsed cookies are cached for this request
        parsed_cookies = web.ctx._parsed_cookies
    else:
        http_cookie = defaults.pop('_http_cookie', None)
        if http_cookie is None:
            http_cookie = web.ctx.env.get('HTTP_COOKIE', '')
        parsed_cookies = parse_cookies(http_cookie)
        web.ctx._parsed_cookies = parsed_cookies

    for required in requireds:
        if required not in parsed_cookies:
            raise KeyError(required)

    output = web.storage(defaults)
    output.update(parsed_cookies)
    return output
Contributor

This looks good to me. Do you have any numbers on how fast this simple version is compared to SimpleCookie?

Can you also share your test cases?

Contributor
benhoyt commented Apr 27, 2012

I actually don't have my test cases from earlier, but I used about a 1KB cookie (which is about what we were seeing -- with Google Analytics and other stuff, cookie headers add up). But something like this:

# about 1KB of cookies without quotes
# web_cookies: 4.90331822896
# our_cookies: 1.18649113093
test_cookies = '; '.join('cookie{0}=valuevaluevaluevaluevaluevalue{1}'.format(i, i) for i in range(25))
t = timeit.Timer('web_cookies(_http_cookie=test_cookies)', 'from __main__ import web_cookies, test_cookies')
print t.timeit(number=10000)
t = timeit.Timer('our_cookies(_http_cookie=test_cookies)', 'from __main__ import our_cookies, test_cookies')
print t.timeit(number=10000)

# about 1KB of cookies with quotes (no real speed-up)
# web_cookies: 6.78449911483
# our_cookies: 6.50397727312
test_cookies = '; '.join('cookie{0}="valuevaluevaluevaluevaluevalue{1}"'.format(i, i) for i in range(25))
t = timeit.Timer('web_cookies(_http_cookie=test_cookies)', 'from __main__ import web_cookies, test_cookies')
print t.timeit(number=10000)
t = timeit.Timer('our_cookies(_http_cookie=test_cookies)', 'from __main__ import our_cookies, test_cookies')
print t.timeit(number=10000)

# BUT, commenting out the "and '_http_cookie' not in defaults" so it's
# actually caching them in web.ctx, which is how it'll look in real life.
# about 1KB of cookies without quotes and caching
# web_cookies: 4.90086106546
# our_cookies: 0.0288045923491

# about 1KB of cookies with quotes and caching
# web_cookies: 6.71030800227
# our_cookies: 0.0289513803879

The bottom two aren't quite a real test, because you wouldn't call cookies() 10,000 times in one request, but we call it upwards of 10 times, and the point is, after the first time it's cached so the call is basically free.

It turns out that we do now have cookies with quotes in them on our site, so we're getting the main gain from the web.ctx caching -- if anything, it's the error handling and the web.ctx caching that I'd pick from what I've got.

Contributor

I'm worried about the decode_cookie function. It is returning strings sometimes and unicode someother times.

The current implementation always returns strings. Changing it may break existing applications.

@benhoyt shall we remove the unicode conversion in decode_cookie?

Contributor
benhoyt commented May 22, 2012

Fair enough about not wanting to break existing applications. I think this is a good feature, though, and it parallels how functions like web.input() work (automatically decoding to unicode). Maybe you could add a unicode parameter so you can say cookies(unicode=True) to have it return unicode, otherwise it returns a byte string?

FWIW, when we changed our codebase to using the new unicode-returning version, it didn't break anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment