Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding #2400

Merged
merged 1 commit into from
May 27, 2020

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Nov 18, 2016

Fixes #1992.

This PR modifies the CookiesMiddleware._get_request_cookies method to prevent the middleware from discarding cookies already set in the headers (either by the DefaultHeadersMiddleware or by using the Request's headers constructor parameter).

Update: also fixes #3575

@codecov-io
Copy link

codecov-io commented Nov 18, 2016

Codecov Report

Merging #2400 into master will increase coverage by 0.02%.
The diff coverage is 95.55%.

@@            Coverage Diff             @@
##           master    #2400      +/-   ##
==========================================
+ Coverage   84.57%   84.59%   +0.02%     
==========================================
  Files         164      164              
  Lines        9931     9961      +30     
  Branches     1477     1484       +7     
==========================================
+ Hits         8399     8427      +28     
- Misses       1266     1267       +1     
- Partials      266      267       +1     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/cookies.py 95.74% <95.55%> (-1.14%) ⬇️

@elacuesta elacuesta force-pushed the keep_cookie_header branch 5 times, most recently from 40856ed to 139644a Compare November 21, 2016 18:00
@elacuesta elacuesta force-pushed the keep_cookie_header branch 3 times, most recently from ae5bc13 to a37aec1 Compare February 16, 2017 15:25
@elacuesta
Copy link
Member Author

@kmike, about

cookies should be UTF-8 in most cases (but other encodings are also possible, so the code must be robust)

I believe the current CookieJar implementation (scrapy.http.cookies.CookieJar, based on http.cookiejar) does not handle encodings other than UTF8 properly. A different implementation could solve that, but IMHO that would fall out of the scope of this PR.
Check the following snippet:

# -*- coding: utf-8 -*-

from __future__ import print_function
from scrapy.http import Request, Response
from scrapy.http.cookies import CookieJar

def test_cookie(encoding):
    print('# Encoding: {} #'.format(encoding))
    jar = CookieJar()
    response = Response('http://example.org', headers={'Set-Cookie': u'a=ä'.encode(encoding)})
    request = Request('http://example.org')
    cookies_from_response = jar.make_cookies(response, request)
    print('cookies:', cookies_from_response)
    for cookie in cookies_from_response:
        jar.set_cookie(cookie)
    jar.add_cookie_header(request)
    print('Cookie header:', request.headers['Cookie'])
    print()

test_cookie('utf8')
test_cookie('latin1')
test_cookie('utf16')

and the output:

$ python2 cookies_jar.py 
# Encoding: utf8 #
cookies: [Cookie(version=0, name='a', value='\xc3\xa4', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: a=ä

# Encoding: latin1 #
cookies: [Cookie(version=0, name='a', value='\xe4', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: a=

# Encoding: utf16 #
cookies: [Cookie(version=0, name='\xff\xfea\x00', value='\x00\xe4\x00', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: ��a=�

$ python3 cookies_jar.py 
# Encoding: utf8 #
cookies: [Cookie(version=0, name='a', value='ä', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: b'a=\xc3\xa4'

# Encoding: latin1 #
cookies: [Cookie(version=0, name='a', value='�', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: b'a=\xef\xbf\xbd'

# Encoding: utf16 #
cookies: [Cookie(version=0, name='��a\x00', value='\x00�\x00', port=None, port_specified=False, domain='example.org', domain_specified=False, domain_initial_dot=False, path='/', path_specified=False, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]
Cookie header: b'\xef\xbf\xbd\xef\xbf\xbda\x00=\x00\xef\xbf\xbd\x00'

@kmike
Copy link
Member

kmike commented Feb 16, 2017

@elacuesta yeah, it can't handle them because the encoding is unknown. But it shouldn't raise an exception even if a header is not utf-8 - there are byte sequences which can raise UnicodeDecodeError when decoded from utf-8; the whole response shouldn't fail in this case.

@elacuesta
Copy link
Member Author

Right, what do you think about the following then?

  1. mention in the docs that cookies should be either unicode strings or UTF8-encoded bytes
  2. catch UnicodeDecodeError when parsing cookies in CookiesMiddleware and log a warning
  3. discard/keep (with probably bogus value) non-UTF8 cookies

@kmike
Copy link
Member

kmike commented Feb 17, 2017

mention in the docs that cookies should be either unicode strings or UTF8-encoded bytes

It won't work - it is website who defines what are cookies, we can't prescribe that. Non-utf8 cookies are likely bugs though.

catch UnicodeDecodeError when parsing cookies in CookiesMiddleware and log a warning

This sounds good to me.

Warnings can be annoying if one scrapes a website which uses such cookies consistently, so it'd be nice to document how to disable these warnings, e.g. by pointing to https://doc.scrapy.org/en/latest/topics/logging.html#advanced-customization.

discard/keep (with probably bogus value) non-UTF8 cookies

Currently non-UTF8 cookies are kept with bogus values - errors='replace' is used everywhere. Though not ideal, I haven't heard anyone complaining; probably it doesn't happen often, or maybe it happens in cases where non-lossy cookie values are not important.

@elacuesta
Copy link
Member Author

@kmike It seems to me that http.cookiejar works consistently in py2/py3 only with unicode strings, I added some code to decode cookies from the Cookie header (which is always bytes thanks to the Headers implementation) using utf8 and falling back to latin1 in case of UnicodeDecodeError.

With this implementation the Cookie header would always end up encoded with utf8 even if the cookies were already passed as bytes using a different encoding. What do you think about that?

@elacuesta elacuesta force-pushed the keep_cookie_header branch 2 times, most recently from 218bab3 to 92e4038 Compare October 12, 2018 18:18
@elacuesta elacuesta changed the title CookiesMiddleware: keep cookies from 'Cookie' request header CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding Jan 10, 2019
@elacuesta
Copy link
Member Author

I made some changes to the tests and to the docs to point to the logging configuration page.

@kmike could you please re-check this? The current status is:

Cookies set in the Request.cookies attribute

  1. UTF8-encoded bytes: sent as they are, Double-encoded cookies #3575 is also fixed
  2. Non-UTF8-encoded bytes: sent as UTF8-encoded bytes - a warning is logged
  3. Unicode string: encoded with UTF8

Cookies set in the Cookie request header (currently being ignored)

  1. UTF8-encoded bytes: sent as they are
  2. Non-UTF8-encoded bytes: sent as UTF8-encoded bytes - a warning is logged
  3. Unicode string: encoded with UTF8

@elacuesta
Copy link
Member Author

Added a check and a test case to filter out invalid cookies (dicts with no name or value in the Request.cookies attribute)

docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/cookies.py Outdated Show resolved Hide resolved
@elacuesta
Copy link
Member Author

Since we're only splitting by one char (";") and ignoring whitespaces, it seems like using plain bytes.split and bytes.strip is faster than using regex.

In [1]: import re
   ...:
   ...: re_split = re.compile(br'\s*;\s*')
   ...:
   ...: def cookies(n=4):
   ...:     values = [b'%i=%i' % (i, i*10) for i in range(10**n)]
   ...:     return b' ; '.join(values)
   ...:
   ...: %timeit re_split.split(cookies())
   ...: %timeit [c.strip() for c in cookies().split(b';')]
7.58 ms ± 39.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.5 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Gallaecio Gallaecio changed the title CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding [MRG+1] CookiesMiddleware: keep cookies from 'Cookie' request header, fix encoding Mar 26, 2019
@elacuesta elacuesta force-pushed the keep_cookie_header branch 2 times, most recently from b8da528 to 81a436d Compare August 28, 2019 18:59
@elacuesta elacuesta closed this Aug 28, 2019
@elacuesta elacuesta reopened this Aug 28, 2019
@elacuesta elacuesta force-pushed the keep_cookie_header branch 4 times, most recently from f3b0889 to e48445b Compare December 5, 2019 14:21
@elacuesta elacuesta closed this Feb 3, 2020
@elacuesta elacuesta reopened this Feb 3, 2020
@Gallaecio Gallaecio added this to the v2.1 milestone Mar 19, 2020
@Gallaecio Gallaecio removed this from the v2.1 milestone Apr 16, 2020
@kmike kmike requested a review from wRAR May 21, 2020 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cookies from the Cookie request header are not processed Double-encoded cookies
6 participants