-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add encoding to Link object + re-enable HTML entities links tests #1949
Conversation
cf.
you'll see |
@@ -28,6 +28,7 @@ def __init__(self, url, text='', fragment='', nofollow=False): | |||
self.text = text | |||
self.fragment = fragment | |||
self.nofollow = nofollow | |||
self.encoding = encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 2 when unicode URL is passed it is converted to bytes using utf-8 encoding, regardless of encoding
passed. Should we use encoding
instead? If so, the warning may become misleading. There should be a warning if encoding is not passed, but if user passes it explicitly then it may be fine to pass unicode URL in Python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit I don't quite understand why Unicode URL is converted to bytes within Link
.
Was the intent to make it safe? (in that case, we could store the safe_url version of it, using encoding).
Reading scrapy/tests/test_link.py
, I find the following odd:
link = Link(u"http://www.example.com/\xa3")
self.assertIsInstance(link.url, str)
self.assertEqual(link.url, b'http://www.example.com/\xc2\xa3') <-----
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python 2.x all URLs were bytes, so we required link.url to be bytes. For backwards compatibility, and because of missing unicode support in url-related functions in Python 2.x, link.url is still required to be bytes in Python 2.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no idea about what is this test for :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, in Python2, we could make it pass through safe_url_string
, using encoding, (and converting to bytes would already be bytes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean safe_url_string
called twice, once in LinkExtractor
and once in Request
init?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so your comment is on https://github.com/scrapy/scrapy/blob/link-encoding/scrapy/linkextractors/lxmlhtml.py#L60, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests seem to pass fine with WITHOUT safe_url_string
in LinkExtractor
. Checking further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was about your suggestion to use safe_url_string
in Link constructor; I haven't noticed it in LinkExtractor.
Current coverage is 81.65%@@ master #1949 diff @@
==========================================
Files 161 161
Lines 8669 8669
Methods 0 0
Messages 0 0
Branches 1269 1269
==========================================
- Hits 7080 7078 -2
Misses 1233 1233
- Partials 356 358 +2
|
@kmike , with What happens with
|
Add
encoding
parameter and attribute toLink
object.That's the only way I found to properly account for encoding information when building requests from link extractors.
Supersedes #1880
Should fix #1403