New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change URI verification in command line tools #4603
base: master
Are you sure you want to change the base?
Change URI verification in command line tools #4603
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4603 +/- ##
==========================================
+ Coverage 84.63% 84.69% +0.05%
==========================================
Files 163 163
Lines 9971 9989 +18
Branches 1485 1490 +5
==========================================
+ Hits 8439 8460 +21
+ Misses 1266 1262 -4
- Partials 266 267 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we still extract the protocol and check that it is one of those supported by the Scrapy project being used?
Also, I wonder if it wouldn’t be better to modify is_url
upstream, making the list of supported protocols a parameter with the current behavior as its default value for backward compatibility, and then in Scrapy passing a list of protocols based on the running Scrapy project.
I added a check in Also, I think it is probably a good idea to extract the code that produces a set of supported protocols in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve left a few style comments, but it looks good to me, good work!
As for w3lib, I would do the following:
- Rename
is_uri
here as_is_uri
(or_is_url
) so that we do not need to deprecate it if we remove it later. - Add a similar change to
w3lib
. - Once a
w3lib
version with that change is published, we can replace the private function here withw3lib
’s, and upgrade the requiredw3lib
version of Scrapy accordingly.
As for using a shared function to get supported project URL schemes, I’m not sure if it is worth it, but you could extract some sharing function in scrapy/commands/__init__.py
. I wonder if the function should also check the length of args
, as the check seems to be the same in both commands, or if that would be too tightly coupled.
supported_protocols = settings.getdict('DOWNLOAD_HANDLERS_BASE').keys() | ||
supported_protocols |= settings.getdict('DOWNLOAD_HANDLERS').keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See settings.getwithbase
. Using that, extracting this code into a function is less important, as you could simplify the line below as:
if not is_uri(args[0], settings.getwithbase('DOWNLOAD_HANDLERS')):
@@ -133,3 +133,15 @@ def strip_url(url, strip_credentials=True, strip_default_port=True, origin_only= | |||
'' if origin_only else parsed_url.query, | |||
'' if strip_fragment else parsed_url.fragment | |||
)) | |||
|
|||
|
|||
def is_uri(text, protocols={'http', 'https', 'file'}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💄 It may be better to make protocols
keyword only:
def is_uri(text, protocols={'http', 'https', 'file'}): | |
def is_uri(text, *, protocols={'http', 'https', 'file'}): |
Also, maybe it should be schemes
instead of protocols
.
pattern = re.compile(regex) | ||
match = pattern.fullmatch(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💄 This would be simpler and the performance should be the same:
pattern = re.compile(regex) | |
match = pattern.fullmatch(text) | |
match = re.fullmatch(pattern, text) |
This closes #4530 . I implemented a new
is_uri
function inscrapy.utils.url
that verifies if given text is a well-formed URI as defined by RFC 3986 and replaced calls tow3lib.url.is_url
infetch
andparse
commands with it.