Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change URI verification in command line tools #4603

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

michalp2213
Copy link
Contributor

This closes #4530 . I implemented a new is_uri function in scrapy.utils.url that verifies if given text is a well-formed URI as defined by RFC 3986 and replaced calls to w3lib.url.is_url in fetch and parse commands with it.

@codecov
Copy link

codecov bot commented May 31, 2020

Codecov Report

Merging #4603 into master will increase coverage by 0.05%.
The diff coverage is 59.37%.

@@            Coverage Diff             @@
##           master    #4603      +/-   ##
==========================================
+ Coverage   84.63%   84.69%   +0.05%     
==========================================
  Files         163      163              
  Lines        9971     9989      +18     
  Branches     1485     1490       +5     
==========================================
+ Hits         8439     8460      +21     
+ Misses       1266     1262       -4     
- Partials      266      267       +1     
Impacted Files Coverage Δ
scrapy/commands/parse.py 20.10% <18.18%> (-0.12%) ⬇️
scrapy/commands/fetch.py 86.53% <63.63%> (-2.60%) ⬇️
scrapy/utils/url.py 100.00% <100.00%> (ø)
scrapy/crawler.py 87.15% <0.00%> (-1.12%) ⬇️
scrapy/http/response/text.py 100.00% <0.00%> (ø)
scrapy/commands/__init__.py 62.31% <0.00%> (+1.20%) ⬆️
scrapy/utils/spider.py 77.77% <0.00%> (+2.77%) ⬆️
scrapy/commands/runspider.py 89.13% <0.00%> (+5.52%) ⬆️
scrapy/commands/crawl.py 35.00% <0.00%> (+6.42%) ⬆️

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we still extract the protocol and check that it is one of those supported by the Scrapy project being used?

Also, I wonder if it wouldn’t be better to modify is_url upstream, making the list of supported protocols a parameter with the current behavior as its default value for backward compatibility, and then in Scrapy passing a list of protocols based on the running Scrapy project.

@michalp2213
Copy link
Contributor Author

I added a check in is_uri that uri's scheme is on the list of supported protocols (passed as an argument, defaults to {'http', 'https', 'file'}) and modified fetch and parse to pass a set of protocols supported by current project (from DOWNLOAD_HANDLERS_BASE and DOWNLOAD_HANDLERS). If you want to modify is_url in w3lib, I can put these changes as a pull request there.

Also, I think it is probably a good idea to extract the code that produces a set of supported protocols in fetch and parse to a separate function. What do you think would be a good place for it? utils.project?

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve left a few style comments, but it looks good to me, good work!

As for w3lib, I would do the following:

  1. Rename is_uri here as _is_uri (or _is_url) so that we do not need to deprecate it if we remove it later.
  2. Add a similar change to w3lib.
  3. Once a w3lib version with that change is published, we can replace the private function here with w3lib’s, and upgrade the required w3lib version of Scrapy accordingly.

As for using a shared function to get supported project URL schemes, I’m not sure if it is worth it, but you could extract some sharing function in scrapy/commands/__init__.py. I wonder if the function should also check the length of args, as the check seems to be the same in both commands, or if that would be too tightly coupled.

Comment on lines +55 to +56
supported_protocols = settings.getdict('DOWNLOAD_HANDLERS_BASE').keys()
supported_protocols |= settings.getdict('DOWNLOAD_HANDLERS').keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See settings.getwithbase. Using that, extracting this code into a function is less important, as you could simplify the line below as:

        if not is_uri(args[0], settings.getwithbase('DOWNLOAD_HANDLERS')):

@@ -133,3 +133,15 @@ def strip_url(url, strip_credentials=True, strip_default_port=True, origin_only=
'' if origin_only else parsed_url.query,
'' if strip_fragment else parsed_url.fragment
))


def is_uri(text, protocols={'http', 'https', 'file'}):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄 It may be better to make protocols keyword only:

Suggested change
def is_uri(text, protocols={'http', 'https', 'file'}):
def is_uri(text, *, protocols={'http', 'https', 'file'}):

Also, maybe it should be schemes instead of protocols.

Comment on lines +140 to +141
pattern = re.compile(regex)
match = pattern.fullmatch(text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💄 This would be simpler and the performance should be the same:

Suggested change
pattern = re.compile(regex)
match = pattern.fullmatch(text)
match = re.fullmatch(pattern, text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scrapy command line tools fail on custom schemes
2 participants