Change URI verification in command line tools #4603

michalp2213 · 2020-05-31T14:27:04Z

This closes #4530 . I implemented a new is_uri function in scrapy.utils.url that verifies if given text is a well-formed URI as defined by RFC 3986 and replaced calls to w3lib.url.is_url in fetch and parse commands with it.

…tch commands

codecov · 2020-05-31T14:45:36Z

Codecov Report

Merging #4603 into master will increase coverage by 0.05%.
The diff coverage is 59.37%.

@@            Coverage Diff             @@
##           master    #4603      +/-   ##
==========================================
+ Coverage   84.63%   84.69%   +0.05%     
==========================================
  Files         163      163              
  Lines        9971     9989      +18     
  Branches     1485     1490       +5     
==========================================
+ Hits         8439     8460      +21     
+ Misses       1266     1262       -4     
- Partials      266      267       +1

Impacted Files	Coverage Δ
scrapy/commands/parse.py	`20.10% <18.18%> (-0.12%)`	⬇️
scrapy/commands/fetch.py	`86.53% <63.63%> (-2.60%)`	⬇️
scrapy/utils/url.py	`100.00% <100.00%> (ø)`
scrapy/crawler.py	`87.15% <0.00%> (-1.12%)`	⬇️
scrapy/http/response/text.py	`100.00% <0.00%> (ø)`
scrapy/commands/__init__.py	`62.31% <0.00%> (+1.20%)`	⬆️
scrapy/utils/spider.py	`77.77% <0.00%> (+2.77%)`	⬆️
scrapy/commands/runspider.py	`89.13% <0.00%> (+5.52%)`	⬆️
scrapy/commands/crawl.py	`35.00% <0.00%> (+6.42%)`	⬆️

Gallaecio

Could we still extract the protocol and check that it is one of those supported by the Scrapy project being used?

Also, I wonder if it wouldn’t be better to modify is_url upstream, making the list of supported protocols a parameter with the current behavior as its default value for backward compatibility, and then in Scrapy passing a list of protocols based on the running Scrapy project.

…cols

michalp2213 · 2020-06-08T20:51:11Z

I added a check in is_uri that uri's scheme is on the list of supported protocols (passed as an argument, defaults to {'http', 'https', 'file'}) and modified fetch and parse to pass a set of protocols supported by current project (from DOWNLOAD_HANDLERS_BASE and DOWNLOAD_HANDLERS). If you want to modify is_url in w3lib, I can put these changes as a pull request there.

Also, I think it is probably a good idea to extract the code that produces a set of supported protocols in fetch and parse to a separate function. What do you think would be a good place for it? utils.project?

Gallaecio

I’ve left a few style comments, but it looks good to me, good work!

As for w3lib, I would do the following:

Rename is_uri here as _is_uri (or _is_url) so that we do not need to deprecate it if we remove it later.
Add a similar change to w3lib.
Once a w3lib version with that change is published, we can replace the private function here with w3lib’s, and upgrade the required w3lib version of Scrapy accordingly.

As for using a shared function to get supported project URL schemes, I’m not sure if it is worth it, but you could extract some sharing function in scrapy/commands/__init__.py. I wonder if the function should also check the length of args, as the check seems to be the same in both commands, or if that would be too tightly coupled.

Gallaecio · 2020-06-14T16:20:28Z

scrapy/commands/fetch.py

+        supported_protocols = settings.getdict('DOWNLOAD_HANDLERS_BASE').keys()
+        supported_protocols |= settings.getdict('DOWNLOAD_HANDLERS').keys()


See settings.getwithbase. Using that, extracting this code into a function is less important, as you could simplify the line below as:

if not is_uri(args[0], settings.getwithbase('DOWNLOAD_HANDLERS')):

Gallaecio · 2020-06-14T16:23:43Z

scrapy/utils/url.py

@@ -133,3 +133,15 @@ def strip_url(url, strip_credentials=True, strip_default_port=True, origin_only=
        '' if origin_only else parsed_url.query,
        '' if strip_fragment else parsed_url.fragment
    ))
+
+
+def is_uri(text, protocols={'http', 'https', 'file'}):


💄 It may be better to make protocols keyword only:

Suggested change

def is_uri(text, protocols={'http', 'https', 'file'}):

def is_uri(text, *, protocols={'http', 'https', 'file'}):

Also, maybe it should be schemes instead of protocols.

Gallaecio · 2020-06-14T16:26:16Z

scrapy/utils/url.py

+    pattern = re.compile(regex)
+    match = pattern.fullmatch(text)


💄 This would be simpler and the performance should be the same:

Suggested change

pattern = re.compile(regex)

match = pattern.fullmatch(text)

match = re.fullmatch(pattern, text)

michalp2213 added 2 commits May 31, 2020 16:06

Add is_uri function (based on RFC 3986) to scrapy.utils.url

d5915d6

Replace w3lib.url.is_url with scrapy.utils.url.is_uri in parse and fe…

4377fd6

…tch commands

Gallaecio reviewed Jun 1, 2020

View reviewed changes

utils.url.is_uri now verifies if uri scheme is one of supported proto…

c58b592

…cols

Gallaecio approved these changes Jun 14, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change URI verification in command line tools #4603

Change URI verification in command line tools #4603

michalp2213 commented May 31, 2020

codecov bot commented May 31, 2020 •

edited

Gallaecio left a comment

michalp2213 commented Jun 8, 2020

Gallaecio left a comment

Gallaecio Jun 14, 2020

Gallaecio Jun 14, 2020

Gallaecio Jun 14, 2020

		supported_protocols = settings.getdict('DOWNLOAD_HANDLERS_BASE').keys()
		supported_protocols \|= settings.getdict('DOWNLOAD_HANDLERS').keys()

	def is_uri(text, protocols={'http', 'https', 'file'}):
	def is_uri(text, *, protocols={'http', 'https', 'file'}):

	pattern = re.compile(regex)
	match = pattern.fullmatch(text)
	match = re.fullmatch(pattern, text)

Change URI verification in command line tools #4603

Are you sure you want to change the base?

Change URI verification in command line tools #4603

Conversation

michalp2213 commented May 31, 2020

codecov bot commented May 31, 2020 • edited

Codecov Report

Gallaecio left a comment

Choose a reason for hiding this comment

michalp2213 commented Jun 8, 2020

Gallaecio left a comment

Choose a reason for hiding this comment

Gallaecio Jun 14, 2020

Choose a reason for hiding this comment

Gallaecio Jun 14, 2020

Choose a reason for hiding this comment

Gallaecio Jun 14, 2020

Choose a reason for hiding this comment

codecov bot commented May 31, 2020 •

edited