Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command #1498

Merged
merged 4 commits into from Oct 1, 2015

Conversation

@Preetwinder
Copy link
Contributor

@Preetwinder Preetwinder commented Sep 16, 2015

This change addresses the enhancement mentioned in #1487.

The add_scheme_if_missing function uses the parse_url function to parse the url. It checks if the scheme is missing, if it is, 'http' is added as the default scheme. It checks if the netloc field is empty. If it is empty, it replaces it with the path field and changes the path field to empty. This is done because when handing invalid url's like "www.example.com" where the scheme is missing, the parse_url function(which uses urlparse itself) leaves the netloc empty and puts the entire url in path. This leads to incorrect url generation when geturl() is called.

The add_scheme_if_missing function is called by the shell command right after it extracts the url from the arguments list. A check is performed to ensure that a null value is not passed to the function.

The test tests some common url variations.

@codecov-io
Copy link

@codecov-io codecov-io commented Sep 16, 2015

Current coverage is 82.44%

Merging #1498 into master will not affect coverage as of 0d6f0b2

Powered by Codecov. Updated on successful CI builds.

self.assertEqual(add_scheme_if_missing('//example.com'),
'http://example.com')
self.assertEqual(add_scheme_if_missing('https://www.example.com'),
'https://www.example.com')

This comment has been minimized.

@kmike

kmike Sep 16, 2015
Member

Could you please add tests for URLs with

  • some variations of port, fragment, GET arguments;
  • URLs without scheme but with // separator, like //example.com/foo/bar?
@@ -110,3 +110,11 @@ def escape_ajax(url):
if not frag.startswith('!'):
return url
return add_or_replace_parameter(defrag, '_escaped_fragment_', frag[1:])

def add_scheme_if_missing(url):

This comment has been minimized.

@kmike

kmike Sep 16, 2015
Member

Could you please add a docstring?

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Sep 16, 2015

This new function should be called add_http_if_no_scheme; it just adds http.

@Preetwinder
Copy link
Contributor Author

@Preetwinder Preetwinder commented Sep 16, 2015

Thank you for reviewing my request. I should have tested more rigorously before submitting. I was not aware of the different variations of URLs that kmike mentioned. Also this is the first time I am contributing to an open source project.
While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.
I feel such an addition would negatively affect the readability of the code. I have been unable to find an elegant solution to the problem. It might be preferable to use a version of the code which directly alters the URL, similar to the code provided by kmike on the issue page, since the urlparse library isn't designed to handle invalid urls. Please guide me in regards to which approach is preferable.

@Preetwinder
Copy link
Contributor Author

@Preetwinder Preetwinder commented Sep 18, 2015

I have proceeded to make the changes mentioned by kmike and nramirezuy. Apart from adding the new tests and the docstring, I have also tested the function against multiple other URL variants separately.

@@ -110,3 +110,12 @@ def escape_ajax(url):
if not frag.startswith('!'):
return url
return add_or_replace_parameter(defrag, '_escaped_fragment_', frag[1:])

def add_http_if_no_scheme(url):
"""Adds http as the default scheme if it is missing from the url"""

This comment has been minimized.

@kmike

kmike Sep 20, 2015
Member

A couple minor notes: according to pep8 there should be 2 blank lines before the function; according to pep257 docstring should use imperative mood:

The docstring is a phrase ending in a period. It prescribes the function or method's effect as a command ("Do this", "Return that"), not as a description; e.g. don't write "Returns the pathname ...".

"""Adds http as the default scheme if it is missing from the url"""
parser = parse_url(url)
if url.startswith('//'):
url = 'http:' + url

This comment has been minimized.

@kmike

kmike Sep 20, 2015
Member

if url starts with '//' there is no need to call parse_url

self.assertEqual(add_http_if_no_scheme('https://www.example.com'),
'https://www.example.com')
self.assertEqual(add_http_if_no_scheme('ftp://www.example.com'),
'ftp://www.example.com')

This comment has been minimized.

@kmike

kmike Sep 20, 2015
Member

Could you please split this test into several test cases? It makes sense to create a new TestCase for add_http_if_no_scheme and create several smaller test methods (e.g. test_username_password, test_preserve_ftp, test_leading_slashes, etc).

I know other tests in this module don't follow this convention (yet), but we're slowly moving to more fine-grained test methods because they are easier to read and provide a possiblity to run the rest of the tests if one of them fails.

@Preetwinder
Copy link
Contributor Author

@Preetwinder Preetwinder commented Sep 21, 2015

Thank you kmike for your feedback and patience. Since the master branch has changed, should I merge or rebase before pushing the new commits?

@kmike
Copy link
Member

@kmike kmike commented Sep 21, 2015

Thanks @Preetwinder! Your PR is very good (and it was good from the beginning); +1 to merge it after a few minor changes.

Since the master branch has changed, should I merge or rebase before pushing the new commits?

I think it is better to rebase.

While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

@Preetwinder Preetwinder force-pushed the Preetwinder:scrapy-add_scheme branch from b7660e3 to 47c8e2b Sep 24, 2015
@Preetwinder
Copy link
Contributor Author

@Preetwinder Preetwinder commented Sep 24, 2015

Thank you for your encouraging words. Please tell, if any further changes need to be made to the PR.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

@kmike kmike changed the title #1487 add_scheme_if_missing for `scrapy shell` command [MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command Sep 30, 2015
@kmike
Copy link
Member

@kmike kmike commented Sep 30, 2015

The PR looks good to me, +1 to merge.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

We can't use wget code because Scrapy uses BSD license, and relying on wget code will force us to change the license to GPL.

dangra added a commit that referenced this pull request Oct 1, 2015
[MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command
@dangra dangra merged commit fe15f93 into scrapy:master Oct 1, 2015
2 checks passed
2 checks passed
codecov/patch 100.00% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dangra
Copy link
Member

@dangra dangra commented Oct 1, 2015

💯 simple and well tested. thanks!

@Preetwinder
Copy link
Contributor Author

@Preetwinder Preetwinder commented Oct 1, 2015

Thank you!

@Preetwinder Preetwinder deleted the Preetwinder:scrapy-add_scheme branch Oct 3, 2015
@kmike kmike mentioned this pull request Oct 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.