New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command #1498

Merged
merged 4 commits into from Oct 1, 2015

Conversation

Projects
None yet
5 participants
@Preetwinder
Contributor

Preetwinder commented Sep 16, 2015

This change addresses the enhancement mentioned in #1487.

The add_scheme_if_missing function uses the parse_url function to parse the url. It checks if the scheme is missing, if it is, 'http' is added as the default scheme. It checks if the netloc field is empty. If it is empty, it replaces it with the path field and changes the path field to empty. This is done because when handing invalid url's like "www.example.com" where the scheme is missing, the parse_url function(which uses urlparse itself) leaves the netloc empty and puts the entire url in path. This leads to incorrect url generation when geturl() is called.

The add_scheme_if_missing function is called by the shell command right after it extracts the url from the arguments list. A check is performed to ensure that a null value is not passed to the function.

The test tests some common url variations.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Sep 16, 2015

Current coverage is 82.44%

Merging #1498 into master will not affect coverage as of 0d6f0b2

Powered by Codecov. Updated on successful CI builds.

codecov-io commented Sep 16, 2015

Current coverage is 82.44%

Merging #1498 into master will not affect coverage as of 0d6f0b2

Powered by Codecov. Updated on successful CI builds.

Show outdated Hide outdated tests/test_utils_url.py Outdated
Show outdated Hide outdated scrapy/utils/url.py Outdated
@nramirezuy

This comment has been minimized.

Show comment
Hide comment
@nramirezuy

nramirezuy Sep 16, 2015

Member

This new function should be called add_http_if_no_scheme; it just adds http.

Member

nramirezuy commented Sep 16, 2015

This new function should be called add_http_if_no_scheme; it just adds http.

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Sep 16, 2015

Contributor

Thank you for reviewing my request. I should have tested more rigorously before submitting. I was not aware of the different variations of URLs that kmike mentioned. Also this is the first time I am contributing to an open source project.
While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.
I feel such an addition would negatively affect the readability of the code. I have been unable to find an elegant solution to the problem. It might be preferable to use a version of the code which directly alters the URL, similar to the code provided by kmike on the issue page, since the urlparse library isn't designed to handle invalid urls. Please guide me in regards to which approach is preferable.

Contributor

Preetwinder commented Sep 16, 2015

Thank you for reviewing my request. I should have tested more rigorously before submitting. I was not aware of the different variations of URLs that kmike mentioned. Also this is the first time I am contributing to an open source project.
While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.
I feel such an addition would negatively affect the readability of the code. I have been unable to find an elegant solution to the problem. It might be preferable to use a version of the code which directly alters the URL, similar to the code provided by kmike on the issue page, since the urlparse library isn't designed to handle invalid urls. Please guide me in regards to which approach is preferable.

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Sep 18, 2015

Contributor

I have proceeded to make the changes mentioned by kmike and nramirezuy. Apart from adding the new tests and the docstring, I have also tested the function against multiple other URL variants separately.

Contributor

Preetwinder commented Sep 18, 2015

I have proceeded to make the changes mentioned by kmike and nramirezuy. Apart from adding the new tests and the docstring, I have also tested the function against multiple other URL variants separately.

Show outdated Hide outdated scrapy/utils/url.py Outdated
Show outdated Hide outdated scrapy/utils/url.py Outdated
Show outdated Hide outdated tests/test_utils_url.py Outdated
@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Sep 21, 2015

Contributor

Thank you kmike for your feedback and patience. Since the master branch has changed, should I merge or rebase before pushing the new commits?

Contributor

Preetwinder commented Sep 21, 2015

Thank you kmike for your feedback and patience. Since the master branch has changed, should I merge or rebase before pushing the new commits?

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Sep 21, 2015

Member

Thanks @Preetwinder! Your PR is very good (and it was good from the beginning); +1 to merge it after a few minor changes.

Since the master branch has changed, should I merge or rebase before pushing the new commits?

I think it is better to rebase.

While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

Member

kmike commented Sep 21, 2015

Thanks @Preetwinder! Your PR is very good (and it was good from the beginning); +1 to merge it after a few minor changes.

Since the master branch has changed, should I merge or rebase before pushing the new commits?

I think it is better to rebase.

While writing more extensive tests, I discovered another failure point due to urlparse incorrectly handling invalid URLs. If the scheme is not present and both the port and the path are specified, an incorrect URL is generated. This happens because urlparse uses the first occurrence of : to parse the scheme. Correcting this would require some hackey code, if we intend to stay within the feature set provided by the urlparse library.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Sep 24, 2015

Contributor

Thank you for your encouraging words. Please tell, if any further changes need to be made to the PR.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

Contributor

Preetwinder commented Sep 24, 2015

Thank you for your encouraging words. Please tell, if any further changes need to be made to the PR.

Yeah, urlparse doesn't handle invalid URLs. There is no need to stay within urlparse feature set; I'm fine even with replacing urplarse entirely.

In the end we should provide a function to "normalize" a URL like browsers normalize them when users enter a URL to the address bar. Adding schema is a step this function should be doing.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

@kmike kmike changed the title from #1487 add_scheme_if_missing for `scrapy shell` command to [MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command Sep 30, 2015

@kmike

This comment has been minimized.

Show comment
Hide comment
@kmike

kmike Sep 30, 2015

Member

The PR looks good to me, +1 to merge.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

We can't use wget code because Scrapy uses BSD license, and relying on wget code will force us to change the license to GPL.

Member

kmike commented Sep 30, 2015

The PR looks good to me, +1 to merge.

You mention in the link you provided that a url parsing implementation extracted from a Web Browser would be suitable as a replacement for urlparse. Would Wget also be a suitable candidate for this? It is fast, GPL 3 licensed, supports modern urls.

We can't use wget code because Scrapy uses BSD license, and relying on wget code will force us to change the license to GPL.

dangra added a commit that referenced this pull request Oct 1, 2015

Merge pull request #1498 from Preetwinder/scrapy-add_scheme
[MRG+1] #1487 add_scheme_if_missing for `scrapy shell` command

@dangra dangra merged commit fe15f93 into scrapy:master Oct 1, 2015

2 checks passed

codecov/patch 100.00% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dangra

This comment has been minimized.

Show comment
Hide comment
@dangra

dangra Oct 1, 2015

Member

💯 simple and well tested. thanks!

Member

dangra commented Oct 1, 2015

💯 simple and well tested. thanks!

@Preetwinder

This comment has been minimized.

Show comment
Hide comment
@Preetwinder

Preetwinder Oct 1, 2015

Contributor

Thank you!

Contributor

Preetwinder commented Oct 1, 2015

Thank you!

@Preetwinder Preetwinder deleted the Preetwinder:scrapy-add_scheme branch Oct 3, 2015

@kmike kmike referenced this pull request Oct 30, 2015

Closed

make Scrapy 1.1 release #1529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment