Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Create Request from curl command #3862

Merged
merged 19 commits into from
Aug 8, 2019

Conversation

noviluni
Copy link
Member

@noviluni noviluni commented Jul 9, 2019

This is a WIP, but I open this PR to discuss how to implement this feature.

Based on: #2985 (abandoned)

@noviluni
Copy link
Member Author

noviluni commented Jul 9, 2019

Based on comments from #2985 (@kmike), I have renamed the function, added docs, added some tests and implemented Request.from_curl.

I have also improved the function by adding basic auth support and fixing when scheme is missing.

My idea is to add more tests with special cases (got some ideas from: https://github.com/NickCarneiro/curlconverter/tree/master/fixtures/curl_commands) and of course implement all missing features.

Any ideas or feedback? :)

@codecov
Copy link

codecov bot commented Jul 9, 2019

Codecov Report

Merging #3862 into master will increase coverage by 0.07%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3862      +/-   ##
==========================================
+ Coverage   85.56%   85.63%   +0.07%     
==========================================
  Files         164      167       +3     
  Lines        9565     9742     +177     
  Branches     1435     1457      +22     
==========================================
+ Hits         8184     8343     +159     
- Misses       1133     1146      +13     
- Partials      248      253       +5
Impacted Files Coverage Δ
scrapy/http/request/__init__.py 100% <100%> (ø) ⬆️
scrapy/utils/curl.py 100% <100%> (ø)
scrapy/core/downloader/tls.py 88.88% <0%> (-1.12%) ⬇️
scrapy/core/downloader/handlers/http10.py 100% <0%> (ø) ⬆️
scrapy/pipelines/files.py 66.53% <0%> (ø) ⬆️
scrapy/downloadermiddlewares/robotstxt.py 100% <0%> (ø) ⬆️
scrapy/logformatter.py 90.47% <0%> (ø) ⬆️
scrapy/core/downloader/handlers/http11.py 92.53% <0%> (ø) ⬆️
scrapy/robotstxt.py 96.72% <0%> (ø)
... and 5 more

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great, thanks for picking this up.

I think there are a few pages of the documentation that would greatly benefit from covering this feature, if you are up to it:

docs/topics/request-response.rst Outdated Show resolved Hide resolved
docs/topics/request-response.rst Outdated Show resolved Hide resolved
scrapy/http/request/__init__.py Outdated Show resolved Hide resolved
scrapy/utils/curl.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

kmike commented Jul 11, 2019

@noviluni just heads up: we want to make Scrapy release next week; if you finish this PR this week, we'll get it in :)

@noviluni

This comment has been minimized.

@noviluni
Copy link
Member Author

@noviluni just heads up: we want to make Scrapy release next week; if you finish this PR this week, we'll get it in :)

@kmike That would be great, but I need feedback about what to do with unrecognized parameters (comment above).

Apart of the above, we could explain in the docs that there is a limited support and that certain parameters are not allowed.

@noviluni noviluni force-pushed the add-parse-curl-cmd-func branch from 1008d17 to 28197ad Compare July 11, 2019 14:04
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work with the documentation topics. I’ve just added a few suggestions.

docs/topics/developer-tools.rst Outdated Show resolved Hide resolved
docs/topics/developer-tools.rst Outdated Show resolved Hide resolved
docs/topics/developer-tools.rst Outdated Show resolved Hide resolved
docs/topics/dynamic-content.rst Outdated Show resolved Hide resolved
docs/topics/dynamic-content.rst Outdated Show resolved Hide resolved
docs/topics/dynamic-content.rst Outdated Show resolved Hide resolved
docs/topics/developer-tools.rst Show resolved Hide resolved
@Gallaecio
Copy link
Member

[…] we could explain in the docs that there is a limited support and that certain parameters are not allowed.

Yes. Maybe we should even list all the specific parameters that are supported. And, if we end up raising an error on unrecognized parameters, we should also list supported parameters that have no impact on the resulting Request.

scrapy/utils/curl.py Outdated Show resolved Hide resolved
@noviluni noviluni force-pushed the add-parse-curl-cmd-func branch from ec52575 to 876db14 Compare July 18, 2019 15:51
scrapy/utils/curl.py Outdated Show resolved Hide resolved
scrapy/utils/curl.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me.

Further improvements (supporting more cURL options, allowing to override certain middlewares, enforcing cURL’s default user agent if no User-Agent header is provided) can come later based on user feedback.

scrapy/http/request/__init__.py Outdated Show resolved Hide resolved
scrapy/utils/curl.py Outdated Show resolved Hide resolved
scrapy/utils/curl.py Outdated Show resolved Hide resolved
@Gallaecio Gallaecio changed the title Create Request from curl command [MRG+1] Create Request from curl command Aug 2, 2019
"""
request_kwargs = curl_to_request_kwargs(curl_command, ignore_unknown_options)
request_kwargs.update(kwargs)
return Request(**request_kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should use cls instead of Request, not sure about this. Opinions?

Copy link
Member Author

@noviluni noviluni Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's right... It’s not as trivial as it seems. Suppose someone uses from_curl from JSONRequest:

JSONRequest.from_curl('curl example.org').

Using Request the result would be:

{'_encoding': 'utf-8',
 'method': 'GET',
 '_url': 'http://example.org',
 '_body': b'',
 'priority': 0,
 'callback': None,
 'errback': None,
 'cookies': {},
 'headers': {},
 'dont_filter': False,
 '_meta': None,
 '_cb_kwargs': None,
 'flags': []}

Using cls the result would be:

{'_dumps_kwargs': {'sort_keys': True},
 '_encoding': 'utf-8',
 'method': 'GET',
 '_url': 'http://example.org',
 '_body': b'',
 'priority': 0,
 'callback': None,
 'errback': None,
 'cookies': {},
 'headers': {b'Content-Type': b'application/json',
  b'Accept': b'application/json, text/javascript, */*; q=0.01'},
 'dont_filter': False,
 '_meta': None,
 '_cb_kwargs': None,
 'flags': []}

(The same happens with XmlRpcRequest)
I think that the first approach could be better, because it respects the curl command, but I understand that for some users could be counter-intuitive.. seems a UX dilema. In any case, we should probably add something to the docstring. What do you think? @Gallaecio any thoughs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not using cls, maybe we should reimplement the new class method in our built-in Request subclasses to raise a NotImplementedError exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the same situation as with middlewares: scrapy should try to create the request with the inputs specified, and then if they are overridden by the class itself, or by the middlewares, you just need to be prepared to that it might happen and not expect 100% conformance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a valid argument. We could use cls and extend the "caution" message regarding the middlewares adding that if it's used from Request subclasses they could add some modifications to the request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a commit with that idea. @kmike @Gallaecio tell me what you think. :)

@kmike
Copy link
Member

kmike commented Aug 2, 2019

Thanks @noviluni! This looks really good. I got a late comment about a class which from_curl constructor should create, but no strong opinion on this.

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @noviluni, @immerrr and @Gallaecio!

@Gallaecio Gallaecio closed this Aug 8, 2019
@Gallaecio Gallaecio reopened this Aug 8, 2019
@Gallaecio
Copy link
Member

The Python 3.7 test failures look like Scrapy’s typical random test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants