Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Merged
merged 85 commits into from Oct 19, 2022

Conversation

Gallaecio
Copy link
Contributor

@Gallaecio Gallaecio commented Aug 12, 2022

Built on top of #40, this pull request aims to address point 4 from #40 (comment).

Please, ignore the actual implementation at the time being, and instead have a look at the test scenarios introduced here. I think we should discuss changes, additions and removals around those before moving forward. Questions to discuss include:

  • Are we OK with making the feature opt-out, or should it be opt-in?
  • Better suggestions for new settings? Are the two future-proofing-related settings worth adding?
  • Any additional considerations to make the plugin more future-proof? Does any of the approaches go too far?
  • Are we OK with the approach to warning, and the scenarios causing a warning? Should we be more lenient with headers about warnings? Is it OK to warn against request metadata usage for all keys that can be defined through request attributes?

To do:

  • Agree on specifications based on test scenarios.
  • Provide documentation.
  • Provide complete test coverage.
  • Refactor the implementation.
  • Make final logic adjustments to tests and implementation

Resolves #12, resolves #16, resolves #17, resolves #19.

@Gallaecio Gallaecio requested review from kmike and BurnzZ August 12, 2022 21:07
Copy link
Member

@BurnzZ BurnzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we OK with making the feature opt-out, or should it be opt-in

I'm leaning towards having it opt-out (enabled by default) since it seems more natural to write the requests this way. +1 on the current setup.

Better suggestions for new settings? Are the two future-proofing-related settings worth adding?

Could you clarify the expected behaviors for ZYTE_API_UNSUPPORTED_HEADERS and ZYTE_API_BROWSER_HEADERS? I'm not entirely sure if I understood how they work.

Any additional considerations to make the plugin more future-proof? Does any of the approaches go too far?

I think the ZYTE_API_AUTOMAP captures most of the cases that we need. Great work!

Are we OK with the approach to warning, and the scenarios causing a warning? Should we be more lenient with headers about warnings? Is it OK to warn against request metadata usage for all keys that can be defined through request attributes?

The route of issuing a warning sounds great. Though I think we need to tweak it a bit to make it clear for the users what is actually passed (i.e. the values taking precedence) when warnings are used.

tests/test_api_requests.py Outdated Show resolved Hide resolved
tests/test_api_requests.py Outdated Show resolved Hide resolved
@kmike
Copy link
Member

kmike commented Aug 15, 2022

hey! I was thinking about being able to do the following 3 things in a spider, at the same time:

  1. Make all "normal" requests go through Zyte API instead of regular Scrapy downloader, in a way which is similar to using a proxy for everything.
  2. It'd be nice to be able to control some parameters for these "normal" requests per-request.
  3. Make some requests to use the exact Zyte API options provided, without any magic. There is going to be an UI to debug these parameters, finiding those which are working, and copy-pasting them.

Use case: extract data from some pages, or to use some browser action, while downloading everything else as usual.

So, it seems it's neither "opt-in" nor "opt-out", it's almost like two separate features (1+2 vs 3), which exist in parallel. Does it make sense?

@Gallaecio
Copy link
Contributor Author

Gallaecio commented Aug 24, 2022

1 would be the default behavior with the proposed implementation, and could be disabled through a setting.

2 would also work with the proposed implementation. But you would get a warning if you try to control through Zyte API parameters something that you can control through Request parameters.

For 3, what about defining a zyte_api_automap request meta key that can be set to False per request?

@Gallaecio
Copy link
Contributor Author

Could you clarify the expected behaviors for ZYTE_API_UNSUPPORTED_HEADERS and ZYTE_API_BROWSER_HEADERS? I'm not entirely sure if I understood how they work.

First off: I am more than open to remove them altogether.

The idea is for these parameters to allow flexibility to support some future Zyte API changes without needing to upgrade to a newer scrapy-zyte-api version.

If tomorrow Zyte API starts allowing to set the Cookie or User-Agent headers, you can use the ZYTE_API_UNSUPPORTED_HEADERSsetting to make it so that scrapy-zyte-api stops excluding them from the mapping. So, if you set the User-Agent header, and set ZYTE_API_UNSUPPORTED_HEADERS=['Cookies'], the User-Agent header is included in customHttpRequestHeaders.

Similar for ZYTE_API_BROWSER_HEADERS. If in the future Zyte API allows an additional header, like Accept-Language, you can set ZYTE_API_BROWSER_HEADERS={"Accept-Language": "acceptLanguage", "Referer": "referer"} so that the header gets mapped into requestHeaders as acceptLanguage.

@codecov
Copy link

codecov bot commented Aug 24, 2022

Codecov Report

Merging #41 (80f7b46) into main (9d1f2c8) will increase coverage by 0.27%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #41      +/-   ##
==========================================
+ Coverage   99.41%   99.69%   +0.27%     
==========================================
  Files           4        4              
  Lines         172      323     +151     
==========================================
+ Hits          171      322     +151     
  Misses          1        1              
Impacted Files Coverage Δ
scrapy_zyte_api/_params.py 100.00% <100.00%> (ø)
scrapy_zyte_api/handler.py 99.03% <100.00%> (+0.08%) ⬆️
scrapy_zyte_api/__init__.py

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Contributor Author

Tests refactored to use the new _get_api_params instead of instantiating a mock server every time. They are also split into groups, but otherwise are basically the same as before.

However, I also made the following changes as I was working on that refactoring:

  • Backward incompatible: if the meta object is truthy but not a mapping, ValueError is now raised (before it would raise TypeError or ValueError depending on the value type)
  • Added additional warnings when defining in meta parameters with no effect (removing them would cause the same API parameters to be used in the end due to the automatic mapping logic).
  • Automated mapping now removes parameters if set to their server-side default value, minimizing parameters sent to the server.
  • None as a meta parameter value unsets a parameter set through ZYTE_API_DEFAULT_PARAMS. If we fear null may become a non-default, valid value for an API parameter, we could go for a special UNSET object. I added this at first to allow silencing warnings caused by combinations of default params and meta params that made sense separately, but I later decided to solve that in a way that does not require explicitly unsetting params, and now I am not sure if we should keep this feature or remove it. I did not come up with the alternative either; the scenarios I could think of are covered by test_default_params_automap and trigger a warning with the current implementation although I think they should not.
  • Explicitly defined httpResponseBody=False, browserHtml=False and screenshot=False in contexts where they are not necessary triggers a warning.

The implementation is growing horrible, but I think we can make it clean once we figure out what behavior we want API-wise.

setup.py Outdated Show resolved Hide resolved
tests/test_api_requests.py Outdated Show resolved Hide resolved
tests/test_api_requests.py Outdated Show resolved Hide resolved
{
"httpResponseBody": True,
"httpResponseHeaders": True,
"httpRequestBody": "YQ==",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also ensure that the httpRequestMethod is not GET when the body is set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is technically possible for it to be. If it was not technically possible, it should be Scrapy itself complaining, not this plugin.

I am not sure whether or not Zyte API itself allows it, but if it does not, I think it may be best to let Zyte API be the one who complains; similarly to how I think we should allow httpResponseBody and browserHtml be combined, even though Zyte API does not currently support that.

tests/test_api_requests.py Show resolved Hide resolved
tests/test_api_requests.py Outdated Show resolved Hide resolved
scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved
tests/test_api_requests.py Outdated Show resolved Hide resolved
tests/test_api_requests.py Show resolved Hide resolved
@Gallaecio
Copy link
Contributor Author

@kmike I backtracked on what we discussed elsewhere about marking httpResponseHeaders as an implementation detail in the documentation, because depending on how Zyte API implements text decoding, it may be best that automatic parameter mapping continues to use the current httpResponseBody + httpResponseHeaders combo by default to support requests for binary data as well.

README.rst Show resolved Hide resolved
README.rst Show resolved Hide resolved
README.rst Show resolved Hide resolved
…nd do not enable httpResponseHeaders as a side effect of browserHtml
@Gallaecio
Copy link
Contributor Author

Gallaecio commented Oct 18, 2022

I have documented httpResponseBody and httpResponseHeaders being True by default when using automatically-mapped request parameters as an implementation detail, recommended enabling them manually in the right scenarios, and removed warnings that used to be triggered by that.

I also realized that there was no reason for httpResponseHeaders to be enabled as a side effect of enabling browserHtml, and changed that as well. I suspect it stemmed from me misunderstanding how the current implementation works (i.e. maybe I was thinking without response headers browserHtml response were also treated as binary responses).

Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thanks @Gallaecio!

@kmike
Copy link
Member

kmike commented Oct 19, 2022

Ok, there are no further comments, so let's merge it :)

@kmike kmike merged commit fe0ea7d into scrapy-plugins:main Oct 19, 2022
@Gallaecio Gallaecio mentioned this pull request Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants