Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Gallaecio · 2022-08-12T21:07:46Z

Built on top of #40, this pull request aims to address point 4 from #40 (comment).

Please, ignore the actual implementation at the time being, and instead have a look at the test scenarios introduced here. I think we should discuss changes, additions and removals around those before moving forward. Questions to discuss include:

Are we OK with making the feature opt-out, or should it be opt-in?
Better suggestions for new settings? Are the two future-proofing-related settings worth adding?
Any additional considerations to make the plugin more future-proof? Does any of the approaches go too far?
Are we OK with the approach to warning, and the scenarios causing a warning? Should we be more lenient with headers about warnings? Is it OK to warn against request metadata usage for all keys that can be defined through request attributes?

To do:

Agree on specifications based on test scenarios.
Provide documentation.
Provide complete test coverage.
Refactor the implementation.
Make final logic adjustments to tests and implementation

Resolves #12, resolves #16, resolves #17, resolves #19.

…es explicitly

BurnzZ

Are we OK with making the feature opt-out, or should it be opt-in

I'm leaning towards having it opt-out (enabled by default) since it seems more natural to write the requests this way. +1 on the current setup.

Better suggestions for new settings? Are the two future-proofing-related settings worth adding?

Could you clarify the expected behaviors for ZYTE_API_UNSUPPORTED_HEADERS and ZYTE_API_BROWSER_HEADERS? I'm not entirely sure if I understood how they work.

Any additional considerations to make the plugin more future-proof? Does any of the approaches go too far?

I think the ZYTE_API_AUTOMAP captures most of the cases that we need. Great work!

Are we OK with the approach to warning, and the scenarios causing a warning? Should we be more lenient with headers about warnings? Is it OK to warn against request metadata usage for all keys that can be defined through request attributes?

The route of issuing a warning sounds great. Though I think we need to tweak it a bit to make it clear for the users what is actually passed (i.e. the values taking precedence) when warnings are used.

tests/test_api_requests.py

kmike · 2022-08-15T08:30:24Z

hey! I was thinking about being able to do the following 3 things in a spider, at the same time:

Make all "normal" requests go through Zyte API instead of regular Scrapy downloader, in a way which is similar to using a proxy for everything.
It'd be nice to be able to control some parameters for these "normal" requests per-request.
Make some requests to use the exact Zyte API options provided, without any magic. There is going to be an UI to debug these parameters, finiding those which are working, and copy-pasting them.

Use case: extract data from some pages, or to use some browser action, while downloading everything else as usual.

So, it seems it's neither "opt-in" nor "opt-out", it's almost like two separate features (1+2 vs 3), which exist in parallel. Does it make sense?

Gallaecio · 2022-08-24T09:02:59Z

1 would be the default behavior with the proposed implementation, and could be disabled through a setting.

2 would also work with the proposed implementation. But you would get a warning if you try to control through Zyte API parameters something that you can control through Request parameters.

For 3, what about defining a zyte_api_automap request meta key that can be set to False per request?

Gallaecio · 2022-08-24T09:13:24Z

Could you clarify the expected behaviors for ZYTE_API_UNSUPPORTED_HEADERS and ZYTE_API_BROWSER_HEADERS? I'm not entirely sure if I understood how they work.

First off: I am more than open to remove them altogether.

The idea is for these parameters to allow flexibility to support some future Zyte API changes without needing to upgrade to a newer scrapy-zyte-api version.

If tomorrow Zyte API starts allowing to set the Cookie or User-Agent headers, you can use the ZYTE_API_UNSUPPORTED_HEADERSsetting to make it so that scrapy-zyte-api stops excluding them from the mapping. So, if you set the User-Agent header, and set ZYTE_API_UNSUPPORTED_HEADERS=['Cookies'], the User-Agent header is included in customHttpRequestHeaders.

Similar for ZYTE_API_BROWSER_HEADERS. If in the future Zyte API allows an additional header, like Accept-Language, you can set ZYTE_API_BROWSER_HEADERS={"Accept-Language": "acceptLanguage", "Referer": "referer"} so that the header gets mapped into requestHeaders as acceptLanguage.

…-mapping

…ADERS

codecov · 2022-08-24T09:48:09Z

Codecov Report

Merging #41 (80f7b46) into main (9d1f2c8) will increase coverage by 0.27%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main      #41      +/-   ##
==========================================
+ Coverage   99.41%   99.69%   +0.27%     
==========================================
  Files           4        4              
  Lines         172      323     +151     
==========================================
+ Hits          171      322     +151     
  Misses          1        1

Impacted Files	Coverage Δ
scrapy_zyte_api/_params.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/handler.py	`99.03% <100.00%> (+0.08%)`	⬆️
scrapy_zyte_api/__init__.py

scrapy_zyte_api/handler.py

Gallaecio · 2022-08-25T18:42:37Z

Tests refactored to use the new _get_api_params instead of instantiating a mock server every time. They are also split into groups, but otherwise are basically the same as before.

However, I also made the following changes as I was working on that refactoring:

Backward incompatible: if the meta object is truthy but not a mapping, ValueError is now raised (before it would raise TypeError or ValueError depending on the value type)
Added additional warnings when defining in meta parameters with no effect (removing them would cause the same API parameters to be used in the end due to the automatic mapping logic).
Automated mapping now removes parameters if set to their server-side default value, minimizing parameters sent to the server.
None as a meta parameter value unsets a parameter set through ZYTE_API_DEFAULT_PARAMS. If we fear null may become a non-default, valid value for an API parameter, we could go for a special UNSET object. I added this at first to allow silencing warnings caused by combinations of default params and meta params that made sense separately, but I later decided to solve that in a way that does not require explicitly unsetting params, and now I am not sure if we should keep this feature or remove it. I did not come up with the alternative either; the scenarios I could think of are covered by test_default_params_automap and trigger a warning with the current implementation although I think they should not.
Explicitly defined httpResponseBody=False, browserHtml=False and screenshot=False in contexts where they are not necessary triggers a warning.

The implementation is growing horrible, but I think we can make it clean once we figure out what behavior we want API-wise.

setup.py

tests/test_api_requests.py

BurnzZ · 2022-08-26T07:09:25Z

tests/test_api_requests.py

+            {
+                "httpResponseBody": True,
+                "httpResponseHeaders": True,
+                "httpRequestBody": "YQ==",


Should we also ensure that the httpRequestMethod is not GET when the body is set?

I believe it is technically possible for it to be. If it was not technically possible, it should be Scrapy itself complaining, not this plugin.

I am not sure whether or not Zyte API itself allows it, but if it does not, I think it may be best to let Zyte API be the one who complains; similarly to how I think we should allow httpResponseBody and browserHtml be combined, even though Zyte API does not currently support that.

tests/test_api_requests.py

scrapy_zyte_api/handler.py

tests/test_api_requests.py

Co-authored-by: Kevin Lloyd Bernal <kevinoxy@gmail.com>

Gallaecio · 2022-10-14T10:48:33Z

@kmike I backtracked on what we discussed elsewhere about marking httpResponseHeaders as an implementation detail in the documentation, because depending on how Zyte API implements text decoding, it may be best that automatic parameter mapping continues to use the current httpResponseBody + httpResponseHeaders combo by default to support requests for binary data as well.

README.rst

…aders when defining parameters manually

README.rst

…nd do not enable httpResponseHeaders as a side effect of browserHtml

Gallaecio · 2022-10-18T08:59:41Z

I have documented httpResponseBody and httpResponseHeaders being True by default when using automatically-mapped request parameters as an implementation detail, recommended enabling them manually in the right scenarios, and removed warnings that used to be triggered by that.

I also realized that there was no reason for httpResponseHeaders to be enabled as a side effect of enabling browserHtml, and changed that as well. I suspect it stemmed from me misunderstanding how the current implementation works (i.e. maybe I was thinking without response headers browserHtml response were also treated as binary responses).

kmike

Looks awesome, thanks @Gallaecio!

kmike · 2022-10-19T14:49:10Z

Ok, there are no further comments, so let's merge it :)

Gallaecio added 4 commits August 12, 2022 10:39

Add ZYTE_API_ENABLED

ce6f9ba

Split ZYTE_API_ALL off of ZYTE_API_ENABLED

6348e81

ZYTE_API_ALL → ZYTE_API_ON_ALL_REQUESTS; mention setting default valu…

5f78c55

…es explicitly

Initial proposal for transparent mapping

68f43dc

Gallaecio requested review from kmike and BurnzZ August 12, 2022 21:07

Gallaecio added 2 commits August 12, 2022 23:38

Update test_zyte_api_request_meta

a2ad009

Run pre-commit hooks

4c44ebe

BurnzZ reviewed Aug 15, 2022

View reviewed changes

tests/test_api_requests.py Outdated Show resolved Hide resolved

tests/test_api_requests.py Outdated Show resolved Hide resolved

Gallaecio added 3 commits August 24, 2022 09:59

Merge remote-tracking branch 'upstream/main' into enable-setting

78fb734

README: move ZYTE_API_ENABLED under Configuration

26b2ef0

test_zyte_api_request_meta: update misleading values

e78a499

Gallaecio added 3 commits August 24, 2022 11:29

Merge remote-tracking branch 'origin/enable-setting' into transparent…

29fdbec

…-mapping

Clarify value precedence in warnings

d9fbbf1

Include an unsupported header in the test for ZYTE_API_UNSUPPORTED_HE…

4442ac9

…ADERS

Gallaecio added 3 commits August 24, 2022 11:49

Implement the zyte_api_automap request meta key

a80b045

Add a missing test case

9864273

Restore Python 3.7 support

afd9b6d

kmike reviewed Aug 24, 2022

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

kmike reviewed Aug 24, 2022

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

kmike reviewed Aug 24, 2022

View reviewed changes

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

Gallaecio added 2 commits August 24, 2022 22:17

Refactor parse_api_params and start refactoring tests

c49e1b3

Complete test refactoring

400a4c2

BurnzZ reviewed Aug 26, 2022

View reviewed changes

Revert change no longer needed

a19c004

BurnzZ reviewed Sep 27, 2022

View reviewed changes

tests/test_api_requests.py Outdated Show resolved Hide resolved

scrapy_zyte_api/handler.py Outdated Show resolved Hide resolved

tests/test_api_requests.py Outdated Show resolved Hide resolved

tests/test_api_requests.py Show resolved Hide resolved

Gallaecio and others added 16 commits September 27, 2022 16:32

Update tests/test_api_requests.py

d454c9b

Co-authored-by: Kevin Lloyd Bernal <kevinoxy@gmail.com>

README: sort usage approaches by relevance

1c46849

README: include more code examples

987ef16

README: provide more specific links to the Zyte API documentation

a4fa769

unsupported headers → skip headers

6098b69

README: add code examples of automated parameter mapping

41d9a70

Clarify how default parameter settings affect unrelated requests

8e93e00

_iter_headers: parameter → header_parameter

0019ea2

Move parameter handling to its own module

82cbc54

make_handler: clarify when handler is set to None

109ed87

tests: fix typo (Request.meta → Request.method)

33a0501

tests: Fix example Content-Length value

9f7fb4b

Clarify message about httpResponseHeaders=False being unnecessary

6a1a496

Remove commented out code

dcfc5b7

tests: fix Content-Length expectations

40e708e

Pin pyopenssl==22.0.0 on the pinned Tox environment

329ba58

kmike reviewed Oct 16, 2022

View reviewed changes

README.rst Show resolved Hide resolved

kmike reviewed Oct 16, 2022

View reviewed changes

README.rst Show resolved Hide resolved

Gallaecio added 3 commits October 17, 2022 09:59

Document how httpResponseBody should be accompanied by httpResponseHe…

79ecfd1

…aders when defining parameters manually

Add a section about response mapping

dd67e12

Discourage using certain parameters on ZYTE_API_AUTOMAP_PARAMS

c386036

kmike reviewed Oct 17, 2022

View reviewed changes

README.rst Show resolved Hide resolved

Do not warn about explicit httpResponseBody or httpResponseHeaders, a…

80f7b46

…nd do not enable httpResponseHeaders as a side effect of browserHtml

kmike approved these changes Oct 18, 2022

View reviewed changes

kmike merged commit fe0ea7d into scrapy-plugins:main Oct 19, 2022

Gallaecio mentioned this pull request Oct 20, 2022

CI: run twine check #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Gallaecio commented Aug 12, 2022 •

edited

Loading

BurnzZ left a comment

kmike commented Aug 15, 2022 •

edited

Loading

Gallaecio commented Aug 24, 2022 •

edited

Loading

Gallaecio commented Aug 24, 2022

codecov bot commented Aug 24, 2022 •

edited

Loading

Gallaecio commented Aug 25, 2022

BurnzZ Aug 26, 2022

Gallaecio Aug 29, 2022

Gallaecio commented Oct 14, 2022

Gallaecio commented Oct 18, 2022 •

edited

Loading

kmike left a comment

kmike commented Oct 19, 2022

Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Support transparent mapping of Scrapy requests to Zyte Data API requests #41

Conversation

Gallaecio commented Aug 12, 2022 • edited Loading

BurnzZ left a comment

Choose a reason for hiding this comment

kmike commented Aug 15, 2022 • edited Loading

Gallaecio commented Aug 24, 2022 • edited Loading

Gallaecio commented Aug 24, 2022

codecov bot commented Aug 24, 2022 • edited Loading

Codecov Report

Gallaecio commented Aug 25, 2022

BurnzZ Aug 26, 2022

Choose a reason for hiding this comment

Gallaecio Aug 29, 2022

Choose a reason for hiding this comment

Gallaecio commented Oct 14, 2022

Gallaecio commented Oct 18, 2022 • edited Loading

kmike left a comment

Choose a reason for hiding this comment

kmike commented Oct 19, 2022

Gallaecio commented Aug 12, 2022 •

edited

Loading

kmike commented Aug 15, 2022 •

edited

Loading

Gallaecio commented Aug 24, 2022 •

edited

Loading

codecov bot commented Aug 24, 2022 •

edited

Loading

Gallaecio commented Oct 18, 2022 •

edited

Loading