Skip to content

Bind request to response only if not bound already #4529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elacuesta opened this issue May 1, 2020 · 3 comments · Fixed by #4632
Closed

Bind request to response only if not bound already #4529

elacuesta opened this issue May 1, 2020 · 3 comments · Fixed by #4632
Labels

Comments

@elacuesta
Copy link
Member

elacuesta commented May 1, 2020

Summary

Currently requests are bound to responses in the engine and/or in the scraper.
I would like to start a conversation to figure out the implications of binding them only if the response.request attribute is not already set, i.e. something like this:

diff --git scrapy/core/engine.py scrapy/core/engine.py
index 77d71846..72f1628a 100644
--- scrapy/core/engine.py
+++ scrapy/core/engine.py
@@ -244,7 +244,8 @@ class ExecutionEngine:
                     % (type(response), response)
                 )
             if isinstance(response, Response):
-                response.request = request  # tie request to response received
+                if response.request is None:
+                    response.request = request
                 logkws = self.logformatter.crawled(request, response, spider)
                 if logkws is not None:
                     logger.log(*logformatter_adapter(logkws), extra={'spider': spider})


diff --git scrapy/core/scraper.py scrapy/core/scraper.py
index edbb4dd6..256935da 100644
--- scrapy/core/scraper.py
+++ scrapy/core/scraper.py
@@ -146,7 +146,8 @@ class Scraper:
                 self._log_download_errors, request_result, request, spider)
 
     def call_spider(self, result, request, spider):
-        result.request = request
+        if not hasattr(result, "request") or result.request is None:
+            result.request = request
         dfd = defer_result(result)
         callback = request.callback or spider.parse
         warn_on_generator_with_return_value(spider, callback)

Motivation

Recently I've been working on a downloader middleware to transparently redirect requests to a service which takes an URL and returns its response body along with some other metadata. The process_request method replaces the URL, and the process_response decodes the response body and creates a new response, making it look like it came directly from the target website, not from the download service. At this point, I'd like to be able to restore the original request as well, but any request I return in the Response.request attribute gets overridden.

Describe alternatives you've considered

ATM I don't see a way to modify this behaviour without altering the Scrapy core.

@dangra
Copy link
Member

dangra commented May 1, 2020

Binding at core/engine.py predates our public release (83dcf8a)
image

Binding at core/scraper.py came 4 years later (0256a34#diff-efe9358e1656d355258b3d73c1a4d189R139)
image

Pablo's merge commit explain why we need scraper's bounding 1e1cc76

@kmike
Copy link
Member

kmike commented May 3, 2020

I'm not sure I understand the implicatiions of the change. But scrapy-splash had a similar challenge (if I understand the problem properly), and in the response it provides url + a few other attributes of both requests, via separate attributes.

I recall implementing it was not nice, and there are caveats, but it was doable. Don't recall the details, but https://github.com/scrapy-plugins/scrapy-splash/blob/e40ca4f3b367ab463273bee1357d3edfe0601f0d/scrapy_splash/response.py#L22 can be a good starting point.

@elacuesta
Copy link
Member Author

Right, I should have checked the blame 😅
Indeed, bad things happen if the request is not present, but I think the current behaviour is not intuitive. You will not get the object you want, and it might bite some other people like it bit me (this example for instance, which only works because request.meta is modified in-place).
So far I am returning the original attributes in the meta dictionary, but this it not ideal IMHO. For instance, requests will appear with the modified URL in the logs, making it difficult to identify them (as they all point to the same URL). I made a custom logformatter as a workaround for this, but it'd be good if I could just be able to bind the actual request I want.

For the record, all tests pass with the above changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants