New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a close reason for robots.txt affecting all start requests #6164
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #6164 +/- ##
===========================================
- Coverage 88.59% 43.11% -45.48%
===========================================
Files 159 159
Lines 11582 11624 +42
Branches 1885 1891 +6
===========================================
- Hits 10261 5012 -5249
- Misses 994 6220 +5226
- Partials 327 392 +65
|
self.signals.send_catch_log(signal=signals.start_requests_exhausted) | ||
except Exception: | ||
self.slot.start_requests = None | ||
logger.error( | ||
"Error while obtaining start requests", | ||
exc_info=True, | ||
extra={"spider": self.spider}, | ||
) | ||
self.signals.send_catch_log(signal=signals.start_requests_exhausted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should implement separate methods for these, but it felt a bit out of scope.
) | ||
|
||
def _start_request_returned(self, request): | ||
self._total_start_request_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about counting them on the engine and passing the count as a parameter through the signal, but it felt weird.
start_request_returned | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. signal:: start_request_returned | ||
.. function:: start_request_returned(request) | ||
|
||
.. versionadded:: VERSION | ||
|
||
Sent after a :class:`~scrapy.http.Request` is returned by the | ||
:meth:`spider.start_requests <scrapy.Spider.start_requests>` iterator and | ||
processed by the | ||
:meth:`process_start_requests <scrapy.spidermiddlewares.SpiderMiddleware.process_start_requests>` | ||
method of :ref:`spider middlewares <topics-spider-middleware>`, and before | ||
that request reaches the :ref:`scheduler <topics-scheduler>` | ||
(:signal:`request_scheduled` signal). | ||
|
||
This signal does not support returning deferreds from its handlers. | ||
|
||
:param request: Returned request. | ||
:type request: scrapy.http.Request | ||
|
||
start_requests_exhausted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with these names to be more in line with current signal naming, although I am not 100% sure about them. I used “returned” because that’s the wording used in the iterator docs, and with “exhausted” as a word that works both for the iterator stopping and an exception being raised.
fingerprint = self._fingerprinter.fingerprint(request) | ||
self._pending_start_request_fingerprints.add(fingerprint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best I could come up with to avoid changing the request meta. It gets rather verbose in the 2 process_request methods, and weird having to remove and re-add fingerprints in the first one, but it works.
No description provided.