Implement a close reason for robots.txt affecting all start requests #6164

Gallaecio · 2023-11-30T14:45:56Z

No description provided.

codecov · 2023-11-30T14:47:40Z

Codecov Report

Merging #6164 (e11a6fb) into master (fa690fb) will decrease coverage by 45.48%.
Report is 2 commits behind head on master.
The diff coverage is 34.04%.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #6164       +/-   ##
===========================================
- Coverage   88.59%   43.11%   -45.48%     
===========================================
  Files         159      159               
  Lines       11582    11624       +42     
  Branches     1885     1891        +6     
===========================================
- Hits        10261     5012     -5249     
- Misses        994     6220     +5226     
- Partials      327      392       +65

Files	Coverage Δ
scrapy/crawler.py	`54.10% <ø> (-33.82%)`	⬇️
scrapy/signals.py	`100.00% <100.00%> (ø)`
scrapy/core/engine.py	`74.62% <66.66%> (-11.04%)`	⬇️
scrapy/downloadermiddlewares/robotstxt.py	`26.31% <28.57%> (-71.61%)`	⬇️

... and 137 files with indirect coverage changes

scrapy/spidermiddlewares/robotstxt.py

Gallaecio · 2023-12-14T10:28:40Z

scrapy/core/engine.py

+                self.signals.send_catch_log(signal=signals.start_requests_exhausted)
            except Exception:
                self.slot.start_requests = None
                logger.error(
                    "Error while obtaining start requests",
                    exc_info=True,
                    extra={"spider": self.spider},
                )
+                self.signals.send_catch_log(signal=signals.start_requests_exhausted)


I wonder if we should implement separate methods for these, but it felt a bit out of scope.

Gallaecio · 2023-12-14T10:30:14Z

scrapy/downloadermiddlewares/robotstxt.py

+        )
+
+    def _start_request_returned(self, request):
+        self._total_start_request_count += 1


I thought about counting them on the engine and passing the count as a parameter through the signal, but it felt weird.

Gallaecio · 2023-12-14T10:32:43Z

docs/topics/signals.rst

+start_request_returned
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. signal:: start_request_returned
+.. function:: start_request_returned(request)
+
+    .. versionadded:: VERSION
+
+    Sent after a :class:`~scrapy.http.Request` is returned by the
+    :meth:`spider.start_requests <scrapy.Spider.start_requests>` iterator and
+    processed by the
+    :meth:`process_start_requests <scrapy.spidermiddlewares.SpiderMiddleware.process_start_requests>`
+    method of :ref:`spider middlewares <topics-spider-middleware>`, and before
+    that request reaches the :ref:`scheduler <topics-scheduler>`
+    (:signal:`request_scheduled` signal).
+
+    This signal does not support returning deferreds from its handlers.
+
+    :param request: Returned request.
+    :type request: scrapy.http.Request
+
+start_requests_exhausted


I went with these names to be more in line with current signal naming, although I am not 100% sure about them. I used “returned” because that’s the wording used in the iterator docs, and with “exhausted” as a word that works both for the iterator stopping and an exception being raised.

Gallaecio · 2023-12-14T10:34:59Z

scrapy/downloadermiddlewares/robotstxt.py

+        fingerprint = self._fingerprinter.fingerprint(request)
+        self._pending_start_request_fingerprints.add(fingerprint)


This is the best I could come up with to avoid changing the request meta. It gets rather verbose in the 2 process_request methods, and weird having to remove and re-add fingerprints in the first one, but it works.

Gallaecio added 2 commits November 30, 2023 11:54

Document built-in spider close reasons

e121040

Implement a close reason for robots.txt affecting all start requests

5409025

Gallaecio requested review from kmike and wRAR November 30, 2023 14:45

Gallaecio marked this pull request as ready for review November 30, 2023 14:48

wRAR approved these changes Nov 30, 2023

View reviewed changes

kmike reviewed Dec 1, 2023

View reviewed changes

scrapy/spidermiddlewares/robotstxt.py Outdated Show resolved Hide resolved

New signals: start_request_returned, start_requests_exhausted

e11a6fb

Gallaecio commented Dec 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a close reason for robots.txt affecting all start requests #6164

Implement a close reason for robots.txt affecting all start requests #6164

Gallaecio commented Nov 30, 2023

codecov bot commented Nov 30, 2023 •

edited

Gallaecio Dec 14, 2023 •

edited

Gallaecio Dec 14, 2023

Gallaecio Dec 14, 2023 •

edited

Gallaecio Dec 14, 2023

		fingerprint = self._fingerprinter.fingerprint(request)
		self._pending_start_request_fingerprints.add(fingerprint)

Implement a close reason for robots.txt affecting all start requests #6164

Are you sure you want to change the base?

Implement a close reason for robots.txt affecting all start requests #6164

Conversation

Gallaecio commented Nov 30, 2023

codecov bot commented Nov 30, 2023 • edited

Codecov Report

Gallaecio Dec 14, 2023 • edited

Choose a reason for hiding this comment

Gallaecio Dec 14, 2023

Choose a reason for hiding this comment

Gallaecio Dec 14, 2023 • edited

Choose a reason for hiding this comment

Gallaecio Dec 14, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 30, 2023 •

edited

Gallaecio Dec 14, 2023 •

edited

Gallaecio Dec 14, 2023 •

edited