Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a close reason for robots.txt affecting all start requests #6164

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Gallaecio
Copy link
Member

No description provided.

@Gallaecio Gallaecio requested review from kmike and wRAR November 30, 2023 14:45
Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Merging #6164 (e11a6fb) into master (fa690fb) will decrease coverage by 45.48%.
Report is 2 commits behind head on master.
The diff coverage is 34.04%.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #6164       +/-   ##
===========================================
- Coverage   88.59%   43.11%   -45.48%     
===========================================
  Files         159      159               
  Lines       11582    11624       +42     
  Branches     1885     1891        +6     
===========================================
- Hits        10261     5012     -5249     
- Misses        994     6220     +5226     
- Partials      327      392       +65     
Files Coverage Δ
scrapy/crawler.py 54.10% <ø> (-33.82%) ⬇️
scrapy/signals.py 100.00% <100.00%> (ø)
scrapy/core/engine.py 74.62% <66.66%> (-11.04%) ⬇️
scrapy/downloadermiddlewares/robotstxt.py 26.31% <28.57%> (-71.61%) ⬇️

... and 137 files with indirect coverage changes

@Gallaecio Gallaecio marked this pull request as ready for review November 30, 2023 14:48
Comment on lines +193 to +201
self.signals.send_catch_log(signal=signals.start_requests_exhausted)
except Exception:
self.slot.start_requests = None
logger.error(
"Error while obtaining start requests",
exc_info=True,
extra={"spider": self.spider},
)
self.signals.send_catch_log(signal=signals.start_requests_exhausted)
Copy link
Member Author

@Gallaecio Gallaecio Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should implement separate methods for these, but it felt a bit out of scope.

)

def _start_request_returned(self, request):
self._total_start_request_count += 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about counting them on the engine and passing the count as a parameter through the signal, but it felt weird.

Comment on lines +469 to +490
start_request_returned
~~~~~~~~~~~~~~~~~~~~~~

.. signal:: start_request_returned
.. function:: start_request_returned(request)

.. versionadded:: VERSION

Sent after a :class:`~scrapy.http.Request` is returned by the
:meth:`spider.start_requests <scrapy.Spider.start_requests>` iterator and
processed by the
:meth:`process_start_requests <scrapy.spidermiddlewares.SpiderMiddleware.process_start_requests>`
method of :ref:`spider middlewares <topics-spider-middleware>`, and before
that request reaches the :ref:`scheduler <topics-scheduler>`
(:signal:`request_scheduled` signal).

This signal does not support returning deferreds from its handlers.

:param request: Returned request.
:type request: scrapy.http.Request

start_requests_exhausted
Copy link
Member Author

@Gallaecio Gallaecio Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with these names to be more in line with current signal naming, although I am not 100% sure about them. I used “returned” because that’s the wording used in the iterator docs, and with “exhausted” as a word that works both for the iterator stopping and an exception being raised.

Comment on lines +100 to +101
fingerprint = self._fingerprinter.fingerprint(request)
self._pending_start_request_fingerprints.add(fingerprint)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the best I could come up with to avoid changing the request meta. It gets rather verbose in the 2 process_request methods, and weird having to remove and re-add fingerprints in the first one, but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants