Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dont_filter param within request-response.rst #6401

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bmazaoreg
Copy link

@bmazaoreg bmazaoreg commented Jun 12, 2024

This request updates the dont_filter param documentation to increase its accuracy as suggested in #6398
A description of what the param is asking, as well as a list of built-in middlewares that take this into account are added.

Resolves #6398

Increased the accuracy of the dont_filter param docs description
Comment on lines +147 to +165
:param dont_filter: indicates that this request should not be dropped by any
middleware or the scheduler. This parameter is crucial for scenarios where you
wish to ensure that a request is processed even if it has been seen before, bypassing
both the scheduler's duplicate filtering mechanism and any built-in or third-party
middleware filters designed to prevent repeated processing. It is particularly useful in
complex scraping projects where certain requests need to be retried under specific conditions,
regardless of whether they have been previously processed. However, caution should be
exercised when using this option, as improper usage can lead to infinite crawling loops.
The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.

Built-in Middlewares that take 'dont_filter' into account:

- OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider.
If the request has the `dont_filter` attribute set, the offsite middleware will allow the
request even if its domain is not listed in allowed domains

- DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets
`request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise.
The `dont_filter` attribute can influence how requests are prioritized based on their depth
Copy link
Member

@Gallaecio Gallaecio Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it much shorter without losing much information. Also, looking at the code, it does not look like DepthMiddleware respects dont_filter.

Suggested change
:param dont_filter: indicates that this request should not be dropped by any
middleware or the scheduler. This parameter is crucial for scenarios where you
wish to ensure that a request is processed even if it has been seen before, bypassing
both the scheduler's duplicate filtering mechanism and any built-in or third-party
middleware filters designed to prevent repeated processing. It is particularly useful in
complex scraping projects where certain requests need to be retried under specific conditions,
regardless of whether they have been previously processed. However, caution should be
exercised when using this option, as improper usage can lead to infinite crawling loops.
The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.
Built-in Middlewares that take 'dont_filter' into account:
- OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider.
If the request has the `dont_filter` attribute set, the offsite middleware will allow the
request even if its domain is not listed in allowed domains
- DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets
`request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise.
The `dont_filter` attribute can influence how requests are prioritized based on their depth
:param dont_filter: indicates whether :ref:`components <topics-components>`
may drop this request (``False``, default) or not (``True``).
:ref:`Built-in schedulers <topics-scheduler>` (which by default drop
duplicate requests) and
:class:`~scrapy.downloadermiddlewares.offsite.OffSiteMiddleware`
respect this parameter. Some third-party components may also respect
it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the docs of the dont_filter parameter of Request
2 participants