Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dont_filter param within request-response.rst #6401

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions docs/topics/request-response.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,10 +144,25 @@ Request objects
Negative values are allowed in order to indicate relatively low-priority.
:type priority: int

:param dont_filter: indicates that this request should not be filtered by
the scheduler. This is used when you want to perform an identical
request multiple times, to ignore the duplicates filter. Use it with
care, or you will get into crawling loops. Default to ``False``.
:param dont_filter: indicates that this request should not be dropped by any
middleware or the scheduler. This parameter is crucial for scenarios where you
wish to ensure that a request is processed even if it has been seen before, bypassing
both the scheduler's duplicate filtering mechanism and any built-in or third-party
middleware filters designed to prevent repeated processing. It is particularly useful in
complex scraping projects where certain requests need to be retried under specific conditions,
regardless of whether they have been previously processed. However, caution should be
exercised when using this option, as improper usage can lead to infinite crawling loops.
The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.

Built-in Middlewares that take 'dont_filter' into account:

- OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider.
If the request has the `dont_filter` attribute set, the offsite middleware will allow the
request even if its domain is not listed in allowed domains

- DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets
`request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise.
The `dont_filter` attribute can influence how requests are prioritized based on their depth
Comment on lines +147 to +165
Copy link
Member

@Gallaecio Gallaecio Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it much shorter without losing much information. Also, looking at the code, it does not look like DepthMiddleware respects dont_filter.

Suggested change
:param dont_filter: indicates that this request should not be dropped by any
middleware or the scheduler. This parameter is crucial for scenarios where you
wish to ensure that a request is processed even if it has been seen before, bypassing
both the scheduler's duplicate filtering mechanism and any built-in or third-party
middleware filters designed to prevent repeated processing. It is particularly useful in
complex scraping projects where certain requests need to be retried under specific conditions,
regardless of whether they have been previously processed. However, caution should be
exercised when using this option, as improper usage can lead to infinite crawling loops.
The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.
Built-in Middlewares that take 'dont_filter' into account:
- OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider.
If the request has the `dont_filter` attribute set, the offsite middleware will allow the
request even if its domain is not listed in allowed domains
- DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets
`request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise.
The `dont_filter` attribute can influence how requests are prioritized based on their depth
:param dont_filter: indicates whether :ref:`components <topics-components>`
may drop this request (``False``, default) or not (``True``).
:ref:`Built-in schedulers <topics-scheduler>` (which by default drop
duplicate requests) and
:class:`~scrapy.downloadermiddlewares.offsite.OffSiteMiddleware`
respect this parameter. Some third-party components may also respect
it.

:type dont_filter: bool

:param errback: a function that will be called if any exception was
Expand Down