Update dont_filter param within request-response.rst #6401

bmazaoreg · 2024-06-12T05:51:11Z

This request updates the dont_filter param documentation to increase its accuracy as suggested in #6398
A description of what the param is asking, as well as a list of built-in middlewares that take this into account are added.

Resolves #6398

Increased the accuracy of the dont_filter param docs description

Gallaecio · 2024-06-12T06:16:37Z

docs/topics/request-response.rst

+    :param dont_filter: indicates that this request should not be dropped by any 
+       middleware or the scheduler. This parameter is crucial for scenarios where you 
+       wish to ensure that a request is processed even if it has been seen before, bypassing 
+       both the scheduler's duplicate filtering mechanism and any built-in or third-party 
+       middleware filters designed to prevent repeated processing. It is particularly useful in 
+       complex scraping projects where certain requests need to be retried under specific conditions, 
+       regardless of whether they have been previously processed. However, caution should be 
+       exercised when using this option, as improper usage can lead to infinite crawling loops. 
+       The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.
+
+       Built-in Middlewares that take 'dont_filter' into account:
+
+       -   OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider. 
+           If the request has the `dont_filter` attribute set, the offsite middleware will allow the 
+           request even if its domain is not listed in allowed domains 
+
+       -   DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets 
+           `request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise. 
+           The `dont_filter` attribute can influence how requests are prioritized based on their depth


I think we can keep it much shorter without losing much information. Also, looking at the code, it does not look like DepthMiddleware respects dont_filter.

Suggested change

:param dont_filter: indicates that this request should not be dropped by any

middleware or the scheduler. This parameter is crucial for scenarios where you

wish to ensure that a request is processed even if it has been seen before, bypassing

both the scheduler's duplicate filtering mechanism and any built-in or third-party

middleware filters designed to prevent repeated processing. It is particularly useful in

complex scraping projects where certain requests need to be retried under specific conditions,

regardless of whether they have been previously processed. However, caution should be

exercised when using this option, as improper usage can lead to infinite crawling loops.

The default value is ``False``, meaning requests are subject to filtering unless explicitly instructed otherwise.

Built-in Middlewares that take 'dont_filter' into account:

- OffSiteMiddleware: Filters out requests for URLs outside the domains covered by the spider.

If the request has the `dont_filter` attribute set, the offsite middleware will allow the

request even if its domain is not listed in allowed domains

- DepthMiddleware: Tracks the depth of each request inside the site being scraped. It sets

`request.meta['depth'] = 0` whenever there is no value previously set and increments it by 1 otherwise.

The `dont_filter` attribute can influence how requests are prioritized based on their depth

:param dont_filter: indicates whether :ref:`components <topics-components>`

may drop this request (``False``, default) or not (``True``).

:ref:`Built-in schedulers <topics-scheduler>` (which by default drop

duplicate requests) and

:class:`~scrapy.downloadermiddlewares.offsite.OffSiteMiddleware`

respect this parameter. Some third-party components may also respect

it.

Update dont_filter param within request-response.rst

9db34c9

Increased the accuracy of the dont_filter param docs description

Gallaecio reviewed Jun 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dont_filter param within request-response.rst #6401

Update dont_filter param within request-response.rst #6401

bmazaoreg commented Jun 12, 2024 •

edited by Gallaecio

Gallaecio Jun 12, 2024 •

edited

Update dont_filter param within request-response.rst #6401

Are you sure you want to change the base?

Update dont_filter param within request-response.rst #6401

Conversation

bmazaoreg commented Jun 12, 2024 • edited by Gallaecio

Gallaecio Jun 12, 2024 • edited

Choose a reason for hiding this comment

bmazaoreg commented Jun 12, 2024 •

edited by Gallaecio

Gallaecio Jun 12, 2024 •

edited