Skip to content

Remove UrlLengthMiddleware from default enabled middlewares #5135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rennerocha opened this issue May 6, 2021 · 12 comments · Fixed by #5250
Closed

Remove UrlLengthMiddleware from default enabled middlewares #5135

rennerocha opened this issue May 6, 2021 · 12 comments · Fixed by #5250

Comments

@rennerocha
Copy link
Contributor

rennerocha commented May 6, 2021

According RFC2396, section 3.2.1:

   The HTTP protocol does not place any a priori limit on the length of a URI.
   Servers MUST be able to handle the URI of any resource they serve, and
   SHOULD be able to handle URIs of unbounded length if they provide 
   GET-based forms that could generate such URIs. A server SHOULD 
   return 414 (Request-URI Too Long) status if a URI is longer than the server
   can handle (see section 10.4.15).

We have enabled by default scrapy.spidermiddlewares.urllength.UrlLengthMiddleware that has a default limit defined by URLLENGTH_LIMIT setting (that can be modified by in project settings) set to 2083. As mentioned here, the reason for this number is related to limits of Microsoft Internet Explorer to handle URIs longer than that.

This can cause problems to spiders that will skip requests of URIs longer than that. Certainly we can change URLLENGTH_LIMIT on these spiders, but sometimes is not easy to set the right value and we chose to set a higher number just to make the middleware happy. This is what I am doing in a real world project, but the solution doesn't look good.

I know that we can or disable the middleware, or change the length limit, but I think it is smoother for the user not to have to worry about this artificial limit we have on Scrapy. We are not using Microsoft Internet Explorer, we don't need this limit.

Some alternatives that I considered:

  • Remove UrlLengthMiddleware as a default enabled middlewares, so we don't need to worry about that limit unless we really need to worry about that (I don't know the exact use-case that required this limit, so keeping the middleware available may make sense);
  • Change the default value to a more reasonable (difficult to find a reasonable value)
  • Allow URLLENGTH_LIMIT = -1, and in this case, ignore the limit. This seems an easier change in the settings than modifying SPIDER_MIDDLEWARES setting
@Gallaecio
Copy link
Member

Gallaecio commented May 6, 2021

I would go for URLLENGTH_LIMIT = -1 (or None or 0?) and making that the default value, unless we decide to remove the middleware altogether.

And I think we should at least consider removing the middleware altogether unless we can come up with scenarios where this middleware can be useful, and if we do we should mention those in the documentation of the URLLENGTH_LIMIT setting.

@kmike
Copy link
Member

kmike commented May 9, 2021

MSIE is not the only main reason for having url length limit. Sometimes, when you're doing broad crawls, you can have a website returning links of ever-increasing length, which usually indicates a loop (and sometimes - incorrect link extraction code); url length limit acts as a stopping condition in this case. It also puts some limits on the request size. I'm not sure, maybe that was also useful for data uris (before we had a downloader handler for them), to prevent queues from exploding.

I'd still consider having some URL length limit a good practice for broad crawls.

@Gallaecio
Copy link
Member

you can have a website returning links of ever-increasing length, which usually indicates a loop

Doesn’t that happen through redirects? (i.e. handled by REDIRECT_MAX_TIMES) Or are we talking about a website containing ever-increasing links in the HTML of their responses?

@kmike
Copy link
Member

kmike commented May 10, 2021

Yeah, it is about ever-increasing links in HTML responses, or links which could be incorectly built by the client code.

@Gallaecio
Copy link
Member

That could probably be handled by DEPTH_LIMIT, but since it is disabled by default, I guess it makes sense to keep URLLENGTH_LIMIT set by default.

Shall we simply allow to set URLLENGTH_LIMIT to a value that effectively disables the middleware? Any preference? (-1, 0, None).

@kmike
Copy link
Member

kmike commented May 10, 2021

Shall we simply allow to set URLLENGTH_LIMIT to a value that effectively disables the middleware? Any preference? (-1, 0, None).

Yeah, why not? I think we're using 0 for other settings as such value.

@Gallaecio Gallaecio removed the discuss label May 10, 2021
@Gallaecio
Copy link
Member

@rennerocha We need to add documentation and tests for it, but know that it turns out the existing code already disables the middleware if you set the setting to 0.

@sidharthkumar2019
Copy link

sidharthkumar2019 commented Aug 4, 2021

I want to contribute. Has this issue been resolved?

@Gallaecio Gallaecio added the docs label Aug 5, 2021
@Gallaecio
Copy link
Member

Gallaecio commented Aug 5, 2021

@sidharthkumar2019 It hasn’t been resolved, it’s up for the taking.

The goal here is to update the documentation of the URLLENGTH_LIMIT setting to indicate how it can be disabled and to mention scenarios where it can be useful (to justify it being enabled by default).

@bit2244
Copy link
Contributor

bit2244 commented Sep 30, 2021

I suppose this is still open, if so I would like to add to the docs

@Gallaecio
Copy link
Member

@iDeepverma Feel free! Let us know if you have any question.

@bit2244
Copy link
Contributor

bit2244 commented Sep 30, 2021

@Gallaecio Should I Add using DEPTH_LIMIT to some appropriate value as the recommended way of using it while disabling the URLLENGTH_LIMIT to avoid loops (which can cause URLs of increasing lengths) as discussed in the above comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants