-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Remove UrlLengthMiddleware from default enabled middlewares #5135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would go for And I think we should at least consider removing the middleware altogether unless we can come up with scenarios where this middleware can be useful, and if we do we should mention those in the documentation of the |
MSIE is not the only main reason for having url length limit. Sometimes, when you're doing broad crawls, you can have a website returning links of ever-increasing length, which usually indicates a loop (and sometimes - incorrect link extraction code); url length limit acts as a stopping condition in this case. It also puts some limits on the request size. I'm not sure, maybe that was also useful for data uris (before we had a downloader handler for them), to prevent queues from exploding. I'd still consider having some URL length limit a good practice for broad crawls. |
Doesn’t that happen through redirects? (i.e. handled by |
Yeah, it is about ever-increasing links in HTML responses, or links which could be incorectly built by the client code. |
That could probably be handled by Shall we simply allow to set |
Yeah, why not? I think we're using 0 for other settings as such value. |
@rennerocha We need to add documentation and tests for it, but know that it turns out the existing code already disables the middleware if you set the setting to |
I want to contribute. Has this issue been resolved? |
@sidharthkumar2019 It hasn’t been resolved, it’s up for the taking. The goal here is to update the documentation of the |
I suppose this is still open, if so I would like to add to the docs |
@iDeepverma Feel free! Let us know if you have any question. |
@Gallaecio Should I Add using |
According RFC2396, section 3.2.1:
We have enabled by default
scrapy.spidermiddlewares.urllength.UrlLengthMiddleware
that has a default limit defined byURLLENGTH_LIMIT
setting (that can be modified by in project settings) set to2083
. As mentioned here, the reason for this number is related to limits of Microsoft Internet Explorer to handle URIs longer than that.This can cause problems to spiders that will skip requests of URIs longer than that. Certainly we can change
URLLENGTH_LIMIT
on these spiders, but sometimes is not easy to set the right value and we chose to set a higher number just to make the middleware happy. This is what I am doing in a real world project, but the solution doesn't look good.I know that we can or disable the middleware, or change the length limit, but I think it is smoother for the user not to have to worry about this artificial limit we have on Scrapy. We are not using Microsoft Internet Explorer, we don't need this limit.
Some alternatives that I considered:
UrlLengthMiddleware
as a default enabled middlewares, so we don't need to worry about that limit unless we really need to worry about that (I don't know the exact use-case that required this limit, so keeping the middleware available may make sense);URLLENGTH_LIMIT = -1
, and in this case, ignore the limit. This seems an easier change in the settings than modifyingSPIDER_MIDDLEWARES
settingThe text was updated successfully, but these errors were encountered: