Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute handle_httpstatus_list not working for codes 301 and 302 #1334

Closed
zxcfer opened this issue Jul 3, 2015 · 11 comments
Closed

Attribute handle_httpstatus_list not working for codes 301 and 302 #1334

zxcfer opened this issue Jul 3, 2015 · 11 comments

Comments

@zxcfer
Copy link

zxcfer commented Jul 3, 2015

When set handle_httpstatus_list = [301, 302], the spider doesn't execute parse. Howeber, it executes for other codes like 404.

@jdemaeyer
Copy link
Contributor

The 3xx HTTP code range is for redirects, and those are handled by the Redirect Downloader Middleware. If you don't want that middleware to automatically follow redirects, but instead handle them in your spider, you have two options:

  1. Completely disable the RedirectMiddlware by setting REDIRECT_ENABLED = False in your settings,
  2. Or, more versatile, decide whether you want the redirect middleware to automatically follow on a per-request basis. It looks for a dont_redirect key set to True in the Request.meta dictionary. So if you instantiate a request like Request("http://some.url", meta={'dont_redirect': True}), the redirect middleware would be disabled for this request only.

@jdemaeyer
Copy link
Contributor

maybe the RedirectMiddleware should honour handle_httpstatus_list as well?

@zxcfer
Copy link
Author

zxcfer commented Jul 3, 2015

I think RedirectMiddleware should honour handle_httpstatus_list as well.

I tried making REDIRECT_ENABLED and dont_redirect equals False. Both worked, however returned response with 200 status code.

@jdemaeyer
Copy link
Contributor

You need to set dont_redirect to True (see the 'dont'? ;)) if you want to use that way. What version of Scrapy are you using and how do you update the settings? I ask b/c I just accidentally tried using Spider.custom_settings with a Scrapy version that didn't support it yet.

Here's a Spider that has REDIRECT_ENABLED = False and does what you want with Scrapy 1.0.1:

# ~/playground/spidy.py

import scrapy

class Spidy(scrapy.Spider):
    name = "Spidy the Spider"
    start_urls = ["https://jigsaw.w3.org/HTTP/300/301.html"]
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [301]

    def parse(self, response):
        print "Got this:", response.status
jakob@MosEisley ~/playground % scrapy runspider spidy.py 2> /dev/null
Got this: 301

@barraponto
Copy link
Contributor

In the code above, dont_redirect is set to True from within scrapy.Spider.make_requests_from_url.

@jdemaeyer
Copy link
Contributor

Hm, don't you mean dont_filter (which is irrelevant for RedirectMiddleware)? Unless the Request class has some default meta dict I don't think dont_redirect is messed with anywhere in stock Scrapy

@barraponto
Copy link
Contributor

Yeah, sorry. So in your example above, dont_redirect is not set to True.
UPDATE: oh, it's either REDIRECT_ENABLED set to False or dont_redirect set to True. Not both.

@zxcfer
Copy link
Author

zxcfer commented Jul 8, 2015

It was right jdemaeyer, all worked as expected it was an issue from some pages I was crawling. If nobody else is interested in RedirectMiddleware follow handle_httpstatus_list rule, we can close this issue.

@kmike
Copy link
Member

kmike commented Jul 8, 2015

I think that's a good idea for RedirectMiddleware to respect handle_httpstatus_list.

@nramirezuy
Copy link
Contributor

+1

@dangra
Copy link
Member

dangra commented Aug 3, 2015

fixed by #1364

@dangra dangra closed this as completed Aug 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants