Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute `handle_httpstatus_list` not working for codes 301 and 302 #1334

Closed
iwxfer opened this issue Jul 3, 2015 · 11 comments
Closed

Attribute `handle_httpstatus_list` not working for codes 301 and 302 #1334

iwxfer opened this issue Jul 3, 2015 · 11 comments

Comments

@iwxfer
Copy link

@iwxfer iwxfer commented Jul 3, 2015

When set handle_httpstatus_list = [301, 302], the spider doesn't execute parse. Howeber, it executes for other codes like 404.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Jul 3, 2015

The 3xx HTTP code range is for redirects, and those are handled by the Redirect Downloader Middleware. If you don't want that middleware to automatically follow redirects, but instead handle them in your spider, you have two options:

  1. Completely disable the RedirectMiddlware by setting REDIRECT_ENABLED = False in your settings,
  2. Or, more versatile, decide whether you want the redirect middleware to automatically follow on a per-request basis. It looks for a dont_redirect key set to True in the Request.meta dictionary. So if you instantiate a request like Request("http://some.url", meta={'dont_redirect': True}), the redirect middleware would be disabled for this request only.
@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Jul 3, 2015

maybe the RedirectMiddleware should honour handle_httpstatus_list as well?

@iwxfer
Copy link
Author

@iwxfer iwxfer commented Jul 3, 2015

I think RedirectMiddleware should honour handle_httpstatus_list as well.

I tried making REDIRECT_ENABLED and dont_redirect equals False. Both worked, however returned response with 200 status code.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Jul 3, 2015

You need to set dont_redirect to True (see the 'dont'? ;)) if you want to use that way. What version of Scrapy are you using and how do you update the settings? I ask b/c I just accidentally tried using Spider.custom_settings with a Scrapy version that didn't support it yet.

Here's a Spider that has REDIRECT_ENABLED = False and does what you want with Scrapy 1.0.1:

# ~/playground/spidy.py

import scrapy

class Spidy(scrapy.Spider):
    name = "Spidy the Spider"
    start_urls = ["https://jigsaw.w3.org/HTTP/300/301.html"]
    custom_settings = {'REDIRECT_ENABLED': False}
    handle_httpstatus_list = [301]

    def parse(self, response):
        print "Got this:", response.status
jakob@MosEisley ~/playground % scrapy runspider spidy.py 2> /dev/null
Got this: 301
@barraponto
Copy link
Contributor

@barraponto barraponto commented Jul 7, 2015

In the code above, dont_redirect is set to True from within scrapy.Spider.make_requests_from_url.

@jdemaeyer
Copy link
Contributor

@jdemaeyer jdemaeyer commented Jul 8, 2015

Hm, don't you mean dont_filter (which is irrelevant for RedirectMiddleware)? Unless the Request class has some default meta dict I don't think dont_redirect is messed with anywhere in stock Scrapy

@barraponto
Copy link
Contributor

@barraponto barraponto commented Jul 8, 2015

Yeah, sorry. So in your example above, dont_redirect is not set to True.
UPDATE: oh, it's either REDIRECT_ENABLED set to False or dont_redirect set to True. Not both.

@iwxfer
Copy link
Author

@iwxfer iwxfer commented Jul 8, 2015

It was right jdemaeyer, all worked as expected it was an issue from some pages I was crawling. If nobody else is interested in RedirectMiddleware follow handle_httpstatus_list rule, we can close this issue.

@kmike
Copy link
Member

@kmike kmike commented Jul 8, 2015

I think that's a good idea for RedirectMiddleware to respect handle_httpstatus_list.

@nramirezuy
Copy link
Contributor

@nramirezuy nramirezuy commented Jul 17, 2015

+1

@dangra
Copy link
Member

@dangra dangra commented Aug 3, 2015

fixed by #1364

@dangra dangra closed this Aug 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.