Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MediaPipeline (and ImagesPipeline/FilesPipeline) does not handle HTTP redirections #2004

Closed
ghost opened this issue May 22, 2016 · 11 comments
Closed
Assignees
Labels
Milestone

Comments

@ghost
Copy link

@ghost ghost commented May 22, 2016

Basically, what's happened is that my spider is unable to download the files because the file_urls provided are actually redirected to the final download link. However, because of the following code, the redirect download middleware is effectively disabled, which makes the download fail.

def _check_media_to_download(self, result, request, info):
        if result is not None:
            return result
        if self.download_func:
            # this ugly code was left only to support tests. TODO: remove
            dfd = mustbe_deferred(self.download_func, request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        else:
            request.meta['handle_httpstatus_all'] = True
            dfd = self.crawler.engine.download(request, info.spider)
            dfd.addCallbacks(
                callback=self.media_downloaded, callbackArgs=(request, info),
                errback=self.media_failed, errbackArgs=(request, info))
        return dfd

And here is the check in the redirect middleware that disables it:

if (request.meta.get('dont_redirect', False) or
                response.status in getattr(spider, 'handle_httpstatus_list', []) or
                response.status in request.meta.get('handle_httpstatus_list', []) or
                request.meta.get('handle_httpstatus_all', False)):
            return response

My question is: What is the point of enabling the handling of httpstatus_all, when it effectively disables all of the checks?

@ghost
Copy link
Author

@ghost ghost commented May 22, 2016

Also, here is the error that is caused:

[scrapy] WARNING: File (code: 302): Error downloading file from <GET <url> referred in <None>

@redapple
Copy link
Contributor

@redapple redapple commented Jun 14, 2016

That line is quite old.
I don't know the exact reason.
It may be to catch all non-200 responses. But then 301 and 302 are interesting to have handled by redirect mw indeed. So handle_httpstatus_all may very well be too-broad scope.

@redapple redapple added the bug label Jun 14, 2016
@jbothma
Copy link

@jbothma jbothma commented Jul 8, 2016

ditto. I overrode the function setting it to True but I'm nervous it'll bite me.
What does a request do when there's an error? will it raise an exception or something so I don't end up treating an error or empty body as my file?

@jbothma
Copy link

@jbothma jbothma commented Jul 8, 2016

I think a 301 should modify the corresponding element in file_urls to the new location, although that might be surprising to the user. Adding a redirect trail to the item might be excessive?

A 302 should probably just follow through to the final destination?

This should all respect the allowed domains, right? Is that automatically taken care of by the downloader?

So correct behaviour might look like:

  1. remove the override of handle_httpstatus_all
  2. verify that any http errors result in helpful error logs and storing anything or setting the files field
  3. logging a helpful error
  4. updating the relevant original url to the last location of the first chain of 301s, e.g. scraped url > 301 a > 301 b > 302 c > 301 d > 200 e should see b placed in file_urls

I think 2 and 3 are taken care of by the current code.

I think 4 is technically correct but it isn't easy since redirect status codes aren't in redirect_urls. Do you think it's necessary or just over-complicating?

@Granitosaurus
Copy link
Contributor

@Granitosaurus Granitosaurus commented Aug 22, 2016

@jbothma To answer your question:

This should all respect the allowed domains, right? Is that automatically taken care of by the downloader?

Yes since all requests scheduled go through download middlewares it will go through OffsiteMiddleware, so it will do the usual match to spider.allowed_domains.

Regarding the issue I think handle_httpstatus_all should be replaced with a handle_httpstatus_list that exclused some 300 codes.
For FilesPipeline the callback media_downloaded() by default raises an exception on anything that is not 200 or has an empty body, so this pipeline doesn't benefit from this setting at all other than wrapping the error. Excluding that from 300 codes wouldn't break anything other than enabling the redirection, which might be unwanted for some reason but I bet it's the other way around more often than not.

I think implementing a pipeline setting to manage this would be an ideal solution.

@jbothma
Copy link

@jbothma jbothma commented Aug 22, 2016

Thanks for the response.

A pipeline setting to opt in to allow 300 redirect codes?

On 22 August 2016 at 16:59, Bernardas Ališauskas notifications@github.com
wrote:

@jbothma https://github.com/jbothma To answer your question:

This should all respect the allowed domains, right? Is that automatically
taken care of by the downloader?

Yes since all requests scheduled go through download middlewares it will
go through OffsiteMiddleware
http://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware,
so it will do the usual match to spider.allowed_domains.

Regarding the issue I think handle_httpstatus_all should be replaced with
a handle_httpstatus_list that exclused some 300 codes.
For FilesPipeline the callback media_downloaded() by default raises an
exception on anything that is not 200 or has an empty body, so this
pipeline doesn't benefit from this setting at all other than wrapping the
error. Excluding that from 300 codes wouldn't break anything other than
enabling the redirection, which might be unwanted for some reason but I bet
it's the other way around more often than not.

I think implementing a pipeline setting to manage this would be an ideal
solution.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2004 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAOZGeEAoQt-j7fuR1_pBU5sLupDkF7xks5qiblJgaJpZM4IkC4v
.

@Granitosaurus
Copy link
Contributor

@Granitosaurus Granitosaurus commented Aug 24, 2016

I submitted a PR with this ☝️
But the question remains, should ignoring redirects remain default behaviour like it is right now or should it be changed?

@jbothma
Copy link

@jbothma jbothma commented Aug 24, 2016

Ah cool - I wanted to submit such a PR but wanted to clarify that the
pipeline setting you referred to was about allowing redirect codes and thus
following them.

I think the default behaviour should probably remain as it is and there
should be a setting to opt in to following redirects, and your proposed
solution sounds cool to me.

An additional question is what the value of referer should be and whether
the user might want the full redirect chain (considering it could be
arbitrary length) but those could come later. Most people probably don't
care that much, they just want the end file (or an error if the end result
isn't a 200)

On 24 August 2016 at 10:05, Bernardas Ališauskas notifications@github.com
wrote:

I submitted a PR with this ☝️

But the question remains, should ignoring redirects remain default
behaviour like it is right now or should it be changed?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2004 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAOZGX0fN7qnpM0SlOxWjUaZo7l9LrKwks5qi_tYgaJpZM4IkC4v
.

@Granitosaurus
Copy link
Contributor

@Granitosaurus Granitosaurus commented Aug 24, 2016

For handling redirect chains I think it should be left to the RedirectMiddleware, it is using REDIRECT_MAX_TIMES setting to set maximum chain size and handles everything else by itself.

What do you mean regarding referer @jbothma though? I think scrapy by default populates referer with last request.url, right? So the issue might be if there are multiple urls the original file url might be lost? since multiple redirect chain gets handled one redirect at the time all of the redirects with their respective referers will end up in the log, so I don't think original referer might be lost here.

Redirect middleware logs every redirect as:

logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
             {'reason': reason, 'redirected': redirected, 'request': request},
             extra={'spider': spider})

So for debugging purposes everything can easily be traced back if that what your question was.

@jbothma
Copy link

@jbothma jbothma commented Aug 24, 2016

Oh cool! I was thinking it'd be useful if the urls in the chain were
attached to the response object, but this isn't a critical need for me, and
it's probably jsut a sidetrack.

It's probably most useful and expected that the referrer would be the
original url referrer but since that isn't set before there's a redirect
anyway, I track the referrer manually in my item.

So it sounds like your PR would address the core issue of being able to
download media behind a redirect

Thanks!

On 24 August 2016 at 14:18, Bernardas Ališauskas notifications@github.com
wrote:

For handling redirect chains I think it should be left to the
RedirectMiddleware, it is using REDIRECT_MAX_TIMES setting to set maximum
chain size and handles everything else by itself.

What do you mean regarding referer @jbothma https://github.com/jbothma
though? I think scrapy by default populates referer with last request.url,
right? So the issue might be if there are multiple urls the original file
url might be lost? since multiple redirect chain gets handled one redirect
at the time all of the redirects with their respective referers will end up
in the log, so I don't think original referer might be lost here.

Redirect middleware logs every redirect as:

logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
{'reason': reason, 'redirected': redirected, 'request': request},
extra={'spider': spider})

So for debugging purposes everything can easily be traced back if that
what your question was.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2004 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAOZGW9Is_k_UjgaPhWOFC_dE02DRbg3ks5qjDZ7gaJpZM4IkC4v
.

@redapple redapple changed the title handle_httpstatus_all set to True? MediaPipeline (and ImagesPipeline/FilesPipeline) does not handle HTTP redirections Nov 16, 2016
@redapple redapple added this to the v1.3 milestone Nov 16, 2016
@SeanPollock
Copy link

@SeanPollock SeanPollock commented Feb 23, 2017

Hi I ran in to this issue today too.

Any updates on this bug, or reason the pull request wasn't merged?

@redapple redapple added this to the v1.5 milestone Mar 2, 2017
@redapple redapple removed this from the v1.4 milestone Mar 2, 2017
@redapple redapple self-assigned this Mar 3, 2017
@redapple redapple added this to the v1.4 milestone Mar 12, 2017
@redapple redapple removed this from the v1.5 milestone Mar 12, 2017
@kmike kmike closed this in #2616 Mar 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants