Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 302 redirection with headers location starts with 3 slash #4032

Closed
sicklife opened this issue Sep 23, 2019 · 8 comments · Fixed by #4042
Closed

Error 302 redirection with headers location starts with 3 slash #4032

sicklife opened this issue Sep 23, 2019 · 8 comments · Fixed by #4042
Labels

Comments

@sicklife
Copy link

@sicklife sicklife commented Sep 23, 2019

Description

when the 302 response return a headers's location startswith 3 slash, the scrapy redirect to a url different from what the browser do.

Steps to Reproduce

  1. scrapy shell https://www.hjenglish.com/new/p1285798/

Expected behavior:
redirect to https://fr.hujiang.com/new/p1285798/ as browser Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36 do.

Actual behavior:
redirct to https://www.hjenglish.com/fr.hujiang.com/new/p1285798

Reproduces how often:

everytime

Versions

Scrapy : 1.7.3
lxml : 4.3.2.0
libxml2 : 2.9.9
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.20.0
Twisted : 19.7.0
Python : 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)]
pyOpenSSL : 19.0.0 (OpenSSL 1.1.1c 28 May 2019)
cryptography : 2.6.1
Platform : Windows-10-10.0.17134-SP0

Additional context

I check the defination of Location in rfc and end with reference resolution. But I fail to findout how to resolve the Location startswith ///. So I don't know why Chrome did so.

The behavior of scrapy is determined by redirect.py#L73, which will truncate /// to /

I'm wandering the differents betweent scarpy and browser...

@Gallaecio Gallaecio added the bug label Sep 24, 2019
@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Sep 24, 2019

We aim to behave as close to browsers as possible, so let’s have a look at it.

@amardeep24
Copy link
Contributor

@amardeep24 amardeep24 commented Sep 26, 2019

I was able to replicate this issue, @Gallaecio may I work on this?

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Sep 26, 2019

@amardeep24 Please, go ahead! 🙂

@amardeep24
Copy link
Contributor

@amardeep24 amardeep24 commented Sep 27, 2019

@Gallaecio The difference between Scrapy and a browser is that, when the location header in the response is having a relative URL like ///fr.hujiang.com/new/p1285798/ (which is permitted as per the RFC 7213) the browser is making a request to the absolute URL https://fr.hujiang.com/new/p1285798/ where as Scrapy is considering it as a relative URL w.r.t to the base URL: https://www.hjenglish.com/fr.hujiang.com/new/p1285798.

This is happening because we have urljoin(request.url, location) in redirect.py, which resolves to https://www.hjenglish.com/fr.hujiang.com/new/p1285798 from the location sanitized URL /fr.hujiang.com/new/p1285798/.

How can we know whether the URL should be relative w.r.t base URL or a completely new URL? should we make two calls one for the https://www.hjenglish.com/fr.hujiang.com/new/p1285798 and one for the https://fr.hujiang.com/new/p1285798/ by adding https:// before the relative URL and scrape the one which succeed ?

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Sep 27, 2019

First we should find out if the URL sanitizer should be updated to handle this case differently. Otherwise, maybe we need to replace /// at the start by <protocol>:// before passing the value to the sanitizer.

@amardeep24
Copy link
Contributor

@amardeep24 amardeep24 commented Sep 27, 2019

@Gallaecio This fixes the issue:

import logging
+import re
from six.moves.urllib.parse import urljoin
+from six.moves.urllib.parse import urlparse

from w3lib.url import safe_url_string

@@ -71,6 +73,9 @@ class RedirectMiddleware(BaseRedirectMiddleware):
            return response

        location = safe_url_string(response.headers['location'])
+        if response.headers['location'].decode().startswith('/', 1):
+            request_scheme = urlparse(request.url).scheme
+            location = request_scheme + '://' + re.sub('^/*', '', location)

        redirected_url = urljoin(request.url, location)

Here I checked whether the location header contains a URL starting with multiple /// if it does then append <scheme>:// before it and make it an absolute URL else use the old logic. Should I raise a PR?

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Sep 27, 2019

Please, open a pull request.

amardeep24 added a commit to amardeep24/scrapy that referenced this issue Sep 27, 2019
@amardeep24
Copy link
Contributor

@amardeep24 amardeep24 commented Sep 27, 2019

Raised PR#4042 @Gallaecio please review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants