New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing '/' when the URL path is empty #3494
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3494 +/- ##
=======================================
Coverage 84.38% 84.38%
=======================================
Files 167 167
Lines 9376 9376
Branches 1392 1392
=======================================
Hits 7912 7912
Misses 1206 1206
Partials 258 258
|
Hey! A nice catch 👍 . I'm not sure though that using canonicalize_url is good here, as canonicalize_url is lossy, and its main purpose is to compare/deduplicate URLs. Do you know what kind of normalization do browsers perform? Likely it is different from what canonicalize_url does.. |
@kmike I'm not sure how this is exactly handled in the different browsers to be honest. I've had a look at how
By glimpsing into the /* If the URL is malformatted (missing a '/' after hostname before path) we
* insert a slash here. The only letters except '/' that can start a path is
* '?' and '#' - as controlled by the two sscanf() patterns above.
*/
if(path[0] != '/') {
/* We need this function to deal with overlapping memory areas. We know
that the memory area 'path' points to is 'urllen' bytes big and that
is bigger than the path. Use +1 to move the zero byte too. */
memmove(&path[1], path, strlen(path) + 1);
path[0] = '/';
rebuild_url = TRUE;
} They simply add a I could create a function that does that, something like this: >>> from six.moves.urllib.parse import urlsplit, urlunsplit
>>> url = 'https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>> scheme, netloc, path, query, fragment = urlsplit(url)
>>> path = path if path.startswith('/') else ('/' + path)
>>> urlunsplit((scheme, netloc, path, query, fragment))
'https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>> What would be the best place to put this function? I'm thinking of calling it Do you think it should be defined in the redirect middleware script or is it general enough to be part of |
hey @guillaume-humbert, thanks for checking it! However, I think we should make sure we're doing a right thing; curl is not a gold standard - generally it is good, but I recall some incorrect handling of edge cases in past. "Do what a browser is doing" is a gold standard here, as web developers are testing their websites against browsers; "do what's in RFC" may also work, though if this conflicts with browser behavior, browser wins. So, questions:
The best resource I found to answer these questions is https://url.spec.whatwg.org/#url-representation. It says:
"special URL" is defined here https://url.spec.whatwg.org/#is-special as having a special scheme (https://url.spec.whatwg.org/#special-scheme), which is one of "ftp", "file", "gopher", "http", "https", "ws" or "wss". So this suggests we should always add I think the best place to put such code is w3lib.url. And it looks like this is already implemented here: https://github.com/scrapy/w3lib/blob/4e6865e0a8eae05c64ff766b4869332d6942e9ab/w3lib/url.py#L87, in |
@kmike Thanks for sharing the links this is useful! Now, I was thinking we could replace the calls to However, So I'm thinking we could:
What do you think? |
I am trying this and it looks like the middleware not executed. It goes as following. Any suggestions? I tried to put a print statement in "process_response" under class "FixLocationHeaderMiddleWare" but no output also. Thanks. 1.) Scrapy log when running the script 2.) My main scrapy.py file class WatsonsSpider(scrapy.Spider): 3.) My middlewares.py file: class FixLocationHeaderMiddleWare:
|
@pc2000sg If you look closely at my stackoverflow answer, you will see that your settings are wrong, instead of: custom_settings = {
'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
} It should be: custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'diffmarts.middlewares.FixLocationHeaderMiddleWare': 650
}
} Also, make sure that the order number is greater than 600, as it needs to be applied after the |
OK got it. Let me give it a try. Thanks for your advice!
…On Sat, Dec 15, 2018, 09:34 Guillaume Humbert ***@***.*** wrote:
@pc2000sg <https://github.com/pc2000sg> If you look closely at my stackoverflow
answer <https://stackoverflow.com/a/53323565/869764>, you will see that
your settings are wrong, instead of:
custom_settings = {'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
}
It should be:
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'diffmarts.middlewares.FixLocationHeaderMiddleWare': 650
}
}
Also, make sure that the order number is greater than 600, as it needs to
be applied after the RedirectMiddleware which has a order number of 600.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3494 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/Aq7C0WfcreICn452BXHevJVcK3nhOoZ2ks5u5MJAgaJpZM4YhlKY>
.
|
@guillaume-humbert 'diffmarts.middlewares.FixLocationHeaderMiddleWare', |
Issued closed by hardcoding the redirect URL as start_url in scrapy python. Thanks everyone for the inputs and suggestions! |
This is to fix the issue reported in #1133
When performing redirects, scrapy creates a new request based on the URL from the response location header.
In some cases, the redirect URL that we receive in the response is not properly normalized.
When performing a HTTP GET request with this URL, we may receive a HTTP error.
However, when using an actual web browser, it works fine, because the browser normalizes the location URL before performing the next request.
For example, when querying https://www.watsons.com.sg, the response location header contains:
https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F
We can see that this part of the URL is not normalized:
queue.watsons.com.sg?c=aswatson
, it's missing a slash.If you send a HTTP GET on this URL, you'll get some bad response code back (HTTP 400).
Firefox actually does the normalization to:
queue.watsons.com.sg/?c=aswatson
(notice the additional slash).I've added a unit test to show the behavior.
More details in my answer in Stackoverflow: https://stackoverflow.com/a/53323565/869764