Add missing '/' when the URL path is empty #3494

influentcoder · 2018-11-15T18:45:03Z

This is to fix the issue reported in #1133

When performing redirects, scrapy creates a new request based on the URL from the response location header.

In some cases, the redirect URL that we receive in the response is not properly normalized.
When performing a HTTP GET request with this URL, we may receive a HTTP error.

However, when using an actual web browser, it works fine, because the browser normalizes the location URL before performing the next request.

For example, when querying https://www.watsons.com.sg, the response location header contains:

https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F

We can see that this part of the URL is not normalized: queue.watsons.com.sg?c=aswatson, it's missing a slash.
If you send a HTTP GET on this URL, you'll get some bad response code back (HTTP 400).

Firefox actually does the normalization to: queue.watsons.com.sg/?c=aswatson (notice the additional slash).

I've added a unit test to show the behavior.

More details in my answer in Stackoverflow: https://stackoverflow.com/a/53323565/869764

codecov · 2018-11-15T19:16:51Z

Codecov Report

Merging #3494 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3494   +/-   ##
=======================================
  Coverage   84.38%   84.38%           
=======================================
  Files         167      167           
  Lines        9376     9376           
  Branches     1392     1392           
=======================================
  Hits         7912     7912           
  Misses       1206     1206           
  Partials      258      258

Impacted Files	Coverage Δ
scrapy/downloadermiddlewares/redirect.py	`96.72% <100%> (ø)`	⬆️

kmike · 2018-11-19T20:23:45Z

Hey! A nice catch 👍 . I'm not sure though that using canonicalize_url is good here, as canonicalize_url is lossy, and its main purpose is to compare/deduplicate URLs. Do you know what kind of normalization do browsers perform? Likely it is different from what canonicalize_url does..

influentcoder · 2018-11-20T18:19:32Z

@kmike I'm not sure how this is exactly handled in the different browsers to be honest. I've had a look at how curl handles it.

$ curl -v -L https://www.watsons.com.sg/
...
...
* Issue another request to this URL: 'https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
* Rebuilt URL to: https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F
...
...

By glimpsing into the curl code base, I've found this: https://github.com/curl/curl/blob/curl-7_58_0/lib/url.c#L2280

  /* If the URL is malformatted (missing a '/' after hostname before path) we
   * insert a slash here. The only letters except '/' that can start a path is
   * '?' and '#' - as controlled by the two sscanf() patterns above.
   */
  if(path[0] != '/') {
    /* We need this function to deal with overlapping memory areas. We know
       that the memory area 'path' points to is 'urllen' bytes big and that
       is bigger than the path. Use +1 to move the zero byte too. */
    memmove(&path[1], path, strlen(path) + 1);
    path[0] = '/';
    rebuild_url = TRUE;
}

They simply add a / before the URL path if it's missing.

I could create a function that does that, something like this:

>>> from six.moves.urllib.parse import urlsplit, urlunsplit
>>> url = 'https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>> scheme, netloc, path, query, fragment = urlsplit(url)
>>> path = path if path.startswith('/') else ('/' + path)
>>> urlunsplit((scheme, netloc, path, query, fragment))
'https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>>

What would be the best place to put this function? I'm thinking of calling it rebuild_url.

Do you think it should be defined in the redirect middleware script or is it general enough to be part of w3clib in url.py?

kmike · 2018-11-20T20:47:07Z

hey @guillaume-humbert, thanks for checking it! However, I think we should make sure we're doing a right thing; curl is not a gold standard - generally it is good, but I recall some incorrect handling of edge cases in past. "Do what a browser is doing" is a gold standard here, as web developers are testing their websites against browsers; "do what's in RFC" may also work, though if this conflicts with browser behavior, browser wins.

So, questions:

does this normalization happen only for redirects (Location header), or for other requests as well? What's the best stage we should normalize URLs at?
does it happen only when query is present, or always?
etc.

The best resource I found to answer these questions is https://url.spec.whatwg.org/#url-representation. It says:

A URL’s path is a list of zero or more ASCII strings holding data, usually identifying a location in hierarchical form. It is initially empty.
Note: A special URL always has a non-empty path.

"special URL" is defined here https://url.spec.whatwg.org/#is-special as having a special scheme (https://url.spec.whatwg.org/#special-scheme), which is one of "ftp", "file", "gopher", "http", "https", "ws" or "wss".

So this suggests we should always add / to an URL if a path is empty (regardless of redirect/etc) and a scheme is one of those, but not otherwise.

I think the best place to put such code is w3lib.url. And it looks like this is already implemented here: https://github.com/scrapy/w3lib/blob/4e6865e0a8eae05c64ff766b4869332d6942e9ab/w3lib/url.py#L87, in safe_download_url function (though not 100% correctly, as this is done regardless of scheme). But for some reason this function is not used in Scrapy! I have no idea why.

influentcoder · 2018-11-21T19:28:01Z

@kmike Thanks for sharing the links this is useful!
I've checked the browser behavior for Firefox and Chrome, they apparently add / to the path when it is empty for any requests.
E.g. https://www.google.com?q=test is rewritten to https://www.google.com/?q=test

Now, I was thinking we could replace the calls to safe_url_string by safe_download_url, since safe_download_url does call safe_url_string and then does the addition of the / when needed.

However, safe_url_string accepts additional parameters encoding and path_encoding, so to achieve that, we would need to add these parameters to safe_download_url and pass them along.

So I'm thinking we could:

In w3lib, add two additional parameters to safe_download_url and pass them to safe_url_string.
Update safe_download_url as per the URL spec, meaning add the missing / only for special URLs.
In scrapy, replace all references to safe_url_string by safe_download_url.

What do you think?

pc2000sg · 2018-11-27T14:58:46Z

I am trying this and it looks like the middleware not executed. It goes as following. Any suggestions? I tried to put a print statement in "process_response" under class "FixLocationHeaderMiddleWare" but no output also. Thanks.

1.) Scrapy log when running the script
......
2018-11-27 22:30:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'diffmarts.middlewares.FixLocationHeaderMiddleWare',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-27 22:30:49 [scrapy.middleware] INFO: Enabled item pipelines:
['diffmarts.pipelines.DiffmartsPipeline']
2018-11-27 22:30:49 [scrapy.core.engine] INFO: Spider opened 2018-11-27 22:30:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-11-27 22:30:50 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=62&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F>: HTTP status code is not handled or not allowed
2018-11-27 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-27 22:30:52 [scrapy.core.engine] ERROR: Scraper close failure

2.) My main scrapy.py file
from scrapy.spiders import Spider

class WatsonsSpider(scrapy.Spider):
name = 'watsons'
start_urls = ['http://www.watsons.com.sg/']
print("debug1")
custom_settings = {
'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
}
def parse(self, response):
#print(response.url)
for category_selector in response.xpath("//div[contains(@Class, 'navSubContainer-box clearfix')]//div[contains(@Class, 'navCol')]//dt/a"):
cat_name = ' '.join(category_selector.xpath(".//text()").extract()).strip(' \n\r\t')

3.) My middlewares.py file:
...
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

class FixLocationHeaderMiddleWare:

def process_response(self, request, response, spider):
    if 'location' in response.headers:
        response.headers['location'] = canonicalize_url(response.headers['location'])
    return response

influentcoder · 2018-12-15T09:33:20Z

@pc2000sg If you look closely at my stackoverflow answer, you will see that your settings are wrong, instead of:

custom_settings = {
'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
}

It should be:

custom_settings = {
    'DOWNLOADER_MIDDLEWARES': {
        'diffmarts.middlewares.FixLocationHeaderMiddleWare': 650
    }
}

Also, make sure that the order number is greater than 600, as it needs to be applied after the RedirectMiddleware which has a order number of 600.

pc2000sg · 2018-12-15T15:50:43Z

OK got it. Let me give it a try. Thanks for your advice!

…

On Sat, Dec 15, 2018, 09:34 Guillaume Humbert ***@***.*** wrote: @pc2000sg <https://github.com/pc2000sg> If you look closely at my stackoverflow answer <https://stackoverflow.com/a/53323565/869764>, you will see that your settings are wrong, instead of: custom_settings = {'diffmarts.middlewares.FixLocationHeaderMiddleWare':300 } It should be: custom_settings = { 'DOWNLOADER_MIDDLEWARES': { 'diffmarts.middlewares.FixLocationHeaderMiddleWare': 650 } } Also, make sure that the order number is greater than 600, as it needs to be applied after the RedirectMiddleware which has a order number of 600. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3494 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aq7C0WfcreICn452BXHevJVcK3nhOoZ2ks5u5MJAgaJpZM4YhlKY> .

pc2000sg · 2019-01-01T13:35:28Z

@guillaume-humbert
I finally figured out the way to code the middleware changes. However after calling safe_download_url and adding "/" to the url it seems that the 400 error still persisted. Attached the scrapy log as reference below. Wonder if there is other possibility that caused the 400 error? Thanks.

'diffmarts.middlewares.FixLocationHeaderMiddleWare',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-01 21:17:14 [scrapy.middleware] INFO: Enabled item pipelines:
['diffmarts.pipelines.DiffmartsPipeline']
2019-01-01 21:17:14 [scrapy.core.engine] INFO: Spider opened
2019-01-01 21:17:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-01 21:17:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=62&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F>: HTTP status code is not handled or not allowed
2019-01-01 21:17:15 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-01 21:17:17 [scrapy.core.engine] ERROR: Scraper close failure
Traceback (most recent call last):

pc2000sg · 2019-01-09T01:28:15Z

Issued closed by hardcoding the redirect URL as start_url in scrapy python. Thanks everyone for the inputs and suggestions!

Canonicalize the redirect URL

a004ddc

influentcoder changed the title ~~Canonicalize the redirect URL #1133~~ Canonicalize the redirect URL Nov 15, 2018

influentcoder changed the title ~~Canonicalize the redirect URL~~ Add missing '/' when the URL path is empty Nov 21, 2018

influentcoder mentioned this pull request Dec 16, 2018

[MRG+1] Added 'encoding' and 'path_encoding' parameters in 'safe_download_url' scrapy/w3lib#118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing '/' when the URL path is empty #3494

Add missing '/' when the URL path is empty #3494

influentcoder commented Nov 15, 2018

codecov bot commented Nov 15, 2018 •

edited

kmike commented Nov 19, 2018 •

edited

influentcoder commented Nov 20, 2018

kmike commented Nov 20, 2018

influentcoder commented Nov 21, 2018

pc2000sg commented Nov 27, 2018

influentcoder commented Dec 15, 2018

pc2000sg commented Dec 15, 2018 via email

pc2000sg commented Jan 1, 2019

pc2000sg commented Jan 9, 2019

Add missing '/' when the URL path is empty #3494

Are you sure you want to change the base?

Add missing '/' when the URL path is empty #3494

Conversation

influentcoder commented Nov 15, 2018

codecov bot commented Nov 15, 2018 • edited

Codecov Report

kmike commented Nov 19, 2018 • edited

influentcoder commented Nov 20, 2018

kmike commented Nov 20, 2018

influentcoder commented Nov 21, 2018

pc2000sg commented Nov 27, 2018

influentcoder commented Dec 15, 2018

pc2000sg commented Dec 15, 2018 via email

pc2000sg commented Jan 1, 2019

pc2000sg commented Jan 9, 2019

codecov bot commented Nov 15, 2018 •

edited

kmike commented Nov 19, 2018 •

edited