Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing '/' when the URL path is empty #3494

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

influentcoder
Copy link

This is to fix the issue reported in #1133

When performing redirects, scrapy creates a new request based on the URL from the response location header.

In some cases, the redirect URL that we receive in the response is not properly normalized.
When performing a HTTP GET request with this URL, we may receive a HTTP error.

However, when using an actual web browser, it works fine, because the browser normalizes the location URL before performing the next request.

For example, when querying https://www.watsons.com.sg, the response location header contains:

https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F

We can see that this part of the URL is not normalized: queue.watsons.com.sg?c=aswatson, it's missing a slash.
If you send a HTTP GET on this URL, you'll get some bad response code back (HTTP 400).

Firefox actually does the normalization to: queue.watsons.com.sg/?c=aswatson (notice the additional slash).

I've added a unit test to show the behavior.

More details in my answer in Stackoverflow: https://stackoverflow.com/a/53323565/869764

@influentcoder influentcoder changed the title Canonicalize the redirect URL #1133 Canonicalize the redirect URL Nov 15, 2018
@codecov
Copy link

codecov bot commented Nov 15, 2018

Codecov Report

Merging #3494 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master    #3494   +/-   ##
=======================================
  Coverage   84.38%   84.38%           
=======================================
  Files         167      167           
  Lines        9376     9376           
  Branches     1392     1392           
=======================================
  Hits         7912     7912           
  Misses       1206     1206           
  Partials      258      258
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/redirect.py 96.72% <100%> (ø) ⬆️

@kmike
Copy link
Member

kmike commented Nov 19, 2018

Hey! A nice catch 👍 . I'm not sure though that using canonicalize_url is good here, as canonicalize_url is lossy, and its main purpose is to compare/deduplicate URLs. Do you know what kind of normalization do browsers perform? Likely it is different from what canonicalize_url does..

@influentcoder
Copy link
Author

@kmike I'm not sure how this is exactly handled in the different browsers to be honest. I've had a look at how curl handles it.

$ curl -v -L https://www.watsons.com.sg/
...
...
* Issue another request to this URL: 'https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
* Rebuilt URL to: https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F
...
...

By glimpsing into the curl code base, I've found this: https://github.com/curl/curl/blob/curl-7_58_0/lib/url.c#L2280

  /* If the URL is malformatted (missing a '/' after hostname before path) we
   * insert a slash here. The only letters except '/' that can start a path is
   * '?' and '#' - as controlled by the two sscanf() patterns above.
   */
  if(path[0] != '/') {
    /* We need this function to deal with overlapping memory areas. We know
       that the memory area 'path' points to is 'urllen' bytes big and that
       is bigger than the path. Use +1 to move the zero byte too. */
    memmove(&path[1], path, strlen(path) + 1);
    path[0] = '/';
    rebuild_url = TRUE;
}

They simply add a / before the URL path if it's missing.

I could create a function that does that, something like this:

>>> from six.moves.urllib.parse import urlsplit, urlunsplit
>>> url = 'https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>> scheme, netloc, path, query, fragment = urlsplit(url)
>>> path = path if path.startswith('/') else ('/' + path)
>>> urlunsplit((scheme, netloc, path, query, fragment))
'https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=55&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F'
>>>

What would be the best place to put this function? I'm thinking of calling it rebuild_url.

Do you think it should be defined in the redirect middleware script or is it general enough to be part of w3clib in url.py?

@kmike
Copy link
Member

kmike commented Nov 20, 2018

hey @guillaume-humbert, thanks for checking it! However, I think we should make sure we're doing a right thing; curl is not a gold standard - generally it is good, but I recall some incorrect handling of edge cases in past. "Do what a browser is doing" is a gold standard here, as web developers are testing their websites against browsers; "do what's in RFC" may also work, though if this conflicts with browser behavior, browser wins.

So, questions:

  • does this normalization happen only for redirects (Location header), or for other requests as well? What's the best stage we should normalize URLs at?
  • does it happen only when query is present, or always?
  • etc.

The best resource I found to answer these questions is https://url.spec.whatwg.org/#url-representation. It says:

A URL’s path is a list of zero or more ASCII strings holding data, usually identifying a location in hierarchical form. It is initially empty.
Note: A special URL always has a non-empty path.

"special URL" is defined here https://url.spec.whatwg.org/#is-special as having a special scheme (https://url.spec.whatwg.org/#special-scheme), which is one of "ftp", "file", "gopher", "http", "https", "ws" or "wss".

So this suggests we should always add / to an URL if a path is empty (regardless of redirect/etc) and a scheme is one of those, but not otherwise.

I think the best place to put such code is w3lib.url. And it looks like this is already implemented here: https://github.com/scrapy/w3lib/blob/4e6865e0a8eae05c64ff766b4869332d6942e9ab/w3lib/url.py#L87, in safe_download_url function (though not 100% correctly, as this is done regardless of scheme). But for some reason this function is not used in Scrapy! I have no idea why.

@influentcoder
Copy link
Author

@kmike Thanks for sharing the links this is useful!
I've checked the browser behavior for Firefox and Chrome, they apparently add / to the path when it is empty for any requests.
E.g. https://www.google.com?q=test is rewritten to https://www.google.com/?q=test

Now, I was thinking we could replace the calls to safe_url_string by safe_download_url, since safe_download_url does call safe_url_string and then does the addition of the / when needed.

However, safe_url_string accepts additional parameters encoding and path_encoding, so to achieve that, we would need to add these parameters to safe_download_url and pass them along.

So I'm thinking we could:

  1. In w3lib, add two additional parameters to safe_download_url and pass them to safe_url_string.
  2. Update safe_download_url as per the URL spec, meaning add the missing / only for special URLs.
  3. In scrapy, replace all references to safe_url_string by safe_download_url.

What do you think?

@influentcoder influentcoder changed the title Canonicalize the redirect URL Add missing '/' when the URL path is empty Nov 21, 2018
@pc2000sg
Copy link

I am trying this and it looks like the middleware not executed. It goes as following. Any suggestions? I tried to put a print statement in "process_response" under class "FixLocationHeaderMiddleWare" but no output also. Thanks.

1.) Scrapy log when running the script
......
2018-11-27 22:30:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'diffmarts.middlewares.FixLocationHeaderMiddleWare',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-27 22:30:49 [scrapy.middleware] INFO: Enabled item pipelines:
['diffmarts.pipelines.DiffmartsPipeline']
2018-11-27 22:30:49 [scrapy.core.engine] INFO: Spider opened 2018-11-27 22:30:49 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-11-27 22:30:50 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://queue.watsons.com.sg?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=62&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F>: HTTP status code is not handled or not allowed
2018-11-27 22:30:50 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-27 22:30:52 [scrapy.core.engine] ERROR: Scraper close failure

2.) My main scrapy.py file
from scrapy.spiders import Spider

class WatsonsSpider(scrapy.Spider):
name = 'watsons'
start_urls = ['http://www.watsons.com.sg/']
print("debug1")
custom_settings = {
'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
}

def parse(self, response):
#print(response.url)
for category_selector in response.xpath("//div[contains(@Class, 'navSubContainer-box clearfix')]//div[contains(@Class, 'navCol')]//dt/a"):
cat_name = ' '.join(category_selector.xpath(".//text()").extract()).strip(' \n\r\t')

3.) My middlewares.py file:
...
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

class FixLocationHeaderMiddleWare:

def process_response(self, request, response, spider):
    if 'location' in response.headers:
        response.headers['location'] = canonicalize_url(response.headers['location'])
    return response

@influentcoder
Copy link
Author

@pc2000sg If you look closely at my stackoverflow answer, you will see that your settings are wrong, instead of:

custom_settings = {
'diffmarts.middlewares.FixLocationHeaderMiddleWare':300
}

It should be:

custom_settings = {
    'DOWNLOADER_MIDDLEWARES': {
        'diffmarts.middlewares.FixLocationHeaderMiddleWare': 650
    }
}

Also, make sure that the order number is greater than 600, as it needs to be applied after the RedirectMiddleware which has a order number of 600.

@pc2000sg
Copy link

pc2000sg commented Dec 15, 2018 via email

@pc2000sg
Copy link

pc2000sg commented Jan 1, 2019

@guillaume-humbert
I finally figured out the way to code the middleware changes. However after calling safe_download_url and adding "/" to the url it seems that the 400 error still persisted. Attached the scrapy log as reference below. Wonder if there is other possibility that caused the 400 error? Thanks.

'diffmarts.middlewares.FixLocationHeaderMiddleWare',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-01 21:17:14 [scrapy.middleware] INFO: Enabled item pipelines:
['diffmarts.pipelines.DiffmartsPipeline']
2019-01-01 21:17:14 [scrapy.core.engine] INFO: Spider opened
2019-01-01 21:17:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-01 21:17:15 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 https://queue.watsons.com.sg/?c=aswatson&e=watsonprdsg&ver=v3-java-3.5.2&cver=62&cid=zh-CN&l=PoC+Layout+SG&t=https%3A%2F%2Fwww.watsons.com.sg%2F>: HTTP status code is not handled or not allowed
2019-01-01 21:17:15 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-01 21:17:17 [scrapy.core.engine] ERROR: Scraper close failure
Traceback (most recent call last):

@pc2000sg
Copy link

pc2000sg commented Jan 9, 2019

Issued closed by hardcoding the redirect URL as start_url in scrapy python. Thanks everyone for the inputs and suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants