Description
When using open_in_browser() feature, Scrapy will try to add a <base> tag to ensure remote resources are loaded, and to make external links to work in our local browser. This feature rely on the following code:
|
if isinstance(response, HtmlResponse): |
|
if b'<base' not in body: |
|
repl = f'<head><base href="{response.url}">' |
|
body = body.replace(b'<head>', to_bytes(repl)) |
Some website are using attributes on the <head> tag, which will prevent the <base> tag to be injected, and therefore external resources to be loaded.
How to reproduce the issue
Simply create a basic spider following Scrapy tutorial and use the following code:
import scrapy
from scrapy.utils.response import open_in_browser
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = [
'https://example.com/head-without-argument.html',
'https://example.com/head-with-argument.html']
def parse(self, response):
open_in_browser(response)
pass
For the scrapped pages itselves, use the simplest code possible (I've not been able to quickly find a public page using arguments on <head>, sorry):
<!DOCTYPE html>
<html>
<!-- head-without-argument.html -->
<head>
<title>Title</title>
</head>
<body>
<p>Foo</p>
<img src="./assets/image.jpg">
</body>
</html>
<!DOCTYPE html>
<html>
<!-- head-with-argument.html -->
<head id="example">
<title>Title</title>
</head>
<body>
<p>Foo</p>
<img src="./assets/image.jpg">
</body>
</html>
Then run the spider with scrapy crawl example and you'll see that:
head-without-argument.html output renders resource correctly
head-with-argument.html output doesn't render resource
How to fix the issue
At the very least, the literal replace() function should be replace by a regex replacement:
if isinstance(response, HtmlResponse):
if b'<base' not in body:
repl = f'\\1<base href="{response.url}">'
body = re.sub(b"(<head.*?>)", to_bytes(repl), body)
Environment
Scrapy : 2.5.1
lxml : 4.6.3.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 21.7.0
Python : 3.9.7 (default, Sep 3 2021, 04:31:11) - [Clang 12.0.5 (clang-1205.0.22.9)]
pyOpenSSL : 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021)
cryptography : 35.0.0
Platform : macOS-11.6-arm64-arm-64bit
Description
When using
open_in_browser()feature, Scrapy will try to add a<base>tag to ensure remote resources are loaded, and to make external links to work in our local browser. This feature rely on the following code:scrapy/scrapy/utils/response.py
Lines 81 to 84 in 06f3d12
Some website are using attributes on the
<head>tag, which will prevent the<base>tag to be injected, and therefore external resources to be loaded.How to reproduce the issue
Simply create a basic spider following Scrapy tutorial and use the following code:
For the scrapped pages itselves, use the simplest code possible (I've not been able to quickly find a public page using arguments on
<head>, sorry):Then run the spider with
scrapy crawl exampleand you'll see that:head-without-argument.htmloutput renders resource correctlyhead-with-argument.htmloutput doesn't render resourceHow to fix the issue
At the very least, the literal
replace()function should be replace by a regex replacement:Environment