Skip to content

Open in Browser <base> replacement will fail if <head> has attributes #5319

@zessx

Description

@zessx

Description

When using open_in_browser() feature, Scrapy will try to add a <base> tag to ensure remote resources are loaded, and to make external links to work in our local browser. This feature rely on the following code:

if isinstance(response, HtmlResponse):
if b'<base' not in body:
repl = f'<head><base href="{response.url}">'
body = body.replace(b'<head>', to_bytes(repl))

Some website are using attributes on the <head> tag, which will prevent the <base> tag to be injected, and therefore external resources to be loaded.

How to reproduce the issue

Simply create a basic spider following Scrapy tutorial and use the following code:

import scrapy
from scrapy.utils.response import open_in_browser

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = [
        'https://example.com/head-without-argument.html', 
        'https://example.com/head-with-argument.html']

    def parse(self, response):
        open_in_browser(response)
        pass

For the scrapped pages itselves, use the simplest code possible (I've not been able to quickly find a public page using arguments on <head>, sorry):

<!DOCTYPE html>
<html>
  <!-- head-without-argument.html -->
  <head>
    <title>Title</title>
  </head>
  <body>
    <p>Foo</p>
    <img src="./assets/image.jpg">
  </body>
</html>
<!DOCTYPE html>
<html>
  <!-- head-with-argument.html -->
  <head id="example">
    <title>Title</title>
  </head>
  <body>
    <p>Foo</p>
    <img src="./assets/image.jpg">
  </body>
</html>

Then run the spider with scrapy crawl example and you'll see that:

  1. head-without-argument.html output renders resource correctly
  2. head-with-argument.html output doesn't render resource

How to fix the issue

At the very least, the literal replace() function should be replace by a regex replacement:

 if isinstance(response, HtmlResponse): 
     if b'<base' not in body: 
         repl = f'\\1<base href="{response.url}">' 
         body = re.sub(b"(<head.*?>)", to_bytes(repl), body)

Environment

Scrapy       : 2.5.1
lxml         : 4.6.3.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.7 (default, Sep  3 2021, 04:31:11) - [Clang 12.0.5 (clang-1205.0.22.9)]
pyOpenSSL    : 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021)
cryptography : 35.0.0
Platform     : macOS-11.6-arm64-arm-64bit

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions