Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open in Browser <base> replacement will fail if <head> has attributes #5319

Closed
zessx opened this issue Nov 15, 2021 · 1 comment · Fixed by #5320
Closed

Open in Browser <base> replacement will fail if <head> has attributes #5319

zessx opened this issue Nov 15, 2021 · 1 comment · Fixed by #5320
Labels

Comments

@zessx
Copy link
Contributor

zessx commented Nov 15, 2021

Description

When using open_in_browser() feature, Scrapy will try to add a <base> tag to ensure remote resources are loaded, and to make external links to work in our local browser. This feature rely on the following code:

if isinstance(response, HtmlResponse):
if b'<base' not in body:
repl = f'<head><base href="{response.url}">'
body = body.replace(b'<head>', to_bytes(repl))

Some website are using attributes on the <head> tag, which will prevent the <base> tag to be injected, and therefore external resources to be loaded.

How to reproduce the issue

Simply create a basic spider following Scrapy tutorial and use the following code:

import scrapy
from scrapy.utils.response import open_in_browser

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = [
        'https://example.com/head-without-argument.html', 
        'https://example.com/head-with-argument.html']

    def parse(self, response):
        open_in_browser(response)
        pass

For the scrapped pages itselves, use the simplest code possible (I've not been able to quickly find a public page using arguments on <head>, sorry):

<!DOCTYPE html>
<html>
  <!-- head-without-argument.html -->
  <head>
    <title>Title</title>
  </head>
  <body>
    <p>Foo</p>
    <img src="./assets/image.jpg">
  </body>
</html>
<!DOCTYPE html>
<html>
  <!-- head-with-argument.html -->
  <head id="example">
    <title>Title</title>
  </head>
  <body>
    <p>Foo</p>
    <img src="./assets/image.jpg">
  </body>
</html>

Then run the spider with scrapy crawl example and you'll see that:

  1. head-without-argument.html output renders resource correctly
  2. head-with-argument.html output doesn't render resource

How to fix the issue

At the very least, the literal replace() function should be replace by a regex replacement:

 if isinstance(response, HtmlResponse): 
     if b'<base' not in body: 
         repl = f'\\1<base href="{response.url}">' 
         body = re.sub(b"(<head.*?>)", to_bytes(repl), body)

Environment

Scrapy       : 2.5.1
lxml         : 4.6.3.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 1.22.0
Twisted      : 21.7.0
Python       : 3.9.7 (default, Sep  3 2021, 04:31:11) - [Clang 12.0.5 (clang-1205.0.22.9)]
pyOpenSSL    : 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021)
cryptography : 35.0.0
Platform     : macOS-11.6-arm64-arm-64bit
@zessx
Copy link
Contributor Author

zessx commented Nov 15, 2021

Related PR : #5320

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants