Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore data:image URLs #429

Closed
nikolay-borzov opened this issue Nov 2, 2020 · 3 comments · Fixed by #438
Closed

Ignore data:image URLs #429

nikolay-borzov opened this issue Nov 2, 2020 · 3 comments · Fixed by #438

Comments

@nikolay-borzov
Copy link

nikolay-borzov commented Nov 2, 2020

Configuration

version: 4.2.3

options:

{
  urls: [
   'http://example.com/product/abc'
  ],
    directory: path.resolve(__dirname, 'src'),
    urlFilter(url) {
      return url.indexOf('data:image') === -1;
    }
}

Description

The page contains image like:

<picture>
 <source media="(max-width: 559px)
   srcset=" 1x,
            2x">
  <source media="(min-width: 560px) and (max-width: 719px)"
    srcset=" 1x,
             2x">
  <source media="(min-width: 720px) and (max-width: 899px)"
    srcset=" 1x,
             2x">
  <source media="(min-width: 900px) and (max-width: 1199px)"
    srcset=" 1x,
             2x">
  <source media="(min-width: 1200px)"
    srcset=" 1x,
             2x">
  <img
    src=""
    srcset="http://example.com/product/abc/media/521811121-392x351.jpg 2x">
</picture>

data:image URLs are not fully ignored despite setting urlFilter to filter out URLs containing 'data:image'. It looks like there are two URLs extracted from data:image URL - one with data:image/gif;base64, another with the rest part after , - R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

website-scraper:debug filtering out { url: "data:image/gif;base64", filename: "undefined", depth: 3 } by url filter +0ms
website-scraper:debug found requested resource for { url: "http://example.com/product/abc/R0lGODlhAQABAIAAAAAAAP///R0lGODlhAQABAIAAAAAAAP///R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7", filename: "undefined", depth: 3 
} +0ms

R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 is attached to base URL

Expected behavior: There should not be any downloads from URLs like Base URL + data:image URL

Actual behavior: HTML is downloaded as yH5BAEAAAAALAAAAAABAAEAAAIBRAA7_1.html

Additional Information

Downloaded HTML contains extra space between data:image/gif;base64, and R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

<picture>
 <source media="(max-width: 559px)
   srcset="data:image/gif;base64, R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 1x,
           data:image/gif;base64, R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 2x">

I'm sorry I cannot provide the real URL. It's an internal URL on the customer side.

@s0ph1e
Copy link
Member

s0ph1e commented Nov 3, 2020

Hi @nikolay-borzov 👋

Thank you for reporting this issue, I will try to reproduce it

@stale
Copy link

stale bot commented Jan 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 3, 2021
@s0ph1e s0ph1e added maybe-later and removed wontfix labels Jan 3, 2021
@aivus
Copy link
Member

aivus commented Apr 10, 2021

A space is adding because of the bug in srcset module (sindresorhus/srcset#1).
However this bug was fixed in v3.0.0.

Upgrade to this major version of srcset requires Node.js 10+, so we need to drop old versions here. I've created #438 for this.

@aivus aivus mentioned this issue Apr 11, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants