srcset source termination #419

PRGfx · 2018-02-28T20:42:45Z

What happened?

srcset URLs are not always correctly terminated:
not passing a (optional) descriptor to the source string (see srcset) results in an additional discovered URL containing the next character which can be a , , a %22 (") or %27 (').
As a result such image URLs will not be ignored as they don't match a file extension.

What should have happened?

The returned URLs should match the input in count and characters (see examples below).

Steps to reproduce the problem

I got a fresh clone and added these tests to test/discovery.js:

appended comma

it("should not append comma", function() {
    discover("<img srcset='https://example.com/image1.png, https://example.com/image2.png'>", {
        url: "https://example.com/"
    })
    .should.eql(["https://example.com/image1.png", "https://example.com/image2.png"]);
});

Result is ["https://example.com/image1.png,", "https://example.com/image2.png", "example.com/image1.png"]. Note however that the result was okay with relative URLs.

appended quote

it("should work with sourcesets", function() {
    var links = discover("<picture>" +
        "<source srcset='https://example.com/image1.svg, https://example.com/image2.svg' media='(min-width: 123px)'>" +
        "<source srcset='https://example.com/image3.svg' media='(min-width: 456px)'>" +
        "<img src='https://example.com/image4.svg'>" +
        "</picture>");
    links.should.be.an("array");
    links.should.eql([
        "https://example.com/image1.svg",
        "https://example.com/image2.svg",
        "https://example.com/image3.svg",
        "https://example.com/image4.svg"
    ]);
});

Result is ["https://example.com/image4.svg", "https://example.com/image1.svg,", "https://example.com/image2.svg", "https://example.com/image3.svg", "https://example.com/image1.svg", "https://example.com/image2.svg%27"]. Using relative URLs in this example only returns image 4, 1 and 2%27. While this example of course is semantically wrong, using two images in the first source only serves to illustrate both described errors. Removing one source from the srcset still adds a %27-appended URL though.

A non-greedy matcher in crawler.js:341 seems to resolve that problem.

The text was updated successfully, but these errors were encountered:

Fixes #441 and #419

fredrikekelund · 2018-10-31T08:08:52Z

Hi @PRGfx, it looks like this was due to the regexp on this line. When simplecrawler was being more actively maintained, our standard answer used to be that the default discovery mechanism would rather grab a few false positives than be too prescriptive about which resources were discovered. Since this was such a simple fix though, I went ahead and pushed a fix.

Fixes simplecrawler#441 and simplecrawler#419

fredrikekelund added a commit that referenced this issue Oct 30, 2018

Minor improvements to default discoverRegex

f36d822

Fixes #441 and #419

fredrikekelund mentioned this issue Oct 30, 2018

Minor improvements to default discoverRegex #442

Merged

fredrikekelund closed this as completed Oct 31, 2018

gombosg pushed a commit to gombosg/simplecrawler that referenced this issue Dec 11, 2018

Minor improvements to default discoverRegex

7c84671

Fixes simplecrawler#441 and simplecrawler#419

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

srcset source termination #419

srcset source termination #419

PRGfx commented Feb 28, 2018

fredrikekelund commented Oct 31, 2018

srcset source termination #419

srcset source termination #419

Comments

PRGfx commented Feb 28, 2018

What happened?

What should have happened?

Steps to reproduce the problem

appended comma

appended quote

fredrikekelund commented Oct 31, 2018