Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

srcset source termination #419

Closed
PRGfx opened this issue Feb 28, 2018 · 1 comment
Closed

srcset source termination #419

PRGfx opened this issue Feb 28, 2018 · 1 comment

Comments

@PRGfx
Copy link

PRGfx commented Feb 28, 2018

What happened?

srcset URLs are not always correctly terminated:
not passing a (optional) descriptor to the source string (see srcset) results in an additional discovered URL containing the next character which can be a , , a %22 (") or %27 (').
As a result such image URLs will not be ignored as they don't match a file extension.

What should have happened?

The returned URLs should match the input in count and characters (see examples below).

Steps to reproduce the problem

I got a fresh clone and added these tests to test/discovery.js:

appended comma

it("should not append comma", function() {
    discover("<img srcset='https://example.com/image1.png, https://example.com/image2.png'>", {
        url: "https://example.com/"
    })
    .should.eql(["https://example.com/image1.png", "https://example.com/image2.png"]);
});

Result is ["https://example.com/image1.png,", "https://example.com/image2.png", "example.com/image1.png"]. Note however that the result was okay with relative URLs.

appended quote

it("should work with sourcesets", function() {
    var links = discover("<picture>" +
        "<source srcset='https://example.com/image1.svg, https://example.com/image2.svg' media='(min-width: 123px)'>" +
        "<source srcset='https://example.com/image3.svg' media='(min-width: 456px)'>" +
        "<img src='https://example.com/image4.svg'>" +
        "</picture>");
    links.should.be.an("array");
    links.should.eql([
        "https://example.com/image1.svg",
        "https://example.com/image2.svg",
        "https://example.com/image3.svg",
        "https://example.com/image4.svg"
    ]);
});

Result is ["https://example.com/image4.svg", "https://example.com/image1.svg,", "https://example.com/image2.svg", "https://example.com/image3.svg", "https://example.com/image1.svg", "https://example.com/image2.svg%27"]. Using relative URLs in this example only returns image 4, 1 and 2%27. While this example of course is semantically wrong, using two images in the first source only serves to illustrate both described errors. Removing one source from the srcset still adds a %27-appended URL though.

A non-greedy matcher in crawler.js:341 seems to resolve that problem.

@fredrikekelund
Copy link
Collaborator

Hi @PRGfx, it looks like this was due to the regexp on this line. When simplecrawler was being more actively maintained, our standard answer used to be that the default discovery mechanism would rather grab a few false positives than be too prescriptive about which resources were discovered. Since this was such a simple fix though, I went ahead and pushed a fix.

gombosg pushed a commit to gombosg/simplecrawler that referenced this issue Dec 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants