[fix] support data:image URLs in prerendering #2713

gg187on · 2021-10-31T15:16:48Z

fixes: #2645

Before submitting the PR, please make sure you do the following

It's really useful if your PR references an issue where it is discussed ahead of time. In many cases, features are absent for a reason. For large changes, please create an RFC: https://github.com/sveltejs/rfcs
This message body should clearly illustrate what problems it solves.
Ideally, include a test that fails without this PR but passes with it.

Tests

Run the tests with pnpm test and lint the project with pnpm lint and pnpm check

Changesets

If your PR makes a change that should be noted in one or more packages' changelogs, generate a changeset by running pnpx changeset and following the prompts. All changesets should be patch until SvelteKit 1.0

changeset-bot · 2021-10-31T15:16:51Z

⚠️ No Changeset found

Latest commit: 0d5bfd9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

bluwy · 2021-11-01T02:47:44Z

Would be nice to add a test case for this as well

benmccann · 2021-11-03T13:33:39Z

Discussion on Discord about whether we might need to do HTML parsing to make this work: https://discord.com/channels/457912077277855764/781266015601295360/904411015083606106

gg187on · 2021-11-06T13:08:35Z

there is long discusion https://discord.com/channels/457912077277855764/781266015601295360/906016984267886682 about using combination of more regexes with JS logic... and spliting it to more steps.

I liked idea of using something like Syberag suggested

function extractData(source) {
    let getTag = /<(a|link|img|source)\b/g
    let getAttributes = /(?<name>\w+)\s*(?:=(?:"([^"]*)"|'([^']*)|([^>]+)\b))?/g
    let tags = source.matchAll(getTag)

    return Array.from(tags).map(tag => {
        let start = tag.index
        let subSource = source.substr(start + tag[0].length) // Issue is probably here, because I take the document until the end instead of just the tag
        let attributes = Array.from(subSource.matchAll(getAttributes))
        return {
            tag: tag[1],
            attributes: attributes.map(attr => ({name: attr.groups.name, value: attr[2]||attr[3]||attr[4]}))}
    })
}

even tho, this doesn't work correctly, but with this approach I think, we can make performance fast, dependency less and less buggy solution.

On the other hand, JSdom is something, other devs know more likely, and would be able to contribute in future with less effort to understand code.

E: I also got one really dumb idea of after matching tag, extract it, serialize, and then character after character read it, where we would put some boolean to variable, if attribute value is closed, and if not, ignore > character. Then extract part in href/src/srcset attributes and call await visit() on that URL when needed. (aka, when it's URL, not data:uri) ... But not sure about performance, I think it would be too slow.

E: seems parse5 solution will be good for solving this issue.

Rich-Harris · 2021-11-11T23:24:18Z

My understanding from the Discord thread is that the issue here is attributes containing the > character. It feels like the proper solution here is for attribute values to be properly escaped by Svelte, so that we are able to use regexes safely — parsing the HTML would be a hugely expensive change. Does that feel like a good approach?

gg187on · 2021-11-19T10:52:28Z

My understanding from the Discord thread is that the issue here is attributes containing the > character. It feels like the proper solution here is for attribute values to be properly escaped by Svelte, so that we are able to use regexes safely — parsing the HTML would be a hugely expensive change. Does that feel like a good approach?

In input, we get string, that contains HTML code... but in this string how we can detect which > character to escape, and which not? Resp. do we escape all of them? If so, how will that regex match, what is tag, what attribute, and where tag ends?

Now imagine You have this crazy input:

<img data-something=" src=" crossorigin src="dog>cat.png"> crazy link </img>

It means, that " src=" is not attribute name, but value. Regex is chomsky type 3 grammar, while HTML is type 2.
We write regex, that will find something between " and "... what we will get in match:

src=
crossorigin src=
dog>cat.png

And that's when there is not another tag before or after. And there can be even nested attributes inside attributes... not just <img src="data:img/svg+xml,<svg >"> but what if there is in svg another attribute with data:uri? Yes it is escaped, now if we escape another > we will not recognize, which is ending which tag.

If we match anything between < and > we get:

img data-something=" src=" crossorigin src="dog -> our issue now.

When we want to read src value, we get:

crossorigin src=
dog

those both values are incorrect, we want dog>cat.png

I'm not sure how we would be able to fix it with escaping, when we still needs to find where to escape > character and where not. Also there is still a lot of other issues related to how we parse it with regex. For example: #2742

where @Conduitry answered #2742 (comment)

I think, because regex can't cover all cases of HTML language parsing, I'm sure it would be good to rewrite it, and use parse5 parser instead. It would fix a lot of bugs, and even those, we don't know about today.

I think parse5 solution will give us a lot of advantages, both now and in future. And it will make code even more maintainable, as regex is just not readable.

Ofc, maybe I'm incorrect, and it is possible to escape it, I don't totally understand this, but still I think, parse5 is better solution.

gg187on · 2021-12-06T08:27:05Z

I see there is good solution with parse5 in PR #2923 ... that fix this problem. I think I can close this issue in favor of #2923 , I was writting solution with parse5 too, but I think it's not better than solution in that PR, so I think, it's better to merge that PR.

vercel · 2021-12-06T08:50:45Z

@gg187on is attempting to deploy a commit to the Svelte Team on Vercel.

A member of the Team first needs to authorize it.

[fix] support data:image URLs in prerendering

c06d9cc

better solution for data: hrefs.

691c5dd

benmccann mentioned this pull request Nov 3, 2021

[chore] refactor is_rel_external regex #2725

Closed

si3nloong mentioned this pull request Nov 26, 2021

[fix] prerender with parse5 #2923

Closed

5 tasks

gg187on added 3 commits December 6, 2021 09:40

tests

5c03da0

test pattern parsing HTML tags.

ad3dc54

formating

0d5bfd9

gg187on closed this Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] support data:image URLs in prerendering #2713

[fix] support data:image URLs in prerendering #2713

gg187on commented Oct 31, 2021 •

edited

changeset-bot bot commented Oct 31, 2021 •

edited

bluwy commented Nov 1, 2021 •

edited

benmccann commented Nov 3, 2021

gg187on commented Nov 6, 2021 •

edited

Rich-Harris commented Nov 11, 2021

gg187on commented Nov 19, 2021 •

edited

gg187on commented Dec 6, 2021 •

edited

vercel bot commented Dec 6, 2021

[fix] support data:image URLs in prerendering #2713

[fix] support data:image URLs in prerendering #2713

Conversation

gg187on commented Oct 31, 2021 • edited

Before submitting the PR, please make sure you do the following

Tests

Changesets

changeset-bot bot commented Oct 31, 2021 • edited

⚠️ No Changeset found

bluwy commented Nov 1, 2021 • edited

benmccann commented Nov 3, 2021

gg187on commented Nov 6, 2021 • edited

Rich-Harris commented Nov 11, 2021

gg187on commented Nov 19, 2021 • edited

gg187on commented Dec 6, 2021 • edited

vercel bot commented Dec 6, 2021

gg187on commented Oct 31, 2021 •

edited

changeset-bot bot commented Oct 31, 2021 •

edited

bluwy commented Nov 1, 2021 •

edited

gg187on commented Nov 6, 2021 •

edited

gg187on commented Nov 19, 2021 •

edited

gg187on commented Dec 6, 2021 •

edited