Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] support data:image URLs in prerendering #2713

Closed
wants to merge 5 commits into from

Conversation

gg187on
Copy link

@gg187on gg187on commented Oct 31, 2021

fixes: #2645

Before submitting the PR, please make sure you do the following

  • It's really useful if your PR references an issue where it is discussed ahead of time. In many cases, features are absent for a reason. For large changes, please create an RFC: https://github.com/sveltejs/rfcs
  • This message body should clearly illustrate what problems it solves.
  • Ideally, include a test that fails without this PR but passes with it.

Tests

  • Run the tests with pnpm test and lint the project with pnpm lint and pnpm check

Changesets

  • If your PR makes a change that should be noted in one or more packages' changelogs, generate a changeset by running pnpx changeset and following the prompts. All changesets should be patch until SvelteKit 1.0

@changeset-bot
Copy link

changeset-bot bot commented Oct 31, 2021

⚠️ No Changeset found

Latest commit: 0d5bfd9

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@bluwy
Copy link
Member

bluwy commented Nov 1, 2021

Would be nice to add a test case for this as well

@benmccann
Copy link
Member

Discussion on Discord about whether we might need to do HTML parsing to make this work: https://discord.com/channels/457912077277855764/781266015601295360/904411015083606106

@gg187on
Copy link
Author

gg187on commented Nov 6, 2021

there is long discusion https://discord.com/channels/457912077277855764/781266015601295360/906016984267886682 about using combination of more regexes with JS logic... and spliting it to more steps.

I liked idea of using something like Syberag suggested

function extractData(source) {
    let getTag = /<(a|link|img|source)\b/g
    let getAttributes = /(?<name>\w+)\s*(?:=(?:"([^"]*)"|'([^']*)|([^>]+)\b))?/g
    let tags = source.matchAll(getTag)

    return Array.from(tags).map(tag => {
        let start = tag.index
        let subSource = source.substr(start + tag[0].length) // Issue is probably here, because I take the document until the end instead of just the tag
        let attributes = Array.from(subSource.matchAll(getAttributes))
        return {
            tag: tag[1],
            attributes: attributes.map(attr => ({name: attr.groups.name, value: attr[2]||attr[3]||attr[4]}))}
    })
}

even tho, this doesn't work correctly, but with this approach I think, we can make performance fast, dependency less and less buggy solution.

On the other hand, JSdom is something, other devs know more likely, and would be able to contribute in future with less effort to understand code.


E: I also got one really dumb idea of after matching tag, extract it, serialize, and then character after character read it, where we would put some boolean to variable, if attribute value is closed, and if not, ignore > character. Then extract part in href/src/srcset attributes and call await visit() on that URL when needed. (aka, when it's URL, not data:uri) ... But not sure about performance, I think it would be too slow.


E: seems parse5 solution will be good for solving this issue.

@Rich-Harris
Copy link
Member

My understanding from the Discord thread is that the issue here is attributes containing the > character. It feels like the proper solution here is for attribute values to be properly escaped by Svelte, so that we are able to use regexes safely — parsing the HTML would be a hugely expensive change. Does that feel like a good approach?

@gg187on
Copy link
Author

gg187on commented Nov 19, 2021

My understanding from the Discord thread is that the issue here is attributes containing the > character. It feels like the proper solution here is for attribute values to be properly escaped by Svelte, so that we are able to use regexes safely — parsing the HTML would be a hugely expensive change. Does that feel like a good approach?

In input, we get string, that contains HTML code... but in this string how we can detect which > character to escape, and which not? Resp. do we escape all of them? If so, how will that regex match, what is tag, what attribute, and where tag ends?

Now imagine You have this crazy input:

<img data-something=" src=" crossorigin src="dog>cat.png"> crazy link </img>

It means, that " src=" is not attribute name, but value. Regex is chomsky type 3 grammar, while HTML is type 2.
We write regex, that will find something between " and "... what we will get in match:

  1. src=
  2. crossorigin src=
  3. dog>cat.png

And that's when there is not another tag before or after. And there can be even nested attributes inside attributes... not just <img src="data:img/svg+xml,<svg >"> but what if there is in svg another attribute with data:uri? Yes it is escaped, now if we escape another > we will not recognize, which is ending which tag.

If we match anything between < and > we get:

img data-something=" src=" crossorigin src="dog -> our issue now.

When we want to read src value, we get:

  1. crossorigin src=
  2. dog

those both values are incorrect, we want dog>cat.png

I'm not sure how we would be able to fix it with escaping, when we still needs to find where to escape > character and where not. Also there is still a lot of other issues related to how we parse it with regex. For example: #2742

where @Conduitry answered #2742 (comment)

I think, because regex can't cover all cases of HTML language parsing, I'm sure it would be good to rewrite it, and use parse5 parser instead. It would fix a lot of bugs, and even those, we don't know about today.

I think parse5 solution will give us a lot of advantages, both now and in future. And it will make code even more maintainable, as regex is just not readable.

Ofc, maybe I'm incorrect, and it is possible to escape it, I don't totally understand this, but still I think, parse5 is better solution.

@si3nloong si3nloong mentioned this pull request Nov 26, 2021
5 tasks
@gg187on
Copy link
Author

gg187on commented Dec 6, 2021

I see there is good solution with parse5 in PR #2923 ... that fix this problem. I think I can close this issue in favor of #2923 , I was writting solution with parse5 too, but I think it's not better than solution in that PR, so I think, it's better to merge that PR.

@vercel
Copy link

vercel bot commented Dec 6, 2021

@gg187on is attempting to deploy a commit to the Svelte Team on Vercel.

A member of the Team first needs to authorize it.

@gg187on gg187on closed this Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adapter-static data:image URL error 404
4 participants