feat: parser instead of regex for preprocessor #6611

tanhauhau · 2021-08-03T03:10:11Z

Fixes #5900, fixes #5292, fixes #4701

Replace the regex to a simple parser, allow us to fix #5292 and support #4701

allow return attributes, fixes feature: preprocessor can change tag attributes like <script lang="ts"> #5900
introduce another type for preprocessor expression instead of reusing the existing script
- the returned code for expression has to be an expression, if the preprocessor transformed it into a multiline statements, it will crash the compiler later on.

Before submitting the PR, please make sure you do the following

Prefix your PR title with [feat], [fix], [chore], or [docs].
This message body should clearly illustrate what problems it solves.
Ideally, include a test that fails without this PR but passes with it.

Tests

Run the tests with npm test and lint the project with npm run lint

Conduitry · 2021-08-03T21:56:04Z

I'm worried about using a parser for this, because we have no idea what sort of syntax people will be using in the original (unpreprocessed) code. For example, we have no idea whether the language that they're trying to preprocess in expressions has any of the same features of matching braces or strings or comments as JS does.

tanhauhau · 2021-08-04T00:20:21Z

hmm.. fair enough, would it make sense to have an option on the preprocessors to decide whether to use regex or parser?

for example:

const preprocessor = {
  style() {},
  markup() {},
  script() {},
  mode: 'parser' | 'regex' = 'regex',
};

with default as regex, which will make preprocess() backwards compatible, and preprocessors author can update on their own pace

though, i'm not sure to use the word "regex" and "parser", it feels like leaking our implementation out

and alternative maybe to specify as version number:

const preprocessor = {
  style() {},
  markup() {},
  script() {},
  version: 1,
};

dummdidumm · 2021-08-16T06:49:46Z

I like this parser very very much, because it could help us with all sorts of tooling. Preprocessing would get easier, but the ESLint and Prettier plugins could also benefit from this, and language-tools could possibly get rid of some custom top level script extractor logic. So it would be great if the parser could be available for direct use.

Regarding the preprocessor enhancement: I propose to not enhance the existing API and instead try to make it part of the new API which is discussed in this RFC: sveltejs/rfcs#56

I therefore propose to split this PR up into two pieces: One which simply makes this new parser available publicly (whether that's under svelte/preprocess or not is TBD), and once the new preprocessor API is finalized another PR where the parser is usable as part of that new API.

Another thing that would help tooling: Add a isTopLevel property to script / style entries to distinguish

<script>
   console.log('top level')
</script>
<div>
  <script>console.log('not top')</script>
</div>

dummdidumm · 2021-08-17T06:51:26Z

@pngwn might have some thoughts about this, he wrote a language-agnostic parser, too, albeit with more sophisticated parsing mechanism for the contents of example each etc and producing a complete AST in the end (https://github.com/pngwn/MDsveX/tree/master/packages/svelte-parse).

Conduitry · 2021-10-06T19:10:10Z

Regarding the ESLint plugin benefiting from this: I'm not sure what that would look like exactly, but unless it were a synchronous API (which the existing preprocessor API is not), it couldn't be used in ESLint, where all plugins need to run synchronously.

dummdidumm · 2021-10-07T08:20:40Z

The parser itself is synchronous and could help ESLint etc to for example find out where the top level script/style tags are - which is why I asked if we could split this PR into two, one for the parser itself because it would be immediately useful in itself, and the preprocessing changes independent of that.

baseballyama · 2021-11-30T09:59:19Z

I may not understand something regarding this PR.
However, I have a small idea for finding top-level <script> / <style> tag.
So I'm posting it.
I'm sorry if I'm saying fundamentally wrong or any considerations were omitted.

I think this process could be made easier by adding a rule that <script>/ <style> tags must always be at the beginning of a line.
This is breaking changes but I don't think it will affect most users.
And I wrote about how to notice it to affected users at the bottom.

This way, perhaps only regular expressions, or a very simple parser that is completely language-independent, will be able to extract each part correctly.

The below (maybe only) case doesn't handle properly, but I think it's enough to put it in the documentation because this is a super edge case.

<sctipt>.
  const str = `
<script>
  const hoge = "hoge";
</script>
`;
</script>

It would be nice to have some kind of warning and a way to disable it (like ) in case a user unintentionally writes an invalid <script> / <style> tag.

And to do this, we need to traverse an entire code before separating it for finding <script> / <style> tags that are not at the beginning of a line.
Is there currently a mechanism for this?

By the way,
For the part that extracts expressions, for example, if user use pug, the notation will be as follows.
(Just I copied from web.)
if statements doesn't have {, }.
Therefore current parser may doesn't run as expected if my understand is correct.

 ul
   +if('posts && posts.length > 1')
     +each('posts as post')
       li
         a(rel="prefetch" href="blog/{post.slug}") {post.title}
     +else()
       span No posts :boom:

bluwy · 2021-12-04T09:31:39Z

@baseballyama I'm not sure if checking for <script> at the first of line would be a robust solution. The format of the code shouldn't change its semantic meaning, so I think a simple parser would work better here.

Re pug, it shouldn't be an issue since the parser would always run on pure Svelte code. Pug would be preprocessed into html before Svelte parses it.

baseballyama · 2021-12-04T14:58:38Z

@bluwy

Thank you for your comment!

I'm not sure if checking for <script> at the first of line would be a robust solution. The format of the code shouldn't change its semantic meaning, so I think a simple parser would work better here.

I add an additional explanation of my idea.

Point1

This is just IMO, but the parser doesn't know anything other than <script> / </script> and <style> / </style> strings before splitting them into script and style and markup parts.
This is because it doesn't know what language is used.
Thus, the parser needs to rely only on the <script> / </script> string and the <style> / </style> string for parsing.

According to GitHub issues, Svelte users sometimes write <script> or <style> strings inside <script> or <style> or markup part.
However, there doesn't seem to be a case for writing them at the beginning of a line.
This is because there is a context that it is a real <script> / <style> tag.

Point2

From another perspective, I think we can recognize <script> and <style> not as HTML tags, but just separators that represent sections of a .svelte file.
Then it is ergonomically natural to write the separator at the beginning of the line, like # in markdown.

For these two reasons, it may make sense to introduce the rule of my idea,
But anyway, if it is true that this rule can simplify the parsing process, I thought it would be a good idea to consider it.
And if the maintainers think that a simple parser is better, of course, I will follow it!

Re pug, it shouldn't be an issue since the parser would always run on pure Svelte code. Pug would be preprocessed into html before Svelte parses it.

For instance, I had understood that there is a use case where language-tools want to use this before pre-processing.
Sorry if I'm wrong. (I need to learn more about it.😅)
And I totally agree with you about the Svelte compiler, I don't think it will be a problem since it is already pre-processed.

bluwy · 2021-12-05T18:31:17Z

@baseballyama I'm still not sure if it's possible with the idea you suggested. I agree that we can find the start tag easily with this assumption, but the closing tag is still hard to match. Re point2, determining whether they are a separator is another tricky one to solve. Markdown's # works because it must start on a new line, Svelte however doesn't mind where the tag is at as long as it's proper HTML, for example a Svelte file written only in one line. But if I'm still misunderstanding your point, feel free to try out your idea 😅

Re language tools, I'm not familiar with the code, but it looks like it's for stripping out the tags before svelte.parse so it doesn't error out. @dummdidumm may know more about that.

If you have anymore questions (that might be off-topic), It'd be great to have them in the Svelte discord contributing channel. Would be nice to see you there too!

parser instead of regex for preprocessor

24b8d45

tanhauhau force-pushed the tanhauhau/preproces-parse branch from 7d10174 to 24b8d45 Compare August 3, 2021 03:15

fix test

094a42a

dummdidumm mentioned this pull request Aug 16, 2021

Preprocessing API rework sveltejs/rfcs#56

Open

bluwy mentioned this pull request Sep 19, 2021

svelte-kit package errors due to not removing lang="ts" sveltejs/kit#2450

Closed

bluwy mentioned this pull request Nov 13, 2021

ParseError: Unexpected end of input #6923

Open

tanhauhau added this to the one day milestone Mar 1, 2023

benmccann changed the title ~~[feat] parser instead of regex for preprocessor~~ feat: parser instead of regex for preprocessor Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parser instead of regex for preprocessor #6611

feat: parser instead of regex for preprocessor #6611

tanhauhau commented Aug 3, 2021 •

edited

Loading

Conduitry commented Aug 3, 2021

tanhauhau commented Aug 4, 2021 •

edited

Loading

dummdidumm commented Aug 16, 2021 •

edited

Loading

dummdidumm commented Aug 17, 2021

Conduitry commented Oct 6, 2021

dummdidumm commented Oct 7, 2021

baseballyama commented Nov 30, 2021

bluwy commented Dec 4, 2021

baseballyama commented Dec 4, 2021 •

edited

Loading

bluwy commented Dec 5, 2021

feat: parser instead of regex for preprocessor #6611

Are you sure you want to change the base?

feat: parser instead of regex for preprocessor #6611

Conversation

tanhauhau commented Aug 3, 2021 • edited Loading

Before submitting the PR, please make sure you do the following

Tests

Conduitry commented Aug 3, 2021

tanhauhau commented Aug 4, 2021 • edited Loading

dummdidumm commented Aug 16, 2021 • edited Loading

dummdidumm commented Aug 17, 2021

Conduitry commented Oct 6, 2021

dummdidumm commented Oct 7, 2021

baseballyama commented Nov 30, 2021

bluwy commented Dec 4, 2021

baseballyama commented Dec 4, 2021 • edited Loading

Point1

Point2

bluwy commented Dec 5, 2021

tanhauhau commented Aug 3, 2021 •

edited

Loading

tanhauhau commented Aug 4, 2021 •

edited

Loading

dummdidumm commented Aug 16, 2021 •

edited

Loading

baseballyama commented Dec 4, 2021 •

edited

Loading