Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider RegExp template tag instead #45

Closed
littledan opened this issue Jan 18, 2021 · 27 comments
Closed

Consider RegExp template tag instead #45

littledan opened this issue Jan 18, 2021 · 27 comments

Comments

@littledan
Copy link
Member

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature. This version would solve the user-facing problem more directly, avoiding the need to concatenate RegExps in the result, and can help engines avoid re-parsing (by just parsing the shared part once). If any part of RegExp syntax ends up requiring context-dependent escaping, a template constructor could resolve that in a way that context-independent RegExp.escape cannot.

Such a feature could look like this, to find the value of the variable x followed by the value of y with arbitrary whitespace in between:

RegExp.build`${x}\s+${y}/m`

Here, flags are provided after a / within the RegExp (which is of course optional).

@ljharb
Copy link
Member

ljharb commented Jan 18, 2021

That was the expressed preference 5 years ago, and in that time, no delegate nor anybody in userland seems to have developed a template tag, as a proposal or a library, and userland seems quite convinced that the existing solution does already solve the user-facing problem.

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

@littledan
Copy link
Member Author

littledan commented Jan 18, 2021

Well, I think we should encourage people who want to do work in TC39 to develop this. I think the only evidence we have right now is that it hasn't been popularized yet; we're allowed to go beyond what popular libraries do.

@ljharb
Copy link
Member

ljharb commented Jan 18, 2021

Indeed, but the problem that needs solving - and has needed solving for much longer than the 5 years since this proposal was rejected - has been consistently and completely solved by the low-level feature contained in this proposal. No need has manifested for anything more.

@littledan
Copy link
Member Author

Yes, I agree that this problem is extremely important to solve.

@oliverfoster
Copy link
Contributor

oliverfoster commented Jan 18, 2021

History of the tag vs escape quarrel: Present the proposal again?, Results of TC39 presentation, specifically comments such as this, this, this, this, this, this

For what it's worth from me: We currently have a working solution. One in the hand vs two in the bush.

@benjamingr
Copy link
Collaborator

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature.

Talking to users, we have found that almost everyone wants a RegExp.escape for the simple cases and no one asked for RegExp.tag, users I've spoken with found .tag very confusing as well.

@littledan
Copy link
Member Author

It would be great if the proposal's README summarized these explanations. I was unaware of some of this context. I would be especially interested in hearing about developers who are confused by RegExp.tag--this is news to me.

To be clear, I'm not opposed to RegExp.escape. I was trying to motivate people to work on a higher-level construct. You've given good context for explaining why that's not a good idea. It's more important to solve this problem one way or another, than in the way that I aesthetically prefer.

@ljharb
Copy link
Member

ljharb commented Jan 20, 2021

@littledan this proposal is 5 years old. The agenda item is not a proposal, it’s a discussion topic, during which context will be provided. Once this proposal is being actually proposed, then it will contain sufficient context in its documents.

@littledan
Copy link
Member Author

I suggest you add an FAQ topic to explain the context here as well, to help people get prepared for the meeting and avoid repeated questions along the lines I made.

@benjamingr
Copy link
Collaborator

I've made a PR here: #46 just to start a FAQ, contributions welcome.

@gibson042
Copy link
Contributor

gibson042 commented Jan 28, 2021

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

I know of at least XRegExp.build.

@benjamingr
Copy link
Collaborator

benjamingr commented Jan 28, 2021

Wow 9 years ago slevithan/xregexp@47b338d - also CC @LeaVerou

Edit: also relevant: slevithan/xregexp#103

Edit2: Note that XRegExp.tag does not actually perform context sensitive escaping - it just calls .escape on every substitution.

@ljharb
Copy link
Member

ljharb commented Jan 28, 2021

It would be really interesting (but likely impossible) to know how much of xregexp usage is .build.

@leo60228
Copy link

I think that having both a RegExp.escape method that tries to handle as many escaping scenarios as possible and a RegExp.literal method providing a simple interface for interpolation would be the best route.

RegExp.escape could possibly be done by having a second argument specifying the regex up until the string being escaped, and if that argument is null or missing it would escape for all possible contexts.

@benjamingr
Copy link
Collaborator

@ljharb just a note - xregexp.build doesn't actually do what the suggested thing here does.

@leo60228 yeah and to be clear an .escape proposal doesn't stop any future .tag/.build efforts at all.

@phpnode
Copy link

phpnode commented Feb 27, 2021

This well-intentioned but entirely wrong idea derailed this proposal 5 years ago and we ended up without a solution at all. Please let's not do this again.

@benjamingr
Copy link
Collaborator

benjamingr commented Feb 27, 2021

@phpnode the way the stage process works is that the committee considers alternatives. I think the best path forward is to make a good faith effort to consider alternatives.

The more I talk to users and language committees of other languages the more I lean towards .escape and not a tagged-template - but I am one (relatively inexperienced in language design) developer. By the way note that the two APIs are not mutually exclusive and nothing prevents .tag to land at some point in the future anyway.

I am frustrated by how things went 5 years ago, given the amount of work I put into this proposal back then. I think a better approach back then would have been a longer and more patient dialog with TC39 to understand how to push this* forward.

Worst case (for people who need RegExp.escape as the API) - if the committee decides against .escape then WHATWG members I spoke with were very optimistic about our ability to add this to the web at which point I will also add it to Node.js. It would be unfortunate to get another cross-platform web API like AbortController rather than something that's part of the language - so I feel strongly that we should explore TC39 standardisation first and do the hard work of getting buy-in from delegates.

*this - an API for escaping regular expressions, not necessarily RegExp.escape or a tagged template.

@gibson042
Copy link
Contributor

gibson042 commented Mar 13, 2021

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

Another example: https://runkit.com/tolmasky/templated-regular-expression-example

const { queryString } = templateRegExp(
{
    key: /[^=#]+/,
    value: /[^&#]+/,
    parameter: /${key}=${value}/,
    queryString: /\?(${parameter})(?:&(${parameter}))*/
});

queryString.test("?a=b&c=d")

@benjamingr
Copy link
Collaborator

I'm not sure how that's a demonstration of escaping? That appears to be just a way to create a RegExp from objects (I also have one from ±6 years ago I don't maintain here: https://github.com/benjamingr/js-restructure :)).

Is this it? https://www.npmjs.com/package/templated-regular-expression

@gibson042
Copy link
Contributor

Yes, that's the package. And it's not intended as a demonstration of escaping, but rather another response to the "Do you have an example of a library that attempts to solve [high-level RegExp templating]?" question above regarding this specific issue, which @littledan raised to "solve the user-facing problem more directly, avoiding the need to concatenate RegExps".

@pygy
Copy link

pygy commented Apr 5, 2022

Letting users compose RegExps would be awesome. The RegExp syntax are optimized for small searchers, but the regular formalism makes it perfect for lexers, and that use case is very poorly served by literals. Likewise, backtracking makes sense for small searchers, but you can easily run into ReDOS when writing non-trivial parsers.

For previous art, there's alsoThere's also https://github.com/trusktr/regexr by @trusktr.

I have a personnal preference for JS-based combinators, as seen in compose-regexp. The compose-regexp functions accepts either string or RegExps as arguments, and strings are escaped. For completeness, it would require a charset builder with set operations.

The ${...} syntax for splicing in values doesn't cut it in RegExp context IMO. as both $ and {} already have meanings in RegExp context. Also, how would one interpret splicing values in character class context? That use case would be better served by a dedicated JS API IMO.

Alternatively one could imagine a syntax extension in RegExps for splicing in full patterns (the point of the regular formalism is composition, after all, that is at the core of the definition). Something like

const p1 = RegExp.quantifier("{4}", /\w/)
const p2 = RegExp.intersect(/\p{sc=Latin}/u, /\p{Uppercaser}/u)

const combined = /^(?:(?>p1)|(?>p2))$/

// compare with

const tCombined = RegExp.from`^$(?:{p1}|${p2})$`

// or with combinators

const cCombined = sequence(/^/, either(p1,p2), /$/)

Such a syntax would prevent splicing in character class and {} quantifier context at parse time. These use cases would be better served by a dedicated JS API as shown in the example above.

Edit, a bit rusty :-)

@rauschma
Copy link

rauschma commented Oct 19, 2022

  • My opinion: A template tag would be great to have but deserves its own proposal and is orthogonal to an operation for escaping.

  • This is my template tag library: https://github.com/rauschma/re-template-tag

    • It escapes string substitutions.
    • It inserts RegExp substitutions verbatim (enabling RegExp composition).
    • I recently used RegExp composition for a tokenizer and it helped tremendously.
  • A template tag would nicely complement the upcoming extended /x mode. Quoting the proposal: “While the x-mode flag can be used in a RegularExpressionLiteral, it does not permit the use of LineTerminator in RegularExpressonLiteral. For multi-line regular expressions you would need to use the RegExp constructor.”

  • I’m wondering if RegExp could play double duty as a template tag (it would have to check whether its first argument is an Array or not):

    const RE_DATE = RegExp`/^${RE_YEAR}-${RE_MONTH}-${RE_DAY}$/u`;

@cben
Copy link

cben commented Jan 11, 2023

another template tag library: https://github.com/mikesamuel/regexp-make-js

@pygy
Copy link

pygy commented Jan 12, 2023

Syntax aside, when composing, one has to take what's spliced in into account, or risk getting nonsensical results. Here's an example:

const ab = /a|b/
const c = /c/

import {re} from 're-template-tag';
console.log(re`${ab}${c}`) // /a|bc/


import {sequence} from "compose-regexp"
console.log(sequence(ab, c)) // /(?:a|b)c/

@erights
Copy link

erights commented Mar 18, 2023

I will object to any proposal that creates a new vulnerability to injection attacks. AFAICT, the means the only acceptable proposal is one based on tagged template literals.

@mikesamuel and I wrote and championed the tagged template literal proposal. We got them added to JS primarily to support writing context aware SAFE escapers of embedded languages. This is exactly the situation we're talking about here. RegExp is precisely such an embedded language, and safe escaping cannot be done in this language without context sensitivity.

At #45 (comment) , @cben links to

https://github.com/mikesamuel/regexp-make-js

This is the one I want to call everyone's attention to. It was written by @mikesamuel with substantial input from me, explicitly for the purpose of doing SAFE context aware escaping. It also accepts RegExp instances as substitution values, expressing nested matchers, not just nested data. This enables powerful composition of RegExps.

I would support a proposal based on this implementation, that preserves both its safety and its compositionality. I would also support a proposal that rejects in more odd contexts, rather than bothering to do context aware escaping for those odd contexts.

@pygy
Copy link

pygy commented Mar 18, 2023

@erights could you give an example of an injection attack enabled by a non-ttl solution ?

E.g. how would you craft an attack on compose-regexp?

Edit: looking at the issues over at https://github.com/mikesamuel/regexp-make-js, I see a problem with sequence(/a{/, 1, /}/, which currently returns /a{1}/ rather than /a{(?:)1}/. It can be patched easily (edit2: done as of v0.6.31... | edit3: as of v0.7.0 lone quantifier brackets in input are rejected as syntax errors).

Edit4: @erights beside quantifier brackets and back references, do you have other "weird" scenarios in mind?

@ljharb
Copy link
Member

ljharb commented Sep 26, 2023

This proposal advanced to stage 2 with RegExp.escape semantics. tc39/proposals@cc2b37d

@ljharb ljharb closed this as completed Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests