Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support comments in JSON #2049

Merged
merged 5 commits into from Jan 28, 2022
Merged

Support comments in JSON #2049

merged 5 commits into from Jan 28, 2022

Conversation

kurtmckee
Copy link
Contributor

This adds support for comments in JSON. Valid JSON documents are not affected, but if comments appear in the document they will be parsed as comments now.

This added support doesn't claim to support any JSON supersets like JSONC or JSON5. It's just comments. 馃槃

Example, rendered in Monokai:

image

Please let me know if there's anything that I need to cleanup or improve. Thanks!

@not-my-profile
Copy link
Contributor

I would expect the JSON lexer to parse strictly according to the JSON spec (with comments resulting in Error tokens). So I think it would be better to introduce a new JSON+comments lexer, based on the JSON lexer.

@Anteru
Copy link
Collaborator

Anteru commented Jan 26, 2022

I'd like to better understand why. We're not here to educate people on correct JSON -- what's the advantage of splitting it out in yet another flavor of JSON? If people don't know if it's JSON or not, they can always use the JavaScript lexer, but having to select between 5 flavors of JSON hardly seems like the most user-friendly solution (and it would require us to implement JSON5, JSON C, etc.) I'd like to understand your motivation and why you think we should flag JSON with comments as Error.

@not-my-profile
Copy link
Contributor

We're not here to educate people on correct JSON

Sure but I'd argue that we're also not here to misguide people (into thinking that comments are part of JSON).

it would require us to implement JSON5, JSON C, etc

This PR turns the lexer for JSON into a lexer for "JSON with Comments" (as established by Microsoft's VS Code). Introducing "JSON with comments" as a separate lexer changes nothing about whether or not we need to support JSON5. In any case supporting JSON5 would require a dedicated lexer. (Unless you're really advocating that the JSON lexer should handle all extensions of JSON ... but I don't think you are).

having to select between 5 flavors of JSON hardly seems like the most user-friendly solution

I agree but I think the proper way to address this would be to implement proper lexer guessing. I suggested a solution for that in #2005 (btw haven't heard back from you yet regarding my last reply).

@kurtmckee
Copy link
Contributor Author

@not-my-profile,

I'd argue that we're also not here to misguide people (into thinking that comments are part of JSON).

I hear you. Many years ago I bought an MP3 player and eagerly started testing to confirm that it was following the MP3 spec...but I discovered it was able to play an animated GIF as cover art. That offended me, so I threw it in the trash. Other people might not have been so lucky, and might have assumed that comments are part of JSON animated GIFs are part of the MP3 spec.

Look, I'm not going to be drawn into a pharisaical discussion about spec purity. People, companies, and software products are adding comments to JSON documents, and this patch will address those real-world situations. It's okay if the patch isn't merged, but I'm not going to defend the lexer on this one point while you gloss over the fact that the existing lexer also doesn't validate that there are no trailing commas after final array items.

@not-my-profile
Copy link
Contributor

Thanks for the sarcasm, I think it really adds to the discussion.

I never said that we shouldn't address the fact that people use extensions of JSON, I just think that there's a better way to do so.

while you gloss over the fact that the existing lexer also doesn't validate that there are no trailing commas after final array items

I indeed wasn't aware of that. I also think that emitting error tokens for such trailing commas would be preferable.

@Anteru
Copy link
Collaborator

Anteru commented Jan 26, 2022

Hey, please don't get too wound up about this. I very much appreciate the contributions both of you have made and I would not want this to end in bad blood.

Ultimately, the way Pygments is working is that it's not validating any input language, nor was it ever a goal of Pygments to do so (unless @birkenfeld has different goals in mind.) We have plenty of lexers which accept slightly more relaxed code for the sake of user convenience. It's also a question of discoverability -- a naive user might try JSON, see it works, move along, but not know how to find JSON+comments or select it if it doesn't. I'm leaning towards accepting this change as-is to reduce user pain at no (obvious) downside. I do hear your concern about highlighting invalid JSON but I'm afraid that ship has sailed unfortunately. As with every project used in the real-world, we'll have to make some decisions which will leave folks unhappy. I hope you can accept that and continue to contribute (and yes, I'm aware I didn't reply on the other thread yet -- I don't have a good solution for that either, nor did I spent sufficient time on said issue to give a good reply :/.)

@not-my-profile
Copy link
Contributor

I hope you can accept that

Of course, no worries. If the JsonLexer handles more than JSON, I think it should be documented in its docstring.

@birkenfeld
Copy link
Member

@Anteru is correct - this library is not meant to validate. Marking unrecognized things as error is not an attempt at validation, it is a refusal to guess more than anything else.

So since we have a good, and in many cases valid, interpretation of this syntax, it is worth supporting it without having another slightly different lexer, IMO.

Of course, it should be prominent in the docstring.

@kurtmckee
Copy link
Contributor Author

kurtmckee commented Jan 26, 2022

@not-my-profile I apologize for the sarcasm. That stemmed from something unrelated to this discussion, and I'm sorry for it.

@birkenfeld I'll update the doc string to reflect that comments are also supported.

@Anteru
Copy link
Collaborator

Anteru commented Jan 26, 2022

Thanks everyone!

@kurtmckee
Copy link
Contributor Author

It looks like some random input triggered an issue in the Elpi lexer:

! KeyboardInterrupt: interrupted ElpiLexer.get_tokens(): test_content='v{+MKj-AwTQIn:HAZiJ|23C;BwQHBoG)bOjlWg0ts5\'vRzu2+<i=v.fddOcMj+lQ8Bp$-V9rsephX(osSZPkMMX[#H_OLY\\GB]bVN2Df((TOqx}$&xF%@~I8Z\';&}{:uv_6`PGRbaVp/,fvK!;}t&D:Nx&0~Enn`a6\x7f3l(wP>-3w#%G]:#\x7f)-4F?^{@D3$1YUUt6kJ>_z@egLZ[jp6Y9&{^w`yKmy<sC*<cNS?x=#7q\x7f!/4|3j?11P5dN@GimkS?\'90^e=7Ba$2iU@fRu[\\W0IcJn*^)$\x7f\\ZY}C|\\pAaU<bulP]!H|L6;S/m.+,inoaE*\'C="Ho5?T"V5hAK)fWF5\'QM>JocLF`A+z\x7f~.Ju_q-_8.F%Szhy^R"q9`zk(9CUr%s!y=mEy4~\\grT)D1hh,e78/Wkldr|,"/>E2:%E"<8gQq#KdWctXTg7]x*1VNYt04;[DrR!bLOX7[m}~IIX4*e{]>.,\n' !

I don't think this is related to the JSON changes but I'll try to get this addressed.

@kurtmckee
Copy link
Contributor Author

kurtmckee commented Jan 28, 2022

Got it. The Elpi lexer has pathological backtracking that can be triggered with the following minimal input:

from pygments.lexers.elpi import ElpiLexer

list(ElpiLexer().get_tokens("a" * 30))

Only lowercase characters will trigger pathological backtracking. Uppercase characters and digits are not a problem.

I'll open a ticket regarding this issue, but would like to address it in a separate PR.

Edit: Opened #2053 to track this.

@Anteru Anteru merged commit 22dca20 into pygments:master Jan 28, 2022
@Anteru Anteru added this to the 2.12.0 milestone Jan 28, 2022
@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Jan 28, 2022
@Anteru Anteru self-assigned this Jan 28, 2022
@Anteru
Copy link
Collaborator

Anteru commented Jan 28, 2022

Thanks (for both!)

@kurtmckee kurtmckee deleted the json-comments branch January 31, 2022 14:42
@Anteru Anteru removed the changelog-update Items which need to get mentioned in the changelog label Feb 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants