Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for MCFunction Lexer + tests #2107

Merged
merged 15 commits into from
Apr 24, 2022
Merged

Conversation

RitikShah
Copy link
Contributor

@RitikShah RitikShah commented Apr 7, 2022

Hi, I'm a newer open-source contributor so hopefully I've dotted my eyes and crossed my ts.

I'm committing 2 lexers for the mcfunction language (and the snbt data format) used in the popular game, Minecraft. This lexer was loosely based off of work done by @Arcensoth here, which is currently being used to syntax highlight mcfunction on Github (and most text editors). This lexer mostly strays away from the line-based lexing approach that's currently on Github, which allows it to work for commands and data to span multiple lines cleanly.

This lexer is designed for the Java Edition of the game (you can read more about the language on the wiki), however, it should also function for the Bedrock Edition.

I've included 1 full examplefiles test alongside some key snippets to help cover the entire lexer. Potentially, we can add more to help cover the lexer if needed.


I've tried my best to keep a approach to all of the regex, but I'm a little unsure on some of the detailed cases. Particularly, I spun out the snbt data format (read more here), since it's quite useful to highlight this data on it's own, but I ran into some weird handling trying to reuse the same lexer when needed inside the language. This is mostly due to the fact that both JSON and SNBT are very similar, yet, distinct data formats that the game uses (sometimes, inside of each other).

I also had some struggles with the "selector" state management, I feel like it could be simplified somewhat. Hopefully, someone else can take a look and provide suggestions.

I appreciate anyone taking a look at this PR! Since Github has mcfunction support, it's common to see it used in codeblocks in documentation. I'm hoping this lexer will be useful for those using Sphinx alongside their minecraft project.

Thanks!

Copy link
Contributor

@jeanas jeanas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a first round of review, will take another look later. This looks quite good-quality overall.

pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
@jeanas
Copy link
Contributor

jeanas commented Apr 7, 2022

There are a bunch of linting errors in the CI (https://github.com/pygments/pygments/runs/5868270378?check_suite_focus=true), be sure to check and fix them (run the tool locally with REGEXLINT=/path/to/regexlint make regexlint after cloning https://github.com/pygments/regexlint).

@RitikShah
Copy link
Contributor Author

RitikShah commented Apr 7, 2022

Thanks so much for these comments. Since you mentioned you wanted to take a closer look, I wasn't sure if I should implement your first-look suggestions or leave it static.

Also, I had some issues with literals in the MCFunctionLexer (particularly floats and numbers). In the commands syntax, we have a coordinate concept which allows you to specify different aspects of relative through preceding operators (the ~ and ^).

tp @s 100.5 80 -100.5
tp @s 10 ~ -10
tp @s ~10 ~ ~-10
tp @s 10 ^0.5 -10
tp @s 10 ^-0.5 -10

For some reason, my literals and operators doesn't play nice with each other, but I'm not entirely sure why!

(Edit: Oh, regexlint is nice, I ran make all and I thought that was good).

@jeanas
Copy link
Contributor

jeanas commented Apr 7, 2022

Since you mentioned you wanted to take a closer look, I wasn't sure if I should implement your first-look suggestions or leave it static.

It'd be helpful for me (and likely for other maintainers if they want to review this as well) if you started doing the updates now.

For some reason, my literals and operators doesn't play nice with each other, but I'm not entirely sure why!

Will take a look when I review this again.

Copy link
Contributor Author

@RitikShah RitikShah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I accidentally started a review, but I have to submit review to get those comments posted?
Edit: Working on some updates!

],

"list": [
(r"(?<=\[)[BIL](?:\;)", Keyword.Type),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NBT can have special typed Int, Byte and Long arrays which start with the type:

fake_command {ints: [I; 1, 2, 3], bytes: [B; 0b, 1b, 2b], longs: [L; 1234L, 2134L, 3245L]}

I was ensuring that this type definition only appears at the start.
(accidentally clicked Start A Review, oops)

"compound": [
# this handles the unquoted snbt keys
# note: stringified keys still work
(r"[A-z_]+", Name.Attribute),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, this is tricky. theoretically, "yes", but it must be quoted (i think). U can even have stuff like . and : as a key here, but you must quote the key, which makes it almost JSON-esque. However, if something is quoted, it'll match as a literal anyways so it should be fine here.

@RitikShah
Copy link
Contributor Author

RitikShah commented Apr 8, 2022

My latest push has many things broken still (working on the refactor). I've actually fixed my issues with the float and int literals (still need to get 2.2e10 working).

I'm working on refactoring the NBT aspect of the lexer. When I use the SNBTLexer inside of the MCFunctionLexer, I'm unsure how to properly "pop" out.

In my MCFunctionLexer, the root state includes a nbt state which has the following rules:

"nbt": [
    (r"\{", Punctuation, "nbt.compound"),
    (r"\[", Punctuation, "nbt.list"),
],
"nbt.compound": [
    (r".+", using(SNBTLexer, state=("compound",))),
    default("#pop"),
],
"nbt.list": [
    (r".+", using(SNBTLexer, state=("list",))),
    default("#pop"),
],

And then my SNBTLexer root state was changed from a suggestion earlier to:

"root": [
    (r"\{", Punctuation.SNBT.Start, "compound"),   # The token has a special name for debugging
    (r"[^\{]+", Text.TEST),
],

With using, I pass in a state which starts the SNBTLexer off at a specific state (which works). The issue occurs when this state eventually pops since it puts me back in the SNBTLexer root state instead of the original MCFunctionLexer nbt.compound state (which would default pop). I took a look at the method and it looks like if you pass in a string, it automatically puts in the root state (this was odd, I feel like it should be in the docstring), so I passed in a tuple with 1-element to try to avoid that but it still occurred.

I think I'm not understanding how using works which is why I'm unable to solve this. I feel like I want to be able to just push the compound state from SNBTLexer onto the stack ((r"\{", Punctuation, SNBTLexer.push_state("compound")), like this) but the using helper only allows me to substitute a token where I can classify subtokens within one regex rule.

I think I need to add some extra rules in the state with using and do something like pushing and popping brackets, but it feels very odd since that behavior is already implemented in SNBTLexer.


Alternatively, I can maybe combine selector and snbt parsing (also json since snbt works for that anyways). I didn't want to do this at first, but it might be better in the long-run since the structure for the selectors and similar structures (@a[attribute=value], minecraft:block[attribute=value], item{key: value}), I can just let : and = be valid separators?

Edit: Currently trying this idea..

@jeanas
Copy link
Contributor

jeanas commented Apr 10, 2022

using() is basically intended for parsing embedded code in another language within code in a main language (think <script> in HTML and the like). It's not intended for use cases where you have two similar languages that should share definitions. For that, the primary tool is subclassing. See https://pygments.org/docs/lexerdevelopment/#subclassing-lexers-derived-from-regexlexer.

@RitikShah
Copy link
Contributor Author

Just pushed my latest, this has generic property which treats keys and values the same across every interface of mapping (and lists).

I think this is the best way to handle the various property maps across this strange language. I think I might use the subclassing method to handle the literals the same across the SNBTLexer and the MCFunctionLexer.

tests/snippets/mcfunction/commenting.txt Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
pygments/lexers/mcfunction.py Show resolved Hide resolved
pygments/lexers/mcfunction.py Show resolved Hide resolved
pygments/lexers/mcfunction.py Outdated Show resolved Hide resolved
@birkenfeld
Copy link
Member

Just a few more comments - also the mapfile seems to be out of date, so please run map mapfiles once more. Thanks!

@RitikShah
Copy link
Contributor Author

RitikShah commented Apr 11, 2022

I'm getting closer to a completed lexer here (looks like I still failed tests?). At the moment, I'm mostly worried about my handling of some literals.

I think the route I went with a generic property map was successful, hopefully I did fine with my usage of state (I tried to avoid state inflation by ensuring I popped as I went). I actually used a small debug script in the get_tokens_unprocessed to append the state stack to each token (literally just directly using __getattr__) so it would be easier to debug. I'm not sure if there was an easier way to trace through the lexer, perhaps it could be interesting to have a debug mode where metadata is attached to each token (the current stack perhaps).

Edit: Tests failed bc I stupidly changed \S -> . by accident.

@RitikShah
Copy link
Contributor Author

This time, I made sure to pass all of the regex tests and ran make all. Hopefully everything turns right this time 😅

@Anteru Anteru merged commit 8ad1bea into pygments:master Apr 24, 2022
@Anteru Anteru added the changelog-update Items which need to get mentioned in the changelog label Apr 24, 2022
@Anteru Anteru self-assigned this Apr 24, 2022
@Anteru Anteru added this to the 2.12.0 milestone Apr 24, 2022
@Anteru Anteru removed the changelog-update Items which need to get mentioned in the changelog label Apr 24, 2022
@triplay-9
Copy link

triplay-9 commented Sep 11, 2023

I just cloned pygments:master and found that pygments/lexers/mcfunction.py is missing. The file disappears in 2.14.0, but is present in 2.13.0.

Oh, I found it, now it's pygments/lexers/minecraft.py. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants