Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode character classes #235

Open
terpstra opened this issue Dec 22, 2018 · 12 comments

Comments

@terpstra
Copy link

commented Dec 22, 2018

Firstly, thanks a lot for this tool. It saved me a lot of time! I am using re2c to create a parser for an as-yet unpublished build tool. The input files are utf-8 encoded. Everything works fine for the ascii character set.

However, I'd like to expand my identifier space to include/allow unicode letters in addition to [a-zA-Z]. Currently the only way to do this that I can see is to write a parser for UnicodeData.txt that grabs all of the letter category code points and dumps them into a giant character class. That's fine, but now I have a generator for a generator for C++. It seems like this sort of Unicode character class functionality would be more naturally supported directly in re2c itself.

I was somewhat surprised this was not already supported, so I went looking for these classes in re2c and could not find them. Apologies if this is already supported and my grep-powers were insufficient.

Thanks!

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 22, 2018

Hi @terpstra , your grep was correct: re2c doesn't support syntactic aliases for Unicode character classes yet. There is no technical reason it can't do that, but you are the first to ask.

As a temporary quick workaround, I can generate and distribute together with re2c source code an "official" file with re2c definitions of Unicode categories: unicode_categories.re.txt. This is to be included verbatim into your .re files; the name L can be used in subsequent re2c blocks to denote Unicode letters. The definitions are generated from the same scripts that generate re2c tests, so the definitions are coherent with what re2c is able to handle at the moment. The generator doesn't use UnicodeData.txt directly (though it should), it uses haskell Data.CharSet library.

@terpstra

This comment has been minimized.

Copy link
Author

commented Dec 22, 2018

Thanks a lot for this! Does re2c support some form of 'include'? Dumping tables this large into a source file whose main focus is parsing distracts the reader.

Ultimately, I think users will want all the classes and subclasses in Unicode. For example, also the Lu class for upper-case letters / etc. Do you think this is a good candidate for future inclusion?

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 23, 2018

Does re2c support some form of 'include'?

No, but it would be useful. Initial implementation may only allow to include files from current directory (the one re2c is run from), otherwise we'd also need to support include paths.

Ultimately, I think users will want all the classes and subclasses in Unicode.

Agreed.

Do you think this is a good candidate for future inclusion?

Yes. Don't close the issue. :)

@terpstra

This comment has been minimized.

Copy link
Author

commented Dec 25, 2018

I've noticed that "L \ Lu" in re2c v1.1.1 reports:
re2c: error: line 359, column 12: can only difference char sets

It seems that the inclusion of any value above 0x80 in a character class renders it no longer a character class.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 25, 2018

@terpstra I opened #236: this is a known limitation, but worth a separate issue.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 26, 2018

@terpstra Meanwhile, re2c learnt to handle include files b94c5af:

  • /*!include:re2c "x.re" */ works in the same way as #include "x.re" in C/C++, as if x.re was pasted verbatim in place of the directive.
  • -I <path> option allows to specify search paths for included files. Default search path is the directory of the source file, e.g. if you run re2c x/y/z.re, then default include path wil be x/y/.
@terpstra

This comment has been minimized.

Copy link
Author

commented Dec 26, 2018

Nice!

Do you plan to put unicode_categories.re somewhere in the include path? For now I'm just copy-pasting it into my own symbol.re as you suggested.

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Dec 26, 2018

For now I think the best option is to copy unicode_categories.re in your source tree and then put /*!include:re2c "path/unicode_categories.re" */ in your .re file. If unicode_categories.re gets updated, at least you won't have to modify the including .re file and glue it together from pieces.

Perhaps later re2c will install these definition files in some default locations, or at least default relative to re2c root directory, and we'll have a "standard library" of useful regular expressions.

@fletcher

This comment has been minimized.

Copy link

commented Jan 17, 2019

FYI -- this precompiled set of unicode definitions is fantastic -- I needed to add support for unicode strings to a project I started today, and found this. Made short work of an otherwise complicated problem. Thanks!

(PS-- Thanks for asking about this Brett!)

(PPS -- It goes without saying, but also to second Brett's thanks for re2c. I've been using it for a few years now and am always impressed with how easy it is to use!)

@skvadrik

This comment has been minimized.

Copy link
Owner

commented Jan 17, 2019

Glad to hear that it works for you!
I wish the original re2c author and long-time contributors like @dnuffer read the above comment.

@terpstra

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

Who is Brett? From the context, it sounds like you meant me.

@fletcher

This comment has been minimized.

Copy link

commented Jan 17, 2019

@terpstra My apologies, you are right. I saw terpstra and immediately thought of Brett Terpstra since our software projects intersect at times. But that isn't you, so while my comment stands in its intent (I appreciate your asking about this!) it doesn't mean quite as much since we've never met and your name is not Brett.....

Move along.... Nothing to see here.... Just another person making an idiot of themselves on the internet... ;)

@skvadrik skvadrik referenced this issue May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.