Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use case: Lexing/Tokenization #445

Closed
kraigher opened this issue Feb 8, 2018 · 7 comments
Closed

Use case: Lexing/Tokenization #445

kraigher opened this issue Feb 8, 2018 · 7 comments
Labels

Comments

@kraigher
Copy link

kraigher commented Feb 8, 2018

I am looking a writing a lexer/tokenizer using this crate. It seems RegexSet is not quite what I need since it matches all regexes in parallel. For a lexer/tokenizer I would want to specify regexes in priority order and avoid matching a lower priority regex if a higher priorty regex is already matched. Consider the following simple example:

let tokenizer = regex::RegexTokenizer::new([
    "\"(.*?)\"",
    "[a-zA-Z_]+"
    "-?[1-9][0-9]*"])

// NOTE: My specific use case requires stateful lexing 
//       where I need to switch regex based on previous token 
//       so I do not use an iterator over matches
let mut start = 0;
while start < code.len() {
  match tokenizer.match(code[start..]) {
     Some(m) => { 
         match m.index {
             0 => println!("string = \"{}\"", m.get(1));
             1 => println!("identifier = \"{}\"", m.get(0));
             2 => println!("integer = \"{}\"", m.get(0));
         };
         start += m.len()
      };
     None => {
         println!("error");
         break;
     };
  };
};

So my question: Is there anything the regex create can do to make this use case easier and more well performing? Currently I build one single normal regex using | (or operator) and (?P) for each token type followed by long list of if-statements checking captures.name("nameX"). Is there a better way to solve this use case using the regex crate?

@BurntSushi
Copy link
Member

Sorry, but I don't understand your question here. A RegexSet matches all of the regexes and tells you which have matched. If you have a priority order, then you can just respect that and ignore any lower priority matches in favor of the highest priority match. The bit that I am confused about is why you can't abide lower priority regexes being involved in a match. You don't actually say why?

In any case, if for some reason you only want the highest priority regex to match (maybe the lower priority regexes can match a lot of text and therefore be slow?), then your solution to create an alternation is the way to go, where the highest priority regexes appear earliest in the alternation.

Saying that RegexSet is inappropriate because you are trying to use it for a lexer/tokenizer doesn't really make a lot of sense, since that's exactly one of the principle use cases for a RegexSet. So it's a really confusing claim. Instead, please provide more detail about why it isn't appropriate for the specific thing you're trying to accomplish.

@kraigher
Copy link
Author

kraigher commented Feb 8, 2018

Sorry if I was unclear. I think my concerns are the following:

  1. Is "unnecessary" work being performed to get a full parallel match compared to using priority information.

  2. How can I efficiently get the index of the highest priority match. Maybe just call .next() on the matches iterator is what I want but does it guarantee that the indexes appear in ascending order?

  3. Getting the length of the match in the input string (In my use case start position is always 0 so that is not interesting). It seems I can only get the index of the regex in the regexset that matched but not the length of the match or position of it within the input string.

@BurntSushi
Copy link
Member

Is "unnecessary" work being performed to get a full parallel match compared to using priority information.

If you ask for all possible matches, then the regex engine will do enough work to confirm or deny each match. If your tokens are reasonably exclusive, then the amount of extra work here may be negligible. I'd strongly recommend that you benchmark this and decide for yourself.

How can I efficiently get the index of the highest priority match. Maybe just call .next() on the matches iterator is what I want but does it guarantee that the indexes appear in ascending order?

Yes, as mentioned in the documentation.

Getting the length of the match in the input string (In my use case start position is always 0 so that is not interesting). It seems I can only get the index of the regex in the regexset that matched but not the length of the match or position of it within the input string.

The documentation discusses these limitations and the best work-arounds given the current regex implementation.

@kraigher
Copy link
Author

kraigher commented Feb 9, 2018

Thank you for your answer.

  1. Ok I guess there is no problem here.

  2. I cannot see that the documentation says the indexes will appear in ascending order in the matches iterator. It only says that indexes correspond to the order in the regexset which I only read as the fact that indexes 0 corresponds to the first regex etc. One could imagine that if regex 2 matched before 1 that the order would be [2, 1]. Are you saying that this is a commitment of the API to provide indexes in ascending order?

  3. Yes so it is limitation currently. But is there plans to lift the limitation if it could improve the lexing/tokenization use case? I guess the performance must be better with a multi output DFA than running serveral DFA one after another?

@BurntSushi
Copy link
Member

BurntSushi commented Feb 9, 2018

One could imagine that if regex 2 matched before 1 that the order would be [2, 1]. Are you saying that this is a commitment of the API to provide indexes in ascending order?

Oh interesting! I see the problem now. My bad. Yes, this is definitely a documentation bug. Indices will be emitted in ascending order. I guess those docs belong on the SetMatches::iter method.

Yes so it is limitation currently. But is there plans to lift the limitation if it could improve the lexing/tokenization use case? I guess the performance must be better with a multi output DFA than running serveral DFA one after another?

It's not like I set out to cripple RegexSet intentionally. And it's not like it's just a matter of exposing something that is currently hidden. It requires significant implementation work, and is partially a research task at this point. I've tried before and failed. I maintain this project in my free time, so I can't answer you with commitments.

@BurntSushi BurntSushi added the bug label Feb 9, 2018
@kraigher
Copy link
Author

kraigher commented Feb 9, 2018

It's not like I set out to cripple RegexSet intentionally. And it's not like it's just a matter of exposing something that is currently hidden. It requires significant implementation work, and is partially a research task at this point. I've tried before and failed. I maintain this project in my free time, so I can't answer you with commitments.

I fully understand, I run an open source project in my free time as well. Just wanted to know the intentions/roadmap in this area.

@BurntSushi
Copy link
Member

@kraigher OK, understood. The best timeline I can give you: "years, if ever."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants