Look at if a generated regex parser can do better job than pugixml (performance wise) #5

mithro · 2019-07-02T21:40:32Z

pugixml is a generic XML parser. However, now we are generating a parser, we should see if we can do better by creating a parser that only parses files in the exact given XML formats. Using something like Google's re2 would be a good option for that.

mithro · 2019-07-02T21:40:57Z

See also https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/ and https://news.ycombinator.com/item?id=14608663

duck2 · 2019-07-02T22:00:19Z

I don't think this would work. Even the non-requirement of order on the attributes will make the resulting regular expression explode combinatorially. Think of an element with 6 required attributes:

<element (attr1="([\w]*)" attr2="([\w]*)" attr3="([\w]*)" attr4="([\w]*)" attr5="([\w]*)" attr6="([\w]*)")
|(attr1="([\w]*)" attr2="([\w]*)" attr3="([\w]*)" attr4="([\w]*)" attr6="([\w]*)" attr5="([\w]*)"
| [718 more permutations...] >

mithro · 2019-07-02T22:11:31Z

Use an or and a match X times. Something like....

<element (attr1="([\w]*)"|attr2="([\w]*)"|attr3="([\w]*)")*>

duck2 · 2019-07-02T22:18:27Z

Yes, but then we won't be checking if all required attributes are present. State machines are really bad at handling independent inputs.

mithro · 2019-07-02T22:34:04Z

@duck2 - That can be done after the tag has been parsed?

duck2 · 2019-07-02T22:43:48Z

Need to think more. Maybe we could make something like an opinionated SAX parser out of this, output of which can be fed to the general purpose validators.

mithro · 2019-07-02T22:45:20Z

@duck2 Notice how I separated some final validation from the parsing in the example here -> #1 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look at if a generated regex parser can do better job than pugixml (performance wise) #5

Look at if a generated regex parser can do better job than pugixml (performance wise) #5

mithro commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019

Look at if a generated regex parser can do better job than pugixml (performance wise) #5

Look at if a generated regex parser can do better job than pugixml (performance wise) #5

Comments

mithro commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019

duck2 commented Jul 2, 2019

mithro commented Jul 2, 2019