(Original issue 1361 created by howardchris on 2017-07-11T08:45:49.801114+00:00)
The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).
When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.
The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:
#!python
r'"(\\\\|\\"|[^"])*"'
I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.
The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:
#!python
r'"(\\\\|\\"|[^"\\])*"'
Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.
A good description of the problem can be found here:
http://www.regular-expressions.info/catastrophic.html
After making these changes, the data in the attached file can be formatted in ~1s on my machine.
(Original issue 1361 created by howardchris on 2017-07-11T08:45:49.801114+00:00)
The attached text file contains JSON data, downloaded from a request to google.co.uk (the text isn't actually legal JSON, but that's besides the point).
When using the JsonLexer to parse this file, the process hangs for a very long time, and the CPU of one core is maxed out.
The reason is catastrophic backtracking in the regex engine, caused by the regex used to match strings:
I believe the problem is that the groups in the regex are not mutually exclusive, as the last group can match a backslash character.
The solution I have found is to make each group in the regex mutually exclusive, by adding a backslash character to the final group:
Note that this particular regex is actually in two places (in both the simplevalue and objectvalue sections), and so both regexes should be updated.
A good description of the problem can be found here:
http://www.regular-expressions.info/catastrophic.html
After making these changes, the data in the attached file can be formatted in ~1s on my machine.