Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Elusive bug seemingly involving parse #3692
Describe the bug
An elusive and persistent crash involving parse & GC.
I have prepared 2 ways to trigger the bug:
Both ways crash for me (W7, various builds of 2018 - Dec 31, Dec 30, Dec 18, Dec 7, and a few more)
I am unable to produce a smaller snippet that will trigger it for certain. Any attempt to isolate it reduces the chances of it manifesting, so at this point the best way to find out what's happening is likely to analyze the code armed with the knowledge of GC & parse internals and make educated guesses. That, or scan the aforementioned internals for allocation bugs.
What else sometimes helps me to trigger it:
The bug has been fixed. This is the detailed log of the whole debugging session, done by me and Qingtian. It took us about 3.5 hours working in parallel the first hour, then together in the resolution of the bug. Without the logging process, it would have taken 1 or 1.5 hour less to figure out the cause and the proper fix, and I must say it is particularily tedious to do such detailed logging (I have dropped some extra fine-grained probing cases that were not successful), though I hope it could serve as an example to other contributors for the methods we use to investigate crashing cases caused by particularily complex bugs that are triggered by a GC pass.
1. Reproducing the issue
The description provides steps to reproduce the issue, but using consoles instead of direct compilation of the script. So the first thing to do is try to isolate the crash from the consoles and interpreter by compiling the input script directly in debug mode.
Bingo, it can exhibit the same crash, and with the extra bonus of a much smaller call-stack:
Now let's confirm that the GC is involved by trying again with
2. Investigating possible causes of the bug
Looking at the stack trace, it crashes in Parse code, more precisely in
The only rule in the input script that matches that pattern is
So, we remove the probing code in the rule, and go see the line 499 in %parse.reds, to get more clues. The referred line is:
which is very unlikely to crash, as that would mean the generated code for such expression is wrong, which is by experience, extremely unlikely. So, let's wrap that part of the code with some logs to confirm if the crash really comes from there or not:
Result is inconclusive as the nested loop is called too many times, and after 5mn of run time, the program still hasn't halted in any way.
Anyway, unprecise error line reporting in R/S is caused by macros being expanded at block-level, so new-line metadata gets messed up quickly (to be improved in the future).
Though, just above the reported error line, there is a macro lying there:
Now we get the line
Indeed, we now get an "assertion failed" instead of a "access violation". So
Though, we have an issue with probing, as we are in a deeply nested loop and anything printed there will result in too many logs, that would make it difficult to reach the right position in an acceptable time frame. We need to put some conditions to reduce the logs to a much smaller amount. We can notice that the crash always happened after 8 GC passes, as in debug mode, the GC passes are reported like that:
As it never reaches more than 8 passes, we can use the internal GC passes counter to only log after the 8th pass. So the code becomes:
Running the code now gives us:
So the culprit is
Ah, finally something interesting, that is a special value used by the GC compacting routine when compiled in debug mode, to mark the newly freed contiguous area ahead of the in-use area (the memory chunk "heap", that is available for new series allocations). So, in this last call to
So, let's dump the content of the input string buffer to see if there's anything odd by injecting a conditional
This is a typical series buffer memory dump. The header part is 5 32-bit words long, and followed by the series data.
A first check to do there, is ensure that those values are consistent, so that what tail - offset gives, is less than the allocated size. That's the case here.
We can notice three important information there:
Let's search for the beginning of the this pattern (we take only
That matches the UCS-4 input string above. We can immediatly notice that the
So let's now explore the UTF-8 conversion code in %runtime/unicode.reds. The main function for such conversion is
So let's first see if
The result is negative, that line is never printed, so
At this point, we strongly suspect a edge case problem, probably related to a string buffer expansion in
Now that we are very close to finding the root cause, Qingtian is proposing to leave the conclusion as an exercise for the reader, as dinner time is passing already....
The first hint in the code above is that
3. Providing a fix for the issue
As I vaguely remember that this conditional expression has been the cause of issues in the past, I have a quickly look at the other similar functions used for upgrading internal string representations, this is what we find there:
We can see that
So we see that the line 181 was modified by commit ec6b1fa. Let's git blame the previous version now:
And the version before:
Now, we have a complete history of what happened there, to shed more lights on why the issue was fixed for one of the conversion functions and not the others. We can first notice that the initial code was correct, but not very clear as the
When we look at the commit that introduced the bug, we want to know if it was related to a bug fix or something else. The commit is b5e3798 and is related to the big refactoring done to move many series actions code into the parent virtual type
We can now implement a safe fix for it, and spread it other the thee similar functions. After running all the whole Red unit tests to ensure that this fix causes no regression, we can finally submit it to red/red repo and have a well-deserved dinner!.