Half the generated files are empty? #43

kaby76 · 2022-02-18T21:19:20Z

I'm using the grammar here and generated tests using the latest on master:

grammarinator-process  VerilogLexer.g4 VerilogParser.g4 -o .
grammarinator-generate VerilogGenerator.VerilogGenerator  --sys-path . -d 30 -n 10 -r source_text --serializer grammarinator.runtime.simple_space_serializer

I'm not sure I understand why half of the files generated have zero character length.

The text was updated successfully, but these errors were encountered:

CityOfLight77 · 2022-02-19T05:11:00Z

I'm facing same issue with all grammars I tested they generated empty files, but I don't know it's intended or not.

renatahodovan · 2022-02-19T10:29:35Z

Hi @kaby76 and @CityOfLight77

It's not a surprise if you look carefully into the grammar to generate test cases from. In case of VerilogGenerator, the start rule used in the example is source_text. It's definition from the grammar is:

// START SYMBOL
source_text
	: description* EOF
	;

It means, that source_text must be constructed from zero or more description (due to the Kleene star quantifier * after description), i.e., empty files should be recognized by a Verilog parser.
Grammarinator does exactly the same in the opposite direction: before every generation it rolls a dice to decide whether to generate zero or more description (i.e., generate empty file or not).

Although this random decision about zero or more quantifier expansion is quite useful deeper in the derivation tree to avoid infinite recursions, at the beginning, around the start_rule, it's worth to manually replace the * with + (Kleene plus, "one or more" quantifier) to avoid empty output files.

I hope this helps!

@CityOfLight77 If it doesn't solve your problem with empty files, please share the grammar and I'll look into it.

Cheers,
Reni

kaby76 · 2022-02-19T13:10:30Z

For grammarinator-generate.exe VerilogGenerator.VerilogGenerator --sys-path . -d 10 -n 100 -r source_text --serializer grammarinator.runtime.simple_space_serializer, I then used Trash to get the number of children for the source_text rule (for i in tests/*; do trparse -t gen $i 2>/dev/null | trxgrep ' /source_text/*' | trtext -c ; done > o) and made a histogram plot for the number of children in a source_text for 100 generated tests. It seems the "sampling" for the LL-derivations follows a bell curve. Why is that?

renatahodovan · 2022-02-20T19:07:19Z

Hi @kaby76

It's not a bell curve but it's an (1/x)^n curve (in this case (1/2)^n), which is exactly what we expect from quantifiers by definition/implementation. The generation of quantifiers happens according to the following pseudo code:

source_text = UnparserRule(name='source_text')
while random_decision():
    source_text += UnparserRule(name='description')

It means, that the probability of the generation of one description is 1/2, for two descriptions is (1/2)^2, for three is (1/2)^3, etc., i.e.; (1/2)^n, what your plot shows as well.

kaby76 · 2022-02-21T12:18:34Z

Thanks. That explains quite a bit of what the generated code is doing. I can now follow through on what for _ in self._model.quantify(current, 0, min=0, max=inf) does.

akosthekiss · 2022-02-21T13:43:09Z

@kaby76 I was just about to leave a comment guiding you to models, if you wanted to tweak the "let's flip a coin" default approach. You can write your own decision model that has the same API as DefaultModel . Every random decision of the generated fuzzer (e.g., how to chose an alternative from A | B or how many times to iterate over *) actually happens here. And the default model can be replaced even from the command line using the -m or --model switch:

https://github.com/renatahodovan/grammarinator/blob/master/grammarinator/generate.py#L237-L238

As the documentation of models is incomplete (so to say), let me introduce quantify(self, node, idx, min, max). Whenever a quantifier is reached during test case generation, the model's quantify method is called in a for loop. Actually, quantify should be a generator and it should yield as many times as the loop is expected to iterate. It is expected that it yields between min and max times (inclusive). To help quantify make the decision, the current node is passed as an argument, for which children are being generated; e.g., node.name names the rule that is corresponding to the node in the grammar. Moreover, idx is also passed as an argument, which uniquely identifies the quantifier within the rule. (E.g., in S: A* B?;, * has index 0, ? has index 1.)

I know that the above is a bit brief, but I hope it helps.

BTW, there is also a subclass of DefaultModel, called DispatchingModel. It simplifies tweaking the random decisions in some selected rules by writing methods named like quantify_<RULE>. E.g., in your example:

class VerilogModel(grammarinator.runtime.DispatchingModel):
    def quantify_source_text(self, node, idx, min, max):
        yield
        yield
        yield

(And this would create test cases that always contained exactly three descriptions. The rest of the quantifiers would still use the flip-the-coin approach.)

kaby76 closed this as completed Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Half the generated files are empty? #43

Half the generated files are empty? #43

kaby76 commented Feb 18, 2022

CityOfLight77 commented Feb 19, 2022

renatahodovan commented Feb 19, 2022

kaby76 commented Feb 19, 2022

renatahodovan commented Feb 20, 2022

kaby76 commented Feb 21, 2022

akosthekiss commented Feb 21, 2022

Half the generated files are empty? #43

Half the generated files are empty? #43

Comments

kaby76 commented Feb 18, 2022

CityOfLight77 commented Feb 19, 2022

renatahodovan commented Feb 19, 2022

kaby76 commented Feb 19, 2022

renatahodovan commented Feb 20, 2022

kaby76 commented Feb 21, 2022

akosthekiss commented Feb 21, 2022