Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half the generated files are empty? #43

Closed
kaby76 opened this issue Feb 18, 2022 · 6 comments
Closed

Half the generated files are empty? #43

kaby76 opened this issue Feb 18, 2022 · 6 comments

Comments

@kaby76
Copy link

kaby76 commented Feb 18, 2022

I'm using the grammar here and generated tests using the latest on master:

grammarinator-process  VerilogLexer.g4 VerilogParser.g4 -o .
grammarinator-generate VerilogGenerator.VerilogGenerator  --sys-path . -d 30 -n 10 -r source_text --serializer grammarinator.runtime.simple_space_serializer

I'm not sure I understand why half of the files generated have zero character length.

@CityOfLight77
Copy link

I'm facing same issue with all grammars I tested they generated empty files, but I don't know it's intended or not.

@renatahodovan
Copy link
Owner

Hi @kaby76 and @CityOfLight77

It's not a surprise if you look carefully into the grammar to generate test cases from. In case of VerilogGenerator, the start rule used in the example is source_text. It's definition from the grammar is:

// START SYMBOL
source_text
	: description* EOF
	;

It means, that source_text must be constructed from zero or more description (due to the Kleene star quantifier * after description), i.e., empty files should be recognized by a Verilog parser.
Grammarinator does exactly the same in the opposite direction: before every generation it rolls a dice to decide whether to generate zero or more description (i.e., generate empty file or not).

Although this random decision about zero or more quantifier expansion is quite useful deeper in the derivation tree to avoid infinite recursions, at the beginning, around the start_rule, it's worth to manually replace the * with + (Kleene plus, "one or more" quantifier) to avoid empty output files.

I hope this helps!

@CityOfLight77 If it doesn't solve your problem with empty files, please share the grammar and I'll look into it.

Cheers,
Reni

@kaby76
Copy link
Author

kaby76 commented Feb 19, 2022

For grammarinator-generate.exe VerilogGenerator.VerilogGenerator --sys-path . -d 10 -n 100 -r source_text --serializer grammarinator.runtime.simple_space_serializer, I then used Trash to get the number of children for the source_text rule (for i in tests/*; do trparse -t gen $i 2>/dev/null | trxgrep ' /source_text/*' | trtext -c ; done > o) and made a histogram plot for the number of children in a source_text for 100 generated tests. It seems the "sampling" for the LL-derivations follows a bell curve. Why is that?

Untitled

@renatahodovan
Copy link
Owner

Hi @kaby76

It's not a bell curve but it's an (1/x)^n curve (in this case (1/2)^n), which is exactly what we expect from quantifiers by definition/implementation. The generation of quantifiers happens according to the following pseudo code:

source_text = UnparserRule(name='source_text')
while random_decision():
    source_text += UnparserRule(name='description')

It means, that the probability of the generation of one description is 1/2, for two descriptions is (1/2)^2, for three is (1/2)^3, etc., i.e.; (1/2)^n, what your plot shows as well.

@kaby76
Copy link
Author

kaby76 commented Feb 21, 2022

Thanks. That explains quite a bit of what the generated code is doing. I can now follow through on what for _ in self._model.quantify(current, 0, min=0, max=inf) does.

@kaby76 kaby76 closed this as completed Feb 21, 2022
@akosthekiss
Copy link
Collaborator

@kaby76 I was just about to leave a comment guiding you to models, if you wanted to tweak the "let's flip a coin" default approach. You can write your own decision model that has the same API as DefaultModel . Every random decision of the generated fuzzer (e.g., how to chose an alternative from A | B or how many times to iterate over *) actually happens here. And the default model can be replaced even from the command line using the -m or --model switch:

https://github.com/renatahodovan/grammarinator/blob/master/grammarinator/generate.py#L237-L238

As the documentation of models is incomplete (so to say), let me introduce quantify(self, node, idx, min, max). Whenever a quantifier is reached during test case generation, the model's quantify method is called in a for loop. Actually, quantify should be a generator and it should yield as many times as the loop is expected to iterate. It is expected that it yields between min and max times (inclusive). To help quantify make the decision, the current node is passed as an argument, for which children are being generated; e.g., node.name names the rule that is corresponding to the node in the grammar. Moreover, idx is also passed as an argument, which uniquely identifies the quantifier within the rule. (E.g., in S: A* B?;, * has index 0, ? has index 1.)

I know that the above is a bit brief, but I hope it helps.

BTW, there is also a subclass of DefaultModel, called DispatchingModel. It simplifies tweaking the random decisions in some selected rules by writing methods named like quantify_<RULE>. E.g., in your example:

class VerilogModel(grammarinator.runtime.DispatchingModel):
    def quantify_source_text(self, node, idx, min, max):
        yield
        yield
        yield

(And this would create test cases that always contained exactly three descriptions. The rest of the quantifiers would still use the flip-the-coin approach.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants