New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a ppx_deriving plugin would be super-useful #7

Closed
gasche opened this Issue Jun 15, 2017 · 10 comments

Comments

Projects
None yet
5 participants
@gasche
Contributor

gasche commented Jun 15, 2017

When I look at @yomimono's heroic code, it appears clear that without a metaprogramming way to generate Crowbar generators for algebraic datatypes to be enumerated in "the obvious way" (with annotations to override special cases, etc.), it can currently be very painful to deploy Crowbar for a library.

Do some people here have plans for such a plugin?

I looked around a bit and there already exist at least one ppx_deriving plugin for random generation, @paurkedal's ppx_deriving_random. Unfortunately, because of the difference in type signatures for generators and generator language in general, I don't think that it would be possible to reuse it directly. It should be a very good start for hacking a crowbar generator, though.

(In general the distance between Crowbar-style generation interfaces and Quickcheck-style generation interface is an uncanny zero, so it sounds like a nice factorization should be within reach, but right now I'm tempted to try Crowbar on stuff and I would really like even a separate/specialized generator.)

@paurkedal

This comment has been minimized.

Show comment
Hide comment
@paurkedal

paurkedal Jun 18, 2017

The Crowbar interface seems to nicely map from a PPX similar to ppx_deriving_random. In particular, there are no distribution-related parameters, which simplifies things.

I think the thing which most limits the usefulness of PPX for random number generation is the need of controlling the distributions. Some generation of some ADTs diverges without the [@weight ...] annotations, and more generally one might want them for balancing the generated trees. For elementary types, it may e.g. be useful to control whether an int is generated as Random.int (1 lsl Random.int n) for some n or Random.bits (), etc. Stilll, one can save a lot of coding on the more straight forward cases, which is particularly nice for testing.

paurkedal commented Jun 18, 2017

The Crowbar interface seems to nicely map from a PPX similar to ppx_deriving_random. In particular, there are no distribution-related parameters, which simplifies things.

I think the thing which most limits the usefulness of PPX for random number generation is the need of controlling the distributions. Some generation of some ADTs diverges without the [@weight ...] annotations, and more generally one might want them for balancing the generated trees. For elementary types, it may e.g. be useful to control whether an int is generated as Random.int (1 lsl Random.int n) for some n or Random.bits (), etc. Stilll, one can save a lot of coding on the more straight forward cases, which is particularly nice for testing.

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 3, 2017

Contributor

So @yomimono has a first prototype of ppx_deriving plugin at https://github.com/yomimono/ppx_deriving_crowbar , but as I understand there are issues with stack overflows caused by deep non-tail-call calls when generating deep values at recursive types.

Contributor

gasche commented Dec 3, 2017

So @yomimono has a first prototype of ppx_deriving plugin at https://github.com/yomimono/ppx_deriving_crowbar , but as I understand there are issues with stack overflows caused by deep non-tail-call calls when generating deep values at recursive types.

@yomimono

This comment has been minimized.

Show comment
Hide comment
@yomimono

yomimono Dec 3, 2017

Collaborator

It's true! (Both things.) The stack issue isn't specific to users of ppx_deriving_crowbar but it's much easier to trigger without understanding why when automatically deriving a generator from a type.

In addition to the repository itself, there's a proof-of-concept which automatically derives the types in parsetree.mli at https://github.com/yomimono/ocaml-test-omp. Attempting to run this example illustrates the stack problem well :)

Collaborator

yomimono commented Dec 3, 2017

It's true! (Both things.) The stack issue isn't specific to users of ppx_deriving_crowbar but it's much easier to trigger without understanding why when automatically deriving a generator from a type.

In addition to the repository itself, there's a proof-of-concept which automatically derives the types in parsetree.mli at https://github.com/yomimono/ocaml-test-omp. Attempting to run this example illustrates the stack problem well :)

@gasche

This comment has been minimized.

Show comment
Hide comment
@gasche

gasche Dec 4, 2017

Contributor

(cc @Armael who was looking at the stack-overflow thing tonight, and would need a repro case.)

Contributor

gasche commented Dec 4, 2017

(cc @Armael who was looking at the stack-overflow thing tonight, and would need a repro case.)

@stedolan

This comment has been minimized.

Show comment
Hide comment
@stedolan

stedolan Dec 4, 2017

Owner

Is there still a stack overflow after @yomimono's #16? (which is merged on the master branch)

Owner

stedolan commented Dec 4, 2017

Is there still a stack overflow after @yomimono's #16? (which is merged on the master branch)

@yomimono

This comment has been minimized.

Show comment
Hide comment
@yomimono

yomimono Dec 4, 2017

Collaborator

Yes. #16 isn't sufficient for very branchy and recursive structures like the generators we get from the parsetree types.

Collaborator

yomimono commented Dec 4, 2017

Yes. #16 isn't sufficient for very branchy and recursive structures like the generators we get from the parsetree types.

@Armael

This comment has been minimized.

Show comment
Hide comment
@Armael

Armael Dec 4, 2017

Contributor

What about writing the generators in CPS? This would not fix the distribution issues, but at least ensure we do not stack overflow (at the price of probably slower generators).
I had a look and could try to submit a PR if this is considered a good idea.

Contributor

Armael commented Dec 4, 2017

What about writing the generators in CPS? This would not fix the distribution issues, but at least ensure we do not stack overflow (at the price of probably slower generators).
I had a look and could try to submit a PR if this is considered a good idea.

@stedolan

This comment has been minimized.

Show comment
Hide comment
@stedolan

stedolan Dec 4, 2017

Owner

I don't think CPS-style is the answer. Generators should never generate test cases deep enough to stack overflow - bugs tend to be reproducible with small examples, and smaller examples test faster. The bug is that Crowbar does not bound the size of test cases well, because I wrote awful sizing logic. (Once again, crowbar proves to be an excellent tool for finding bugs in crowbar).

#16 fixed the worst one, and #18 fixes more. With both, the o-m-p tests now run without stack overflows (on my machine...), but I wouldn't be surprised if there were other sizing bugs.

Owner

stedolan commented Dec 4, 2017

I don't think CPS-style is the answer. Generators should never generate test cases deep enough to stack overflow - bugs tend to be reproducible with small examples, and smaller examples test faster. The bug is that Crowbar does not bound the size of test cases well, because I wrote awful sizing logic. (Once again, crowbar proves to be an excellent tool for finding bugs in crowbar).

#16 fixed the worst one, and #18 fixes more. With both, the o-m-p tests now run without stack overflows (on my machine...), but I wouldn't be surprised if there were other sizing bugs.

@paurkedal

This comment has been minimized.

Show comment
Hide comment
@paurkedal

paurkedal Dec 4, 2017

I agree hard limits may be needed to guard against stack overflow, but if there is a distribution issue then the coverage will be very poor, so I think it needs to be addressed as well. I didn't solve this in ppx_deriving regexp, but pushed the issue onto the user by adding the [@weight ] attribute to manually tweak the distribution.

If we recall linear algebra class, we can consider what it takes for random sampling of a recursive data type to converge. Given a set of mutually recursive algebraic types tᵢ and let nᵢⱼₖ be the number of occurrences of type tₖ in the jth branch of tᵢ. If we sample branche j of tᵢ with probability pᵢⱼ, then the expectation value sᵢ of the size of tᵢ becomes

sᵢ = ∑ⱼₖ pᵢⱼ nᵢⱼₖ sₖ + bᵢ

where bᵢ is contributions from non-recursive sub-types. This is a linear algebra system

∑ₖ aᵢₖ sₖ = bᵢ   where aᵢₖ = δᵢₖ - ∑ⱼ pᵢⱼ nᵢⱼₖ

That system is not always solvable, and also if any of the sizes comes out negative, I gather the distribution must be divergent. E.g. the just-divergent t₁ = 1/2 unit + 1/2 (t₁ * t₁) gives a₁₁ = 0, which is the singular 1 by 1 matrix. If we tip the weights into the fully divergent t₁ = 1/3 unit + 2/3 (t₁ * t₁), then a₁₁ = -1/3 and s₁ = -3.

In the case where manual tweaking of weights is acceptable, a convergence check could be done at compile time. But it may also be possible to automatically find suitable weights. If we expand the co-recursion and collect same-order terms into a single t = p₀ unit + p₁ t + p₂ (t * t) + ... then we need ∑ pₙ n < 1. Considering some non-normalised cₙ ∝ pₙ, then this is (∑ cₙ n) / (∑ cₙ) < 1. Using a geometric form cₙ = xⁿ, I get the convergence condition x < 1/2 using the formulas from the Wikipedia page on Geometric Series. Note, however, that if terms of the type is missing, we can't include them in the calculation. If the unit term is missing, it's easily seen that (∑ cₙ n) / (∑ cₙ) >= 1, so the generator will always diverge, as expected. Missing higher terms should be okay.

But please do check my calculations, as I was just sketching this out while writing.

Clarification: By t = 1/2 unit + 1/2 (t * t) I mean a type like type t = A of unit | B of t * t sampled with equal probability for each branch. Non-recursive non-unit constructor arguments are omitted in the last paragraph, as they have no bearing on convergence.

paurkedal commented Dec 4, 2017

I agree hard limits may be needed to guard against stack overflow, but if there is a distribution issue then the coverage will be very poor, so I think it needs to be addressed as well. I didn't solve this in ppx_deriving regexp, but pushed the issue onto the user by adding the [@weight ] attribute to manually tweak the distribution.

If we recall linear algebra class, we can consider what it takes for random sampling of a recursive data type to converge. Given a set of mutually recursive algebraic types tᵢ and let nᵢⱼₖ be the number of occurrences of type tₖ in the jth branch of tᵢ. If we sample branche j of tᵢ with probability pᵢⱼ, then the expectation value sᵢ of the size of tᵢ becomes

sᵢ = ∑ⱼₖ pᵢⱼ nᵢⱼₖ sₖ + bᵢ

where bᵢ is contributions from non-recursive sub-types. This is a linear algebra system

∑ₖ aᵢₖ sₖ = bᵢ   where aᵢₖ = δᵢₖ - ∑ⱼ pᵢⱼ nᵢⱼₖ

That system is not always solvable, and also if any of the sizes comes out negative, I gather the distribution must be divergent. E.g. the just-divergent t₁ = 1/2 unit + 1/2 (t₁ * t₁) gives a₁₁ = 0, which is the singular 1 by 1 matrix. If we tip the weights into the fully divergent t₁ = 1/3 unit + 2/3 (t₁ * t₁), then a₁₁ = -1/3 and s₁ = -3.

In the case where manual tweaking of weights is acceptable, a convergence check could be done at compile time. But it may also be possible to automatically find suitable weights. If we expand the co-recursion and collect same-order terms into a single t = p₀ unit + p₁ t + p₂ (t * t) + ... then we need ∑ pₙ n < 1. Considering some non-normalised cₙ ∝ pₙ, then this is (∑ cₙ n) / (∑ cₙ) < 1. Using a geometric form cₙ = xⁿ, I get the convergence condition x < 1/2 using the formulas from the Wikipedia page on Geometric Series. Note, however, that if terms of the type is missing, we can't include them in the calculation. If the unit term is missing, it's easily seen that (∑ cₙ n) / (∑ cₙ) >= 1, so the generator will always diverge, as expected. Missing higher terms should be okay.

But please do check my calculations, as I was just sketching this out while writing.

Clarification: By t = 1/2 unit + 1/2 (t * t) I mean a type like type t = A of unit | B of t * t sampled with equal probability for each branch. Non-recursive non-unit constructor arguments are omitted in the last paragraph, as they have no bearing on convergence.

@yomimono

This comment has been minimized.

Show comment
Hide comment
@yomimono

yomimono Mar 30, 2018

Collaborator

I've now released ppx_deriving_crowbar into opam, so I'll close this issue.

Collaborator

yomimono commented Mar 30, 2018

I've now released ppx_deriving_crowbar into opam, so I'll close this issue.

@yomimono yomimono closed this Mar 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment