Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Symbolic dimensions allowed for Finite Random Variables #16962

Open
wants to merge 15 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@czgdp1807
Copy link
Member

commented Jun 4, 2019

References to other Issues or PRs

Fixes #16961

Brief description of what is fixed or changed

Symbolic dimensions have been allowed for Die in sympy.stats.frv_types. Un-evaluated objects for Density are returned, allowing the substitution of the symbolic dimension with integers.

Other comments

I have made the change only to Die and that too for Density. Further changes will be made according to the comments received for the this approach. This approach can be easily extended to other distributions as well.
ping @oscarbenjamin @Upabjojr

Release Notes

  • stats
    • symbolic dimensions allowed for finite distributions
@sympy-bot

This comment has been minimized.

Copy link

commented Jun 4, 2019

Hi, I am the SymPy bot (v147). I'm here to help you write a release notes entry. Please read the guide on how to write release notes.

Your release notes are in good order.

Here is what the release notes will look like:

  • stats

This will be added to https://github.com/sympy/sympy/wiki/Release-Notes-for-1.5.

Note: This comment will be updated with the latest check if you edit the pull request. You need to reload the page to see it.

Click here to see the pull request description that was parsed.

<!-- Your title above should be a short description of what
was changed. Do not include the issue number in the title. -->

#### References to other Issues or PRs
<!-- If this pull request fixes an issue, write "Fixes #NNNN" in that exact
format, e.g. "Fixes #1234". See
https://github.com/blog/1506-closing-issues-via-pull-requests . Please also
write a comment on that issue linking back to this pull request once it is
open. -->
Fixes https://github.com/sympy/sympy/issues/16961

#### Brief description of what is fixed or changed
Symbolic dimensions have been allowed for `Die` in sympy.stats.frv_types. Un-evaluated objects for Density are returned, allowing the substitution of the symbolic dimension with integers.

#### Other comments
I have made the change only to `Die` and that too for `Density`. Further changes will be made according to the comments received for the this approach. This approach can be easily extended to other distributions as well.
ping @oscarbenjamin @Upabjojr 

#### Release Notes

<!-- Write the release notes for this release below. See
https://github.com/sympy/sympy/wiki/Writing-Release-Notes for more information
on how to write release notes. The bot will check your release notes
automatically to see if they are formatted correctly. -->

<!-- BEGIN RELEASE NOTES -->
* stats
  * symbolic dimensions allowed for finite distributions
<!-- END RELEASE NOTES -->

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

I tried D = Die('D', n) and found that things like E(D), P(Eq(D, 2)) or P(D<2) went into infinite loops.

Also I don't know the code in the stats module that well so I'm not sure if it makes sense to handle this case with conditional code in FinitePSpace or if it would be better to have a new class for this.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 4, 2019

I have written in the description that changes have been made only for density. Further changes will be made as per the comments received. Thanks for bringing the issue to my notice. I will work on these and push the changes ASAP.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 4, 2019

I think using existing classes is better as there are already too many classes in stats module making it difficult to handle.
I don't mean that it's impossible to add new classes but it's an issue which, from my point of view, can be fixed by adding some conditional checks and doing computations accordingly.

@codecov

This comment has been minimized.

Copy link

commented Jun 5, 2019

Codecov Report

Merging #16962 into master will increase coverage by 0.258%.
The diff coverage is 82.242%.

@@              Coverage Diff              @@
##            master    #16962       +/-   ##
=============================================
+ Coverage   74.126%   74.384%   +0.258%     
=============================================
  Files          620       622        +2     
  Lines       160572    160849      +277     
  Branches     37677     37754       +77     
=============================================
+ Hits        119026    119647      +621     
+ Misses       36140     35872      -268     
+ Partials      5406      5330       -76

@oscarbenjamin oscarbenjamin added the stats label Jun 5, 2019

Show resolved Hide resolved sympy/stats/tests/test_finite_rv.py Outdated
Show resolved Hide resolved sympy/stats/tests/test_finite_rv.py Outdated
@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

This has merge conflicts now

@czgdp1807 czgdp1807 added the GSoC label Jun 8, 2019

@czgdp1807 czgdp1807 changed the title [GSoC] symbolic dimensions allowed for Die Symbolic dimensions allowed for Die Jun 8, 2019

Show resolved Hide resolved sympy/stats/frv.py Outdated
@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2019

Is it possible to do anything with these symbolically sized RVs without substituting for the size?

I just tried a few things:

In [1]: from sympy.stats import *                                                                                                                                                                                  

In [2]: D = Die('D', n)                                                                                                                                                                                            

In [3]: D                                                                                                                                                                                                          
Out[3]: D

In [4]: E(D)                                                                                                                                                                                                       
Out[4]: Expectation(D)

In [5]: P(Eq(D, 1))                                                                                                                                                                                                
Out[5]: Probability(Eq(D, 1))

In [6]: P(Eq(D, 1)).subs(n, 2)                                                                                                                                                                                     
Out[6]: Probability(Eq(D, 1))

In [7]: E(D)                                                                                                                                                                                                       
Out[7]: Expectation(D)

In [8]: E(D).subs(n, 2)                                                                                                                                                                                            
Out[8]: Expectation(D)

In [9]: E(D.subs(n, 2))                                                                                                                                                                                            
Out[9]: 3/2

We know that E(D) is (n+1)/2 and that P(Eq(D, 1)) is 1/n (assuming n>=1). It seems to me that these could evaluate to something in terms of n...

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

@oscarbenjamin I don't think it's possible to do that because we need to traverse the set. Though we can add some formulae like by adding methods like, expectation which will respond to symbolic dimensions. However, that will be quite ad-hoc? What are your views @Upabjojr?

@czgdp1807 czgdp1807 changed the title Symbolic dimensions allowed for Die Symbolic dimensions allowed for Finite Random Variables Jun 9, 2019

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

It isn't necessary to traverse the set: the formulas are known. I don't really see the point of being able to have a Die of symbolic size n if the only way to compute anything with it is to substitute n for an integer.

It might be better to have a new class to represent these distributions that doesn't work by iterating over the sample space.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

It isn't necessary to traverse the set: the formulas are known.

I have also observed that some refactoring is needed to sympy.stats.frv_types.
Though one notable point is formulae aren't generic enough.

However, that will be quite ad-hoc? What are your views @Upabjojr?

Let's wait for @Upabjojr 's comment. Infact, adding too many formulae isn't recommended according to the previously held discussions with him. I believe that symbolic_probability allows generic results.

@sidhantnagpal

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

Is it necessary to have dict define the density of a finite_rv?
How about specifying it using union with limits (ref: #9815 and #14566) or a set-builder notation like:

In [20]: imageset(i, binomial(n, i) * p**i * (1-p)**(n-i), S.Naturals0.intersect(Interval(0, n)))                                                          
Out[20]: 
⎧ x        n - x ⎛n⎞                  ⎫
⎨p ⋅(1 - p)     ⋅⎜ ⎟ | x ∊ ℕ₀ ∩ [0, n]⎬
⎩                ⎝x⎠                  ⎭

In [21]: binp = _; binp.subs(n, 9)                                                           
Out[21]: 
⎧ x        9 - x ⎛9⎞                   ⎫
⎨p ⋅(1 - p)     ⋅⎜ ⎟ | x ∊ {0, 1, …, 9}⎬
⎩                ⎝x⎠                   ⎭

In [22]: binp.args[0] # Lambda object                                                         
Out[22]: 
     x        n - x ⎛n⎞
x ↦ p ⋅(1 - p)     ⋅⎜ ⎟
                    ⎝x⎠
@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

Is it necessary to have dict define the density of a finite_rv?

I think using dict is restricting every work of symbolic dimensions. So we should not use it.

How about specifying it using union with limits (ref: #9815 and #14566)

Quite a good idea. I will implement this tomorrow if no one objects. Though it will require some refactoring of code. Please provide your views @Upabjojr .

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

What is the point in having a set?

What you want is the expression or map that creates the set (i.e. binp.args[0]) not the set itself. You can use that to calculate the probabilities.

Note that if the set simplifies in some way then args[0] may not be a Lambda any more.

@sidhantnagpal

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

What is the point in having a set?

To substantiate the finite support of the probability distribution.

Note that if the set simplifies in some way then args[0] may not be a Lambda any more.

Right, it would be interesting to see if the expressions can remain unevaluated.

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Right, it would be interesting to see if the expressions can remain unevaluated.

The easiest way to do that is by not using imageset.

You need two objects: the set of possible values and the function that maps those to their associated probabilities. Combining those with imageset conceptually loses the information connecting which value correspond to which probability (even if the unevaluated ImageSet still keeps them as separate objects).

For a Die you would get:

In [11]: imageset(x, 1/n, S.Naturals0.intersect(Interval(1, n+1)))              
Out[11]: 
⎧1⎫
⎨─⎬
⎩n⎭

That's a valid representation of the set but not really what is wanted here.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

though one notable point is changing from dict to imageset will break the current tests and the current code of the users. I think we should use imageset only for symbolic dimensions. However only three distributions have symbolic dimensions and for those we can go for imageset and that too only one dimensions contain symbols.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

Nice point #16962 (comment).
Why don't we use unevaluated ImageSet directly which will keep the information intact?

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

I think that ultimately the best approach is to reimplement all of the finite probability spaces without dicts using symbolic objects like Sum, Lambda, Range etc. Collapsing those objects down to ordinary numbers when there are known integer dimensions should be handled by the symbolic machinery. This would be analogous to the way that continuous RVs are represented but with Sum instead of Integral.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

I am in agreement with @oscarbenjamin. Infact, while working with this PR I was thinking of refactoring the finite rv. Now, I think I can work on it and if it is completed I will make a PR otherwise we can think more here.

@Upabjojr

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Using ImageSet is actually a really nice idea.

I would recommend to try to do some minimal intervention in your PRs, avoid refactoring all of finite_set. For example, you may add the ".dict" property to the superclass which checks if an "ImageSet" has been defined. Subclasses may override this method.

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Something related to ImageSet could be good here but ImageSet itself is a poor fit.

In the mathematical concept of an image set there are 3 involved objects:

  1. The input set (the domain)
  2. The map (function)
  3. The output set (the image of the domain under the function)

SymPy's ImageSet is created by providing the input set and the map. However as an object it represents the output set and that is not what is wanted here (see the Die example above).

What is wanted here are the other two objects: the input set and the domain. It would be possible to define a new object (a symbolic "dict") that represents the combination of those two but it's also possible just to keep them as separate objects: a Set and a Lambda.

So ImageSet should not be used but the related concept of having a Set and a Lambda could work.

Show resolved Hide resolved sympy/stats/frv_types.py Outdated

@czgdp1807 czgdp1807 changed the title Symbolic dimensions allowed for Finite Random Variables [WIP] Symbolic dimensions allowed for Finite Random Variables Jun 11, 2019

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 11, 2019

I am working on some changes keeping in my mind all the comments received. Will update the PR as soon as I am done.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

I have been experimenting with some changes SingleFiniteDistribution, FinitePSpace. I have some queries and some suggestions.

  1. What's the use of characteristic_function, moment_generating_function in SingleFiniteDistribution? : The computations for MGF and CF are duplicated in SingleFiniteDistribution and FinitePSpace. The former class has characteristic_function and moment_generating_function and later class has, compute_characteristic_function and compute_moment_generating_function for computations. The corresponding interface functions, i.e., characteristic_function and moment_generating_function in sympy.stats.rv use the methods in FinitePSpace. So,why the duplication in SingleFiniteDistribution?

  2. Would it be good to have pmf rather than pdf? : To me the notion of pdf for distributions in frv_types.py sounds non-intuitive. They are creating complications in result generation. We can replace pdf with pmf with easier definitions and use Sum in frv.py to generate results.

  3. Currently, the use of dict restricts the usage of symbolic dimensions. For example, in compute_cdf in FinitePSpace uses dict to compute cdf and then chaches it. I would suggest another method of handling queries online by using something like, Lambda(k, Sum(self.distribution.pmf(ki), (ki, self.distributionlow, k)). Here, self.distribution.low means the lower bound of the set defined in a specific distribution. Why to use it? The reason is, for general case, set is defined as Intersection(S.Naturals0, Interval(0, n)), which don't return self.inf, self.sup. By handling queries using just in time, we will not affect the performance because, the random variables are finite, so time for computation won't be as high. Now if someone passes too high input then the current implementation will also fail.

Please provide your comments on my suggestion.

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

The set Intersection(S.Naturals0, Interval(0, n)) should just be range(n). The range class should accept symbolic dimensions (I though someone already added that recently...).

Otherwise what you say all seems reasonable to me but I don't know the current implementation that well.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

The set Intersection(S.Naturals0, Interval(0, n)) should just be range(n). The range class should accept symbolic dimensions (I though someone already added that recently...).

I was working on it #16838. But due to some reasons(mentioned in the last comments of the thread in that PR) it was closed.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 12, 2019

That's why I suggest Intersection(S.Naturals0, Interval(0, n)) which can work with current implementations of other modules. Why can we not add attributes like, low, high in SingleFiniteDistribution?

@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

Intersection(S.Naturals0, Interval(0, n)) should just evaluate to Range(n).

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 13, 2019

As no major objections have been received on #16962 (comment) apart from some suggestions, I will start my work on it.

czgdp1807 added some commits Jun 13, 2019

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 16, 2019

ping @Upabjojr @sidhantnagpal
Brief Summary of Changes

  • Some uncovered, unused methods have been removed from SingleFiniteDistribution. The names are, characteristic_function and moment_generating_function.
  • The symbolic dimensions have been added to FinitePSpace with if clause which checks that whether the dimension is symbolic via is_symbolic attribute. The pdf attribute is shifted to specific distributions rather than keeping it in the superclass and all the work is done using the pdf only as is done with other distributions in sympy.stats. pdf -> pmf hasn't been done for compatibility reasons.
  • The approach of dict for integer dimensions hasn't been removed as it will break the current API. Overall, the user facing API is intact.

The tests have been written for Die. Tests for other distributions which have symbolic dimensions will be added after 24 hours if no objection on the current approach is raised. Thanks.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

Any objections/suggestions?

Show resolved Hide resolved sympy/stats/frv.py Outdated
@@ -88,7 +106,7 @@ def dict(self):

@property
def set(self):
return self.args
return list(self.args)

This comment has been minimized.

Copy link
@Upabjojr

Upabjojr Jun 17, 2019

Contributor

why list?

This comment has been minimized.

Copy link
@czgdp1807

czgdp1807 Jun 17, 2019

Author Member

In frv_types, the distributions were returning the set as list, so to keep everything in agreement, I made this change.

Show resolved Hide resolved sympy/stats/tests/test_finite_rv.py Outdated
Show resolved Hide resolved sympy/stats/tests/test_finite_rv.py Outdated
@oscarbenjamin

This comment has been minimized.

Copy link
Contributor

commented Jun 17, 2019

It looks like most methods of FinitePSpace have:

if self.is_symbolic:  # is_symbolic is not a good name for this attribute
    do x
else:
   do y

The ones that don't yet have that check probably should because they don't work with symbolic dimensions:

In [44]: D = Die('D', n)

In [45]: D.pspace.sorted_cdf(D)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-45-637c066e196e> in <module>
----> 1 D.pspace.sorted_cdf(D)
--> 288         items = list(cdf.items())
AttributeError: 'Lambda' object has no attribute 'items'

In [46]: D.pspace.compute_quantile(D)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
--> 338         for key, value in cdf.items():
AttributeError: 'Lambda' object has no attribute 'items'

In [47]: D.pspace.where(D > 3)  # hangs...

Needing to check is_x in each method of a class is a good sign that it should be two separate classes: the two different cases do not share any code.

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

what about SymbolicFinitePSpace? we will use the probability space based on the inputs received while creating the random symbol? what do you say @Upabjojr? i have an inclination towards @oscarbenjamin's idea

@czgdp1807

This comment has been minimized.

Copy link
Member Author

commented Jun 18, 2019

Needing to check is_x in each method of a class is a good sign that it should be two separate classes: the two different cases do not share any code.

So, I think I can start my work on this. In fact, this will enhance code maintenance in future too.

czgdp1807 added some commits Jun 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.