Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] Added Mixture distribution #19886

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Smit-create
Copy link
Member

References to other Issues or PRs

Discussions in #18730

Brief description of what is fixed or changed

Other comments

  • Add tests and classes
  • Add docs

Release Notes

  • stats
    • Added Mixture Distribution

ping @czgdp1807 @Upabjojr @jmig5776

@sympy-bot
Copy link

Hi, I am the SymPy bot (v160). I'm here to help you write a release notes entry. Please read the guide on how to write release notes.

Your release notes are in good order.

Here is what the release notes will look like:

This will be added to https://github.com/sympy/sympy/wiki/Release-Notes-for-1.7.

Note: This comment will be updated with the latest check if you edit the pull request. You need to reload the page to see it.

Click here to see the pull request description that was parsed.

<!-- Your title above should be a short description of what
was changed. Do not include the issue number in the title. -->

#### References to other Issues or PRs
<!-- If this pull request fixes an issue, write "Fixes #NNNN" in that exact
format, e.g. "Fixes #1234" (see
https://tinyurl.com/auto-closing for more information). Also, please
write a comment on that issue linking back to this pull request once it is
open. -->
Discussions in #18730

#### Brief description of what is fixed or changed


#### Other comments
- [x] Add tests and classes
- [ ] Add docs

#### Release Notes

<!-- Write the release notes for this release below. See
https://github.com/sympy/sympy/wiki/Writing-Release-Notes for more information
on how to write release notes. The bot will check your release notes
automatically to see if they are formatted correctly. -->

<!-- BEGIN RELEASE NOTES -->
* stats
    * Added Mixture Distribution
<!-- END RELEASE NOTES -->
ping @czgdp1807 @Upabjojr @jmig5776 

@sympy-bot
Copy link

sympy-bot commented Aug 4, 2020

🟠

Hi, I am the SymPy bot (v160). I've noticed that some of your commits add or delete files. Since this is sometimes done unintentionally, I wanted to alert you about it.

This is an experimental feature of SymPy Bot. If you have any feedback on it, please comment at sympy/sympy-bot#75.

The following commits add new files:

  • d2fdd4c:
    • sympy/stats/mixture_rv.py
    • sympy/stats/tests/test_mixture_rv.py

If these files were added/deleted on purpose, you can ignore this message.

@codecov
Copy link

codecov bot commented Aug 4, 2020

Codecov Report

Merging #19886 into master will increase coverage by 11.552%.
The diff coverage is 90.322%.

@@              Coverage Diff               @@
##            master    #19886        +/-   ##
==============================================
+ Coverage   64.256%   75.809%   +11.552%     
==============================================
  Files          666       669         +3     
  Lines       172568    173342       +774     
  Branches     40689     40857       +168     
==============================================
+ Hits        110887    131410     +20523     
+ Misses       55277     36195     -19082     
+ Partials      6404      5737       -667     

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 4, 2020

Maybe there is a possibility for a different kind of appoarch: you designed an API asking for

  1. a list of probabilities summing up to one,
  2. a list of RVs.

Point 1. is just a categorical distribution. Point 2. can be considered as a joint distribution of independent random variables.

I was wondering, could we instead approach this design by defining:

  • A new object, maybe called IndependentRandomSymbolProduct(X, Y, ... ), which creates joint RVs out of a product of independent RVs.
  • An element index, maybe equivalent to the class MatrixElement in the matrix expressions module. Let's call it the [ ] operator.
  • a compound distribution with the categorical distribution as a parameter inside the [ ] operator.

At this point, for example, you would define the mixture of X and Y random variables with probabilities 1/3 and 2/3 as

IndependentRandomSymbolProduct(X, Y)[CategoricalRV(probs=[1/3, 2/3])]

This is a completely different approach. I was curious about making the class construction more similar to the way the compound distribution is defined.

@Smit-create
Copy link
Member Author

Smit-create commented Aug 5, 2020

Okay, I understood your approach, but according to the definition of IID Rvs, all the RVs should have the same distribution as per wikipedia, so for IID Random Symbol should be in the following way I think:

1. X = IIDRandomSymbol('X', [X1, X2, ..., Xn]) # where all Xi have same distribution
2. density(X)([x1, x2, ..., xn]) # this will give density(X1)(x1) * density(X2)(x2) * ... * density(Xn)(xn)
3. # Similarly for probability and expectations methods

@czgdp1807
Copy link
Member

czgdp1807 commented Aug 6, 2020

I think rather than taking RVs, distributions should be taken. MixtureDistribution should be in public scope. See, https://reference.wolfram.com/language/ref/MixtureDistribution.html
We discussed this earlier as well. In addition, I am unable to figure out the relation between joint distribution and mixture distribution. The latter one is simply the weighted sum of all the PDFs of the respective distributions but joing distribution needs some special conditions like independence for the joint CDF to be de-composed into product of individual CDFs. How are these two similar?

N = Normal("N", 0, 1)
M = Normal('M', 1, 2)
Z = Laplace('L', 3, 1)
D = Mixture('D', [S(2)/10, S(5)/10, S(3)/10], [N, M, Z])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new Mixture RV, a user should be able to define a new distribution using, MixtureDistribution and then pass it on to custom RV API. Mixture can be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for accessing P, E, etc. we need an RV. Considering that the MixtureDIstribution is created, then we need an RV which will take the MixtureDistrbution as an argument and then able to change the pspace to any of the Finite, Continuous or Discrete Pspaces upon calling of compute_cdf, compute_density etc., so we need Mixture RV IMO. The change can be:

>>> M = Mixture('M', [wt1, wt2, wt3, ...,], [dist1, dist2, dist3, ...,])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, Mixture can be a function rather than a class. It can return a generic RV/ContinuousRV/DiscreteRV/FiniteRV depending on the type of PDF.
P.S. I saw that ContinuousRV generates only RVs with continuous PDF but what about PDFs which are neither continuous nor discrete? What will be returned in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, Mixture can be a function rather than a class

Do you mean to remove the MixtureDistribution/MixturePSpace class?

I saw that ContinuousRV generates only RVs with continuous PDF but what about PDFs which are neither continuous nor discrete?

Presently, I have considered the 3 cases of Continuous, Discrete, and Finite. Should an error be raised if the distribution is not out of these 3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should an error be raised if the distribution is not out of these 3?

No a generic random symbol can be returned having the custom PDF given by the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to remove the MixtureDistribution/MixturePSpace class?

Mixture should be a constructor function returning generic RandomSymbol objects with the given mixture probability distribution.

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 7, 2020

but according to the definition of IDD Rvs

I simply meant independent RVs, not identical, in the example above.

@Smit-create
Copy link
Member Author

Smit-create commented Aug 9, 2020

I simply meant independent RVs, not identical, in the example above.

@Upabjojr @czgdp1807 For implementing the Independent Random Symbol, should it be done in the following way? Continuing the #19886 (comment)

>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs
...
>>> E(X) # product of respective expectations
...

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

@Upabjojr @czgdp1807 For implementing the Independent Random Symbol, should it be done in the following way? Continuing the #19886 (comment)

>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs
...

Yes, that is a nice notation. Better in another PR, this one is already long enough. Maybe density(X)(x1, x2, ... , xn) without the list [ ]? How did you implement it for the other multivariate RVs?

>>> E(X) # product of respective expectations
...

On Wikipedia, the expected value of a multivariate random variables in the vector of the expected values of the components:
https://en.wikipedia.org/wiki/Multivariate_random_variable#Expected_value

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs

I thought a bit over it... we are creating too many classes this way. We already have the Array and Matrix elements. What about just using those ones?

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

I mean, this looks like it's already working:

In [4]: N = Normal("N", 0, 1)

In [5]: P = Poisson("P", 3)

In [6]: m = Matrix([N, P])

In [7]: m
Out[7]:
Matrix([
[N],
[P]])

In [8]: E(m)
Out[8]:
Matrix([
[0],
[3]])

This is the correct behaviour of the expectation if m is to be considered a multivariate random variable composed of two independent univariate RVs N and P.

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

OTOH, there is a bug:

In [10]: a = Array([N, P])

In [11]: a
Out[11]: [N, P]

In [12]: E(a)
---------------------------------------------------------------------------
AttributeError: 'ImmutableDenseNDimArray' object has no attribute 'expand'

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

In order to create a mixture distribution this way, one should:

In [17]: B = Bernoulli("B", 0.3)

In [18]: m[B, 1]
---------------------------------------------------------------------------
ValueError: index out of boundary

This is likely a bug in the class MatrixElement. I would expect this instance of MatrixElement to be exactly equivalent to MixtureDistribution([7/10, 3/10], [N, P]).

In case of multivariate RVs, create a Matrix or Array object and then pass a categorical distribution RV as the matrix/array index.

@Smit-create maybe with this procedure we can avoid creating new classes. What do you think?

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

Strangely enough, the same bug does not appear if we pass the Bernoulli RV as index to an array:

In [22]: a[B]
Out[22]: [N, P][B]

@Upabjojr
Copy link
Contributor

Upabjojr commented Aug 9, 2020

Expectation apparently works with Array:

In [23]: E(a[B])
Out[23]: 0.900000000000000

In [24]: 7/10*E(N) + 3/10*E(P)
Out[24]: 0.900000000000000

@Smit-create
Copy link
Member Author

Yes, I understood the procedure with the Array, but I think density and E are working fine but not cdf:

>>> N = Normal("N", 0, 1)
>>> M = Normal('M', 1, 2)
>>> Z = Laplace('L', 3, 1)
>>> wt = Binomial('B', 2, S(2)/10)
>>> E(a[wt])
11/25
>>> density(a[wt])(x)
2*sqrt(2)*exp(-1/8)*exp(x/4)*exp(-x**2/8)/(25*sqrt(pi)) + exp(-Abs(x - 3))/50 + 8*sqrt(2)*exp(-x**2/2)
/(25*sqrt(pi))
>>> cdf(a[wt])(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smit/Smitlunagariya/sympy/sympy/stats/rv.py", line 950, in cdf
    result = pspace(expr).compute_cdf(expr, **kwargs)
  File "/home/smit/Smitlunagariya/sympy/sympy/stats/rv.py", line 489, in compute_cdf
    raise ValueError("CDF not well defined on multivariate expressions")
ValueError: CDF not well defined on multivariate expressions

Also, this works for the random variables, but not when we pass instances of distribution. I think that we couldn't allow custom weights to be passed directly as we would need to create a custom RV for indexing.

@Upabjojr
Copy link
Contributor

Maybe cdf should be aware of MatrixElement?

@Smit-create
Copy link
Member Author

Maybe cdf should be aware of MatrixElement?

It should be. But this will work if we have Rvs as an argument and not when Distributions as E, cdf etc are not operated with Distributions. So, IMHO we should have a separate class/function for Mixture

@czgdp1807
Copy link
Member

It seems like there are a bunch of approaches presented here but a consensus isn't reached. This should be there in ideas list of next GSoC and all the approaches discussed so far can be compiled in a single issue (already open, OP can be edited). I learnt that particularly for stats, without arriving at a consensus we should certainly not open a PR. When something is stalled, efforts are wasted a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants