[GSoC] Added Mixture distribution #19886

Smit-create · 2020-08-04T05:22:06Z

References to other Issues or PRs

Discussions in #18730

Brief description of what is fixed or changed

Other comments

Add tests and classes
Add docs

Release Notes

stats
- Added Mixture Distribution

ping @czgdp1807 @Upabjojr @jmig5776

sympy-bot · 2020-08-04T05:22:15Z

✅

Hi, I am the SymPy bot (v160). I'm here to help you write a release notes entry. Please read the guide on how to write release notes.

Your release notes are in good order.

Here is what the release notes will look like:

stats
- Added Mixture Distribution (#19886 by @Smit-create)

This will be added to https://github.com/sympy/sympy/wiki/Release-Notes-for-1.7.

Note: This comment will be updated with the latest check if you edit the pull request. You need to reload the page to see it.

Click here to see the pull request description that was parsed.

<!-- Your title above should be a short description of what
was changed. Do not include the issue number in the title. -->

#### References to other Issues or PRs
<!-- If this pull request fixes an issue, write "Fixes #NNNN" in that exact
format, e.g. "Fixes #1234" (see
https://tinyurl.com/auto-closing for more information). Also, please
write a comment on that issue linking back to this pull request once it is
open. -->
Discussions in #18730

#### Brief description of what is fixed or changed


#### Other comments
- [x] Add tests and classes
- [ ] Add docs

#### Release Notes

<!-- Write the release notes for this release below. See
https://github.com/sympy/sympy/wiki/Writing-Release-Notes for more information
on how to write release notes. The bot will check your release notes
automatically to see if they are formatted correctly. -->

<!-- BEGIN RELEASE NOTES -->
* stats
    * Added Mixture Distribution
<!-- END RELEASE NOTES -->
ping @czgdp1807 @Upabjojr @jmig5776

sympy-bot · 2020-08-04T05:22:16Z

🟠

Hi, I am the SymPy bot (v160). I've noticed that some of your commits add or delete files. Since this is sometimes done unintentionally, I wanted to alert you about it.

This is an experimental feature of SymPy Bot. If you have any feedback on it, please comment at sympy/sympy-bot#75.

The following commits add new files:

d2fdd4c:
- sympy/stats/mixture_rv.py
- sympy/stats/tests/test_mixture_rv.py

If these files were added/deleted on purpose, you can ignore this message.

codecov · 2020-08-04T07:11:49Z

Codecov Report

Merging #19886 into master will increase coverage by 11.552%.
The diff coverage is 90.322%.

@@              Coverage Diff               @@
##            master    #19886        +/-   ##
==============================================
+ Coverage   64.256%   75.809%   +11.552%     
==============================================
  Files          666       669         +3     
  Lines       172568    173342       +774     
  Branches     40689     40857       +168     
==============================================
+ Hits        110887    131410     +20523     
+ Misses       55277     36195     -19082     
+ Partials      6404      5737       -667

Upabjojr · 2020-08-04T20:55:30Z

Maybe there is a possibility for a different kind of appoarch: you designed an API asking for

a list of probabilities summing up to one,
a list of RVs.

Point 1. is just a categorical distribution. Point 2. can be considered as a joint distribution of independent random variables.

I was wondering, could we instead approach this design by defining:

A new object, maybe called IndependentRandomSymbolProduct(X, Y, ... ), which creates joint RVs out of a product of independent RVs.
An element index, maybe equivalent to the class MatrixElement in the matrix expressions module. Let's call it the [ ] operator.
a compound distribution with the categorical distribution as a parameter inside the [ ] operator.

At this point, for example, you would define the mixture of X and Y random variables with probabilities 1/3 and 2/3 as

IndependentRandomSymbolProduct(X, Y)[CategoricalRV(probs=[1/3, 2/3])]

This is a completely different approach. I was curious about making the class construction more similar to the way the compound distribution is defined.

Smit-create · 2020-08-05T05:01:01Z

Okay, I understood your approach, but according to the definition of IID Rvs, all the RVs should have the same distribution as per wikipedia, so for IID Random Symbol should be in the following way I think:

1. X = IIDRandomSymbol('X', [X1, X2, ..., Xn]) # where all Xi have same distribution
2. density(X)([x1, x2, ..., xn]) # this will give density(X1)(x1) * density(X2)(x2) * ... * density(Xn)(xn)
3. # Similarly for probability and expectations methods

czgdp1807 · 2020-08-06T06:39:43Z

I think rather than taking RVs, distributions should be taken. MixtureDistribution should be in public scope. See, https://reference.wolfram.com/language/ref/MixtureDistribution.html
We discussed this earlier as well. In addition, I am unable to figure out the relation between joint distribution and mixture distribution. The latter one is simply the weighted sum of all the PDFs of the respective distributions but joing distribution needs some special conditions like independence for the joint CDF to be de-composed into product of individual CDFs. How are these two similar?

sympy/stats/tests/test_mixture_rv.py

czgdp1807 · 2020-08-06T06:44:06Z

sympy/stats/tests/test_mixture_rv.py

+    N = Normal("N", 0, 1)
+    M = Normal('M', 1, 2)
+    Z = Laplace('L', 3, 1)
+    D = Mixture('D', [S(2)/10, S(5)/10, S(3)/10], [N, M, Z])


Instead of creating a new Mixture RV, a user should be able to define a new distribution using, MixtureDistribution and then pass it on to custom RV API. Mixture can be removed.

But for accessing P, E, etc. we need an RV. Considering that the MixtureDIstribution is created, then we need an RV which will take the MixtureDistrbution as an argument and then able to change the pspace to any of the Finite, Continuous or Discrete Pspaces upon calling of compute_cdf, compute_density etc., so we need Mixture RV IMO. The change can be:

>>> M = Mixture('M', [wt1, wt2, wt3, ...,], [dist1, dist2, dist3, ...,])

I think, Mixture can be a function rather than a class. It can return a generic RV/ContinuousRV/DiscreteRV/FiniteRV depending on the type of PDF.
P.S. I saw that ContinuousRV generates only RVs with continuous PDF but what about PDFs which are neither continuous nor discrete? What will be returned in that case?

I think, Mixture can be a function rather than a class

Do you mean to remove the MixtureDistribution/MixturePSpace class?

I saw that ContinuousRV generates only RVs with continuous PDF but what about PDFs which are neither continuous nor discrete?

Presently, I have considered the 3 cases of Continuous, Discrete, and Finite. Should an error be raised if the distribution is not out of these 3?

Should an error be raised if the distribution is not out of these 3?

No a generic random symbol can be returned having the custom PDF given by the user.

Do you mean to remove the MixtureDistribution/MixturePSpace class?

Mixture should be a constructor function returning generic RandomSymbol objects with the given mixture probability distribution.

Upabjojr · 2020-08-07T08:04:32Z

but according to the definition of IDD Rvs

I simply meant independent RVs, not identical, in the example above.

Smit-create · 2020-08-09T04:34:12Z

I simply meant independent RVs, not identical, in the example above.

@Upabjojr @czgdp1807 For implementing the Independent Random Symbol, should it be done in the following way? Continuing the #19886 (comment)

>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs
...
>>> E(X) # product of respective expectations
...

Upabjojr · 2020-08-09T06:41:59Z

@Upabjojr @czgdp1807 For implementing the Independent Random Symbol, should it be done in the following way? Continuing the #19886 (comment)
>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs
...

Yes, that is a nice notation. Better in another PR, this one is already long enough. Maybe density(X)(x1, x2, ... , xn) without the list [ ]? How did you implement it for the other multivariate RVs?

>>> E(X) # product of respective expectations
...

On Wikipedia, the expected value of a multivariate random variables in the vector of the expected values of the components:
https://en.wikipedia.org/wiki/Multivariate_random_variable#Expected_value

Upabjojr · 2020-08-09T14:06:07Z

>>> X = IndependentRandomSymbols('X', [X1, X2, X3,...,]) # where X1, X2, etc. are Independent Random Variables
>>> density(X)([x1, x2,...x3]) # product of respective pdfs

I thought a bit over it... we are creating too many classes this way. We already have the Array and Matrix elements. What about just using those ones?

Upabjojr · 2020-08-09T14:13:21Z

I mean, this looks like it's already working:

In [4]: N = Normal("N", 0, 1)

In [5]: P = Poisson("P", 3)

In [6]: m = Matrix([N, P])

In [7]: m
Out[7]:
Matrix([
[N],
[P]])

In [8]: E(m)
Out[8]:
Matrix([
[0],
[3]])

This is the correct behaviour of the expectation if m is to be considered a multivariate random variable composed of two independent univariate RVs N and P.

Upabjojr · 2020-08-09T14:14:27Z

OTOH, there is a bug:

In [10]: a = Array([N, P])

In [11]: a
Out[11]: [N, P]

In [12]: E(a)
---------------------------------------------------------------------------
AttributeError: 'ImmutableDenseNDimArray' object has no attribute 'expand'

Upabjojr · 2020-08-09T14:21:03Z

In order to create a mixture distribution this way, one should:

In [17]: B = Bernoulli("B", 0.3)

In [18]: m[B, 1]
---------------------------------------------------------------------------
ValueError: index out of boundary

This is likely a bug in the class MatrixElement. I would expect this instance of MatrixElement to be exactly equivalent to MixtureDistribution([7/10, 3/10], [N, P]).

In case of multivariate RVs, create a Matrix or Array object and then pass a categorical distribution RV as the matrix/array index.

@Smit-create maybe with this procedure we can avoid creating new classes. What do you think?

Upabjojr · 2020-08-09T14:22:23Z

Strangely enough, the same bug does not appear if we pass the Bernoulli RV as index to an array:

In [22]: a[B]
Out[22]: [N, P][B]

Upabjojr · 2020-08-09T14:24:48Z

Expectation apparently works with Array:

In [23]: E(a[B])
Out[23]: 0.900000000000000

In [24]: 7/10*E(N) + 3/10*E(P)
Out[24]: 0.900000000000000

Smit-create · 2020-08-10T07:57:07Z

Yes, I understood the procedure with the Array, but I think density and E are working fine but not cdf:

>>> N = Normal("N", 0, 1)
>>> M = Normal('M', 1, 2)
>>> Z = Laplace('L', 3, 1)
>>> wt = Binomial('B', 2, S(2)/10)
>>> E(a[wt])
11/25
>>> density(a[wt])(x)
2*sqrt(2)*exp(-1/8)*exp(x/4)*exp(-x**2/8)/(25*sqrt(pi)) + exp(-Abs(x - 3))/50 + 8*sqrt(2)*exp(-x**2/2)
/(25*sqrt(pi))
>>> cdf(a[wt])(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/smit/Smitlunagariya/sympy/sympy/stats/rv.py", line 950, in cdf
    result = pspace(expr).compute_cdf(expr, **kwargs)
  File "/home/smit/Smitlunagariya/sympy/sympy/stats/rv.py", line 489, in compute_cdf
    raise ValueError("CDF not well defined on multivariate expressions")
ValueError: CDF not well defined on multivariate expressions

Also, this works for the random variables, but not when we pass instances of distribution. I think that we couldn't allow custom weights to be passed directly as we would need to create a custom RV for indexing.

Upabjojr · 2020-08-10T13:25:52Z

Maybe cdf should be aware of MatrixElement?

Smit-create · 2020-08-12T10:16:02Z

Maybe cdf should be aware of MatrixElement?

It should be. But this will work if we have Rvs as an argument and not when Distributions as E, cdf etc are not operated with Distributions. So, IMHO we should have a separate class/function for Mixture

czgdp1807 · 2020-12-06T09:29:59Z

It seems like there are a bunch of approaches presented here but a consensus isn't reached. This should be there in ideas list of next GSoC and all the approaches discussed so far can be compiled in a single issue (already open, OP can be edited). I learnt that particularly for stats, without arriving at a consensus we should certainly not open a PR. When something is stalled, efforts are wasted a bit.

add mixture dist.

d2fdd4c

Smit-create added GSoC stats labels Aug 4, 2020

czgdp1807 reviewed Aug 6, 2020

View reviewed changes

sympy/stats/tests/test_mixture_rv.py Outdated Show resolved Hide resolved

czgdp1807 reviewed Aug 6, 2020

View reviewed changes

sympy/stats/tests/test_mixture_rv.py Outdated Show resolved Hide resolved

czgdp1807 reviewed Aug 6, 2020

View reviewed changes

apply suggestions

ced6c64

Smit-create requested a review from czgdp1807 August 7, 2020 06:06

github-actions bot added the Merge conflict label Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC] Added Mixture distribution #19886

[GSoC] Added Mixture distribution #19886

Smit-create commented Aug 4, 2020

sympy-bot commented Aug 4, 2020

sympy-bot commented Aug 4, 2020 •

edited

codecov bot commented Aug 4, 2020 •

edited

Upabjojr commented Aug 4, 2020

Smit-create commented Aug 5, 2020 •

edited

czgdp1807 commented Aug 6, 2020 •

edited

czgdp1807 Aug 6, 2020

Smit-create Aug 6, 2020

czgdp1807 Aug 8, 2020

Smit-create Aug 8, 2020

czgdp1807 Aug 13, 2020

czgdp1807 Aug 13, 2020

Upabjojr commented Aug 7, 2020

Smit-create commented Aug 9, 2020 •

edited

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020 •

edited

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Smit-create commented Aug 10, 2020

Upabjojr commented Aug 10, 2020

Smit-create commented Aug 12, 2020

czgdp1807 commented Dec 6, 2020

[GSoC] Added Mixture distribution #19886

Are you sure you want to change the base?

[GSoC] Added Mixture distribution #19886

Conversation

Smit-create commented Aug 4, 2020

References to other Issues or PRs

Brief description of what is fixed or changed

Other comments

Release Notes

sympy-bot commented Aug 4, 2020

sympy-bot commented Aug 4, 2020 • edited

🟠

codecov bot commented Aug 4, 2020 • edited

Codecov Report

Upabjojr commented Aug 4, 2020

Smit-create commented Aug 5, 2020 • edited

czgdp1807 commented Aug 6, 2020 • edited

czgdp1807 Aug 6, 2020

Choose a reason for hiding this comment

Smit-create Aug 6, 2020

Choose a reason for hiding this comment

czgdp1807 Aug 8, 2020

Choose a reason for hiding this comment

Smit-create Aug 8, 2020

Choose a reason for hiding this comment

czgdp1807 Aug 13, 2020

Choose a reason for hiding this comment

czgdp1807 Aug 13, 2020

Choose a reason for hiding this comment

Upabjojr commented Aug 7, 2020

Smit-create commented Aug 9, 2020 • edited

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020 • edited

Upabjojr commented Aug 9, 2020

Upabjojr commented Aug 9, 2020

Smit-create commented Aug 10, 2020

Upabjojr commented Aug 10, 2020

Smit-create commented Aug 12, 2020

czgdp1807 commented Dec 6, 2020

sympy-bot commented Aug 4, 2020 •

edited

codecov bot commented Aug 4, 2020 •

edited

Smit-create commented Aug 5, 2020 •

edited

czgdp1807 commented Aug 6, 2020 •

edited

Smit-create commented Aug 9, 2020 •

edited

Upabjojr commented Aug 9, 2020 •

edited