-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats: refactory of sampling methods #20494
Stats: refactory of sampling methods #20494
Conversation
✅ Hi, I am the SymPy bot (v161). I'm here to help you write a release notes entry. Please read the guide on how to write release notes. Your release notes are in good order. Here is what the release notes will look like:
This will be added to https://github.com/sympy/sympy/wiki/Release-Notes-for-1.8. Click here to see the pull request description that was parsed.
Update The release notes on the wiki have been updated. |
N.B. #18754 (comment) Adding a new library can involve changes across multiple distribution classes across multiple files. But as in the above comment, |
Codecov Report
@@ Coverage Diff @@
## master #20494 +/- ##
=============================================
- Coverage 75.781% 75.702% -0.080%
=============================================
Files 673 676 +3
Lines 174510 174529 +19
Branches 41202 41191 -11
=============================================
- Hits 132247 132123 -124
- Misses 36546 36686 +140
- Partials 5717 5720 +3 |
I don't see it as a problem. Implementing a few more methods for every distribution is really OK. The current state of external calls is a typical debugging nightmare... putting methods inside dictionaries then looking them up. Python indeed supports first-class functions, but this kind of coding should really be avoided. The viable alternative would be |
Furthermore, the same code was copy-pasted a few times. That's also a design flaw. |
In general, I keep telling everybody to code as if you were coding in C++ or Java. Using Python magic may be evil. |
It's possible to avoid the magic while still organising the sampling code in classes for each external library rather than distributed as methods over all the different distribution classes. The current code is a little messy but there's nothing wrong with using a dict to define the mapping from sympy classes to sampling functions (it should use the classes as keys rather than strings though). The fact that it's harder to do it that way in C++ or Java does not mean that it is a bad idea in Python. A stronger consideration to me is that we should aim to concentrate code that is specific to external libraries in one place like a compatibility layer. |
Well, singledispatching wouldn't require a class either.
It's unreadable, messy and hard to debug. Since Python 3.4 there's single dispatching in the standard library, so we should use that instead of
Too much freedom in Python may lead to messy code.
Well, that could be turned the other way round. We could also concentrate the code specific to a particular distribution function into that distribution. Either way or the other. Anyway, the only alternative I see is using |
I think singledispatch is reasonable. It means we can group all the code together for each external sampling library and makes it easy to add new libraries but also makes the sampling code extensible from outside if needed by users for their own classes.
It might do. You have the freedom to write code in different ways so you have a bit more responsibility to design something sensible but also more flexibility to do so. You can make a mess with ordinary methods as well if you don't take care to design properly (and document the design) e.g. #20479. |
I think the approach in the diff should be good because the only drawback is that this will require adding new methods to many distributions for supporting a new library. Since, python is interpreted language, |
sympy/stats/crv.py
Outdated
class SampleContinuousNumpy: | ||
"""Returns the sample from numpy of the given distribution""" | ||
|
||
def __new__(cls, dist, size): | ||
return cls._sample_numpy(dist, size) | ||
|
||
@classmethod | ||
def _sample_numpy(cls, dist, size): | ||
"""Sample from NumPy.""" | ||
|
||
import numpy | ||
numpy_rv_map = { | ||
'BetaDistribution': lambda dist, size: numpy.random.beta(a=float(dist.alpha), | ||
b=float(dist.beta), size=size), | ||
'ChiSquaredDistribution': lambda dist, size: numpy.random.chisquare( | ||
df=float(dist.k), size=size), | ||
'ExponentialDistribution': lambda dist, size: numpy.random.exponential( | ||
1/float(dist.rate), size=size), | ||
'GammaDistribution': lambda dist, size: numpy.random.gamma(float(dist.k), | ||
float(dist.theta), size=size), | ||
'LogNormalDistribution': lambda dist, size: numpy.random.lognormal( | ||
float(dist.mean), float(dist.std), size=size), | ||
'NormalDistribution': lambda dist, size: numpy.random.normal( | ||
float(dist.mean), float(dist.std), size=size), | ||
'ParetoDistribution': lambda dist, size: (numpy.random.pareto( | ||
a=float(dist.alpha), size=size) + 1) * float(dist.xm), | ||
'UniformDistribution': lambda dist, size: numpy.random.uniform( | ||
low=float(dist.left), high=float(dist.right), size=size) | ||
} | ||
|
||
dist_list = numpy_rv_map.keys() | ||
|
||
if dist.__class__.__name__ not in dist_list: | ||
return None | ||
|
||
return numpy_rv_map[dist.__class__.__name__](dist, size) | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The above two comments can be ignored.
sympy/stats/rv.py
Outdated
# See links below referring to sections beginning with "A common parametrization..." | ||
# I will remove all these comments if everything is ok. | ||
|
||
samps = self._do_sample_scipy(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that import scipy
can be written just above this line. This may help in avoid writing, import scipy
for every implementation of self._do_sample_scipy
as scipy
will be imported into the namespace before control goes for sampling using it. Possibly, you may have tried it. Can you please let me know why it didn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for other libraries if it works.
🟠Hi, I am the SymPy bot (v161). I've noticed that some of your commits add or delete files. Since this is sometimes done unintentionally, I wanted to alert you about it. This is an experimental feature of SymPy Bot. If you have any feedback on it, please comment at sympy/sympy-bot#75. The following commits add new files:
The following commits delete files:
If these files were added/deleted on purpose, you can ignore this message. |
IMHO, the previous commit(adding sampling methods to distribution classes) was better. Now, things are way more scattered and debugging would be more difficult since features of a particular distribution are not there where they should be. Using |
I was following the suggestion of putting all sampling calls in the same place. In the end, what this PR does is adding methods for calling external samplers to the distribution classes. The previous commit was adding methods inside the class, this last commit adds methods outside of the class using singledispatch. Singledispatch is now in the standard library of Python, so that's a standard way to declare methods to a class externally. |
What is certain... I'm going to remove all strings |
I don't have much problem with usage of
But by doing that we are putting methods related to a distribution in a different place. So one way or the other things will be scattered. I am not sure if things like P.S. I hope switching back to previous diff just needs a minor effort i.e., |
Sure. That's cryptic too and should be avoided whenever we have an alternative to it. No problems with this thought. |
I don't mind so much what you two want to do with this. Personally I think that the current dict approach is fine except that it should have been using the class rather than the name of the class as the keys in the dict and also the logic was a bit convoluted. The two approaches here are both friendlier to subclasses but I don't really see why there should be any need to support subclassing these. |
There is function overloading by default, though that's not exactly the same as singledispatch as the dispatch is done at compile time (instead of runtime). That's the main difference between Technically not even class methods are dispatched (at runtime) in C++, unless you define them as virtual. (Java does it by default instead). In SymEngine they are using the visitor pattern to implement singledipatching (it's a trick involving the proper usage of two virtual methods). So yes, there is singledispatch. SymEngine uses it exactly for the same purpose that I am using it in this PR (i.e. not defining too many methods in classes).
Well, no need to rush. There are advantages and disadvantages in both approaches. SymEngine uses both IIRC. I was initially a bit more favorable to class methods, but this second way is not so bad in the end. |
This is something new to me. Thanks for bringing this up, it seems to be a very nice application of Dynamic Binding.
Agreed. I think the new approach has an advantage of avoiding pollution of distribution classes with multiple methods for a multiple libraries. I didn't think of this earlier. The |
I am marking this stalled until the release of |
Another advantage of For example, if an external library can handle optimized sampling of the addion or multiplication of random variables, we can add dispatchers for |
We may continue this in a week or so. We need minor changes here like, adding seed argument to every function. I will push all those changes here. |
I will bring the back changes from |
Classes were invented to avoid that kind of edits. class SamplingOnSomeExternalLibrary:
def __init__(self, seed):
self.seed = seed
@singledispatchmethod
def sample(self, x):
pass
@sample.register
def _(self, x: NormalDistribution):
return some_library.sample_normal(... , seed=self.seed) The problem is than singledispatchmethod was introduced only in Python 3.8 (single dispatch on functions OTOH was introduced in Python 3.4). I'm definitely in favor of singledispatching, as you can define the dispatching logic for classes in SymPy's core without editing SymPy's core. For example If we are going to use classes to avoid adding parameters to every dispatched function, we still need to find a way to make |
#20658 has been merged. We can proceed with this PR according to the latest API changes (addition of optional |
Should we use a class with |
The current approach in the diff looks good. |
I have moved all sampling tests for CRV, FRV and DRV to ================================================= test process starts ==================================================
executable: /usr/bin/python3 (3.6.9-final-0) [CPython]
architecture: 64-bit
cache: yes
ground types: python
numpy: 1.19.2
random seed: 54908228
hash randomization: on (PYTHONHASHSEED=470444264)
sympy/stats/sampling/tests/test_sample_continuous_rv.py[3] EEs [FAIL]
sympy/stats/sampling/tests/test_sample_discrete_rv.py[3] EEs [FAIL]
sympy/stats/sampling/tests/test_sample_finite_rv.py[4] EEs/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py:1110: UserWarning:
The return type of sample has been changed to return an iterator
object since version 1.7. For more information see
https://github.com/sympy/sympy/issues/19061
warnings.warn(filldedent(message))
. [FAIL]
________________________________________________________________________________________________________________________
______________________ sympy/stats/sampling/tests/test_sample_continuous_rv.py:test_sample_numpy _______________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_continuous_rv.py", line 25, in test_sample_numpy
samps = next(sample(X, size=size, library='numpy'))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/crv.py", line 473, in sample
return {self.value: self.distribution.sample(size, library=library, seed=seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for BetaDistribution(1, 1) is not currently implemented from numpy
________________________________________________________________________________________________________________________
______________________ sympy/stats/sampling/tests/test_sample_continuous_rv.py:test_sample_scipy _______________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_continuous_rv.py", line 58, in test_sample_scipy
g_sample = list(sample(Gamma("G", 2, 7), size=size, numsamples=numsamples))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/crv.py", line 473, in sample
return {self.value: self.distribution.sample(size, library=library, seed=seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for GammaDistribution(2, 7) is not currently implemented from scipy
________________________________________________________________________________________________________________________
_______________________ sympy/stats/sampling/tests/test_sample_discrete_rv.py:test_sample_numpy ________________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_discrete_rv.py", line 20, in test_sample_numpy
samps = next(sample(X, size=size, library='numpy'))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/drv.py", line 288, in sample
return {self.value: self.distribution.sample(size, library=library, seed=seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for GeometricDistribution(0.5) is not currently implemented from numpy
________________________________________________________________________________________________________________________
_______________________ sympy/stats/sampling/tests/test_sample_discrete_rv.py:test_sample_scipy ________________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_discrete_rv.py", line 50, in test_sample_scipy
z_sample = list(sample(Zeta("G", 7), size=size, numsamples=numsamples))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/drv.py", line 288, in sample
return {self.value: self.distribution.sample(size, library=library, seed=seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for ZetaDistribution(7) is not currently implemented from scipy
________________________________________________________________________________________________________________________
________________________ sympy/stats/sampling/tests/test_sample_finite_rv.py:test_sample_numpy _________________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_finite_rv.py", line 19, in test_sample_numpy
samps = next(sample(X, size=size, library='numpy'))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/frv.py", line 344, in sample
return {self.value: self.distribution.sample(size, library, seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for BinomialDistribution(5, 0.4, 1, 0) is not currently implemented from numpy
________________________________________________________________________________________________________________________
________________________ sympy/stats/sampling/tests/test_sample_finite_rv.py:test_sample_scipy _________________________
Traceback (most recent call last):
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/sampling/tests/test_sample_finite_rv.py", line 47, in test_sample_scipy
h_sample = list(sample(Hypergeometric("H", 1, 1, 1), size=size, numsamples=numsamples))
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1283, in return_generator_finite
library=library, seed=seed) # a dictionary that maps RVs to values
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/frv.py", line 344, in sample
return {self.value: self.distribution.sample(size, library, seed)}
File "/home/czgdp1807ssd/sympy_project/sympy/sympy/stats/rv.py", line 1565, in sample
% (self, library))
NotImplementedError: Sampling for HypergeometricDistribution(1, 1, 1) is not currently implemented from scipy
========================== tests finished: 1 passed, 3 skipped, 6 exceptions, in 0.42 seconds ==========================
DO *NOT* COMMIT! |
What version of |
Python 3.6.9 (default, Jul 17 2020, 12:50:27)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>> scipy.__version__
'1.5.3' The tests on |
Apparently, SymPy uses the package Line 65 in dc75152
|
Sorry, they are installed with Line 224 in dc75152
|
Travis isn't used any more so testing is done with the github actions config in |
|
||
|
||
@singledispatch | ||
def do_sample_numpy(dist, size, rand_state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was called irrespective of the type of dist
and hence, None
was returned leading to NotImplementedError
.
|
||
# CRV: | ||
|
||
@do_sample_numpy.register(BetaDistribution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_sample_numpy
wasn't registered with BetaDistribution
at least in my system and therefore wasn't overridden. Similar was the case with all other distributions. Anyways, @do_sample_numpy.register(BetaDistribution)
is more stable and less risky than @do_sample_numpy.register
and depending on type annotations.
I think all is good with this PR. Will merge in a couple of days if no objections raised. Thanks. |
Just cleaning up the code.
References to other Issues or PRs
Brief description of what is fixed or changed
Other comments
Release Notes