Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improvements to the Stable distribution #9523

Merged
merged 67 commits into from Dec 18, 2021

Conversation

bsdz
Copy link
Contributor

@bsdz bsdz commented Nov 22, 2018

Improve performance/stability for piecewise approach especially when alpha = 1.
Replace Mitnik FFT with Wang's better Simpson approach.
Correct location / scale for alpha = 1.
Introduce multiple parameterizations including original S1 and new S0.
Correct typo in rvs for alpha = 1.
Improve testing by using sample points based on percentiles.

@ilayn
Copy link
Member

ilayn commented Nov 23, 2018

Thanks for the PR @bsdz. Do you mind adding some motivation for these replacements and how have been replaced?

@rgommers rgommers added scipy.stats maintenance Items related to regular maintenance tasks labels Nov 23, 2018
@bsdz
Copy link
Contributor Author

bsdz commented Nov 23, 2018

Do you mind adding some motivation for these replacements and how have been replaced?

Ah yes of course, there are basically 3 main points I wanted to address.

Firstly, for the piecewise approach we always had trouble when alpha==1. The integrand is continuous and monotonically increases from zero to 1/e before falling back to zero. The function is very peaked and many of our optimizers are unable to easily locate the peak. So now we follow Nolan's advice and use a root finder to locate this peak. We still have a problem with the integrand often losing support after the peak and this can create a long zero tail that quadpack doesn't like so I've added simple bound reducing loop (it's something I hope others might offer better solutions to).

This fix makes the "best" method choice, that selected between and "piecewise" and "dni" (direct numerical integration) based on whether alpha == 1 or not, redundant and so that's been deprecated.

Secondly, for the FFT implementation (used when one needs to generate many sample points for plotting, MLE, etc), there has been a better approach for a while and I explored this in a short paper (Stable Probability Densities using FFTs and Newton-Cote Rules in Scipy). Here a general approach is also derived with accuracy/performance measured and compared. From there I felt that moving from Mitnik's 1st order to Wang's 3rd order FFT approach is very beneficial. Although I have left the machinery in the code that allows one to experiment with higher orders for an increase in accuracy that also will come with a performance hit. This change makes the FFT much more usable with improved accuracy as should be visible in the test cases.

Finally, regarding the name deprecations, after reading more around the numerical considerations around Stable distributions, I felt that we should try and keep with the general community in terms of naming. So 'quadrature' method is now 'dni' which is short for direct numerical integration (of characteristic function). I replaced 'zolotarev' with 'piecewise' which was a mistake on my part from my original PR; I hadn't realised that the original author of that method was actually Nolan who used expansions derived originally by Zolotarev. I've chosen 'piecewise' as it represents the segmenting of the parameter domain and calculates each segment separately.

I also deprecated the auto switch over to FFT when number of requested sample points reached 'pdf_fft_min_points_threshold'. While using the module I realised that one actually tends to explicitly choose what method one wishes to use and since the FFT had been considered unstable in 1.2 the attribute fell out of use.

@bsdz bsdz changed the title Stable improvements ENH: Stable improvements Nov 23, 2018
@tylerjereddy
Copy link
Contributor

Strange Azure failures noted & cross-referenced from a similar issue recently observed with Azure CI in NumPy. Hopefully someone from the Microsoft team can give us some advice there.

@ericsciple
Copy link

@tylerjereddy i'm looking right now...

@ericsciple
Copy link

@tylerjereddy can you try vmImage (the file contains vmIMage)

I do see in our data, the issue you are describing. You are correct it went to the wrong pool.

However I can't reproduce the issue on my account, but I might be hitting a server with different binaries version.

Here is what I tried.

jobs:
- job: Linux
  pool:
    vmIMage: ubuntu-16.04
  steps:
  - script: pwd
- job: Windows
  pool:
    vmIMage: 'VS2017-Win2016'
  variables:
      OPENBLAS_32: https://3f23b170c54c2533c070-1c8a9b3114517dc5fe17b7c3f8c63a43.ssl.cf2.rackcdn.com/openblas-5f998ef_gcc7_1_0_win32.zip
      OPENBLAS_64: https://3f23b170c54c2533c070-1c8a9b3114517dc5fe17b7c3f8c63a43.ssl.cf2.rackcdn.com/openblas-5f998ef_gcc7_1_0_win64.zip
  strategy:
    maxParallel: 1
    matrix:
        Python37-64bit-full:
          PYTHON_VERSION: '3.7'
          PYTHON_ARCH: 'x64'
          TEST_MODE: full
          OPENBLAS: $(OPENBLAS_64)
          BITS: 64
  steps:
  - script: cd

@tylerjereddy
Copy link
Contributor

@ericsciple Ok, see #9736

@rgommers
Copy link
Member

rgommers commented Mar 1, 2019

For reference, this is a follow-up to gh-7374. Azure issues were resolved.

@rgommers rgommers changed the title ENH: Stable improvements ENH: improvements to the Stable distribution Mar 1, 2019
@rgommers
Copy link
Member

rgommers commented Mar 1, 2019

Finally, regarding the name deprecations, after reading more around the numerical considerations around Stable distributions, I felt that we should try and keep with the general community in terms of naming.

It looks to me like it would be better to keep the old names around as aliases, rather than deprecating them. The rationale for deprecation is not very strong here.

@bsdz
Copy link
Contributor Author

bsdz commented Mar 1, 2019

Finally, regarding the name deprecations, after reading more around the numerical considerations around Stable distributions, I felt that we should try and keep with the general community in terms of naming.

It looks to me like it would be better to keep the old names around as aliases, rather than deprecating them. The rationale for deprecation is not very strong here.

I'm happy to remove the deprecation warnings and update the docstring. Then that should mean the new/old names simply behave as aliases. If that sounds good I'll try and update this PR in the following week.

@rgommers
Copy link
Member

rgommers commented Mar 1, 2019

yes that sounds good, thanks @bsdz

@bsdz
Copy link
Contributor Author

bsdz commented Mar 14, 2019

I've updated now to remove deprecations. Would it be possible to add this PR to the 1.3 milestone?

Also I see one of the checks failing - pypy3. It seems unrelated. It's showing:

FAILED scipy/io/matlab/tests/test_streams.py::TestZlibInputStream::test_seek

There seems to be a similar failure on different recently updated PR.

@rgommers rgommers added this to the 1.3.0 milestone Mar 14, 2019
@rgommers
Copy link
Member

milestone added. please ignore the pypy CI failure, it was already resolved in master but CircleCI runs your actual PR commit rather than the merge of that with master

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few comments. This is a rather big diff that I'd like a stats reviewer to confirm "ok" on. I'm unlikely to merge before next week otherwise.

In addition, the commit history could be condensed to remove the whitespace adjustment stuff & indeed there are still whitespace / formatting issues to clean up.

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_distributions.py Outdated Show resolved Hide resolved
@WarrenWeckesser
Copy link
Member

It's always nice to see proposals to improve scipy, and if I had infinite time I'd dig into this, but I don't, so I won't be able to do a real review of this in the short term. Instead, I'll just be that guy and nag about PEP-8. 😝 Please keep all lines less than 80 characters.

Re-add changes.
Reformat to less than 79 chars.
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I'm going to bump the milestone in absence of a resident stats reviewer giving the "ok."

A few other notes:

  • there seems to be a lot of hard work here & there's even a link to an exploratory whitepaper by the PR author
  • the commit messages should be prefixed with i.e., ENH and describe the change(s); currently, the commit messages are not informative wrt the main changes
  • are there no open issues related to the stability limitations addressed here?

return np.cos(alpha*xi)**(1/(alpha-1)) * \
(np.cos(theta)/np.sin(alpha*(xi+theta)))**(alpha/(alpha-1)) * \
(np.cos(alpha*xi+(alpha-1)*theta)/np.cos(theta))
return np.cos(alpha * xi)**(1 / (alpha - 1)) * (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a duplicate of V(theta) from _pdf_single_value_piecewise() in the same module

The crusade against duplication is probably sometimes a bit much, but just bringing this up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps on a later refactor this class could go in its own module and common code could be pushed up to the top.

Regarding amending commits with ENH prefixes, I can't really help there as I'm likely to get it wrong. Sorry.

if beta > 0:

def V(theta):
expr_1 = np.pi/2+beta*theta
return 2. * expr_1 * np.exp(expr_1*np.tan(theta)/beta) / np.cos(theta) / np.pi
expr_1 = np.pi / 2 + beta * theta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function also gets duplicated between pdf and cdf versions. Again, duplication isn't always so terrible--just bringing this up.

@tylerjereddy tylerjereddy modified the milestones: 1.3.0, 1.4.0 Apr 23, 2019
@tylerjereddy
Copy link
Contributor

@WarrenWeckesser @rlucas7 @josef-pkt Should I bump the milestone again?

@WarrenWeckesser
Copy link
Member

@tylerjereddy, I won't have time to look at this before 1.4 is branched.

@rlucas7
Copy link
Member

rlucas7 commented Nov 12, 2019 via email

@ragibson
Copy link
Contributor

Actually, this is also sufficient if you already have scipy installed from that commit

>>> from scipy.stats import levy_stable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/__init__.py", line 453, in <module>
    from ._stats_py import *
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/_stats_py.py", line 44, in <module>
    from . import distributions
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/distributions.py", line 16, in <module>
    from ._levy_stable import levy_stable
ImportError: cannot import name 'levy_stable' from 'scipy.stats._levy_stable' (unknown location)

@steppi
Copy link
Contributor

steppi commented Dec 15, 2021

I see. I’ve been invoking tests with the pytest command directly and didn’t try the runtests.py script. I’ll try this out later today

@steppi
Copy link
Contributor

steppi commented Dec 15, 2021

Actually, this is also sufficient if you already have scipy installed from that commit

>>> from scipy.stats import levy_stable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/__init__.py", line 453, in <module>
    from ._stats_py import *
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/_stats_py.py", line 44, in <module>
    from . import distributions
  File "/home/ryan/.local/lib/python3.9/site-packages/scipy/stats/distributions.py", line 16, in <module>
    from ._levy_stable import levy_stable
ImportError: cannot import name 'levy_stable' from 'scipy.stats._levy_stable' (unknown location)

Oh. This actually works on my machine. I’ve been installing in a fresh virtual environment with pip install -e . on Ubuntu 20.04 with an X86_64 architecture. Using a recent Python 3.9 that was built from source.

@ragibson
Copy link
Contributor

ragibson commented Dec 15, 2021

I'm currently on x86_64 on a distro based on Ubuntu 18.04. It looks like it does work for me in a virtual environment with pip's -e flag.

There's no error on a fresh venv using pip install -e ., but I do get one on a fresh venv using pip install .. Pip's -e flag probably leaves some extra imports available (namely, 'scipy.stats._levy_stable' (unknown location)) for development purposes since it directly links to the checkout directory.

@steppi
Copy link
Contributor

steppi commented Dec 15, 2021

I'm currently on x86_64 on a distro based on Ubuntu 18.04. It looks like it does work for me in a virtual environment with pip's -e flag.

There's no error on a fresh venv using pip install -e ., but I do get one on a fresh venv using pip install .. Pip's -e flag probably leaves some extra imports available (namely, 'scipy.stats._levy_stable' (unknown location)) for development purposes since it directly links to the checkout directory.

I see. I’ll make a note to always test with pip install . before pushing. I didn’t realize this could be an issue. I guess this is why the src/ layout is recommended now for new projects. https://hynek.me/articles/testing-packaging/

@steppi
Copy link
Contributor

steppi commented Dec 18, 2021

@ragibson . I got it working by explicitly adding the line

    config.add_data_files(join('_levy_stable', '__init__.py'))

to scipy/stats/setup.py. There must be some quirk of the scipy build system that caused the __init__.py not to be included automatically. The build system is significantly more complex than any I'd worked with before. It seems kind of like an abuse to use add_data_files for this purpose. Do you think it's acceptable? If so, is it OK if I force push to this branch?

@steppi
Copy link
Contributor

steppi commented Dec 18, 2021

I found .py files being included with add_data_files in scipy/special/setup.py so I think this must be acceptable. I'll change the line to

    config.add_data_files(join('_levy_stable', '*.py'))

so there won't be problems down the line if someone decides to add another .py file to scipy/stats/_levy_stable/.

@ragibson
Copy link
Contributor

There must be some quirk of the scipy build system that caused the __init__.py not to be included automatically. The build system is significantly more complex than any I'd worked with before. It seems kind of like an abuse to use add_data_files for this purpose.

I found .py files being included with add_data_files in scipy/special/setup.py so I think this must be acceptable.

Sounds good to me. Honestly, I've not yet been curious enough to look deeply into how scipy's setup works. I think the data files list is a feature of setuptools, but I have no idea as to why it's needed for .py files sometimes here.

If there's a more standard way to do this, perhaps a maintainer will complain and suggest a better method during review.

@steppi
Copy link
Contributor

steppi commented Dec 18, 2021

Sounds good to me. Honestly, I've not yet been curious enough to look deeply into how scipy's setup works. I think the data files list is a feature of setuptools, but I have no idea as to why it's needed for .py files sometimes here.

If there's a more standard way to do this, perhaps a maintainer will complain and suggest a better method during review.

Cool, looks like everything is working. Thanks @ragibson for reproducing the bug and helping me figure out the problem with using pip install -e . vs pip install . before testing.

@rgommers we got the consolidation of code into scipy/stats/_levy_stable/__init__.py working. Let us know if everything looks OK. I guess we should wait to merge this one until at least after the xslow marking PR, gh-15234, goes in.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! That turned out to be harder than I expected, sorry if I caused you some headaches.

This now looks good to me. Please go ahead and merge @steppi. I added one comment about why add_data_files seemed necessary. You can ignore that, it's more an FYI - I suggest to open a new issue for that when you merge and Cc me, then I'll take it along when updating the Meson build for this PR.

@@ -96,6 +96,18 @@ def configuration(parent_package='', top_path=None):
# Type stubs
config.add_data_files('*.pyi')

# add levy stable module
config.add_data_files(join('_levy_stable', '*.py'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason this didn't work otherwise is that it's a separate submodule with an __init__.py, and those are normally treated with config.add_subpackage(like right above in thissetup.py`).

@steppi steppi merged commit e569c7a into scipy:master Dec 18, 2021
@steppi
Copy link
Contributor

steppi commented Dec 18, 2021

Great! That turned out to be harder than I expected, sorry if I caused you some headaches.

This now looks good to me. Please go ahead and merge @steppi. I added one comment about why add_data_files seemed necessary. You can ignore that, it's more an FYI - I suggest to open a new issue for that when you merge and Cc me, then I'll take it along when updating the Meson build for this PR.

Thanks @rgommers! issue incoming later today. And thanks @bsdz and @ragibson for all of the hard work you put into this one! I'm glad I was able to help you push it to finish line.

@bsdz
Copy link
Contributor Author

bsdz commented Dec 18, 2021

Fantastic work @steppi and @ragibson! Thanks for getting it through and sorry to leave you guys to it for the last leg. Also thanks @rgommers for final merge :)

@tupui
Copy link
Member

tupui commented Jan 7, 2022

@steppi
Copy link
Contributor

steppi commented Jan 7, 2022

@steppi can you add a note in the wiki? https://github.com/scipy/scipy/wiki/Release-note-entries-for-SciPy-1.9.0

Sure thing. I should have a some time tomorrow.

@tupui
Copy link
Member

tupui commented Jan 7, 2022

Thanks 🙏😃

@steppi
Copy link
Contributor

steppi commented Jan 9, 2022

@tupui, I've written up a draft for the wiki. I plan to come back to it later for another round of editing. @ragibson, @bsdz, if either of you have time, let me know if you think everything looks OK or if you think I've left out anything important.

@bsdz
Copy link
Contributor Author

bsdz commented Jan 9, 2022

@tupui, I've written up a draft for the wiki. I plan to come back to it later for another round of editing. @ragibson, @bsdz, if either of you have time, let me know if you think everything looks OK or if you think I've left out anything important.

Reads well to me. Covers the important improvements. Thanks @steppi

@ragibson
Copy link
Contributor

ragibson commented Jan 9, 2022

@steppi Yes, this looks great to me -- good work! 👍

@steppi
Copy link
Contributor

steppi commented Jan 9, 2022

Thanks for taking a look. Looking it over again, I think the only that needs to change is to not mention the mean and variance when talking about "S0" being a proper location-scale parametrization. The variance only exists when alpha = 2 (normal distribution) and the mean only exists when alpha > 1.

steppi pushed a commit that referenced this pull request Feb 22, 2022
* Create test_levy_stable_fitstart.py

Test scipy.stats.levy_stable._fitstart() with data for negative beta, negative loc (delta), and alpha==1.

* Update _continuous_distns.py

Fix estimate of alpha when beta < 0, and allow negative values of loc (delta)

* TST: Tidy tests and resolve NameError

Uses numpy.testing.assert_allclose() instead of generic assert with numpy.isclose().
Removes named constants that were only used once, for the sake of simplicity.
Removes one test of dubious utility.
Resolves NameError caused by incorrect module reference.

* BUG: Fix levy_stable._fitstart()

Calculate alpha estimate when beta < 0, in style similar to the other parameters.
Adjust delta estimate from zeta, when alpha is *not* 1, fixing apparent typo in comparator.

* TST: levy_stable._fitstart() new tests

TestLevyStable.test_fit_beta_flip() confirms that sign of beta affects loc, not alpha or scale.
TestLevyStable.test_fit_delta_shift() confirms that loc slides up and down if data shifts.
TestLevyStable.test_fit_loc_extrap() confirms that loc goes out of sample for alpha close to 1.

* TST: delete test_levy_stable_fitstart.py

Tests improved and relocated to test_distributions.py

* STY: Resolve linter complaints

Replace lambda expression with def.
Wrap long line.

* STY: Resolve linter complaints

Removed whitespace on blank lines.

* STY: Resolve PEP8 violations

Following #9523.
Replaced lambda expressions with def.
Wrapped long lines.
Removed trailing whitespace.

* STY: Wrap long lines

* TST: Update reason for test failure

This is #14476 and the bug was not fixed.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
Co-authored-by: steppi <albert_steppi@hms.harvard.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C/C++ Items related to the internal C/C++ code base Cython Issues with the internal Cython code base maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Levy stable