Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Multivariate t-distribution #12582

Merged
merged 9 commits into from
Oct 11, 2020

Conversation

gwgundersen
Copy link
Contributor

@gwgundersen gwgundersen commented Jul 19, 2020

Reference issue

Closes #10042. Duplicates #11413 because the old branch was out-of-date.

What does this implement/fix?

Adds support for the multivariate t-distribution, particularly pdf, logpdf, and rvs. See my blog post for a high-level discussion of my approach.

Additional information

Completed:

  • No licensing issues.
  • Unit tests comparing logpdf and pdf to MATLAB's mvtpdf function; unit tests for arguments, reproducibility (with random_state), and some testable mathematical properties.
  • All tests pass locally.
  • All public functions have docstrings and examples.
  • Code style follows existing content in _multivariate.py.
  • Docstrings with new functionality.
  • Added myself to THANKS.txt
  • Added a documentation page via the docstring.

Would appreciate feedback on:

  1. How to unit test rvs?
  2. Should we support cdf and logcdf given that the multivariate t-distribution does not have a closed-form solution for the CDF? While I have seen some implementations (see Consider support for multivariate student-t distribution? #10042 (comment) in particular), I don't yet fully understand how they work and would prefer adding these methods to be a separate PR.
  3. Should we support entropy? I don't know anything about the differential entropy of the multivariate t-distribution, and would also prefer punting that to a future PR.

@gwgundersen
Copy link
Contributor Author

My tests pass locally, but the CI tests fail for a reason I don't understand, e.g.:

loc = rng.random(3)
E   AttributeError: 'mtrand.RandomState' object has no attribute 'random'

@mdhaber
Copy link
Contributor

mdhaber commented Aug 3, 2020

@gwgundersen Thanks for submitting this. Other opinions may differ, but I suspect it's OK to start without cdf and entropy in this PR and add later, since not all multivariate distributions have all the methods we might like.

I'm not sure if this will help, but try git submodule update --init before make html. You can also see the documentation built by CI: click "Details".
image

It looks like it's not showing up in the documentation. I think you need to add your distribution to scipy/stats/__init__.py:

Multivariate distributions
==========================
.. autosummary::
   :toctree: generated/
   multivariate_normal   -- Multivariate normal distribution
   matrix_normal         -- Matrix normal distribution
   dirichlet             -- Dirichlet
   wishart               -- Wishart
   invwishart            -- Inverse Wishart
   multinomial           -- Multinomial distribution
   special_ortho_group   -- SO(N) group
   ortho_group           -- O(N) group
   unitary_group         -- U(N) group
   random_correlation    -- random correlation matrices

I'm not sure about the failing test or how to test the RVS; perhaps @WarrenWeckesser has thoughts?

@gwgundersen
Copy link
Contributor Author

@mdhaber, thanks for that suggestion. I tried initializing and updating the git submodule, but it did not seem to work. Is it possible I need to rebuild Scipy from scratch to update the version number based on the commit? I've tried to do that and, oddly, am completely failing to do so—I was able to build Scipy previously, and I don't know if it's a change in the build procedure or something local to me. I wrote more details here.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 16, 2020

@gwgundersen yes, it sounds to me like you need to build SciPy again, too. I'm not sure how to fix your build error, though. Sometimes my approach has been to nuke my environment and start from scratch, but I see that you've written up your issue, so hopefully you'll get a more helpful reply. I've also seen this issue for macs lately, but it looks different from yours.

THANKS.txt Outdated Show resolved Hide resolved
@rlucas7
Copy link
Member

rlucas7 commented Sep 14, 2020

@mdhaber and @gwgundersen I did some testing of this PR locally to see if I could get the docstrings to build.

I couldn't set my local changes to track this pr for some reason (not sure why) so I posted the changes here:

rlucas7@56c9d27

it is mostly spacing around sections of docstrings (e.g. before example line, etc).
The examples render locally for my html and latex doc builds.
There may be an extra newline character here or there but that should give you an idea of what's left for docstring changes.

Let me know if I can help further to get this PR to merging, seems close.

@mdhaber
Copy link
Contributor

mdhaber commented Sep 15, 2020

@rlucas7 Do you want me to add those to this PR?

@gwgundersen
Copy link
Contributor Author

@mdhaber and @rlucas7, thanks for keeping this fire alive. I should have time to look at this again later in the week. Also, I have access to a new machine, which may help with my issues building the docs. I hope that building the docs correctly and pushing will fix some of the failing tests.

@rlucas7
Copy link
Member

rlucas7 commented Sep 16, 2020

@mdhaber wrote:

@rlucas7 Do you want me to add those to this PR?

Assuming @gwgundersen is ok with that then yes, the edits I made were all only newline characters so nothing beyond cosmetic changes to make sphinx happy.

Would be nice to get the doc build to green so we can focus on the scientific/probabilistic aspects of the PR and move this one forward.

@rlucas7
Copy link
Member

rlucas7 commented Sep 16, 2020

@mdhaber and @rlucas7, thanks for keeping this fire alive. I should have time to look at this again later in the week. Also, I have access to a new machine, which may help with my issues building the docs. I hope that building the docs correctly and pushing will fix some of the failing tests.

No worries @gwgundersen please let us know if you run into issues w/dev setup.

@mdhaber
Copy link
Contributor

mdhaber commented Sep 16, 2020

OK I went ahead and pushed @rlucas7's changes to @gwgundersen's branch so we can see how they affect the doc rendering.
@gwgundersen you'll need to pull the changes e.g. git pull origin/10042-multivariate-t-again before continuing work.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber I left a couple suggested changes to get the rvs reproducibility test to pass for me locally. One had some weird issue trying to overwrite the suggested change you made, perhaps you can take a look, and if they look ok to you, commit and see if gets to green?

I also left several comments, mostly minor, 2 less minor ones though.

scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Show resolved Hide resolved
Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwgundersen hopefully my review/clarifying comments here help. Ping me in comments if still unclear. I think is mostly a matter of adding some branching for np.inf df case.

scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwgundersen apologies for my confusing review yesterday. For the df parameter the way things are currently written in this PR, everything should work correctly if df is e.g. 1.1 except for the check that you have for int. So replace that check with the suggested change and I think the PR is good from that side. I think my comments for when df==np.inf is/are still valid though, let me know if unclear.

scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwgundersen left a couple more comments. I noticed something weird with the shapes, not sure the cause but behavior is inconsistent:

>>> import numpy as np
>>> import scipy.stats as ss
>>> loc = np.zeros(4)
>>> shape = np.eye(4)
>>> df = 1
>>> shape[2, 2] = 0
>>> ss.multivariate_t(loc, shape, df, allow_singular=True).pdf(np.zeros(4))
array([0.07599089])
>>> ss.multivariate_t(loc, shape, np.inf, allow_singular=True).pdf(np.zeros(4))
0.06349363593424098 

Could you take a look and add a test for that one?

Left a couple suggestions for the last unit test that was previously puzzling me-now I think I understand that one better.

Most other comments are cosmetic or style changes.

scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/_multivariate.py Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
@gwgundersen
Copy link
Contributor Author

gwgundersen commented Sep 23, 2020

I noticed something weird with the shapes, not sure the cause but behavior is inconsistent:

>>> import numpy as np
>>> import scipy.stats as ss
>>> loc = np.zeros(4)
>>> shape = np.eye(4)
>>> df = 1
>>> shape[2, 2] = 0
>>> ss.multivariate_t(loc, shape, df, allow_singular=True).pdf(np.zeros(4))
array([0.07599089])
>>> ss.multivariate_t(loc, shape, np.inf, allow_singular=True).pdf(np.zeros(4))
0.06349363593424098 

Ah, good catch. When using np.inf, we delegate to multivariate_normal, which uses _squeeze_output, converting values like [[2]] to 2. I've added _squeeze_output to multivariate_t's _logpdf function and rvs functions. I've added a new unit test test_shape_correctness:

def test_shape_correctness(self):
    # pdf and logpdf should return scalar when the 
    # number of samples in x is one.
    dim = 4
    loc = np.zeros(dim)
    shape = np.eye(dim)
    df = 1
    x = np.zeros(dim)
    res = multivariate_t(loc, shape, df).pdf(x)
    assert np.isscalar(res)
    res = multivariate_t(loc, shape, df).logpdf(x)
    assert np.isscalar(res)

    # pdf() and logpdf() should return probabilities of shape 
    # (n_samples,) when x has n_samples.
    n_samples = 7
    x = np.random.random((n_samples, dim))
    res = multivariate_t(loc, shape, df).pdf(x)
    assert (res.shape == (n_samples,))
    res = multivariate_t(loc, shape, df).logpdf(x)
    assert (res.shape == (n_samples,))

    # rvs() should return scalar unless a size argument is applied.
    res = multivariate_t(np.zeros(1), np.eye(1), 1).rvs()
    assert np.isscalar(res)

    # rvs() should return vector of shape (size,) if size argument 
    # is applied.
    size = 7
    res = multivariate_t(1, 1, 1).rvs(size=size)
    assert (res.shape == (size,))

scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/tests/test_multivariate.py Outdated Show resolved Hide resolved
Comment on lines 1815 to 1819
try:
multivariate_t(loc, shape, df, allow_singular=True)
except np.linalg.LinAlgError:
pytest.fail("Test failed by unexpectedly raising `LinAlgError`.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above-writing unit tests that happy code paths do not raise exceptions is unnecessary.

@gwgundersen
Copy link
Contributor Author

@rlucas, what if I just do a git push --force with my rebased branch, despite these failing tests? I suspect the issue has to do with my machine or OS (Ubuntu 20.04.1), since:

  1. The unit tests pass for you.
  2. The code and tests appear to rely on LAPACK.
  3. The same unit tests fail when using a newly cloned version of master.
  4. The same unit tests fail when using my current branch (not rebased).

Alternatively, I could ask the developers at #11524 for help, since that PR introduced the unit tests that are failing for me.

@gwgundersen
Copy link
Contributor Author

Also, the failing tests "smell" like they may be caused by differing mathematical or algorithmic conventions, e.g.:

>>> np.isclose(u1_2, u1)
array([[False,  True, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False]])
>>> u1_2[0,0]
0.058479194
>>> u1[0,0]
-0.058478586
>>> u1_2[0,2]
0.040936727
>>> u1[0,2]
-0.040936757

@rlucas7
Copy link
Member

rlucas7 commented Oct 5, 2020 via email

gwgundersen and others added 7 commits October 5, 2020 22:43
Fixed doc building.

Removing unneeded newline

formatting

Added multivariate T to documentation.

Removing thanks update

test changes to multivariate-t docs
Co-authored-by: Lucas Roberts <rlucas7@users.noreply.github.com>
Fix rendering in docs.

Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>

Added reference

Co-authored-by: Lucas Roberts <rlucas7@users.noreply.github.com>

Changed positive definite to positive semidefinite.
Fixes and tests for handling default arguments.

Removed unneeded comment.

Code and tests for defaulting to multivariate_normal when df=np.inf.

Amended code to allow `df` to be non-integer, but must be greater than zero. Added unit tests.

Amended documentation for df

Call _squeeze_output on pdf result; unit test pdf shapes.

Update scipy/stats/tests/test_multivariate.py

Co-authored-by: Lucas Roberts <rlucas7@users.noreply.github.com>

Update scipy/stats/_multivariate.py

Co-authored-by: Lucas Roberts <rlucas7@users.noreply.github.com>

Update scipy/stats/_multivariate.py

Co-authored-by: Lucas Roberts <rlucas7@users.noreply.github.com>

Added squeeze_output and tests for rvs.
Simplified tests using parameterize decorator; fixed names in pdf_correctness test.

Use parameterize and assert_equal

Use assert_raises alias rather than context manager pattern.

Added newline to break up long line

Removed test coverage

simplified unit test

Added numpy to imports

Removed unneeded numpy import; made axes eual.

Removed explicitly test not raising ValueError with df float; instead, use df float in another passing test.
@ilayn
Copy link
Member

ilayn commented Oct 6, 2020

Just to make sure that this is not a sign problem (since the results are unique up to columns signs), could you please check if this turns out all true

>>> np.allclose(np.abs(u1_2), np.abs(u1), rtol=0., atol=np.finfo(np.float32).eps*100)

@gwgundersen
Copy link
Contributor Author

gwgundersen commented Oct 6, 2020

@rlucas7, I rebased, confirmed the stats tests pass, and force pushed. The CI checks pass, and I didn't bring in any additional commits.

@gwgundersen
Copy link
Contributor Author

@ilayn, that command returns False:

> /home/gwg/projects/scipy/build/testenv/lib/python3.7/site-packages/scipy/linalg/tests/test_decomp_cossin.py(151)test_cossin_separate()
-> assert_allclose(u1_2, u1, rtol=0., atol=10*np.finfo(dtype_).eps)
(Pdb) np.allclose(np.abs(u1_2), np.abs(u1), rtol=0., atol=np.finfo(np.float32).eps*100)
False

Also, despite my linalg tests failing locally, the CI tests all pass.

@ilayn
Copy link
Member

ilayn commented Oct 6, 2020

That can happen also due to an outdated blas/lapack distro (mkl or openblas etc.) Because apparently not only signs are wrong but the results are also different so it is an actual miscalculation.

@rlucas7
Copy link
Member

rlucas7 commented Oct 7, 2020

That can happen also due to an outdated blas/lapack distro (mkl or openblas etc.) Because apparently not only signs are wrong but the results are also different so it is an actual miscalculation.

Thanks for commenting on this @ilayn I hadn't think about BLAS version issues.

So that we know in case of future questions, it would be great @gwgundersen if you could let us know which BLAS and version you have running.

Copy link
Member

@rlucas7 rlucas7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making 2 minor (cosmetic) changes to docs and marking approved. I'll leave open until the upcoming weekend to give others some time to comment. If no further comments by Sunday October 11th, at 9am EST, I'll merge to master (sometime after 9am).

scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
scipy/stats/_multivariate.py Outdated Show resolved Hide resolved
@gwgundersen
Copy link
Contributor Author

gwgundersen commented Oct 9, 2020

@ilayn, I'm not sure which BLAS and LAPACK versions are being used. This is what I tried when trying to find out:

>>> import numpy.distutils.system_info as sysinfo

>>> sysinfo.get_info('lapack')
{'libraries': ['lapack', 'lapack'],
 'library_dirs': ['/usr/lib/x86_64-linux-gnu'],
 'language': 'f77'}

>>> sysinfo.get_info('blas') 
{'libraries': ['blas', 'blas'],
 'library_dirs': ['/usr/lib/x86_64-linux-gnu'],
 'include_dirs': ['/usr/local/include',
  '/usr/include',
  '/home/gwg/miniconda3/envs/scipy/include'],
 'language': 'c',
 'define_macros': [('HAVE_CBLAS', None)]}

When I ls in those directories, I see multiple shared object files:

/usr/lib/x86_64-linux-gnu/lapack$ ls
liblapack.a  liblapack.so  liblapack.so.3  liblapack.so.3.9.0

/usr/lib/x86_64-linux-gnu/openblas-pthread$ ls
cmake       libblas.so.3  liblapack.so.3         libopenblasp-r0.3.8.so  pkgconfig
libblas.a   liblapack.a   libopenblas.a          libopenblas.so
libblas.so  liblapack.so  libopenblasp-r0.3.8.a  libopenblas.so.0

I also tried to re-build Scipy from scratch pointing to the latest object files:

export LAPACK=/usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
export BLAS=/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
python setup.py install

But the tests still fail. Happy to try something else if you have advice, but I can't find any info online about how to find the actual BLAS or LAPACK versions being used.

@rlucas7 rlucas7 merged commit 882849f into scipy:master Oct 11, 2020
@rlucas7
Copy link
Member

rlucas7 commented Oct 11, 2020

Squashed and merged, nice work @gwgundersen

@rlucas7 rlucas7 added this to the 1.6.0 milestone Oct 11, 2020
@mdhaber
Copy link
Contributor

mdhaber commented Oct 11, 2020

Thanks @gwgundersen, @rlucas!

@powen-kao
Copy link

Hello @gwgundersen ,
I find the code in the following line has a problem with Lint check and cause failure in continuous-integration/travis-ci/pr
I guess it's because of space before "="

dev = x - loc

it produces the following error

36.65s$ pycodestyle scipy benchmarks/benchmarks
scipy/stats/_multivariate.py:4083:12: E221 multiple spaces before operator
1       E221 multiple spaces before operator
The command "pycodestyle scipy benchmarks/benchmarks" exited with 1.

@rlucas7
Copy link
Member

rlucas7 commented Oct 11, 2020

@hp5588 I'm seeing that now, I've got a PR to fix it.

ref:
#12938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider support for multivariate student-t distribution?
7 participants