Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMD unbiased estimate issue #4595

Closed
saeidnp opened this issue Mar 29, 2019 · 12 comments
Closed

MMD unbiased estimate issue #4595

saeidnp opened this issue Mar 29, 2019 · 12 comments
Assignees

Comments

@saeidnp
Copy link

saeidnp commented Mar 29, 2019

I run the following code which computes MMD unbiased estimate multiple times. Since both the distributions are Normal(0,1) and the estimate is unbiased, I expect the average output to be 0. However, it always prints the exact same number (-0.6577018139068969)

import numpy as np
from torch.distributions.normal import Normal
import shogun as sg

for _ in range(100):
    X = Normal(0,1).sample((7000,)).numpy().reshape(-1,1)
    Y = Normal(0,1).sample((7000,)).numpy().reshape(-1,1)
    mmd = sg.QuadraticTimeMMD()
    mmd.set_p(sg.RealFeatures(X.T.astype(np.float64)))
    mmd.set_q(sg.RealFeatures(Y.T.astype(np.float64)))
    mmd.set_kernel(sg.GaussianKernel(32, 1))
    mmd.set_statistic_type(sg.ST_UNBIASED_FULL)
    stat = mmd.compute_statistic()
    print('stat = ', stat)
@lambday
Copy link
Member

lambday commented Mar 29, 2019

Hi @saeidnp. Thanks for letting us know. I'll check and get back to you on this.

@karlnapf
Copy link
Member

any updates @lambday ?

@karlnapf
Copy link
Member

karlnapf commented Apr 13, 2019

I just tried and I get

import numpy as np
import shogun as sg

stats = []
for _ in range(100):
    N = 2000
    X = np.random.randn(1,N)
    Y = np.random.randn(1,N)
    mmd = sg.QuadraticTimeMMD()
    mmd.set_p(sg.RealFeatures(X.astype(np.float64)))
    mmd.set_q(sg.RealFeatures(Y.astype(np.float64)))
    mmd.set_kernel(sg.GaussianKernel(32, 1))
    mmd.set_statistic_type(sg.ST_UNBIASED_FULL)
    stat = mmd.compute_statistic()
    print('stat = ', stat)
    stats += [stat]
    print('Average so far:', np.mean(stats))
...
('stat = ', 0.09728989243740216)
('Average so far:', -0.061675398767896017)
('stat = ', 0.5960679263807833)
('Average so far:', -0.046726686832698761)
('stat = ', -0.24397649394813925)
('Average so far:', -0.051110015879708551)
('stat = ', 1.8716953927651048)
('Average so far:', -0.0093098983004734764)
('stat = ', -0.37008902290835977)
('Average so far:', -0.016986049887875315)
('stat = ', -0.15827194147277623)
('Average so far:', -0.01992950596256075)
('stat = ', 0.9893259266391397)
('Average so far:', 0.00066754368237191238)
('stat = ', -0.36889349576085806)
('Average so far:', -0.006723677106492687)
...

I guess there is an issue with your random number generator?

@saeidnp
Copy link
Author

saeidnp commented Apr 19, 2019

@karlnapf in your code less samples are used than mine.
When I run your code, I get the expected output. But I tried your code with N = 7000 and got that magic number (-0.6577018139068969) again!

@karlnapf
Copy link
Member

That seems strange to me but I’ll check!

@karlnapf
Copy link
Member

I get exactly the same magic number. Crazy! This is a bug and I will dig deeper once I have a moment. @lambday might also be able to help

@karlnapf
Copy link
Member

This is an overflow issue caused by a .sum() call of an Eigen3 block. Will fix it next

karlnapf added a commit to karlnapf/shogun that referenced this issue May 2, 2019
karlnapf added a commit to karlnapf/shogun that referenced this issue May 2, 2019
@karlnapf
Copy link
Member

karlnapf commented May 8, 2019

(13:52:45) Heiko: block.cast<float64>().sum()
(13:52:49) Heiko: is that lazy evaluated?
(13:52:59) Heiko: or is a new matrix allocated
(13:52:59) Heiko: ?
(13:53:43) ChriSopht: the casting is done lazily. but this does not get vectorized
(13:54:00) Heiko: i see
(13:54:18) Heiko: so if I want to have it vectorized I need to allocate a new matrix with the casted values
(13:55:09) ChriSopht: HeikoS: that would vectorize the summation but not the conversion. so likely not worth the overhead
(13:55:44) ChriSopht: eventually this needs to be implemented in Eigen, but it is not too trivial to do generically
(13:55:56) Heiko: ChriSopht: alright thanks
(13:56:07) Heiko: well I can think about changing the original array in the first place maybe
(13:56:12) Heiko: but this is helpful, thanks!
(13:57:07) ChriSopht: sure that is possible, depends of course if that is beneficial (more memory/cache throughput).
(13:58:14) ChriSopht: you could try to implement a custom SIMD-cast+summation function (it is not very hard, if you are just consider one target architecture)
(13:58:40) Heiko: ah that might be an idea
(13:58:51) Heiko: do you have any pointers for that?
(14:01:32) ChriSopht: you need one cvtps_pd (depending on your architecture): https://software.intel.com/sites/landingpage/IntrinsicsGuide/#cats=Convert&techs=SSE2,AVX,AVX_512&expand=1762&text=cvtps_pd and then one add_pd in your main-loop
(14:02:13) ChriSopht: depending on your block-sizes you can also use two or four accumulators (better throughput)
(14:03:22) ChriSopht: last reduction can be done by Eigen::internal::predux

@gbaydin
Copy link

gbaydin commented May 10, 2019

@saeidnp can you please confirm whether this closed issue fixed the "magic number" problem? The issue is closed but I don't see a message here saying that the bug is fixed.

@karlnapf
Copy link
Member

@gbaydin yes it is solved via my hotfix. Could you confirm if you are using it?

It (unfortunately) will now a tiny bit slower than before due to the lack of low level vectorization in Eigen3 when summing the kernel matrix, now a c++ loop rather than SIMD summations of the columns. But that should only be noticeable when sampling from the null distribution (repeated computation of the test). We will run benchmarks and potentially try to improve it.

@saeidnp
Copy link
Author

saeidnp commented May 10, 2019

@gbaydin @karlnapf yes it passes my tests.

@karlnapf
Copy link
Member

Great, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants