Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMD two sample tests with 1 instance in one sample #4631

Open
amir-rahnama opened this issue May 12, 2019 · 6 comments
Open

MMD two sample tests with 1 instance in one sample #4631

amir-rahnama opened this issue May 12, 2019 · 6 comments

Comments

@amir-rahnama
Copy link

I noticed that when I run a Quadratic time MMD two sample tests, in a case on the sample has a single instance, the samples will all return nan. Now, this could be due to the assumption in MMD itself, but I was wondering if I still can interpret mmd.perform_test(alpha) still or not?

@karlnapf
Copy link
Member

Thanks for reporting! I think this should raise an assertion error, @lambday ?

@ambodi it won't return anything sensible. The test statistic might be OK (depending on whether or not you use biased/unbiased), but the p-value will be nonsense (permutation test makes no sense with one sample, the other approximations also will fail). Just curious, why are you interested in this case?

@amir-rahnama
Copy link
Author

@karlnapf it does not raise an error but returns nans.

Great question, I have this case: I have a function that outputs one vector and another function that outputs 5000 vectors. I wanted to evaluate the divergence between these two functions. Now I can definitely run the functions over a range of values, let's say 700, but then the problem is that shogun memory overflows for 5000*700 samples.

Any suggestions?

@lambday
Copy link
Member

lambday commented May 14, 2019

@karlnapf yeah maybe there should be an assertion, but we've left it up to the user. For the unbiased case we should at least have that check.

@ambodi The NaN values come because of how we scale the MMD^2 estimates for performing the test compute the MMD^2 estimates in the unbiased case (see http://shogun.ml/api/latest/classshogun_1_1CMMD.html). It's the (n-1) term in the denominator which is messing up. The unbiased one is the default - if you want to change it, you can use set_statistic_type().

It's interesting that shogun memory overflows for 5000x700 samples. What's the dimension? Can you post your code somewhere so that we can try it out?

[EDIT: fixed incorrect statement about scaling]

@karlnapf
Copy link
Member

@karlnapf it does not raise an error but returns nans.

Great question, I have this case: I have a function that outputs one vector and another function that outputs 5000 vectors. I wanted to evaluate the divergence between these two functions. Now I can definitely run the functions over a range of values, let's say 700, but then the problem is that shogun memory overflows for 5000*700 samples.

Any suggestions?

statistically, this problem seems pretty much impossible to solve with a test. You might want to try an outlier detection algorithm, like a one-class SVM, and train it on the 5000 vectors and then apply to the single vector to see whether it is similar-enough or not

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 26, 2020
@bhavukkalra
Copy link
Contributor

hello @karlnapf i am using this sklearn model to do this outlier detection task
https://scikit-learn.org/stable/modules/outlier_detection.html
could you please guide to which 5000 vectors were you refering to above? so that i could use them in the training process. and we could use try blocks for detecting the ones which interfere with (n-1) term in the denominator.
what do you suggest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants