Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows only: problem with mutual_info_classif when discrete=True #9772

Closed
csalazar94 opened this issue Sep 14, 2017 · 7 comments · Fixed by #10414
Closed

Windows only: problem with mutual_info_classif when discrete=True #9772

csalazar94 opened this issue Sep 14, 2017 · 7 comments · Fixed by #10414

Comments

@csalazar94
Copy link

@csalazar94 csalazar94 commented Sep 14, 2017

Description

RuntimeWarning: invalid value encountered in log
log_outer = -np.log(outer) + log(pi.sum()) + log(pj.sum())

Steps/Code to Reproduce

import pandas as pd
from sklearn.feature_selection import mutual_info_classif

a = [1]*(52632+2529) + [2]*(14660+793) + [3]*(3271+204) + [4]*(814+39) + [5]*(316+20)

b = [0]*52632 + [1]*2529 + [0]*14660 + [1]*793 + [0]*3271 + [1]*204 + [0]*814 + [1]*39 + [0]*316+ [1]*20

df = pd.DataFrame([a,b]).T

mutual_info_classif(df.loc[:,0].values.reshape(-1, 1), df.loc[:,1], discrete_features=True)

Expected Results

array([ 1.48233078])

Actual Results

array([ nan])

Versions

Windows-7-6.1.7601-SP1
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 12:30:02) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 15, 2017

Can not reproduce on an Ubuntu box with similar python, numpy, scipy and scikit-learn versions. The output I get is array([ 0.00848119]).

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 15, 2017

Maybe you could add a bit more details where you got your expected result from?

@csalazar94
Copy link
Author

@csalazar94 csalazar94 commented Sep 15, 2017

I'm trying to compute mutual information between two discrete variables, one takes values between 1 and 5, the other 0 and 1.

I tried again but i still got this:

RuntimeWarning: invalid value encountered in log
log_outer = -np.log(outer) + log(pi.sum()) + log(pj.sum())

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 18, 2017

I can reproduce on Windows actually. Looks like we get int32 somewhere and that a computation overflows so that you end up taking the log of a negative int32 ... need more investigation to pinpoint the source of the problem.

@lesteve lesteve added the Bug label Sep 18, 2017
@lesteve lesteve added this to the 0.20 milestone Sep 18, 2017
@lesteve lesteve changed the title problem with mutual_info_classif when discrete=True Windows only: problem with mutual_info_classif when discrete=True Sep 18, 2017
@lesteve
Copy link
Member

@lesteve lesteve commented Sep 18, 2017

@csalazar94 can you add some details about why you expected array([ 1.48233078]) as the output of your snippet?

@thechargedneutron
Copy link
Contributor

@thechargedneutron thechargedneutron commented Dec 23, 2017

@lesteve I am experiencing this error in Ubuntu system. Also, after fixing a probable integer overflow, I am getting [ 0.00012122] as the output. The answer is inconsistent with your observation.

The problem lies at line 605, sklearn/metrics/cluster/supervised.py
outer = pi.take(nzx) * pj.take(nzy)
I modified it to this:

outer = pi.take(nzx) * pj.take(nzy)
if np.any(outer<0):
    outer = pi.take(nzx).astype(np.int64) * pj.take(nzy).astype(np.int64)

and thus got [ 0.00012122]. Please verify.

@lesteve
Copy link
Member

@lesteve lesteve commented Jan 2, 2018

@lesteve I am experiencing this error in Ubuntu system.

I am guessing this is because you are using a 32-bit python. I can reproduce the problem using a 32-bit python.

and thus got [ 0.00012122]. Please verify.

This is what I get as well. Not sure why I had a different value in my previous post.

The problem lies at line 605, sklearn/metrics/cluster/supervised.py

Seems like you figured out where the int overflow happens, thanks! Not sure what the best fix actually is, maybe casting pi and pj as int64, this way you make sure that pi.sum() and pj.sum() do not overflow either.

As for testing, you are more than welcome to add a test similar to the one in the first post (without the pandas dependency). This should fail on Windows without your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants