Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows only: problem with mutual_info_classif when discrete=True #9772

Closed
csalazar94 opened this Issue Sep 14, 2017 · 7 comments

Comments

Projects
None yet
3 participants
@csalazar94
Copy link

csalazar94 commented Sep 14, 2017

Description

RuntimeWarning: invalid value encountered in log
log_outer = -np.log(outer) + log(pi.sum()) + log(pj.sum())

Steps/Code to Reproduce

import pandas as pd
from sklearn.feature_selection import mutual_info_classif

a = [1]*(52632+2529) + [2]*(14660+793) + [3]*(3271+204) + [4]*(814+39) + [5]*(316+20)

b = [0]*52632 + [1]*2529 + [0]*14660 + [1]*793 + [0]*3271 + [1]*204 + [0]*814 + [1]*39 + [0]*316+ [1]*20

df = pd.DataFrame([a,b]).T

mutual_info_classif(df.loc[:,0].values.reshape(-1, 1), df.loc[:,1], discrete_features=True)

Expected Results

array([ 1.48233078])

Actual Results

array([ nan])

Versions

Windows-7-6.1.7601-SP1
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 12:30:02) [MSC v.1900 64 bit (AMD64)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.19.0

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Sep 15, 2017

Can not reproduce on an Ubuntu box with similar python, numpy, scipy and scikit-learn versions. The output I get is array([ 0.00848119]).

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Sep 15, 2017

Maybe you could add a bit more details where you got your expected result from?

@csalazar94

This comment has been minimized.

Copy link
Author

csalazar94 commented Sep 15, 2017

I'm trying to compute mutual information between two discrete variables, one takes values between 1 and 5, the other 0 and 1.

I tried again but i still got this:

RuntimeWarning: invalid value encountered in log
log_outer = -np.log(outer) + log(pi.sum()) + log(pj.sum())

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Sep 18, 2017

I can reproduce on Windows actually. Looks like we get int32 somewhere and that a computation overflows so that you end up taking the log of a negative int32 ... need more investigation to pinpoint the source of the problem.

@lesteve lesteve added the Bug label Sep 18, 2017

@lesteve lesteve added this to the 0.20 milestone Sep 18, 2017

@lesteve lesteve changed the title problem with mutual_info_classif when discrete=True Windows only: problem with mutual_info_classif when discrete=True Sep 18, 2017

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Sep 18, 2017

@csalazar94 can you add some details about why you expected array([ 1.48233078]) as the output of your snippet?

@thechargedneutron

This comment has been minimized.

Copy link
Contributor

thechargedneutron commented Dec 23, 2017

@lesteve I am experiencing this error in Ubuntu system. Also, after fixing a probable integer overflow, I am getting [ 0.00012122] as the output. The answer is inconsistent with your observation.

The problem lies at line 605, sklearn/metrics/cluster/supervised.py
outer = pi.take(nzx) * pj.take(nzy)
I modified it to this:

outer = pi.take(nzx) * pj.take(nzy)
if np.any(outer<0):
    outer = pi.take(nzx).astype(np.int64) * pj.take(nzy).astype(np.int64)

and thus got [ 0.00012122]. Please verify.

@lesteve

This comment has been minimized.

Copy link
Member

lesteve commented Jan 2, 2018

@lesteve I am experiencing this error in Ubuntu system.

I am guessing this is because you are using a 32-bit python. I can reproduce the problem using a 32-bit python.

and thus got [ 0.00012122]. Please verify.

This is what I get as well. Not sure why I had a different value in my previous post.

The problem lies at line 605, sklearn/metrics/cluster/supervised.py

Seems like you figured out where the int overflow happens, thanks! Not sure what the best fix actually is, maybe casting pi and pj as int64, this way you make sure that pi.sum() and pj.sum() do not overflow either.

As for testing, you are more than welcome to add a test similar to the one in the first post (without the pandas dependency). This should fail on Windows without your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.