Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The first argument of bincount must be non-negative #6

Closed
AGS0331 opened this issue Mar 19, 2016 · 1 comment
Closed

ValueError: The first argument of bincount must be non-negative #6

AGS0331 opened this issue Mar 19, 2016 · 1 comment

Comments

@AGS0331
Copy link

AGS0331 commented Mar 19, 2016

Testing pandas-profiling with the df I am currently working with, and getting

/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.pyc in histogram(a, bins, range, normed, weights, density)
    247                 n.imag += np.bincount(indices, weights=tmp_w.imag, minlength=bins)
    248             else:
--> 249                 n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
    250 
    251         # We now compute the bin edges since these are returned
ValueError: The first argument of bincount must be non-negative

The dataframe I'm using:

print df.head()

0         1   2   3         4    5    6            7    8    9         10  \
0   1  0.697663   1   1  0.000005  1.0  inf     2.307568  inf  inf  0.000005   
1   1  0.983510   1   1  0.000008  1.0  inf    59.642170  inf  inf  0.000008   
2   1  1.000000   1   1  0.000000  1.0  inf          inf  inf  inf  0.000000   
3   1  0.999195   1   1  0.000004  1.0  inf  1241.660000  inf  inf  0.000004   
4   1  1.000000   1   1  0.000064  1.0  inf          inf  inf  inf  0.000064   

    11  
0  inf  
1  inf  
2  inf  
3  inf  
4  inf  
@JosPolfliet
Copy link
Contributor

Awesome you found a bug!

After investigation it is a bug (feature) in Pandas that calls NumPy with illegal arguments, caused by the infinities in your data.

As a workaround you can replace the infinities with missing values like so:

df.replace(to_replace=np.inf, value=np.NaN, inplace=True)

Depending on what you are trying to do, NaN might actually be better than infinities. Are the infinities real, mathematical infinities? If they are right-censored, for example because your sensor can only read to a certain upper limit, than replacing it with the right-censored value might be more appropriate.

In any case, a lot of calculated statistics will be invalid. For example, when there is a positive infinity, the calculated mean should be positive infinity as well, but the value will now be ignored, skewing the results.

I will leave this bug open for now to think about how we should deal with infinities:

  • Replace them with NaN's by default
  • Group them in separate bins and plot them separately
  • Reject the variable since algorithms won't fit anyway

Would love input on this as well.

To reproduce:

import pandas as pd
import numpy as np
series=pd.Series([2.307568, 59.642170, np.inf, 1241.660000, np.inf])
plot = series.plot(kind='hist', bins=10)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants