Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mode() not compatible with fillna() #9750

Closed
alfonsomhc opened this issue Mar 30, 2015 · 9 comments
Closed

Mode() not compatible with fillna() #9750

alfonsomhc opened this issue Mar 30, 2015 · 9 comments

Comments

@alfonsomhc
Copy link
Contributor

I made an toy dataframe:
df = pandas.DataFrame([[1, 1, 1],[2, 1, 1],[2, 1, 1],[numpy.nan, numpy.nan, numpy.nan]], columns=["a","b","c"])

I try different methods to fill missing values. These work as expected:
df.fillna(df.mean())
df.fillna(df.median())

But this doesnt work:
df.fillna(df.mode())

Inspecting the output from df.mode() I see it has different format than df.mean() and df.median(). As I user I would expect the same behavior for these functions, and be able to fill missing values as described.
Using Pandas 0.15.2

@alfonsomhc
Copy link
Contributor Author

I have found that if I want to fill NaN with the mode, I need to do this:
df.fillna(df.mode().ix[0])
I would have expected the mean, median and mode to all return the same type of object. As far as I have understood, mean and median return an series (for my example data frame), but the mode returns a dataframe...

@TomAugspurger
Copy link
Contributor

mode can't reduce a DataFrame to a Series because there could be items with the same number of counts

In [16]: df = pd.DataFrame({'A': [1, 2, 1, 2, 1, 2, 3]})

In [17]: df.mode()
Out[17]:
   A
0  1
1  2

@shoyer
Copy link
Member

shoyer commented Mar 31, 2015

Hmm. If I were designing mode from scratch, I would probably choose to have just use the first such value -- similar to np.argmax. But at this point, we are probably stuck. We could consider adding some sort of keyword argument to change this behavior, but indexing is also pretty easy.

@alfonsomhc
Copy link
Contributor Author

Thanks for looking into this and also for the explanation. As a user I would like a parameter that controls this behavior, where the default is to return a series (i.e. choose the first mode if many). Whatever you decide, may I suggest that at least the clarification/example given by TomAugspurger is added to the documentation (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.mode.html)? I did read that page before creating this issue, and the reason why a dataframe is returned was not clear to me...

@shoyer
Copy link
Member

shoyer commented Mar 31, 2015

@alfonsomhc If you'd like to put together a PR with a documentation patch, it would be gratefully accepted.

@alfonsomhc
Copy link
Contributor Author

I see that the page I referred to is generated by the documentation in file pandas/core/frame.py
Should I just add the note there then?

@alfonsomhc
Copy link
Contributor Author

I didnt really know how to do the pull request. Hopefully I didnt break anything!

@alfonsomhc
Copy link
Contributor Author

And now suddenly the issue is closed? Hopefully somebody can verify what I did. In case it wasnt clear enough, it's the first time I contribute to an open source project...

@alfonsomhc alfonsomhc reopened this Apr 1, 2015
@mroeschke
Copy link
Member

Closing as it looks like the proper documentation was added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants