Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Option to include NaNs in value_counts() #5569

Closed
michaelaye opened this issue Nov 22, 2013 · 10 comments · Fixed by #7424
Closed

Feature request: Option to include NaNs in value_counts() #5569

michaelaye opened this issue Nov 22, 2013 · 10 comments · Fixed by #7424
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@michaelaye
Copy link
Contributor

I find it highly valuable to also receive the information how many NaN values are in my Series.
Could we have an option in value_counts(), maybe include_nans=True that would add a count for those in the output of it?

@jtratner
Copy link
Contributor

to be clear, you can already do this: series.isnull().sum()

The problem is that you'd end up with an Index with nan with it, which causes problems.

@michaelaye
Copy link
Contributor Author

Yes, I know the workaround, but I would like to see it solved for value_counts, because it is a relevant piece of information for the scope of value_counts.
I see the problem with nan in the Index. How about converting the index to dtype 'object' just for this output. Then the nan could be packed into a string?

@jtratner
Copy link
Contributor

this is trivial to implement and the Index with nan issues aren't that big. We wouldn't include nan by default.

so we're down to a design decision. To me, nan means missing value, so it doesn't make sense to show up as a counted value or for the mode.

@michaelaye
Copy link
Contributor Author

I wouldn't include it by default as well. But it is extremely helpful to have in one overview what the rough ratio is between successful measurements with values and nans.
In my case, I have 3 different categories plus 'not categorized'. When displayed like so:

'a'   1000
'b'   5000
'c'   4000
'nan' 100000

I immediately know that something went wrong, which I wouldn't suspect if I don't see the nans. When I have to do this without the nans, I first have to sum up all the real values and compare it to the length of series, it's always one more step to do. Sure, I can write my own wrapper, but I thought it's a useful feature to have it at least as an option to the value_counts call.
What did you mean by or for the mode?

@jtratner
Copy link
Contributor

well, mode is really a special case of value_counts(), so if you include it for one it makes sense to use the same kwarg for the other.

@michaelaye
Copy link
Contributor Author

I have never used 'mode' and don't understand what it does. In my case:

print df.marking.mode()
print 
print df.marking.value_counts()

0    blotch
dtype: object

blotch         3854641
fan            3192799
None           2785831
interesting     884843
dtype: int64

I cheated by replacing nans by None string. ;)

@ghost ghost assigned jtratner Dec 5, 2013
@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@nmontpetit
Copy link

From my experience, it is very helpful for nan to show up as a counted value.

I work in a SAS shop, but I'm moving all of my analysis and reporting work from SAS to Python. I use value_counts to give me the results I would get from PROC FREQ in SAS. I use PROC FREQ daily, and almost always I'm looking at real-world data with missing values. I honestly cannot remember a case where I didn't want the missing values to be in the frequency counts.

I've got to believe I'm nowhere near the only person who needs to see frequency counts for nan values. I could see the lack of this feature slowing the adoption of pandas among SAS users.

SAS does not report missing values in frequency reports by default, but I'm OK with always selecting that option when I run PROC FREQ.

I do know how to add the missing counts to my value_counts output, but it's annoying to need to do it pretty much every time I use value_counts. If I -- and probably others -- need to do this every time we use value_counts it seems like including nan counts in the results is a reasonable option to add to the method.

@hayd
Copy link
Contributor

hayd commented Jun 13, 2014

+1 will fix along with #7424.

@jankatins
Copy link
Contributor

Something similar is in #7217. If NaN is a problem in index, this will also come up in Categoricals: jreback@725a497

@abhishekmamdapure
Copy link

why wait for the feature, you can try something like this which I have been using for a very long time

Name Designation
aaa Data Scientist
bbb Data Scientist
ccc Data Scientist
ddd  
eee ML Engineer
fff ML Engineer
ggg  
hhh Data Analyst
iii  

df['Designation'].astype('str').value_counts()

Data Scientist    3
nan               3
ML Engineer       2
Data Analyst      1
Name: Designation, dtype: int64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants