Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect calculation for % unique for variables with missing values #56

Closed
mhowison opened this issue Aug 23, 2017 · 1 comment
Closed
Labels
bug 🐛 Something isn't working

Comments

@mhowison
Copy link

We've found that the % unique calculation sometimes shows up as >100%, and the it appears to be in cases where there are a small number of unique values and "missing" is one of them. It looks like that "missing" is counted in the numerator but not in the denominator of the calculation.

For example, we have a variable with 2 possible values ("Y" and "missing") that shows 200.0% unique (e.g. 2/1).

Another variable with 3 possible values ("Y", "N", "missing") shows 150.0% unique (e.g. 3/2).

romainx added a commit to romainx/pandas-profiling that referenced this issue Oct 6, 2017
…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.
@romainx romainx mentioned this issue Oct 6, 2017
romainx added a commit that referenced this issue Oct 31, 2017
@romainx romainx added the bug 🐛 Something isn't working label Oct 31, 2017
@romainx
Copy link
Contributor

romainx commented Oct 31, 2017

The current implementation includes na in distinct count. But when the percentage of unique values is computed the distinct count is divided by the count that does not include na. This is the cause of the inconsistent behavior highlighted here.

I've fixed the implementation to be consistent: now na is considered as a distinct value also in the percentage of unique values and the result is consistent.

df = pd.DataFrame({'test':[0, 1, np.nan]})
pandas_profiling.ProfileReport(df)
# Unique (%)	100.0%
# Instead of previously
# Unique (%)	150.0%

@romainx romainx closed this as completed Oct 31, 2017
chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020
…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.
chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants