-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect calculation for % unique for variables with missing values #56
Labels
bug 🐛
Something isn't working
Comments
romainx
added a commit
to romainx/pandas-profiling
that referenced
this issue
Oct 6, 2017
…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.
Merged
The current implementation includes I've fixed the implementation to be consistent: now df = pd.DataFrame({'test':[0, 1, np.nan]})
pandas_profiling.ProfileReport(df)
# Unique (%) 100.0%
# Instead of previously
# Unique (%) 150.0% |
chanedwin
pushed a commit
to chanedwin/pandas-profiling
that referenced
this issue
Oct 11, 2020
…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.
chanedwin
pushed a commit
to chanedwin/pandas-profiling
that referenced
this issue
Oct 11, 2020
Fixed issue ydataai#56
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We've found that the % unique calculation sometimes shows up as >100%, and the it appears to be in cases where there are a small number of unique values and "missing" is one of them. It looks like that "missing" is counted in the numerator but not in the denominator of the calculation.
For example, we have a variable with 2 possible values ("Y" and "missing") that shows 200.0% unique (e.g. 2/1).
Another variable with 3 possible values ("Y", "N", "missing") shows 150.0% unique (e.g. 3/2).
The text was updated successfully, but these errors were encountered: