Incorrect calculation for % unique for variables with missing values #56

mhowison · 2017-08-23T12:26:01Z

We've found that the % unique calculation sometimes shows up as >100%, and the it appears to be in cases where there are a small number of unique values and "missing" is one of them. It looks like that "missing" is counted in the numerator but not in the denominator of the calculation.

For example, we have a variable with 2 possible values ("Y" and "missing") that shows 200.0% unique (e.g. 2/1).

Another variable with 3 possible values ("Y", "N", "missing") shows 150.0% unique (e.g. 3/2).

…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.

Fixed issue #56

romainx · 2017-10-31T14:51:16Z

The current implementation includes na in distinct count. But when the percentage of unique values is computed the distinct count is divided by the count that does not include na. This is the cause of the inconsistent behavior highlighted here.

I've fixed the implementation to be consistent: now na is considered as a distinct value also in the percentage of unique values and the result is consistent.

df = pd.DataFrame({'test':[0, 1, np.nan]})
pandas_profiling.ProfileReport(df)
# Unique (%)	100.0%
# Instead of previously
# Unique (%)	150.0%

…inct_count But when the percentage of unique values is computed the distinct count is divided by the count that does not include na This is the cause of the iconsistent results highlighted in ydataai#56. I've fixed the implementation to be consistent: na is considered as a distinct value also in the percentage of unique values Another implementation would have been to not consider na at all. A specific test case has been added.

Fixed issue ydataai#56

romainx mentioned this issue Oct 6, 2017

Fixed issue #56 #60

Merged

romainx added a commit that referenced this issue Oct 31, 2017

Merge pull request #60 from romainx/master

6629f6e

Fixed issue #56

romainx added the bug 🐛 Something isn't working label Oct 31, 2017

romainx closed this as completed Oct 31, 2017

romainx mentioned this issue Nov 13, 2017

OverflowError: signed integer is greater than maximum #69

Closed

chanedwin pushed a commit to chanedwin/pandas-profiling that referenced this issue Oct 11, 2020

Merge pull request ydataai#60 from romainx/master

b448e4a

Fixed issue ydataai#56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect calculation for % unique for variables with missing values #56

Incorrect calculation for % unique for variables with missing values #56

mhowison commented Aug 23, 2017

romainx commented Oct 31, 2017

Incorrect calculation for % unique for variables with missing values #56

Incorrect calculation for % unique for variables with missing values #56

Comments

mhowison commented Aug 23, 2017

romainx commented Oct 31, 2017