-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Environment Details
- SDMetrics version: 0.16.0 (latest)
Error Description
The ContingencySimilarity metric produces a RuntimeWarning whenever there are NaN values in a column and there are different combinations of values that appear in the real and synthetic data.
Fortunately, it appears the the computed score is unaffected -- i.e. if you replace all the NaN values in the real/synthetic data with a non-null value, the warning goes away and the score remains the same. So this is not really a concern for the overall quality score, but it is still annoying to see the warning printed out so many times.
Steps to reproduce
import pandas as pd
import numpy as np
from sdmetrics.column_pairs import ContingencySimilarity
real_data = pd.DataFrame(data={
'A': ['value']*4,
'B': ['1', '2', '3', np.nan]
})
synthetic_data = pd.DataFrame(data={
'A': ['value']*3,
'B': ['1', '2', np.nan]
})
ContingencySimilarity.compute(
real_data=real_data[['A', 'B']],
synthetic_data=synthetic_data[['A', 'B']]
)
/usr/local/lib/python3.10/dist-packages/sdmetrics/column_pairs/statistical/contingency_similarity.py:47: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning.
combined_index = contingency_real.index.union(contingency_synthetic.index)
0.75
Additional Context
From @pvk-developer: This may be happening due to the changes we made in: #625 Before we used crosstab
and now we use union
.
When using union, it doesn't really seem like the result needs to be sorted, so we might be able to fix it by just turning the sorting off.