Try to improve performance of contingency_similarity #622

amontanez24 · 2024-08-12T15:22:39Z

Problem Description

As a user, I'd like to get the results of my metrics reports as quickly as possible.

We performed an audit on the QualityReport since it seemed to be slow. The conclusion was that most of the time is lost in the contingency similarity metric. More specifically, these lines

SDMetrics/sdmetrics/column_pairs/statistical/contingency_similarity.py

Lines 45 to 54 in 685731f

    
           contingency_real = pd.crosstab( 
        
               index=real[columns[0]].astype(str), 
        
               columns=real[columns[1]].astype(str), 
        
               normalize=True, 
        
           ) 
        
           contingency_synthetic = pd.crosstab( 
        
               index=synthetic[columns[0]].astype(str), 
        
               columns=synthetic[columns[1]].astype(str), 
        
               normalize=True, 
        
           )

This is the performance report's visualization

Expected behavior

Without changing the algorithm at all the goal of this issue is to improve the performance of contingency_similarity. Optimizations that are in scope include

Trying different pandas or numpy functions instead of crosstab
Trying to do any type conversions at a higher level (eg. the astype(str) calls are happening multiple times on the same columns)
Seeing if there is a more efficient way to compute the table

The optimizations should not change the overall algorithm of the metric.

Additional context

If not many optimizations can be made, we can follow up with a different issue

The text was updated successfully, but these errors were encountered:

amontanez24 added the internal The issue doesn't change the API or functionality label Aug 12, 2024

amontanez24 mentioned this issue Aug 20, 2024

Try to improve performance of contingency_similarity #625

Merged

amontanez24 self-assigned this Aug 27, 2024

amontanez24 added this to the 0.15.2 milestone Aug 27, 2024

amontanez24 closed this as completed in #625 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to improve performance of contingency_similarity #622

Try to improve performance of contingency_similarity #622

amontanez24 commented Aug 12, 2024

Try to improve performance of contingency_similarity #622

Try to improve performance of contingency_similarity #622

Comments

amontanez24 commented Aug 12, 2024

Problem Description

Expected behavior

Additional context