Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to improve performance of contingency_similarity #622

Closed
amontanez24 opened this issue Aug 12, 2024 · 0 comments · Fixed by #625
Closed

Try to improve performance of contingency_similarity #622

amontanez24 opened this issue Aug 12, 2024 · 0 comments · Fixed by #625
Assignees
Labels
internal The issue doesn't change the API or functionality
Milestone

Comments

@amontanez24
Copy link
Contributor

Problem Description

As a user, I'd like to get the results of my metrics reports as quickly as possible.

We performed an audit on the QualityReport since it seemed to be slow. The conclusion was that most of the time is lost in the contingency similarity metric. More specifically, these lines

contingency_real = pd.crosstab(
index=real[columns[0]].astype(str),
columns=real[columns[1]].astype(str),
normalize=True,
)
contingency_synthetic = pd.crosstab(
index=synthetic[columns[0]].astype(str),
columns=synthetic[columns[1]].astype(str),
normalize=True,
)

This is the performance report's visualization
image

Expected behavior

Without changing the algorithm at all the goal of this issue is to improve the performance of contingency_similarity. Optimizations that are in scope include

  • Trying different pandas or numpy functions instead of crosstab
  • Trying to do any type conversions at a higher level (eg. the astype(str) calls are happening multiple times on the same columns)
  • Seeing if there is a more efficient way to compute the table

The optimizations should not change the overall algorithm of the metric.

Additional context

  • If not many optimizations can be made, we can follow up with a different issue
@amontanez24 amontanez24 added the internal The issue doesn't change the API or functionality label Aug 12, 2024
@amontanez24 amontanez24 self-assigned this Aug 27, 2024
@amontanez24 amontanez24 added this to the 0.15.2 milestone Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal The issue doesn't change the API or functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant