Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too slow "Column Pair Trends" #546

Closed
echatzikyriakidis opened this issue Mar 26, 2024 · 4 comments
Closed

Too slow "Column Pair Trends" #546

echatzikyriakidis opened this issue Mar 26, 2024 · 4 comments
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc. resolution:WAI The software is working as intended

Comments

@echatzikyriakidis
Copy link

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDMetrics version: 0.13.1
  • Python version: 3.10.12
  • Operating System: Ubuntu 22.04.3 LTS (jammy)

Error Description

Hi @npatki,

It seems that , it is too slow when running Column Pair Trends from Quality Report.

My current example:

Generating report ...
(1/4) Evaluating Column Shapes: : 100%|██████████| 59/59 [03:39<00:00, 3.72s/it]
(2/4) Evaluating Column Pair Trends: : 0%| | 0/158 [00:00<?, ?it/s]

Suggestion:

Is it possible to change the library so that both single-table and multi-table reports (Quality+Diagnostic and any other that exists) to allow parallelization (either multithreading or multiprocessing) ?

Every calculation of column shapes or trends in column pairs can run in parallel. No need for sequential computation, since each computation is independent. Right?

Thanks!

@echatzikyriakidis echatzikyriakidis added bug Something isn't working new Label applied to new issues labels Mar 26, 2024
@echatzikyriakidis
Copy link
Author

echatzikyriakidis commented Mar 26, 2024

OK, I have to say that after a change everything goes fast, but in reality I don't know exactly why because have no clue how the library works.

Many of my fields are high-cardinality ones with almost unique text values. So far, I had them as pii=true in the metadata and sdtype=categorical. I decided to remove sdtype completely but the library failed asking for sdtype field. Then I changed it to sdtype=text and now it runs very fast. Why is that? I don't care about these fields, I just want the library to skip them and not taking them account for column shapes or column pair trends.

Is this a correct approach to skip them? I just need some validation.

Thanks.

@npatki
Copy link
Contributor

npatki commented Apr 1, 2024

Hi @echatzikyriakidis, appreciate the feedback.

Before going to parallelization (which we certainly can look into), it is helpful to look back at metadata and ensure everything is running right. SDMetrics uses metadata to make sure it is applying the correct metrics. For example, if you are storing HTTP codes such as 404 (error not found), 500 (server error), 200 (ok), etc. then it should make sure to treat those as discrete categories instead of a numerical distribution.

Here are the docs for what the metadata should look like.

Based on your description, here's what I think is going on:

  1. Sdtype categorical is not compatible with pii. If you mark something as categorical, SDMetrics will dutifully evaluate every single category, which could take a long time if you have non-statistical, high cardinality value such as a text description. See metadata spec.
  2. Any "other" sdtype (that is not categorical, numerical, datetime) is treated as a non-statistical value, meaning that it gets skipped. Therefore, when you apply text, it is skipping the column.

I would recommend continuing to use text to skip over columns that you do not want to include in the report. In the meantime, our team look into improving the experience for indicating which columns to skip.

@npatki npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Apr 1, 2024
@echatzikyriakidis
Copy link
Author

Hi @npatki,

That's exactly what I ended up doing. I set those high cardinality text fields with sdtype=text and now everything is fast.

Thanks!

@npatki npatki added resolution:WAI The software is working as intended resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Apr 2, 2024
@npatki
Copy link
Contributor

npatki commented Apr 2, 2024

Thanks for confirming @echatzikyriakidis. I left a feature request in #548 so make it easier (and more intuitive) to specify which columns you want to ignore when generating a report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc. resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants