Too slow "Column Pair Trends" #546

echatzikyriakidis · 2024-03-26T13:31:54Z

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version: 0.13.1
Python version: 3.10.12
Operating System: Ubuntu 22.04.3 LTS (jammy)

Error Description

It seems that , it is too slow when running Column Pair Trends from Quality Report.

My current example:

Generating report ...
(1/4) Evaluating Column Shapes: : 100%|██████████| 59/59 [03:39<00:00, 3.72s/it]
(2/4) Evaluating Column Pair Trends: : 0%| | 0/158 [00:00<?, ?it/s]

Suggestion:

Is it possible to change the library so that both single-table and multi-table reports (Quality+Diagnostic and any other that exists) to allow parallelization (either multithreading or multiprocessing) ?

Every calculation of column shapes or trends in column pairs can run in parallel. No need for sequential computation, since each computation is independent. Right?

Thanks!

echatzikyriakidis · 2024-03-26T15:25:18Z

OK, I have to say that after a change everything goes fast, but in reality I don't know exactly why because have no clue how the library works.

Many of my fields are high-cardinality ones with almost unique text values. So far, I had them as pii=true in the metadata and sdtype=categorical. I decided to remove sdtype completely but the library failed asking for sdtype field. Then I changed it to sdtype=text and now it runs very fast. Why is that? I don't care about these fields, I just want the library to skip them and not taking them account for column shapes or column pair trends.

Is this a correct approach to skip them? I just need some validation.

Thanks.

npatki · 2024-04-01T23:55:14Z

Hi @echatzikyriakidis, appreciate the feedback.

Before going to parallelization (which we certainly can look into), it is helpful to look back at metadata and ensure everything is running right. SDMetrics uses metadata to make sure it is applying the correct metrics. For example, if you are storing HTTP codes such as 404 (error not found), 500 (server error), 200 (ok), etc. then it should make sure to treat those as discrete categories instead of a numerical distribution.

Here are the docs for what the metadata should look like.

Based on your description, here's what I think is going on:

Sdtype categorical is not compatible with pii. If you mark something as categorical, SDMetrics will dutifully evaluate every single category, which could take a long time if you have non-statistical, high cardinality value such as a text description. See metadata spec.
Any "other" sdtype (that is not categorical, numerical, datetime) is treated as a non-statistical value, meaning that it gets skipped. Therefore, when you apply text, it is skipping the column.

I would recommend continuing to use text to skip over columns that you do not want to include in the report. In the meantime, our team look into improving the experience for indicating which columns to skip.

echatzikyriakidis · 2024-04-02T06:50:26Z

Hi @npatki,

That's exactly what I ended up doing. I set those high cardinality text fields with sdtype=text and now everything is fast.

Thanks!

npatki · 2024-04-02T14:25:55Z

Thanks for confirming @echatzikyriakidis. I left a feature request in #548 so make it easier (and more intuitive) to specify which columns you want to ignore when generating a report.

echatzikyriakidis added bug Something isn't working new Label applied to new issues labels Mar 26, 2024

npatki added under discussion Issue is currently being discussed and removed new Label applied to new issues labels Apr 1, 2024

echatzikyriakidis closed this as completed Apr 2, 2024

npatki added resolution:WAI The software is working as intended resolution:resolved The issue was fixed, the question was answered, etc. and removed under discussion Issue is currently being discussed labels Apr 2, 2024

npatki mentioned this issue Apr 2, 2024

Better way to ignore columns when running a report #548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too slow "Column Pair Trends" #546

Too slow "Column Pair Trends" #546

echatzikyriakidis commented Mar 26, 2024

echatzikyriakidis commented Mar 26, 2024 •

edited

Loading

npatki commented Apr 1, 2024 •

edited

Loading

echatzikyriakidis commented Apr 2, 2024

npatki commented Apr 2, 2024

Too slow "Column Pair Trends" #546

Too slow "Column Pair Trends" #546

Comments

echatzikyriakidis commented Mar 26, 2024

Environment Details

Error Description

echatzikyriakidis commented Mar 26, 2024 • edited Loading

npatki commented Apr 1, 2024 • edited Loading

echatzikyriakidis commented Apr 2, 2024

npatki commented Apr 2, 2024

echatzikyriakidis commented Mar 26, 2024 •

edited

Loading

npatki commented Apr 1, 2024 •

edited

Loading