Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271

adren · 2024-01-25T19:59:13Z

When summarizing all columns (Ctrl+I / describe-sheet) on a large CSV file the computation is much more slower when a date column is typed as date compared basic summarizing.

This regression is not seen with other typed column (integer)

the test is done on a 2M file with 9 columns that can be downloaded here
https://github.com/datablist/sample-csv-files/raw/main/files/people/people-2000000.zip

once uncompressed, this will produce people-2000000.csv

here is a first script to describe all columns with only the 1st column typed as integer

#!vd -p
{"sheet": null, "col": null, "row": null, "longname": "open-file", "input": "people-2000000.csv", "keystrokes": "o", "comment": null}
{"sheet": "people-2000000", "col": "Index", "row": "", "longname": "type-int", "input": "", "keystrokes": "#", "comment": "set type of current column to int"}
{"sheet": "people-2000000", "col": "", "row": "", "longname": "describe-sheet", "input": "", "keystrokes": "Shift+I", "comment": "open Describe Sheet with descriptive statistics for all visible columns"}

which takes 22 to run on my computer

$ time vd -b -p describe_i.vdj
saul.pw/VisiData v3.0.2
opening describe_i.vdj as vdj
Support VisiData: https://github.com/sponsors/saulpw
opening people-2000000.csv as csv
set type of current column to int
open Describe Sheet with descriptive statistics for all visible columns
replay complete

real    0m22,657s
user    0m21,951s
sys     0m0,717s

and a second script with only the date type ("Date of birth" column)

#!vd -p
{"sheet": null, "col": null, "row": null, "longname": "open-file", "input": "people-2000000.csv", "keystrokes": "o", "comment": null}
{"sheet": "people-2000000", "col": "Date of birth", "row": "", "longname": "type-date", "input": "", "keystrokes": "@", "comment": "set type of current column to date"}
{"sheet": "people-2000000", "col": "", "row": "", "longname": "describe-sheet", "input": "", "keystrokes": "Shift+I", "comment": "open Describe Sheet with descriptive statistics for all visible columns"}

and this one takes 1min28 seconds to run

describe_d.vdj
saul.pw/VisiData v3.0.2
Support VisiData: https://github.com/sponsors/saulpw
opening describe_d.vdj as vdj
opening people-2000000.csv as csv
set type of current column to date
open Describe Sheet with descriptive statistics for all visible columns
replay complete

real    1m28,849s
user    1m28,088s
sys     0m0,774s

88 seconds is exactly 4 times longer compared to 22 seconds

there is no difference in the computation time between typing the 1st column as integer or leaving it

One can expect that typing a column as a date would not takes 300% longer for summarizing the data

This is tested with visidata v3.0.2 and Python 3.11.4-4 on Ubuntu 23.10

The text was updated successfully, but these errors were encountered:

anjakefala · 2024-01-25T22:59:48Z

Thank you for such a detailed report, and providing sample data!

This regression is not seen with other typed column (integer)

Do you mean that this behaviour has regressed since a previous version, or that you noticed a difference in behaviour between int and date?

midichef · 2024-01-26T04:12:26Z

Great repro case for this report.

I narrowed down where to look. This line creates the slowness for dates. If I comment it out, describe-sheet finishes quickly.

visidata/visidata/features/describe.py

Line 85 in 6a1f17c

v = srccol.type(v)

saulpw · 2024-01-26T04:53:04Z

Parsing dates is expensive, especially with python-dateutil. If you know the format, try using z@.

adren · 2024-01-26T06:02:44Z

Thank you for such a detailed report, and providing sample data!

This regression is not seen with other typed column (integer)

Do you mean that this behaviour has regressed since a previous version, or that you noticed a difference in behaviour between int and date?

No, it's not a regression in 3.x per se: the problem was already there in 2.11

adren · 2024-01-26T06:08:29Z

Parsing dates is expensive, especially with python-dateutil. If you know the format, try using z@.

Even when specifying a custom date format on that particular big file, the summarizing still takes 49 seconds
which is +122% compared to the 22 seconds (no typing)
but still better that without specifying (122 seconds)

$ time vd -b -p describe_cd.vdj 
saul.pw/VisiData v3.0.2
Support VisiData: https://github.com/sponsors/saulpw
opening describe_cd.vdj as vdj
opening people-2000000.csv as csv
set type of current column to custom date format
open Describe Sheet with descriptive statistics for all visible columns
replay complete

real    0m49,026s
user    0m48,227s
sys     0m0,809s

saulpw · 2024-01-26T19:06:17Z

Yes, this makes sense. Again, parsing dates is expensive, even with strptime (which is what z@ uses). If you write a Python script that converts all elements in that column to date objects and then summarizes them, you should find that it takes about the same amount of time. If you want the values summarized as dates, the work has to be done somewhere!

adren added the bug label Jan 25, 2024

anjakefala closed this as completed Jul 13, 2024

anjakefala added By Design and removed bug labels Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271

Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271

adren commented Jan 25, 2024 •

edited

Loading

anjakefala commented Jan 25, 2024

midichef commented Jan 26, 2024

saulpw commented Jan 26, 2024

adren commented Jan 26, 2024

adren commented Jan 26, 2024

saulpw commented Jan 26, 2024

Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271

Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271

Comments

adren commented Jan 25, 2024 • edited Loading

anjakefala commented Jan 25, 2024

midichef commented Jan 26, 2024

saulpw commented Jan 26, 2024

adren commented Jan 26, 2024

adren commented Jan 26, 2024

saulpw commented Jan 26, 2024

adren commented Jan 25, 2024 •

edited

Loading