-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarizing column (describe-sheet) is much slower (+300%) with typed date column #2271
Comments
Thank you for such a detailed report, and providing sample data!
Do you mean that this behaviour has regressed since a previous version, or that you noticed a difference in behaviour between |
Great repro case for this report. I narrowed down where to look. This line creates the slowness for dates. If I comment it out, describe-sheet finishes quickly. visidata/visidata/features/describe.py Line 85 in 6a1f17c
|
Parsing dates is expensive, especially with python-dateutil. If you know the format, try using |
No, it's not a regression in 3.x per se: the problem was already there in 2.11 |
Even when specifying a custom date format on that particular big file, the summarizing still takes 49 seconds
|
Yes, this makes sense. Again, parsing dates is expensive, even with strptime (which is what |
When summarizing all columns (Ctrl+I / describe-sheet) on a large CSV file the computation is much more slower when a date column is typed as date compared basic summarizing.
This regression is not seen with other typed column (integer)
the test is done on a 2M file with 9 columns that can be downloaded here
https://github.com/datablist/sample-csv-files/raw/main/files/people/people-2000000.zip
once uncompressed, this will produce people-2000000.csv
here is a first script to describe all columns with only the 1st column typed as integer
which takes 22 to run on my computer
and a second script with only the date type ("Date of birth" column)
and this one takes 1min28 seconds to run
88 seconds is exactly 4 times longer compared to 22 seconds
there is no difference in the computation time between typing the 1st column as integer or leaving it
One can expect that typing a column as a date would not takes 300% longer for summarizing the data
This is tested with visidata v3.0.2 and Python 3.11.4-4 on Ubuntu 23.10
The text was updated successfully, but these errors were encountered: