Tablite throws IndexError when reading a complex CSV file #33

ypanagis · 2022-10-08T17:42:55Z

I am trying tablite with a CSV file with many fields and some of which are long, more specifically full texts. When reading with Table.import_file, I get the following exception

Exception: Traceback (most recent call last):
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/mplite/__init__.py", line 26, in execute
    return self.f(*self.args,**self.kwargs)
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/tablite/core.py", line 2646, in text_reader_task
    data[header].append(fields[index])
IndexError: list index out of range

When the same dataset is converted to xlsx and then opened, no errors occur. I am attaching an example file that causes the error.

test.csv

The text was updated successfully, but these errors were encountered:

root-11 · 2022-10-08T19:12:02Z

Hey @ypanagis
Thanks for this. I'll make the error message more informative in the next release.

The file you use has claims to have 31 headers, but there are only 30 columns.
The column for "court" is missing.

This will work for you:

from tablite import Table
from pathlib import Path
path = Path(__file__).parent / "data" / 'long_text_test.csv'
assert path.exists()

columns = [
    "sharepointid","Rank","ITEMID","docname","doctype","application","APPNO",
    "ARTICLES","violation","nonviolation","CONCLUSION","importance","ORIGINATING BODY ID",
    "typedescription","kpdate","kpdateAsText","documentcollectionid","documentcollectionid2",
    "languageisocode","extractedappno","isplaceholder","doctypebranch","RESPONDENT",
    "respondentOrderEng","scl","ECLI","ORIGINATING BODY","YEAR","FULLTEXT","judges"]

t = Table.import_file(path, import_as='csv', columns={c:'f' for c in columns}, text_qualifier='"')
selection = columns[:5]
t.__getitem__(*selection).show()

PS> Note that your file is cut mid-row in your file test.csv (the last row has 25 rows)

Here's what the output looks like on my machine:

ypanagis · 2022-10-11T16:19:22Z

Hi @root-11 and thank you for your reply. First of all, yes this CSV is rather ill-structured and is missing values at some columns. One of those is the column "court" as you very correctly noticed.

I didn't know of the columns and text_qualifier parameters of import_file and it is a real convenience that they are included!

I played a bit in the example script that you gave, with setting selection = columns[-3:] to see e.g. how it can work with the last few columns that includes TEXT which is the long one. I saw however the error I submit in the attached file. After browsing the messages it seems that lines 45-46 of the CSV, cause an error but not very obvious what and couldn't really see something in the CSV (there can be something of course).
error.txt

root-11 · 2022-10-11T17:07:11Z

That's python multiprocessing module crashing.
As the test suite runs python3.8 on linux just like you this seems strange. Could it be a difference between your conda env and pythons own venv?

ypanagis · 2022-10-11T17:13:17Z

I run the script on MacOS, can this also be an issue with multiprocessing?

root-11 · 2022-10-11T17:20:58Z

I'm not sure. Can you try to run the test multiprocessing test suite in this script:
https://github.com/root-11/mplite/blob/main/tests/test_basics.py

If that doesn't work I'll have to do a deeper dive to why MacOS behaves differently.

root-11 · 2022-10-11T22:24:01Z

I've added windows and macOS to the test matrix and they all come out positive:

ypanagis · 2022-10-14T17:11:29Z

I changed to Python 3.9 as you suggested but gives me now the error in the attached file. My PC has also mamba installed the environment is now a mamba one but I hope this is not a problem.

Note that I saw the same error when I removed the last two columns that had some emtpy values, in case that caused issues.

tablite is in version 2022.10.08.
error.txt

root-11 · 2022-10-15T10:49:19Z

Thanks for that Yannis! I'll look into that immediately.

root-11 · 2022-10-15T10:54:39Z

So the error says that psutil.virtual_memory().free is zero.

Could you run this on your mac for me:

import psutil
psutil.virtual_memory().free

ypanagis · 2022-10-15T11:01:29Z

Thanks Bjorn, I just ran it and gives this RuntimeError from Python 3.8 bug

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

...

root-11 · 2022-10-19T15:52:27Z

@ypanagis - you think we can close this ticket now?

ypanagis · 2022-10-19T15:55:12Z

Yes @root-11 makes total sense to me. Will try the package some more, but this part is definitely over now.

root-11 · 2022-10-19T16:02:20Z

Neat. Just FYI: I've released a new version today with slightly better memory management.
The details are in the changelog: https://github.com/root-11/tablite/blob/master/changelog.md

root-11 self-assigned this Oct 8, 2022

root-11 closed this as completed Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tablite throws IndexError when reading a complex CSV file #33

Tablite throws IndexError when reading a complex CSV file #33

ypanagis commented Oct 8, 2022

root-11 commented Oct 8, 2022

ypanagis commented Oct 11, 2022 •

edited

Loading

root-11 commented Oct 11, 2022

ypanagis commented Oct 11, 2022

root-11 commented Oct 11, 2022

root-11 commented Oct 11, 2022

ypanagis commented Oct 14, 2022 •

edited

Loading

root-11 commented Oct 15, 2022

root-11 commented Oct 15, 2022

ypanagis commented Oct 15, 2022 •

edited

Loading

root-11 commented Oct 19, 2022

ypanagis commented Oct 19, 2022

root-11 commented Oct 19, 2022

Tablite throws IndexError when reading a complex CSV file #33

Tablite throws IndexError when reading a complex CSV file #33

Comments

ypanagis commented Oct 8, 2022

root-11 commented Oct 8, 2022

ypanagis commented Oct 11, 2022 • edited Loading

root-11 commented Oct 11, 2022

ypanagis commented Oct 11, 2022

root-11 commented Oct 11, 2022

root-11 commented Oct 11, 2022

ypanagis commented Oct 14, 2022 • edited Loading

root-11 commented Oct 15, 2022

root-11 commented Oct 15, 2022

ypanagis commented Oct 15, 2022 • edited Loading

root-11 commented Oct 19, 2022

ypanagis commented Oct 19, 2022

root-11 commented Oct 19, 2022

ypanagis commented Oct 11, 2022 •

edited

Loading

ypanagis commented Oct 14, 2022 •

edited

Loading

ypanagis commented Oct 15, 2022 •

edited

Loading