Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablite throws IndexError when reading a complex CSV file #33

Closed
ypanagis opened this issue Oct 8, 2022 · 13 comments
Closed

Tablite throws IndexError when reading a complex CSV file #33

ypanagis opened this issue Oct 8, 2022 · 13 comments
Assignees

Comments

@ypanagis
Copy link

ypanagis commented Oct 8, 2022

I am trying tablite with a CSV file with many fields and some of which are long, more specifically full texts. When reading with Table.import_file, I get the following exception

Exception: Traceback (most recent call last):
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/mplite/__init__.py", line 26, in execute
    return self.f(*self.args,**self.kwargs)
  File "/Users/ypanagis/opt/anaconda3/envs/tabular/lib/python3.8/site-packages/tablite/core.py", line 2646, in text_reader_task
    data[header].append(fields[index])
IndexError: list index out of range

When the same dataset is converted to xlsx and then opened, no errors occur. I am attaching an example file that causes the error.

test.csv

@root-11 root-11 self-assigned this Oct 8, 2022
@root-11
Copy link
Owner

root-11 commented Oct 8, 2022

Hey @ypanagis
Thanks for this. I'll make the error message more informative in the next release.

The file you use has claims to have 31 headers, but there are only 30 columns.
The column for "court" is missing.

This will work for you:

from tablite import Table
from pathlib import Path
path = Path(__file__).parent / "data" / 'long_text_test.csv'
assert path.exists()

columns = [
    "sharepointid","Rank","ITEMID","docname","doctype","application","APPNO",
    "ARTICLES","violation","nonviolation","CONCLUSION","importance","ORIGINATING BODY ID",
    "typedescription","kpdate","kpdateAsText","documentcollectionid","documentcollectionid2",
    "languageisocode","extractedappno","isplaceholder","doctypebranch","RESPONDENT",
    "respondentOrderEng","scl","ECLI","ORIGINATING BODY","YEAR","FULLTEXT","judges"]

t = Table.import_file(path, import_as='csv', columns={c:'f' for c in columns}, text_qualifier='"')
selection = columns[:5]
t.__getitem__(*selection).show()

PS> Note that your file is cut mid-row in your file test.csv (the last row has 25 rows)

Here's what the output looks like on my machine:
image

@ypanagis
Copy link
Author

ypanagis commented Oct 11, 2022

Hi @root-11 and thank you for your reply. First of all, yes this CSV is rather ill-structured and is missing values at some columns. One of those is the column "court" as you very correctly noticed.

I didn't know of the columns and text_qualifier parameters of import_file and it is a real convenience that they are included!

I played a bit in the example script that you gave, with setting selection = columns[-3:] to see e.g. how it can work with the last few columns that includes TEXT which is the long one. I saw however the error I submit in the attached file. After browsing the messages it seems that lines 45-46 of the CSV, cause an error but not very obvious what and couldn't really see something in the CSV (there can be something of course).
error.txt

@root-11
Copy link
Owner

root-11 commented Oct 11, 2022

That's python multiprocessing module crashing.
As the test suite runs python3.8 on linux just like you this seems strange. Could it be a difference between your conda env and pythons own venv?

@ypanagis
Copy link
Author

I run the script on MacOS, can this also be an issue with multiprocessing?

@root-11
Copy link
Owner

root-11 commented Oct 11, 2022

I'm not sure. Can you try to run the test multiprocessing test suite in this script:
https://github.com/root-11/mplite/blob/main/tests/test_basics.py

If that doesn't work I'll have to do a deeper dive to why MacOS behaves differently.

@root-11
Copy link
Owner

root-11 commented Oct 11, 2022

I've added windows and macOS to the test matrix and they all come out positive:

image

@ypanagis
Copy link
Author

ypanagis commented Oct 14, 2022

I changed to Python 3.9 as you suggested but gives me now the error in the attached file. My PC has also mamba installed the environment is now a mamba one but I hope this is not a problem.

Note that I saw the same error when I removed the last two columns that had some emtpy values, in case that caused issues.

tablite is in version 2022.10.08.
error.txt

@root-11
Copy link
Owner

root-11 commented Oct 15, 2022

Thanks for that Yannis! I'll look into that immediately.

@root-11
Copy link
Owner

root-11 commented Oct 15, 2022

So the error says that psutil.virtual_memory().free is zero.

Could you run this on your mac for me:

import psutil
psutil.virtual_memory().free

@ypanagis
Copy link
Author

ypanagis commented Oct 15, 2022

Thanks Bjorn, I just ran it and gives this RuntimeError from Python 3.8 bug

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

...

@root-11
Copy link
Owner

root-11 commented Oct 19, 2022

@ypanagis - you think we can close this ticket now?

@ypanagis
Copy link
Author

Yes @root-11 makes total sense to me. Will try the package some more, but this part is definitely over now.

@root-11
Copy link
Owner

root-11 commented Oct 19, 2022

Neat. Just FYI: I've released a new version today with slightly better memory management.
The details are in the changelog: https://github.com/root-11/tablite/blob/master/changelog.md

@root-11 root-11 closed this as completed Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants