Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to process .arrow file in the datasets #545

Closed
ramses-lee opened this issue Mar 6, 2024 · 3 comments
Closed

Unable to process .arrow file in the datasets #545

ramses-lee opened this issue Mar 6, 2024 · 3 comments

Comments

@ramses-lee
Copy link

A general demonstration is outlined here in the google collar file: https://colab.research.google.com/drive/1oKhivD5T9Yi1gMl0_7dUwqVFqiNfD43k?usp=sharing

The 'flights-200k.arrow" is producing an error every time I tried to read in the file using Pandas package.

@domoritz
Copy link
Member

domoritz commented Mar 6, 2024

Can you try reading it as a file and stream? Maybe try pyarrow directly.

@ramses-lee
Copy link
Author

Not exactly sure what you meant, but I tested both parquet read_table() function as well as the pyarrow memory_map() function and both gave me an error.

@domoritz
Copy link
Member

domoritz commented Mar 6, 2024

Ahh, I fixed it. The file wasn't closed properly.

This works now.

import pyarrow as pa

with open('data/flights-200k.arrow', 'rb') as f:
    buf = f.read()

    with pa.ipc.open_file(buf) as reader:
        df = reader.read_pandas()

        print(df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants