Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting to arrow seems to create corrupted or invalid output #2228

Closed
alvations opened this issue Oct 13, 2022 · 2 comments
Closed

Exporting to arrow seems to create corrupted or invalid output #2228

alvations opened this issue Oct 13, 2022 · 2 comments

Comments

@alvations
Copy link

On these vaex and pyarrow version:

>>> vaex.__version__
{'vaex': '4.12.0',
 'vaex-core': '4.12.0',
 'vaex-viz': '0.5.3',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.1',
 'vaex-jupyter': '0.8.0',
 'vaex-ml': '0.18.0'}

>>> pyarrow.__version__
8.0.0

When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv:

$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv

And exporting the tsv to arrow as such, then reading it back:

import vaex
import pyarrow as pa

df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')

pa.parquet.read_table('s2t.parquet')

It throws the following error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
      1 import pyarrow as pa
      2 
----> 3 pa.parquet.read_table('s2t.parquet')

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
   2746                 ignore_prefixes=ignore_prefixes,
   2747                 pre_buffer=pre_buffer,
-> 2748                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   2749             )
   2750         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
   2338 
   2339             self._dataset = ds.FileSystemDataset(
-> 2340                 [fragment], schema=schema or fragment.physical_schema,
   2341                 format=parquet_format,
   2342                 filesystem=fragment.filesystem

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Is there some additional args/kwargs that should be added when exporting or reading the parquet files?

Or is the exporting to arrow bugged/broken somehow?

@JovanVeljanoski
Copy link
Member

JovanVeljanoski commented Oct 13, 2022

You are using the wrong method.

Basically you need to

df.export_parquet("file.parquet")

# or 

df.export("file.parquet") # This will auto-use the above method by looking at the extensions specified

This, df.export_arrow("file.arrow") exports to a different (arrow native) file format

@alvations
Copy link
Author

Thanks for the quick reply! I get it the right write/read function and extensions now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants