Exporting to arrow seems to create corrupted or invalid output #2228

alvations · 2022-10-13T19:47:14Z

On these vaex and pyarrow version:

>>> vaex.__version__
{'vaex': '4.12.0',
 'vaex-core': '4.12.0',
 'vaex-viz': '0.5.3',
 'vaex-hdf5': '0.12.3',
 'vaex-server': '0.8.1',
 'vaex-astro': '0.9.1',
 'vaex-jupyter': '0.8.0',
 'vaex-ml': '0.18.0'}

>>> pyarrow.__version__
8.0.0

When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv:

$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv

And exporting the tsv to arrow as such, then reading it back:

import vaex
import pyarrow as pa

df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')

pa.parquet.read_table('s2t.parquet')

It throws the following error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
      1 import pyarrow as pa
      2 
----> 3 pa.parquet.read_table('s2t.parquet')

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
   2746                 ignore_prefixes=ignore_prefixes,
   2747                 pre_buffer=pre_buffer,
-> 2748                 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
   2749             )
   2750         except ImportError:

/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
   2338 
   2339             self._dataset = ds.FileSystemDataset(
-> 2340                 [fragment], schema=schema or fragment.physical_schema,
   2341                 format=parquet_format,
   2342                 filesystem=fragment.filesystem

/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Is there some additional args/kwargs that should be added when exporting or reading the parquet files?

Or is the exporting to arrow bugged/broken somehow?

The text was updated successfully, but these errors were encountered:

JovanVeljanoski · 2022-10-13T20:27:20Z

You are using the wrong method.

Basically you need to

df.export_parquet("file.parquet")

# or 

df.export("file.parquet") # This will auto-use the above method by looking at the extensions specified

This, df.export_arrow("file.arrow") exports to a different (arrow native) file format

alvations · 2022-10-13T20:32:31Z

Thanks for the quick reply! I get it the right write/read function and extensions now.

alvations closed this as completed Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporting to arrow seems to create corrupted or invalid output #2228

Exporting to arrow seems to create corrupted or invalid output #2228

alvations commented Oct 13, 2022

JovanVeljanoski commented Oct 13, 2022 •

edited

alvations commented Oct 13, 2022

Exporting to arrow seems to create corrupted or invalid output #2228

Exporting to arrow seems to create corrupted or invalid output #2228

Comments

alvations commented Oct 13, 2022

JovanVeljanoski commented Oct 13, 2022 • edited

alvations commented Oct 13, 2022

JovanVeljanoski commented Oct 13, 2022 •

edited