You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by pyarrow.read_table(), e.g. given a file, e.g. s2t.tsv:
$ printf "test-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\ntest-1\nfoobar\n" > s
$ printf "1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n1-best\npoo bear\n" > t
$ paste s t > s2t.tsv
And exporting the tsv to arrow as such, then reading it back:
import vaex
import pyarrow as pa
df = vaex.from_csv('s2t.tsv', sep='\t', header=None)
df.export_arrow('s2t.parquet')
pa.parquet.read_table('s2t.parquet')
It throws the following error:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
/tmp/ipykernel_17/3649263967.py in <module>
1 import pyarrow as pa
2
----> 3 pa.parquet.read_table('s2t.parquet')
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in read_table(source, columns, use_threads, metadata, schema, use_pandas_metadata, memory_map, read_dictionary, filesystem, filters, buffer_size, partitioning, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties)
2746 ignore_prefixes=ignore_prefixes,
2747 pre_buffer=pre_buffer,
-> 2748 coerce_int96_timestamp_unit=coerce_int96_timestamp_unit
2749 )
2750 except ImportError:
/opt/conda/lib/python3.7/site-packages/pyarrow/parquet/__init__.py in __init__(self, path_or_paths, filesystem, filters, partitioning, read_dictionary, buffer_size, memory_map, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, schema, decryption_properties, **kwargs)
2338
2339 self._dataset = ds.FileSystemDataset(
-> 2340 [fragment], schema=schema or fragment.physical_schema,
2341 format=parquet_format,
2342 filesystem=fragment.filesystem
/opt/conda/lib/python3.7/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Fragment.physical_schema.__get__()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/opt/conda/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Could not open Parquet input source 's2t.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is there some additional args/kwargs that should be added when exporting or reading the parquet files?
Or is the exporting to arrow bugged/broken somehow?
The text was updated successfully, but these errors were encountered:
On these vaex and pyarrow version:
When reading a tsv file and exporting it to arrow, the arrow table couldn't be properly loaded by
pyarrow.read_table()
, e.g. given a file, e.g.s2t.tsv
:And exporting the tsv to arrow as such, then reading it back:
It throws the following error:
Is there some additional args/kwargs that should be added when exporting or reading the parquet files?
Or is the exporting to arrow bugged/broken somehow?
The text was updated successfully, but these errors were encountered: