Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix export (hdf5) of joined dataframe which contains missing values #1418

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JovanVeljanoski
Copy link
Member

This PR address the issue raised in #1413

Notes:

  • The added test checks the export against hdf5, arrow, parquet and csv. Only hdf5 and csv report problems, but for different reasons;
  • HDF5 test fails because the data seems to be saved incorrectly;
  • The csv file is exported correctly. However when the data is read, the missing values are converted do nan. I think this should now automatically be turned into a missing (masked) values, not np.nan. What do you think @maartenbreddels @xdssio. This might be more tricky tho.. so I am happy to leave it as is and adjust the test. Just wondering what the ideal case should be - in my opinion missing values.

Checklist:

  • Add tests
  • Make tests pass

@Ben-Epstein
Copy link
Contributor

Ben-Epstein commented Feb 18, 2022

fwiw i believe the issue with hdf5 stems from the fact that you cannot have "missing" (nan) values in int dtypes in numpy.

import numpy as np

np.array([1,2,3,4,None], dtype=np.int64)

>>> TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Converting the column to something non-int would work, but that may not be the desired outcome. Or maybe you could convert to a pyarrow array?

For example, this works:

import vaex
import pandas as pd
import pyarrow as pa

df1 = vaex.from_pandas(pd.DataFrame([1,2,3], columns=['idx']))
df2 = vaex.from_pandas(pd.DataFrame({'idx':[1,6,7], 't1': [0,0,0]}))
df3 = df1.join(df2, how='left', on='idx', rsuffix='_y')

df3["idx_y"] = df3.idx_y.astype(pa.int64())
df3["t1"] = df3.idx_y.astype(pa.int64())
display(df3)

df3.export('test.hdf5')
vaex.open('test.hdf5')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants