Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG-REPORT] open_many and concat does not handle missing columns correctly #1491

Open
FrankBoermanTenneT opened this issue Aug 4, 2021 · 5 comments
Assignees
Labels

Comments

@FrankBoermanTenneT
Copy link

FrankBoermanTenneT commented Aug 4, 2021

Hi I have a bunch of hdf5 files which are similar but sometimes some of them miss a column.

I can open them with vaex.open_many with no error, or load them one by one with vaex.open and then stitch them together with vaex.concat. However when I try to write them out to one big hdf5 file with df.export_hdf5 then I get the error:

  File "C:\Users\<snip>\Anaconda3\lib\site-packages\vaex\dataset.py", line 1453, in is_masked
    ar = self._columns[column]
KeyError: 'PTDF_IT-AT'

I expected that vaex would insert na values in the missing columns for the files, as per #156 however it does not seem to do that, am I looking at it the wrong way?

I am running vaex 4.3.0

@JovanVeljanoski
Copy link
Member

Thank you for the report. I can confirm this is a bug.

For a quick and dirty solution you can try:

  • export to arrow (and if needed reexport that to hdf5)
  • for the numerical columns you could try df.materialize(col_that_is_sometimes_missing)

I will try to push a test for this soon!

@FrankBoermanTenneT
Copy link
Author

hi @JovanVeljanoski thanks for the quick reply. As a workaround I figured out which columns were breaking and removed them for now (it turned out they mostly held None data anyway which I didnt know). This fixed my problem for now.
But it would be great if you managed to fix this! thanks in advance, big fan of the library!

@maartenbreddels
Copy link
Member

big fan of the library!

Always good to hear :) Thanks!

@FrankBoermanTenneT
Copy link
Author

well you guys test suite is good I think. If I read it correctly this: https://github.com/vaexio/vaex/pull/1493/checks?check_run_id=3242308011#step:18:274 is my exact bug (I am running python3.8.8 on windows 10). So it is at least reproducable.

@maartenbreddels
Copy link
Member

nice work @JovanVeljanoski !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants