Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behaviour of append method #23

Closed
marchinidavide opened this issue Aug 15, 2019 · 6 comments
Closed

Behaviour of append method #23

marchinidavide opened this issue Aug 15, 2019 · 6 comments

Comments

@marchinidavide
Copy link

Hi everyone :)

I would like to confirm my understanding of method to append data to one item using collection.append(item, data).
To my understanding, this operation creates a new parquet file and modifies the metadata.
I would like to avoid having thousands of very small file but rather "including new data in the last .parquet file" and create a new one only after the last .parquet file reaches the predefined length (I'm fine with the current value of 1 million rows).
I see from the code that this goes in the end to calling Dask's dd.to_parquet() and I tried to dig deeper into it but I find the code very convoluted and difficult to read :(

Ideally my workflow would be this in pseudocode:

open last file in a pandas
append the new data
remove last parquet file
write with to_parquet() in chunks of 10**6 rows
update _metadata file to be able to still do efficient read with pystore

I didn't find a way to modify the _metadata file, any hint on this would be really appreciated :)
Also, any opinion on why I shouldn't be doing this is very welcome!

@marchinidavide
Copy link
Author

I think it's closely related to dask/fastparquet#114 ... happy to move the discussion to Dask repository if more appropriate!

@cevans3098
Copy link

@marchinidavide I had the same issue and think I resolved it. see my code and comments in issue #17

would be great if you confirm and verify performance

@marchinidavide
Copy link
Author

@marchinidavide I had the same issue and think I resolved it. see my code and comments in issue #17

would be great if you confirm and verify performance

Thanks for the link to your solution, seems I completely missed it!
I will fore sure try it and benchmark with my data!

ranaroussi added a commit that referenced this issue Aug 20, 2019
@ranaroussi
Copy link
Owner

I've pushed a new version to the dev branch. It should result in a faster and more consistent behavior when using append. By default, PyStore will aim for partitions of ~99MB each (as per Dask's recommendation).

LMK.

@marchinidavide
Copy link
Author

Thanks! Will definitely have a look soon and give feedback!

@ranaroussi
Copy link
Owner

ranaroussi commented Aug 21, 2019

Closing this issue and moving all related discussions to issue #21.

Please see my comments here: #21 (comment), and here: #21 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants