Behaviour of append method #23

marchinidavide · 2019-08-15T08:06:59Z

Hi everyone :)

I would like to confirm my understanding of method to append data to one item using collection.append(item, data).
To my understanding, this operation creates a new parquet file and modifies the metadata.
I would like to avoid having thousands of very small file but rather "including new data in the last .parquet file" and create a new one only after the last .parquet file reaches the predefined length (I'm fine with the current value of 1 million rows).
I see from the code that this goes in the end to calling Dask's dd.to_parquet() and I tried to dig deeper into it but I find the code very convoluted and difficult to read :(

Ideally my workflow would be this in pseudocode:

open last file in a pandas
append the new data
remove last parquet file
write with to_parquet() in chunks of 10**6 rows
update _metadata file to be able to still do efficient read with pystore

I didn't find a way to modify the _metadata file, any hint on this would be really appreciated :)
Also, any opinion on why I shouldn't be doing this is very welcome!

The text was updated successfully, but these errors were encountered:

marchinidavide · 2019-08-15T09:54:45Z

I think it's closely related to dask/fastparquet#114 ... happy to move the discussion to Dask repository if more appropriate!

cevans3098 · 2019-08-19T03:47:13Z

@marchinidavide I had the same issue and think I resolved it. see my code and comments in issue #17

would be great if you confirm and verify performance

marchinidavide · 2019-08-19T06:42:18Z

@marchinidavide I had the same issue and think I resolved it. see my code and comments in issue #17

would be great if you confirm and verify performance

Thanks for the link to your solution, seems I completely missed it!
I will fore sure try it and benchmark with my data!

ranaroussi · 2019-08-20T14:10:38Z

I've pushed a new version to the dev branch. It should result in a faster and more consistent behavior when using append. By default, PyStore will aim for partitions of ~99MB each (as per Dask's recommendation).

LMK.

marchinidavide · 2019-08-20T14:46:11Z

Thanks! Will definitely have a look soon and give feedback!

ranaroussi · 2019-08-21T13:54:21Z

Closing this issue and moving all related discussions to issue #21.

Please see my comments here: #21 (comment), and here: #21 (comment)

ranaroussi added a commit that referenced this issue Aug 20, 2019

addressing issues #17, #21, #22 and #23

5ade48a

ranaroussi closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behaviour of append method #23

Behaviour of append method #23

marchinidavide commented Aug 15, 2019

marchinidavide commented Aug 15, 2019

cevans3098 commented Aug 19, 2019

marchinidavide commented Aug 19, 2019

ranaroussi commented Aug 20, 2019

marchinidavide commented Aug 20, 2019

ranaroussi commented Aug 21, 2019 •

edited

Loading

Behaviour of append method #23

Behaviour of append method #23

Comments

marchinidavide commented Aug 15, 2019

marchinidavide commented Aug 15, 2019

cevans3098 commented Aug 19, 2019

marchinidavide commented Aug 19, 2019

ranaroussi commented Aug 20, 2019

marchinidavide commented Aug 20, 2019

ranaroussi commented Aug 21, 2019 • edited Loading

ranaroussi commented Aug 21, 2019 •

edited

Loading