[Feature Request] Using parquet files instead/alongside torch splits #377

Dsantra92 · 2022-09-09T22:59:04Z

Hello devs.
I am trying to develop support for OGB Datasets in MLDatasets.jl. One of the bottlenecks we are facing is loading the .pt files. This implementation here using Pickle.jl hack results in substantial memory usage compared to python. With new support for TorchArrow can you support parquet files for loading the splits?

weihua916 · 2023-02-17T23:42:28Z

Hi! Are the split files so large? They are just storing the split indices, no?

Dsantra92 · 2023-02-18T00:35:24Z

I was asking if it was possible/planned to use a language independent format to store the computed splits.

weihua916 · 2023-02-18T02:20:38Z

I see. That'd require all zipped files to be re-created. I do not think we will support this in the immediate future. You can probably consider some workaround on your side.

Dsantra92 · 2023-02-18T08:54:44Z

Makes sense!🙁

Dsantra92 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Using parquet files instead/alongside torch splits #377

[Feature Request] Using parquet files instead/alongside torch splits #377

Dsantra92 commented Sep 9, 2022

weihua916 commented Feb 17, 2023

Dsantra92 commented Feb 18, 2023

weihua916 commented Feb 18, 2023

Dsantra92 commented Feb 18, 2023

[Feature Request] Using parquet files instead/alongside torch splits #377

[Feature Request] Using parquet files instead/alongside torch splits #377

Comments

Dsantra92 commented Sep 9, 2022

weihua916 commented Feb 17, 2023

Dsantra92 commented Feb 18, 2023

weihua916 commented Feb 18, 2023

Dsantra92 commented Feb 18, 2023