Support for uploading datasets #1409

DumbMachine · 2020-01-20T03:47:42Z

Sizes of datasets are increasing and so is the computational power required for processing them. Most data scientists and other users prefer to use cloud-based solutions for storing the dataset. This issue is opened for the discussion about a feature, which does the following:

Allowing users to use the Kaggle platform to upload datasets (up to 10gb).
Allowing users to connect to their choice of cloud-based storage (like gcloud buckets), by allowing uploading and downloading from the bucket (initial support for major cloud players like aws, gcp and azure).

The motivation for this feature can be explained by the following:

It allows the user to save processed data.
Reduces the task of configuration and allowing users to upload datasets with easy cli commands.
Users can easily version each upload, we could make use of git lfs for better versioning/history of each upload.

This would require some research in knowing and structuring the configuration of the respective cloud solutions and also a good design to implement proper versioning of datasets

The text was updated successfully, but these errors were encountered:

DumbMachine · 2020-01-22T04:06:56Z

@ethanwhite @henrykironde let me know your thoughts on this.

henrykironde · 2020-01-22T07:32:55Z

I do think this is a good idea and I do really like it. We could start with integrating with Kaggle and see were we go from there. Also about the versioning, we do have a tool to track that, we shall see how we advance to support this feature further.

ethanwhite · 2020-01-27T14:26:02Z

I think the idea of cloud storage backends is a good one. If storing into a cloud database this should already be supported through the existing ability to pass remote database locations, so this issue would be about adding flat file storage, which I agree would be useful. I think this could be combined with https://github.com/weecology/retriever/wiki/GSoC-2020-Project-Ideas#data-retriever-add-support-for-more-raw-data-formats to make a project that is basically about adding new types of data as sources and new backends for data to be converted into.

DumbMachine · 2020-02-23T10:48:49Z

I believe combining provenance with this idea of uploading data is cool. Users will have two choices, mostly, when working with this:

User is working a dataset included in retriever-recipes. After some modification user wishes to save the changes he made to the dataset. He can commit changes: retriever commit <dataset> -m "<commit_message>" --path <path>
Then if he wishes to upload these changes, for use elsewhere: retriever upload <dataset> --backend aws --bucket_name <bucketname>
Users who have there own data, they follow much of the same process, commit changes and then upload.

henrykironde mentioned this issue Apr 14, 2021

Load data to a cloud-DB #1026

Open

henrykironde closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for uploading datasets #1409

Support for uploading datasets #1409

DumbMachine commented Jan 20, 2020

DumbMachine commented Jan 22, 2020

henrykironde commented Jan 22, 2020

ethanwhite commented Jan 27, 2020

DumbMachine commented Feb 23, 2020

Support for uploading datasets #1409

Support for uploading datasets #1409

Comments

DumbMachine commented Jan 20, 2020

DumbMachine commented Jan 22, 2020

henrykironde commented Jan 22, 2020

ethanwhite commented Jan 27, 2020

DumbMachine commented Feb 23, 2020