Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for uploading datasets #1409

Closed
DumbMachine opened this issue Jan 20, 2020 · 4 comments
Closed

Support for uploading datasets #1409

DumbMachine opened this issue Jan 20, 2020 · 4 comments

Comments

@DumbMachine
Copy link
Contributor

Sizes of datasets are increasing and so is the computational power required for processing them. Most data scientists and other users prefer to use cloud-based solutions for storing the dataset. This issue is opened for the discussion about a feature, which does the following:

  • Allowing users to use the Kaggle platform to upload datasets (up to 10gb).
  • Allowing users to connect to their choice of cloud-based storage (like gcloud buckets), by allowing uploading and downloading from the bucket (initial support for major cloud players like aws, gcp and azure).

The motivation for this feature can be explained by the following:

  • It allows the user to save processed data.
  • Reduces the task of configuration and allowing users to upload datasets with easy cli commands.
  • Users can easily version each upload, we could make use of git lfs for better versioning/history of each upload.

This would require some research in knowing and structuring the configuration of the respective cloud solutions and also a good design to implement proper versioning of datasets

@DumbMachine
Copy link
Contributor Author

@ethanwhite @henrykironde let me know your thoughts on this.

@henrykironde
Copy link
Contributor

I do think this is a good idea and I do really like it. We could start with integrating with Kaggle and see were we go from there. Also about the versioning, we do have a tool to track that, we shall see how we advance to support this feature further.

@ethanwhite
Copy link
Member

I think the idea of cloud storage backends is a good one. If storing into a cloud database this should already be supported through the existing ability to pass remote database locations, so this issue would be about adding flat file storage, which I agree would be useful. I think this could be combined with https://github.com/weecology/retriever/wiki/GSoC-2020-Project-Ideas#data-retriever-add-support-for-more-raw-data-formats to make a project that is basically about adding new types of data as sources and new backends for data to be converted into.

@DumbMachine
Copy link
Contributor Author

I believe combining provenance with this idea of uploading data is cool. Users will have two choices, mostly, when working with this:

  1. User is working a dataset included in retriever-recipes. After some modification user wishes to save the changes he made to the dataset. He can commit changes: retriever commit <dataset> -m "<commit_message>" --path <path>
    Then if he wishes to upload these changes, for use elsewhere: retriever upload <dataset> --backend aws --bucket_name <bucketname>
  2. Users who have there own data, they follow much of the same process, commit changes and then upload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants