Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets] Module design suggestion #94

Closed
2 tasks done
frgfm opened this issue Jan 27, 2021 · 5 comments
Closed
2 tasks done

[datasets] Module design suggestion #94

frgfm opened this issue Jan 27, 2021 · 5 comments
Assignees
Labels
module: datasets Related to datasets type: enhancement New feature or request
Milestone

Comments

@frgfm
Copy link
Member

frgfm commented Jan 27, 2021

Here is a suggestion for the pyrovision.datasets module's organization:

  • utils.py: shared utility functions
  • XYZ.py: a VisionDataset inherited class definion with functions for dataset XYZ

This suggestion is based on the organization of https://github.com/pytorch/vision/tree/master/torchvision/datasets which is quite effective. The documentation should also include detailed explanation on how to use it.

If one dataset was made available by Pyronear, it should be accessible via a link with a hash for integrity check.

@frgfm frgfm added type: enhancement New feature or request module: datasets Related to datasets labels Jan 27, 2021
@frgfm frgfm added this to the 0.1.1 milestone Jan 27, 2021
@frgfm frgfm self-assigned this Jan 27, 2021
@TekayaNidham
Copy link
Contributor

Good idea, that would work

@frgfm
Copy link
Member Author

frgfm commented Feb 7, 2021

Hey @MateoLostanlen @blenzi @x0s @Akilditu ,

While refactoring the datasets module, I ended up wondering whether some things should be removed because of their misalignment with the repo's purpose. Generally speaking, the purpose of the datasets module in this repo is to make for each dataset its source accessible (using processed annotations if need be) and offering a torchvision dataset. I believe there are some features linked to wildfire that are not aligned with this (considering that for now, the source of wildfire is not even accessible publicly). Some details below:

openfire

  • the URL needs to be moved to a release attachments

video_utils

  • FireLabeler: for me, this helps to structure the dataset before exposing it publicly, but not after. In my opinion, this should be removed from the library.
  • FrameExtractor: this is really helpful, but we could make it much better. First, do we want something to do this statically (creating image files from video files) or dynamically (used by the dataloader)? Second, whatever the first answer is, we need to make it compatible for other video datasets (perhaps there are interesting things we could use in https://github.com/pytorch/vision/blob/master/torchvision/datasets/video_utils.py). I would argue that considering its name, the parameters of the constructor should be sampling parameters (frames / sec, and max_frames for instance), and its call arguments should be a video file (or already read). What do you think?

utils

  • with some regex, we could refactor or remove the name + extension resolution

wildfire
In this module, I believe we need to make some decisions that users won't have to do later including: decide on some criteria using metadata to discard invalid samples (frames), all the remaining frames make the dataset (imbalanced perhaps but still). Next we need to select sampling (none, origin_proportion, positive_ratio), and finally train/val/test split (this last part can be done in the same fashion as with sklearn). In the short term:

  • WildFireDataset: the class gives access to a non-processed entity. First we need to decide (or split) whether this is an Image dataset or a video dataset. Assuming we go with images, is it using image files or video files (sampled into frames). Additionally, we give access to "target_names" which should be used to decide whether we keep/reject samples before making it public. We could set a confidence threshold and discard video samples that do not match it.
  • WildFireSplitter: rereading the code, this should be a function. Additionally, I'm starting to think that sklearn utilities could do the same job considering the nature of the split.
  • computeSubset: this should be refactored and integrated in the Dataset object to be used with the sampling argument in the constructor
  • ExhaustSplitStrategy: can someone remind me the purpose of it?

I do believe it would bring much more value to upload a clean subset of the dataset similar to the one of @MateoLostanlen to have a stable user-friendly and shareable dataset. What do you think?

@frgfm frgfm modified the milestones: 0.1.1, 0.1.2 Feb 28, 2021
@frgfm
Copy link
Member Author

frgfm commented Mar 10, 2021

Any feedback @MateoLostanlen @blenzi @fe51 @Akilditu ? :)

@MateoLostanlen
Copy link
Member

@Akilditu has sarted a Dataset Repo, I guess of these stuff should move there

@frgfm
Copy link
Member Author

frgfm commented Jul 4, 2022

Closed by #136 & #138

@frgfm frgfm closed this as completed Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to datasets type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants