[datasets] Module design suggestion #94

frgfm · 2021-01-27T23:17:54Z

Here is a suggestion for the pyrovision.datasets module's organization:

utils.py: shared utility functions
XYZ.py: a VisionDataset inherited class definion with functions for dataset XYZ

This suggestion is based on the organization of https://github.com/pytorch/vision/tree/master/torchvision/datasets which is quite effective. The documentation should also include detailed explanation on how to use it.

If one dataset was made available by Pyronear, it should be accessible via a link with a hash for integrity check.

Refactors into a flat module (refactor: Refactored pyrovision.datasets module #97)
Assess which functions/classes should be kept in the repo (remove dataset creation related modules #136, refactor: Refactored package for incoming v0.2 #138)

The text was updated successfully, but these errors were encountered:

TekayaNidham · 2021-01-27T23:37:09Z

Good idea, that would work

frgfm · 2021-02-07T21:05:51Z

Hey @MateoLostanlen @blenzi @x0s @Akilditu ,

While refactoring the datasets module, I ended up wondering whether some things should be removed because of their misalignment with the repo's purpose. Generally speaking, the purpose of the datasets module in this repo is to make for each dataset its source accessible (using processed annotations if need be) and offering a torchvision dataset. I believe there are some features linked to wildfire that are not aligned with this (considering that for now, the source of wildfire is not even accessible publicly). Some details below:

openfire

the URL needs to be moved to a release attachments

video_utils

FireLabeler: for me, this helps to structure the dataset before exposing it publicly, but not after. In my opinion, this should be removed from the library.
FrameExtractor: this is really helpful, but we could make it much better. First, do we want something to do this statically (creating image files from video files) or dynamically (used by the dataloader)? Second, whatever the first answer is, we need to make it compatible for other video datasets (perhaps there are interesting things we could use in https://github.com/pytorch/vision/blob/master/torchvision/datasets/video_utils.py). I would argue that considering its name, the parameters of the constructor should be sampling parameters (frames / sec, and max_frames for instance), and its call arguments should be a video file (or already read). What do you think?

utils

with some regex, we could refactor or remove the name + extension resolution

wildfire
In this module, I believe we need to make some decisions that users won't have to do later including: decide on some criteria using metadata to discard invalid samples (frames), all the remaining frames make the dataset (imbalanced perhaps but still). Next we need to select sampling (none, origin_proportion, positive_ratio), and finally train/val/test split (this last part can be done in the same fashion as with sklearn). In the short term:

WildFireDataset: the class gives access to a non-processed entity. First we need to decide (or split) whether this is an Image dataset or a video dataset. Assuming we go with images, is it using image files or video files (sampled into frames). Additionally, we give access to "target_names" which should be used to decide whether we keep/reject samples before making it public. We could set a confidence threshold and discard video samples that do not match it.
WildFireSplitter: rereading the code, this should be a function. Additionally, I'm starting to think that sklearn utilities could do the same job considering the nature of the split.
computeSubset: this should be refactored and integrated in the Dataset object to be used with the sampling argument in the constructor
ExhaustSplitStrategy: can someone remind me the purpose of it?

I do believe it would bring much more value to upload a clean subset of the dataset similar to the one of @MateoLostanlen to have a stable user-friendly and shareable dataset. What do you think?

frgfm · 2021-03-10T19:31:49Z

Any feedback @MateoLostanlen @blenzi @fe51 @Akilditu ? :)

MateoLostanlen · 2021-04-15T15:44:57Z

@Akilditu has sarted a Dataset Repo, I guess of these stuff should move there

frgfm · 2022-07-04T20:48:56Z

Closed by #136 & #138

frgfm added type: enhancement New feature or request module: datasets Related to datasets labels Jan 27, 2021

frgfm added this to the 0.1.1 milestone Jan 27, 2021

frgfm self-assigned this Jan 27, 2021

frgfm mentioned this issue Feb 3, 2021

refactor: Refactored pyrovision.datasets module #97

Merged

frgfm modified the milestones: 0.1.1, 0.1.2 Feb 28, 2021

MateoLostanlen mentioned this issue Feb 14, 2022

remove dataset creation related modules #136

Merged

frgfm modified the milestones: 0.1.2, 0.2.0 Jul 4, 2022

frgfm closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets] Module design suggestion #94

[datasets] Module design suggestion #94

frgfm commented Jan 27, 2021 •

edited

Loading

TekayaNidham commented Jan 27, 2021

frgfm commented Feb 7, 2021

frgfm commented Mar 10, 2021

MateoLostanlen commented Apr 15, 2021

frgfm commented Jul 4, 2022

[datasets] Module design suggestion #94

[datasets] Module design suggestion #94

Comments

frgfm commented Jan 27, 2021 • edited Loading

TekayaNidham commented Jan 27, 2021

frgfm commented Feb 7, 2021

frgfm commented Mar 10, 2021

MateoLostanlen commented Apr 15, 2021

frgfm commented Jul 4, 2022

frgfm commented Jan 27, 2021 •

edited

Loading