Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SubSsampler dataset #66

Merged
merged 51 commits into from
Sep 18, 2020

Conversation

MateoLostanlen
Copy link
Member

Cette PR introduit mon wildfireSubSampler DataSet, le but de ce dataset est de combiner K frame d’une même séquence. Ce Dataset permet d’entrainer une version modifiée du ResNet18, le ssResNet18. Ce model passe les K frame dans la partie convolution du model indépendamment puis les assemble dans la partie fully connected pour prédire s’il y a du feu ou non. L’idée et d’ajouter une sorte d’information temporelle pour améliorer les prédictions

@MateoLostanlen MateoLostanlen requested a review from frgfm May 17, 2020 15:52
@MateoLostanlen MateoLostanlen added the type: enhancement New feature or request label May 17, 2020
@MateoLostanlen MateoLostanlen added this to In progress in PyroNear kanban via automation May 17, 2020
@MateoLostanlen MateoLostanlen added module: datasets Related to datasets module: models Related to models labels May 17, 2020
@frgfm
Copy link
Member

frgfm commented May 19, 2020

I'll wait up on #63 to be merged to ensure the tests are running correctly! If I may, for next PRs, try to make sure that they are not inter-dependent, that will make things more straightforward for reviewing 👍

@MateoLostanlen
Copy link
Member Author

Ok, sorry for that !

@codecov
Copy link

codecov bot commented May 19, 2020

Codecov Report

Merging #66 into master will increase coverage by 1.04%.
The diff coverage is 95.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #66      +/-   ##
==========================================
+ Coverage   85.21%   86.25%   +1.04%     
==========================================
  Files          20       21       +1     
  Lines         771      866      +95     
==========================================
+ Hits          657      747      +90     
- Misses        114      119       +5     
Flag Coverage Δ
#unittests 86.25% <95.45%> (+1.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pyronear/datasets/wildfire/wildfire.py 97.22% <95.34%> (-1.14%) ⬇️
pyronear/datasets/wildfire/__init__.py 100.00% <100.00%> (ø)
pyronear/utils/collect_env.py 70.66% <0.00%> (-1.34%) ⬇️
pyronear/datasets/openfire.py 89.18% <0.00%> (-0.29%) ⬇️
pyronear/nn/__init__.py 100.00% <0.00%> (ø)
pyronear/models/utils.py 98.18% <0.00%> (ø)
pyronear/models/resnet.py 82.60% <0.00%> (ø)
pyronear/nn/functional.py 100.00% <0.00%> (ø)
pyronear/datasets/utils.py 91.66% <0.00%> (ø)
pyronear/utils/__init__.py 100.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61e6410...b86ca35. Read the comment docs.

Copy link
Member

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! Would you mind opening a different PR for the model?

pyronear/datasets/wildfire/split_strategy.py Outdated Show resolved Hide resolved
pyronear/datasets/wildfire/wildfire.py Outdated Show resolved Hide resolved
test/test_datasets_wildfire_split.py Outdated Show resolved Hide resolved
PyroNear kanban automation moved this from In progress to Review in progress May 19, 2020
@MateoLostanlen MateoLostanlen removed the module: models Related to models label May 22, 2020
@MateoLostanlen
Copy link
Member Author

@x0s I have made the changes, tell me it's ok for you ?

@MateoLostanlen MateoLostanlen requested a review from x0s June 30, 2020 08:54
x0s
x0s previously requested changes Jun 30, 2020
Copy link
Contributor

@x0s x0s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Thanks for adapting your code,

You seem to use a class where only a function is needed. There're actually not benefit doing so, because it encourage implicit states. As you know from the zen of python:

Explicit is better than implicit.
Simple is better than complex.

I don't have time to review the algorithm(subsampling part), but it could be great to have more comments to understand how you make it.

Also, we may use a lighter fixture for the checkedlabels or metadata. It could be good to make it consistent with already existing fixtures to avoid confusion.

Deterministic behaviour is really expected, we need to be able to reproduce our result so they're stable across time

Also please make the whole PR compliant to PEP8 naming convention as you said it to do so:

PS : Ok for the PEP8, I will change that

Using this package may help.

@@ -5,6 +5,8 @@
import numpy as np
import pandas as pd
import torch
from random import SystemRandom
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using SystemRandom ? it generates unreproducible sequences that also change depending on the system

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I use random I got this error
Standard pseudo-random generators are not suitable for security/cryptographic purposes.
in Codacy/PR Quality Review so I switch to SystemRandom

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with SystemRandom but if we only need a scalar random value, you can easily discard Codacy warning. If there is indeed a lack of availability on some systems as mentioned in the earlier reference, it might be safer to stick with a bare random.random().

Alternatively, this file has a numpy dependency which has a large support of numpy.random !

frame_per_seq: int
frame per sequence to take

frame_per_seq: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the issue here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he was referring to the double parameter's description of frame_per_seq!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we were talking about the same thing but there are two mentions of the frame_per_seq, and there is no mention of the seedargument in the constructor docstring. Mind fixing that up? :)

Example
-------
subsampler = WildFireSubSampler(dataset.metadata, 2)
dataset.metadata = subsampler.computeSubSet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata attribute is not expected to be modified this way.
It is way better to initialize a Dataset with the expected metadata instead of modifying the metadata attribute.
Remember that at init, a Dataset is bound to a metadata file and a path leading to the frames described in the metadata file.
Dodging init may create divergence and unexpected behaviour

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, not here anymore in last update

self.metadata = metadata
self.metadata.index = np.arange(len(self.metadata))
self.imgs = self.metadata['imgFile']
self.SubSetImgs = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow PEP8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are saying that because the s of SubSetImgs should be lower case right ?


self.assertEqual(len(wildfire), 1999)

subsampler = WildFireSubSampler(wildfire.metadata, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Subsampler input is a metadata file and outputs another metadata file.

Why not simply init a WildFireDataset with the new subsampled metadata file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I have change that !

@@ -191,3 +193,97 @@ def set_splits(self, dataframes):
# Determine estimated(posterior) parameters
self.n_samples_ = {set_: len(self.splits[set_]) for set_ in ['train', 'val', 'test']}
self.ratios_ = {set_: (self.n_samples_[set_] / len(self.wildfire)) for set_ in ['train', 'val', 'test']}


class WildFireSubSampler:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this subsampler is not bound to WildFire Dataset and can be used with any wildfire video dataset annotated the same way. To help user understand what's made here, The class name can reflect better this possibility

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Parameters
---------
metadata: Pandas.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be great to share same interface with WildFireDataset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevent anymore with last update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the files changed, is this still an issue?

meta2 = subsampler.computeSubSet()

self.assertEqual(len(meta2), 400)
self.assertFalse(wildfire.metadata['imgFile'].values.tolist() == meta2['imgFile'].values.tolist())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need reproducibility when dealing with random numbers. If not ensured, we cannot guarantee any result nor stability.
If you don't want to be bound to a random generator, seeds can be hyperoptimized, while maintaining the predictor deterministic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done !

from pyronear.datasets.wildfire import WildFireDataset, WildFireSubSampler


class WildFireDatasetSubSampler(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please append with Tester so this class name doesn't collide with others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry but I don't understand your comment :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he was suggesting to append Tester at the end of this class's name :) (so WildFireDatasetSubSamplerTester or something like WildFireSubSamplerTester is you want a shorter name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind changing the class name so that we can close this thread? 🙏

self.SubSetImgs = []
self.probTh = probTh
self.frame_per_seq = frame_per_seq
# Define sequences numbers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you re-extracting the fBase from the imgFile ? Shouldn't it be already present since previous step was Frame extraction ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right sorry for that

Copy link
Member

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the docstring to fix, the PR looks alright, almost there! I'm still having concerns about adding new .csv content files each time we want to add a new feature and its unittests though.

frame_per_seq: int
frame per sequence to take

frame_per_seq: float
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we were talking about the same thing but there are two mentions of the frame_per_seq, and there is no mention of the seedargument in the constructor docstring. Mind fixing that up? :)

from pyronear.datasets.wildfire import WildFireDataset, WildFireSubSampler


class WildFireDatasetSubSampler(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind changing the class name so that we can close this thread? 🙏

@MateoLostanlen
Copy link
Member Author

Apart from the docstring to fix, the PR looks alright, almost there! I'm still having concerns about adding new .csv content files each time we want to add a new feature and its unittests though.

We do not need a new csv for each feture only a csv complete enough to test all cases. I suggest to open an issue on this subject and to create a PR soon to solve it.

I just made a commit to change the name of the test class and to remove the error in the doc. Can we validate the PR?

@frgfm
Copy link
Member

frgfm commented Sep 17, 2020

Apart from the docstring to fix, the PR looks alright, almost there! I'm still having concerns about adding new .csv content files each time we want to add a new feature and its unittests though.

We do not need a new csv for each feture only a csv complete enough to test all cases. I suggest to open an issue on this subject and to create a PR soon to solve it.

I just made a commit to change the name of the test class and to remove the error in the doc. Can we validate the PR?

Just missing the seed argument in the docstring 🙏
Well, it's far from being one .csv for all features. In this PR for instance, we introduce a 2000 lines content file for testing. Let's discuss this in an issue as you mentioned, it will be easier then!

frgfm
frgfm previously approved these changes Sep 17, 2020
Copy link
Member

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'll open an issue later for fixture & content files addition in PR, thanks!

Copy link
Member

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

PyroNear kanban automation moved this from Review in progress to Reviewer approved Sep 18, 2020
@MateoLostanlen MateoLostanlen merged commit 8116724 into pyronear:master Sep 18, 2020
PyroNear kanban automation moved this from Reviewer approved to Done Sep 18, 2020
@MateoLostanlen MateoLostanlen deleted the addSubSamplerDataSet branch January 16, 2021 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: datasets Related to datasets type: enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants