Add class subset selection #151

mustafa1728 · 2021-06-04T04:08:19Z

Description

Adding class subset selection to core kale API.
This will allow datasets to easily select only a subset of classes for training, validation and testing.

API update description

Adding class_ids=[id_0, id_1 ...] parameter in initialisations of:

MultiDomainDatasets (as being used in digits_dann_lightn example)
VideoMultiDomainDatasets (as being used in action_dann_lightn example)

should make it use class subset data. With no parameter or None value, dataset will use all classes.

Status

Ready

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
In-line docstrings updated.

codecov-commenter · 2021-06-04T04:27:53Z

Codecov Report

Merging #151 (2a230eb) into main (8177ad9) will decrease coverage by 0.09%.
The diff coverage is 48.57%.

@@            Coverage Diff             @@
##             main     #151      +/-   ##
==========================================
- Coverage   88.18%   88.08%   -0.10%     
==========================================
  Files          44       44              
  Lines        4122     4156      +34     
==========================================
+ Hits         3635     3661      +26     
- Misses        487      495       +8

Impacted Files	Coverage Δ
kale/loaddata/video_multi_domain.py	`74.80% <40.00%> (-0.90%)`	⬇️
kale/loaddata/multi_domain.py	`87.80% <50.00%> (-0.59%)`	⬇️
kale/loaddata/dataset_access.py	`89.47% <100.00%> (+1.97%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8177ad9...2a230eb. Read the comment docs.

kale/loaddata/dataset_access.py

haipinglu · 2021-06-06T09:39:31Z

kale/loaddata/dataset_access.py

+            return test_dataset
+        else:
+            sub_indices = [i for i in range(0, len(test_dataset)) if test_dataset[i][1] in class_ids]
+            return torch.utils.data.Subset(test_dataset, sub_indices)


Line 72-76 is somehow a repetition of line 39-43. Is it possible and worthy to reduce such repetition (e.g. define that 4 lines as a function taking ids and dataset in)?

Yes I feel that is definitely worthy. I am thinking a private _get_subset function? Also, I wanted to ask whether this would be alright since this function will come in the 'read the docs' as well.

Code will be reviewed and can be updated so you may make the changes good in your opinion and we will review. When not sure, check how pytorch implements something similar to learn.

kale/loaddata/multi_domain.py

kale/loaddata/video_access.py

kale/loaddata/video_multi_domain.py

tests/loaddata/test_digits_access.py

haipinglu · 2021-06-06T10:19:28Z

tests/loaddata/test_digits_access.py

+    )
+    dataset_subsampled.prepare_data_loaders()
+
+    assert len(dataset_subsampled) <= len(dataset)


You can have stronger assertions here. Since these digit datasets have the same number of samples for each class, we should have len(dataset_subsampled) == 0.3*len(dataset) since you take 3 out of 10 classes, right?

haipinglu · 2021-06-06T10:21:44Z

tests/loaddata/test_digits_access.py

+    assert len(source_train) <= len(source.get_train())
+    assert len(source_test) <= len(source.get_test())
+
+    assert isinstance(source_train, torch.utils.data.Dataset)


These type assertions are less useful, right? We used similar assertions when we do not have stronger assertions implementation. In your case, the above can be stronger assertions. Consider to remove these.

Agree with Haiping. In addition, my most assertions are too general and simple and only check some parameter type, because we want to achieve high coverage in the first step. However, from now on, we are going to improve our test example by replacing some assertions with stronger ones. Therefore, in your test example, you can explore assertions for a more detailed test than the current code.

haipinglu · 2021-06-06T10:23:31Z

tests/loaddata/test_digits_access.py

@@ -51,3 +53,38 @@ def test_get_train_test(dataset_name, download_path):
    assert source.n_classes() == 10
    assert isinstance(source_train, torch.utils.data.Dataset)
    assert isinstance(source_test, torch.utils.data.Dataset)
+
+
+@pytest.mark.parametrize("dataset_name", ALL)


It may not be necessary to loop through all variations, if they were tested elsewhere. We can talk tomorrow.

haipinglu · 2021-06-06T10:25:15Z

tests/loaddata/test_video_access.py

+        if dataset_subsampled.flow:
+            dataset_subsampled._flow_source_by_split = {"train": subsampled_train_val[0]}
+            dataset_subsampled._flow_target_by_split = {"train": subsampled_train_val[0]}
+        assert len(dataset_subsampled) <= len(dataset)


See the comments for digits. Stronger versions are possible.

haipinglu · 2021-06-06T10:33:49Z

tests/loaddata/test_video_access.py

+        assert isinstance(subsampled_train_val[1], torch.utils.data.Dataset)
+
+        assert len(subsampled_train_val[0]) <= len(train_val[0])
+        assert len(subsampled_train_val[1]) <= len(train_val[1])


See the comments for digits. Stronger versions are possible.

haipinglu

Good job. Please see the comments. If any further clarification is needed, we can discuss tomorrow (Monday) and/or Tuesday.
You can mark it as Ready for review (bottom of PR) now.

@xianyuanliu Please take a look too before we meet on Monday.

xianyuanliu

Well done. Need to reorganize the code. If you have other ideas, please comment.

xianyuanliu · 2021-06-06T12:38:28Z

kale/loaddata/multi_domain.py

        )

        logging.debug("Load source Test")
-        self._source_by_split["test"] = self._source_access.get_test()
+        self._source_by_split["test"] = self._source_access.get_test_class_subset(self.class_ids)


We may keep get_train_val() and get_test() in the mainstream because they are more clear to understand than get_test_class_subset(). We can set a flag (if...else...) in get_train() and get_test() to trigger on the process to get specific class samples from the dataset.

I think the issue with this is that get_train() and get_test() are being redefined by its child classes. So, similar flags would have to be handled by each of the child classes separately. get_train_val() is not being re-defined, so I have added flags inside it.

I think another option could be to have the child classes redefine a different function (maybe _get_train() private or get_train_all()) and call this inside get_train() with class_id flags?

I think another option could be to have the child classes redefine a different function (maybe _get_train() private or get_train_all()) and call this inside get_train() with class_id flags?

It will be a good idea if feasible.

xianyuanliu · 2021-06-06T12:52:34Z

kale/loaddata/multi_domain.py

        logging.debug("Load target Test")
-        self._target_by_split["test"] = self._target_access.get_test()
+        self._target_by_split["test"] = self._target_access.get_test_class_subset(self.class_ids)


Like above.

xianyuanliu · 2021-06-06T13:11:30Z

kale/loaddata/dataset_access.py

        Returns:
            Dataset: a torch.utils.data.Dataset
        """
-        train_dataset = self.get_train()
+        train_dataset = self.get_train_class_subset(class_ids)


We may give a flag class_ids to get_train() directly without changing self.get_train() here.

kale/loaddata/dataset_access.py

xianyuanliu · 2021-06-06T13:38:08Z

tests/loaddata/test_digits_access.py

+    assert len(source_train) <= len(source.get_train())
+    assert len(source_test) <= len(source.get_test())
+
+    assert isinstance(source_train, torch.utils.data.Dataset)


Agree with Haiping. In addition, my most assertions are too general and simple and only check some parameter type, because we want to achieve high coverage in the first step. However, from now on, we are going to improve our test example by replacing some assertions with stronger ones. Therefore, in your test example, you can explore assertions for a more detailed test than the current code.

mustafa1728 · 2021-06-07T05:01:51Z

A simpler solution could be to keep a single separate function for getting class-subsets and using this function in MultiDomainDatasets and VideoMultiDomainDatasets. The advantage is that the base DatasetAccess class is not modified and it becomes a bit easier to understand as well. The last two commits are in this direction and this separate function is in kale.utils.class_subset.

xianyuanliu

Good job!

It will be better to put get_class subset in the class Dataset Access if possible. I think it will be easier for people to find it. get_class subset is also an enhancement for Dataset Access.

xianyuanliu · 2021-06-07T08:49:18Z

kale/loaddata/video_multi_domain.py

+            self._flow_source_by_split["test"] = self._source_access_dict["flow"].get_test()
+            self._flow_target_by_split["test"] = self._target_access_dict["flow"].get_test()
+            if self.class_ids is not None:
+                self._flow_source_by_split["test"] = get_class_subset(


Could we put this function in the DatasetAccess? Is it feasible?
For example,
self._flow_source_by_split["test"] = self._flow_source_by_split["test"].get_class_subset(self.class_ids)

Yes agreed, that will be better. Thanks, I'll update it.

Ohh, so it may not be feasible as is, since self._flow_source_by_split["test"] is a simple torch.Dataset object and not a DatasetAccess object.

The function can be put inside DatasetAccess class as a static function, so it will be used like: dataset = DatasetAccess.get_class_subset(dataset, class_ids). Would this be appropriate?

Oh, I ignore the type. No need to change. That seems a little weird :) The current one is okay but it only has a few lines in kale/utils/class_subset.py. Do we have any other choice to put this function? We can keep the current now and discuss with Haiping at the meeting. Thanks!

xianyuanliu

Well done! Thanks!

haipinglu · 2021-06-08T13:53:29Z

tests/loaddata/test_digits_access.py

+
+@pytest.mark.parametrize("class_subset", CLASS_SUBSETS)
+@pytest.mark.parametrize("val_ratio", VAL_RATIO)
+def test_class_subsampling(class_subset, val_ratio, download_path):


try to avoid using "subsampling"

mustafa1728 added 2 commits June 3, 2021 19:29

core sub-sampling functions added

fc658db

subsampling added in MultiDomainDatasets

4d0a284

mustafa1728 added work-in-progress Work in progress that should NOT be merged new feature New feature/module (including request) labels Jun 4, 2021

mustafa1728 self-assigned this Jun 4, 2021

mustafa1728 marked this pull request as draft June 4, 2021 04:38

mustafa1728 added 7 commits June 4, 2021 10:35

doc strings updated

6c70763

sub-sampling added in video datasets

f7bcdaf

tests added for class-subsampling

1ecca7d

tests parameters number reduced

68139ae

removed MNIST/M data from tests | HTTP error

b88c0a6

subsampling tests improved and added in video access

dbfa237

subsampling video tests only ADL w/o annotation

1eb9918

mustafa1728 requested review from xianyuanliu and haipinglu June 5, 2021 07:48

mustafa1728 changed the title ~~Class subsampling~~ Add Class subsampling Jun 5, 2021