A potential bug in data_provider.py #4

gyfastas · 2022-01-11T13:23:17Z

Thank you for your awesome work! When using the proposed model in another downstream task, I found that there is a potential bug in https://github.com/TUM-DAML/gemnet_pytorch/blob/master/gemnet/training/data_provider.py#L31-L49, it could causes evaluating the model on the training data rather than the testing subset.

For example, our indices are [n_train: n_train+n_val] when using split="val". As the shuffle=False, the idx sampler is SequentialSampler(Subset(data_container, indices)). Notice that this sampler produces indices in range [0, n_val].

However,

        super().__init__(
            data_container, ## here is the full set.
            sampler=batch_sampler,
            collate_fn=lambda x: collate(x, data_container),
            pin_memory=True,  # load on CPU push to GPU
            **kwargs
        )

as shown in this code snippet, the "dataset" passed to the DataLoader is the full dataset. Then, the iterator of data loader would take sample according to the indices provided by the sampler. As illustrated above, the sampler produces indices in range [0, n_val]. Therefore, it actually takes data from a subset of training part.

I noticed that such a problem is avoid in train.ipynb by using two data containers. However, this part would be ambiguous for users who want to generalize this model to other datasets.

I tried to fix it as:

class CustomDataLoader(DataLoader):
    def __init__(
        self, data_container, batch_size, indices, shuffle, seed=None, **kwargs
    ):

        if shuffle:
            generator = torch.Generator()
            if seed is not None:
                generator.manual_seed(seed)
            idx_sampler = SubsetRandomSampler(indices, generator)
        else:
            idx_sampler = SequentialSampler(Subset(data_container, indices))

        batch_sampler = BatchSampler(
            idx_sampler, batch_size=batch_size, drop_last=False
        )
        # Note: a bug here if we do not use subset.
        # Sequential sampler on subset returns index like (0, 1, 2, 3...)
        # However, the returned index is on the full data. 
        # If we do not take Subset here, it uses data from training subset. 
        dataset = data_container if shuffle else Subset(data_container, indices)

        super().__init__(
            dataset ,
            sampler=batch_sampler,
            collate_fn=data_container.collate_fn,
            pin_memory=True,  # load on CPU push to GPU
            **kwargs
        )

The text was updated successfully, but these errors were encountered:

gasteigerjo · 2022-01-15T11:02:10Z

Thank you for highlighting this! I've set up a PR that should fix this issue and make the code a bit cleaner. We'll verify and merge it soon.

gasteigerjo mentioned this issue Jan 15, 2022

Refactor and fix DataLoader #5

Merged

gasteigerjo closed this as completed in #5 Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential bug in data_provider.py #4

A potential bug in data_provider.py #4

gyfastas commented Jan 11, 2022 •

edited

Loading

gasteigerjo commented Jan 15, 2022

A potential bug in data_provider.py #4

A potential bug in data_provider.py #4

Comments

gyfastas commented Jan 11, 2022 • edited Loading

gasteigerjo commented Jan 15, 2022

gyfastas commented Jan 11, 2022 •

edited

Loading