Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A potential bug in data_provider.py #4

Closed
gyfastas opened this issue Jan 11, 2022 · 1 comment · Fixed by #5
Closed

A potential bug in data_provider.py #4

gyfastas opened this issue Jan 11, 2022 · 1 comment · Fixed by #5

Comments

@gyfastas
Copy link

gyfastas commented Jan 11, 2022

Thank you for your awesome work! When using the proposed model in another downstream task, I found that there is a potential bug in https://github.com/TUM-DAML/gemnet_pytorch/blob/master/gemnet/training/data_provider.py#L31-L49, it could causes evaluating the model on the training data rather than the testing subset.

For example, our indices are [n_train: n_train+n_val] when using split="val". As the shuffle=False, the idx sampler is SequentialSampler(Subset(data_container, indices)). Notice that this sampler produces indices in range [0, n_val].

However,

        super().__init__(
            data_container, ## here is the full set.
            sampler=batch_sampler,
            collate_fn=lambda x: collate(x, data_container),
            pin_memory=True,  # load on CPU push to GPU
            **kwargs
        )

as shown in this code snippet, the "dataset" passed to the DataLoader is the full dataset. Then, the iterator of data loader would take sample according to the indices provided by the sampler. As illustrated above, the sampler produces indices in range [0, n_val]. Therefore, it actually takes data from a subset of training part.

I noticed that such a problem is avoid in train.ipynb by using two data containers. However, this part would be ambiguous for users who want to generalize this model to other datasets.

I tried to fix it as:

class CustomDataLoader(DataLoader):
    def __init__(
        self, data_container, batch_size, indices, shuffle, seed=None, **kwargs
    ):

        if shuffle:
            generator = torch.Generator()
            if seed is not None:
                generator.manual_seed(seed)
            idx_sampler = SubsetRandomSampler(indices, generator)
        else:
            idx_sampler = SequentialSampler(Subset(data_container, indices))

        batch_sampler = BatchSampler(
            idx_sampler, batch_size=batch_size, drop_last=False
        )
        # Note: a bug here if we do not use subset.
        # Sequential sampler on subset returns index like (0, 1, 2, 3...)
        # However, the returned index is on the full data. 
        # If we do not take Subset here, it uses data from training subset. 
        dataset = data_container if shuffle else Subset(data_container, indices)

        super().__init__(
            dataset ,
            sampler=batch_sampler,
            collate_fn=data_container.collate_fn,
            pin_memory=True,  # load on CPU push to GPU
            **kwargs
        )
@gasteigerjo
Copy link
Contributor

Thank you for highlighting this! I've set up a PR that should fix this issue and make the code a bit cleaner. We'll verify and merge it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants