You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For example, our indices are [n_train: n_train+n_val] when using split="val". As the shuffle=False, the idx sampler is SequentialSampler(Subset(data_container, indices)). Notice that this sampler produces indices in range [0, n_val].
However,
super().__init__(
data_container, ## here is the full set.
sampler=batch_sampler,
collate_fn=lambda x: collate(x, data_container),
pin_memory=True, # load on CPU push to GPU
**kwargs
)
as shown in this code snippet, the "dataset" passed to the DataLoader is the full dataset. Then, the iterator of data loader would take sample according to the indices provided by the sampler. As illustrated above, the sampler produces indices in range [0, n_val]. Therefore, it actually takes data from a subset of training part.
I noticed that such a problem is avoid in train.ipynb by using two data containers. However, this part would be ambiguous for users who want to generalize this model to other datasets.
I tried to fix it as:
class CustomDataLoader(DataLoader):
def __init__(
self, data_container, batch_size, indices, shuffle, seed=None, **kwargs
):
if shuffle:
generator = torch.Generator()
if seed is not None:
generator.manual_seed(seed)
idx_sampler = SubsetRandomSampler(indices, generator)
else:
idx_sampler = SequentialSampler(Subset(data_container, indices))
batch_sampler = BatchSampler(
idx_sampler, batch_size=batch_size, drop_last=False
)
# Note: a bug here if we do not use subset.
# Sequential sampler on subset returns index like (0, 1, 2, 3...)
# However, the returned index is on the full data.
# If we do not take Subset here, it uses data from training subset.
dataset = data_container if shuffle else Subset(data_container, indices)
super().__init__(
dataset ,
sampler=batch_sampler,
collate_fn=data_container.collate_fn,
pin_memory=True, # load on CPU push to GPU
**kwargs
)
The text was updated successfully, but these errors were encountered:
Thank you for your awesome work! When using the proposed model in another downstream task, I found that there is a potential bug in https://github.com/TUM-DAML/gemnet_pytorch/blob/master/gemnet/training/data_provider.py#L31-L49, it could causes evaluating the model on the training data rather than the testing subset.
For example, our indices are [n_train: n_train+n_val] when using split="val". As the shuffle=False, the idx sampler is SequentialSampler(Subset(data_container, indices)). Notice that this sampler produces indices in range [0, n_val].
However,
as shown in this code snippet, the "dataset" passed to the DataLoader is the full dataset. Then, the iterator of data loader would take sample according to the indices provided by the sampler. As illustrated above, the sampler produces indices in range [0, n_val]. Therefore, it actually takes data from a subset of training part.
I noticed that such a problem is avoid in
train.ipynb
by using two data containers. However, this part would be ambiguous for users who want to generalize this model to other datasets.I tried to fix it as:
The text was updated successfully, but these errors were encountered: