Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Var misuse dataset #3978

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

Conversation

amitamb
Copy link

@amitamb amitamb commented Jan 30, 2022

No description provided.


class VarMisuse(Dataset):
r"""The VarMisuse dataset from the `"Learning to Represent Programs
with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of
with Graphs" <https://arxiv.org/abs/1711.00740>`_ paper, consisting of

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

def get(self, idx):
file_idx, data_idx = divmod(idx, 100)
window, chunk = divmod(file_idx, 10)
data_list = torch.load(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How big is the underlying dataset on disk? Can't we use InMemoryDataset here?

While I like the current approach of having multiple graphs grouped together in one file, it looks like we are not really making use of this here. Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.

Sadly, I currently have no good idea how to resolve this though. Any thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rusty1s The total sample numbers are following:

Train: 1250 * 100 = 125000
Valid: 214 * 100 = 21400
Test: 586 * 100 = 58600

Total space taken by gzipped files is around ~3.5GB for train, ~500GB for valid and ~1.5 GB for test.

The way I am processing at present,each processed pt file takes around 5 to 10MB for each chunk i.e.100 samples. So storing all train samples in memory would take around 7.5MB *1250 = ~9.375GB+

I thought that is too big for storing in memory. Most of the other in memory datasets are much smaller.

Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.

May be we can keep the pointer to the last opened chunk file open and reuse it somehow but that would only help for sequential access. I think samples would be loaded after shuffling so that would not be of much help. 100 samples or 5MB is not too big I thought to access individual sample and with SSD it likely will not be a bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, shuffling is indeed a problem for this approach. I think 9GB would be still a reasonable amount for holding data in memory, but we can certainly think of adding a dataset interface in-between, e.g., InMemoyChunkDataset that effectively can re-use already loaded chunks.

@rusty1s
Copy link
Member

rusty1s commented Feb 8, 2022

Thanks for the PR and sorry for my super late review :) Left some comments.

@rusty1s rusty1s linked an issue Feb 8, 2022 that may be closed by this pull request
@amitamb
Copy link
Author

amitamb commented Feb 14, 2022

No problem. I have example implementation in works and it is working for a single sample/batch but weights are going to 0 for full batch. Would you mind take a look at it?

@tsadigovAgmail
Copy link

Is there any work being done here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VarMisuse Dataset + Examples
3 participants