-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented Var misuse dataset #3978
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
||
class VarMisuse(Dataset): | ||
r"""The VarMisuse dataset from the `"Learning to Represent Programs | ||
with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of | |
with Graphs" <https://arxiv.org/abs/1711.00740>`_ paper, consisting of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
def get(self, idx): | ||
file_idx, data_idx = divmod(idx, 100) | ||
window, chunk = divmod(file_idx, 10) | ||
data_list = torch.load( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is the underlying dataset on disk? Can't we use InMemoryDataset
here?
While I like the current approach of having multiple graphs grouped together in one file, it looks like we are not really making use of this here. Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.
Sadly, I currently have no good idea how to resolve this though. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rusty1s The total sample numbers are following:
Train: 1250 * 100 = 125000
Valid: 214 * 100 = 21400
Test: 586 * 100 = 58600
Total space taken by gzipped files is around ~3.5GB for train, ~500GB for valid and ~1.5 GB for test.
The way I am processing at present,each processed pt file takes around 5 to 10MB for each chunk i.e.100 samples. So storing all train samples in memory would take around 7.5MB *1250 = ~9.375GB+
I thought that is too big for storing in memory. Most of the other in memory datasets are much smaller.
Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.
May be we can keep the pointer to the last opened chunk file open and reuse it somehow but that would only help for sequential access. I think samples would be loaded after shuffling so that would not be of much help. 100 samples or 5MB is not too big I thought to access individual sample and with SSD it likely will not be a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, shuffling is indeed a problem for this approach. I think 9GB would be still a reasonable amount for holding data in memory, but we can certainly think of adding a dataset interface in-between, e.g., InMemoyChunkDataset
that effectively can re-use already loaded chunks.
Thanks for the PR and sorry for my super late review :) Left some comments. |
No problem. I have example implementation in works and it is working for a single sample/batch but weights are going to 0 for full batch. Would you mind take a look at it? |
Is there any work being done here? |
No description provided.