Implemented Var misuse dataset #3978

amitamb · 2022-01-30T21:11:18Z

No description provided.

for more information, see https://pre-commit.ci

…o var_misuse

for more information, see https://pre-commit.ci

…o var_misuse

for more information, see https://pre-commit.ci

…o var_misuse

for more information, see https://pre-commit.ci

rusty1s · 2022-02-08T09:05:32Z

torch_geometric/datasets/var_misuse.py

+
+class VarMisuse(Dataset):
+    r"""The VarMisuse dataset from the `"Learning to Represent Programs
+    with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of


Suggested change

with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of

with Graphs" <https://arxiv.org/abs/1711.00740>`_ paper, consisting of

Will change that.

rusty1s · 2022-02-08T09:10:49Z

torch_geometric/datasets/var_misuse.py

+    def get(self, idx):
+        file_idx, data_idx = divmod(idx, 100)
+        window, chunk = divmod(file_idx, 10)
+        data_list = torch.load(


How big is the underlying dataset on disk? Can't we use InMemoryDataset here?

While I like the current approach of having multiple graphs grouped together in one file, it looks like we are not really making use of this here. Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.

Sadly, I currently have no good idea how to resolve this though. Any thoughts?

@rusty1s The total sample numbers are following:

Train: 1250 * 100 = 125000
Valid: 214 * 100 = 21400
Test: 586 * 100 = 58600

Total space taken by gzipped files is around ~3.5GB for train, ~500GB for valid and ~1.5 GB for test.

The way I am processing at present,each processed pt file takes around 5 to 10MB for each chunk i.e.100 samples. So storing all train samples in memory would take around 7.5MB *1250 = ~9.375GB+

I thought that is too big for storing in memory. Most of the other in memory datasets are much smaller.

Currently, we are loading multiple graphs from disk, use one of them, and throw the remaining ones away.

May be we can keep the pointer to the last opened chunk file open and reuse it somehow but that would only help for sequential access. I think samples would be loaded after shuffling so that would not be of much help. 100 samples or 5MB is not too big I thought to access individual sample and with SSD it likely will not be a bottleneck.

Yes, shuffling is indeed a problem for this approach. I think 9GB would be still a reasonable amount for holding data in memory, but we can certainly think of adding a dataset interface in-between, e.g., InMemoyChunkDataset that effectively can re-use already loaded chunks.

rusty1s · 2022-02-08T09:11:14Z

Thanks for the PR and sorry for my super late review :) Left some comments.

amitamb · 2022-02-14T13:29:43Z

No problem. I have example implementation in works and it is working for a single sample/batch but weights are going to 0 for full batch. Would you mind take a look at it?

…o var_misuse

for more information, see https://pre-commit.ci

tsadigovAgmail · 2024-02-21T04:58:20Z

Is there any work being done here?

amitamb and others added 5 commits January 28, 2022 21:14

Starting with var_misuse

fff5e11

Updated var_misuse implementation

b679e74

Merge branch 'master' into var_misuse

c68bd84

Included VarMisuse dataset

21a39b7

[pre-commit.ci] auto fixes from pre-commit.com hooks

07b5fa5

for more information, see https://pre-commit.ci

amitamb mentioned this pull request Jan 30, 2022

VarMisuse Dataset + Examples #3693

Open

amitamb and others added 14 commits January 31, 2022 02:56

Updated documentation

3c4668a

Merge branch 'var_misuse' of github.com:amitamb/pytorch_geometric int…

55b4539

…o var_misuse

Fixed few bugs and flex8 issues

fc58f3c

[pre-commit.ci] auto fixes from pre-commit.com hooks

e0f4742

for more information, see https://pre-commit.ci

Fixed refactor issue

151fa72

Merge branch 'var_misuse' of github.com:amitamb/pytorch_geometric int…

62396ad

…o var_misuse

Fixed some issues

cdfebef

One more fix

098153e

[pre-commit.ci] auto fixes from pre-commit.com hooks

c54f323

for more information, see https://pre-commit.ci

Fixed doc issue

95788f9

Merge branch 'var_misuse' of github.com:amitamb/pytorch_geometric int…

e83d35b

…o var_misuse

Fixed issue after variablerenaming

6bf2903

[pre-commit.ci] auto fixes from pre-commit.com hooks

651d6ff

for more information, see https://pre-commit.ci

Updated repo location

b984987

rusty1s added the help wanted label Feb 4, 2022

rusty1s reviewed Feb 8, 2022

View reviewed changes

Merge branch 'master' into var_misuse

e984b7e

rusty1s linked an issue Feb 8, 2022 that may be closed by this pull request

VarMisuse Dataset + Examples #3693

Open

rusty1s assigned amitamb Feb 8, 2022

rusty1s added feature 1 - Priority P1 labels Feb 8, 2022

Fixed documentation as per review

119f061

amitamb and others added 4 commits February 17, 2022 06:07

Merge branch 'var_misuse' of github.com:amitamb/pytorch_geometric int…

9014e60

…o var_misuse

Using full dataset hosted on archive.org

08f9d82

[pre-commit.ci] auto fixes from pre-commit.com hooks

dfddc34

for more information, see https://pre-commit.ci

Merge branch 'master' into var_misuse

9ba1737

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented Var misuse dataset #3978

Implemented Var misuse dataset #3978

amitamb commented Jan 30, 2022

rusty1s Feb 8, 2022

amitamb Feb 14, 2022

amitamb Feb 17, 2022

rusty1s Feb 8, 2022

amitamb Feb 14, 2022

rusty1s Feb 15, 2022

rusty1s commented Feb 8, 2022

amitamb commented Feb 14, 2022

tsadigovAgmail commented Feb 21, 2024

	with Graphs." <https://arxiv.org/abs/1711.00740>`_ paper, consisting of
	with Graphs" <https://arxiv.org/abs/1711.00740>`_ paper, consisting of

Implemented Var misuse dataset #3978

Are you sure you want to change the base?

Implemented Var misuse dataset #3978

Conversation

amitamb commented Jan 30, 2022

rusty1s Feb 8, 2022

Choose a reason for hiding this comment

amitamb Feb 14, 2022

Choose a reason for hiding this comment

amitamb Feb 17, 2022

Choose a reason for hiding this comment

rusty1s Feb 8, 2022

Choose a reason for hiding this comment

amitamb Feb 14, 2022

Choose a reason for hiding this comment

rusty1s Feb 15, 2022

Choose a reason for hiding this comment

rusty1s commented Feb 8, 2022

amitamb commented Feb 14, 2022

tsadigovAgmail commented Feb 21, 2024