Handle node-level and edge-level temporal information when generating partitions #8718

kgajdamo · 2024-01-04T10:25:02Z

Description

Temporal data definition:
Time data can be added to the FeatureStore in two ways:

it can be obtained directly from the partition using LocalFeatureStore.from_partition() function
or
you can add them yourself using the put_tensor() function on the LocalFeatureStore object.
Node-level temporal data: each partition must have the same time vector, which is global and its size is equal to the number of nodes in the whole graph.
Why:
We operate on global node ids.
Edge-level temporal data: each partition has its own time vector, which is local for a given partition and its size is equal to the number of edges in the given partition (part_data.edge_index.size(1)).
Why:
Each partition has its own unique edge_index in COO format, which is later converted to a CSR/CSC matrix in the neighbor sampler. So we do not have information about the global edge IDs when sampling and we would not be able to find the correct time information for a specific edge. Therefore, this information must be local.

How to distinguish node-level or edge-level temporal data:

time_attr='time' for node-level temporal sampling.
time_attr='edge_time' for edge-level temporal sampling. It is different from a single machine case when both edge time and node time have time_attr='time'. It is handled this way because of the lack of the node_store/edge_store in the feature store, so at the moment we determine whether to use node-level or edge-level temporal sampling based on the attribute name.

Where temporal data is saved:

time has been added to the node features -> node_feats.pt
edge_time has been added to the edge features -> edge_feats.pt

codecov · 2024-01-04T10:35:16Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (b26c034) 89.86% compared to head (bffc0ef) 89.39%.

Files	Patch %	Lines
torch_geometric/distributed/partition.py	96.87%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8718      +/-   ##
==========================================
- Coverage   89.86%   89.39%   -0.47%     
==========================================
  Files         479      479              
  Lines       31087    31119      +32     
==========================================
- Hits        27937    27820     -117     
- Misses       3150     3299     +149

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torch_geometric/distributed/partition.py

JakubPietrakIntel

looks great, thanks :)

torch_geometric/distributed/partition.py

… partitions

…ensor` to define temporal data

rusty1s · 2024-01-22T07:30:14Z

torch_geometric/distributed/partition.py

+                        'global_id': edge_id,
+                        'feats': dict(edge_attr=part_data.edge_attr[perm]),
+                    }
+                    if is_edge_level_time:


Is it expected that we only add edge_time in case there exists edge features?

While writing an e2e script that uses the MovieLens dataset (specifically: MovieLens(path, model_name='all-MiniLM-L6-v2') I noticed that there were no edge attributes there. So I had to move edge_time outside the condition. Does it make sense?

Yes, but here it is inside the condition, right? Maybe I am misunderstanding.

You are right. Here is the fix: #8815

This PR enables distributed edge sampling for heterogeneous graphs. **Added:** - Distributed edge heterogeneous sampling. - Distributed edge heterogeneous node-level and edge-level temporal sampling. - `DistEdgeHeteroSamplerInput` class, which serves as an input data to the `node_sample` function when for a given input edge there are different source and target node types. - unit tests **Comments:** - In the case when a given input edge has distinct source and destination node types it is necessary to handle the data of each of these types separately, so it is slightly different from the situation when we have only one input node type. - This PR depends on: [#8718](#8718) --------- Co-authored-by: JakubPietrakIntel <jakub.pietrak@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>

kgajdamo added the distributed label Jan 4, 2024

kgajdamo requested review from ZhengHongming888 and JakubPietrakIntel January 4, 2024 10:25

kgajdamo self-assigned this Jan 4, 2024

kgajdamo requested review from wsad1 and rusty1s as code owners January 4, 2024 10:25

kgajdamo marked this pull request as draft January 4, 2024 14:31

kgajdamo marked this pull request as ready for review January 4, 2024 15:12

kgajdamo force-pushed the partition-temporal branch 2 times, most recently from 7ce0b74 to 4eb0778 Compare January 5, 2024 08:53

kgajdamo mentioned this pull request Jan 5, 2024

Enable distributed link hetero sampling #8722

Merged

JakubPietrakIntel reviewed Jan 5, 2024

View reviewed changes

torch_geometric/distributed/partition.py Show resolved Hide resolved

JakubPietrakIntel reviewed Jan 5, 2024

View reviewed changes

torch_geometric/distributed/partition.py Outdated Show resolved Hide resolved

torch_geometric/distributed/partition.py Show resolved Hide resolved

kgajdamo force-pushed the partition-temporal branch from cc523d9 to 1ee2d8a Compare January 5, 2024 10:23

kgajdamo marked this pull request as draft January 5, 2024 11:29

rusty1s added feature 0 - Priority P0 labels Jan 5, 2024

kgajdamo marked this pull request as ready for review January 5, 2024 15:12

rusty1s reviewed Jan 8, 2024

View reviewed changes

torch_geometric/distributed/partition.py Outdated Show resolved Hide resolved

kgajdamo force-pushed the partition-temporal branch from 58f44a6 to 1d3f262 Compare January 12, 2024 15:45

kgajdamo and others added 7 commits January 12, 2024 17:23

handle node-level and edge-level temporal information when generating…

7681858

… partitions

update CHANGELOG.md

7c344f7

get temporal data from partition

9a9e424

dist neighbor sampler tests to use from_partition instead of `put_t…

b2f0642

…ensor` to define temporal data

apply code review comment - global_eid -> offsetted_eid

7cbcc35

add is_node_level_time and is_edge_level_time properties

1d3f262

update

1403b68

rusty1s approved these changes Jan 22, 2024

View reviewed changes

Merge branch 'master' into partition-temporal

bffc0ef

rusty1s enabled auto-merge (squash) January 22, 2024 07:31

rusty1s disabled auto-merge January 22, 2024 07:44

rusty1s merged commit 81fdeaf into pyg-team:master Jan 22, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle node-level and edge-level temporal information when generating partitions #8718

Handle node-level and edge-level temporal information when generating partitions #8718

kgajdamo commented Jan 4, 2024 •

edited

codecov bot commented Jan 4, 2024 •

edited

JakubPietrakIntel left a comment

rusty1s Jan 22, 2024

kgajdamo Jan 22, 2024

rusty1s Jan 23, 2024

kgajdamo Jan 24, 2024

Handle node-level and edge-level temporal information when generating partitions #8718

Handle node-level and edge-level temporal information when generating partitions #8718

Conversation

kgajdamo commented Jan 4, 2024 • edited

codecov bot commented Jan 4, 2024 • edited

Codecov Report

JakubPietrakIntel left a comment

Choose a reason for hiding this comment

rusty1s Jan 22, 2024

Choose a reason for hiding this comment

kgajdamo Jan 22, 2024

Choose a reason for hiding this comment

rusty1s Jan 23, 2024

Choose a reason for hiding this comment

kgajdamo Jan 24, 2024

Choose a reason for hiding this comment

kgajdamo commented Jan 4, 2024 •

edited

codecov bot commented Jan 4, 2024 •

edited