-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle node-level and edge-level temporal information when generating partitions #8718
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #8718 +/- ##
==========================================
- Coverage 89.86% 89.39% -0.47%
==========================================
Files 479 479
Lines 31087 31119 +32
==========================================
- Hits 27937 27820 -117
- Misses 3150 3299 +149 ☔ View full report in Codecov by Sentry. |
7ce0b74
to
4eb0778
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thanks :)
cc523d9
to
1ee2d8a
Compare
58f44a6
to
1d3f262
Compare
…ensor` to define temporal data
'global_id': edge_id, | ||
'feats': dict(edge_attr=part_data.edge_attr[perm]), | ||
} | ||
if is_edge_level_time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that we only add edge_time
in case there exists edge features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While writing an e2e script that uses the MovieLens dataset (specifically: MovieLens(path, model_name='all-MiniLM-L6-v2') I noticed that there were no edge attributes there. So I had to move edge_time outside the condition. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but here it is inside the condition, right? Maybe I am misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Here is the fix: #8815
This PR enables distributed edge sampling for heterogeneous graphs. **Added:** - Distributed edge heterogeneous sampling. - Distributed edge heterogeneous node-level and edge-level temporal sampling. - `DistEdgeHeteroSamplerInput` class, which serves as an input data to the `node_sample` function when for a given input edge there are different source and target node types. - unit tests **Comments:** - In the case when a given input edge has distinct source and destination node types it is necessary to handle the data of each of these types separately, so it is slightly different from the situation when we have only one input node type. - This PR depends on: [#8718](#8718) --------- Co-authored-by: JakubPietrakIntel <jakub.pietrak@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
Description
Temporal data definition:
Time data can be added to the FeatureStore in two ways:
it can be obtained directly from the partition using
LocalFeatureStore.from_partition()
functionor
you can add them yourself using the
put_tensor()
function on theLocalFeatureStore
object.Node-level temporal data: each partition must have the same time vector, which is global and its size is equal to the number of nodes in the whole graph.
Why:
We operate on global node ids.
Edge-level temporal data: each partition has its own time vector, which is local for a given partition and its size is equal to the number of edges in the given partition (part_data.edge_index.size(1)).
Why:
Each partition has its own unique edge_index in COO format, which is later converted to a CSR/CSC matrix in the neighbor sampler. So we do not have information about the global edge IDs when sampling and we would not be able to find the correct time information for a specific edge. Therefore, this information must be local.
How to distinguish node-level or edge-level temporal data:
time_attr
='time' for node-level temporal sampling.time_attr
='edge_time' for edge-level temporal sampling. It is different from a single machine case when both edge time and node time havetime_attr
='time'. It is handled this way because of the lack of the node_store/edge_store in the feature store, so at the moment we determine whether to use node-level or edge-level temporal sampling based on the attribute name.Where temporal data is saved:
time
has been added to the node features -> node_feats.ptedge_time
has been added to the edge features -> edge_feats.pt