Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle node-level and edge-level temporal information when generating partitions #8718

Merged
merged 8 commits into from
Jan 22, 2024

Conversation

kgajdamo
Copy link
Contributor

@kgajdamo kgajdamo commented Jan 4, 2024

Description

Temporal data definition:
Time data can be added to the FeatureStore in two ways:

  • it can be obtained directly from the partition using LocalFeatureStore.from_partition() function
    or

  • you can add them yourself using the put_tensor() function on the LocalFeatureStore object.

  • Node-level temporal data: each partition must have the same time vector, which is global and its size is equal to the number of nodes in the whole graph.
    Why:
    We operate on global node ids.

  • Edge-level temporal data: each partition has its own time vector, which is local for a given partition and its size is equal to the number of edges in the given partition (part_data.edge_index.size(1)).
    Why:
    Each partition has its own unique edge_index in COO format, which is later converted to a CSR/CSC matrix in the neighbor sampler. So we do not have information about the global edge IDs when sampling and we would not be able to find the correct time information for a specific edge. Therefore, this information must be local.

How to distinguish node-level or edge-level temporal data:

  • time_attr='time' for node-level temporal sampling.
  • time_attr='edge_time' for edge-level temporal sampling. It is different from a single machine case when both edge time and node time have time_attr='time'. It is handled this way because of the lack of the node_store/edge_store in the feature store, so at the moment we determine whether to use node-level or edge-level temporal sampling based on the attribute name.

Where temporal data is saved:

  • time has been added to the node features -> node_feats.pt
  • edge_time has been added to the edge features -> edge_feats.pt

Copy link

codecov bot commented Jan 4, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (b26c034) 89.86% compared to head (bffc0ef) 89.39%.

Files Patch % Lines
torch_geometric/distributed/partition.py 96.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8718      +/-   ##
==========================================
- Coverage   89.86%   89.39%   -0.47%     
==========================================
  Files         479      479              
  Lines       31087    31119      +32     
==========================================
- Hits        27937    27820     -117     
- Misses       3150     3299     +149     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kgajdamo kgajdamo marked this pull request as draft January 4, 2024 14:31
@kgajdamo kgajdamo marked this pull request as ready for review January 4, 2024 15:12
@kgajdamo kgajdamo force-pushed the partition-temporal branch 2 times, most recently from 7ce0b74 to 4eb0778 Compare January 5, 2024 08:53
Copy link
Contributor

@JakubPietrakIntel JakubPietrakIntel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks :)

torch_geometric/distributed/partition.py Outdated Show resolved Hide resolved
torch_geometric/distributed/partition.py Show resolved Hide resolved
@kgajdamo kgajdamo marked this pull request as draft January 5, 2024 11:29
@kgajdamo kgajdamo marked this pull request as ready for review January 5, 2024 15:12
'global_id': edge_id,
'feats': dict(edge_attr=part_data.edge_attr[perm]),
}
if is_edge_level_time:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that we only add edge_time in case there exists edge features?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While writing an e2e script that uses the MovieLens dataset (specifically: MovieLens(path, model_name='all-MiniLM-L6-v2') I noticed that there were no edge attributes there. So I had to move edge_time outside the condition. Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but here it is inside the condition, right? Maybe I am misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Here is the fix: #8815

@rusty1s rusty1s enabled auto-merge (squash) January 22, 2024 07:31
@rusty1s rusty1s disabled auto-merge January 22, 2024 07:44
@rusty1s rusty1s merged commit 81fdeaf into pyg-team:master Jan 22, 2024
14 checks passed
rusty1s added a commit that referenced this pull request Feb 4, 2024
This PR enables distributed edge sampling for heterogeneous graphs.

**Added:**
- Distributed edge heterogeneous sampling.
- Distributed edge heterogeneous node-level and edge-level temporal
sampling.
- `DistEdgeHeteroSamplerInput` class, which serves as an input data to
the `node_sample` function when for a given input edge there are
different source and target node types.
- unit tests

**Comments:**
- In the case when a given input edge has distinct source and
destination node types it is necessary to handle the data of each of
these types separately, so it is slightly different from the situation
when we have only one input node type.
- This PR depends on:
[#8718](#8718)

---------

Co-authored-by: JakubPietrakIntel <jakub.pietrak@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matthias Fey <matthias.fey@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants