Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dist_context/RPC for distributed training #7671

Merged
merged 17 commits into from
Jul 28, 2023

Conversation

ZhengHongming888
Copy link
Contributor

This code belongs to the part of the whole distributed training for PyG.

This class will do

  1. dist_context is to setup the distributed mode like worker mode or other mode and also setup the information for distributed context like role, rank, group_name, world_size,etc.
  2. based on pytorch rpc.api to setup the api to check is_rpc_initialized and wrapper func, init_rpc(), RpcCall(), RpcRouter, rpc_async_request() and rpc_sync_request(), etc

These basic rpc functionality will be used in later feature lookup function after node sampling.

Any comments please let us know. thanks.

@codecov
Copy link

codecov bot commented Jun 30, 2023

Codecov Report

Merging #7671 (649d034) into master (02cc18d) will decrease coverage by 0.39%.
The diff coverage is 88.46%.

❗ Current head 649d034 differs from pull request most recent head dabb855. Consider uploading reports for the commit dabb855 to get more accurate results

@@            Coverage Diff             @@
##           master    #7671      +/-   ##
==========================================
- Coverage   91.95%   91.57%   -0.39%     
==========================================
  Files         453      455       +2     
  Lines       25556    25657     +101     
==========================================
- Hits        23500    23495       -5     
- Misses       2056     2162     +106     
Files Changed Coverage Δ
torch_geometric/distributed/rpc.py 86.51% <86.51%> (ø)
torch_geometric/distributed/dist_context.py 100.00% <100.00%> (ø)

... and 22 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@rusty1s rusty1s changed the title Add dist_context/rpc for distributed training Add dist_context/RPC for distributed training Jul 2, 2023
Copy link
Contributor

@mananshah99 mananshah99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some initial comments. The biggest one has to do with the use of global state here, where we may have a cleaner implementation if we remove some of these state and consolidate under classes that can be set as instance variables on the feature and graph store.

torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
Comment on lines 188 to 195
def _rpc_remote_async_call(call_id, *args, **kwargs):
r""" Entry for rpc requests """
return _rpc_call_pool.get(call_id).rpc_async_call(*args, **kwargs)


def _rpc_remote_sync_all(call_id, *args, **kwargs):
r""" Entry for rpc requests """
return _rpc_call_pool.get(call_id).rpc_sync_call(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a global pool here? Can't we just pas the sync or async function and its args directly to rpc_request_async and rpc_request_sync below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For there we will use one pool it to distinguish the different call functions in mp processing for easy code understanding.

Copy link
Contributor

@mananshah99 mananshah99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. Left a few additional comments on top of the ones from before; once we address these, we should be good to merge this and handle more clean-up later.

torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
Copy link
Member

@rusty1s rusty1s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit hard to review this without proper context. Not sure if possible, but would be great to test some of these.

torch_geometric/distributed/dist_context.py Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/dist_context.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
torch_geometric/distributed/rpc.py Outdated Show resolved Hide resolved
@rusty1s rusty1s enabled auto-merge (squash) July 28, 2023 15:56
@rusty1s rusty1s merged commit 6acc096 into pyg-team:master Jul 28, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants