Skip to content

Conversation

@JackCaoG
Copy link
Collaborator

Reverts #3536

@JackCaoG
Copy link
Collaborator Author

Need to set the port for GPU CI and also handle the pjrt test in CPU test.

@JackCaoG
Copy link
Collaborator Author

Need to look into cpu CI failure

@JackCaoG
Copy link
Collaborator Author

not able to repo the stuck on my dev machine. downloading docker used by ci

@JackCaoG
Copy link
Collaborator Author

able to repo using circle CI docker, looking.

@JackCaoG
Copy link
Collaborator Author

Seems like for multi cpu cc test multiple grpc server is started

2022-05-24 03:23:04.964812: I  159322 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:37785
4
2022-05-24 03:23:04.975578: I  159321 tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:55535}
2022-05-24 03:23:04.976036: I  159321 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:55535
4
2022-05-24 03:23:05.000570: I  159319 tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:36029}
2022-05-24 03:23:05.002110: I  159319 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:36029
2022-05-24 03:23:05.002230: I  159319 tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:1489] Creating mesh service bound to de4155e68d22:34453
2022-05-24 03:23:05.007282: I  159320 tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job localservice -> {0 -> localhost:32791}
2022-05-24 03:23:05.008858: I  159320 tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:438] Started server with target: grpc://localhost:32791

This won't work with the single xrt_server approach in this pr.

@JackCaoG
Copy link
Collaborator Author

Well I guess this make sense, we have ~8 cc op tests, each of them will start 4 grpc server with random port. Every time we have a port conflict test will fail. One thing I can do is limit these cc ops tests on cpu to only be run on pytorch/xla CI. It is very unlikely upstream will break pt/xla cc op too.

@JackCaoG
Copy link
Collaborator Author

I will close pr and open a new one to only run cc op test on torch/xla CI.

@JackCaoG JackCaoG closed this May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants