Support unordered sharding spec for partial replication #5316

JackCaoG · 2023-07-17T23:46:01Z

No description provided.

jonb377

Thanks Jack! LGTM

jonb377 · 2023-07-18T00:29:23Z

torch_xla/experimental/xla_sharding.py

-def _get_group_assignment(
-    sharding_type: ShardingType, mesh: Mesh,
-    partition_spec: Tuple[Union[int, None]]) -> Tuple[List, List]:
+def _get_group_assignment(sharding_type: ShardingType, mesh: Mesh,


It looks like the mesh parameter isn't used anymore

nice catch, I should remove it.

@jonb377 I will fix this in a follow up pr.

* Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring

* Update inline style code to multiline (#5291) * Fix typo in _test.yml (#5172) s/metadtaa/metadata/ * [SPMD][Virtual Device]All tensors should be in SPMD:0 C++ device (#5284) * Move all tensors to SPMD:0 C++ device under spmd context * fix load shards * fix test_mark_sharding_2d by not creating placeholder for virtual device * fix the waitdeviceop for spmd case * Fix test_shard_hashing * fix spmd device casting issue * remove hacks in test_xla_virtual_device.py * add test for new virtual device usage * fix review comments * fix IsTpuDevice * linter * Revert pr #2682 (#5215) * Make README more actionable (#5262) * Make README more actionable * move profiling guide link * text wrapping * [SPMD] Use xs.Mesh in test_2d_tensor_3d_mesh (#5295) * use mesh in test_2d_tensor_3d_mesh * remove attributes patch * [SPMD] Add FSDP sharding for test_train_spmd_linear_model.py (#5299) Summary: This diff adds FSDP sharding for test_train_spmd_linear_model.py. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_train_spmd_linear_model.py --sharding fsdp * [SPMD] Avoid recompilations in xs.mark_sharding() (#5300) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Support mark_sharding on IRs (#5301) Summary: This pull requests fixes the recompilation issue in xs.mark_sharding(). xtensor->GetXlaData() will compile the program if xtensor is an IR in order to get the BackendData. I believe this is not intended given the error message below suggests only data type xtensors are supported. Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py * [SPMD] Allow dumping post optimizations hlo (#5302) Summary: This pull request partial reverts the change in #5266 to re-enble dumping post optimizations hlo. Test Plan: XLA_USE_SPMD=1 PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_xla_sharded_hlo_dump_post_optimizations * Add `_sharded_cpu_state_dict` for distributed checkpointing (#5288) * initiak commit * Add test workflow for `xrt` branch (#5241) * Add test workflow for `xrt` branch * Only run for PRs targeting XRT branch * Add function to generate stablehlo based callable from pytorch model (#5216) * Add function to generate stablehlo based callable from pytorch model Added function `torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`. This function will take a pytorch Module and convert it into stablehlo bytecode. * Only run the main CI workflow on PRs targeting master and release branches (#5244) * Only run main CI for master and release branches. * Disabling XRT tests on main CI * AMP for TPUs v3 (#5161) * remove duplicate autocast_test (#5246) * Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247) * Install `expecttest` in xla_test_job.yaml (#5252) * Add IAM roles for cloudbuild_editors (#5251) * [Functionalization] Remove view in view_symint (#5231) * [Functionalization] Remove view in view_symint Summary: This pull request removes views in tensor_method::view_symint. Test Plan: XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view * Fix linters * fixed the test * ran the linter --------- Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> * Delete XRT from the main branch (#5240) * Delete XRT from the main branch * Remove dead import * formatting * Remove disable_xrt build option * Fix runtime init * Revert "Remove disable_xrt build option" This reverts commit ba312e7. * Add disable XRT option back * formatting * Prune mesh service * Remove obsolete test * Remove other run server script * Remove XRT config * Update PJRT default device test * Add a file I forgot to save * if using_pjrt -> @requires_pjrt * Remove irrelevant test case * Remove XRT env vars * fix md link * formatting * Remove extra `requires_pjrt` * merge conflicts * Add other autocast back * Add nightly build for cuda 12 (#5253) * Fix the linter command in the CI (#5254) * fix linter command * ran linter * Jack cao g/fix spmd buff is null (#5256) * Fix that non-tensor scalar can't be handled by virtual device * add test * comment * Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239) * Skip calling as_strided in empty_strided_symint. * only return empty_symint conditionally. * add a comment * Add XRT nightly builds (#5261) * Add XRT nightly builds * remove space * [OpenXLA] Migrate to pull XLA from OpenXLA (#5202) PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09 * Add ToString method for both PjrtData and PjrtShardedData (#5265) * Add ToString method for both PjrtData and PjrtShardedData * on cpu same config will become replicated, dont't check actual op sharding type * Update Sharded graph HLO dumping (#5266) * Enable PjRt Client Compilation with StableHLO (#5233) * Enable xla PjRt client compilation with StableHLO * add XLA_STABLEHLO_COMPILE to configuration.yaml * fix merge conflict * dummy commit to trigger ci * Revert "dummy commit to trigger ci" This reverts commit f7aec23. * Disable Bazel remote cache for forked PR (#5259) * disable bazel remote cache if gcloud key is empty * remove remote cache from setup.py * experiment with debug msg * fix flag * add more logs * skip remote chache if credential file is empty * add comment * add logs * add check in test and coverage script * fix condition in coverage test * advance branch pr * allow remote cache if gloud file isn't specified explicitly * remove dummy comment * Suppress debug symbols in OpenXLA code (#5269) * [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268) * Make TPU detection more robust (#5271) * Clean bazel stuff on distutils clean. (#5274) * Clean bazel stuff on distutils clean * Fix python formatting * Delete unused .so file, and .lds files (#5275) * [OpenXLA] Delete unused .so file and .lds files * Fix the error when export_torch_model is given a non-tensor (#5277) However the generated StableHLO graph still hardcodes the non-tensor value. this is not correct, will fix later. * Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282) * Always do build_ext in python setup.py develop (#5273) Bazel should figure out that _XLAC.so is current or not, and trigger rebuild if any cpp files changed. * Remove or improve several hardcoded TPU test conditions (#5272) * Remove or improve several hardcoded TPU test conditions * Fix test condition * Add `runtime.host_index` (#5283) * Make it an error if calling sizes() on a dynamic tensor. (#4998) * Err if calling sizes() on dynamic tensor * try to set has_symbolic_sizes_strides_ * resolve merge conflict * enable CONTINUE_ON_ERROR * fixed the python test test_SizeEq_should_not_compile_for_identical_symints * fix test_index_types * set CONTINUE_ON_ERROR to true * remove some unwanted code. * add a print * directly set has_symbolic_sizes_strides_ = true * make some fixes. * fix empty_strided_symint * ran linter * change error type in the test. * fix comments * ran linter * Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281) * Fix the error where mark_step does not materalize tensors on SPMD:0 * typo * fix test_non_tensor_scalar * Disable torch._dynamo.config.automatic_dynamic_shapes (#5285) * Set torch._dynamo.config.automatic_dynamic_shapes to False * Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape * run linter * wrap only if sharding type is non-replicated * Handle non-tensors * run linter * Call wrap_if_sharded first * Add exception in test for unsharded tensor * fix test * Use torch.Tensor instead of torch.tensor * use .cpu() only for tensors --------- Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> * Supoort unordered sharding spec correctly (#5305) * Supoort non-ordered sharding spec correctly * use permute instead of transpose * use dim > 2 to suit TPU v3(otherwise can't be divide evenly) * Support unordered sharding spec for partial replication (#5316) * Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring * Fix mismatched GPU docker image in the doc. (#5319) * quick refactor on _get_group_assignment (#5318) * Add tf independent serialization (#5308) Create a serialization format for StableHLO graphs and weights without tf.saved_model Need to not use tensorflow because tensorflow is no longer dependency of pytorch/xla. Information saved are enough to reconstruct the tf.saved_model for serving. Information stored: * metadata on which tensor maps which input position * StableHLO version number * metadata on which tensor corresponds to user input or parameter * metadata on shape and dtype of each tensor. * Tensors themselves are saved as numpy arrays using np.save. * Disable coverage for now (#5321) * Enable Some input output aliasing under SPMD (#5320) * Use `_sharded_cpu_state_dict` functionality to Write Items for SPMD Save Planner (#5315) * initial commit * add suggested changes * add unit test * fix test * fix test * add suggested changes * remove is_sharded_tensor check * check if device type is xla in `wrap_if_sharded` * change order * update resolve_data and add more tests * run linter * use subtest * formatting fixes * run linter * handle single tensor for method send_to_device_single (#5317) * handle single tensor for method send_to_device_single * fix broadcast parameter --------- Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: Nikita Shulga <nshulga@meta.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Will Cromar <wcromar@google.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Yash Shah <55116947+yashs97@users.noreply.github.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: Meghan Cowan <cowanmeg@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>

This reverts commit 78c6840.

* Revert "Support unordered sharding spec for partial replication (#5316)" * Update test_2d_tensor_3d_mesh unit test to surface a bug * Use partial replication for 2D tensor over 3D mesh sharding

@yeounoh

* Sharding should be per output of IR Node, instead of per IR Node (#5330) * sharding should be per output of IR Node, instead of per IR Node * Update sharding_hash method * Add test for sharding on IR with multiple output * fix cpu test * Fix a bug in getSharding * Update Python device API for SPMD (#5129) * Make python Api to respect the virtual device when SPMD is enabled * fix typo * Check out the release branch instead of origin/master in ansible (#5344) * Also dump output sharding on HLO file (#5339) * Also dump output sharding on HLO file * only dump output sharding if dump format is HLO * add test * fix typo * Make all-reduce a no-op when world size is 1 (#5342) * Make all-reduce a no-op when world size is 1 * Fix torch.distributed test * add fs linker flag (#5347) * Add py3.10 whl path to doc, refactor whl table (#5354) * fix amp dtype setting for GPU (#5337) * fix amp dtype setting for GPU. * fix ut * fix lint. * minor. * Add python test for SPMD+Runtime Python API (#5349) * Add python test for SPMD+Runtime Python API * replace test name * Update test_xla_spmd_python_api_interaction.py * Check the actual device instead of query env var for virtual device (#5352) * Check the actual device instead of query env var for virtual device * revert unneeded change * minor changes * [BE] use self.assertEquals instead of str equality in test_zero1.py (#5364) * Revert "[BE] use self.assertEquals instead of str equality in test_zero1.py (#5364)" (#5366) This reverts commit 8ada333. * [Dynamo|TPU] Tweak `atol` and `rtol` for `test_dynamo.py` (#5363) * tweak `atol` and `rtol` * [Dynamo|TPU] Skip`DynamoTrainingBasicTest.test_resnet18` on TPU (#5362) * Skip`DynamoTrainingBasicTest.test_resnet18` on TPU * Add a script for running stablehlo tests. (#5360) * Add kokoro presubmit for stablehlo tests * Don't rewrite index hints in global save planning (#5348) * [Dynamo|TPU] Skip `DynamoInferenceBasicTest.test_resnet18` on TPU (#5361) * Skip `DynamoInferenceBasicTest.test_resnet18` on TPU * [BE] use self.assertEquals instead of str equality in test_zero1.py (#5367) * [BE] use self.assertEquals instead of str equality in test_zero1.py * Use our own assertEqual * Remove print statements * Fix ReplicateShardedData for int type (#5374) * Fix ReplicateShardedData for int type * add test * Update dynamo.md (#5378) Update dynamo.md to remove note about fallback ops since they're supported now * Revert "Fix ReplicateShardedData for int type (#5374)" (#5380) This reverts commit 7fb7dfe. * Remove the mention of XRT_TPU_CONFIG in the CONTRIBUTING.md (#5379) * [Dynamo|TPU] Tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU (#5373) * tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU * Rectify test_zero1.py once optim.load_state_dict doesn't guarantee immutability (#5382) * [TEST ONLY] print statements for test_zero1.py to debug * Try fix * Rectify test_zero1.py to account for state_dict modification * Fix lint * Add gpu doc for how to build PyTorch/XLA from source with GPU support. (#5384) * Add gpu doc for how to build PyTorch/XLA from source with GPU support. * fix typo * fix comments * fix comments * clear pending ir should also clear the cc op tokens (#5385) * Port resnet data loading optimizations to SPMD test script (#5386) * Add support for in-place ops with self tensors in dynamo bridge (#5309) * Add more support for in-place ops in dynamo bridge Run linter * Add check to explicitly sync self tensors Remove debugging lines Update unit tests to a model * Clean up some code Surround in an if-statement Update metrics for fallback related dynamo tests Update cloned args logic Revert "Update metrics for fallback related dynamo tests" This reverts commit 3855f43. * Update single_node flag back to False * Add dynamo test in TPU CI (#5381) Add dynamo test in TPU CI * Add manual seed in multihost checkpoint (#5392) * Fix change_id type in coverage uploading (#5394) * Update dynamo cpu fallback op to aten::_foobar (#5393) * Run single host multi GPU tests in the CI. (#5387) * Add gpu doc for how to build PyTorch/XLA from source with GPU support. * Run single host multi GPU tests. * fix linter * fix linter * fix error * fix test * [PJRT] Separate collective ops test from TPU runtime test. (#5396) * [PJRT] Separate collective ops test from TPU runtime test. * formatting * Fix ReplicateShardedData for int type (#5404) * Update the dynamo backend name to `openxla` (#5402) * Replace aot backend with openxla * Update the inference backend except the fallback tests * handle the fallback tests * update remaining test * update doc * add torch pin * Delete .torcch_pin * linter * [SPMD] Multi-host batch sharded data loading (#5331) * Refactor to share code between export_torch_model and save_as_stablehlo (#5388) * Refactor to share code between export_torch_model and save_as_stablehlo * Fix TPU collective ops test for multi-host TPUs (#5408) * Fix TPU collective ops test for multi-host TPUs * formatting * Partially replicate lower-rank tensors (#5409) * Partially replicate lower-rank tensors * Fix unit test * Remove unnecessary device count check * Fix unordered partition spec test * yapf * Revert "Partially replicate lower-rank tensors (#5409)" (#5412) This reverts commit 56a6a02. * SPMD cross slice-replication using partial_replication sharding (#5411) * Revert "Support unordered sharding spec for partial replication (#5316)" * Update test_2d_tensor_3d_mesh unit test to surface a bug * Use partial replication for 2D tensor over 3D mesh sharding * Fix the incorect clone arg condition in dynamo bridge (#5414) * [SPMD] named partition spec support (#5415) [SPMD] named partition spec * [PJRT|TPU] Update `test_xla_devices_single_process_all_chips` for expected device number (#5421) Update `test_xla_devices_single_process_all_chips` for expected device number * Add repo for libcudnn8=8.7.0.84 and CUDA 11.8 (#5425) * Update fix_includes.sh (#5441) Without this patch I cannot get torch_xla to build outside of the docker. This should fix it. * [PJRT] Support `torchrun` with `pjrt://` `init_method` (#5438) * Support torchrun with `pjrt://` `init_method` * move import * fix error * Fix NameError * Fix path * Remove from TPU CI * Bugfix + add more test for llama (#5439) Bugfix details: 1. When the graph have mutations the exported graph will have additional inputs. For now we are dropping them. 2. We should trace with args instead of final_args. * Move the C++ test build to CI build job instead of test job (#5442) * Update gcc to 10. (#5445) * Update gcc to 10, And use unversioned clang-format (so it's installation will succeed) in both debian bullseye and buster * gcc10 to ansible * Update the random seed for every dynamo execution (#5444) * Revert "Update gcc to 10. (#5445)" (#5449) This reverts commit 454e916. Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> * Install gcc-10 (#5450) * Revert "Install gcc-10 (#5450)" (#5452) This reverts commit 65b7639. * parallelize SPMD inputhandler and GetDataShards (#5447) * parallelize SPMD inputhandler and GetDataShards * add output handler trace * Remove base image override from TPU CI build (#5453) * Update to GCC 10 (#5451) * Cache sharded placeholder for dynamo execution (#5446) * Cache the output sharding spec for dynamo * address review comments * add test * remove dead code * add missing wait deivce ops * Update xla_graph_executor.cpp * linter * Remove Docker image override from dev image (#5456) * hack: implement (unimplement?) GetDataShard for XRT * skip flaky test (#5459) * Neuron import hook (#5429) * Enable Neuron import hook for calling initialization functions if using AWS Neuron * removing copy/paste error * moving aws init call and removing comment * Add missing includes (#5434) * Add missing includes Currently this is included indirectly through PyTorch includes, but when I remove the include from PyTorch's headers, the xla build fails. * [TESTING] Pin PyTorch PR * Retrigger CI after timeout * Remove .torch_pin * [GPU]Update README.md with wheel/docker for CUDA12.0 and deprecate CUDA11.7 (#5443) * [GPU]Update README.md with wheel and docker support CUDA12.0 and deprecate CUDA 11.7 * Update README.md with docker support CUDA 12.0 and python 3.8 * Update README.md * Update README.md * update remote cache key in ansible (#5463) * Fix data type in Pow with Scalar base and Tensor exponent (#5467) * fix dtype inference * fix linter * bump the timeout for CI (#5470) * Fix the input sharding for dynamo (#5469) * Enabling sharding device data IR (#5475) * Allow shard device data IR * Handle XLATensor that is DeviceData IR and does not have XLAData * fix typo * Introduce `torch_xla.runtime.use_spmd()` (#5474) Introduce torch_xla.runtime.use_spmd() and torch_xla.runtime.is_spmd() * Enable PJRT C API Client and other changes for Neuron (#5428) * Enable PJRT C API Client and other changes for Neuron * keeping quotes consistent * fixing device type call * refactoring neuron initialization with spawn * updating replication setting only for torchrun * removing set replication in xla backed was added to rendezvous handler * removing workaround for world_size/master_port for neuron * fixing linter issues * Don't move full tensor to device in deferred_init (#4819) * [SPMD] Fix HybridMesh ordering (#5478) Summary: In xs.HybridMesh, it assumes the xr.global_runtime_device_attributes() will return the attributes according to the PyTorch/XLA's logical global ordinals. However, it turns out not to be the case. To fix this, we pass the logical global ordinal as one of the attributes and xs.HybridMesh will sort the attributes according to this new attribute before using the array. Test Plan: PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_hybrid_mesh * [SPMD] Properly skip tests on TPU V2 (#5479) Summary: Some of the tests only fail on TPU V2 but were skipped for all TPUs. Let's fix that. Test Plan: PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py * Add @yeounoh to .github CODEOWNERS (#5482) * Add Python API to execute StableHLO bytecode (#5476) * [SPMD] Fix TPU CI after #5478 (#5487) * [SPMD] Fix TPU CI after #5478 Summary: Let's fix all TPU CI failures after #5478. Test Plan: TPU CI * Fix linters * [SPMD] Fix XLA_DUMP_POST_OPTIMIZATIONS test (#5485) Summary: XLA_DUMP_POST_OPTIMIZATIONS was set as static which means that the value will be fixed during the whole test run for a particular test suite. Therefore, let's make a separate file. Test Plan: PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding.py PJRT_DEVICE=TPU USE_XLA_SPMD=1 python test/spmd/test_xla_sharding_hlo.py * [Dist] Refactor ZeRO-1 (#5145) * refactor * fix * fix * add padding * more robust save/load * Update artifacts.auto.tfvars for 2.1 release (#5483) * Update artifacts.auto.tfvars for 2.1 release Update artifacts.auto.tfvars for 2.1 release * Remove cuda version 11.7 and add 12.0 for 2.1 triggers * Add 3.10 tpu version * Add ShardingSpec to XLATensor when it is created with a PJRTShardedData (#5489) * Add ShardingSpec to XLATensor when it is created with a PJRTShardedData * add test * Add topological sorting to dynamo partitions (#5472) * Add topological sorting to dynamo partitions * Run linter * Update unit tests to include more in-place ops * [SPMD] Patch nn.Linear (#5491) Summary: This pull request introduces a patched version of torch.nn.functional.linear that uses einsum instead of torch.matmul which will flatten the tensors to 2D and collide the sharded dimensions. The torch.matmul default behavior makes it very hard for XLA compiler to propagate the sharding annotation. Test Plan: PJRT_DEVICE=CPU python test/test_operations.py -v -k test_patched_linear * [original author: mrnikwaws] Neuron operator support (#5471) * adding glu operator support * adding glu operator * fixing yaml * fixing linter issues * fixing linter issues * fixing spacing * fixing spacing * fixing spacing * fixing spacing * fixing shape helper * fixing spacing * [SPMD] Make IR sharding custom sharding op (#5433) Summary: This pull request changes the syntax of IR sharding by making it a new node instead of just attaching the sharding spec to the tensor. On the same time, we will still attach a sharding spec to the newly created XLATensor which will hold the new IR node. This new IR node will be a CustomSharding node and in hlo: %annotate = f32[6,3] custom-call(%copy), custom_call_target="Sharding", sharding={devices=[2,1]0,1} Test Plan: PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_mark_sharding_ir PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py -v -k test_inplace_add_with_sharding * Support input sharding changed after first dynamo tracing (#5477) * Support input sharding changed after first dynamo tracing * fix linter * Handle the different input for dynamo sharding change * update counter * only get sharding specs when spmd is enabled * add option to skip checking input sharding after x runs * handle the cpu test * make XLA_DYNAMO_INPUT_SHARDING_CHECK_THREASHOLD configable * fix review comments * Always use ExecuteReplicated with SPMD (#5494) * Always use ExecuteReplicated with SPMD * Add unit test * Skip a couple tests on TPU due to precision issue (#5496) * Refactor stablehlo API and put them in official location. (#5493) Changes include: * make end point in torch_xla/init.py for exposed APIs torch_xla.save_as_stablehlo and torch_xla.save_torch_model_as_stablehlo. * All tf related integration to its own file. * Remove args as argument (because it will spear inside of ExportedProgram) but allow user to override it (which we use for now. * Support tuples in partition spec (#5488) * Support tuples in partition spec * Add unit test for partial replication * yapf * Support higher-rank tensors over lower-rank mesh * Fix test & yapf * Don't use partition_spec when creating group assignment * Update documentation * More documentation * Translate named specs in ShardingSpec * Add a API to explictly init runtime (#5500) * Add explict error message when tensor is on CPU for dynamo backend (#5499) * remove torchvision in stablehlo.py (#5501) * Fix tupled partition spec test on v3 (#5503) * Update dynamo doc (#5506) * Update dynamo.md (#5509) fixing typo * Get original_traced_args as example_inputs. (#5511) Change due to changing name in pytorch/pytorch#107978 * mark_sharding over a replicated tensor is allowed. (#5513) * [SPMD] Propagate replicated output (#5508) Summary: During the LLaMA2 experiements, I disovered that manually marking 1D tensors to be replicated can greatly save a lot of memory. Then I disocvered that explicitly replicated spec will get dropped after mark_step. That is caused by PrepareOutputShardingPropagation where it explicitly clear the sharding spec for replicated output. So, I went ahead and fix that. Further, I did some experiements of propogating replicated output and that drop the requirements of manually replicating 1D tensors. Hence, I made this change. I'm still not quite sure why, will follow up later. Test Plan: PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py * Disable cxx abi in ansible when building pt/xla for branch r2.0 (#5332) * Update pytorch git tag for r2.1 (#5529) Update more places Add torch_pin * Enable megacore_dense by default (#5520) (#5531) Summary: This change enables megacore_dense by default to allow asynchorous cc ops especailly for GSPMD. Test Plan: CI Co-authored-by: Jiewen Tan <jwtan@google.com> * Add option to unbundle libtpu (#5534) (#5536) * Add optiona to unbundle libtpu * Add clarifying note * Revert 2.1 terraform changes (#5537) * Fix FSDP for Models with Frozen Weights (#5484) (#5539) * Fix fsdp not freeing forzen full params * add test * formatting * remove unnecessary env var in test Co-authored-by: Liyang90 <liyanglu@google.com> * Update r2.1 wheel to be compatible with PyPI (#5550) * Update project metadata and remove useless files * Update README * Add manylinux platform tag * formatting * Add resnet50-weight-quant colab notebook (#5407) (#5556) * Add resnet50-weight-only-quant colab notebook * update notebook with llama blog link Co-authored-by: Siyuan Liu <lsiyuan@google.com> * hack: add placeholders for `HasSharding` and `GetSharding` to XRT * formatting * hack: always return false from `HasSharding` * Update torch pin to current RC for CI testing * Cherry pick `pjrt://` init method rename and doc updates (#5562) * Change `pjrt://` init method to `xla://` (#5560) * Update PJRT documentation for the 2.1 release (#5557) * Update PJRT documentation for the 2.1 release * clarify plugins * clarify PJRT doc * Update `pjrt://` to `xla://` * Use new cache silo and skip test build * hack: disable missing test * hack: alter cache silo name * formatting --------- Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Baole Ai <baoleai01@gmail.com> Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com> Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com> Co-authored-by: qihqi <hanq@google.com> Co-authored-by: jonb377 <jonbolin@google.com> Co-authored-by: Wonjoo Lee <wonjoo@google.com> Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com> Co-authored-by: Yeounoh Chung <yeounoh@google.com> Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com> Co-authored-by: Alisson Azzolini <37222419+aazzolini@users.noreply.github.com> Co-authored-by: aws-kingrj <78175353+aws-kingrj@users.noreply.github.com> Co-authored-by: peterbell10 <peterbell10@live.co.uk> Co-authored-by: Zach Zheng <zczheng@amazon.com> Co-authored-by: Jiewen Tan <jwtan@google.com> Co-authored-by: Huang, Guangtai <guangtai@amazon.com> Co-authored-by: Shauheen <shauheen@users.noreply.github.com> Co-authored-by: Liyang90 <liyanglu@google.com>

JackCaoG added 4 commits July 17, 2023 23:44

Suport unordered sharding spec for partial replication

8f28425

add 4d test

8c84620

handle 2d tensor with 2d mesh case

a86ba43

refactoring

6be5202

JackCaoG added the SPMD label Jul 17, 2023

JackCaoG requested review from alanwaketan, jonb377 and khatwanimohit July 17, 2023 23:46

jonb377 approved these changes Jul 18, 2023

View reviewed changes

JackCaoG merged commit 78c6840 into master Jul 18, 2023
8 checks passed

khatwanimohit pushed a commit that referenced this pull request Jul 20, 2023

Support unordered sharding spec for partial replication (#5316)

5ec8813

* Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring

JackCaoG added a commit that referenced this pull request Jul 21, 2023

Support unordered sharding spec for partial replication (#5316)

aac03da

* Suport unordered sharding spec for partial replication * add 4d test * handle 2d tensor with 2d mesh case * refactoring

yeounoh added a commit that referenced this pull request Aug 4, 2023

Revert "Support unordered sharding spec for partial replication (#5316)"

b8bf1d4

This reverts commit 78c6840.

yeounoh added a commit that referenced this pull request Aug 7, 2023

Revert "Support unordered sharding spec for partial replication (#5316)"

8c0f327

This reverts commit 78c6840.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unordered sharding spec for partial replication #5316

Support unordered sharding spec for partial replication #5316

JackCaoG commented Jul 17, 2023

jonb377 left a comment

jonb377 Jul 18, 2023

JackCaoG Jul 18, 2023

JackCaoG Jul 18, 2023

Support unordered sharding spec for partial replication #5316

Support unordered sharding spec for partial replication #5316

Conversation

JackCaoG commented Jul 17, 2023

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 Jul 18, 2023

Choose a reason for hiding this comment

JackCaoG Jul 18, 2023

Choose a reason for hiding this comment

JackCaoG Jul 18, 2023

Choose a reason for hiding this comment