Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

will-cromar · 2023-09-14T20:45:14Z

Skipped commits that update bazel workspace or are incompatible with XRT:

8ffbd1a
1fa34bf
443fad1 (skipped this one by mistake, but we don't need it)
709c75a
0b311d0
2c07df9

I had to make substantial edits (ie not just renaming imports) to the following commits to make them build against our pins:

Last commit picked: ee72332

* sharding should be per output of IR Node, instead of per IR Node * Update sharding_hash method * Add test for sharding on IR with multiple output * fix cpu test * Fix a bug in getSharding

* Make python Api to respect the virtual device when SPMD is enabled * fix typo

* Also dump output sharding on HLO file * only dump output sharding if dump format is HLO * add test * fix typo

* Make all-reduce a no-op when world size is 1 * Fix torch.distributed test

* fix amp dtype setting for GPU. * fix ut * fix lint. * minor.

* Add python test for SPMD+Runtime Python API * replace test name * Update test_xla_spmd_python_api_interaction.py

…5352) * Check the actual device instead of query env var for virtual device * revert unneeded change * minor changes

…5364)

…ro1.py (#5364)" (#5366) This reverts commit 8ada333.

* tweak `atol` and `rtol`

* Skip`DynamoTrainingBasicTest.test_resnet18` on TPU

* Add kokoro presubmit for stablehlo tests

) * Skip `DynamoInferenceBasicTest.test_resnet18` on TPU

…5367) * [BE] use self.assertEquals instead of str equality in test_zero1.py * Use our own assertEqual * Remove print statements

* Fix ReplicateShardedData for int type * add test

Update dynamo.md to remove note about fallback ops since they're supported now

This reverts commit 7fb7dfe.

…erent_input_shape` on TPU (#5373) * tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU

…mutability (#5382) * [TEST ONLY] print statements for test_zero1.py to debug * Try fix * Rectify test_zero1.py to account for state_dict modification * Fix lint

#5384) * Add gpu doc for how to build PyTorch/XLA from source with GPU support. * fix typo * fix comments * fix comments

* Add more support for in-place ops in dynamo bridge Run linter * Add check to explicitly sync self tensors Remove debugging lines Update unit tests to a model * Clean up some code Surround in an if-statement Update metrics for fallback related dynamo tests Update cloned args logic Revert "Update metrics for fallback related dynamo tests" This reverts commit 3855f43. * Update single_node flag back to False

Add dynamo test in TPU CI

Summary: During the LLaMA2 experiements, I disovered that manually marking 1D tensors to be replicated can greatly save a lot of memory. Then I disocvered that explicitly replicated spec will get dropped after mark_step. That is caused by PrepareOutputShardingPropagation where it explicitly clear the sharding spec for replicated output. So, I went ahead and fix that. Further, I did some experiements of propogating replicated output and that drop the requirements of manually replicating 1D tensors. Hence, I made this change. I'm still not quite sure why, will follow up later. Test Plan: PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py

Update more places Add torch_pin

Summary: This change enables megacore_dense by default to allow asynchorous cc ops especailly for GSPMD. Test Plan: CI Co-authored-by: Jiewen Tan <jwtan@google.com>

* Add optiona to unbundle libtpu * Add clarifying note

* Fix fsdp not freeing forzen full params * add test * formatting * remove unnecessary env var in test Co-authored-by: Liyang90 <liyanglu@google.com>

* Update project metadata and remove useless files * Update README * Add manylinux platform tag * formatting

* Add resnet50-weight-only-quant colab notebook * update notebook with llama blog link Co-authored-by: Siyuan Liu <lsiyuan@google.com>

will-cromar · 2023-09-14T23:00:40Z

Something actually unconditionally calls HasSharding, leading to this error with XRT: RuntimeError: ./torch_xla/csrc/runtime/xrt_computation_client.h:179 : HasSharding not implemented

will-cromar · 2023-09-14T23:06:55Z

I'm able to build the current commit against PyTorch on the 2.1 branch and run MNIST on TPU with XRT 🎊

* Change `pjrt://` init method to `xla://` (#5560) * Update PJRT documentation for the 2.1 release (#5557) * Update PJRT documentation for the 2.1 release * clarify plugins * clarify PJRT doc * Update `pjrt://` to `xla://`

will-cromar · 2023-09-15T14:59:32Z

I disabled this test that is missing from this branch: run_stablehlo_compile "$CDIR/stablehlo/test_stablehlo_compile.py"

I don't know where it got lost, but there's no reason to use stablehlo with XRT anyway

will-cromar · 2023-09-15T15:00:47Z

Also, I was hitting a weird build failure on this branch until I updated the CI cache silo name. I wonder if that's why the wheel build is failing. I'll try updating the silo name in this branch.

JackCaoG and others added 30 commits September 14, 2023 20:23

Sharding should be per output of IR Node, instead of per IR Node (#5330)

d638f6e

* sharding should be per output of IR Node, instead of per IR Node * Update sharding_hash method * Add test for sharding on IR with multiple output * fix cpu test * Fix a bug in getSharding

Update Python device API for SPMD (#5129)

ee2e6cf

* Make python Api to respect the virtual device when SPMD is enabled * fix typo

Check out the release branch instead of origin/master in ansible (#5344)

6ef52a8

Also dump output sharding on HLO file (#5339)

9476d08

* Also dump output sharding on HLO file * only dump output sharding if dump format is HLO * add test * fix typo

Make all-reduce a no-op when world size is 1 (#5342)

b23a333

* Make all-reduce a no-op when world size is 1 * Fix torch.distributed test

add fs linker flag (#5347)

f203e93

Add py3.10 whl path to doc, refactor whl table (#5354)

10cea64

fix amp dtype setting for GPU (#5337)

5257b5f

* fix amp dtype setting for GPU. * fix ut * fix lint. * minor.

Add python test for SPMD+Runtime Python API (#5349)

b07770d

* Add python test for SPMD+Runtime Python API * replace test name * Update test_xla_spmd_python_api_interaction.py

Check the actual device instead of query env var for virtual device (#…

ab01165

…5352) * Check the actual device instead of query env var for virtual device * revert unneeded change * minor changes

[BE] use self.assertEquals instead of str equality in test_zero1.py (#…

bde253d

…5364)

Revert "[BE] use self.assertEquals instead of str equality in test_ze…

f172dcd

…ro1.py (#5364)" (#5366) This reverts commit 8ada333.

[Dynamo|TPU] Tweak atol and rtol for test_dynamo.py (#5363)

5549163

* tweak `atol` and `rtol`

[Dynamo|TPU] SkipDynamoTrainingBasicTest.test_resnet18 on TPU (#5362)

c49daaf

* Skip`DynamoTrainingBasicTest.test_resnet18` on TPU

Add a script for running stablehlo tests. (#5360)

e7ae1d4

* Add kokoro presubmit for stablehlo tests

Don't rewrite index hints in global save planning (#5348)

f993041

[Dynamo|TPU] Skip DynamoInferenceBasicTest.test_resnet18 on TPU (#5361

c7d7b23

) * Skip `DynamoInferenceBasicTest.test_resnet18` on TPU

[BE] use self.assertEquals instead of str equality in test_zero1.py (#…

3fec049

…5367) * [BE] use self.assertEquals instead of str equality in test_zero1.py * Use our own assertEqual * Remove print statements

Fix ReplicateShardedData for int type (#5374)

7df484c

* Fix ReplicateShardedData for int type * add test

Update dynamo.md (#5378)

52ab1c6

Update dynamo.md to remove note about fallback ops since they're supported now

Revert "Fix ReplicateShardedData for int type (#5374)" (#5380)

9ae2efe

This reverts commit 7fb7dfe.

Remove the mention of XRT_TPU_CONFIG in the CONTRIBUTING.md (#5379)

84a6635

[Dynamo|TPU] Tweak atol and rtol for `test_simple_model_with_diff…

70b09d5

…erent_input_shape` on TPU (#5373) * tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU

Rectify test_zero1.py once optim.load_state_dict doesn't guarantee im…

d2f8221

…mutability (#5382) * [TEST ONLY] print statements for test_zero1.py to debug * Try fix * Rectify test_zero1.py to account for state_dict modification * Fix lint

Add gpu doc for how to build PyTorch/XLA from source with GPU support. (

3ff13bf

#5384) * Add gpu doc for how to build PyTorch/XLA from source with GPU support. * fix typo * fix comments * fix comments

clear pending ir should also clear the cc op tokens (#5385)

6c2d7af

Port resnet data loading optimizations to SPMD test script (#5386)

87d397d

Add dynamo test in TPU CI (#5381)

ab68214

Add dynamo test in TPU CI

Add manual seed in multihost checkpoint (#5392)

b13d1f2

yeounoh and others added 12 commits September 14, 2023 22:36

mark_sharding over a replicated tensor is allowed. (#5513)

ff94c94

Disable cxx abi in ansible when building pt/xla for branch r2.0 (#5332)

73873f1

Update pytorch git tag for r2.1 (#5529)

9c62357

Update more places Add torch_pin

Enable megacore_dense by default (#5520) (#5531)

2d7e92f

Summary: This change enables megacore_dense by default to allow asynchorous cc ops especailly for GSPMD. Test Plan: CI Co-authored-by: Jiewen Tan <jwtan@google.com>

Add option to unbundle libtpu (#5534) (#5536)

ff30d64

* Add optiona to unbundle libtpu * Add clarifying note

Revert 2.1 terraform changes (#5537)

19d74d4

Fix FSDP for Models with Frozen Weights (#5484) (#5539)

7e22509

* Fix fsdp not freeing forzen full params * add test * formatting * remove unnecessary env var in test Co-authored-by: Liyang90 <liyanglu@google.com>

Update r2.1 wheel to be compatible with PyPI (#5550)

cdcc3d8

* Update project metadata and remove useless files * Update README * Add manylinux platform tag * formatting

Add resnet50-weight-quant colab notebook (#5407) (#5556)

565a915

* Add resnet50-weight-only-quant colab notebook * update notebook with llama blog link Co-authored-by: Siyuan Liu <lsiyuan@google.com>

hack: add placeholders for HasSharding and GetSharding to XRT

60c9ad0

formatting

e324b7b

will-cromar added 2 commits September 14, 2023 23:04

hack: always return false from HasSharding

5f1e8b3

Update torch pin to current RC for CI testing

41f8276

Cherry pick pjrt:// init method rename and doc updates (#5562)

515cf17

* Change `pjrt://` init method to `xla://` (#5560) * Update PJRT documentation for the 2.1 release (#5557) * Update PJRT documentation for the 2.1 release * clarify plugins * clarify PJRT doc * Update `pjrt://` to `xla://`

will-cromar changed the title ~~Cherry-pick commits from 9/27 to 8/9 into XRT branch~~ Cherry-pick 2.1 release branch into XRT branch through 9/14 Sep 14, 2023

will-cromar added 2 commits September 14, 2023 23:34

Use new cache silo and skip test build

8416c06

hack: disable missing test

da2dd82

will-cromar added 2 commits September 15, 2023 15:03

hack: alter cache silo name

586c0d0

formatting

ebb3733

will-cromar marked this pull request as ready for review September 15, 2023 17:49

will-cromar requested review from JackCaoG, mateuszlewko and stgpetrovic as code owners September 15, 2023 17:49

JackCaoG approved these changes Sep 15, 2023

View reviewed changes

will-cromar merged commit 7c32c0f into xrt Sep 15, 2023
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

will-cromar commented Sep 14, 2023 •

edited

Loading

will-cromar commented Sep 14, 2023

will-cromar commented Sep 14, 2023

will-cromar commented Sep 15, 2023

will-cromar commented Sep 15, 2023

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

Conversation

will-cromar commented Sep 14, 2023 • edited Loading

will-cromar commented Sep 14, 2023

will-cromar commented Sep 14, 2023

will-cromar commented Sep 15, 2023

will-cromar commented Sep 15, 2023

will-cromar commented Sep 14, 2023 •

edited

Loading