Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick 2.1 release branch into XRT branch through 9/14 #5574

Merged
merged 117 commits into from
Sep 15, 2023

Conversation

will-cromar
Copy link
Collaborator

@will-cromar will-cromar commented Sep 14, 2023

Skipped commits that update bazel workspace or are incompatible with XRT:

I had to make substantial edits (ie not just renaming imports) to the following commits to make them build against our pins:

Last commit picked: ee72332

JackCaoG and others added 30 commits September 14, 2023 20:23
* sharding should be per output of IR Node, instead of per IR Node

* Update sharding_hash method

* Add test for sharding on IR with multiple output

* fix cpu test

* Fix a bug in getSharding
* Make python Api to respect the virtual device when SPMD is enabled

* fix typo
* Also dump output sharding on HLO file

* only dump output sharding if dump format is HLO

* add test

* fix typo
* Make all-reduce a no-op when world size is 1

* Fix torch.distributed test
* fix amp dtype setting for GPU.

* fix ut

* fix lint.

* minor.
* Add python test for SPMD+Runtime Python API

* replace test name

* Update test_xla_spmd_python_api_interaction.py
…5352)

* Check the actual device instead of query env var for virtual device

* revert unneeded change

* minor changes
*  Skip`DynamoTrainingBasicTest.test_resnet18` on TPU
* Add kokoro presubmit for stablehlo tests
)


* Skip `DynamoInferenceBasicTest.test_resnet18` on TPU
…5367)

* [BE] use self.assertEquals instead of str equality in test_zero1.py

* Use our own assertEqual

* Remove print statements
* Fix ReplicateShardedData for int type

* add test
Update dynamo.md to remove note about fallback ops since they're supported now
…erent_input_shape` on TPU (#5373)

* tweak `atol` and `rtol` for `test_simple_model_with_different_input_shape` on TPU
…mutability (#5382)

* [TEST ONLY] print statements for test_zero1.py to debug

* Try fix

* Rectify test_zero1.py to account for state_dict modification

* Fix lint
#5384)

* Add gpu doc for how to build PyTorch/XLA from source with GPU support.

* fix typo

* fix comments

* fix comments
* Add more support for in-place ops in dynamo bridge

Run linter

* Add check to explicitly sync self tensors

Remove debugging lines

Update unit tests to a model

* Clean up some code

Surround  in an if-statement

Update metrics for fallback related dynamo tests

Update cloned args logic

Revert "Update metrics for fallback related dynamo tests"

This reverts commit 3855f43.

* Update single_node flag back to False
Add dynamo test in TPU CI
yeounoh and others added 12 commits September 14, 2023 22:36
Summary:
During the LLaMA2 experiements, I disovered that manually marking 1D tensors to be replicated can greatly save a lot of memory. Then I disocvered that explicitly replicated spec will get dropped after mark_step. That is caused by PrepareOutputShardingPropagation where it explicitly clear the sharding spec for replicated output. So, I went ahead and fix that.

Further, I did some experiements of propogating replicated output and that drop the requirements of manually replicating 1D tensors. Hence, I made this change.

I'm still not quite sure why, will follow up later.

Test Plan:
PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py
Update more places

Add torch_pin
Summary:
This change enables megacore_dense by default to allow asynchorous cc
ops especailly for GSPMD.

Test Plan:
CI

Co-authored-by: Jiewen Tan <jwtan@google.com>
* Add optiona to unbundle libtpu

* Add clarifying note
* Fix fsdp not freeing forzen full params

* add test

* formatting

* remove unnecessary env var in test

Co-authored-by: Liyang90 <liyanglu@google.com>
* Update project metadata and remove useless files

* Update README

* Add manylinux platform tag

* formatting
* Add resnet50-weight-only-quant colab notebook

* update notebook with llama blog link

Co-authored-by: Siyuan Liu <lsiyuan@google.com>
@will-cromar
Copy link
Collaborator Author

Something actually unconditionally calls HasSharding, leading to this error with XRT: RuntimeError: ./torch_xla/csrc/runtime/xrt_computation_client.h:179 : HasSharding not implemented

@will-cromar
Copy link
Collaborator Author

I'm able to build the current commit against PyTorch on the 2.1 branch and run MNIST on TPU with XRT 🎊

* Change `pjrt://` init method to `xla://` (#5560)

* Update PJRT documentation for the 2.1 release (#5557)

* Update PJRT documentation for the 2.1 release

* clarify plugins

* clarify PJRT doc

* Update `pjrt://` to `xla://`
@will-cromar will-cromar changed the title Cherry-pick commits from 9/27 to 8/9 into XRT branch Cherry-pick 2.1 release branch into XRT branch through 9/14 Sep 14, 2023
@will-cromar
Copy link
Collaborator Author

I disabled this test that is missing from this branch: run_stablehlo_compile "$CDIR/stablehlo/test_stablehlo_compile.py"

I don't know where it got lost, but there's no reason to use stablehlo with XRT anyway

@will-cromar
Copy link
Collaborator Author

Also, I was hitting a weird build failure on this branch until I updated the CI cache silo name. I wonder if that's why the wheel build is failing. I'll try updating the silo name in this branch.

@will-cromar will-cromar marked this pull request as ready for review September 15, 2023 17:49
@will-cromar will-cromar merged commit 7c32c0f into xrt Sep 15, 2023
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.