Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XRT nightly builds #5261

Merged
merged 2 commits into from
Jun 28, 2023
Merged

Add XRT nightly builds #5261

merged 2 commits into from
Jun 28, 2023

Conversation

will-cromar
Copy link
Collaborator

Add nightly builds targeting the XRT branch. Add XRT to the Docker image tag and wheel destination to prevent conflicts with the main branch's nightly builds. Am I missing anything else?

terraform plan:

Terraform will perform the following actions:

  # module.xrt_nightly_builds["3.10_tpuvm"].module.cloud_build.google_cloudbuild_trigger.trigger will be created
  + resource "google_cloudbuild_trigger" "trigger" {
      + create_time = (known after apply)
      + description = "Builds nightly xla:nightly_3.10_tpuvm' TPU docker image and corresponding wheels for PyTorch/XLA. Trigger managed by Terraform setup in infra/tpu-pytorch-releases/artifacts_builds.tf."
      + id          = (known after apply)
      + location    = "us-central1"
      + name        = "nightly-xrt-3-10-tpuvm"
      + project     = (known after apply)
      + trigger_id  = (known after apply)

      + approval_config {
          + approval_required = (known after apply)
        }

      + build {
          + timeout = "21600s"

          + options {
              + dynamic_substitutions = true
              + substitution_option   = "ALLOW_LOOSE"
              + worker_pool           = "projects/tpu-pytorch-releases/locations/us-central1/workerPools/main"
            }

          + step {
              + args = [
                  + "fetch",
                  + "--unshallow",
                ]
              + id   = "git_fetch"
              + name = "gcr.io/cloud-builders/git"
            }
          + step {
              + args = [
                  + "checkout",
                  + "$COMMIT_SHA",
                ]
              + id   = "git_checkout"
              + name = "gcr.io/cloud-builders/git"
            }
          + step {
              + args       = [
                  + "-c",
                  + "docker build --progress=plain --network=cloudbuild -f=Dockerfile . --build-arg=python_version=3.10 --build-arg=ansible_vars='{\"accelerator\":\"tpu\",\"arch\":\"amd64\",\"cuda_version\":\"11.8\",\"nightly_release\":\"true\",\"package_version\":\"2.1.0\",\"python_version\":\"3.10\",\"pytorch_git_rev\":\"main\",\"xla_git_rev\":\"$COMMIT_SHA\"}' -t=\"us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:$(echo nightly_xrt_3.10_tpuvm)\" -t=\"us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:$(echo nightly_xrt_3.10_tpuvm_$(date +%Y%m%d))\" -t=local_image",
                ]
              + dir        = "infra/ansible"
              + entrypoint = "bash"
              + id         = "build_xla_docker_image"
              + name       = "gcr.io/cloud-builders/docker"
            }
          + step {
              + args       = [
                  + "-c",
                  + "docker push --all-tags us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla",
                ]
              + entrypoint = "bash"
              + id         = "push_xla_docker_image"
              + name       = "gcr.io/cloud-builders/docker"
            }
          + step {
              + args       = [
                  + "-c",
                  + "echo The following wheels will be published && ls /dist/*.whl && cp /dist/*.whl /wheels",
                ]
              + entrypoint = "bash"
              + id         = "copy_wheels_to_volume"
              + name       = "local_image"

              + volumes {
                  + name = "wheels"
                  + path = "/wheels"
                }
            }
          + step {
              + args       = [
                  + "-c",
                  + "gsutil cp /wheels/*.whl gs://pytorch-xla-releases/wheels/xrt/tpuvm",
                ]
              + entrypoint = "bash"
              + id         = "upload_wheels_to_storage_bucket"
              + name       = "gcr.io/cloud-builders/gsutil"

              + volumes {
                  + name = "wheels"
                  + path = "/wheels"
                }
            }
        }

      + source_to_build {
          + ref       = "refs/heads/xrt"
          + repo_type = "GITHUB"
          + uri       = "https://github.com/pytorch/xla"
        }
    }

  # module.xrt_nightly_builds["3.8_cuda_12.0"].module.cloud_build.google_cloudbuild_trigger.trigger will be created
  + resource "google_cloudbuild_trigger" "trigger" {
      + create_time = (known after apply)
      + description = "Builds nightly xla:nightly_3.8_cuda_12.0' CUDA 12.0 docker image and corresponding wheels for PyTorch/XLA. Trigger managed by Terraform setup in infra/tpu-pytorch-releases/artifacts_builds.tf."
      + id          = (known after apply)
      + location    = "us-central1"
      + name        = "nightly-xrt-3-8-cuda-12-0"
      + project     = (known after apply)
      + trigger_id  = (known after apply)

      + approval_config {
          + approval_required = (known after apply)
        }

      + build {
          + timeout = "21600s"

          + options {
              + dynamic_substitutions = true
              + substitution_option   = "ALLOW_LOOSE"
              + worker_pool           = "projects/tpu-pytorch-releases/locations/us-central1/workerPools/main"
            }

          + step {
              + args = [
                  + "fetch",
                  + "--unshallow",
                ]
              + id   = "git_fetch"
              + name = "gcr.io/cloud-builders/git"
            }
          + step {
              + args = [
                  + "checkout",
                  + "$COMMIT_SHA",
                ]
              + id   = "git_checkout"
              + name = "gcr.io/cloud-builders/git"
            }
          + step {
              + args       = [
                  + "-c",
                  + "docker build --progress=plain --network=cloudbuild -f=Dockerfile . --build-arg=python_version=3.8 --build-arg=ansible_vars='{\"accelerator\":\"cuda\",\"arch\":\"amd64\",\"cuda_version\":\"12.0\",\"nightly_release\":\"true\",\"package_version\":\"2.1.0\",\"python_version\":\"3.8\",\"pytorch_git_rev\":\"main\",\"xla_git_rev\":\"$COMMIT_SHA\"}' -t=\"us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:$(echo nightly_xrt_3.8_cuda_12.0)\" -t=\"us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:$(echo nightly_xrt_3.8_cuda_12.0_$(date +%Y%m%d))\" -t=local_image",
                ]
              + dir        = "infra/ansible"
              + entrypoint = "bash"
              + id         = "build_xla_docker_image"
              + name       = "gcr.io/cloud-builders/docker"
            }
          + step {
              + args       = [
                  + "-c",
                  + "docker push --all-tags us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla",
                ]
              + entrypoint = "bash"
              + id         = "push_xla_docker_image"
              + name       = "gcr.io/cloud-builders/docker"
            }
          + step {
              + args       = [
                  + "-c",
                  + "echo The following wheels will be published && ls /dist/*.whl && cp /dist/*.whl /wheels",
                ]
              + entrypoint = "bash"
              + id         = "copy_wheels_to_volume"
              + name       = "local_image"

              + volumes {
                  + name = "wheels"
                  + path = "/wheels"
                }
            }
          + step {
              + args       = [
                  + "-c",
                  + "gsutil cp /wheels/*.whl gs://pytorch-xla-releases/wheels/xrt/cuda/12.0",
                ]
              + entrypoint = "bash"
              + id         = "upload_wheels_to_storage_bucket"
              + name       = "gcr.io/cloud-builders/gsutil"

              + volumes {
                  + name = "wheels"
                  + path = "/wheels"
                }
            }
        }

      + source_to_build {
          + ref       = "refs/heads/xrt"
          + repo_type = "GITHUB"
          + uri       = "https://github.com/pytorch/xla"
        }
    }

  # module.xrt_nightly_builds["3.10_tpuvm"].module.cloud_build.module.schedule_triggers[0].google_cloud_scheduler_job.trigger_schedule will be created
  + resource "google_cloud_scheduler_job" "trigger_schedule" {
      + id        = (known after apply)
      + name      = "nightly-xrt-3-10-tpuvm-schedule"
      + paused    = (known after apply)
      + project   = (known after apply)
      + region    = (known after apply)
      + schedule  = "0 0 * * *"
      + state     = (known after apply)
      + time_zone = "America/Los_Angeles"

      + http_target {
          + http_method = "POST"
          + uri         = (known after apply)

          + oauth_token {
              + service_account_email = "build-triggers-scheduler@tpu-pytorch-releases.iam.gserviceaccount.com"
            }
        }
    }

  # module.xrt_nightly_builds["3.8_cuda_12.0"].module.cloud_build.module.schedule_triggers[0].google_cloud_scheduler_job.trigger_schedule will be created
  + resource "google_cloud_scheduler_job" "trigger_schedule" {
      + id        = (known after apply)
      + name      = "nightly-xrt-3-8-cuda-12-0-schedule"
      + paused    = (known after apply)
      + project   = (known after apply)
      + region    = (known after apply)
      + schedule  = "0 0 * * *"
      + state     = (known after apply)
      + time_zone = "America/Los_Angeles"

      + http_target {
          + http_method = "POST"
          + uri         = (known after apply)

          + oauth_token {
              + service_account_email = "build-triggers-scheduler@tpu-pytorch-releases.iam.gserviceaccount.com"
            }
        }
    }

Plan: 4 to add, 0 to change, 0 to destroy.

xrt_nightly_builds = [
{
accelerator = "tpu"
python_version = "3.10"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we decided that for 2.1 release we are only going to build 3.10 whl? I think ideally we should have 3.8 as well, so user on old TPUVM don't need to delete and recreate their VM....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other important update in the new base image. IMO it's better to create a new TPU VM.

@JackCaoG
Copy link
Collaborator

Is this pr ready to merge?

"infra/tpu-pytorch-releases/artifacts_builds.tf."
])

wheels_dest = "${module.releases_storage_bucket.url}/wheels/xrt/${
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good, the xrt directory is needed, as these wheels will have the same name as non-xrt wheels.

Copy link
Collaborator

@mateuszlewko mateuszlewko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I'll merge and see if there are any problems.

@mateuszlewko mateuszlewko merged commit 2e6b9bb into master Jun 28, 2023
7 checks passed
@mateuszlewko
Copy link
Collaborator

Looks like it worked fine, the wheels have been published.

JackCaoG pushed a commit that referenced this pull request Jul 10, 2023
* Add XRT nightly builds

* remove space
JackCaoG added a commit that referenced this pull request Jul 12, 2023
* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* fix xrt tostring

* Update Sharded graph HLO dumping (#5266)

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* fix conflict

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* [Traceable Collecive] Hide token for all_gather (#5232)

Summary:
This pull request does the following:
1. It hides token for all_gather.
2. It folds the out-of-place all_gather into the regular all_gather.
3. It fixes an issue with the last all_reduce_in_place PR where it forgot to set the token.

Test Plan:
PJRT_DEVICE=TPU python test/test_mp_all_gather.py

* Lower squeeze.dims (#5286)

* avoid copy proto in PrepareOutputShardingPropagation (#5287)

* Revert "Suppress debug symbols in OpenXLA code (#5269)"

This reverts commit 3967d7b.

* Revert "fix conflict"

This reverts commit e91ad3a.

---------

Co-authored-by: iefgnoix <isaacwxf23@gmail.com>
Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Baole Ai <baoleai01@gmail.com>
jonb377 pushed a commit that referenced this pull request Jul 14, 2023
* initiak commit

* Add test workflow for `xrt` branch (#5241)

* Add test workflow for `xrt` branch

* Only run for PRs targeting XRT branch

* Add function to generate stablehlo based callable from pytorch model (#5216)

* Add function to generate stablehlo based callable from pytorch model

Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.

* Only run the main CI workflow on PRs targeting master and release branches (#5244)

* Only run main CI for master and release branches.

* Disabling XRT tests on main CI

* AMP for TPUs v3 (#5161)

* remove duplicate autocast_test (#5246)

* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

* Install `expecttest` in xla_test_job.yaml (#5252)

* Add IAM roles for cloudbuild_editors (#5251)

* [Functionalization] Remove view in view_symint (#5231)

* [Functionalization] Remove view in view_symint

Summary:
This pull request removes views in tensor_method::view_symint.

Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view

* Fix linters

* fixed the test

* ran the linter

---------

Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Delete XRT from the main branch (#5240)

* Delete XRT from the main branch

* Remove dead import

* formatting

* Remove disable_xrt build option

* Fix runtime init

* Revert "Remove disable_xrt build option"

This reverts commit ba312e7.

* Add disable XRT option back

* formatting

* Prune mesh service

* Remove obsolete test

* Remove other run server script

* Remove XRT config

* Update PJRT default device test

* Add a file I forgot to save

* if using_pjrt -> @requires_pjrt

* Remove irrelevant test case

* Remove XRT env vars

* fix md link

* formatting

* Remove extra `requires_pjrt`

* merge conflicts

* Add other autocast back

* Add nightly build for cuda 12 (#5253)

* Fix the linter command in the CI (#5254)

* fix linter command

* ran linter

* Jack cao g/fix spmd buff is null (#5256)

* Fix that non-tensor scalar can't be handled by virtual device

* add test

* comment

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* Update Sharded graph HLO dumping (#5266)

* Enable PjRt Client Compilation with StableHLO (#5233)

* Enable xla PjRt client compilation with StableHLO

* add XLA_STABLEHLO_COMPILE to configuration.yaml

* fix merge conflict

* dummy commit to trigger ci

* Revert "dummy commit to trigger ci"

This reverts commit f7aec23.

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* Delete unused .so file, and .lds files (#5275)

* [OpenXLA] Delete unused .so file and .lds files

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* run linter

* wrap only if sharding type is non-replicated

* Handle non-tensors

* run linter

* Call wrap_if_sharded first

* Add exception in test for unsharded tensor

* fix test

* Use torch.Tensor instead of torch.tensor

* use .cpu() only for tensors

---------

Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
khatwanimohit added a commit that referenced this pull request Jul 17, 2023
* initiak commit

* Add test workflow for `xrt` branch (#5241)

* Add test workflow for `xrt` branch

* Only run for PRs targeting XRT branch

* Add function to generate stablehlo based callable from pytorch model (#5216)

* Add function to generate stablehlo based callable from pytorch model

Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.

* Only run the main CI workflow on PRs targeting master and release branches (#5244)

* Only run main CI for master and release branches.

* Disabling XRT tests on main CI

* AMP for TPUs v3 (#5161)

* remove duplicate autocast_test (#5246)

* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

* Install `expecttest` in xla_test_job.yaml (#5252)

* Add IAM roles for cloudbuild_editors (#5251)

* [Functionalization] Remove view in view_symint (#5231)

* [Functionalization] Remove view in view_symint

Summary:
This pull request removes views in tensor_method::view_symint.

Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view

* Fix linters

* fixed the test

* ran the linter

---------

Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Delete XRT from the main branch (#5240)

* Delete XRT from the main branch

* Remove dead import

* formatting

* Remove disable_xrt build option

* Fix runtime init

* Revert "Remove disable_xrt build option"

This reverts commit ba312e7.

* Add disable XRT option back

* formatting

* Prune mesh service

* Remove obsolete test

* Remove other run server script

* Remove XRT config

* Update PJRT default device test

* Add a file I forgot to save

* if using_pjrt -> @requires_pjrt

* Remove irrelevant test case

* Remove XRT env vars

* fix md link

* formatting

* Remove extra `requires_pjrt`

* merge conflicts

* Add other autocast back

* Add nightly build for cuda 12 (#5253)

* Fix the linter command in the CI (#5254)

* fix linter command

* ran linter

* Jack cao g/fix spmd buff is null (#5256)

* Fix that non-tensor scalar can't be handled by virtual device

* add test

* comment

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* Update Sharded graph HLO dumping (#5266)

* Enable PjRt Client Compilation with StableHLO (#5233)

* Enable xla PjRt client compilation with StableHLO

* add XLA_STABLEHLO_COMPILE to configuration.yaml

* fix merge conflict

* dummy commit to trigger ci

* Revert "dummy commit to trigger ci"

This reverts commit f7aec23.

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* Delete unused .so file, and .lds files (#5275)

* [OpenXLA] Delete unused .so file and .lds files

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* run linter

* wrap only if sharding type is non-replicated

* Handle non-tensors

* run linter

* Call wrap_if_sharded first

* Add exception in test for unsharded tensor

* fix test

* Use torch.Tensor instead of torch.tensor

* use .cpu() only for tensors

---------

Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
khatwanimohit added a commit that referenced this pull request Jul 20, 2023
* initiak commit

* Add test workflow for `xrt` branch (#5241)

* Add test workflow for `xrt` branch

* Only run for PRs targeting XRT branch

* Add function to generate stablehlo based callable from pytorch model (#5216)

* Add function to generate stablehlo based callable from pytorch model

Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.

* Only run the main CI workflow on PRs targeting master and release branches (#5244)

* Only run main CI for master and release branches.

* Disabling XRT tests on main CI

* AMP for TPUs v3 (#5161)

* remove duplicate autocast_test (#5246)

* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

* Install `expecttest` in xla_test_job.yaml (#5252)

* Add IAM roles for cloudbuild_editors (#5251)

* [Functionalization] Remove view in view_symint (#5231)

* [Functionalization] Remove view in view_symint

Summary:
This pull request removes views in tensor_method::view_symint.

Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view

* Fix linters

* fixed the test

* ran the linter

---------

Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Delete XRT from the main branch (#5240)

* Delete XRT from the main branch

* Remove dead import

* formatting

* Remove disable_xrt build option

* Fix runtime init

* Revert "Remove disable_xrt build option"

This reverts commit ba312e7.

* Add disable XRT option back

* formatting

* Prune mesh service

* Remove obsolete test

* Remove other run server script

* Remove XRT config

* Update PJRT default device test

* Add a file I forgot to save

* if using_pjrt -> @requires_pjrt

* Remove irrelevant test case

* Remove XRT env vars

* fix md link

* formatting

* Remove extra `requires_pjrt`

* merge conflicts

* Add other autocast back

* Add nightly build for cuda 12 (#5253)

* Fix the linter command in the CI (#5254)

* fix linter command

* ran linter

* Jack cao g/fix spmd buff is null (#5256)

* Fix that non-tensor scalar can't be handled by virtual device

* add test

* comment

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* Update Sharded graph HLO dumping (#5266)

* Enable PjRt Client Compilation with StableHLO (#5233)

* Enable xla PjRt client compilation with StableHLO

* add XLA_STABLEHLO_COMPILE to configuration.yaml

* fix merge conflict

* dummy commit to trigger ci

* Revert "dummy commit to trigger ci"

This reverts commit f7aec23.

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* Delete unused .so file, and .lds files (#5275)

* [OpenXLA] Delete unused .so file and .lds files

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* run linter

* wrap only if sharding type is non-replicated

* Handle non-tensors

* run linter

* Call wrap_if_sharded first

* Add exception in test for unsharded tensor

* fix test

* Use torch.Tensor instead of torch.tensor

* use .cpu() only for tensors

---------

Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
JackCaoG added a commit that referenced this pull request Jul 21, 2023
* initiak commit

* Add test workflow for `xrt` branch (#5241)

* Add test workflow for `xrt` branch

* Only run for PRs targeting XRT branch

* Add function to generate stablehlo based callable from pytorch model (#5216)

* Add function to generate stablehlo based callable from pytorch model

Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.

* Only run the main CI workflow on PRs targeting master and release branches (#5244)

* Only run main CI for master and release branches.

* Disabling XRT tests on main CI

* AMP for TPUs v3 (#5161)

* remove duplicate autocast_test (#5246)

* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

* Install `expecttest` in xla_test_job.yaml (#5252)

* Add IAM roles for cloudbuild_editors (#5251)

* [Functionalization] Remove view in view_symint (#5231)

* [Functionalization] Remove view in view_symint

Summary:
This pull request removes views in tensor_method::view_symint.

Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view

* Fix linters

* fixed the test

* ran the linter

---------

Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Delete XRT from the main branch (#5240)

* Delete XRT from the main branch

* Remove dead import

* formatting

* Remove disable_xrt build option

* Fix runtime init

* Revert "Remove disable_xrt build option"

This reverts commit ba312e7.

* Add disable XRT option back

* formatting

* Prune mesh service

* Remove obsolete test

* Remove other run server script

* Remove XRT config

* Update PJRT default device test

* Add a file I forgot to save

* if using_pjrt -> @requires_pjrt

* Remove irrelevant test case

* Remove XRT env vars

* fix md link

* formatting

* Remove extra `requires_pjrt`

* merge conflicts

* Add other autocast back

* Add nightly build for cuda 12 (#5253)

* Fix the linter command in the CI (#5254)

* fix linter command

* ran linter

* Jack cao g/fix spmd buff is null (#5256)

* Fix that non-tensor scalar can't be handled by virtual device

* add test

* comment

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* Update Sharded graph HLO dumping (#5266)

* Enable PjRt Client Compilation with StableHLO (#5233)

* Enable xla PjRt client compilation with StableHLO

* add XLA_STABLEHLO_COMPILE to configuration.yaml

* fix merge conflict

* dummy commit to trigger ci

* Revert "dummy commit to trigger ci"

This reverts commit f7aec23.

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* Delete unused .so file, and .lds files (#5275)

* [OpenXLA] Delete unused .so file and .lds files

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* run linter

* wrap only if sharding type is non-replicated

* Handle non-tensors

* run linter

* Call wrap_if_sharded first

* Add exception in test for unsharded tensor

* fix test

* Use torch.Tensor instead of torch.tensor

* use .cpu() only for tensors

---------

Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
JackCaoG added a commit that referenced this pull request Jul 24, 2023
* Update inline style code to multiline (#5291)

* Fix typo in _test.yml (#5172)

s/metadtaa/metadata/

* [SPMD][Virtual Device]All tensors should be in SPMD:0 C++ device (#5284)

* Move all tensors to SPMD:0 C++ device under spmd context

* fix load shards

* fix test_mark_sharding_2d by not creating placeholder for virtual device

* fix the waitdeviceop for spmd case

* Fix test_shard_hashing

* fix spmd device casting issue

* remove hacks in test_xla_virtual_device.py

* add test for new virtual device usage

* fix review comments

* fix IsTpuDevice

* linter

* Revert pr #2682 (#5215)

* Make README more actionable (#5262)

* Make README more actionable

* move profiling guide link

* text wrapping

* [SPMD] Use xs.Mesh in test_2d_tensor_3d_mesh (#5295)

* use mesh in test_2d_tensor_3d_mesh

* remove attributes patch

* [SPMD] Add FSDP sharding for test_train_spmd_linear_model.py (#5299)

Summary:
This diff adds FSDP sharding for test_train_spmd_linear_model.py.

Test Plan:
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_train_spmd_linear_model.py --sharding fsdp

* [SPMD] Avoid recompilations in xs.mark_sharding() (#5300)

Summary:
This pull requests fixes the recompilation issue in xs.mark_sharding().
xtensor->GetXlaData() will compile the program if xtensor is an IR in order
to get the BackendData. I believe this is not intended given the error message
below suggests only data type xtensors are supported.

Test Plan:
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py

* [SPMD] Support mark_sharding on IRs (#5301)

Summary:
This pull requests fixes the recompilation issue in xs.mark_sharding().
xtensor->GetXlaData() will compile the program if xtensor is an IR in order
to get the BackendData. I believe this is not intended given the error message
below suggests only data type xtensors are supported.

Test Plan:
PJRT_DEVICE=TPU XLA_USE_SPMD=1 python test/spmd/test_xla_sharding.py

* [SPMD] Allow dumping post optimizations hlo (#5302)

Summary:
This pull request partial reverts the change in #5266 to re-enble
dumping post optimizations hlo.

Test Plan:
XLA_USE_SPMD=1 PJRT_DEVICE=TPU python test/spmd/test_xla_sharding.py -v -k test_xla_sharded_hlo_dump_post_optimizations

* Add `_sharded_cpu_state_dict` for distributed checkpointing (#5288)

* initiak commit

* Add test workflow for `xrt` branch (#5241)

* Add test workflow for `xrt` branch

* Only run for PRs targeting XRT branch

* Add function to generate stablehlo based callable from pytorch model (#5216)

* Add function to generate stablehlo based callable from pytorch model

Added function
`torch_xla.experimental.stablehlo_saved_model.export_pytorch_model`.
This function will take a pytorch Module and convert it into stablehlo
bytecode.

* Only run the main CI workflow on PRs targeting master and release branches (#5244)

* Only run main CI for master and release branches.

* Disabling XRT tests on main CI

* AMP for TPUs v3 (#5161)

* remove duplicate autocast_test (#5246)

* Remove `test_experimental_pjrt_tpu.py` from TPU CI (#5247)

* Install `expecttest` in xla_test_job.yaml (#5252)

* Add IAM roles for cloudbuild_editors (#5251)

* [Functionalization] Remove view in view_symint (#5231)

* [Functionalization] Remove view in view_symint

Summary:
This pull request removes views in tensor_method::view_symint.

Test Plan:
XLA_DISABLE_FUNCTIONALIZATION=1 PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view
PJRT_DEVICE=TPU python ../test/test_view_ops.py -v -k TestViewOpsXLA.test_view_view

* Fix linters

* fixed the test

* ran the linter

---------

Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>

* Delete XRT from the main branch (#5240)

* Delete XRT from the main branch

* Remove dead import

* formatting

* Remove disable_xrt build option

* Fix runtime init

* Revert "Remove disable_xrt build option"

This reverts commit ba312e7.

* Add disable XRT option back

* formatting

* Prune mesh service

* Remove obsolete test

* Remove other run server script

* Remove XRT config

* Update PJRT default device test

* Add a file I forgot to save

* if using_pjrt -> @requires_pjrt

* Remove irrelevant test case

* Remove XRT env vars

* fix md link

* formatting

* Remove extra `requires_pjrt`

* merge conflicts

* Add other autocast back

* Add nightly build for cuda 12 (#5253)

* Fix the linter command in the CI (#5254)

* fix linter command

* ran linter

* Jack cao g/fix spmd buff is null (#5256)

* Fix that non-tensor scalar can't be handled by virtual device

* add test

* comment

* Skip calling as_strided in empty_strided_symint if the input has dynamic dimensions. (#5239)

* Skip calling as_strided in empty_strided_symint.

* only return empty_symint conditionally.

* add a comment

* Add XRT nightly builds (#5261)

* Add XRT nightly builds

* remove space

* [OpenXLA] Migrate to pull XLA from OpenXLA (#5202)

PyTorch/XLA migrate to pull XLA from OpenXLA by replacing TensorFlow with OpenXLA after deprecating XRT usage, and replace TensorFlow-pin with OpenXLA-pin to May09

* Add ToString method for both PjrtData and PjrtShardedData (#5265)

* Add ToString method for both PjrtData and PjrtShardedData

* on cpu same config will become replicated, dont't check actual op sharding type

* Update Sharded graph HLO dumping (#5266)

* Enable PjRt Client Compilation with StableHLO (#5233)

* Enable xla PjRt client compilation with StableHLO

* add XLA_STABLEHLO_COMPILE to configuration.yaml

* fix merge conflict

* dummy commit to trigger ci

* Revert "dummy commit to trigger ci"

This reverts commit f7aec23.

* Disable Bazel remote cache for forked PR (#5259)

* disable bazel remote cache if gcloud key is empty

* remove remote cache from setup.py

* experiment with debug msg

* fix flag

* add more logs

* skip remote chache if credential file is empty

* add comment

* add logs

* add check in test and coverage script

* fix condition in coverage test

* advance branch pr

* allow remote cache if gloud file isn't specified explicitly

* remove dummy comment

* Suppress debug symbols in OpenXLA code (#5269)

* [SPMD] Sharding n-d tensor on (n+1)-d Mesh (#5268)

* Make TPU detection more robust (#5271)

* Clean bazel stuff on distutils clean. (#5274)

* Clean bazel stuff on distutils clean

* Fix python formatting

* Delete unused .so file, and .lds files (#5275)

* [OpenXLA] Delete unused .so file and .lds files

* Fix the error when export_torch_model is given a non-tensor (#5277)

However the generated StableHLO graph still hardcodes the
non-tensor value. this is not correct, will fix later.

* Dsiable test_simple_model_with_different_input_shape since it is curretnly broken by pytorch (#5282)

* Always do build_ext in python setup.py develop (#5273)

Bazel should figure out that _XLAC.so is current
or not, and trigger rebuild if any cpp files changed.

* Remove or improve several hardcoded TPU test conditions (#5272)

* Remove or improve several hardcoded TPU test conditions

* Fix test condition

* Add `runtime.host_index` (#5283)

* Make it an error if calling sizes() on a dynamic tensor. (#4998)

* Err if calling sizes() on dynamic tensor

* try to set has_symbolic_sizes_strides_

* resolve merge conflict

* enable CONTINUE_ON_ERROR

* fixed the python test test_SizeEq_should_not_compile_for_identical_symints

* fix test_index_types

* set CONTINUE_ON_ERROR to true

* remove some unwanted code.

* add a print

* directly set has_symbolic_sizes_strides_ = true

* make some fixes.

* fix empty_strided_symint

* ran linter

* change error type in the test.

* fix comments

* ran linter

* Fix the error where mark_step does not materalize tensors on SPMD:0 (#5281)

* Fix the error where mark_step does not materalize tensors on SPMD:0

* typo

* fix test_non_tensor_scalar

* Disable torch._dynamo.config.automatic_dynamic_shapes (#5285)

* Set torch._dynamo.config.automatic_dynamic_shapes to False

* Enable DynamoInferenceBasicTest.test_simple_model_with_different_input_shape

* run linter

* wrap only if sharding type is non-replicated

* Handle non-tensors

* run linter

* Call wrap_if_sharded first

* Add exception in test for unsharded tensor

* fix test

* Use torch.Tensor instead of torch.tensor

* use .cpu() only for tensors

---------

Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: JackCaoG <59073027+JackCaoG@users.noreply.github.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>

* Supoort unordered sharding spec correctly (#5305)

* Supoort non-ordered sharding spec correctly

* use permute instead of transpose

* use dim > 2 to suit TPU v3(otherwise can't be divide evenly)

* Support unordered sharding spec for partial replication (#5316)

* Suport unordered sharding spec for partial replication

* add 4d test

* handle 2d tensor with 2d mesh case

* refactoring

* Fix mismatched GPU docker image in the doc. (#5319)

* quick refactor on _get_group_assignment (#5318)

* Add tf independent serialization (#5308)

Create a serialization format for StableHLO graphs and weights without tf.saved_model

Need to not use tensorflow because tensorflow is no longer dependency of pytorch/xla.
Information saved are enough to reconstruct the tf.saved_model for serving.
Information stored:

* metadata on which tensor maps which input position
* StableHLO version number
* metadata on which tensor corresponds to user input or parameter
* metadata on shape and dtype of each tensor.
* Tensors themselves are saved as numpy arrays using np.save.

* Disable coverage for now (#5321)

* Enable Some input output aliasing under SPMD (#5320)

* Use `_sharded_cpu_state_dict` functionality to Write Items for SPMD Save Planner (#5315)

* initial commit

* add suggested changes

* add unit test

* fix test

* fix test

* add suggested changes

* remove is_sharded_tensor check

* check if device type is xla in `wrap_if_sharded`

* change order

* update resolve_data and add more tests

* run linter

* use subtest

* formatting fixes

* run linter

* handle single tensor for method send_to_device_single (#5317)

* handle single tensor for method send_to_device_single

* fix broadcast parameter

---------

Co-authored-by: Wonjoo Lee <wonjoo@google.com>
Co-authored-by: Nikita Shulga <nshulga@meta.com>
Co-authored-by: iefgnoix <isaacwxf23@gmail.com>
Co-authored-by: Will Cromar <wcromar@google.com>
Co-authored-by: Mohit Khatwani <118776932+khatwanimohit@users.noreply.github.com>
Co-authored-by: Jiewen Tan <jwtan@google.com>
Co-authored-by: Yash Shah <55116947+yashs97@users.noreply.github.com>
Co-authored-by: qihqi <hanq@google.com>
Co-authored-by: Meghan Cowan <cowanmeg@google.com>
Co-authored-by: Mateusz Lewko <mateusz.lewko@gmail.com>
Co-authored-by: Manfei <41607353+ManfeiBai@users.noreply.github.com>
Co-authored-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: stgpetrovic <stgpetrovic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants