Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
f224540
Add basic ansible configuration for bazel and installing apt pkgs
Jan 18, 2023
8b9f75e
Add apt repos and some signing keys
Jan 18, 2023
e4b1d14
Add pip packages
mateuszlewko Jan 19, 2023
83f03a4
Don't use apt-key for adding repo keys
mateuszlewko Jan 19, 2023
ae5bd04
Don't use apt-key for adding repo keys
mateuszlewko Jan 19, 2023
9e20eff
Add fetch_srcs role for fetching PyTorch and XLA repos
mateuszlewko Jan 20, 2023
23edf5f
Add patches application
mateuszlewko Jan 20, 2023
962af4f
Add role for compling PyTorch and XLA sources
mateuszlewko Jan 20, 2023
e42b986
WIP in build srcs
mateuszlewko Jan 20, 2023
12cdf8d
Succesfully build XLA
mateuszlewko Jan 23, 2023
06f9555
Clean-up and merge env variables; Separate stage; arch and accelerato…
mateuszlewko Jan 23, 2023
756cc26
Fix passing env variables; Add missing XLA_SANDBOX_BUILD
mateuszlewko Jan 23, 2023
524cfb6
Rename playbooks dir to ansible
mateuszlewko Jan 23, 2023
aa28b5a
Add cloudbuild file that uses ansible playbook
mateuszlewko Jan 23, 2023
9dafb22
Add 'signed-by' to all apt repos
mateuszlewko Jan 23, 2023
eb5e81d
Add placeholders for release config vars
mateuszlewko Jan 23, 2023
02dd57b
Add release build
mateuszlewko Jan 23, 2023
fd97bb0
Disable verbose ansible in docker build
mateuszlewko Jan 23, 2023
267672f
Add ansible config file and enable displaying tasks duration
mateuszlewko Jan 24, 2023
4e3676d
Add TORCH_XLA_VERSION env variable, which is used when building XLA
mateuszlewko Jan 24, 2023
dd8ecec
Disable Ansible warnings about no inventory; Force git clone; revert …
mateuszlewko Jan 24, 2023
ca16d26
Add basic tests for bazel and fetch_srcs roles
mateuszlewko Jan 24, 2023
86b6b6f
Add import tests for build_srcs
mateuszlewko Jan 24, 2023
c79b451
Set git versions for which imports work
mateuszlewko Jan 27, 2023
ae83150
Pass env vars to imports test
mateuszlewko Jan 27, 2023
a443d13
Add configure_env role and apply minor cleanup
mateuszlewko Jan 30, 2023
44ca33a
Don't replace existing env var entries in /etc/environment
mateuszlewko Jan 30, 2023
2a3170f
Move ansible dir to /docker/experimental
mateuszlewko Jan 30, 2023
8097849
Minor
mateuszlewko Feb 9, 2023
4b3eb79
Add cloudbuild file that builds tpu wheels and dev/release docker images
mateuszlewko Feb 9, 2023
95938b0
Add initial Terraform configuration with bucket for storing state
mateuszlewko Feb 9, 2023
3777a71
Add Google Cloud bucket for wheels
mateuszlewko Feb 10, 2023
0cb6c8b
Add cloudbuild trigger for dev image
mateuszlewko Feb 10, 2023
8e20adb
Remove TF state backup
mateuszlewko Feb 10, 2023
4aa8fb5
Add missing ansible install in release image dockerfile
mateuszlewko Feb 10, 2023
c34fd6c
Add a step to build release image and wheels
mateuszlewko Feb 10, 2023
94fd434
Rever prev change
mateuszlewko Feb 10, 2023
6a99df9
Add ids to buildsteps
mateuszlewko Feb 13, 2023
ecf5578
Add artifacts repo and public permissions for the wheels bucket
mateuszlewko Feb 14, 2023
7a57a1f
Provision worker pool
mateuszlewko Feb 15, 2023
14aa5a8
Correct docker repo url
mateuszlewko Feb 15, 2023
fb085f4
Add trigger for release images
mateuszlewko Feb 15, 2023
3a4fefe
Minor fixes to release image trigger
mateuszlewko Feb 15, 2023
b6e70b1
Use trigger name instead of id in cloud scheduler job
mateuszlewko Feb 15, 2023
075952c
Push wheels from release-image-trigger to wheels bucket
mateuszlewko Feb 15, 2023
cf01e1e
Add some outputs
mateuszlewko Feb 15, 2023
0321c16
Add a todo for wheels naming
mateuszlewko Feb 15, 2023
0ff3559
Correct the list of artifacts for release_images
mateuszlewko Feb 16, 2023
7f61c1e
Propagate cloudbuild triggers from docker_images variable.
mateuszlewko Feb 16, 2023
8af004a
Add dir parameter
mateuszlewko Feb 16, 2023
b466327
Add -c before docker args
mateuszlewko Feb 16, 2023
6c4c79e
Fix docker_images build args
mateuszlewko Feb 16, 2023
0bfe8b8
Pass build args for development image
mateuszlewko Feb 16, 2023
bb6435e
Add many docker images
mateuszlewko Feb 16, 2023
c69a03d
Remove unused cloudbuild file
mateuszlewko Feb 16, 2023
7da3dfa
Pass pytorch and xla git revs to ansible
mateuszlewko Feb 16, 2023
b30aba2
Remove -trigger suffix
mateuszlewko Feb 16, 2023
afa66cd
Set proper name to triggers
mateuszlewko Feb 16, 2023
9cfa91c
Complete readme file
mateuszlewko Feb 16, 2023
5d4ab4c
Include build logs with github status
mateuszlewko Feb 16, 2023
d269fa6
Set object only if there are any wheels specified
mateuszlewko Feb 16, 2023
f09ec16
Add trigger for release images
mateuszlewko Feb 15, 2023
3b65e24
Reduce the number of explicit variables for docker images
mateuszlewko Feb 16, 2023
c8036d7
Rename name paramter to trigger_name
mateuszlewko Feb 16, 2023
a5ff1be
Add 6h timeout to cloud builds
mateuszlewko Feb 16, 2023
4d7891d
Move terraform_cloudbuild to terraform
mateuszlewko Mar 14, 2023
a4ab9dc
Disable clang
mateuszlewko Mar 15, 2023
2c3ecaf
Add trigger for XLA 2.0 on TPUVM
mateuszlewko Mar 15, 2023
e6e292e
Add --progress=plain to docker builds
mateuszlewko Mar 15, 2023
1b791af
Set gcc and g++ explicitly
mateuszlewko Mar 15, 2023
88ae574
Remove Python 3.7 build trigger from Terraform (#4534)
will-cromar Jan 30, 2023
c82b626
Use Ansible for building wheels and provisioning docker images. (#4531)
mateuszlewko Feb 23, 2023
347c7b1
Add cloudbuild file that builds tpu wheels and dev/release docker images
mateuszlewko Feb 9, 2023
c97dc33
Add trigger for release images
mateuszlewko Feb 15, 2023
5eb4557
Move terraform_cloudbuild to terraform
mateuszlewko Mar 14, 2023
51614c9
Rebase fix
mateuszlewko Mar 15, 2023
d5c51e8
Add cuda_version to ansible vars
mateuszlewko Mar 15, 2023
0e69044
Increase verbosity of ansible playbook
mateuszlewko Mar 16, 2023
1fabfd2
Use e2-standard-32 for staging worker pool
mateuszlewko Mar 16, 2023
d46f9d0
Pass git revs to build stage of ansible playbook
mateuszlewko Mar 16, 2023
b1cdde5
Fix passing extra args to ansible
mateuszlewko Mar 16, 2023
32ae9df
Add missing docker build file args
mateuszlewko Mar 16, 2023
6142734
Wrap cuda version in quotes to ensure they're treated as string
mateuszlewko Mar 16, 2023
f2375a2
Set xla_git_rev to HEAD for v2.0 build
mateuszlewko Mar 16, 2023
e0e2640
Fix wrong rebase
mateuszlewko Mar 16, 2023
2207f85
Pass package version to ansible config
mateuszlewko Mar 16, 2023
7fe8115
Set cuda_version to string
mateuszlewko Mar 16, 2023
e525c78
Add debug statement for listing generated wheels
mateuszlewko Mar 16, 2023
6ce0146
Fixes to package_version names
mateuszlewko Mar 16, 2023
b001c64
Format terraform files; use xla branch r2.0 for building v2.0 package
mateuszlewko Mar 17, 2023
059882d
Debugging kaniko
mateuszlewko Mar 17, 2023
e413012
Debugging kaniko
mateuszlewko Mar 17, 2023
5e71a9f
Revert from kaniko to docker builder
mateuszlewko Mar 20, 2023
4839de0
Debugging artifacts
mateuszlewko Mar 20, 2023
c60ad3b
Debugging artifacts
mateuszlewko Mar 20, 2023
b203140
Debugging artifacts
mateuszlewko Mar 20, 2023
1a1d900
Debugging artifacts
mateuszlewko Mar 20, 2023
5498533
Fixed uploading wheels to storage bucket
mateuszlewko Mar 20, 2023
f97b74e
Use tag v2.0.0 for xla sources
mateuszlewko Mar 20, 2023
026af0c
Use master branch for ansible setup version in all triggers
mateuszlewko Mar 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions docker/experimental/ansible/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
ARG python_version=3.8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how simple the Dockerfile becomes with ansible!

ARG debian_version=buster

FROM python:${python_version}-${debian_version} AS build

RUN pip install ansible

COPY . /ansible
WORKDIR /ansible

ARG arch=amd64
ARG accelerator=tpu
ARG cuda_version=11.8
ARG pytorch_git_rev=HEAD
ARG xla_git_rev=HEAD
ARG package_version

RUN ansible-playbook -vvv playbook.yaml -e \
"stage=build \
arch=${arch} \
accelerator=${accelerator} \
cuda_version=${cuda_version} \
pytorch_git_rev=${pytorch_git_rev} \
xla_git_rev=${xla_git_rev} \
package_version=${package_version}"

FROM python:${python_version}-${debian_version} AS release

WORKDIR /ansible
COPY . /ansible

ARG arch=amd64
ARG accelerator=tpu
ARG cuda_version=11.8
ARG pytorch_git_rev=HEAD
ARG xla_git_rev=HEAD

RUN pip install ansible
RUN ansible-playbook -vvv playbook.yaml -e \
"stage=release \
arch=${arch} \
accelerator=${accelerator} \
cuda_version=${cuda_version} \
pytorch_git_rev=${pytorch_git_rev} \
xla_git_rev=${xla_git_rev} \
" --tags "install_deps"

WORKDIR /dist
COPY --from=build /src/pytorch/dist/*.whl .
COPY --from=build /src/pytorch/xla/dist/*.whl .

RUN echo "Installing the following wheels" && ls /dist/*.whl
RUN pip install *.whl

WORKDIR /
RUN rm -rf /ansible
2 changes: 2 additions & 0 deletions docker/experimental/ansible/ansible.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ callbacks_enabled = profile_tasks
# The playbooks is only run on the implicit localhost.
# Silence warning about empty hosts inventory.
localhost_warning = False
# Make output human-readable.
stdout_callback = yaml

[inventory]
# Silence warning about no inventory.
Expand Down
16 changes: 8 additions & 8 deletions docker/experimental/ansible/config/apt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ apt:
- wget

build_cuda:
- cuda-libraries-11-8
- cuda-toolkit-11-8
- cuda-minimal-build-11-8
- libcudnn8=8.8.0.121-1+cuda11.8
- libcudnn8-dev=8.8.0.121-1+cuda11.8
- "cuda-libraries-{{ cuda_version | replace('.', '-') }}"
- "cuda-toolkit-{{ cuda_version | replace('.', '-') }}"
- "cuda-minimal-build-{{ cuda_version | replace('.', '-') }}"
- "{{ cuda_deps['libcudnn'][cuda_version] }}"
- "{{ cuda_deps['libcudnn-dev'][cuda_version] }}"

build_amd64:
- "clang-{{ clang_version }}"
Expand All @@ -39,9 +39,9 @@ apt:
- patch

release_cuda:
- cuda-libraries-11-8
- cuda-minimal-build-11-8
- libcudnn8=8.8.0.121-1+cuda11.8
- "cuda-libraries-{{ cuda_version | replace('.', '-') }}"
- "cuda-minimal-build-{{ cuda_version | replace('.', '-') }}"
- "{{ cuda_deps['libcudnn'][cuda_version] }}"

# Specify objects with string fields `url` and `keyring`.
# The keyring path should start with /usr/share/keyrings/ for debian and ubuntu.
Expand Down
7 changes: 7 additions & 0 deletions docker/experimental/ansible/config/cuda_deps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Versions of cuda dependencies for given cuda versions.
# Note: wrap version in quotes to ensure they're treated as strings.
cuda_deps:
libcudnn:
"11.8": libcudnn8=8.8.0.121-1+cuda11.8
libcudnn-dev:
"11.8": libcudnn8-dev=8.8.0.121-1+cuda11.8
14 changes: 10 additions & 4 deletions docker/experimental/ansible/config/env.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@
# They'll be accessible for all processes on the host.
release_env:
common:
CC: "clang-{{ clang_version }}"
CXX: "clang++-{{ clang_version }}"
# Force GCC because clang/bazel has issues.
CC: gcc
CXX: g++
# CC: "clang-{{ clang_version }}"
# CXX: "clang++-{{ clang_version }}"
LD_LIBRARY_PATH: "$LD_LIBRARY_PATH:/usr/local/lib"

tpu:
Expand All @@ -20,8 +23,11 @@ build_env:
LD_LIBRARY_PATH: "$LD_LIBRARY_PATH:/usr/local/lib"
# Set explicitly to 0 as setup.py defaults this flag to true if unset.
BUILD_CPP_TESTS: 0
CC: "clang-{{ clang_version }}"
CXX: "clang++-{{ clang_version }}"
# Force GCC because clang/bazel has issues.
CC: gcc
CXX: g++
# CC: "clang-{{ clang_version }}"
# CXX: "clang++-{{ clang_version }}"
PYTORCH_BUILD_NUMBER: 1
TORCH_XLA_VERSION: "{{ package_version }}"
PYTORCH_BUILD_VERSION: "{{ package_version }}"
Expand Down
1 change: 1 addition & 0 deletions docker/experimental/ansible/config/vars.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Used for fetching cuda from the right repo, see apt.yaml.
cuda_repo: ubuntu1804
cuda_version: "11.8"
# Used for fetching clang from the right repo, see apt.yaml.
llvm_debian_repo: buster
clang_version: 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we're going for gcc in the end until clang/bazel interaction is fixed. We should keep clang around though for other things so nothing to change here, just fyi.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Expand Down
18 changes: 18 additions & 0 deletions docker/experimental/ansible/development.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Dockerfile for building a development image.
# The built image contains all required pip and apt packages for building and
# running PyTorch and PyTorch/XLA. The image doesn't contain any source code.
ARG python_version=3.8
ARG debian_version=buster

FROM python:${python_version}-${debian_version}

RUN pip install ansible

COPY . /ansible
WORKDIR /ansible

ARG arch=amd64
ARG accelerator=tpu

RUN ansible-playbook playbook.yaml -e "stage=build arch=${arch} accelerator=${accelerator}" --skip-tags "fetch_srcs,build_srcs"
RUN ansible-playbook playbook.yaml -e "stage=release arch=${arch} accelerator=${accelerator}" --skip-tags "fetch_srcs,build_srcs"
14 changes: 10 additions & 4 deletions docker/experimental/ansible/playbook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
ansible.builtin.assert:
that: "{{ lookup('ansible.builtin.vars', item.name) is regex(item.pattern) }}"
fail_msg: |
"Variable '{{ item.name }}' doesn't match pattern '{{ item.pattern }}'"
"Variable '{{ item.name }} = '{{ lookup('ansible.builtin.vars', item.name) }}' doesn't match pattern '{{ item.pattern }}'"
"Pass the required variable with: --e \"{{ item.name }}=<value>\""
loop:
- name: stage
Expand All @@ -28,12 +28,16 @@
loop:
# vars.yaml should be the first as other config files depend on it.
- vars.yaml
# cuda_deps should be loaded before apt, since apt depends on it.
- cuda_deps.yaml
- apt.yaml
- pip.yaml
- env.yaml
tags: always # Execute this task even with `--skip-tags` is used.

roles:
- bazel
- role: bazel
tags: bazel

- role: install_deps
vars:
Expand Down Expand Up @@ -62,12 +66,12 @@
pip.pkgs_nodeps[stage + '_' + arch] | default([], true) +
pip.pkgs_nodeps[stage + '_' + accelerator] | default([], true)
}}"
tags: install_deps

- role: fetch_srcs
vars:
src_root: "/src"
pytorch_git_rev: HEAD
xla_git_rev: HEAD
tags: fetch_srcs

- role: build_srcs
vars:
Expand All @@ -77,6 +81,7 @@
combine(build_env[arch] | default({}, true)) |
combine(build_env[accelerator] | default({}, true))
}}"
tags: build_srcs

- role: configure_env
vars:
Expand All @@ -86,3 +91,4 @@
combine(release_env[accelerator] | default({}, true))
}}"
when: stage == "release"
tags: configure_env
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
# localhost.
remote_src: true
strip: 1
ignore_whitespace: true
basedir: "{{ (src_root, 'pytorch/xla/third_party/tensorflow') | path_join }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now gone, and bazel does the patching.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not gone for v1.12 and v.1.13 releases? The problem with this approach is that build & deployment configuration is not bound with sources version (and for now I don't think we can improve on that). So this build process has to work for current and future versions of the code as well as older ones.

Do you think that having this step while bazel is doing the patching is problematic? I added ignore_errors: true so if the step fails it's fine, the playbook still continues.

loop: "{{ tf_patches.files | map(attribute='path') }}"
ignore_errors: true
Expand Down
33 changes: 12 additions & 21 deletions docker/experimental/terraform/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,17 @@
# Terraform configuration for build/test resources
# Terraform for CloudBuild triggers

Download the latest Terraform binary for your system and add it to your `$PATH`:
https://developer.hashicorp.com/terraform/downloads
This Terraform setup provisions:
- public storage bucket for PyTorch and PyTorch/XLA wheels.
- private storage bucket for Terraform state.
- public artifact repository for docker images.
- cloud builds for nightly and release docker images and wheels.
- schedule jobs and a service account for triggering cloud build.

Terraform state is stored in a shared GCS bucket. To initialize Terraform, run
the following:
# Running

```
# Authenticate with GCP
gcloud auth login --update-adc
1. Run `gcloud auth application-default login` on your local workstation.
2. Make sure that a recent Terraform binary is installed (>= 1.3.8).
If not, install Terraform from the [official source](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli).
3. Run `terraform apply -var-file=vars/staging.tfvars`.

# Initialize Terraform
terraform init
```

To preview your changes run `terraform plan`.

If the changes look correct, you can update the project with `terraform apply`.

Resources:

- Official Terraform documentation: https://developer.hashicorp.com/terraform/docs
- GCP Terraform documentation: https://cloud.google.com/docs/terraform/get-started-with-terraform
- Storing Terraform state in GCS: https://cloud.google.com/docs/terraform/resource-management/store-state
- Cloud Build Trigger documentation: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloudbuild_trigger
24 changes: 24 additions & 0 deletions docker/experimental/terraform/artifact_repo.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Docker repository in Artifact Registry for all public images.
resource "google_artifact_registry_repository" "public_docker_repo" {
location = var.public_docker_repo.location
repository_id = var.public_docker_repo.id
description = "Official docker images."
format = "DOCKER"
}

resource "google_artifact_registry_repository_iam_member" "all_users_read_public_docker_repo" {
role = "roles/artifactregistry.reader"
member = "allUsers"
project = google_artifact_registry_repository.public_docker_repo.project
location = google_artifact_registry_repository.public_docker_repo.location
repository = google_artifact_registry_repository.public_docker_repo.name
}

locals {
public_repo = google_artifact_registry_repository.public_docker_repo
public_docker_repo_url = "${local.public_repo.location}-docker.pkg.dev/${var.project_id}/${local.public_repo.repository_id}"
}

output "public_docker_registry_url" {
value = local.public_docker_repo_url
}
43 changes: 43 additions & 0 deletions docker/experimental/terraform/buckets.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
resource "google_storage_bucket" "tfstate" {
name = "${var.project_id}-tfstate${var.storage_bucket_suffix}"
force_destroy = false
location = "US"
storage_class = "STANDARD"

# Required by project policy.
# See https://cloud.google.com/storage/docs/uniform-bucket-level-access.
uniform_bucket_level_access = false

versioning {
enabled = true
}
}

# Storage bucket for all publicly released wheels.
resource "google_storage_bucket" "public_wheels" {
name = "${var.project_id}-wheels-public"
force_destroy = false
location = "US"
storage_class = "STANDARD"

uniform_bucket_level_access = false

versioning {
enabled = true
}
}

# Grants all users (public) read access to the bucket with wheels.
resource "google_storage_bucket_access_control" "all_users_read_public_wheels" {
bucket = google_storage_bucket.public_wheels.name
role = "READER"
entity = "allUsers"
}

output "public_wheels_bucket_url" {
value = google_storage_bucket.public_wheels.url
}

output "tfstate_bucket_url" {
value = google_storage_bucket.tfstate.url
}
Loading