Skip to content

[RFC] Add Intel GPU support into PyTorch CI/CD  #114850

@chuanqi129

Description

@chuanqi129

Motivation

As the [RFC] Intel GPU Upstreaming mentioned, to integrate the new Intel GPU device and its associated features into PyTorch, we need to implement PyTorch CI/CD tests specifically designed for Intel GPU devices. These tests will ensure the quality of incoming pull requests and gate their acceptance accordingly.

Design Philosophy

For the new part CI/CD enabling, we aim to leverage the existing PyTorch CI/CD infrastructure wherever possible. Intel GPU related test will be dispatched to Intel Develop Cloud (IDC) Instances, which provides Intel GPU hardware as self-hosted runners. Based on our understanding of current PyTorch CI/CD tests, we will divide Intel GPU related tests into several categories: pull test, inductor tests and other tests. For inductor tests, we will extend the existing inductor test workflow to accommodate Intel GPU inductor testing. And a new workflow will serve as the entry point for other tests for Intel GPU, mirroring the approach used for other devices. Overall, all Intel GPU tests follow the rules below.

  • Docker based builds & tests
  • Multiple steps both for inductor and other tests
    • Base Docker image build on AWS instance runners which provides PyTorch build and tests environment.
    • Wheel build on AWS instance runners directly.
    • Tests can be sharded and dispatched on IDC instance runners.

Detail

Entrance of pull test

For the basic build test for pull requests, we will add a new part for Intel GPU specific build in .github/workflows/pull.yml and triggered by each pull request.

Entrance of Inductor tests

For Inductor tests, we will add a new part for Intel GPU specific tests in .github/workflows/inductor.yml and triggered by ciflow/inductor. To avoid breaking other inductor related PRs at the first stage, we plan to add a new ciflow label ciflow/xpu in .github/pytorch-probot.yml, and limit the Intel GPU inductor tests only for PR which has both ciflow/inductor and ciflow/xpu.

Entrance of other tests

The Intel GPU device related remain tests will be triggered by PR with label ciflow/xpu or regular triggered by timer. To achieve it, we will add a new entrance workflow .github/workflows/xpu.yml, which like below content.

name: xpu

on:
  push:
    branches:
      - main
      - release/*
    tags:
      - ciflow/xpu/*
  workflow_dispatch:
  schedule:
    - cron: 0 0 * * *  

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
  cancel-in-progress: true

jobs:
  linux-jammy-xpu-py3_8-build:
    name: linux-jammy-xpu-py3.8
    uses: ./.github/workflows/_linux-build.yml
    with:
      build-environment: linux-jammy-xpu-py3.8
      docker-image-name: pytorch-linux-jammy-xpu-n-py3
      sync-tag: xpu-build
      test-matrix: |
        { include: [
          { config: "default", shard: 1, num_shards: 2, runner: "linux.idc.xpu" },
          { config: "default", shard: 2, num_shards: 2, runner: "linux.idc.xpu" },
        ]}

  linux-jammy-xpu-py3_8-test:
    name: linux-jammy-xpu-py3.8
    uses: ./.github/workflows/_xpu-test.yml
    needs: linux-jammy-xpu-py3_8-build
    with:
      build-environment: linux-jammy-xpu-py3.8
      docker-image: ${{ needs.linux-jammy-xpu-py3_8-build.outputs.docker-image }}
      test-matrix: ${{ needs.linux-jammy-xpu-py3_8-build.outputs.test-matrix }} 

Build & Test

Will add Intel GPU specific base image Dockerfile .ci/docker/ubuntu-xpu/Dockerfile and Intel GPU part into image build script .ci/docker/build.sh to support Intel GPU based image build on linux.2xlarge runners.

For Pytorch wheel build, different with other devices, currently we need dispatch it to IDC instance runners. We will reuse .github/workflows/_linux-build.yml with Intel GPU specific build-environment and add Intel GPU part into Pytorch build script .ci/pytorch/build.sh.

For the tests part, we will add a new Intel GPU test workflow .github/workflows/_xpu-test.yml and some necessary GitHub action such as setup-xpu, teardown-xpu, etc. We also will add a new part in test script .ci/pytorch/test.sh and a series utils scripts for Intel GPU.

cc @seemethere @malfet @pytorch/pytorch-dev-infra @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Metadata

Metadata

Assignees

Labels

module: ciRelated to continuous integrationmodule: intelSpecific to x86 architectureoncall: relengIn support of CI and Release EngineeringtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions