Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor][cpp] GEMM template (infra and fp32) #124021

Closed
wants to merge 63 commits into from

Commits on Apr 14, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 14, 2024
    Configuration menu
    Copy the full SHA
    00eb31a View commit details
    Browse the repository at this point in the history

Commits on Apr 16, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 16, 2024
    Configuration menu
    Copy the full SHA
    b6ff5fe View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 16, 2024
    Configuration menu
    Copy the full SHA
    0355c46 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    ba94cdf View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    5ad7899 View commit details
    Browse the repository at this point in the history
  3. Update

    [ghstack-poisoned]
    jgong5 committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    1c4edcd View commit details
    Browse the repository at this point in the history
  4. Update

    [ghstack-poisoned]
    jgong5 committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    a56957d View commit details
    Browse the repository at this point in the history
  5. Update

    [ghstack-poisoned]
    jgong5 committed Apr 17, 2024
    Configuration menu
    Copy the full SHA
    5bf33c4 View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 18, 2024
    Configuration menu
    Copy the full SHA
    f780f9c View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 18, 2024
    Configuration menu
    Copy the full SHA
    0580a46 View commit details
    Browse the repository at this point in the history

Commits on Apr 26, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 26, 2024
    Configuration menu
    Copy the full SHA
    d795f31 View commit details
    Browse the repository at this point in the history

Commits on Apr 27, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 27, 2024
    Configuration menu
    Copy the full SHA
    002bedb View commit details
    Browse the repository at this point in the history

Commits on Apr 28, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    2bfc603 View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    a416d41 View commit details
    Browse the repository at this point in the history
  3. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    8d3f8aa View commit details
    Browse the repository at this point in the history
  4. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    701a0cd View commit details
    Browse the repository at this point in the history
  5. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    85ce15a View commit details
    Browse the repository at this point in the history
  6. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    5f0133e View commit details
    Browse the repository at this point in the history
  7. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    b1f731b View commit details
    Browse the repository at this point in the history
  8. Update

    [ghstack-poisoned]
    jgong5 committed Apr 28, 2024
    Configuration menu
    Copy the full SHA
    c0d77bc View commit details
    Browse the repository at this point in the history

Commits on Apr 29, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    ab8e6a9 View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    c2c5d2d View commit details
    Browse the repository at this point in the history
  3. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    b079a2c View commit details
    Browse the repository at this point in the history
  4. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    fac3997 View commit details
    Browse the repository at this point in the history
  5. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    b0e451c View commit details
    Browse the repository at this point in the history
  6. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    ff91a01 View commit details
    Browse the repository at this point in the history
  7. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    59086de View commit details
    Browse the repository at this point in the history
  8. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    bfce7d8 View commit details
    Browse the repository at this point in the history
  9. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    7e6490a View commit details
    Browse the repository at this point in the history
  10. Update

    [ghstack-poisoned]
    jgong5 committed Apr 29, 2024
    Configuration menu
    Copy the full SHA
    7a4dc85 View commit details
    Browse the repository at this point in the history

Commits on Apr 30, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    0cec870 View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    fc8a9c8 View commit details
    Browse the repository at this point in the history
  3. Update

    [ghstack-poisoned]
    jgong5 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    b337242 View commit details
    Browse the repository at this point in the history
  4. Update

    [ghstack-poisoned]
    jgong5 committed Apr 30, 2024
    Configuration menu
    Copy the full SHA
    1c5a149 View commit details
    Browse the repository at this point in the history

Commits on May 6, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed May 6, 2024
    Configuration menu
    Copy the full SHA
    7ae7be0 View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed May 6, 2024
    Configuration menu
    Copy the full SHA
    614a739 View commit details
    Browse the repository at this point in the history
  3. Update

    [ghstack-poisoned]
    jgong5 committed May 6, 2024
    Configuration menu
    Copy the full SHA
    6b682e2 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    66f5e31 View commit details
    Browse the repository at this point in the history
  2. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    [ghstack-poisoned]
    jgong5 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    d56bebf View commit details
    Browse the repository at this point in the history
  3. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    [ghstack-poisoned]
    jgong5 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    acb4a95 View commit details
    Browse the repository at this point in the history
  4. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    [ghstack-poisoned]
    jgong5 committed May 7, 2024
    Configuration menu
    Copy the full SHA
    70a6d7d View commit details
    Browse the repository at this point in the history

Commits on May 8, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 8, 2024
    Configuration menu
    Copy the full SHA
    0162cf6 View commit details
    Browse the repository at this point in the history
  2. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 8, 2024
    Configuration menu
    Copy the full SHA
    92f4ac4 View commit details
    Browse the repository at this point in the history
  3. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 8, 2024
    Configuration menu
    Copy the full SHA
    8cfdb7d View commit details
    Browse the repository at this point in the history

Commits on May 9, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    55d98b0 View commit details
    Browse the repository at this point in the history
  2. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    ca09328 View commit details
    Browse the repository at this point in the history
  3. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 9, 2024
    Configuration menu
    Copy the full SHA
    e96352e View commit details
    Browse the repository at this point in the history

Commits on May 12, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 12, 2024
    Configuration menu
    Copy the full SHA
    f427d85 View commit details
    Browse the repository at this point in the history

Commits on May 15, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed May 15, 2024
    Configuration menu
    Copy the full SHA
    f0e2203 View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed May 15, 2024
    Configuration menu
    Copy the full SHA
    0cacd09 View commit details
    Browse the repository at this point in the history

Commits on May 19, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    67877a6 View commit details
    Browse the repository at this point in the history
  2. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    6687ccf View commit details
    Browse the repository at this point in the history
  3. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    70b35d3 View commit details
    Browse the repository at this point in the history
  4. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3bcbae9 View commit details
    Browse the repository at this point in the history
  5. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    3a8012d View commit details
    Browse the repository at this point in the history
  6. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 19, 2024
    Configuration menu
    Copy the full SHA
    fbb0064 View commit details
    Browse the repository at this point in the history

Commits on May 20, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang 
    
    [ghstack-poisoned]
    jgong5 committed May 20, 2024
    Configuration menu
    Copy the full SHA
    2993c2e View commit details
    Browse the repository at this point in the history

Commits on May 21, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
    
    [ghstack-poisoned]
    jgong5 committed May 21, 2024
    Configuration menu
    Copy the full SHA
    ac36018 View commit details
    Browse the repository at this point in the history

Commits on May 23, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
    
    [ghstack-poisoned]
    jgong5 committed May 23, 2024
    Configuration menu
    Copy the full SHA
    ad5e500 View commit details
    Browse the repository at this point in the history

Commits on May 24, 2024

  1. Update on "[inductor][cpp] GEMM template (infra and fp32)"

    This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC #125683 for more background info.
    1. Cpp template infrastructure
    Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
    2. Initial FP32 gemm template
    This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
    3. Correctness and performance
    The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.
    
    Static shapes
    | Benchmark | torchbench | huggingface | timm_models |
    |------------|-------------|--------------|--------------|
    | Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
    | Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
    | Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |
    
    Key models being sped up:
    drq: 1.14x
    soft_act: 1.12
    cait_m36_384: 1.18x
    
    Dynamic shapes
    | Benchmark | torchbench | huggingface | timm_models |
    | --- | --- | --- | --- |
    | Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
    | Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
    | Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
    | Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |
    
    Key models being sped up:
    BERT_pytorch: 1.22x
    pyhpc_turbulent: 1.13x
    soft_actor_critic: 1.77x
    BlenderbotForCausalLM: 1.09x
    cait_m36_384: 1.17x
    
    cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
    
    Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
    
    [ghstack-poisoned]
    jgong5 committed May 24, 2024
    Configuration menu
    Copy the full SHA
    5f07582 View commit details
    Browse the repository at this point in the history

Commits on May 28, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed May 28, 2024
    Configuration menu
    Copy the full SHA
    c872b7c View commit details
    Browse the repository at this point in the history

Commits on May 29, 2024

  1. Update

    [ghstack-poisoned]
    jgong5 committed May 29, 2024
    Configuration menu
    Copy the full SHA
    bdb239e View commit details
    Browse the repository at this point in the history
  2. Update

    [ghstack-poisoned]
    jgong5 committed May 29, 2024
    Configuration menu
    Copy the full SHA
    6ad24d3 View commit details
    Browse the repository at this point in the history