Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fused attention patterns #97741

Closed
wants to merge 13 commits into from
Closed

Fused attention patterns #97741

wants to merge 13 commits into from

Commits on Mar 28, 2023

  1. Fused attention patterns

    Patterns based on #94729 --
    mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    [ghstack-poisoned]
    jansel committed Mar 28, 2023
    Configuration menu
    Copy the full SHA
    c4828fb View commit details
    Browse the repository at this point in the history

Commits on Mar 29, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Mar 29, 2023
    Configuration menu
    Copy the full SHA
    f9dcd5c View commit details
    Browse the repository at this point in the history

Commits on Mar 30, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Mar 30, 2023
    Configuration menu
    Copy the full SHA
    e989709 View commit details
    Browse the repository at this point in the history
  2. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Mar 30, 2023
    Configuration menu
    Copy the full SHA
    676ba98 View commit details
    Browse the repository at this point in the history

Commits on Apr 1, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 1, 2023
    Configuration menu
    Copy the full SHA
    3f21c07 View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 2, 2023
    Configuration menu
    Copy the full SHA
    8a4ae9c View commit details
    Browse the repository at this point in the history
  2. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 2, 2023
    Configuration menu
    Copy the full SHA
    700c058 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 4, 2023
    Configuration menu
    Copy the full SHA
    0762a70 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    aa9a378 View commit details
    Browse the repository at this point in the history
  2. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    c658449 View commit details
    Browse the repository at this point in the history
  3. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    4045a85 View commit details
    Browse the repository at this point in the history
  4. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    edd46af View commit details
    Browse the repository at this point in the history

Commits on Apr 9, 2023

  1. Update on "Fused attention patterns"

    Patterns based on #94729 mainly as a forcing function for implementing joint graph replacements.
    
    Up until now, we had two places to do pattern matching
    1) Pre-grad has janky infra (graph not normalized or functional), but is
       desirable for many types of passes where you want your change to
       affect grad formulas.
    2) Post-grad has good infra, but cant change grad formulas.
    
    This PR adds a third place to do pattern matching: the joint
    forward+backwards graph.  The idea is to take the patterns and lower
    them to a joint graph and replace both the forwards+backwards before
    we partition them.  This allows us to do something similar to pre-grad
    transforms, but run after normalization and functionalization.
    
    Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.
    
    cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire
    
    [ghstack-poisoned]
    jansel committed Apr 9, 2023
    Configuration menu
    Copy the full SHA
    75c89eb View commit details
    Browse the repository at this point in the history