Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[inductor] graph replayer #106952

Closed
wants to merge 5 commits into from

Commits on Aug 10, 2023

  1. [inductor] graph replayer

    [ghstack-poisoned]
    shunting314 committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    61a76d8 View commit details
    Browse the repository at this point in the history
  2. Update on "[inductor] graph replayer"

    Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
    ```
     python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
    ```
    took about 2 minutes to run. It may take even longer for some other models.
    
    The command is slow since it
    - need do dynamo work
    - verify the model on CPU
    - run perf tests
    - compile all the graphs
    
    However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.
    
    The graph replayer add a `save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system.  Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.
    
    Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.
    
    I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).
    
    
    [ghstack-poisoned]
    shunting314 committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    eabfafb View commit details
    Browse the repository at this point in the history
  3. Update on "[inductor] graph replayer"

    Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
    ```
     python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
    ```
    took about 2 minutes to run. It may take even longer for some other models.
    
    The command is slow since it
    - need do dynamo work
    - verify the model on CPU
    - run perf tests
    - compile all the graphs
    
    However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.
    
    The graph replayer add a `save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system.  Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.
    
    Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.
    
    I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).
    
    
    [ghstack-poisoned]
    shunting314 committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    542ec13 View commit details
    Browse the repository at this point in the history
  4. Update on "[inductor] graph replayer"

    Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
    ```
     python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
    ```
    took about 2 minutes to run. It may take even longer for some other models.
    
    The command is slow since it
    - need do dynamo work
    - verify the model on CPU
    - run perf tests
    - compile all the graphs
    
    However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.
    
    The graph replayer add a `save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system.  Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.
    
    Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.
    
    I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).
    
    
    [ghstack-poisoned]
    shunting314 committed Aug 10, 2023
    Configuration menu
    Copy the full SHA
    13c755f View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2023

  1. Update on "[inductor] graph replayer"

    Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
    ```
     python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
    ```
    took about 2 minutes to run. It may take even longer for some other models.
    
    The command is slow since it
    - need do dynamo work
    - verify the model on CPU
    - run perf tests
    - compile all the graphs
    
    However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.
    
    The graph replayer add a `save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system.  Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.
    
    Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.
    
    I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).
    
    
    [ghstack-poisoned]
    shunting314 committed Aug 11, 2023
    Configuration menu
    Copy the full SHA
    d2a86d8 View commit details
    Browse the repository at this point in the history