Skip to content

[perf] Reduce tensor & aten overhead #13049

@zou3519

Description

@zou3519

Motivation

Launching a FusionGroup for a JIT LSTM cell takes ~40us CPU time, as reported by the autograd profiler. This is not good because the kernel itself takes < 10us CUDA time and could probably be faster. After seeing nothing noticeably wrong with the JIT, I am looking into core performance overheads.

Methodology

Performance is measured with gbenchmark microbenchmarks: https://github.com/pytorch/benchmark/tree/master/timing/cpp2. Here is some sample output that motivates many of the below tasks.

Investigations

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions