-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Performance Profiler #495
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/495
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 9b37dd3 with merge base 12ac498 (): NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
tutorials/profiler/model.py
Outdated
@@ -0,0 +1,257 @@ | |||
# Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like gpt-fast right? In which case we have a fork already in https://github.com/pytorch/ao/tree/main/torchao/_models/llama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a simplified version with lots of features stripped out and cleaner printing to demonstrate usage of the performance counter.
See the Usage
section of the PR and the README for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! I will be out this week for a friend's wedding but @andrewor14 mind reviewing this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if this is along the lines of what you had in mind regarding #426.
The core abstractions DeviceSpec
and TransformerPerformanceCounter
take care of tracking the necessary measurements (achieved BW and FLOPs/s) and detecting device-specific BW and FLOPs/s for MBU
and MFU
, as seen in the example output above.
Happy to adapt the API however useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jeromeku, thanks for working on this. It looks great overall! Left a few comments, mostly about code reuse and location. Another thing I'm wondering is how should we make it easy for developers to profile their quantized models? Do you think we should just integrate it into generate.py
? Maybe @jerryzh168 and @HDCharles should take a look too.
torchao/profiler/__init__.py
Outdated
] | ||
|
||
|
||
def get_all_base_classes(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it's only used in 1 place. Maybe we should inline?
torchao/profiler/__init__.py
Outdated
return [cls.__name__.lower() for cls in inspect.getmro(object.__class__)] | ||
|
||
|
||
def total_model_params( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a util method. Should we move this to a utils file, e.g. torchao/profiler/utils.py
? Also if these are not meant to be called by the user I would call it _total_model_params
to make it private
6: "E", | ||
7: "Z", | ||
8: "Y", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be a simple array?
tutorials/profiler/generate.py
Outdated
@@ -0,0 +1,339 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels more like a script that users will call instead of a tutorial. I feel this should just live under torchao/profiler/generate.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for the README
tutorials/profiler/tokenizer.py
Outdated
@@ -0,0 +1,112 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @msaroufim we should not duplicate this code (and model.py). We already copied them from gpt-fast so we should only have 1 version in torchao. Can you merge these with the existing ones under torchao/_models
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeromeku the only controversial thing about this PR, if we could reduce code duplication this would make merging this a no brainer
We're planning a release on Aug 8, do you think you'll have time to land your changes before then?
tutorials/profiler/generate.py
Outdated
@@ -0,0 +1,339 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably better to rename the file itself to something that includes profiler
in the name as well, otherwise it will be a bit confusing I think
I think this looks great, one thing is that maybe we have to think about where to put the example code (which also requires llama model definition), we have an existing place for llama model: https://github.com/pytorch/ao/tree/main/torchao/_models/llama that also contains eval and benchmark code, everything is under |
@jerryzh168 @msaroufim @andrewor14 Thanks for the feedback -- will make the changes. Been caught up with some other things but should have some free time in the coming days. |
@andrewor14 @jerryzh168 @msaroufim Made the following changes:
|
The CI failures aren't related to this PR... |
I've seen this kind of large scale IMA error when either
One trick that might help in the meantime is just for this PR, try rebasing to main to see if this issue repros and if it doesn't then change the github action workflow regression test to only run your test and let's see if any issues still pop up I also left a few misc pieces of feedback Finally, this PR would benefit from a simple README around where people should plug in their changes to run performance benchmarks. For example this script while significantly simpler did help people run evals without much headaches by just adding yet another if condition https://github.com/pytorch/ao/blob/main/scripts/hf_eval.py |
dtype: Optional[torch.dtype] = None | ||
flops_by_dtype: dict = field(default_factory=dict) | ||
|
||
def _post_init_check(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example was failing on these asserts for me, one set of failures was because the code was based off an old branch and another set was because it seems like device_spec
has non optional parameters like dtype and flops_by_dtype
from torchao.profiler import CUDADeviceSpec, TransformerPerformanceCounter
import torch
# Device spec is a dataclass that contains device info such as name, peak bandwidth, peak FLOPs
# If these fields are not manually specified, they will be automatically populated using
# CUDA runtime functions exposed by `torch.cuda` and `triton.runtime.driver`
device_spec = CUDADeviceSpec()
# The manager object tracks latency, data movement (bytes), and FLOPs across multiple contexts and
# maintains performance metrics for each context and in aggregate.
manager = TransformerPerformanceCounter(device_spec=device_spec)
# Prefill
with manager.count(label="prefill", num_tokens=x.numel()):
out = model(encoded_prompt)
# Print recorded stats for "prefill" context
manager.print_summary(labels=["prefill"], show=True)
# Decode
with manager.count(label="decode", num_tokens=1):
out = model(out[-1])
# Print recorded stats for "decode" context
manager.print_summary(labels=["decode"], show=True)
# Print accumulated stats across all contexts
manager.print_summary(show=True) #
) | ||
|
||
# -------------------- Device Spec Tests ------------------- # | ||
DEVICE_NAMES = ["h100 sxm", "a100", "nvidia geforce rtx 4090"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only have a10G in CI that might explain some of teh CI failures
And we're exploring L4 instances next since those are cheaper and have fp8 support
@@ -0,0 +1,442 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is still quite similar to https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py and was hoping we could converge the two or at the very least if we are creating a seperate profile script it should call as many functions from generate as possible
@msaroufim
Add Performance Profiler
Initial implementation of a performance profiler per #426.
Overview
Primary contribution is a
TransformerPerformanceCounter
which records data movement and FLOPs across multiple contexts.Combined with a
DeviceSpec
, theoretical peak performance / utilization stats are accumulated for each individual context and in aggregate which can be used forSpeed of Light
/ roofline analysis and other downstream performance profiling.Motivation is to create a lightweight method for collecting useful performance stats using (mostly)
torch
-native features as a complement totorch.profiler
and before diving into tools such asnsight compute
.Details
Below is an example demonstrating the basic API.
CUDADeviceSpec
is a lightweight dataclass that models device infotorch.cuda
andtriton.runtime.driver
CUDA APIs where possibletorchao.profiler.device_spec._AVAILABLE_GPU_SPECS
to fill in peak FLOPs, though this should be possible to calculate directly using thecudaDriver
API.TransformerPerformanceCounter
usesPerformanceCounterMode
under the hood to capture data movement and FLOPsPerformanceCounterMode
is an extended version oftorch.utils.flop_counter.FlopCounterMode
which counts data movement and FLOPs byaten.operator
and organized bytorch.nn.Module
viatorch.__dispatch__
.Metrics are encapsulated by
PerformanceStats
:In addition to the raw / derived metrics, a
TransformerPerformanceCounter
also has convenience methods for summarizing accumulated stats. From the above example,will print:
Tests
See
test/profiler/test_device_spec.py
andtest/profiler/test_performance_counter.py
for unit tests for each of these components.Usage
An end-to-end example of is available in
tutorials/profiler
.generate.py
script fromgpt-fast
with prettier printing.TransformerPerformanceCounter
.Running the example for
llama2-7b
(on anRTX 3090
) prints the following, with the outputs fromTransformerPerformanceCounter
annotated as such, and those from the originalgpt-fast
script prepended withGPTFast
:TODO
-More detailed metrics?
CUPTI
/ncu
torch.profiler