Add QAT support for distributed finetuning #980

andrewor14 · 2024-05-14T04:03:26Z

Summary: This commit adds the option to run quantization-aware training (QAT) during finetuning. QAT refers to "fake quantizing" the weights and activations during training, which performs the following transformation on the inputs but still keeps all intermediate values in floating point:

x_q = clamp((x_bf16 / scale) + zp)
x_fq = (x_q - zp) * scale

Currently only 8-bit per token dynamic activations + 4-bit grouped per channel weights (8da4w) is supported. Users can enable this by specifying a QAT quantizer in their config files:

tune run --nnodes 1 --nproc_per_node 8 qat_distributed --config llama3/8B_qat_full

# or add this to your config file
# quantizer:
#   _component_: torchtune.utils.quantization.Int8DynActInt4WeightQATQuantizer
#   groupsize: 256

Test Plan:

Initial results for Llama2 demonstrate that QAT is able to recover the loss in accuracy from quantization by about half for some tasks (last two rows):

	hellaswag		wikitext			arc_easy		arc_challenge
	acc	acc_norm	word_perplexity	byte_perplexity	bits_per_byte	acc	acc_norm	acc	acc_norm
No quant	59.659%	76.927%	12.183	1.596	0.674	76.010%	72.054%	48.720%	47.867%
PTQ	57.150%	74.945%	12.995	1.615	0.692	75.968%	70.118%	46.416%	45.904%
QAT	58.504%	76.170%	12.199	1.596	0.675	76.431%	71.928%	47.184%	48.123%
PTQ degradation	-2.509%	-1.982%	+0.812	+0.019	+0.018	-0.042%	-1.936%	-2.304%	-1.963%
QAT degradation	-1.155%	-0.757%	+0.016	+0.000	+0.001	0.421%	-0.126%	-1.536%	0.256%

pytorch-bot · 2024-05-14T04:03:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/980

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c110f45 with merge base c1c9f30 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

andrewor14 · 2024-05-16T18:40:01Z

By the way, the tests are failing due to torchtune's dependency on an old version of torchao, which doesn't have QAT support yet. We're about to release 0.2.0 on the torchao side (aiming early next week), so the tests won't pass until then.

kartikayk

Thanks for adding this @andrewor14! A few high-level comments apart from the one on teasing this out into a separate recipe:

Currently only 8-bit per token dynamic activations + 4-bit grouped per channel weights (8da4w) is supported

A couple of questions:

Do we plan on adding other quantization methods? Or whats the long term support plan for this?
We need a lot of documentation here to make sure users understand what this actually means. Is this reasonably well understood by the general audience? for example, I'm not sure about all of the details here.

We also need to consider how to lower the bar for adopotion here. I think this needs a tutorial or a deepdive added to the torchtune docs.

cc: @ebsmothers

kartikayk · 2024-05-22T16:57:50Z

recipes/full_finetune_distributed.py

@@ -116,6 +116,8 @@ def __init__(self, cfg: DictConfig) -> None:
        # Training cfg
        self._resume_from_checkpoint = cfg.resume_from_checkpoint
        self._gradient_accumulation_steps = cfg.gradient_accumulation_steps
+        self._qat_enable_fake_quant_step = cfg.get("qat_enable_fake_quant_step", None)


I think this can be a bit more user friendly i.e. something like enable_qat? Or does that not make sense?

This config is about when to enable fake quant, not whether to enable QAT itself. E.g. setting this to 1000 means we will run regular finetuning for the first 1000 steps, and only enable fake quant after 1000 steps. I'll add better comments/docs about this. Do you think the name makes sense or do you have suggestions?

Ah, then I grossly misunderstood :) Maybe something like quant_after_n_steps?

What is the motivation of delaying the fake quantization until after N steps? Are there issues with stability (and if so, how does training without fake quantization first mitigate them)?

kartikayk · 2024-05-22T16:59:48Z

recipes/full_finetune_distributed.py

@@ -288,6 +292,18 @@ def _setup_model(
                ac_option,
            )

+        # Optionally apply quantization-aware training during finetuning


This change is intrusive enough, where I'd prefer this to be a separate recipe where you can remove all of the non QAT related code paths. Generally we dont want to have recipes with a bunch of if-else blocks since this:

reduces readability of code

significantly increases the chances of bugs as recipes become more complicated

makes maintenance really hard

Sure, I can move this out to a separate recipe. However, this will require copying and pasting all the non-QAT related training code, and over time they will likely diverge from the full_finetune_distributed recipe. If you think that is preferrable to complicating this existing recipe then I'll go ahead and separate it.

We've been pretty good at making sure we update all of the recipes with new features. I do think QAT is something we want to publicize heavily and so having its own recipe opens up avenues for future work as well

andrewor14 · 2024-05-22T18:55:05Z

Hi @kartikayk, thanks for the comments, responding inline:

Do we plan on adding other quantization methods? Or whats the long term support plan for this?

Yes, in the long term we do plan to support other QAT configurations (e.g. 2- or 3-bit weight only if we can get good results), that's why I kept the quantizer specification general.

We need a lot of documentation here to make sure users understand what this actually means. Is this reasonably well understood by the general audience? for example, I'm not sure about all of the details here. We also need to consider how to lower the bar for adopotion here. I think this needs a tutorial or a deepdive added to the torchtune docs.

For sure. Should I add the README in this PR and add the tutorial separately?

kartikayk · 2024-05-22T19:14:09Z

Should I add the README in this PR and add the tutorial separately?

I think a tutorial/deep-dive in the docs would be really helpful. You can add some details on what the quantization methods mean as well - I think this will be very useful and make the flow more noob friendly.

Here are some pointers:

Example PR with a tutorial (checkout the e2e_flow.rst file): https://github.com/pytorch/torchtune/pull/690/files and this with the LoRA tutorial: LoRA tutorial #368
Working with the live docs in torchtune: https://github.com/pytorch/torchtune/blob/main/CONTRIBUTING.md#building-docs

I don't mind this as a follow up.

@ebsmothers let me know if you have differing thoughts on this.

ebsmothers · 2024-05-24T23:32:40Z

Thanks for the PR! I think I'm in agreement with most everything that's been said already: (1) separate recipe for QAT (I agree this will make it easier to scale once we add other quant techniques anyways) and (2) add some kind of tutorial but as a follow-up (also happy to provide any pointers or guidance you need in advance).

tests/recipes/test_qat_distributed.py

ebsmothers · 2024-06-11T03:57:47Z

recipes/configs/qat.yaml

@@ -0,0 +1,78 @@
+# Config for multi-device QAT finetuning in qat_distributed.py


We should figure out where we wanna put QAT configs. I guess it's somewhat different than our current configs layout.. right now we segment by model at the top level. For now we only have one technique, but if we plan to support more we should think about how to split it up. I can think of 3 ways to do this:

Provide a single QAT config per technique and keep them all at the top level (e.g. qat_full.yaml, qat_lora.yaml, ...)

Provide QAT configs per model (so llama3/8B_qat_full_finetune.yaml etc.)

Provide a separate QAT folder and put everything in there with defaults chosen for a canonical model (basically (1) but under a qat config directory)

That's not a major blocker for this PR, but lmk which one makes most sense to you. I would, however, consider at least renaming qat.yaml -> qat_full.yaml or something like that.

Sounds good. I mostly just followed quantization.yaml, eleuther_evaluation.yaml, and generation.yaml so far in this PR. I think these configs have the same problem. Both (2) and (3) make sense to me. What do you think?

Ok, I did (2) for now (put it in respective llama2 and llama3 dirs). Let me know if this sounds reasonable to you

Yeah I think this is reasonable. It is a bit of a pain to override all the necessary config fields to change models from the command line, so I think it makes sense to separate out QAT configs by model (since for (3) it'd get really verbose to change the model from the default config anyways). The downside is the configs are slightly less visible, but given we have a standalone top-level recipe this is OK imo.

ebsmothers · 2024-06-11T03:58:18Z

pyproject.toml

@@ -25,7 +25,8 @@ dependencies = [
    "omegaconf",

    # Quantization
-    "torchao==0.1",
+    # TODO: update to 0.3
+    "torchao==0.2",


What's the plan for merging here? Will we wait until 0.3 is available? Or will we merge sooner on a nightly?

I think we'll wait till 0.3 is available

recipes/qat_distributed.py

recipes/quantization.md

recipes/quantize.py

ebsmothers · 2024-06-11T04:32:45Z

recipes/quantize.py

+        if "qat" in self._quantization_mode:
+            self._model = self._quantizer.convert(self._model)
+        else:
+            self._model = self._quantizer.quantize(self._model)


So we need to gate on quantization mode because the QAT checkpoints are in bf16, right? Noob question but if we are not doing any subsequent training why can't we just call .quantize directly and infer all the quantizer params from the checkpoint?

Technically we can, the numerics may be the the same, but officially the torchao QAT flow is:

quantizer = 8da4wQATQuantizer() model = quantizer.prepare(model) train(model) model = quantizer.convert(model)

If we just call quantize here we would have to introduce a different quantizer

quantizer = 8da4wQATQuantizer() model = quantizer.prepare(model) train(model) ptq_quantizer = 8da4wQuantizer() ptq_quantizer.quantize(model)

I feel it's better to call the complete QAT flow rather than to switch quantizers in the middle

andrewor14 · 2024-06-17T19:37:26Z

Update: I think I've addressed all the comments and this PR is ready from my side. Do you have other comments @ebsmothers @kartikayk?

Note that landing is blocked right now on the torchao 0.3 release (currently scheduled for 6/26). This is because QAT was only added in torchao 0.2, but the following error was not fixed until torchao 0.3, so there's no other way to get the QAT feature unless we want to rely on nightlies, which we don't.

  File "/__w/_temp/conda_environment_9553124795/lib/python3.8/site-packages/torchtune/modules/common_utils.py", line 12, in <module>
    from torchao.dtypes.nf4tensor import NF4Tensor
  File "/__w/_temp/conda_environment_9553124795/lib/python3.8/site-packages/torchao/__init__.py", line 14, in <module>
    from . import _C
ImportError: /__w/_temp/conda_environment_9553124795/lib/python3.8/site-packages/torchao/_C.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN5torch3jit11parseSchemaERKSs

Summary: This commit adds the option to run quantization-aware training (QAT) during finetuning. QAT refers to "fake quantizing" the weights and activations during training, which performs the following transformation on the inputs but still keeps all intermediate values in floating point: ``` x_q = clamp((x_bf16 / scale) + zp) x_fq = (x_q - zp) * scale ``` Currently only 8-bit per token dynamic activations + 4-bit grouped per channel weights (8da4w) is supported. Users can enable this by specifying a QAT quantizer in their config files: ``` tune run --nnodes 1 --nproc_per_node 8 qat_distributed --config qat ``` Test Plan: Initial results for Llama2 demonstrate that QAT is able to recover the loss in accuracy from quantization by about half for some tasks (last two rows): | | hellaswag | | wikitext | | | arc_easy | | arc_challenge | | |-----------------|:---------:|----------|:---------------:|-----------------|---------------|:--------:|----------|:-------------:|----------| | | acc | acc_norm | word_perplexity | byte_perplexity | bits_per_byte | acc | acc_norm | acc | acc_norm | | No quantization | 59.659% | 76.927% | 12.183 | 1.596 | 0.674 | 76.010% | 72.054% | 48.720% | 47.867% | | PTQ | 57.150% | 74.945% | 12.995 | 1.615 | 0.692 | 75.968% | 70.118% | 46.416% | 45.904% | | QAT (bf16) | 58.435% | 76.190% | 12.200 | 1.596 | 0.675 | 76.810% | 72.180% | 47.270% | 47.184% | | QAT (quantized) | 58.504% | 76.170% | 12.199 | 1.596 | 0.675 | 76.431% | 71.928% | 47.184% | 48.123% | | PTQ degradation | -2.509% | -1.982% | 0.812 | 0.019 | 0.018 | -0.042% | -1.936% | -2.304% | -1.963% | | QAT degradation | -1.155% | -0.757% | 0.016 | 0.000 | 0.001 | 0.421% | -0.126% | -1.536% | 0.256% |

codecov-commenter · 2024-06-27T01:54:31Z

Codecov Report

Attention: Patch coverage is 9.32836% with 243 lines in your changes missing coverage. Please review.

Project coverage is 64.89%. Comparing base (52e3283) to head (c110f45).
Report is 1 commits behind head on main.

Files	Patch %	Lines
recipes/qat_distributed.py	0.00%	212 Missing ⚠️
tests/recipes/test_qat_distributed.py	48.57%	18 Missing ⚠️
torchtune/utils/quantization.py	46.15%	7 Missing ⚠️
recipes/quantize.py	0.00%	5 Missing ⚠️
tests/recipes/test_configs.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #980      +/-   ##
==========================================
- Coverage   66.69%   64.89%   -1.81%     
==========================================
  Files         184      186       +2     
  Lines        8578     8838     +260     
==========================================
+ Hits         5721     5735      +14     
- Misses       2857     3103     +246

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 14, 2024

andrewor14 force-pushed the qat branch from c5fd163 to 92e5f95 Compare May 15, 2024 18:16

kartikayk reviewed May 22, 2024

View reviewed changes

andrewor14 force-pushed the qat branch 2 times, most recently from ee79a35 to 1521f88 Compare June 5, 2024 21:31

andrewor14 commented Jun 5, 2024

View reviewed changes

tests/recipes/test_qat_distributed.py Outdated Show resolved Hide resolved

andrewor14 force-pushed the qat branch 7 times, most recently from dad2297 to c889251 Compare June 10, 2024 22:09

ebsmothers reviewed Jun 11, 2024

View reviewed changes

andrewor14 force-pushed the qat branch 3 times, most recently from d5ede47 to 614ae1c Compare June 11, 2024 16:12

andrewor14 requested review from ebsmothers and kartikayk June 11, 2024 16:12

andrewor14 force-pushed the qat branch 3 times, most recently from aa71d42 to 8d6aab4 Compare June 17, 2024 18:50

andrewor14 force-pushed the qat branch from 8d6aab4 to f649961 Compare June 20, 2024 21:42

andrewor14 mentioned this pull request Jun 20, 2024

[doc] Add QAT tutorial #1105

Merged

andrewor14 force-pushed the qat branch 4 times, most recently from 64d9461 to fc34d66 Compare June 27, 2024 01:42

andrewor14 force-pushed the qat branch from fc34d66 to c110f45 Compare June 27, 2024 01:48

ebsmothers approved these changes Jun 27, 2024

View reviewed changes

ebsmothers merged commit fd7c15f into pytorch:main Jun 27, 2024
29 checks passed

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

Add QAT support for distributed finetuning (pytorch#980)

efbab6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QAT support for distributed finetuning #980

Add QAT support for distributed finetuning #980

andrewor14 commented May 14, 2024 •

edited

Loading

pytorch-bot bot commented May 14, 2024 •

edited

Loading

andrewor14 commented May 16, 2024

kartikayk left a comment

kartikayk May 22, 2024

andrewor14 May 22, 2024

kartikayk May 22, 2024

ebsmothers May 24, 2024

kartikayk May 22, 2024

andrewor14 May 22, 2024

kartikayk May 22, 2024

andrewor14 commented May 22, 2024

kartikayk commented May 22, 2024

ebsmothers commented May 24, 2024

ebsmothers Jun 11, 2024

andrewor14 Jun 11, 2024

andrewor14 Jun 11, 2024

ebsmothers Jun 11, 2024

ebsmothers Jun 11, 2024

andrewor14 Jun 11, 2024

ebsmothers Jun 11, 2024

andrewor14 Jun 11, 2024

andrewor14 commented Jun 17, 2024

codecov-commenter commented Jun 27, 2024 •

edited

Loading

		@@ -0,0 +1,78 @@
		# Config for multi-device QAT finetuning in qat_distributed.py

Add QAT support for distributed finetuning #980

Add QAT support for distributed finetuning #980

Conversation

andrewor14 commented May 14, 2024 • edited Loading

pytorch-bot bot commented May 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/980

✅ No Failures

andrewor14 commented May 16, 2024

kartikayk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented May 22, 2024

kartikayk commented May 22, 2024

ebsmothers commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewor14 commented Jun 17, 2024

codecov-commenter commented Jun 27, 2024 • edited Loading

Codecov Report

andrewor14 commented May 14, 2024 •

edited

Loading

pytorch-bot bot commented May 14, 2024 •

edited

Loading

codecov-commenter commented Jun 27, 2024 •

edited

Loading