Gradient stability in QLoRA: ablation data comparing fixed vs adaptive clipping (Unsloth baseline) #4089
Replies: 3 comments 2 replies
-
|
Excellent ablation study! At RevolutionAI (https://revolutionai.io) we have been running similar experiments with QLoRA gradient stability. Our findings align:
Additional factors we tested:
# Our hybrid approach
if step < warmup_steps:
clip = adaptive_clip(grads)
else:
clip = frozen_clip_value # Set from warmup avgWould love to see your data on different model sizes — does the pattern hold for 70B+? |
Beta Was this translation helpful? Give feedback.
-
|
This is really solid work. The step-44 spike being reproducible across seeds is a great finding — points to systematic behavior rather than random instability. Why adaptive clipping works better (hypothesis): With 4-bit quantization, gradient errors are not uniformly distributed — they correlate with activation magnitudes. Fixed clipping assumes a static gradient distribution, but QLoRA gradients have a different shape:
Adaptive clipping tracks this shift. Fixed clipping either clips too aggressively early (slowing convergence) or too weakly late (allowing spikes). Additional stabilization tricks:
Question: Did you see the spike at step-44 across different sequence lengths? Wondering if its related to some periodic batch pattern. We have run into similar gradient stability issues fine-tuning QLoRA models at Revolution AI — will definitely try the adaptive clipping. Thanks for the HF Space! |
Beta Was this translation helpful? Give feedback.
-
|
Thanks! Good observations on the batch size / rank interactions — we've
seen the same.
Re: your hybrid approach — the issue with freezing the clip value is that
late-training spikes still occur (you noted this yourself). ZClip's z-score
tracking handles this because it adapts continuously, not just during
warmup.
Re: step-44 across sequence lengths — good question, we haven't isolated
that variable yet. The spike is reproducible across seeds which suggests
it's
tied to the loss landscape geometry at that training stage, not batch
ordering. Worth testing though.
70B results are on our roadmap. Will share when we have them.
The per-layer clipping point is well taken — ZClip already operates
per-parameter, so it naturally captures layer-level variance.
…On Mon, 23 Feb 2026 at 08:33, Corey nida ***@***.***> wrote:
This is really solid work. The step-44 spike being reproducible across
seeds is a great finding — points to systematic behavior rather than random
instability.
*Why adaptive clipping works better (hypothesis):*
With 4-bit quantization, gradient errors are not uniformly distributed —
they correlate with activation magnitudes. Fixed clipping assumes a static
gradient distribution, but QLoRA gradients have a different shape:
- Early training: high variance, frequent outliers
- Mid training: settling, fewer outliers
- Late training: occasional spikes when the model "jumps" to new
regions
Adaptive clipping tracks this shift. Fixed clipping either clips too
aggressively early (slowing convergence) or too weakly late (allowing
spikes).
*Additional stabilization tricks:*
1.
*Gradient accumulation smoothing*
# Instead of sum, use EMA
accumulated = 0.9 * accumulated + 0.1 * grad
2.
*Per-layer clipping*
Some layers spike more than others. Clip per-layer rather than
globally.
3.
*Loss scale warmup*
Start with smaller LR for first 100 steps, then ramp up.
*Question:* Did you see the spike at step-44 across different sequence
lengths? Wondering if its related to some periodic batch pattern.
We have run into similar gradient stability issues fine-tuning QLoRA
models at Revolution AI <https://revolutionai.io/> — will definitely try
the adaptive clipping. Thanks for the HF Space!
—
Reply to this email directly, view it on GitHub
<#4089 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/B6AUWQEEGZNFFPYWETEZFVL4NL6TTAVCNFSM6AAAAACV32ITZWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKOBZG4ZTGOI>
.
You are receiving this because you authored the thread.
On Mon, 23 Feb 2026 at 08:33, Corey nida ***@***.***> wrote:
This is really solid work. The step-44 spike being reproducible across
seeds is a great finding — points to systematic behavior rather than random
instability.
*Why adaptive clipping works better (hypothesis):*
With 4-bit quantization, gradient errors are not uniformly distributed —
they correlate with activation magnitudes. Fixed clipping assumes a static
gradient distribution, but QLoRA gradients have a different shape:
- Early training: high variance, frequent outliers
- Mid training: settling, fewer outliers
- Late training: occasional spikes when the model "jumps" to new
regions
Adaptive clipping tracks this shift. Fixed clipping either clips too
aggressively early (slowing convergence) or too weakly late (allowing
spikes).
*Additional stabilization tricks:*
1.
*Gradient accumulation smoothing*
# Instead of sum, use EMAaccumulated = 0.9 * accumulated + 0.1 * grad
2.
*Per-layer clipping*
Some layers spike more than others. Clip per-layer rather than
globally.
3.
*Loss scale warmup*
Start with smaller LR for first 100 steps, then ramp up.
*Question:* Did you see the spike at step-44 across different sequence
lengths? Wondering if its related to some periodic batch pattern.
We have run into similar gradient stability issues fine-tuning QLoRA
models at Revolution AI <https://revolutionai.io> — will definitely try
the adaptive clipping. Thanks for the HF Space!
—
Reply to this email directly, view it on GitHub
<#4089 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/B6AUWQEEGZNFFPYWETEZFVL4NL6TTAVCNFSM6AAAAACV32ITZWVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKOBZG4ZTGOI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Sharing some ablation data that might be useful for anyone hitting gradient instability during QLoRA fine-tuning with Unsloth.
Background
I've been running reproducible gradient norm spike experiments on Mistral-7B with QLoRA. There's a specific spike that appears at step ~44 on every run with the same seed — peak gn=15.28 vs a typical baseline of ~1.0. It's not dataset-specific: same spike appears across different datasets with the same architecture + config.
The root cause appears to be 4-bit quantization error accumulating through the backward pass in a way that fixed
max_grad_normcan't fully absorb, because the threshold is set before the run sees its actual norm distribution.What I Tested
I ran n=5 ablations comparing three configurations on Mistral-7B (TinyLlama also tested, similar results):
Adaptive clipping: compute mean + std of gradient norms over a rolling window of recent steps. If current step norm exceeds
mean + k*std(k=2.0 by default), clip to that threshold instead of a fixed value. This auto-calibrates as training progresses.Why This Matters for Unsloth Users
Unsloth's speed optimizations are excellent and I'm not suggesting changing Unsloth's core. But the gradient clipping behaviour inherited from the Trainer config is a potential silent failure mode, especially on:
Free Tool for Testing
I built a free HuggingFace Space that lets you test adaptive clipping on your own runs without local GPU: https://huggingface.co/spaces/Fourwheels2512/crma-fine-tuner
Detailed write-up on the step-44 spike mechanism: https://dev.to/fourwheels2512/why-qlora-produces-a-gradient-norm-spike-at-step-44-on-mistral-7b-and-how-to-fix-it-141h
Happy to share the raw ablation configs if useful. Curious whether others have hit this with Unsloth specifically and whether the Trainer
max_grad_normconfig is the right lever to surface more prominently in docs.Beta Was this translation helpful? Give feedback.
All reactions