Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as
The paper is supported by a range of experiments with small task-specific models, as well as one experiment analyzing pretrained LLMs (Figure 7 of the paper). The LLM experiment requires considerably more effort to setup and run, hence we provide the recipe for reproducing the experiments separately for small models and LLMs.
To reproduce the experiments with small models, do the following:
-
Install uv.
-
Create the Python environment: in the root of the repository run
uv sync. -
For the experiments with value-softmax model (Section 3 of the paper), run visualizations/toy_experiments.ipynb.
-
Create an account in Weights & Biases if you don't have one.
-
For the experiments with induction heads (Section 4.1):
-
Run the wandb sweep:
uv run wandb sweep sweeps/induction.yaml uv run wandb agent <SWEEP_ID>
-
-
For the experiments with classification (Section 4.2):
-
Run the wandb sweep:
uv run wandb sweep sweeps/classification.yaml uv run wandb agent <SWEEP_ID> -
Run visualizations/classification_with_sample_tracking.ipynb.
-
To reproduce the experiment with LLMs, do the following:
-
Clone axlearn-mod into a separate folder.
-
Install uv.
-
Create the Python environment with the
axlearndependency:uv sync --extra axlearn -
Install the
gsutiltool (e.g., as a part of the Google Cloud CLI). -
Run the downloading and inference of softmax and sigmoid models:
uv run python pretrained.py --model softmax --n-samples 1000 uv run python pretrained.py --model sigmoid --n-samples 1000 -
To visualize the results, run visualizations/pretrained.ipynb.
If you find our work useful for your research, please cite it as
@article{varre2026gradient,
title={Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions},
author={Varre, Aditya and Rofin, Mark and Flammarion, Nicolas},
journal={arXiv preprint arXiv:2603.06248},
year={2026}
}
