Skip to content

tml-epfl/softmax

Repository files navigation

Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions

Read the paper

Abstract

Understanding the intricate non-convex training dynamics of softmax-based models is crucial for explaining the empirical success of transformers. In this article, we analyze the gradient flow dynamics of the value-softmax model, defined as $\mathcal{L}(\mathbf{V} \sigma(\mathbf{a}))$, where $\mathbf{V}$ and $\mathbf{a}$ are a learnable value matrix and attention vector, respectively. As the matrix times softmax vector parameterization constitutes the core building block of self-attention, our analysis provides direct insight into transformer's training dynamics. We reveal that gradient flow on this structure inherently drives the optimization toward solutions characterized by low-entropy outputs. We demonstrate the universality of this polarizing effect across various objectives, including logistic and square loss. Furthermore, we discuss the practical implications of these theoretical results, offering a formal mechanism for empirical phenomena such as attention sinks and massive activations.

Reproducing the experiments

The paper is supported by a range of experiments with small task-specific models, as well as one experiment analyzing pretrained LLMs (Figure 7 of the paper). The LLM experiment requires considerably more effort to setup and run, hence we provide the recipe for reproducing the experiments separately for small models and LLMs.

To reproduce the experiments with small models, do the following:

  1. Install uv.

  2. Create the Python environment: in the root of the repository run uv sync.

  3. For the experiments with value-softmax model (Section 3 of the paper), run visualizations/toy_experiments.ipynb.

  4. Create an account in Weights & Biases if you don't have one.

  5. For the experiments with induction heads (Section 4.1):

    1. Run the wandb sweep:

      uv run wandb sweep sweeps/induction.yaml
      uv run wandb agent <SWEEP_ID>
      
    2. Run visualizations/induction.ipynb.

    3. Run visualizations/induction_save_attn_patterns.ipynb.

  6. For the experiments with classification (Section 4.2):

    1. Run the wandb sweep:

      uv run wandb sweep sweeps/classification.yaml
      uv run wandb agent <SWEEP_ID>
      
    2. Run visualizations/classification.ipynb.

    3. Run visualizations/classification_with_sample_tracking.ipynb.


To reproduce the experiment with LLMs, do the following:

  1. Clone axlearn-mod into a separate folder.

  2. Install uv.

  3. Create the Python environment with the axlearn dependency:

    uv sync --extra axlearn
    
  4. Install the gsutil tool (e.g., as a part of the Google Cloud CLI).

  5. Run the downloading and inference of softmax and sigmoid models:

    uv run python pretrained.py --model softmax --n-samples 1000
    uv run python pretrained.py --model sigmoid --n-samples 1000
    
  6. To visualize the results, run visualizations/pretrained.ipynb.


Citation information

If you find our work useful for your research, please cite it as

@article{varre2026gradient,
  title={Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions},
  author={Varre, Aditya and Rofin, Mark and Flammarion, Nicolas},
  journal={arXiv preprint arXiv:2603.06248},
  year={2026}
}

About

Official code for the paper "Gradient Flow Polarizes Softmax Outputs towards Low-Entropy Solutions"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors