on enwik8
python train.py
Optionally, use Dynamic Tanh instead of RMSnorm or LayerNorm
Ref: Transformer without Normalization
Speed test: Colab
python train_infini.py
on MNIST
RoPE for spatial embedding in 2D spatial attention
python train_dit.py
Sin-cos embedding before 3D attention
python train_dit_3d.py
It has additional dim_head parameter in the DiT block. Standard DiT:
# Head dimension is derived from hidden_size and num_heads
num_heads = hidden_size // dim_head
DiT variant:
# Head dimension is an independent parameter
# Can be set independently of hidden_size
Model Parameter Efficiency
Standard DiT:
- Required large hidden_size (e.g., 256) to maintain reasonable attention head dimensions
- Full model width applied to all operations (feed-forward, projections, etc.)
DiT variant:
- Can use much smaller hidden_size (e.g., 16) while maintaining attention capacity
- Significantly reduces parameter count in tokenization and feed-forward layers
python train_flt.py
python train_shortcut.py
python train_unet.py