Transformer and Diffusion Transformer

Vanilla Transformer

on enwik8

python train.py

Optionally, use Dynamic Tanh instead of RMSnorm or LayerNorm

Ref: Transformer without Normalization

Speed test: Colab

Transformer with Infini-attention

python train_infini.py

2D Diffusion Transformer

on MNIST

RoPE for spatial embedding in 2D spatial attention

python train_dit.py

3D Diffusion Transformer

Sin-cos embedding before 3D attention

python train_dit_3d.py

DiT variant

It has additional dim_head parameter in the DiT block. Standard DiT:

# Head dimension is derived from hidden_size and num_heads
num_heads = hidden_size // dim_head

DiT variant:

# Head dimension is an independent parameter
# Can be set independently of hidden_size

Model Parameter Efficiency

Standard DiT:

Required large hidden_size (e.g., 256) to maintain reasonable attention head dimensions
Full model width applied to all operations (feed-forward, projections, etc.)

DiT variant:

Can use much smaller hidden_size (e.g., 16) while maintaining attention capacity
Significantly reduces parameter count in tokenization and feed-forward layers

Flow Matching with Transformer

python train_flt.py

Shortcut Model (Flow Matching) with Transformer

python train_shortcut.py

Unet

python train_unet.py

Result Analysis

See doc and Blog

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
analysis		analysis
data		data
models		models
test		test
.gitignore		.gitignore
README.md		README.md
infer_neuromem.py		infer_neuromem.py
pyproject.toml		pyproject.toml
train.py		train.py
train_dit.py		train_dit.py
train_dit_3d.py		train_dit_3d.py
train_flt.py		train_flt.py
train_infini.py		train_infini.py
train_neuromem.py		train_neuromem.py
train_shortcut.py		train_shortcut.py
train_skiptransformer.py		train_skiptransformer.py
train_unet.py		train_unet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer and Diffusion Transformer

Vanilla Transformer

Transformer with Infini-attention

2D Diffusion Transformer

3D Diffusion Transformer

DiT variant

Flow Matching with Transformer

Shortcut Model (Flow Matching) with Transformer

Unet

Result Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer and Diffusion Transformer

Vanilla Transformer

Transformer with Infini-attention

2D Diffusion Transformer

3D Diffusion Transformer

DiT variant

Flow Matching with Transformer

Shortcut Model (Flow Matching) with Transformer

Unet

Result Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages