Skip to content

quantumiracle/transformers-pytorch

Repository files navigation

Transformer and Diffusion Transformer

Vanilla Transformer

on enwik8

python train.py

Optionally, use Dynamic Tanh instead of RMSnorm or LayerNorm

Ref: Transformer without Normalization

Speed test: Colab

Transformer with Infini-attention

python train_infini.py

2D Diffusion Transformer

on MNIST

RoPE for spatial embedding in 2D spatial attention

python train_dit.py

3D Diffusion Transformer

Sin-cos embedding before 3D attention

python train_dit_3d.py

DiT variant

It has additional dim_head parameter in the DiT block. Standard DiT:

# Head dimension is derived from hidden_size and num_heads
num_heads = hidden_size // dim_head

DiT variant:

# Head dimension is an independent parameter
# Can be set independently of hidden_size

Model Parameter Efficiency

Standard DiT:

  • Required large hidden_size (e.g., 256) to maintain reasonable attention head dimensions
  • Full model width applied to all operations (feed-forward, projections, etc.)

DiT variant:

  • Can use much smaller hidden_size (e.g., 16) while maintaining attention capacity
  • Significantly reduces parameter count in tokenization and feed-forward layers

Flow Matching with Transformer

python train_flt.py

Shortcut Model (Flow Matching) with Transformer

python train_shortcut.py

Unet

python train_unet.py

Result Analysis

See doc and Blog

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors