v0.1.2
·
37 commits
to main
since this release
The attention bias in MosaicBERT has attn_bias.ndim == 4, so I generalized flash_attention_n to accomodate this.