awesome-transformers

This repo is a collection of AWESOME things about attention mechanism, Transformer networks, pre-trained models, including papers, code, etc. Feel free to star and fork. Please feel free to pull requests or report issues.

Papers

Survey

Pre-trained Models for Natural Language Processing: A Survey [Invited Review of Science China Technological Sciences] [pdf]

Theory

Sparse Transformers

Deep High-Resolution Representation Learning for Visual Recognition [TPAMI 2019] [pdf] [code]
Deep High-Resolution Representation Learning for Human Pose Estimation [CVPR 2019] [pdf] [code] Gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel.
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks [NIPS 2015] [pdf] [code]
Use a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.
Feature Pyramid Networks for Object Detection [CVPR 2017] [pdf] [code]
Exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.
U-Net: Convolutional Networks for Biomedical Image Segmentation [MICCAI 2015] [pdf] [code]
The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.
Fully Convolutional Networks for Semantic Segmentation [CVPR 2015] [pdf] [code]
Build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure [arXiv Aug 2019] [pdf] Establish connections between the dynamics in Transformer and recurrent networks to argue that several factors including gradient flow along an ensemble of multiple weakly dependent paths play a paramount role in the success of Transformer. Then leverage the dynamics to introduce Multiresolution Transformer Networks as the first architecture that exploits hierarchical structure in data via self-attention.
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning [arXiv Nov 2019] [pdf] [code]
Explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures.
Multi-scale Transformer Language Models [arXiv 1st May 2020] [pdf] Learn representations of text at multiple scales and present three different architectures that have an inductive bias to handle the hierarchical nature of language.

REFORMER: The Efficient Transformer [ICLR 2020] [pdf] [code]
Replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence.
Longformer: The Long-Document Transformer [arXiv Apr. 2020] [pdf] [code]
Propose an attention mechanism with a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [arXiv Jun. 2020] [pdf] [code]
Lite Transformer with Long-Short Range Attention [ICLR 2020] [pdf] [code]
Present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices.
Tree-structured Attention with Hierarchical Accumulation [ICLR 2020] [pdf] [code]
Present an attention-based hierarchical encoding method.
Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention [ICLR 2020] [pdf] Present Transformer-XH, which uses extra hop attention to enable intrinsic modeling of structured texts in a fully data-driven way.
Linformer: Self-Attention with Linear Complexity [arXiv Jun. 2020] [pdf] Demonstrate that the self-attention mechanism can be approximated by a low-rank matrix and propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n^2) to O(n) in both time and space.
Adaptive Attention Span in Transformers [ACL 2019] [pdf] [code] Propose a novel self-attention mechanism that can learn its optimal attention span.
BP-Transformer: Modelling Long-Range Context via Binary Partitioning [arXiv Nov. 2019] [pdf] [code]
Adopt a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning.
Adaptively Sparse Transformers [EMNLP 2019] [pdf] [code]
Introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

awesome-transformers

Contents

Papers

Survey

Theory

Sparse Transformers

Library

Lectures and Tutorials

Other Resources

About

Releases

Packages

License

shizhediao/awesome-transformers

Folders and files

Latest commit

History

Repository files navigation

awesome-transformers

Contents

Papers

Survey

Theory

Sparse Transformers

Library

Lectures and Tutorials

Other Resources

About

Resources

License

Stars

Watchers

Forks