This repo is a collection of AWESOME things about attention mechanism, Transformer networks, pre-trained models, including papers, code, etc. Feel free to star and fork. Please feel free to pull requests or report issues.
- Pre-trained Models for Natural Language Processing: A Survey [Invited Review of Science China Technological Sciences] [pdf]
- Deep High-Resolution Representation Learning for Visual Recognition [TPAMI 2019] [pdf] [code]
- Deep High-Resolution Representation Learning for Human Pose Estimation [CVPR 2019] [pdf] [code] Gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel.
- Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks [NIPS 2015] [pdf] [code]
Use a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. - Feature Pyramid Networks for Object Detection [CVPR 2017] [pdf] [code]
Exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. - U-Net: Convolutional Networks for Biomedical Image Segmentation [MICCAI 2015] [pdf] [code]
The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. - Fully Convolutional Networks for Semantic Segmentation [CVPR 2015] [pdf] [code]
Build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. - Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure [arXiv Aug 2019] [pdf] Establish connections between the dynamics in Transformer and recurrent networks to argue that several factors including gradient flow along an ensemble of multiple weakly dependent paths play a paramount role in the success of Transformer. Then leverage the dynamics to introduce Multiresolution Transformer Networks as the first architecture that exploits hierarchical structure in data via self-attention.
- MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning [arXiv Nov 2019] [pdf] [code]
Explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. - Multi-scale Transformer Language Models [arXiv 1st May 2020] [pdf] Learn representations of text at multiple scales and present three different architectures that have an inductive bias to handle the hierarchical nature of language.
- REFORMER: The Efficient Transformer [ICLR 2020] [pdf] [code]
Replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence. - Longformer: The Long-Document Transformer
[arXiv Apr. 2020] [pdf] [code]
Propose an attention mechanism with a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [arXiv Jun. 2020] [pdf] [code]
- Lite Transformer with Long-Short Range Attention [ICLR 2020] [pdf] [code]
Present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices. - Tree-structured Attention with Hierarchical Accumulation [ICLR 2020] [pdf] [code]
Present an attention-based hierarchical encoding method. - Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention [ICLR 2020] [pdf] Present Transformer-XH, which uses extra hop attention to enable intrinsic modeling of structured texts in a fully data-driven way.
- Linformer: Self-Attention with Linear Complexity [arXiv Jun. 2020] [pdf] Demonstrate that the self-attention mechanism can be approximated by a low-rank matrix and propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n^2) to O(n) in both time and space.
- Adaptive Attention Span in Transformers [ACL 2019] [pdf] [code] Propose a novel self-attention mechanism that can learn its optimal attention span.
- BP-Transformer: Modelling Long-Range Context via Binary Partitioning
[arXiv Nov. 2019] [pdf] [code]
Adopt a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning. - Adaptively Sparse Transformers [EMNLP 2019] [pdf] [code]
Introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.