Skip to content

shizhediao/awesome-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

awesome-transformers

MIT License

This repo is a collection of AWESOME things about attention mechanism, Transformer networks, pre-trained models, including papers, code, etc. Feel free to star and fork. Please feel free to pull requests or report issues.

Contents

Papers

Survey

  • Pre-trained Models for Natural Language Processing: A Survey [Invited Review of Science China Technological Sciences] [pdf]

Theory

Sparse Transformers

  • Deep High-Resolution Representation Learning for Visual Recognition [TPAMI 2019] [pdf] [code]
  • Deep High-Resolution Representation Learning for Human Pose Estimation [CVPR 2019] [pdf] [code] Gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutliresolution subnetworks in parallel.
  • Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks [NIPS 2015] [pdf] [code]
    Use a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.
  • Feature Pyramid Networks for Object Detection [CVPR 2017] [pdf] [code]
    Exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.
  • U-Net: Convolutional Networks for Biomedical Image Segmentation [MICCAI 2015] [pdf] [code]
    The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.
  • Fully Convolutional Networks for Semantic Segmentation [CVPR 2015] [pdf] [code]
    Build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning.
  • Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure [arXiv Aug 2019] [pdf] Establish connections between the dynamics in Transformer and recurrent networks to argue that several factors including gradient flow along an ensemble of multiple weakly dependent paths play a paramount role in the success of Transformer. Then leverage the dynamics to introduce Multiresolution Transformer Networks as the first architecture that exploits hierarchical structure in data via self-attention.
  • MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning [arXiv Nov 2019] [pdf] [code]
    Explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures.
  • Multi-scale Transformer Language Models [arXiv 1st May 2020] [pdf] Learn representations of text at multiple scales and present three different architectures that have an inductive bias to handle the hierarchical nature of language.

  • REFORMER: The Efficient Transformer [ICLR 2020] [pdf] [code]
    Replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(LlogL), where L is the length of the sequence.
  • Longformer: The Long-Document Transformer [arXiv Apr. 2020] [pdf] [code]
    Propose an attention mechanism with a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
  • Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing [arXiv Jun. 2020] [pdf] [code]
  • Lite Transformer with Long-Short Range Attention [ICLR 2020] [pdf] [code]
    Present an efficient mobile NLP architecture, Lite Transformer, to facilitate deploying mobile NLP applications on edge devices.
  • Tree-structured Attention with Hierarchical Accumulation [ICLR 2020] [pdf] [code]
    Present an attention-based hierarchical encoding method.
  • Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention [ICLR 2020] [pdf] Present Transformer-XH, which uses extra hop attention to enable intrinsic modeling of structured texts in a fully data-driven way.
  • Linformer: Self-Attention with Linear Complexity [arXiv Jun. 2020] [pdf] Demonstrate that the self-attention mechanism can be approximated by a low-rank matrix and propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n^2) to O(n) in both time and space.
  • Adaptive Attention Span in Transformers [ACL 2019] [pdf] [code] Propose a novel self-attention mechanism that can learn its optimal attention span.
  • BP-Transformer: Modelling Long-Range Context via Binary Partitioning [arXiv Nov. 2019] [pdf] [code]
    Adopt a fine-to-coarse attention mechanism on multi-scale spans via binary partitioning.
  • Adaptively Sparse Transformers [EMNLP 2019] [pdf] [code]
    Introduce the adaptively sparse Transformer, wherein attention heads have flexible, contextdependent sparsity patterns. This sparsity is accomplished by replacing softmax with α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.

Library

Lectures and Tutorials

Other Resources

About

A curated list of resources dedicated to Transformers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published