Skip to content

qithink/NeurIPS2021_Vision_Transformer_Paper_Collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

NeurIPS2021: Vision Transformer Paper Collection

1. Do Vision Transformers See Like Convolutional Neural Networks?

Authors: Maithra Raghu · Tom Unterthiner · Simon Kornblith · Chiyuan Zhang · Alexey Dosovitskiy

TL;DR: We use representation analysis methods to study Vision Transformers and understand differences between them and CNNs.

OpenReview: https://openreview.net/forum?id=R-616EWWKF5

PDF: https://openreview.net/pdf?id=R-616EWWKF5

2. Early Convolutions Help Transformers See Better

Authors: Tete Xiao · Piotr Dollar · Mannat Singh · Eric Mintun · Trevor Darrell · Ross B Girshick

TL;DR: None

OpenReview: https://openreview.net/forum?id=Lpfh1Bpqfk

PDF: https://openreview.net/pdf?id=Lpfh1Bpqfk

3. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Authors: Enze Xie · Wenhai Wang · Zhiding Yu · Anima Anandkumar · Jose M. Alvarez · Ping Luo

TL;DR: We present a simple yet powerful pipeline for semantic segmentation with Transformers.

OpenReview: https://openreview.net/forum?id=OG18MI5TRL

PDF: https://openreview.net/pdf?id=OG18MI5TRL

4. Long-Short Transformer: Efficient Transformers for Language and Vision

Authors: Chen Zhu · Wei Ping · Chaowei Xiao · Mohammad Shoeybi · Tom Goldstein · Anima Anandkumar · Bryan Catanzaro

TL;DR: We propose an efficient attention mechanism that is applicable to both autoregressive and bidirectional models in language and vision.

OpenReview: https://openreview.net/forum?id=M_lkFOwVdYc

PDF: https://openreview.net/pdf?id=M_lkFOwVdYc

5. Test-Time Personalization with a Transformer for Human Pose Estimation

Authors: Yizhuo Li · Miao Hao · Zonglin Di · Nitesh Bharadwaj Gundavarapu · Xiaolong Wang

TL;DR: None

OpenReview: https://openreview.net/forum?id=cwSkaedP-wz

PDF: https://openreview.net/pdf?id=cwSkaedP-wz

6. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

Authors: Hassan Akbari · Liangzhe Yuan · Rui Qian Qian · Wei-Hong Chuang · Shih-Fu Chang · Yin Cui · Boqing Gong

TL;DR: A pure Transformer-based pipeline for learning semantic representations from raw video, audio, and text without supervision

OpenReview: https://openreview.net/forum?id=RzYrn625bu8

PDF: https://openreview.net/pdf?id=RzYrn625bu8

7. Searching the Search Space of Vision Transformer

Authors: Minghao Chen · Kan Wu · Bolin Ni · Houwen Peng · Bei Liu · Jianlong Fu · Hongyang Chao · Haibin Ling

TL;DR: None

OpenReview: https://openreview.net/forum?id=AVS8CamBecS

PDF: https://openreview.net/pdf?id=AVS8CamBecS

8. Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Authors: Jing Zhang · Jianwen Xie · Nick Barnes · Ping Li

TL;DR: None

OpenReview: https://openreview.net/forum?id=LoUdcqLuPej

PDF: https://openreview.net/pdf?id=LoUdcqLuPej

9. TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

Authors: Yifan Jiang · Shiyu Chang · Zhangyang Wang

TL;DR: We build a pure transformer-based generative adversarial network and shows its competitive performance on multiple benchmarks.

OpenReview: https://openreview.net/forum?id=1GTpBZvNUrk

PDF: https://openreview.net/pdf?id=1GTpBZvNUrk

10. Augmented Shortcuts for Vision Transformers

Authors: Yehui Tang · Kai Han · Chang Xu · An Xiao · Yiping Deng · Chao Xu · Yunhe Wang

TL;DR: A novel augmented shortcut scheme for improving feature diversity in vision transformers.

OpenReview: https://openreview.net/forum?id=XiZYCewdxMQ

PDF: https://openreview.net/pdf?id=XiZYCewdxMQ

11. Improved Transformer for High-Resolution GANs

Authors: Long Zhao · Zizhao Zhang · Ting Chen · Dimitris Metaxas · Han Zhang

TL;DR: We propose a Transformer-based generator for high-resolution image synthesis.

OpenReview: https://openreview.net/forum?id=zmbiQmdtg9

PDF: https://openreview.net/pdf?id=zmbiQmdtg9

12. Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers

Authors: zengyh Zeng · Huan Yang · Hongyang Chao · Jianbo Wang · Jianlong Fu

TL;DR: We present a new perspective of achieving image synthesis by a visual token generation problem and a model named TokenGAN with token-based representation and Transformer-based modeling.

OpenReview: https://openreview.net/forum?id=lGoKo9WS2A_

PDF: https://openreview.net/pdf?id=lGoKo9WS2A_

13. Multi-Person 3D Motion Prediction with Multi-Range Transformers

Authors: Jiashun Wang · Huazhe Xu · Medhini Narasimhan · Xiaolong Wang

TL;DR: None

OpenReview: https://openreview.net/forum?id=gCaaFNvjfpPe

PDF: https://openreview.net/pdf?id=gCaaFNvjfpPe

14. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Authors: Yuxin Fang · Bencheng Liao · Xinggang Wang · Jiemin Fang · Jiyang Qi · Rui Wu · Jianwei Niu · Wenyu Liu

TL;DR: We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark.

OpenReview: https://openreview.net/forum?id=nVofoXjTmA_

PDF: https://openreview.net/pdf?id=nVofoXjTmA_

15. Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Authors: Yulin Wang · Rui Huang · Shiji Song · Zeyi Huang · Gao Huang

TL;DR: We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically.

OpenReview: https://openreview.net/forum?id=M0J1c3PqwKZ

PDF: https://openreview.net/pdf?id=M0J1c3PqwKZ

16. All Tokens Matter: Token Labeling for Training Better Vision Transformers

Authors: Zi-Hang Jiang · Andrew Hou · Li Yuan · Daquan Zhou · Yujun Shi · Xiaojie Jin · Anran Wang · Jiashi Feng

TL;DR: None

OpenReview: https://openreview.net/forum?id=2vubO341F_E

PDF: https://openreview.net/pdf?id=2vubO341F_E

17. The Image Local Autoregressive Transformer

Authors: Chenjie Cao · Yuxin Hong · Xiang Li · Chengrong Wang · Chengming Xu · yanwei Fu · Xiangyang Xue

TL;DR: This paper propose an image Local Autoregressive Transformer (iLAT) to effectively solve the locally guided image synthesis.

OpenReview: https://openreview.net/forum?id=6mEWjDYJeE-

PDF: https://openreview.net/pdf?id=6mEWjDYJeE-

18. Efficient Training of Visual Transformers with Small Datasets

Authors: Yahui Liu · Enver Sangineto · Wei Bi · Nicu Sebe · Bruno Lepri · Marco Nadai

TL;DR: None

OpenReview: https://openreview.net/forum?id=SCN8UaetXx

PDF: https://openreview.net/pdf?id=SCN8UaetXx

19. Post-Training Quantization for Vision Transformer

Authors: Zhenhua Liu · Yunhe Wang · Kai Han · Wei Zhang · Siwei Ma · Wen Gao

TL;DR: We propose a post-training quantization scheme for visual transformer, which consider the ranking loss of self-attention and take the nuclear norm of the features as the evaluation of the sensitivity of the transformer layer,

OpenReview: https://openreview.net/forum?id=9TX5OsKJvm

PDF: https://openreview.net/pdf?id=9TX5OsKJvm

20. Dynamic Grained Encoder for Vision Transformers

Authors: Lin Song · Songyang Zhang · SONG Liu · Zeming Li · Xuming He · Hongbin Sun · Jian Sun · Nanning Zheng

TL;DR: This paper introduces dynamic network mechanism into Vision Transformers to reduce the spatial redundancy of image features.

OpenReview: https://openreview.net/forum?id=gnAIV-EKw2

PDF: https://openreview.net/pdf?id=gnAIV-EKw2

21. Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Authors: Chongjian GE · Youwei Liang · YIBING SONG · Jianbo Jiao · Jue Wang · Ping Luo

TL;DR: We revitalize CNN encoder attentions via transformer in self-supervised visual representation learning

OpenReview: https://openreview.net/forum?id=sRojdWhXJx

PDF: https://openreview.net/pdf?id=sRojdWhXJx

22. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Authors: Yongming Rao · Wenliang Zhao · Benlin Liu · Jiwen Lu · Jie Zhou · Cho-Jui Hsieh

TL;DR: We propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically for vision transformer acceleration.

OpenReview: https://openreview.net/forum?id=jB0Nlbwlybm

PDF: https://openreview.net/pdf?id=jB0Nlbwlybm

23. Transformer in Transformer

Authors: Kai Han · An Xiao · Enhua Wu · Jianyuan Guo · Chunjing XU · Yunhe Wang

TL;DR: None

OpenReview: https://openreview.net/forum?id=iFODavhthGZ

PDF: https://openreview.net/pdf?id=iFODavhthGZ

24. Video Instance Segmentation using Inter-Frame Communication Transformers

Authors: Sukjun Hwang · Miran Heo · Seoung Wug Oh · Seon Joo Kim

TL;DR: None

OpenReview: https://openreview.net/forum?id=pvjfA4wogD6

PDF: https://openreview.net/pdf?id=pvjfA4wogD6

25. Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer

Authors: Yining Ma · Jingwen Li · Zhiguang Cao · Wen Song · Le Zhang · Zhenghua Chen · Jing Tang

TL;DR: We present a Dual-Aspect Collaborative Transformer to solve vehicle routing problems, which delivers superior performance.

OpenReview: https://openreview.net/forum?id=63pC59XOZLZ

PDF: https://openreview.net/pdf?id=63pC59XOZLZ

26. Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Authors: Xuefan Zha · Wentao Zhu · Lv Xun · Sen Yang · Prof. Ji Liu Liu

TL;DR: None

OpenReview: https://openreview.net/forum?id=fDSDkiiXHzj

PDF: https://openreview.net/pdf?id=fDSDkiiXHzj

27. Chasing Sparsity in Vision Transformers: An End-to-End Exploration

Authors: Tianlong Chen · Yu Cheng · Zhe Gan · Lu Yuan · Lei Zhang · Zhangyang Wang

TL;DR: We jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse vision transformer as the final output.

OpenReview: https://openreview.net/forum?id=LKoMTwTuQnC

PDF: https://openreview.net/pdf?id=LKoMTwTuQnC

28. HRFormer: High-Resolution Vision Transformer for Dense Predict

Authors: Yuhui YUAN · Rao Fu · Lang Huang · Weihong Lin · Chao Zhang · Xilin Chen · Jingdong Wang

TL;DR: None

OpenReview: https://openreview.net/forum?id=DF8LCjR03tX

PDF: https://openreview.net/pdf?id=DF8LCjR03tX

29. TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification

Authors: Zhuchen Shao · Hao Bian · Yang Chen · Yifeng Wang · Jian Zhang · Xiangyang Ji · yongbing zhang

TL;DR: We proposed a correlated MIL framework and devised a Transformer based MIL to the weakly supervised classification of whole slide images.

OpenReview: https://openreview.net/forum?id=LKUfuWxajHc

PDF: https://openreview.net/pdf?id=LKUfuWxajHc

30. Federated Split Task-Agnostic Vision Transformer for COVID-19 CXR Diagnosis

Authors: Sangjoon Park · Gwanghyun Kim · Jeongsol Kim · Boah Kim · Jong Chul Ye

TL;DR: We proposed a novel Federated Split Task-Agnostic (FeSTA) framework suitable to leverage the formidable benefit of Vision Transformer to simultaneously process multiple CXR tasks including the diagnosis of COVID-19.

OpenReview: https://openreview.net/forum?id=Ggikq6Tdxch

PDF: https://openreview.net/pdf?id=Ggikq6Tdxch

31. MST: Masked Self-Supervised Transformer for Visual Representation

Authors: Zhaowen Li · Zhiyang Chen · Fan Yang · Wei Li · Yousong Zhu · Chaoyang Zhao · Rui Deng · Liwei Wu · Rui Zhao · Ming Tang · Jinqiao Wang

TL;DR: None

OpenReview: https://openreview.net/forum?id=y_OmkmCH9w

PDF: https://openreview.net/pdf?id=y_OmkmCH9w

32. ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Authors: Yufei Xu · Qiming ZHANG · Jing Zhang · Dacheng Tao

TL;DR: None

OpenReview: https://openreview.net/forum?id=_RnHyIeu5Y5

PDF: https://openreview.net/pdf?id=_RnHyIeu5Y5

33. Focal Attention for Long-Range Interactions in Vision Transformers

Authors: Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao

TL;DR: An effective focal attention mechanism for modeling short- and long-range visual dependencies in Vision Transformers

OpenReview: https://openreview.net/forum?id=2zCRcTafea

PDF: https://openreview.net/pdf?id=2zCRcTafea

34. Blending Anti-Aliasing into Vision Transformer

Authors: Shengju Qian · Hao Shao · Yi Zhu · Mu Li · Jiaya Jia

TL;DR: None

OpenReview: https://openreview.net/forum?id=0-0Wk0t6A_Z

PDF: https://openreview.net/pdf?id=0-0Wk0t6A_Z

35. Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Authors: Muchen Li · Leonid Sigal

TL;DR: A one-step transformer based model that solves Referring Expression Detection and Referring Expression Segmentation jointly.

OpenReview: https://openreview.net/forum?id=j7u7cJDBo8p

PDF: https://openreview.net/pdf?id=j7u7cJDBo8p

36. Adder Attention for Vision Transformer

Authors: Han Shu · Jiahao Wang · Hanting Chen · Lin Li · Yujiu Yang · Yunhe Wang

TL;DR: Implementing transformers using cheap addition operation

OpenReview: https://openreview.net/forum?id=5Ld5bRB9jzY

PDF: https://openreview.net/pdf?id=5Ld5bRB9jzY

37. History Aware Multimodal Transformer for Vision-and-Language Navigation

Authors: Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev

TL;DR: None

OpenReview: https://openreview.net/forum?id=SQxuiYf2TT

PDF: https://openreview.net/pdf?id=SQxuiYf2TT

38. Associating Objects with Transformers for Video Object Segmentation

Authors: Zongxin Yang · Yunchao Wei · Yi Yang

TL;DR: None

OpenReview: https://openreview.net/forum?id=hl3v8io3ZYt

PDF: https://openreview.net/pdf?id=hl3v8io3ZYt

39. Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Authors: Xiangxiang Chu · Zhi Tian · Yuqing Wang · Bo Zhang · Haibing Ren · Xiaolin Wei · Huaxia Xia · Chunhua Shen

TL;DR: Two simple and effective designs of vision transformer, which is on par with the Swin transformer

OpenReview: https://openreview.net/forum?id=5kTlVBkzSRx

PDF: https://openreview.net/pdf?id=5kTlVBkzSRx

40. IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Authors: Bowen Pan · Rameswar Panda · Yifan Jiang · Zhangyang Wang · Rogerio Feris · Aude Oliva

TL;DR: An input-dependent and interpretable dynamic inference framework for vision transformer, which adaptively decides the patch tokens to compute per input instance.

OpenReview: https://openreview.net/forum?id=7X_sBjIwtm9

PDF: https://openreview.net/pdf?id=7X_sBjIwtm9

41. TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

Authors: Shengcai Liao · Ling Shao

TL;DR: A transformer-based deep image matching method is developed for generalizable person re-identification, achieving state-of-the-art performance.

OpenReview: https://openreview.net/forum?id=I3yGrFoH8DF

PDF: https://openreview.net/pdf?id=I3yGrFoH8DF

42. CATs: Cost Aggregation Transformers for Visual Correspondence

Authors: Seokju Cho · Sunghwan Hong · Sangryul Jeon · Yunsung Lee · Kwanghoon Sohn · Seungryong Kim

TL;DR: None

OpenReview: https://openreview.net/forum?id=eVuMspr9cu5

PDF: https://openreview.net/pdf?id=eVuMspr9cu5

43. Compositional Transformers for Scene Generation

Authors: Drew Arad Hudson · Larry Zitnick

TL;DR: A new compositional transformer model for recurrent scene generation.

OpenReview: https://openreview.net/forum?id=YQeWoRnwTnE

PDF: https://openreview.net/pdf?id=YQeWoRnwTnE

44. SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Authors: Abhinav Moudgil · Arjun Majumdar · Harsh Agrawal · Stefan Lee · Dhruv Batra

TL;DR: We design a new vision-and-language navigation agent that operates on both scene and object features with a multimodal transformer using a selective attention pattern for object-centric processing.

OpenReview: https://openreview.net/forum?id=E5EoQqCVYX

PDF: https://openreview.net/pdf?id=E5EoQqCVYX

45. Glance-and-Gaze Vision Transformer

Authors: Qihang Yu · Yingda Xia · Yutong Bai · Yongyi Lu · Alan Yuille · Wei Shen

TL;DR: A new state-of-the-art efficient and effective Vision Transformer with Glance-and-Gaze mechanism

OpenReview: https://openreview.net/forum?id=GitDcBlcg78

PDF: https://openreview.net/pdf?id=GitDcBlcg78

46. XCiT: Cross-Covariance Image Transformers

Authors: Alaa El-Nouby Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou

TL;DR: A new transformer model for image processing, whose complexity is linear in the image resolution with no approximation

OpenReview: https://openreview.net/forum?id=kzPtpIpF8o

PDF: https://openreview.net/pdf?id=kzPtpIpF8o

47. Are Transformers more robust than CNNs?

Authors: Yutong Bai · Jieru Mei · Alan Yuille · Cihang Xie

TL;DR: We provide the first fair and in-depth comparison between Transformers and CNNs

OpenReview: https://openreview.net/forum?id=hbHkvGBZB9

PDF: https://openreview.net/pdf?id=hbHkvGBZB9

48. Implicit Transformer Network for Screen Content Image Continuous Super-Resolution

Authors: Jingyu Yang · Sheng None Shen · Huanjing Yue · Kun Li

TL;DR: A novel method for Screen Content Image Continuous Super-Resolution

OpenReview: https://openreview.net/forum?id=x4t0fxWPNdi

PDF: https://openreview.net/pdf?id=x4t0fxWPNdi

49. Few-Shot Segmentation via Cycle-Consistent Transformer

Authors: Gengwei Zhang · Guoliang Kang · Yi Yang · Yunchao Wei

TL;DR: We propose a cycle-consistent transformer (CyCTR) to aggregate the pixel-wise support features into the query ones for few-shot segmentation while avoiding biasing by harmful information.

OpenReview: https://openreview.net/forum?id=LWH-C1HoQG_

PDF: https://openreview.net/pdf?id=LWH-C1HoQG_

50. Intriguing Properties of Vision Transformers

Authors: Muhammad Muzammal Naseer · Kanchana Ranasinghe · Salman H Khan · Munawar Hayat · Fahad Shahbaz Khan · Ming-Hsuan Yang

TL;DR: Analysis of content-dependent long-range interaction modeling capabilities of Vision Transformers in terms of robustness against image nuisances such as severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations.

OpenReview: https://openreview.net/forum?id=o2mbl-Hmfgd

PDF: https://openreview.net/pdf?id=o2mbl-Hmfgd

51. ResT: An Efficient Transformer for Visual Recognition

Authors: wofmanaf Zhang · Yu-Bin Yang

TL;DR: An efficient multi-scale vision Transformer that can tackle input images with arbitrary size.

OpenReview: https://openreview.net/forum?id=6Ab68Ip4Mu

PDF: https://openreview.net/pdf?id=6Ab68Ip4Mu

52. TransformerFusion: Monocular RGB Scene Reconstruction using Transformers

Authors: Aljaz Bozic · Pablo Palafox · Justus Thies · Angela Dai · Matthias Niessner

TL;DR: We propose a transformer-based approach for 3D scene reconstruction from multi-view RGB input.

OpenReview: https://openreview.net/forum?id=ZEoMBPtvqey

PDF: https://openreview.net/pdf?id=ZEoMBPtvqey

53. Cross-view Geo-localization with Layer-to-Layer Transformer

Authors: Hongji Yang · Xiufan Lu · Yingying Zhu

TL;DR: This paper proposes a novel layer-to-layer Transformer for cross-view geo-localization.

OpenReview: https://openreview.net/forum?id=tQgj7CDTfKB

PDF: https://openreview.net/pdf?id=tQgj7CDTfKB

54. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

Authors: Mandela Patrick · Dylan Campbell · Yuki Asano · Ishan Misra · Florian Metze · Christoph Feichtenhofer · Andrea Vedaldi · João Henriques

TL;DR: A new attention block for video transformers that implicitly models motion paths

OpenReview: https://openreview.net/forum?id=mfQxdSMWOF

PDF: https://openreview.net/pdf?id=mfQxdSMWOF

55. Space-time Mixing Attention for Video Transformer

Authors: Adrian Bulat · Juanma Perez Rua · Swathikiran Sudhakaran · Brais Martinez · Georgios Tzimiropoulos

TL;DR: None

OpenReview: https://openreview.net/forum?id=QgX15Mdi1E_

PDF: https://openreview.net/pdf?id=QgX15Mdi1E_


欢迎关注

About

NeurIPS2021: Vision Transformer Paper Collection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published