Authors: Maithra Raghu · Tom Unterthiner · Simon Kornblith · Chiyuan Zhang · Alexey Dosovitskiy
TL;DR: We use representation analysis methods to study Vision Transformers and understand differences between them and CNNs.
OpenReview: https://openreview.net/forum?id=R-616EWWKF5
PDF: https://openreview.net/pdf?id=R-616EWWKF5
Authors: Tete Xiao · Piotr Dollar · Mannat Singh · Eric Mintun · Trevor Darrell · Ross B Girshick
TL;DR: None
OpenReview: https://openreview.net/forum?id=Lpfh1Bpqfk
PDF: https://openreview.net/pdf?id=Lpfh1Bpqfk
Authors: Enze Xie · Wenhai Wang · Zhiding Yu · Anima Anandkumar · Jose M. Alvarez · Ping Luo
TL;DR: We present a simple yet powerful pipeline for semantic segmentation with Transformers.
OpenReview: https://openreview.net/forum?id=OG18MI5TRL
PDF: https://openreview.net/pdf?id=OG18MI5TRL
Authors: Chen Zhu · Wei Ping · Chaowei Xiao · Mohammad Shoeybi · Tom Goldstein · Anima Anandkumar · Bryan Catanzaro
TL;DR: We propose an efficient attention mechanism that is applicable to both autoregressive and bidirectional models in language and vision.
OpenReview: https://openreview.net/forum?id=M_lkFOwVdYc
PDF: https://openreview.net/pdf?id=M_lkFOwVdYc
Authors: Yizhuo Li · Miao Hao · Zonglin Di · Nitesh Bharadwaj Gundavarapu · Xiaolong Wang
TL;DR: None
OpenReview: https://openreview.net/forum?id=cwSkaedP-wz
PDF: https://openreview.net/pdf?id=cwSkaedP-wz
Authors: Hassan Akbari · Liangzhe Yuan · Rui Qian Qian · Wei-Hong Chuang · Shih-Fu Chang · Yin Cui · Boqing Gong
TL;DR: A pure Transformer-based pipeline for learning semantic representations from raw video, audio, and text without supervision
OpenReview: https://openreview.net/forum?id=RzYrn625bu8
PDF: https://openreview.net/pdf?id=RzYrn625bu8
Authors: Minghao Chen · Kan Wu · Bolin Ni · Houwen Peng · Bei Liu · Jianlong Fu · Hongyang Chao · Haibin Ling
TL;DR: None
OpenReview: https://openreview.net/forum?id=AVS8CamBecS
PDF: https://openreview.net/pdf?id=AVS8CamBecS
Authors: Jing Zhang · Jianwen Xie · Nick Barnes · Ping Li
TL;DR: None
OpenReview: https://openreview.net/forum?id=LoUdcqLuPej
PDF: https://openreview.net/pdf?id=LoUdcqLuPej
Authors: Yifan Jiang · Shiyu Chang · Zhangyang Wang
TL;DR: We build a pure transformer-based generative adversarial network and shows its competitive performance on multiple benchmarks.
OpenReview: https://openreview.net/forum?id=1GTpBZvNUrk
PDF: https://openreview.net/pdf?id=1GTpBZvNUrk
Authors: Yehui Tang · Kai Han · Chang Xu · An Xiao · Yiping Deng · Chao Xu · Yunhe Wang
TL;DR: A novel augmented shortcut scheme for improving feature diversity in vision transformers.
OpenReview: https://openreview.net/forum?id=XiZYCewdxMQ
PDF: https://openreview.net/pdf?id=XiZYCewdxMQ
Authors: Long Zhao · Zizhao Zhang · Ting Chen · Dimitris Metaxas · Han Zhang
TL;DR: We propose a Transformer-based generator for high-resolution image synthesis.
OpenReview: https://openreview.net/forum?id=zmbiQmdtg9
PDF: https://openreview.net/pdf?id=zmbiQmdtg9
Authors: zengyh Zeng · Huan Yang · Hongyang Chao · Jianbo Wang · Jianlong Fu
TL;DR: We present a new perspective of achieving image synthesis by a visual token generation problem and a model named TokenGAN with token-based representation and Transformer-based modeling.
OpenReview: https://openreview.net/forum?id=lGoKo9WS2A_
PDF: https://openreview.net/pdf?id=lGoKo9WS2A_
Authors: Jiashun Wang · Huazhe Xu · Medhini Narasimhan · Xiaolong Wang
TL;DR: None
OpenReview: https://openreview.net/forum?id=gCaaFNvjfpPe
PDF: https://openreview.net/pdf?id=gCaaFNvjfpPe
Authors: Yuxin Fang · Bencheng Liao · Xinggang Wang · Jiemin Fang · Jiyang Qi · Rui Wu · Jianwei Niu · Wenyu Liu
TL;DR: We study the transferability of the vanilla ViT pre-trained on mid-sized ImageNet-1k to the more challenging COCO object detection benchmark.
OpenReview: https://openreview.net/forum?id=nVofoXjTmA_
PDF: https://openreview.net/pdf?id=nVofoXjTmA_
Authors: Yulin Wang · Rui Huang · Shiji Song · Zeyi Huang · Gao Huang
TL;DR: We develop a Dynamic Vision Transformer (DVT) to automatically configure a proper number of tokens for each individual image, leading to a significant improvement in computational efficiency, both theoretically and empirically.
OpenReview: https://openreview.net/forum?id=M0J1c3PqwKZ
PDF: https://openreview.net/pdf?id=M0J1c3PqwKZ
Authors: Zi-Hang Jiang · Andrew Hou · Li Yuan · Daquan Zhou · Yujun Shi · Xiaojie Jin · Anran Wang · Jiashi Feng
TL;DR: None
OpenReview: https://openreview.net/forum?id=2vubO341F_E
PDF: https://openreview.net/pdf?id=2vubO341F_E
Authors: Chenjie Cao · Yuxin Hong · Xiang Li · Chengrong Wang · Chengming Xu · yanwei Fu · Xiangyang Xue
TL;DR: This paper propose an image Local Autoregressive Transformer (iLAT) to effectively solve the locally guided image synthesis.
OpenReview: https://openreview.net/forum?id=6mEWjDYJeE-
PDF: https://openreview.net/pdf?id=6mEWjDYJeE-
Authors: Yahui Liu · Enver Sangineto · Wei Bi · Nicu Sebe · Bruno Lepri · Marco Nadai
TL;DR: None
OpenReview: https://openreview.net/forum?id=SCN8UaetXx
PDF: https://openreview.net/pdf?id=SCN8UaetXx
Authors: Zhenhua Liu · Yunhe Wang · Kai Han · Wei Zhang · Siwei Ma · Wen Gao
TL;DR: We propose a post-training quantization scheme for visual transformer, which consider the ranking loss of self-attention and take the nuclear norm of the features as the evaluation of the sensitivity of the transformer layer,
OpenReview: https://openreview.net/forum?id=9TX5OsKJvm
PDF: https://openreview.net/pdf?id=9TX5OsKJvm
Authors: Lin Song · Songyang Zhang · SONG Liu · Zeming Li · Xuming He · Hongbin Sun · Jian Sun · Nanning Zheng
TL;DR: This paper introduces dynamic network mechanism into Vision Transformers to reduce the spatial redundancy of image features.
OpenReview: https://openreview.net/forum?id=gnAIV-EKw2
PDF: https://openreview.net/pdf?id=gnAIV-EKw2
Authors: Chongjian GE · Youwei Liang · YIBING SONG · Jianbo Jiao · Jue Wang · Ping Luo
TL;DR: We revitalize CNN encoder attentions via transformer in self-supervised visual representation learning
OpenReview: https://openreview.net/forum?id=sRojdWhXJx
PDF: https://openreview.net/pdf?id=sRojdWhXJx
Authors: Yongming Rao · Wenliang Zhao · Benlin Liu · Jiwen Lu · Jie Zhou · Cho-Jui Hsieh
TL;DR: We propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically for vision transformer acceleration.
OpenReview: https://openreview.net/forum?id=jB0Nlbwlybm
PDF: https://openreview.net/pdf?id=jB0Nlbwlybm
Authors: Kai Han · An Xiao · Enhua Wu · Jianyuan Guo · Chunjing XU · Yunhe Wang
TL;DR: None
OpenReview: https://openreview.net/forum?id=iFODavhthGZ
PDF: https://openreview.net/pdf?id=iFODavhthGZ
Authors: Sukjun Hwang · Miran Heo · Seoung Wug Oh · Seon Joo Kim
TL;DR: None
OpenReview: https://openreview.net/forum?id=pvjfA4wogD6
PDF: https://openreview.net/pdf?id=pvjfA4wogD6
Authors: Yining Ma · Jingwen Li · Zhiguang Cao · Wen Song · Le Zhang · Zhenghua Chen · Jing Tang
TL;DR: We present a Dual-Aspect Collaborative Transformer to solve vehicle routing problems, which delivers superior performance.
OpenReview: https://openreview.net/forum?id=63pC59XOZLZ
PDF: https://openreview.net/pdf?id=63pC59XOZLZ
Authors: Xuefan Zha · Wentao Zhu · Lv Xun · Sen Yang · Prof. Ji Liu Liu
TL;DR: None
OpenReview: https://openreview.net/forum?id=fDSDkiiXHzj
PDF: https://openreview.net/pdf?id=fDSDkiiXHzj
Authors: Tianlong Chen · Yu Cheng · Zhe Gan · Lu Yuan · Lei Zhang · Zhangyang Wang
TL;DR: We jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse vision transformer as the final output.
OpenReview: https://openreview.net/forum?id=LKoMTwTuQnC
PDF: https://openreview.net/pdf?id=LKoMTwTuQnC
Authors: Yuhui YUAN · Rao Fu · Lang Huang · Weihong Lin · Chao Zhang · Xilin Chen · Jingdong Wang
TL;DR: None
OpenReview: https://openreview.net/forum?id=DF8LCjR03tX
PDF: https://openreview.net/pdf?id=DF8LCjR03tX
29. TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classification
Authors: Zhuchen Shao · Hao Bian · Yang Chen · Yifeng Wang · Jian Zhang · Xiangyang Ji · yongbing zhang
TL;DR: We proposed a correlated MIL framework and devised a Transformer based MIL to the weakly supervised classification of whole slide images.
OpenReview: https://openreview.net/forum?id=LKUfuWxajHc
PDF: https://openreview.net/pdf?id=LKUfuWxajHc
Authors: Sangjoon Park · Gwanghyun Kim · Jeongsol Kim · Boah Kim · Jong Chul Ye
TL;DR: We proposed a novel Federated Split Task-Agnostic (FeSTA) framework suitable to leverage the formidable benefit of Vision Transformer to simultaneously process multiple CXR tasks including the diagnosis of COVID-19.
OpenReview: https://openreview.net/forum?id=Ggikq6Tdxch
PDF: https://openreview.net/pdf?id=Ggikq6Tdxch
Authors: Zhaowen Li · Zhiyang Chen · Fan Yang · Wei Li · Yousong Zhu · Chaoyang Zhao · Rui Deng · Liwei Wu · Rui Zhao · Ming Tang · Jinqiao Wang
TL;DR: None
OpenReview: https://openreview.net/forum?id=y_OmkmCH9w
PDF: https://openreview.net/pdf?id=y_OmkmCH9w
Authors: Yufei Xu · Qiming ZHANG · Jing Zhang · Dacheng Tao
TL;DR: None
OpenReview: https://openreview.net/forum?id=_RnHyIeu5Y5
PDF: https://openreview.net/pdf?id=_RnHyIeu5Y5
Authors: Jianwei Yang · Chunyuan Li · Pengchuan Zhang · Xiyang Dai · Bin Xiao · Lu Yuan · Jianfeng Gao
TL;DR: An effective focal attention mechanism for modeling short- and long-range visual dependencies in Vision Transformers
OpenReview: https://openreview.net/forum?id=2zCRcTafea
PDF: https://openreview.net/pdf?id=2zCRcTafea
Authors: Shengju Qian · Hao Shao · Yi Zhu · Mu Li · Jiaya Jia
TL;DR: None
OpenReview: https://openreview.net/forum?id=0-0Wk0t6A_Z
PDF: https://openreview.net/pdf?id=0-0Wk0t6A_Z
Authors: Muchen Li · Leonid Sigal
TL;DR: A one-step transformer based model that solves Referring Expression Detection and Referring Expression Segmentation jointly.
OpenReview: https://openreview.net/forum?id=j7u7cJDBo8p
PDF: https://openreview.net/pdf?id=j7u7cJDBo8p
Authors: Han Shu · Jiahao Wang · Hanting Chen · Lin Li · Yujiu Yang · Yunhe Wang
TL;DR: Implementing transformers using cheap addition operation
OpenReview: https://openreview.net/forum?id=5Ld5bRB9jzY
PDF: https://openreview.net/pdf?id=5Ld5bRB9jzY
Authors: Shizhe Chen · Pierre-Louis Guhur · Cordelia Schmid · Ivan Laptev
TL;DR: None
OpenReview: https://openreview.net/forum?id=SQxuiYf2TT
PDF: https://openreview.net/pdf?id=SQxuiYf2TT
Authors: Zongxin Yang · Yunchao Wei · Yi Yang
TL;DR: None
OpenReview: https://openreview.net/forum?id=hl3v8io3ZYt
PDF: https://openreview.net/pdf?id=hl3v8io3ZYt
Authors: Xiangxiang Chu · Zhi Tian · Yuqing Wang · Bo Zhang · Haibing Ren · Xiaolin Wei · Huaxia Xia · Chunhua Shen
TL;DR: Two simple and effective designs of vision transformer, which is on par with the Swin transformer
OpenReview: https://openreview.net/forum?id=5kTlVBkzSRx
PDF: https://openreview.net/pdf?id=5kTlVBkzSRx
Authors: Bowen Pan · Rameswar Panda · Yifan Jiang · Zhangyang Wang · Rogerio Feris · Aude Oliva
TL;DR: An input-dependent and interpretable dynamic inference framework for vision transformer, which adaptively decides the patch tokens to compute per input instance.
OpenReview: https://openreview.net/forum?id=7X_sBjIwtm9
PDF: https://openreview.net/pdf?id=7X_sBjIwtm9
41. TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification
Authors: Shengcai Liao · Ling Shao
TL;DR: A transformer-based deep image matching method is developed for generalizable person re-identification, achieving state-of-the-art performance.
OpenReview: https://openreview.net/forum?id=I3yGrFoH8DF
PDF: https://openreview.net/pdf?id=I3yGrFoH8DF
Authors: Seokju Cho · Sunghwan Hong · Sangryul Jeon · Yunsung Lee · Kwanghoon Sohn · Seungryong Kim
TL;DR: None
OpenReview: https://openreview.net/forum?id=eVuMspr9cu5
PDF: https://openreview.net/pdf?id=eVuMspr9cu5
Authors: Drew Arad Hudson · Larry Zitnick
TL;DR: A new compositional transformer model for recurrent scene generation.
OpenReview: https://openreview.net/forum?id=YQeWoRnwTnE
PDF: https://openreview.net/pdf?id=YQeWoRnwTnE
Authors: Abhinav Moudgil · Arjun Majumdar · Harsh Agrawal · Stefan Lee · Dhruv Batra
TL;DR: We design a new vision-and-language navigation agent that operates on both scene and object features with a multimodal transformer using a selective attention pattern for object-centric processing.
OpenReview: https://openreview.net/forum?id=E5EoQqCVYX
PDF: https://openreview.net/pdf?id=E5EoQqCVYX
Authors: Qihang Yu · Yingda Xia · Yutong Bai · Yongyi Lu · Alan Yuille · Wei Shen
TL;DR: A new state-of-the-art efficient and effective Vision Transformer with Glance-and-Gaze mechanism
OpenReview: https://openreview.net/forum?id=GitDcBlcg78
PDF: https://openreview.net/pdf?id=GitDcBlcg78
Authors: Alaa El-Nouby Ali · Hugo Touvron · Mathilde Caron · Piotr Bojanowski · Matthijs Douze · Armand Joulin · Ivan Laptev · Natalia Neverova · Gabriel Synnaeve · Jakob Verbeek · Herve Jegou
TL;DR: A new transformer model for image processing, whose complexity is linear in the image resolution with no approximation
OpenReview: https://openreview.net/forum?id=kzPtpIpF8o
PDF: https://openreview.net/pdf?id=kzPtpIpF8o
Authors: Yutong Bai · Jieru Mei · Alan Yuille · Cihang Xie
TL;DR: We provide the first fair and in-depth comparison between Transformers and CNNs
OpenReview: https://openreview.net/forum?id=hbHkvGBZB9
PDF: https://openreview.net/pdf?id=hbHkvGBZB9
Authors: Jingyu Yang · Sheng None Shen · Huanjing Yue · Kun Li
TL;DR: A novel method for Screen Content Image Continuous Super-Resolution
OpenReview: https://openreview.net/forum?id=x4t0fxWPNdi
PDF: https://openreview.net/pdf?id=x4t0fxWPNdi
Authors: Gengwei Zhang · Guoliang Kang · Yi Yang · Yunchao Wei
TL;DR: We propose a cycle-consistent transformer (CyCTR) to aggregate the pixel-wise support features into the query ones for few-shot segmentation while avoiding biasing by harmful information.
OpenReview: https://openreview.net/forum?id=LWH-C1HoQG_
PDF: https://openreview.net/pdf?id=LWH-C1HoQG_
Authors: Muhammad Muzammal Naseer · Kanchana Ranasinghe · Salman H Khan · Munawar Hayat · Fahad Shahbaz Khan · Ming-Hsuan Yang
TL;DR: Analysis of content-dependent long-range interaction modeling capabilities of Vision Transformers in terms of robustness against image nuisances such as severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations.
OpenReview: https://openreview.net/forum?id=o2mbl-Hmfgd
PDF: https://openreview.net/pdf?id=o2mbl-Hmfgd
Authors: wofmanaf Zhang · Yu-Bin Yang
TL;DR: An efficient multi-scale vision Transformer that can tackle input images with arbitrary size.
OpenReview: https://openreview.net/forum?id=6Ab68Ip4Mu
PDF: https://openreview.net/pdf?id=6Ab68Ip4Mu
Authors: Aljaz Bozic · Pablo Palafox · Justus Thies · Angela Dai · Matthias Niessner
TL;DR: We propose a transformer-based approach for 3D scene reconstruction from multi-view RGB input.
OpenReview: https://openreview.net/forum?id=ZEoMBPtvqey
PDF: https://openreview.net/pdf?id=ZEoMBPtvqey
Authors: Hongji Yang · Xiufan Lu · Yingying Zhu
TL;DR: This paper proposes a novel layer-to-layer Transformer for cross-view geo-localization.
OpenReview: https://openreview.net/forum?id=tQgj7CDTfKB
PDF: https://openreview.net/pdf?id=tQgj7CDTfKB
Authors: Mandela Patrick · Dylan Campbell · Yuki Asano · Ishan Misra · Florian Metze · Christoph Feichtenhofer · Andrea Vedaldi · João Henriques
TL;DR: A new attention block for video transformers that implicitly models motion paths
OpenReview: https://openreview.net/forum?id=mfQxdSMWOF
PDF: https://openreview.net/pdf?id=mfQxdSMWOF
Authors: Adrian Bulat · Juanma Perez Rua · Swathikiran Sudhakaran · Brais Martinez · Georgios Tzimiropoulos
TL;DR: None
OpenReview: https://openreview.net/forum?id=QgX15Mdi1E_
PDF: https://openreview.net/pdf?id=QgX15Mdi1E_