Skip to content

Latest commit

 

History

History
1417 lines (1405 loc) · 261 KB

Papers-2022.md

File metadata and controls

1417 lines (1405 loc) · 261 KB

December 2022

  • Rethinking with Retrieval: Faithful Large Language Model Inference - [Arxiv] [QA]
  • A Survey on In-context Learning - [Arxiv] [QA]
  • Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling - [Arxiv] [QA]
  • Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? - [Arxiv] [QA]
  • Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models - [Arxiv] [QA]
  • Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples - [Arxiv] [QA]
  • Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence - [Arxiv] [QA]
  • Imitator: Personalized Speech-driven 3D Facial Animation - [Arxiv] [QA]
  • NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation - [Arxiv] [QA]
  • NIRVANA: Neural Implicit Representations of Videos with Adaptive Networks and Autoregressive Patch-wise Modeling - [Arxiv] [QA]
  • Transformer in Transformer as Backbone for Deep Reinforcement Learning - [Arxiv] [QA]
  • Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning - [Arxiv] [QA]
  • Improving Visual Representation Learning through Perceptual Understanding - [Arxiv] [QA]
  • Effects of Data Geometry in Early Deep Learning - [Arxiv] [QA]
  • Effects of Data Geometry in Early Deep Learning - [Arxiv] [QA]
  • StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN - [Arxiv] [QA]
  • MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery - [Arxiv] [QA]
  • Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models - [Arxiv] [QA]
  • Detection of out-of-distribution samples using binary neuron activation patterns - [Arxiv] [QA]
  • Discriminator-Cooperated Feature Map Distillation for GAN Compression - [Arxiv] [QA]
  • Cramming: Training a Language Model on a Single GPU in One Day - [Arxiv] [QA]
  • Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP - [Arxiv] [QA]
  • Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models - [Arxiv] [QA]
  • Multi-Realism Image Compression with a Conditional Generator - [Arxiv] [QA]
  • Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning - [Arxiv] [QA]
  • Interactive Segmentation of Radiance Fields - [Arxiv] [QA]
  • GEDI: GEnerative and DIscriminative Training for Self-Supervised Learning - [Arxiv] [QA]
  • Behavioral Cloning via Search in Video PreTraining Latent Space - [Arxiv] [QA]
  • DSI2I: Dense Style for Unpaired Image-to-Image Translation - [Arxiv] [QA]
  • Large Language Models Encode Clinical Knowledge - [Arxiv] [QA]
  • SMMix: Self-Motivated Image Mixing for Vision Transformers - [Arxiv] [QA]
  • TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation - [Arxiv] [QA]
  • When Do Curricula Work in Federated Learning? - [Arxiv] [QA]
  • On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective - [Arxiv] [QA]
  • HandsOff: Labeled Dataset Generation With No Additional Human Annotations - [Arxiv] [QA]
  • xFBD: Focused Building Damage Dataset and Analysis - [Arxiv] [QA]
  • Detecting Objects with Context-Likelihood Graphs and Graph Refinement - [Arxiv] [QA]
  • A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference - [Arxiv] [QA]
  • Do DALL-E and Flamingo Understand Each Other? - [Arxiv] [QA]
  • Learning to Detect and Segment for Open Vocabulary Object Detection - [Arxiv] [QA]
  • On Calibrating Semantic Segmentation Models: Analyses and An Algorithm - [Arxiv] [QA]
  • OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization - [Arxiv] [QA]
  • DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis - [Arxiv] [QA]
  • Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized Photography - [Arxiv] [QA]
  • Removing Objects From Neural Radiance Fields - [Arxiv] [QA]
  • Markov Categories and Entropy - [Arxiv] [QA]
  • Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise - [Arxiv] [QA]
  • DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders - [Arxiv] [QA]
  • Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation - [Arxiv] [QA]
  • Generalized Decoding for Pixel, Image, and Language - [Arxiv] [QA]
  • 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions - [Arxiv] [QA]
  • Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble - [Arxiv] [QA]
  • Revisiting Residual Networks for Adversarial Robustness: An Architectural Perspective - [Arxiv] [QA]
  • TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization - [Arxiv] [QA]
  • Critic-Guided Decoding for Controlled Text Generation - [Arxiv] [QA]
  • In-Sensor & Neuromorphic Computing are all you need for Energy Efficient Computer Vision - [Arxiv] [QA]
  • MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning - [Arxiv] [QA]
  • MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions - [Arxiv] [QA]
  • PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields - [Arxiv] [QA]
  • Analyzing Semantic Faithfulness of Language Models via Input Intervention on Conversational Question Answering - [Arxiv] [QA]
  • Scene-aware Egocentric 3D Human Pose Estimation - [Arxiv] [QA]
  • Full-Body Articulated Human-Object Interaction - [Arxiv] [QA]
  • Ontologically Faithful Generation of Non-Player Character Dialogues - [Arxiv] [QA]
  • Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers - [Arxiv] [QA]
  • Unleashing the Power of Visual Prompting At the Pixel Level - [Arxiv] [QA]
  • InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds - [Arxiv] [QA]
  • A Survey of Deep Learning for Mathematical Reasoning - [Arxiv] [QA]
  • Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions - [Arxiv] [QA]
  • Precise Zero-Shot Dense Retrieval without Relevance Labels - [Arxiv] [QA]
  • LAMBADA: Backward Chaining for Automated Reasoning in Natural Language - [Arxiv] [QA]
  • Controllable Text Generation with Language Constraints - [Arxiv] [QA]
  • QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity - [Arxiv] [QA]
  • Towards Reasoning in Large Language Models: A Survey - [Arxiv] [QA]
  • SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers - [Arxiv] [QA]
  • ReCode: Robustness Evaluation of Code Generation Models - [Arxiv] [QA]
  • Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study - [Arxiv] [QA]
  • StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-shot and Few-shot Domain Adaptation - [Arxiv] [QA]
  • Hoyer regularizer is all you need for ultra low-latency spiking neural networks - [Arxiv] [QA]
  • Planning-oriented Autonomous Driving - [Arxiv] [QA]
  • Large Language Models Are Reasoning Teachers - [Arxiv] [QA]
  • RepMode: Learning to Re-parameterize Diverse Experts for Subcellular Structure Prediction - [Arxiv] [QA]
  • I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons - [Arxiv] [QA]
  • Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters - [Arxiv] [QA]
  • MetaCLUE: Towards Comprehensive Visual Metaphors Research - [Arxiv] [QA]
  • Panoptic Lifting for 3D Scene Understanding with Neural Fields - [Arxiv] [QA]
  • Denotationally Correct, Purely Functional, Efficient Reverse-mode Automatic Differentiation - [Arxiv] [QA]
  • Scalable Diffusion Models with Transformers - [Arxiv] [QA]
  • One Embedder, Any Task: Instruction-Finetuned Text Embeddings - [Arxiv] [QA]
  • Position-guided Text Prompt for Vision-Language Pre-training - [Arxiv] [QA]
  • Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments - [Arxiv] [QA]
  • The case for 4-bit precision: k-bit Inference Scaling Laws - [Arxiv] [QA]
  • A Probabilistic Framework for Lifelong Test-Time Adaptation - [Arxiv] [QA]
  • Reasoning with Language Model Prompting: A Survey - [Arxiv] [QA]
  • Large Language Models are Better Reasoners with Self-Verification - [Arxiv] [QA]
  • Interactive Cartoonization with Controllable Perceptual Factors - [Arxiv] [QA]
  • HARP: Personalized Hand Reconstruction from a Monocular RGB Video - [Arxiv] [QA]
  • MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation - [Arxiv] [QA]
  • Latent Diffusion for Language Generation - [Arxiv] [QA]
  • Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation - [Arxiv] [QA]
  • Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization - [Arxiv] [QA]
  • Out-of-domain GAN inversion via Invertibility Decomposition for Photo-Realistic Human Face Manipulation - [Arxiv] [QA]
  • Discovering Language Model Behaviors with Model-Written Evaluations - [Arxiv] [QA]
  • PAL: Persona-Augmented Emotional Support Conversation Generation - [Arxiv] [QA]
  • Discrete Point-wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition - [Arxiv] [QA]
  • Emergent Analogical Reasoning in Large Language Models - [Arxiv] [QA]
  • Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems - [Arxiv] [QA]
  • Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model - [Arxiv] [QA]
  • Let's Negotiate! A Survey of Negotiation Dialogue Systems - [Arxiv] [QA]
  • Masked Wavelet Representation for Compact Neural Radiance Fields - [Arxiv] [QA]
  • Fine-Tuning Is All You Need to Mitigate Backdoor Attacks - [Arxiv] [QA]
  • Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted Attacks - [Arxiv] [QA]
  • A Layered Architecture for Universal Causality - [Arxiv] [QA]
  • Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark - [Arxiv] [QA]
  • Neural Story Planning - [Arxiv] [QA]
  • Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models - [Arxiv] [QA]
  • The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning - [Arxiv] [QA]
  • Attentive Mask CLIP - [Arxiv] [QA]
  • GFPose: Learning 3D Human Pose Prior with Gradient Fields - [Arxiv] [QA]
  • Learnable Commutative Monoids for Graph Neural Networks - [Arxiv] [QA]
  • Teaching Small Language Models to Reason - [Arxiv] [QA]
  • Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? - [Arxiv] [QA]
  • RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers - [Arxiv] [QA]
  • Backdoor Attack Detection in Computer Vision by Applying Matrix Factorization on the Weights of Deep Networks - [Arxiv] [QA]
  • Injecting Domain Knowledge in Language Models for Task-Oriented Dialogue Systems - [Arxiv] [QA]
  • MAViL: Masked Audio-Video Learners - [Arxiv] [QA]
  • VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction - [Arxiv] [QA]
  • MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation - [Arxiv] [QA]
  • On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning - [Arxiv] [QA]
  • Rethinking Vision Transformers for MobileNet Size and Speed - [Arxiv] [QA]
  • Real-Time Neural Light Field on Mobile Devices - [Arxiv] [QA]
  • Objaverse: A Universe of Annotated 3D Objects - [Arxiv] [QA]
  • Sliced Optimal Partial Transport - [Arxiv] [QA]
  • CLIPPO: Image-and-Language Understanding from Pixels Only - [Arxiv] [QA]
  • FlexiViT: One Model for All Patch Sizes - [Arxiv] [QA]
  • EVAL: Explainable Video Anomaly Localization - [Arxiv] [QA]
  • DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients - [Arxiv] [QA]
  • Relightable Neural Human Assets from Multi-view Gradient Illuminations - [Arxiv] [QA]
  • Constitutional AI: Harmlessness from AI Feedback - [Arxiv] [QA]
  • NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions - [Arxiv] [QA]
  • Enhanced Training of Query-Based Object Detection via Selective Query Recollection - [Arxiv] [QA]
  • SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory - [Arxiv] [QA]
  • IMos: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions - [Arxiv] [QA]
  • Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - [Arxiv] [QA]
  • ECON: Explicit Clothed humans Optimized via Normal integration - [Arxiv] [QA]
  • Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion - [Arxiv] [QA]
  • Policy Adaptation from Foundation Model Feedback - [Arxiv] [QA]
  • NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior - [Arxiv] [QA]
  • ConQueR: Query Contrast Voxel-DETR for 3D Object Detection - [Arxiv] [QA]
  • HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics - [Arxiv] [QA]
  • Reproducible scaling laws for contrastive language-image learning - [Arxiv] [QA]
  • PD-Quant: Post-Training Quantization based on Prediction Difference Metric - [Arxiv] [QA]
  • Understanding Zero-Shot Adversarial Robustness for Large-Scale Models - [Arxiv] [QA]
  • EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries - [Arxiv] [QA]
  • Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting - [Arxiv] [QA]
  • CREPE: Can Vision-Language Foundation Models Reason Compositionally? - [Arxiv] [QA]
  • Look Before You Match: Instance Understanding Matters in Video Object Segmentation - [Arxiv] [QA]
  • Structured 3D Features for Reconstructing Controllable Avatars - [Arxiv] [QA]
  • Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders - [Arxiv] [QA]
  • Category Theory for Quantum Natural Language Processing - [Arxiv] [QA]
  • FastMIM: Expediting Masked Image Modeling Pre-training for Vision - [Arxiv] [QA]
  • Pixel is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection - [Arxiv] [QA]
  • DA Wand: Distortion-Aware Selection using Neural Mesh Parameterization - [Arxiv] [QA]
  • DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization - [Arxiv] [QA]
  • Doubly Right Object Recognition: A Why Prompt for Visual Rationales - [Arxiv] [QA]
  • Breaking the "Object" in Video Object Segmentation - [Arxiv] [QA]
  • Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion - [Arxiv] [QA]
  • RGBD2: Generative Scene Synthesis via Incremental View Inpainting using RGBD Diffusion Models - [Arxiv] [QA]
  • Towards Practical Plug-and-Play Diffusion Models - [Arxiv] [QA]
  • Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks - [Arxiv] [QA]
  • ALSO: Automotive Lidar Self-supervision by Occupancy estimation - [Arxiv] [QA]
  • Accelerating Dataset Distillation via Model Augmentation - [Arxiv] [QA]
  • MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations - [Arxiv] [QA]
  • REAP: A Large-Scale Realistic Adversarial Patch Benchmark - [Arxiv] [QA]
  • Masked autoencoders are effective solution to transformer data-hungry - [Arxiv] [QA]
  • Cross-Modal Learning with 3D Deformable Attention for Action Recognition - [Arxiv] [QA]
  • Recurrent Vision Transformers for Object Detection with Event Cameras - [Arxiv] [QA]
  • PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery - [Arxiv] [QA]
  • How to Backdoor Diffusion Models? - [Arxiv] [QA]
  • Source-free Depth for Object Pop-out - [Arxiv] [QA]
  • HumanGen: Generating Human Radiance Fields with Explicit Priors - [Arxiv] [QA]
  • Position Embedding Needs an Independent Layer Normalization - [Arxiv] [QA]
  • NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction - [Arxiv] [QA]
  • MAGVIT: Masked Generative Video Transformer - [Arxiv] [QA]
  • A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others - [Arxiv] [QA]
  • VindLU: A Recipe for Effective Video-and-Language Pretraining - [Arxiv] [QA]
  • SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model - [Arxiv] [QA]
  • Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis - [Arxiv] [QA]
  • Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning - [Arxiv] [QA]
  • Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation - [Arxiv] [QA]
  • Seeing a Rose in Five Thousand Ways - [Arxiv] [QA]
  • Information-Theoretic Safe Exploration with Gaussian Processes - [Arxiv] [QA]
  • Genie: Show Me the Data for Quantization - [Arxiv] [QA]
  • Genie: Show Me the Data for Quantization - [Arxiv] [QA]
  • SLAM for Visually Impaired Navigation: A Systematic Literature Review of the Current State of Research - [Arxiv] [QA]
  • ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal - [Arxiv] [QA]
  • Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints - [Arxiv] [QA]
  • MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction - [Arxiv] [QA]
  • FLAG3D: A 3D Fitness Activity Dataset with Language Instruction - [Arxiv] [QA]
  • Ego-Body Pose Estimation via Ego-Head Pose Estimation - [Arxiv] [QA]
  • Structured Like a Language Model: Analysing AI as an Automated Subject - [Arxiv] [QA]
  • ORCa: Glossy Objects as Radiance Field Cameras - [Arxiv] [QA]
  • Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning - [Arxiv] [QA]
  • MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis - [Arxiv] [QA]
  • SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation - [Arxiv] [QA]
  • Multi-Concept Customization of Text-to-Image Diffusion - [Arxiv] [QA]
  • Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection - [Arxiv] [QA]
  • Phone2Proc: Bringing Robust Robots Into Our Chaotic World - [Arxiv] [QA]
  • Generating Holistic 3D Human Motion from Speech - [Arxiv] [QA]
  • BEVBert: Multimodal Map Pre-training for Language-guided Navigation - [Arxiv] [QA]
  • MIME: Human-Aware 3D Scene Generation - [Arxiv] [QA]
  • On the Robustness of Normalizing Flows for Inverse Problems in Imaging - [Arxiv] [QA]
  • GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields - [Arxiv] [QA]
  • Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation - [Arxiv] [QA]
  • Deep Incubation: Training Large Models by Divide-and-Conquering - [Arxiv] [QA]
  • Successive Prompting for Decomposing Complex Questions - [Arxiv] [QA]
  • Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection - [Arxiv] [QA]
  • LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - [Arxiv] [QA]
  • Learning to Dub Movies via Hierarchical Prosody Models - [Arxiv] [QA]
  • Executing your Commands via Motion Diffusion in Latent Space - [Arxiv] [QA]
  • Teaching Matters: Investigating the Role of Supervision in Vision Transformers - [Arxiv] [QA]
  • Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors - [Arxiv] [QA]
  • iQuery: Instruments as Queries for Audio-Visual Sound Separation - [Arxiv] [QA]
  • Reconciling a Centroid-Hypothesis Conflict in Source-Free Domain Adaptation - [Arxiv] [QA]
  • GLeaD: Improving GANs with A Generator-Leading Task - [Arxiv] [QA]
  • FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation - [Arxiv] [QA]
  • EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points - [Arxiv] [QA]
  • Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer - [Arxiv] [QA]
  • Diffusion-SDF: Text-to-Shape via Voxelized Diffusion - [Arxiv] [QA]
  • NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors - [Arxiv] [QA]
  • Fine-tuned CLIP Models are Efficient Video Learners - [Arxiv] [QA]
  • Perspective Fields for Single Image Camera Calibration - [Arxiv] [QA]
  • Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning - [Arxiv] [QA]
  • Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning - [Arxiv] [QA]
  • InternVideo: General Video Foundation Models via Generative and Discriminative Learning - [Arxiv] [QA]
  • Semantic-Conditional Diffusion Networks for Image Captioning - [Arxiv] [QA]
  • Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections - [Arxiv] [QA]
  • Leveraging Different Learning Styles for Improved Knowledge Distillation in Biomedical Imaging - [Arxiv] [QA]
  • Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding - [Arxiv] [QA]
  • Adaptive Testing of Computer Vision Models - [Arxiv] [QA]
  • Learning Neural Parametric Head Models - [Arxiv] [QA]
  • Unifying Vision, Text, and Layout for Universal Document Processing - [Arxiv] [QA]
  • SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields - [Arxiv] [QA]
  • Images Speak in Images: A Generalist Painter for In-Context Visual Learning - [Arxiv] [QA]
  • PEANUT: Predicting and Navigating to Unseen Targets - [Arxiv] [QA]
  • One-shot Implicit Animatable Avatars with Model-based Priors - [Arxiv] [QA]
  • Block Selection Method for Using Feature Norm in Out-of-distribution Detection - [Arxiv] [QA]
  • I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification - [Arxiv] [QA]
  • Momentum Decoding: Open-ended Text Generation As Graph Exploration - [Arxiv] [QA]
  • Prototypical Residual Networks for Anomaly Detection and Localization - [Arxiv] [QA]
  • Learning Imbalanced Data with Vision Transformers - [Arxiv] [QA]
  • Multiscale Structure Guided Diffusion for Image Deblurring - [Arxiv] [QA]
  • Self-supervised AutoFlow - [Arxiv] [QA]
  • Improving Zero-shot Generalization and Robustness of Multi-modal Models - [Arxiv] [QA]
  • Fast Point Cloud Generation with Straight Flows - [Arxiv] [QA]
  • Neural Fourier Filter Bank - [Arxiv] [QA]
  • StegaNeRF: Embedding Invisible Information within Neural Radiance Fields - [Arxiv] [QA]
  • PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models - [Arxiv] [QA]
  • PGFed: Personalize Each Client's Global Objective for Federated Learning - [Arxiv] [QA]
  • PROB: Probabilistic Objectness for Open World Object Detection - [Arxiv] [QA]
  • Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking - [Arxiv] [QA]
  • MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation - [Arxiv] [QA]
  • DiffRF: Rendering-Guided 3D Radiance Field Diffusion - [Arxiv] [QA]
  • RT-NeRF: Real-Time On-Device Neural Radiance Fields Towards Immersive AR/VR Rendering - [Arxiv] [QA]
  • Are Straight-Through gradients and Soft-Thresholding all you need for Sparse Training? - [Arxiv] [QA]
  • Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-world - [Arxiv] [QA]
  • StructVPR: Distill Structural Knowledge with Weighting Samples for Visual Place Recognition - [Arxiv] [QA]
  • Scaling Language-Image Pre-training via Masking - [Arxiv] [QA]
  • SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction - [Arxiv] [QA]
  • Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models - [Arxiv] [QA]
  • 3D Segmentation of Humans in Point Clouds with Synthetic Data - [Arxiv] [QA]
  • Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs - [Arxiv] [QA]
  • ResFormer: Scaling ViTs with Multi-Resolution Training - [Arxiv] [QA]
  • Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation - [Arxiv] [QA]
  • Exploiting Proximity-Aware Tasks for Embodied Social Navigation - [Arxiv] [QA]
  • Hyperbolic Contrastive Learning for Visual Representations beyond Objects - [Arxiv] [QA]
  • Finetune like you pretrain: Improved finetuning of zero-shot vision models - [Arxiv] [QA]
  • Graph Convolutional Neural Networks as Parametric CoKleisli morphisms - [Arxiv] [QA]
  • Language Model Pre-training on True Negatives - [Arxiv] [QA]
  • Language Model Pre-training on True Negatives - [Arxiv] [QA]
  • Language Model Pre-training on True Negatives - [Arxiv] [QA]
  • 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification - [Arxiv] [QA]
  • Parametric Information Maximization for Generalized Category Discovery - [Arxiv] [QA]
  • All You Need Is Hashing: Defending Against Data Reconstruction Attack in Vertical Federated Learning - [Arxiv] [QA]
  • Distilling Reasoning Capabilities into Smaller Language Models - [Arxiv] [QA]

November 2022

  • Plateau-reduced Differentiable Path Tracing - [Arxiv] [QA]
  • SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene - [Arxiv] [QA]
  • CREPE: Open-Domain Question Answering with False Presuppositions - [Arxiv] [QA]
  • CLIPascene: Scene Sketching with Different Types and Levels of Abstraction - [Arxiv] [QA]
  • NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation - [Arxiv] [QA]
  • Fast Inference from Transformers via Speculative Decoding - [Arxiv] [QA]
  • High-Fidelity Guided Image Synthesis with Latent Diffusion Models - [Arxiv] [QA]
  • Spatio-Temporal Crop Aggregation for Video Representation Learning - [Arxiv] [QA]
  • BASiS: Batch Aligned Spectral Embedding Space - [Arxiv] [QA]
  • DiffPose: Toward More Reliable 3D Pose Estimation - [Arxiv] [QA]
  • 3D GAN Inversion with Facial Symmetry Prior - [Arxiv] [QA]
  • 3D Neural Field Generation using Triplane Diffusion - [Arxiv] [QA]
  • DINER: Depth-aware Image-based NEural Radiance fields - [Arxiv] [QA]
  • Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles - [Arxiv] [QA]
  • NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views - [Arxiv] [QA]
  • RGB no more: Minimally-decoded JPEG Vision Transformers - [Arxiv] [QA]
  • Compressing Volumetric Radiance Fields to 1 MB - [Arxiv] [QA]
  • PLA: Language-Driven Open-Vocabulary 3D Scene Understanding - [Arxiv] [QA]
  • Advancing Deep Metric Learning Through Multiple Batch Norms And Multi-Targeted Adversarial Examples - [Arxiv] [QA]
  • Scalable Hierarchical Over-the-Air Federated Learning - [Arxiv] [QA]
  • Out-Of-Distribution Detection Is Not All You Need - [Arxiv] [QA]
  • Wavelet Diffusion Models are fast and scalable Image Generators - [Arxiv] [QA]
  • ExpNet: A unified network for Expert-Level Classification - [Arxiv] [QA]
  • NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers - [Arxiv] [QA]
  • Dimensionality-Varying Diffusion Process - [Arxiv] [QA]
  • UDE: A Unified Driving Engine for Human Motion Generation - [Arxiv] [QA]
  • SparsePose: Sparse-View Camera Pose Regression and Refinement - [Arxiv] [QA]
  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts - [Arxiv] [QA]
  • Decentralized Learning with Multi-Headed Distillation - [Arxiv] [QA]
  • Post-training Quantization on Diffusion Models - [Arxiv] [QA]
  • High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization - [Arxiv] [QA]
  • Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries - [Arxiv] [QA]
  • Is Conditional Generative Modeling all you need for Decision-Making? - [Arxiv] [QA]
  • SuS-X: Training-Free Name-Only Transfer of Vision-Language Models - [Arxiv] [QA]
  • In-Hand 3D Object Scanning from an RGB Sequence - [Arxiv] [QA]
  • A Light Touch Approach to Teaching Transformers Multi-view Geometry - [Arxiv] [QA]
  • Class Adaptive Network Calibration - [Arxiv] [QA]
  • FeatureBooster: Boosting Feature Descriptors with a Lightweight Neural Network - [Arxiv] [QA]
  • High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors - [Arxiv] [QA]
  • DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models - [Arxiv] [QA]
  • Optimal Sparse Regression Trees - [Arxiv] [QA]
  • Post-Processing Temporal Action Detection - [Arxiv] [QA]
  • FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs - [Arxiv] [QA]
  • Dense Text Retrieval based on Pretrained Language Models: A Survey - [Arxiv] [QA]
  • 3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers - [Arxiv] [QA]
  • SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation - [Arxiv] [QA]
  • Towards Improved Input Masking for Convolutional Neural Networks - [Arxiv] [QA]
  • Residual Pattern Learning for Pixel-wise Out-of-Distribution Detection in Semantic Segmentation - [Arxiv] [QA]
  • Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis - [Arxiv] [QA]
  • Meta Architecture for Point Cloud Analysis - [Arxiv] [QA]
  • SpaText: Spatio-Textual Representation for Controllable Image Generation - [Arxiv] [QA]
  • RUST: Latent Neural Scene Representations from Unposed Imagery - [Arxiv] [QA]
  • BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction - [Arxiv] [QA]
  • RbA: Segmenting Unknown Regions Rejected by All - [Arxiv] [QA]
  • NeuralUDF: Learning Unsigned Distance Fields for Multi-view Reconstruction of Surfaces with Arbitrary Topologies - [Arxiv] [QA]
  • A Strong Baseline for Generalized Few-Shot Semantic Segmentation - [Arxiv] [QA]
  • ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision - [Arxiv] [QA]
  • Fine-Grained Face Swapping via Regional GAN Inversion - [Arxiv] [QA]
  • SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow - [Arxiv] [QA]
  • SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow - [Arxiv] [QA]
  • CoMFormer: Continual Learning in Semantic and Panoptic Segmentation - [Arxiv] [QA]
  • Unsupervised Continual Semantic Adaptation through Neural Rendering - [Arxiv] [QA]
  • MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention - [Arxiv] [QA]
  • Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image - [Arxiv] [QA]
  • Learning with Silver Standard Data for Zero-shot Relation Extraction - [Arxiv] [QA]
  • FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction - [Arxiv] [QA]
  • GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face Features - [Arxiv] [QA]
  • SAGA: Spectral Adversarial Geometric Attack on 3D Meshes - [Arxiv] [QA]
  • Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions - [Arxiv] [QA]
  • Attention-based Feature Compression for CNN Inference Offloading in Edge Computing - [Arxiv] [QA]
  • Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation - [Arxiv] [QA]
  • SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks - [Arxiv] [QA]
  • Video Test-Time Adaptation for Action Recognition - [Arxiv] [QA]
  • TSGP: Two-Stage Generative Prompting for Unsupervised Commonsense Question Answering - [Arxiv] [QA]
  • Pose-disentangled Contrastive Learning for Self-supervised Facial Representation - [Arxiv] [QA]
  • Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning - [Arxiv] [QA]
  • Shifted Diffusion for Text-to-image Generation - [Arxiv] [QA]
  • HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising - [Arxiv] [QA]
  • Paint by Example: Exemplar-based Image Editing with Diffusion Models - [Arxiv] [QA]
  • ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field - [Arxiv] [QA]
  • Generalizable Implicit Neural Representations via Instance Pattern Composers - [Arxiv] [QA]
  • SVFormer: Semi-supervised Video Transformer for Action Recognition - [Arxiv] [QA]
  • CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning - [Arxiv] [QA]
  • ReCo: Region-Controlled Text-to-Image Generation - [Arxiv] [QA]
  • Inversion-Based Style Transfer with Diffusion Models - [Arxiv] [QA]
  • Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation - [Arxiv] [QA]
  • FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning - [Arxiv] [QA]
  • Robust Mean Teacher for Continual and Gradual Test-Time Adaptation - [Arxiv] [QA]
  • Open-vocabulary Attribute Detection - [Arxiv] [QA]
  • OReX: Object Reconstruction from Planar Cross-sections Using Neural Fields - [Arxiv] [QA]
  • ActMAD: Activation Matching to Align Distributions for Test-Time-Training - [Arxiv] [QA]
  • ActMAD: Activation Matching to Align Distributions for Test-Time-Training - [Arxiv] [QA]
  • BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields - [Arxiv] [QA]
  • Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation - [Arxiv] [QA]
  • Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video - [Arxiv] [QA]
  • VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval - [Arxiv] [QA]
  • Texts as Images in Prompt Tuning for Multi-Label Image Recognition - [Arxiv] [QA]
  • Integrally Pre-Trained Transformer Pyramid Networks - [Arxiv] [QA]
  • PNI : Industrial Anomaly Detection using Position and Neighborhood Information - [Arxiv] [QA]
  • Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization - [Arxiv] [QA]
  • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks - [Arxiv] [QA]
  • PermutoSDF: Fast Multi-View Reconstruction with Implicit Surfaces using Permutohedral Lattices - [Arxiv] [QA]
  • CASSPR: Cross Attention Single Scan Place Recognition - [Arxiv] [QA]
  • AeDet: Azimuth-invariant Multi-view 3D Object Detection - [Arxiv] [QA]
  • Person Image Synthesis via Denoising Diffusion Model - [Arxiv] [QA]
  • Instant Volumetric Head Avatars - [Arxiv] [QA]
  • Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations - [Arxiv] [QA]
  • EDICT: Exact Diffusion Inversion via Coupled Transformations - [Arxiv] [QA]
  • OCTET: Object-aware Counterfactual Explanations - [Arxiv] [QA]
  • OCTET: Object-aware Counterfactual Explanations - [Arxiv] [QA]
  • DETRs with Collaborative Hybrid Assignments Training - [Arxiv] [QA]
  • GlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild - [Arxiv] [QA]
  • DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction - [Arxiv] [QA]
  • Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation - [Arxiv] [QA]
  • SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields - [Arxiv] [QA]
  • Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring - [Arxiv] [QA]
  • SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation - [Arxiv] [QA]
  • DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models - [Arxiv] [QA]
  • Explaining Image Classifiers with Multiscale Directional Image Representation - [Arxiv] [QA]
  • Backdoor Cleansing with Unlabeled Data - [Arxiv] [QA]
  • Level-S$^2$fM: Structure from Motion on Neural Level Set of Implicit Surfaces - [Arxiv] [QA]
  • One Eye is All You Need: Lightweight Ensembles for Gaze Estimation with Single Encoders - [Arxiv] [QA]
  • Multi-Directional Subspace Editing in Style-Space - [Arxiv] [QA]
  • Visual Dexterity: In-hand Dexterous Manipulation from Depth - [Arxiv] [QA]
  • SceneComposer: Any-Level Semantic Image Synthesis - [Arxiv] [QA]
  • SPARF: Neural Radiance Fields from Sparse and Noisy Poses - [Arxiv] [QA]
  • PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation - [Arxiv] [QA]
  • Teaching Structured Vision&Language Concepts to Vision&Language Models - [Arxiv] [QA]
  • Multitask Vision-Language Prompt Tuning - [Arxiv] [QA]
  • ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields - [Arxiv] [QA]
  • PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning - [Arxiv] [QA]
  • Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion - [Arxiv] [QA]
  • Guided Depth Super-Resolution by Deep Anisotropic Diffusion - [Arxiv] [QA]
  • Efficient Second-Order Plane Adjustment - [Arxiv] [QA]
  • Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields - [Arxiv] [QA]
  • Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint - [Arxiv] [QA]
  • SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training - [Arxiv] [QA]
  • MATE: Masked Autoencoders are Online 3D Test-Time Learners - [Arxiv] [QA]
  • Blur Interpolation Transformer for Real-World Motion from Blur - [Arxiv] [QA]
  • DyNCA: Real-time Dynamic Texture Synthesis Using Neural Cellular Automata - [Arxiv] [QA]
  • From Node Interaction to Hop Interaction: New Effective and Scalable Graph Learning Paradigm - [Arxiv] [QA]
  • Few-shot Non-line-of-sight Imaging with Signal-surface Collaborative Regularization - [Arxiv] [QA]
  • Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation - [Arxiv] [QA]
  • DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection - [Arxiv] [QA]
  • Neural Dependencies Emerging from Learning Massive Categories - [Arxiv] [QA]
  • SeeABLE: Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes - [Arxiv] [QA]
  • DrapeNet: Garment Generation and Self-Supervised Draping - [Arxiv] [QA]
  • Investigating Prompt Engineering in Diffusion Models - [Arxiv] [QA]
  • Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars - [Arxiv] [QA]
  • NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization - [Arxiv] [QA]
  • Vision Transformer with Super Token Sampling - [Arxiv] [QA]
  • Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification - [Arxiv] [QA]
  • You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model - [Arxiv] [QA]
  • DynIBaR: Neural Dynamic Image-Based Rendering - [Arxiv] [QA]
  • The Stack: 3 TB of permissively licensed source code - [Arxiv] [QA]
  • Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - [Arxiv] [QA]
  • Leveraging per Image-Token Consistency for Vision-Language Pre-training - [Arxiv] [QA]
  • DYNAFED: Tackling Client Data Heterogeneity with Global Dynamics - [Arxiv] [QA]
  • Learning to Generate Image Embeddings with User-level Differential Privacy - [Arxiv] [QA]
  • DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting - [Arxiv] [QA]
  • Passive Micron-scale Time-of-Flight with Sunlight Interferometry - [Arxiv] [QA]
  • EDGE: Editable Dance Generation From Music - [Arxiv] [QA]
  • Parallel Diffusion Models of Operator and Image for Blind Inverse Problems - [Arxiv] [QA]
  • Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models - [Arxiv] [QA]
  • LidarGait: Benchmarking 3D Gait Recognition with Point Clouds - [Arxiv] [QA]
  • MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception - [Arxiv] [QA]
  • Tired of Over-smoothing? Stress Graph Drawing Is All You Need! - [Arxiv] [QA]
  • A Practical Stereo Depth System for Smart Glasses - [Arxiv] [QA]
  • Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization - [Arxiv] [QA]
  • Magic3D: High-Resolution Text-to-3D Content Creation - [Arxiv] [QA]
  • BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision - [Arxiv] [QA]
  • PAL: Program-aided Language Models - [Arxiv] [QA]
  • Visual Programming: Compositional visual reasoning without training - [Arxiv] [QA]
  • Task Residual for Tuning Vision-Language Models - [Arxiv] [QA]
  • Patch-Craft Self-Supervised Training for Correlated Image Denoising - [Arxiv] [QA]
  • SPACE: Speech-driven Portrait Animation with Controllable Expression - [Arxiv] [QA]
  • Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks - [Arxiv] [QA]
  • Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information - [Arxiv] [QA]
  • InstructPix2Pix: Learning to Follow Image Editing Instructions - [Arxiv] [QA]
  • CAE v2: Context Autoencoder with CLIP Target - [Arxiv] [QA]
  • Null-text Inversion for Editing Real Images using Guided Diffusion Models - [Arxiv] [QA]
  • MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors - [Arxiv] [QA]
  • I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision - [Arxiv] [QA]
  • EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones - [Arxiv] [QA]
  • AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training - [Arxiv] [QA]
  • CRAFT: Concept Recursive Activation FacTorization for Explainability - [Arxiv] [QA]
  • UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer - [Arxiv] [QA]
  • DETRDistill: A Universal Knowledge Distillation Framework for DETR-families - [Arxiv] [QA]
  • UMFuse: Unified Multi View Fusion for Human Editing applications - [Arxiv] [QA]
  • Task-aware Retrieval with Instructions - [Arxiv] [QA]
  • AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders - [Arxiv] [QA]
  • AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders - [Arxiv] [QA]
  • Token Turing Machines - [Arxiv] [QA]
  • MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis - [Arxiv] [QA]
  • Holistic Evaluation of Language Models - [Arxiv] [QA]
  • Galactica: A Large Language Model for Science - [Arxiv] [QA]
  • Stare at What You See: Masked Image Modeling without Reconstruction - [Arxiv] [QA]
  • A Generalized Framework for Video Instance Segmentation - [Arxiv] [QA]
  • Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning - [Arxiv] [QA]
  • Consistent Direct Time-of-Flight Video Depth Super-Resolution - [Arxiv] [QA]
  • R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement - [Arxiv] [QA]
  • PromptCap: Prompt-Guided Task-Aware Image Captioning - [Arxiv] [QA]
  • Versatile Diffusion: Text, Images and Variations All in One Diffusion Model - [Arxiv] [QA]
  • Is Style All You Need? Dependencies Between Emotion and GST-based Speaker Recognition - [Arxiv] [QA]
  • Uncertainty-aware Gait Recognition via Learning from Dirichlet Distribution-based Evidence - [Arxiv] [QA]
  • Teaching Algorithmic Reasoning via In-context Learning - [Arxiv] [QA]
  • DINER: Disorder-Invariant Implicit Neural Representation - [Arxiv] [QA]
  • EVA: Exploring the Limits of Masked Visual Representation Learning at Scale - [Arxiv] [QA]
  • Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding - [Arxiv] [QA]
  • Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures - [Arxiv] [QA]
  • Imagination is All You Need! Curved Contrastive Learning for Abstract Sequence Modeling Utilized on Long Short-Term Dialogue Planning - [Arxiv] [QA]
  • PKCAM: Previous Knowledge Channel Attention Module - [Arxiv] [QA]
  • PKCAM: Previous Knowledge Channel Attention Module - [Arxiv] [QA]
  • MLIC: Multi-Reference Entropy Model for Learned Image Compression - [Arxiv] [QA]
  • Fcaformer: Forward Cross Attention in Hybrid Vision Transformer - [Arxiv] [QA]
  • ParCNetV2: Oversized Kernel with Enhanced Attention - [Arxiv] [QA]
  • Joint Data Deepening-and-Prefetching for Energy-Efficient Edge Learning - [Arxiv] [QA]
  • BiViT: Extremely Compressed Binary Vision Transformer - [Arxiv] [QA]
  • OverFlow: Putting flows on top of neural transducers for better TTS - [Arxiv] [QA]
  • VGFlow: Visibility guided Flow Network for Human Reposing - [Arxiv] [QA]
  • Residual Degradation Learning Unfolding Framework with Mixing Priors across Spectral and Spatial for Compressive Spectral Imaging - [Arxiv] [QA]
  • SCOTCH and SODA: A Transformer Video Shadow Detection Framework - [Arxiv] [QA]
  • Large Language Models Meet Harry Potter: A Bilingual Dataset for Aligning Dialogue Agents with Characters - [Arxiv] [QA]
  • CXTrack: Improving 3D Point Cloud Tracking with Contextual Information - [Arxiv] [QA]
  • MARLIN: Masked Autoencoder for facial video Representation LearnINg - [Arxiv] [QA]
  • OpenGait: Revisiting Gait Recognition Toward Better Practicality - [Arxiv] [QA]
  • Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning - [Arxiv] [QA]
  • Probabilistic Debiasing of Scene Graphs - [Arxiv] [QA]
  • Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection - [Arxiv] [QA]
  • Masked Contrastive Representation Learning - [Arxiv] [QA]
  • Masked Contrastive Representation Learning - [Arxiv] [QA]
  • Delay Embedded Echo-State Network: A Predictor for Partially Observed Systems - [Arxiv] [QA]
  • High-Quality Entity Segmentation - [Arxiv] [QA]
  • OneFormer: One Transformer to Rule Universal Image Segmentation - [Arxiv] [QA]
  • MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation - [Arxiv] [QA]
  • Secure Aggregation Is Not All You Need: Mitigating Privacy Attacks with Noise Tolerance in Federated Learning - [Arxiv] [QA]
  • GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts - [Arxiv] [QA]
  • Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models - [Arxiv] [QA]
  • BLOOM: A 176B-Parameter Open-Access Multilingual Language Model - [Arxiv] [QA]
  • NoiSER: Noise is All You Need for Low-Light Image Enhancement - [Arxiv] [QA]
  • Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement - [Arxiv] [QA]
  • Self-conditioned Embedding Diffusion for Text Generation - [Arxiv] [QA]
  • $BT^2$: Backward-compatible Training with Basis Transformation - [Arxiv] [QA]
  • Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories - [Arxiv] [QA]
  • A Unified Pyramid Recurrent Network for Video Frame Interpolation - [Arxiv] [QA]
  • Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis - [Arxiv] [QA]
  • Large Language Models Are Human-Level Prompt Engineers - [Arxiv] [QA]
  • Crosslingual Generalization through Multitask Finetuning - [Arxiv] [QA]
  • Progressive Transformation Learning for Leveraging Virtual Images in Training - [Arxiv] [QA]
  • PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales - [Arxiv] [QA]
  • eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers - [Arxiv] [QA]
  • The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training - [Arxiv] [QA]
  • The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning - [Arxiv] [QA]
  • CARE: Causality Reasoning for Empathetic Responses by Conditional Graph Generation - [Arxiv] [QA]

October 2022

  • SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control - [Arxiv] [QA]
  • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - [Arxiv] [QA]
  • DanZero: Mastering GuanDan Game with Reinforcement Learning - [Arxiv] [QA]
  • DiffusER: Discrete Diffusion via Edit-based Reconstruction - [Arxiv] [QA]
  • A simple, efficient and scalable contrastive masked autoencoder for learning visual representations - [Arxiv] [QA]
  • Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition - [Arxiv] [QA]
  • Saliency Can Be All You Need In Contrastive Self-Supervised Learning - [Arxiv] [QA]
  • STPrompt: Semantic-guided and Task-driven prompts for Effective Few-shot Classification - [Arxiv] [QA]
  • Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments - [Arxiv] [QA]
  • Vox-Fusion: Dense Tracking and Mapping with Voxel-based Neural Implicit Representation - [Arxiv] [QA]
  • ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics - [Arxiv] [QA]
  • Working Alliance Transformer for Psychotherapy Dialogue Classification - [Arxiv] [QA]
  • FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion - [Arxiv] [QA]
  • Contrastive Decoding: Open-ended Text Generation as Optimization - [Arxiv] [QA]
  • Streaming Radiance Fields for 3D Video Synthesis - [Arxiv] [QA]
  • Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization - [Arxiv] [QA]
  • Contrastive Search Is What You Need For Neural Text Generation - [Arxiv] [QA]
  • FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation - [Arxiv] [QA]
  • Towards Robust Recommender Systems via Triple Cooperative Defense - [Arxiv] [QA]
  • Dichotomy of Control: Separating What You Can Control from What You Cannot - [Arxiv] [QA]
  • Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task - [Arxiv] [QA]
  • GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling - [Arxiv] [QA]
  • Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS - [Arxiv] [QA]
  • 10 hours data is all you need - [Arxiv] [QA]
  • Overview of Dialogue Robot Competition 2022 - [Arxiv] [QA]
  • DANLI: Deliberative Agent for Following Natural Language Instructions - [Arxiv] [QA]
  • Towards Efficient Dialogue Pre-training with Transferable and Interpretable Latent Structure - [Arxiv] [QA]
  • Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation - [Arxiv] [QA]
  • There Is No Standard Answer: Knowledge-Grounded Dialogue Generation with Adversarial Activated Multi-Reference Learning - [Arxiv] [QA]
  • WikiWhy: Answering and Explaining Cause-and-Effect Questions - [Arxiv] [QA]
  • Large Language Models Can Self-Improve - [Arxiv] [QA]
  • i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? - [Arxiv] [QA]
  • Scaling Instruction-Finetuned Language Models - [Arxiv] [QA]
  • On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning - [Arxiv] [QA]
  • Scaling Laws for Reward Model Overoptimization - [Arxiv] [QA]
  • A Unified View of Masked Image Modeling - [Arxiv] [QA]
  • Co-guiding Net: Achieving Mutual Guidances between Multiple Intent Detection and Slot Filling via Heterogeneous Semantics-Label Graphs - [Arxiv] [QA]
  • How to Boost Face Recognition with StyleGAN? - [Arxiv] [QA]
  • Bag All You Need: Learning a Generalizable Bagging Strategy for Heterogeneous Objects - [Arxiv] [QA]
  • Perceptual Grouping in Contrastive Vision-Language Models - [Arxiv] [QA]
  • MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos - [Arxiv] [QA]
  • DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for Controllable Text Generation - [Arxiv] [QA]
  • Non-Contrastive Learning Meets Language-Image Pre-Training - [Arxiv] [QA]
  • Imagic: Text-Based Real Image Editing with Diffusion Models - [Arxiv] [QA]
  • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them - [Arxiv] [QA]
  • Multi-Agent Automated Machine Learning - [Arxiv] [QA]
  • DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models - [Arxiv] [QA]
  • Keep Me Updated! Memory Management in Long-term Conversations - [Arxiv] [QA]
  • Scratching Visual Transformer's Back with Uniform Attention - [Arxiv] [QA]
  • Data-Efficient Augmentation for Training Neural Networks - [Arxiv] [QA]
  • Data-Efficient Augmentation for Training Neural Networks - [Arxiv] [QA]
  • How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders - [Arxiv] [QA]
  • Is synthetic data from generative models ready for image recognition? - [Arxiv] [QA]
  • DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation - [Arxiv] [QA]
  • Visual Reinforcement Learning with Self-Supervised 3D Representations - [Arxiv] [QA]
  • Unified Vision and Language Prompt Learning - [Arxiv] [QA]
  • Visual Classification via Description from Large Language Models - [Arxiv] [QA]
  • Language Models of Code are Few-Shot Commonsense Learners - [Arxiv] [QA]
  • CUF: Continuous Upsampling Filters - [Arxiv] [QA]
  • Retrospectives on the Embodied AI Workshop - [Arxiv] [QA]
  • H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection - [Arxiv] [QA]
  • Explanations from Large Language Models Make Small Reasoners Better - [Arxiv] [QA]
  • Large Language Models are few(1)-shot Table Reasoners - [Arxiv] [QA]
  • RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses - [Arxiv] [QA]
  • Token-Label Alignment for Vision Transformers - [Arxiv] [QA]
  • A Generalist Framework for Panoptic Segmentation of Images and Videos - [Arxiv] [QA]
  • Visual Prompting for Adversarial Robustness - [Arxiv] [QA]
  • Masked Motion Encoding for Self-Supervised Video Representation Learning - [Arxiv] [QA]
  • Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning - [Arxiv] [QA]
  • BEV-LaneDet: a Simple and Effective 3D Lane Detection Baseline - [Arxiv] [QA]
  • ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation - [Arxiv] [QA]
  • Habitat-Matterport 3D Semantics Dataset - [Arxiv] [QA]
  • OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions - [Arxiv] [QA]
  • Mind's Eye: Grounded Language Model Reasoning through Simulation - [Arxiv] [QA]
  • MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model - [Arxiv] [QA]
  • It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training - [Arxiv] [QA]
  • BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation - [Arxiv] [QA]
  • Multi-Object Navigation with dynamically learned neural implicit representations - [Arxiv] [QA]
  • Certified Training: Small Boxes are All You Need - [Arxiv] [QA]
  • Denoising Masked AutoEncoders Help Robust Classification - [Arxiv] [QA]
  • Iterative Convex Optimization for Model Predictive Control with Discrete-Time High-Order Control Barrier Functions - [Arxiv] [QA]
  • Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning - [Arxiv] [QA]
  • Uncertainty-Aware Unsupervised Image Deblurring with Deep Residual Prior - [Arxiv] [QA]
  • Controllable Dialogue Simulation with In-Context Learning - [Arxiv] [QA]
  • Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders - [Arxiv] [QA]
  • Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP - [Arxiv] [QA]
  • Don't Lose Yourself! Empathetic Response Generation via Explicit Self-Other Awareness - [Arxiv] [QA]
  • ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering - [Arxiv] [QA]
  • Is margin all you need? An extensive empirical study of active learning on tabular data - [Arxiv] [QA]
  • Large Language Models can Implement Policy Iteration - [Arxiv] [QA]
  • GraspCaps: Capsule Networks Are All You Need for Grasping Familiar Objects - [Arxiv] [QA]
  • Automatic Chain of Thought Prompting in Large Language Models - [Arxiv] [QA]
  • Trans2k: Unlocking the Power of Deep Models for Transparent Object Tracking - [Arxiv] [QA]
  • Measuring and Narrowing the Compositionality Gap in Language Models - [Arxiv] [QA]
  • Critical Learning Periods for Multisensory Integration in Deep Networks - [Arxiv] [QA]
  • A ResNet is All You Need? Modeling A Strong Baseline for Detecting Referable Diabetic Retinopathy in Fundus Images - [Arxiv] [QA]
  • FAST: Improving Controllability for Text Generation with Feedback Aware Self-Training - [Arxiv] [QA]
  • On Distillation of Guided Diffusion Models - [Arxiv] [QA]
  • MaPLe: Multi-modal Prompt Learning - [Arxiv] [QA]
  • CLIP model is an Efficient Continual Learner - [Arxiv] [QA]
  • A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning - [Arxiv] [QA]
  • VIMA: General Robot Manipulation with Multimodal Prompts - [Arxiv] [QA]
  • Iterative Vision-and-Language Navigation - [Arxiv] [QA]
  • Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering - [Arxiv] [QA]
  • Language Models are Multilingual Chain-of-Thought Reasoners - [Arxiv] [QA]
  • ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs - [Arxiv] [QA]
  • A Distributional Lens for Multi-Aspect Controllable Text Generation - [Arxiv] [QA]
  • ReAct: Synergizing Reasoning and Acting in Language Models - [Arxiv] [QA]
  • Depth Is All You Need for Monocular 3D Detection - [Arxiv] [QA]
  • GLM-130B: An Open Bilingual Pre-trained Model - [Arxiv] [QA]
  • Decomposed Prompting: A Modular Approach for Solving Complex Tasks - [Arxiv] [QA]
  • Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images - [Arxiv] [QA]
  • Imagen Video: High Definition Video Generation with Diffusion Models - [Arxiv] [QA]
  • CorefDiffs: Co-referential and Differential Knowledge Flow in Document Grounded Conversations - [Arxiv] [QA]
  • Teaching Yourself: Graph Self-Distillation on Neighborhood for Node Classification - [Arxiv] [QA]
  • Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders - [Arxiv] [QA]
  • Affection: Learning Affective Explanations for Real-World Visual Data - [Arxiv] [QA]
  • Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures - [Arxiv] [QA]
  • Group Personalized Federated Learning - [Arxiv] [QA]
  • Group Personalized Federated Learning - [Arxiv] [QA]
  • Centerpoints Are All You Need in Overhead Imagery - [Arxiv] [QA]
  • COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos - [Arxiv] [QA]
  • PlaneDepth: Self-supervised Depth Estimation via Orthogonal Planes - [Arxiv] [QA]
  • Knowledge Unlearning for Mitigating Privacy Risks in Language Models - [Arxiv] [QA]
  • Extraneousness-Aware Imitation Learning - [Arxiv] [QA]
  • Extraneousness-Aware Imitation Learning - [Arxiv] [QA]
  • Recitation-Augmented Language Models - [Arxiv] [QA]
  • Event-based Temporally Dense Optical Flow Estimation with Sequential Learning - [Arxiv] [QA]
  • Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization - [Arxiv] [QA]
  • Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought - [Arxiv] [QA]
  • Masked Spiking Transformer - [Arxiv] [QA]
  • Visual Prompt Tuning for Generative Transfer Learning - [Arxiv] [QA]
  • Membership Inference Attacks Against Text-to-image Generation Models - [Arxiv] [QA]
  • Improving Sample Quality of Diffusion Models Using Self-Attention Guidance - [Arxiv] [QA]
  • Mastering Spatial Graph Prediction of Road Networks - [Arxiv] [QA]
  • Complexity-Based Prompting for Multi-Step Reasoning - [Arxiv] [QA]
  • IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis - [Arxiv] [QA]
  • "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction - [Arxiv] [QA]
  • Contrastive Audio-Visual Masked Autoencoder - [Arxiv] [QA]
  • NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review - [Arxiv] [QA]
  • Multimodal Analogical Reasoning over Knowledge Graphs - [Arxiv] [QA]

September 2022

  • Bias Mimicking: A Simple Sampling Approach for Bias Mitigation - [Arxiv] [QA]
  • Combining Efficient and Precise Sign Language Recognition: Good pose estimation library is all you need - [Arxiv] [QA]
  • Sphere-Guided Training of Neural Implicit Surfaces - [Arxiv] [QA]
  • SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation - [Arxiv] [QA]
  • Hiding Visual Information via Obfuscating Adversarial Perturbations - [Arxiv] [QA]
  • Learning Transferable Spatiotemporal Representations from Natural Script Knowledge - [Arxiv] [QA]
  • State-specific protein-ligand complex structure prediction with a multi-scale deep generative model - [Arxiv] [QA]
  • Compositional Semantic Parsing with Large Language Models - [Arxiv] [QA]
  • DreamFusion: Text-to-3D using 2D Diffusion - [Arxiv] [QA]
  • EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding - [Arxiv] [QA]
  • Contrastive Unsupervised Learning of World Model with Invariant Causal Features - [Arxiv] [QA]
  • Make-A-Video: Text-to-Video Generation without Text-Video Data - [Arxiv] [QA]
  • Dependent Bayesian Lenses: Categories of Bidirectional Markov Kernels with Canonical Bayesian Inversion - [Arxiv] [QA]
  • Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning - [Arxiv] [QA]
  • Improving alignment of dialogue agents via targeted human judgements - [Arxiv] [QA]
  • Learning State-Aware Visual Representations from Audible Interactions - [Arxiv] [QA]
  • Sentiment is all you need to win US Presidential elections - [Arxiv] [QA]
  • Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts - [Arxiv] [QA]
  • Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric - [Arxiv] [QA]
  • Paraphrasing Is All You Need for Novel Object Captioning - [Arxiv] [QA]
  • Generating Formal Safety Assurances for High-Dimensional Reachability - [Arxiv] [QA]
  • Probabilistic Planning with Partially Ordered Preferences over Temporal Goals - [Arxiv] [QA]
  • All are Worth Words: A ViT Backbone for Diffusion Models - [Arxiv] [QA]
  • Promptagator: Few-shot Dense Retrieval From 8 Examples - [Arxiv] [QA]
  • Control Barrier Functions in UGVs for Kinematic Obstacle Avoidance: A Collision Cone Approach - [Arxiv] [QA]
  • ProgPrompt: Generating Situated Robot Task Plans using Large Language Models - [Arxiv] [QA]
  • Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning - [Arxiv] [QA]
  • Generate rather than Retrieve: Large Language Models are Strong Context Generators - [Arxiv] [QA]
  • Target-Guided Open-Domain Conversation Planning - [Arxiv] [QA]
  • Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [Arxiv] [QA]
  • Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos - [Arxiv] [QA]
  • Space-time tradeoffs of lenses and optics via higher category theory - [Arxiv] [QA]
  • Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention - [Arxiv] [QA]
  • Loc-NeRF: Monte Carlo Localization using Neural Radiance Fields - [Arxiv] [QA]
  • Learning Symbolic Model-Agnostic Loss Functions via Meta-Learning - [Arxiv] [QA]
  • Semantic Segmentation using Neural Ordinary Differential Equations - [Arxiv] [QA]
  • A Benchmark for Understanding and Generating Dialogue between Characters in Stories - [Arxiv] [QA]
  • Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution - [Arxiv] [QA]
  • Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models - [Arxiv] [QA]
  • CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention - [Arxiv] [QA]
  • Spatial-then-Temporal Self-Supervised Learning for Video Correspondence - [Arxiv] [QA]
  • Test-Time Training with Masked Autoencoders - [Arxiv] [QA]
  • Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models - [Arxiv] [QA]
  • A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
  • A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
  • A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
  • Not As Easy As You Think -- Experiences and Lessons Learnt from Trying to Create a Bottom-Up Visualization Image Typology - [Arxiv] [QA]
  • PaLI: A Jointly-Scaled Multilingual Language-Image Model - [Arxiv] [QA]
  • Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models - [Arxiv] [QA]
  • Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models - [Arxiv] [QA]
  • Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models - [Arxiv] [QA]
  • Chain of Explanation: New Prompting Method to Generate Higher Quality Natural Language Explanation for Implicit Hate Speech - [Arxiv] [QA]
  • Exploring Target Representations for Masked Autoencoders - [Arxiv] [QA]
  • Developing a multi-variate prediction model for the detection of COVID-19 from Crowd-sourced Respiratory Voice Data - [Arxiv] [QA]
  • Enhancing the Self-Universality for Transferable Targeted Attacks - [Arxiv] [QA]
  • What does a platypus look like? Generating customized prompts for zero-shot image classification - [Arxiv] [QA]
  • MimCo: Masked Image Modeling Pre-training with Contrastive Teacher - [Arxiv] [QA]
  • EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models - [Arxiv] [QA]
  • Selective Annotation Makes Language Models Better Few-Shot Learners - [Arxiv] [QA]
  • RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection - [Arxiv] [QA]
  • An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling - [Arxiv] [QA]
  • TogetherNet: Bridging Image Restoration and Object Detection Together via Dynamic Enhancement Learning - [Arxiv] [QA]
  • Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement - [Arxiv] [QA]
  • Petals: Collaborative Inference and Fine-tuning of Large Models - [Arxiv] [QA]
  • Visual Prompting via Image Inpainting - [Arxiv] [QA]
  • FLAME: Free-form Language-based Motion Synthesis & Editing - [Arxiv] [QA]

August 2022

  • LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data - [Arxiv] [QA]
  • Rethinking Conversational Recommendations: Is Decision Tree All You Need? - [Arxiv] [QA]
  • Faithful Reasoning Using Large Language Models - [Arxiv] [QA]
  • Benchmark Results for Bookshelf Organization Problem as Mixed Integer Nonlinear Program with Mode Switch and Collision Avoidance - [Arxiv] [QA]
  • TrojViT: Trojan Insertion in Vision Transformers - [Arxiv] [QA]
  • Multi-Outputs Is All You Need For Deblur - [Arxiv] [QA]
  • MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining - [Arxiv] [QA]
  • Masked Autoencoders Enable Efficient Knowledge Distillers - [Arxiv] [QA]
  • DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation - [Arxiv] [QA]
  • Understanding Diffusion Models: A Unified Perspective - [Arxiv] [QA]
  • Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors - [Arxiv] [QA]
  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned - [Arxiv] [QA]
  • Improving Personality Consistency in Conversation by Persona Extending - [Arxiv] [QA]
  • Extending nnU-Net is all you need - [Arxiv] [QA]
  • Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition - [Arxiv] [QA]
  • Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks - [Arxiv] [QA]
  • Are disentangled representations all you need to build speaker anonymization systems? - [Arxiv] [QA]
  • Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval - [Arxiv] [QA]
  • A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval - [Arxiv] [QA]
  • Label-Noise Learning with Intrinsically Long-Tailed Data - [Arxiv] [QA]
  • DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization - [Arxiv] [QA]
  • SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability - [Arxiv] [QA]
  • Pseudo-Labels Are All You Need - [Arxiv] [QA]
  • CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation - [Arxiv] [QA]
  • Differentiable Architecture Search with Random Features - [Arxiv] [QA]
  • GraVoS: Voxel Selection for 3D Point-Cloud Detection - [Arxiv] [QA]
  • Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning - [Arxiv] [QA]
  • Significance of Skeleton-based Features in Virtual Try-On - [Arxiv] [QA]
  • CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks - [Arxiv] [QA]
  • Semi-Supervised Video Inpainting with Cycle Consistency Constraints - [Arxiv] [QA]
  • Long-Short History of Gradients is All You Need: Detecting Malicious and Unreliable Clients in Federated Learning - [Arxiv] [QA]
  • Dropout is NOT All You Need to Prevent Gradient Leakage - [Arxiv] [QA]
  • MILAN: Masked Image Pretraining on Language Assisted Representation - [Arxiv] [QA]
  • PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition - [Arxiv] [QA]
  • Assessing the Unitary RNN as an End-to-End Compositional Model of Syntax - [Arxiv] [QA]
  • Safety and Performance, Why not Both? Bi-Objective Optimized Model Compression toward AI Software Deployment - [Arxiv] [QA]
  • Generative Action Description Prompts for Skeleton-based Action Recognition - [Arxiv] [QA]
  • Understanding Masked Image Modeling via Learning Occlusion Invariant Feature - [Arxiv] [QA]
  • Follow Me: Conversation Planning for Target-driven Recommendation Dialogue Systems - [Arxiv] [QA]
  • Atlas: Few-shot Learning with Retrieval Augmented Language Models - [Arxiv] [QA]
  • BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage - [Arxiv] [QA]
  • PointConvFormer: Revenge of the Point-based Convolution - [Arxiv] [QA]
  • DropKey - [Arxiv] [QA]
  • Prompt Tuning for Generative Multimodal Pretrained Models - [Arxiv] [QA]
  • Masked Vision and Language Modeling for Multi-modal Representation Learning - [Arxiv] [QA]
  • Detecting Multivariate Time Series Anomalies with Zero Known Label - [Arxiv] [QA]
  • Character Generation through Self-Supervised Vectorization - [Arxiv] [QA]
  • Character Generation through Self-Supervised Vectorization - [Arxiv] [QA]
  • Prompt-to-Prompt Image Editing with Cross Attention Control - [Arxiv] [QA]
  • An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion - [Arxiv] [QA]
  • ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries - [Arxiv] [QA]
  • Reduction Rules and ILP Are All You Need: Minimal Directed Feedback Vertex Set - [Arxiv] [QA]
  • OmniCity: Omnipotent City Understanding with Multi-level and Multi-view Images - [Arxiv] [QA]
  • Neural network layers as parametric spans - [Arxiv] [QA]
  • Generative Bias for Robust Visual Question Answering - [Arxiv] [QA]
  • Composable Text Controls in Latent Space with ODEs - [Arxiv] [QA]
  • Search for or Navigate to? Dual Adaptive Thinking for Object Navigation - [Arxiv] [QA]

July 2022

  • SdAE: Self-distillated Masked Autoencoder - [Arxiv] [QA]
  • Less is More: Consistent Video Depth Estimation with Masked Frames Modeling - [Arxiv] [QA]
  • MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures - [Arxiv] [QA]
  • A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond - [Arxiv] [QA]
  • Language Models Can Teach Themselves to Program Better - [Arxiv] [QA]
  • Visual Recognition by Request - [Arxiv] [QA]
  • Contrastive Masked Autoencoders are Stronger Vision Learners - [Arxiv] [QA]
  • Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment - [Arxiv] [QA]
  • DETRs with Hybrid Matching - [Arxiv] [QA]
  • Visual correspondence-based explanations improve AI robustness and human-AI team accuracy - [Arxiv] [QA]
  • Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning - [Arxiv] [QA]
  • Is GPT-3 all you need for Visual Question Answering in Cultural Heritage? - [Arxiv] [QA]
  • Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent - [Arxiv] [QA]
  • All you need for horizontal slicing in 5G network - [Arxiv] [QA]
  • Divide and Conquer: 3D Point Cloud Instance Segmentation With Point-Wise Binarization - [Arxiv] [QA]
  • Adaptive Soft Contrastive Learning - [Arxiv] [QA]
  • Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild - [Arxiv] [QA]
  • Language Model Cascades - [Arxiv] [QA]
  • Tailoring Self-Supervision for Supervised Learning - [Arxiv] [QA]
  • GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features - [Arxiv] [QA]
  • FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning - [Arxiv] [QA]
  • Overlooked factors in concept-based explanations: Dataset choice, concept learnability, and human capability - [Arxiv] [QA]
  • Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos - [Arxiv] [QA]
  • Consistent Query Answering for Expressive Constraints under Tuple-Deletion Semantics - [Arxiv] [QA]
  • FedX: Unsupervised Federated Learning with Cross Knowledge Distillation - [Arxiv] [QA]
  • Label2Label: A Language Modeling Framework for Multi-Attribute Learning - [Arxiv] [QA]
  • Class-incremental Novel Class Discovery - [Arxiv] [QA]
  • UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View - [Arxiv] [QA]
  • Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding - [Arxiv] [QA]
  • Adaptive Assignment for Geometry Aware Local Feature Matching - [Arxiv] [QA]
  • SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery - [Arxiv] [QA]
  • Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection - [Arxiv] [QA]
  • Clover: Towards A Unified Video-Language Alignment and Fusion Model - [Arxiv] [QA]
  • Position Prediction as an Effective Pretraining Strategy - [Arxiv] [QA]
  • Bootstrapped Masked Autoencoders for Vision BERT Pretraining - [Arxiv] [QA]
  • Language models show human-like content effects on reasoning - [Arxiv] [QA]
  • Masked Autoencoders that Listen - [Arxiv] [QA]
  • PointNorm: Dual Normalization is All You Need for Point Cloud Analysis - [Arxiv] [QA]
  • Look-ups are not (yet) all you need for deep learning inference - [Arxiv] [QA]
  • A Data-Based Perspective on Transfer Learning - [Arxiv] [QA]
  • Inner Monologue: Embodied Reasoning through Planning with Language Models - [Arxiv] [QA]
  • Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection - [Arxiv] [QA]
  • Bootstrapping a User-Centered Task-Oriented Dialogue System - [Arxiv] [QA]
  • A Skeleton-aware Graph Convolutional Network for Human-Object Interaction Detection - [Arxiv] [QA]
  • LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action - [Arxiv] [QA]
  • Training Transformers Together - [Arxiv] [QA]
  • Back to the Source: Diffusion-Driven Test-Time Adaptation - [Arxiv] [QA]
  • YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors - [Arxiv] [QA]
  • Chairs Can be Stood on: Overcoming Object Bias in Human-Object Interaction Detection - [Arxiv] [QA]
  • Is a PET all you need? A multi-modal study for Alzheimer's disease using 3D CNNs - [Arxiv] [QA]
  • Best Subset Selection with Efficient Primal-Dual Algorithm - [Arxiv] [QA]
  • Distance Matters in Human-Object Interaction Detection - [Arxiv] [QA]
  • Domain-Independent Deception: Definition, Taxonomy and the Linguistic Cues Debate - [Arxiv] [QA]
  • Beyond mAP: Towards better evaluation of instance segmentation - [Arxiv] [QA]
  • PVO: Panoptic Visual Odometry - [Arxiv] [QA]
  • Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection - [Arxiv] [QA]
  • I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference - [Arxiv] [QA]
  • WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - [Arxiv] [QA]
  • Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus - [Arxiv] [QA]
  • Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need - [Arxiv] [QA]
  • Rationale-Augmented Ensembles in Language Models - [Arxiv] [QA]

June 2022

  • LaserMix for Semi-Supervised LiDAR Semantic Segmentation - [Arxiv] [QA]
  • On-Device Training Under 256KB Memory - [Arxiv] [QA]
  • Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations - [Arxiv] [QA]
  • Automatically Balancing Model Accuracy and Complexity using Solution and Fitness Evolution (SAFE) - [Arxiv] [QA]
  • UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration - [Arxiv] [QA]
  • Solving Quantitative Reasoning Problems with Language Models - [Arxiv] [QA]
  • BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models - [Arxiv] [QA]
  • Solution and Fitness Evolution (SAFE): A Study of Multiobjective Problems - [Arxiv] [QA]
  • Solution and Fitness Evolution (SAFE): Coevolving Solutions and Their Objective Functions - [Arxiv] [QA]
  • CV 3315 Is All You Need : Semantic Segmentation Competition - [Arxiv] [QA]
  • ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings - [Arxiv] [QA]
  • Diegetic Representation of Feedback in Open Games - [Arxiv] [QA]
  • Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning - [Arxiv] [QA]
  • zPROBE: Zero Peek Robustness Checks for Federated Learning - [Arxiv] [QA]
  • Task-Adaptive Few-shot Node Classification - [Arxiv] [QA]
  • EventNeRF: Neural Radiance Fields from a Single Colour Event Camera - [Arxiv] [QA]
  • MaskViT: Masked Visual Pre-Training for Video Prediction - [Arxiv] [QA]
  • Rethinking Surgical Instrument Segmentation: A Background Image Can Be All You Need - [Arxiv] [QA]
  • CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose - [Arxiv] [QA]
  • Invariant Causal Mechanisms through Distribution Matching - [Arxiv] [QA]
  • Invariant Causal Mechanisms through Distribution Matching - [Arxiv] [QA]
  • GODEL: Large-Scale Pre-Training for Goal-Directed Dialog - [Arxiv] [QA]
  • KiloNeuS: A Versatile Neural Implicit Surface Representation for Real-Time Rendering - [Arxiv] [QA]
  • Questions Are All You Need to Train a Dense Passage Retriever - [Arxiv] [QA]
  • LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs - [Arxiv] [QA]
  • Marginal Tail-Adaptive Normalizing Flows - [Arxiv] [QA]
  • Marginal Tail-Adaptive Normalizing Flows - [Arxiv] [QA]
  • SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders - [Arxiv] [QA]
  • Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders - [Arxiv] [QA]
  • DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations - [Arxiv] [QA]
  • All you need is feedback: Communication with block attention feedback codes - [Arxiv] [QA]
  • Gender Artifacts in Visual Datasets - [Arxiv] [QA]
  • Landscape Learning for Neural Network Inversion - [Arxiv] [QA]
  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge - [Arxiv] [QA]
  • Sheaf Neural Networks with Connection Laplacians - [Arxiv] [QA]
  • PRANC: Pseudo RAndom Networks for Compacting deep models - [Arxiv] [QA]
  • OmniMAE: Single Model Masked Pretraining on Images and Videos - [Arxiv] [QA]
  • Switchable Representation Learning Framework with Self-compatibility - [Arxiv] [QA]
  • Zero-Shot Video Question Answering via Frozen Bidirectional Language Models - [Arxiv] [QA]
  • Balancing Discriminability and Transferability for Source-Free Domain Adaptation - [Arxiv] [QA]
  • Architectural Backdoors in Neural Networks - [Arxiv] [QA]
  • Masked Frequency Modeling for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
  • Masked Siamese ConvNets - [Arxiv] [QA]
  • Structured Sparsity Learning for Efficient Video Super-Resolution - [Arxiv] [QA]
  • Emergent Abilities of Large Language Models - [Arxiv] [QA]
  • A smile is all you need: Predicting limiting activity coefficients from SMILES with natural language processing - [Arxiv] [QA]
  • GRAM-HD: 3D-Consistent Image Generation at High Resolution with Generative Radiance Manifolds - [Arxiv] [QA]
  • Proximal Splitting Adversarial Attacks for Semantic Segmentation - [Arxiv] [QA]
  • LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling - [Arxiv] [QA]
  • Confidence Score for Source-Free Unsupervised Domain Adaptation - [Arxiv] [QA]
  • Confidence Score for Source-Free Unsupervised Domain Adaptation - [Arxiv] [QA]
  • Transformers are Meta-Reinforcement Learners - [Arxiv] [QA]
  • Transformers are Meta-Reinforcement Learners - [Arxiv] [QA]
  • Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence - [Arxiv] [QA]
  • Language Models are General-Purpose Interfaces - [Arxiv] [QA]
  • Mining Multi-Label Samples from Single Positive Labels - [Arxiv] [QA]
  • Mining Multi-Label Samples from Single Positive Labels - [Arxiv] [QA]
  • Building a Personalized Dialogue System with Prompt-Tuning - [Arxiv] [QA]
  • Balanced Product of Calibrated Experts for Long-Tailed Recognition - [Arxiv] [QA]
  • Referring Image Matting - [Arxiv] [QA]
  • Masked Autoencoders are Robust Data Augmentors - [Arxiv] [QA]
  • Neural Prompt Search - [Arxiv] [QA]
  • Extreme Masking for Learning Instance and Distributed Visual Representations - [Arxiv] [QA]
  • On Data Scaling in Masked Image Modeling - [Arxiv] [QA]
  • Simple Cues Lead to a Strong Multi-Object Tracker - [Arxiv] [QA]
  • Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - [Arxiv] [QA]
  • Spatial-temporal Concept based Explanation of 3D ConvNets - [Arxiv] [QA]
  • Words are all you need? Language as an approximation for human similarity judgments - [Arxiv] [QA]
  • MobileOne: An Improved One millisecond Mobile Backbone - [Arxiv] [QA]
  • MobileOne: An Improved One millisecond Mobile Backbone - [Arxiv] [QA]
  • Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks - [Arxiv] [QA]
  • Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding - [Arxiv] [QA]
  • Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval - [Arxiv] [QA]
  • Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection - [Arxiv] [QA]
  • TriBYOL: Triplet BYOL for Self-Supervised Representation Learning - [Arxiv] [QA]
  • Self-Knowledge Distillation based Self-Supervised Learning for Covid-19 Detection from Chest X-Ray Images - [Arxiv] [QA]
  • Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data - [Arxiv] [QA]
  • Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation - [Arxiv] [QA]
  • A Neural Corpus Indexer for Document Retrieval - [Arxiv] [QA]
  • Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering - [Arxiv] [QA]
  • Is More Data All You Need? A Causal Exploration - [Arxiv] [QA]
  • Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation - [Arxiv] [QA]
  • Making Large Language Models Better Reasoners with Step-Aware Verifier - [Arxiv] [QA]
  • PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers - [Arxiv] [QA]
  • Delving into the Openness of CLIP - [Arxiv] [QA]
  • Video-based Human-Object Interaction Detection from Tubelet Tokens - [Arxiv] [QA]
  • Revisiting the "Video" in Video-Language Understanding - [Arxiv] [QA]
  • PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
  • PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
  • PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
  • A Survey on Computationally Efficient Neural Architecture Search - [Arxiv] [QA]
  • Learning Probabilistic Topological Representations Using Discrete Morse Theory - [Arxiv] [QA]
  • MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data - [Arxiv] [QA]
  • PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images - [Arxiv] [QA]
  • Siamese Image Modeling for Self-Supervised Vision Representation Learning - [Arxiv] [QA]
  • Multi-View Active Fine-Grained Recognition - [Arxiv] [QA]
  • Prefix Conditioning Unifies Language and Label Supervision - [Arxiv] [QA]
  • Unified Recurrence Modeling for Video Action Anticipation - [Arxiv] [QA]
  • Unified Recurrence Modeling for Video Action Anticipation - [Arxiv] [QA]
  • NIPQ: Noise proxy-based Integrated Pseudo-Quantization - [Arxiv] [QA]
  • NIPQ: Noise proxy-based Integrated Pseudo-Quantization - [Arxiv] [QA]
  • Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction - [Arxiv] [QA]
  • ORC: Network Group-based Knowledge Distillation using Online Role Change - [Arxiv] [QA]
  • MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining - [Arxiv] [QA]

May 2022

  • Evolving Domain Generalization - [Arxiv] [QA]
  • itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection - [Arxiv] [QA]
  • FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER - [Arxiv] [QA]
  • Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning - [Arxiv] [QA]
  • Self-Supervised Visual Representation Learning with Semantic Grouping - [Arxiv] [QA]
  • GMML is All you Need - [Arxiv] [QA]
  • Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions - [Arxiv] [QA]
  • HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling - [Arxiv] [QA]
  • FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation - [Arxiv] [QA]
  • Robust Weight Perturbation for Adversarial Training - [Arxiv] [QA]
  • Robust Weight Perturbation for Adversarial Training - [Arxiv] [QA]
  • EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction - [Arxiv] [QA]
  • CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI - [Arxiv] [QA]
  • CoNT: Contrastive Neural Text Generation - [Arxiv] [QA]
  • Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning - [Arxiv] [QA]
  • SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners - [Arxiv] [QA]
  • Additive Higher-Order Factorization Machines - [Arxiv] [QA]
  • A Closer Look at Self-Supervised Lightweight Vision Transformers - [Arxiv] [QA]
  • Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training - [Arxiv] [QA]
  • Object-wise Masked Autoencoders for Fast Pre-training - [Arxiv] [QA]
  • Semi-supervised Semantics-guided Adversarial Training for Trajectory Prediction - [Arxiv] [QA]
  • Controllable Text Generation with Neurally-Decomposed Oracle - [Arxiv] [QA]
  • Diffusion-LM Improves Controllable Text Generation - [Arxiv] [QA]
  • Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation - [Arxiv] [QA]
  • Bayesian Robust Graph Contrastive Learning - [Arxiv] [QA]
  • GIT: A Generative Image-to-text Transformer for Vision and Language - [Arxiv] [QA]
  • Prototype Based Classification from Hierarchy to Fairness - [Arxiv] [QA]
  • Prototype Based Classification from Hierarchy to Fairness - [Arxiv] [QA]
  • Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN - [Arxiv] [QA]
  • Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions - [Arxiv] [QA]
  • Quark: Controllable Text Generation with Reinforced Unlearning - [Arxiv] [QA]
  • Revealing the Dark Secrets of Masked Image Modeling - [Arxiv] [QA]
  • Green Hierarchical Vision Transformer for Masked Image Modeling - [Arxiv] [QA]
  • Physical-World Optical Adversarial Attacks on 3D Face Recognition - [Arxiv] [QA]
  • MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers - [Arxiv] [QA]
  • Pretraining is All You Need for Image-to-Image Translation - [Arxiv] [QA]
  • RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators - [Arxiv] [QA]
  • Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers - [Arxiv] [QA]
  • TALM: Tool Augmented Language Models - [Arxiv] [QA]
  • Large Language Models are Zero-Shot Reasoners - [Arxiv] [QA]
  • Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations - [Arxiv] [QA]
  • Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition - [Arxiv] [QA]
  • Decoder Denoising Pretraining for Semantic Segmentation - [Arxiv] [QA]
  • PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection - [Arxiv] [QA]
  • FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders - [Arxiv] [QA]
  • GraphMAE: Self-Supervised Masked Graph Autoencoders - [Arxiv] [QA]
  • All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs - [Arxiv] [QA]
  • Swept-Angle Synthetic Wavelength Interferometry - [Arxiv] [QA]
  • Least-to-Most Prompting Enables Complex Reasoning in Large Language Models - [Arxiv] [QA]
  • A Review of Safe Reinforcement Learning: Methods, Theory and Applications - [Arxiv] [QA]
  • Adaptive Fairness-Aware Online Meta-Learning for Changing Environments - [Arxiv] [QA]
  • Sampling Is All You Need on Modeling Long-Term User Behaviors for CTR Prediction - [Arxiv] [QA]
  • Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality - [Arxiv] [QA]
  • Can Foundation Models Wrangle Your Data? - [Arxiv] [QA]
  • RankGen: Improving Text Generation with Large Ranking Models - [Arxiv] [QA]
  • Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning - [Arxiv] [QA]
  • Masked Image Modeling with Denoising Contrast - [Arxiv] [QA]
  • Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection - [Arxiv] [QA]
  • Learning Graph Structure from Convolutional Mixtures - [Arxiv] [QA]
  • Learning Graph Structure from Convolutional Mixtures - [Arxiv] [QA]
  • Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation - [Arxiv] [QA]
  • Masked Autoencoders As Spatiotemporal Learners - [Arxiv] [QA]
  • Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners - [Arxiv] [QA]
  • Positional Information is All You Need: A Novel Pipeline for Self-Supervised SVDE from Videos - [Arxiv] [QA]
  • Need is All You Need: Homeostatic Neural Networks Adapt to Concept Shift - [Arxiv] [QA]
  • A CLIP-Hitchhiker's Guide to Long Video Retrieval - [Arxiv] [QA]
  • Robust Losses for Learning Value Functions - [Arxiv] [QA]
  • Robust Losses for Learning Value Functions - [Arxiv] [QA]
  • LogicSolver: Towards Interpretable Math Word Problem Solving with Logical Prompt-enhanced Learning - [Arxiv] [QA]
  • BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models - [Arxiv] [QA]
  • Diffusion Models for Adversarial Purification - [Arxiv] [QA]
  • Long-term Control for Dialogue Generation: Methods and Evaluation - [Arxiv] [QA]
  • Aligning Robot Representations with Humans - [Arxiv] [QA]
  • A Generalist Agent - [Arxiv] [QA]
  • An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers - [Arxiv] [QA]
  • Reduce Information Loss in Transformers for Pluralistic Image Inpainting - [Arxiv] [QA]
  • Learning to Answer Visual Questions from Web Videos - [Arxiv] [QA]
  • When does dough become a bagel? Analyzing the remaining mistakes on ImageNet - [Arxiv] [QA]
  • A for-loop is all you need. For solving the inverse problem in the case of personalized tumor growth modeling - [Arxiv] [QA]
  • Activating More Pixels in Image Super-Resolution Transformer - [Arxiv] [QA]
  • ConvMAE: Masked Convolution Meets Masked Autoencoders - [Arxiv] [QA]
  • Towards a Progression-Aware Autonomous Dialogue Agent - [Arxiv] [QA]
  • The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning - [Arxiv] [QA]
  • Spiking Graph Convolutional Networks - [Arxiv] [QA]
  • Spiking Graph Convolutional Networks - [Arxiv] [QA]
  • A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration - [Arxiv] [QA]
  • Lexical Knowledge Internalization for Neural Dialog Generation - [Arxiv] [QA]
  • End2End Multi-View Feature Matching with Differentiable Pose Optimization - [Arxiv] [QA]
  • Learning to Transfer Prompts for Text Generation - [Arxiv] [QA]
  • OPT: Open Pre-trained Transformer Language Models - [Arxiv] [QA]

April 2022

  • Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models - [Arxiv] [QA]
  • SVTR: Scene Text Recognition with a Single Visual Model - [Arxiv] [QA]
  • Flamingo: a Visual Language Model for Few-Shot Learning - [Arxiv] [QA]
  • ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation - [Arxiv] [QA]
  • The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction - [Arxiv] [QA]
  • Power Bundle Adjustment for Large-Scale 3D Reconstruction - [Arxiv] [QA]
  • Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training - [Arxiv] [QA]
  • Control Globally, Understand Locally: A Global-to-Local Hierarchical Graph Network for Emotional Support Conversation - [Arxiv] [QA]
  • MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation - [Arxiv] [QA]
  • PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions - [Arxiv] [QA]
  • MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval - [Arxiv] [QA]
  • SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text - [Arxiv] [QA]
  • Masked Image Modeling Advances 3D Medical Image Analysis - [Arxiv] [QA]
  • LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback - [Arxiv] [QA]
  • Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models - [Arxiv] [QA]
  • Simulating Fluids in Real-World Still Images - [Arxiv] [QA]
  • RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning - [Arxiv] [QA]
  • Meet Your Favorite Character: Open-domain Chatbot Mimicking Fictional Characters with only a Few Utterances - [Arxiv] [QA]
  • Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - [Arxiv] [QA]
  • Autoregressive Search Engines: Generating Substrings as Document Identifiers - [Arxiv] [QA]
  • Sharper Utility Bounds for Differentially Private Models - [Arxiv] [QA]
  • Sharper Utility Bounds for Differentially Private Models - [Arxiv] [QA]
  • Towards Multi-Turn Empathetic Dialogs with Positive Emotion Elicitation - [Arxiv] [QA]
  • Event Transition Planning for Open-ended Text Generation - [Arxiv] [QA]
  • Human-Object Interaction Detection via Disentangled Transformer - [Arxiv] [QA]
  • Visio-Linguistic Brain Encoding - [Arxiv] [QA]
  • Visio-Linguistic Brain Encoding - [Arxiv] [QA]
  • The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
  • CPFair: Personalized Consumer and Producer Fairness Re-ranking for Recommender Systems - [Arxiv] [QA]
  • Interactiveness Field in Human-Object Interactions - [Arxiv] [QA]
  • Improving Passage Retrieval with Zero-Shot Question Generation - [Arxiv] [QA]
  • INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold - [Arxiv] [QA]
  • A Personalized Dialogue Generator with Implicit User Persona Detection - [Arxiv] [QA]
  • LaMemo: Language Modeling with Look-Ahead Memory - [Arxiv] [QA]
  • Measuring Compositional Consistency for Video Question Answering - [Arxiv] [QA]
  • Neighborhood Attention Transformer - [Arxiv] [QA]
  • Masked Siamese Networks for Label-Efficient Learning - [Arxiv] [QA]
  • BEHAVE: Dataset and Method for Tracking Human Object Interactions - [Arxiv] [QA]
  • GPT-NeoX-20B: An Open-Source Autoregressive Language Model - [Arxiv] [QA]
  • Learning Convolutional Neural Networks in the Frequency Domain - [Arxiv] [QA]
  • Transparent Shape from a Single View Polarization Image - [Arxiv] [QA]
  • Neural Topic Modeling of Psychotherapy Sessions - [Arxiv] [QA]
  • MGM: A meshfree geometric multilevel method for systems arising from elliptic equations on point cloud surfaces - [Arxiv] [QA]
  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback - [Arxiv] [QA]
  • Bootstrap Motion Forecasting With Self-Consistent Constraints - [Arxiv] [QA]
  • Stylized Knowledge-Grounded Dialogue Generation via Disentangled Template Rewriting - [Arxiv] [QA]
  • Deep Annotation of Therapeutic Working Alliance in Psychotherapy - [Arxiv] [QA]
  • Overlapping Word Removal is All You Need: Revisiting Data Imbalance in Hope Speech Detection - [Arxiv] [QA]
  • Exploring the Universal Vulnerability of Prompt-based Learning Paradigm - [Arxiv] [QA]
  • Focal Length and Object Pose Estimation via Render and Compare - [Arxiv] [QA]
  • Category-Aware Transformer Network for Better Human-Object Interaction Detection - [Arxiv] [QA]
  • Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection - [Arxiv] [QA]
  • DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning - [Arxiv] [QA]
  • Representation Learning by Detecting Incorrect Location Embeddings - [Arxiv] [QA]
  • Learning Trajectory-Aware Transformer for Video Super-Resolution - [Arxiv] [QA]
  • Federated Learning with Partial Model Personalization - [Arxiv] [QA]
  • Federated Learning with Partial Model Personalization - [Arxiv] [QA]
  • Unsupervised Prompt Learning for Vision-Language Models - [Arxiv] [QA]
  • Interacting with Non-Cooperative User: A New Paradigm for Proactive Dialogue Policy - [Arxiv] [QA]
  • Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality - [Arxiv] [QA]
  • Knowledge Infused Decoding - [Arxiv] [QA]
  • Knowledge Infused Decoding - [Arxiv] [QA]
  • Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection - [Arxiv] [QA]
  • Towards An End-to-End Framework for Flow-Guided Video Inpainting - [Arxiv] [QA]
  • There Are a Thousand Hamlets in a Thousand People's Eyes: Enhancing Knowledge-grounded Dialogue with Personal Memory - [Arxiv] [QA]
  • Efficient Test-Time Model Adaptation without Forgetting - [Arxiv] [QA]
  • C3KG: A Chinese Commonsense Conversation Knowledge Graph - [Arxiv] [QA]
  • CHORE: Contact, Human and Object REconstruction from a single RGB image - [Arxiv] [QA]
  • Can language models learn from explanations in context? - [Arxiv] [QA]
  • PaLM: Scaling Language Modeling with Pathways - [Arxiv] [QA]
  • At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads - [Arxiv] [QA]
  • $\textit{latent}$-GLAT: Glancing at Latent Variables for Parallel Text Generation - [Arxiv] [QA]
  • Learning Neural Acoustic Fields - [Arxiv] [QA]
  • Learning Neural Acoustic Fields - [Arxiv] [QA]
  • Do As I Can, Not As I Say: Grounding Language in Robotic Affordances - [Arxiv] [QA]
  • MultiMAE: Multi-modal Multi-task Masked Autoencoders - [Arxiv] [QA]
  • Value Gradient weighted Model-Based Reinforcement Learning - [Arxiv] [QA]
  • Value Gradient weighted Model-Based Reinforcement Learning - [Arxiv] [QA]
  • PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models - [Arxiv] [QA]
  • Probabilistic Implicit Scene Completion - [Arxiv] [QA]
  • Probabilistic Implicit Scene Completion - [Arxiv] [QA]
  • Dynamic Focus-aware Positional Queries for Semantic Segmentation - [Arxiv] [QA]
  • What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions - [Arxiv] [QA]
  • Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach - [Arxiv] [QA]
  • Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language - [Arxiv] [QA]
  • End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation - [Arxiv] [QA]
  • Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings - [Arxiv] [QA]

March 2022

  • TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization - [Arxiv] [QA]
  • Exploring Visual Prompts for Adapting Large-Scale Models - [Arxiv] [QA]
  • A 23 MW data centre is all you need - [Arxiv] [QA]
  • R2L: Distilling Neural Radiance Field to Neural Light Field for Efficient Novel View Synthesis - [Arxiv] [QA]
  • Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions - [Arxiv] [QA]
  • SimVQA: Exploring Simulated Environments for Visual Question Answering - [Arxiv] [QA]
  • Self-distillation Augmented Masked Autoencoders for Histopathological Image Classification - [Arxiv] [QA]
  • MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - [Arxiv] [QA]
  • PromptDet: Towards Open-vocabulary Detection using Uncurated Images - [Arxiv] [QA]
  • A Sequential Quadratic Programming Approach to the Solution of Open-Loop Generalized Nash Equilibria - [Arxiv] [QA]
  • Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection - [Arxiv] [QA]
  • Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data - [Arxiv] [QA]
  • Training Compute-Optimal Large Language Models - [Arxiv] [QA]
  • Graph Neural Networks are Dynamic Programmers - [Arxiv] [QA]
  • mc-BEiT: Multi-choice Discretization for Image BERT Pre-training - [Arxiv] [QA]
  • MAT: Mask-Aware Transformer for Large Hole Image Inpainting - [Arxiv] [QA]
  • Parameter-efficient Model Adaptation for Vision Transformers - [Arxiv] [QA]
  • Generalizing Few-Shot NAS with Gradient Matching - [Arxiv] [QA]
  • Generalizing Few-Shot NAS with Gradient Matching - [Arxiv] [QA]
  • Neural Vocoder is All You Need for Speech Super-resolution - [Arxiv] [QA]
  • Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model - [Arxiv] [QA]
  • A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition - [Arxiv] [QA]
  • MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection - [Arxiv] [QA]
  • ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural Representations - [Arxiv] [QA]
  • STaR: Bootstrapping Reasoning With Reasoning - [Arxiv] [QA]
  • UV Volumes for Real-time Rendering of Editable Free-view Human Performance - [Arxiv] [QA]
  • Discovering Human-Object Interaction Concepts via Self-Compositional Learning - [Arxiv] [QA]
  • How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? - [Arxiv] [QA]
  • GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection - [Arxiv] [QA]
  • AutoML for Deep Recommender Systems: A Survey - [Arxiv] [QA]
  • Spectral Measurement Sparsification for Pose-Graph SLAM - [Arxiv] [QA]
  • Continual Test-Time Domain Adaptation - [Arxiv] [QA]
  • MISC: A MIxed Strategy-Aware Model Integrating COMET for Emotional Support Conversation - [Arxiv] [QA]
  • A Comparative Survey of Deep Active Learning - [Arxiv] [QA]
  • Linking Emergent and Natural Languages via Corpus Transfer - [Arxiv] [QA]
  • Linking Emergent and Natural Languages via Corpus Transfer - [Arxiv] [QA]
  • MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection - [Arxiv] [QA]
  • What to Hide from Your Students: Attention-Guided Masked Image Modeling - [Arxiv] [QA]
  • VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training - [Arxiv] [QA]
  • Pathways: Asynchronous Distributed Dataflow for ML - [Arxiv] [QA]
  • Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition - [Arxiv] [QA]
  • Deep Frequency Filtering for Domain Generalization - [Arxiv] [QA]
  • Visual Prompt Tuning - [Arxiv] [QA]
  • Self-supervision through Random Segments with Autoregressive Coding (RandSAC) - [Arxiv] [QA]
  • Language modeling via stochastic processes - [Arxiv] [QA]
  • Language modeling via stochastic processes - [Arxiv] [QA]
  • Language modeling via stochastic processes - [Arxiv] [QA]
  • Masked Discrimination for Self-Supervised Learning on Point Clouds - [Arxiv] [QA]
  • Self-Consistency Improves Chain of Thought Reasoning in Language Models - [Arxiv] [QA]
  • The Conceptual VAE - [Arxiv] [QA]
  • Teaching language models to support answers with verified quotes - [Arxiv] [QA]
  • Towards Large-Scale Interpretable Knowledge Graph Reasoning for Dialogue Systems - [Arxiv] [QA]
  • Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows - [Arxiv] [QA]
  • CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation - [Arxiv] [QA]
  • On Robust Prefix-Tuning for Text Classification - [Arxiv] [QA]
  • On Robust Prefix-Tuning for Text Classification - [Arxiv] [QA]
  • Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation - [Arxiv] [QA]
  • SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition - [Arxiv] [QA]
  • Generative Principal Component Analysis - [Arxiv] [QA]
  • Generative Principal Component Analysis - [Arxiv] [QA]
  • Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
  • Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
  • Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
  • A Framework and Benchmark for Deep Batch Active Learning for Regression - [Arxiv] [QA]
  • RoMe: A Robust Metric for Evaluating Natural Language Generation - [Arxiv] [QA]
  • PLANET: Dynamic Content Planning in Autoregressive Transformers for Long-form Text Generation - [Arxiv] [QA]
  • Memorizing Transformers - [Arxiv] [QA]
  • Memorizing Transformers - [Arxiv] [QA]
  • Multi-Stage Prompting for Knowledgeable Dialogue Generation - [Arxiv] [QA]
  • Differentiable DAG Sampling - [Arxiv] [QA]
  • Differentiable DAG Sampling - [Arxiv] [QA]
  • Iteratively Prompt Pre-trained Language Models for Chain of Thought - [Arxiv] [QA]
  • Multi-View Document Representation Learning for Open-Domain Dense Retrieval - [Arxiv] [QA]
  • Unified Visual Transformer Compression - [Arxiv] [QA]
  • Unified Visual Transformer Compression - [Arxiv] [QA]
  • Vision-Based Manipulators Need to Also See from Their Hands - [Arxiv] [QA]
  • Vision-Based Manipulators Need to Also See from Their Hands - [Arxiv] [QA]
  • Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation - [Arxiv] [QA]
  • ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation - [Arxiv] [QA]
  • Distraction is All You Need for Fairness - [Arxiv] [QA]
  • ScienceWorld: Is your Agent Smarter than a 5th Grader? - [Arxiv] [QA]
  • Respecting causality is all you need for training physics-informed neural networks - [Arxiv] [QA]
  • All in One: Exploring Unified Video-Language Pre-training - [Arxiv] [QA]
  • Orchestrated Value Mapping for Reinforcement Learning - [Arxiv] [QA]
  • Orchestrated Value Mapping for Reinforcement Learning - [Arxiv] [QA]
  • Masked Autoencoders for Point Cloud Self-supervised Learning - [Arxiv] [QA]
  • PromptChainer: Chaining Large Language Model Prompts through Visual Programming - [Arxiv] [QA]
  • Categories of Differentiable Polynomial Circuits for Machine Learning - [Arxiv] [QA]
  • BiBERT: Accurate Fully Binarized BERT - [Arxiv] [QA]
  • BiBERT: Accurate Fully Binarized BERT - [Arxiv] [QA]
  • MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting - [Arxiv] [QA]
  • LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval - [Arxiv] [QA]
  • An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation - [Arxiv] [QA]
  • Long Time No See! Open-Domain Conversation with Long-Term Persona Memory - [Arxiv] [QA]
  • Conditional Prompt Learning for Vision-Language Models - [Arxiv] [QA]
  • Back to the Feature: Classical 3D Features are (Almost) All You Need for 3D Anomaly Detection - [Arxiv] [QA]
  • Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation - [Arxiv] [QA]
  • MVP: Multimodality-guided Visual Pre-training - [Arxiv] [QA]
  • Internet-augmented language models through few-shot prompting for open-domain question answering - [Arxiv] [QA]
  • All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators - [Arxiv] [QA]
  • Source-free Video Domain Adaptation by Learning Temporal Consistency for Action Recognition - [Arxiv] [QA]
  • Kubric: A scalable dataset generator - [Arxiv] [QA]
  • Kubric: A scalable dataset generator - [Arxiv] [QA]
  • Self-supervised Implicit Glyph Attention for Text Recognition - [Arxiv] [QA]
  • Adaptive Cross-Layer Attention for Image Restoration - [Arxiv] [QA]
  • Adaptive Cross-Layer Attention for Image Restoration - [Arxiv] [QA]
  • Structured Pruning is All You Need for Pruning CNNs at Initialization - [Arxiv] [QA]
  • Neural Simulated Annealing - [Arxiv] [QA]
  • Neural Simulated Annealing - [Arxiv] [QA]
  • Training language models to follow instructions with human feedback - [Arxiv] [QA]
  • BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification - [Arxiv] [QA]
  • Video is All You Need: Attacking PPG-based Biometric Authentication - [Arxiv] [QA]
  • Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding - [Arxiv] [QA]
  • Towards a unified view of unsupervised non-local methods for image denoising: the NL-Ridge approach - [Arxiv] [QA]
  • DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index - [Arxiv] [QA]

February 2022

  • One Model is All You Need: Multi-Task Learning Enables Simultaneous Histology Image Segmentation and Classification - [Arxiv] [QA]
  • A Proximal Algorithm for Sampling - [Arxiv] [QA]
  • Rethinking and Refining the Distinct Metric - [Arxiv] [QA]
  • Filter-enhanced MLP is All You Need for Sequential Recommendation - [Arxiv] [QA]
  • The Spectral Bias of Polynomial Neural Networks - [Arxiv] [QA]
  • The Spectral Bias of Polynomial Neural Networks - [Arxiv] [QA]
  • AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation - [Arxiv] [QA]
  • Ask2Mask: Guided Data Selection for Masked Speech Modeling - [Arxiv] [QA]
  • Ask2Mask: Guided Data Selection for Masked Speech Modeling - [Arxiv] [QA]
  • Effective Actor-centric Human-object Interaction Detection - [Arxiv] [QA]
  • All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL - [Arxiv] [QA]
  • Auto-scaling Vision Transformers without Training - [Arxiv] [QA]
  • Auto-scaling Vision Transformers without Training - [Arxiv] [QA]
  • COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics - [Arxiv] [QA]
  • Socialformer: Social Network Inspired Long Document Modeling for Document Ranking - [Arxiv] [QA]
  • PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs - [Arxiv] [QA]
  • Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey - [Arxiv] [QA]
  • 1-WL Expressiveness Is (Almost) All You Need - [Arxiv] [QA]
  • Pseudo Numerical Methods for Diffusion Models on Manifolds - [Arxiv] [QA]
  • Pseudo Numerical Methods for Diffusion Models on Manifolds - [Arxiv] [QA]
  • Bayes-Optimal Classifiers under Group Fairness - [Arxiv] [QA]
  • Bit-wise Training of Neural Network Weights - [Arxiv] [QA]
  • Bit-wise Training of Neural Network Weights - [Arxiv] [QA]
  • Highlighting Object Category Immunity for the Generalization of Human-Object Interaction Detection - [Arxiv] [QA]
  • Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder - [Arxiv] [QA]
  • Masked prediction tasks: a parameter identifiability view - [Arxiv] [QA]
  • Gaussian Mixture Convolution Networks - [Arxiv] [QA]
  • Gaussian Mixture Convolution Networks - [Arxiv] [QA]
  • cosFormer: Rethinking Softmax in Attention - [Arxiv] [QA]
  • cosFormer: Rethinking Softmax in Attention - [Arxiv] [QA]
  • Task-Agnostic Graph Explanations - [Arxiv] [QA]
  • Task-Agnostic Graph Explanations - [Arxiv] [QA]
  • Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations - [Arxiv] [QA]
  • Don't Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis - [Arxiv] [QA]
  • A precortical module for robust CNNs to light variations - [Arxiv] [QA]
  • A precortical module for robust CNNs to light variations - [Arxiv] [QA]
  • Exploring Discontinuity for Video Frame Interpolation - [Arxiv] [QA]
  • Transformer Memory as a Differentiable Search Index - [Arxiv] [QA]
  • What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code - [Arxiv] [QA]
  • Domain Adaptation via Prompt Learning - [Arxiv] [QA]
  • FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows - [Arxiv] [QA]
  • A Contrastive Framework for Neural Text Generation - [Arxiv] [QA]
  • Conditional Contrastive Learning with Kernel - [Arxiv] [QA]
  • Conditional Contrastive Learning with Kernel - [Arxiv] [QA]
  • Domain Adversarial Training: A Game Perspective - [Arxiv] [QA]
  • Domain Adversarial Training: A Game Perspective - [Arxiv] [QA]
  • InPars: Data Augmentation for Information Retrieval using Large Language Models - [Arxiv] [QA]
  • Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs - [Arxiv] [QA]
  • GiraffeDet: A Heavy-Neck Paradigm for Object Detection - [Arxiv] [QA]
  • GiraffeDet: A Heavy-Neck Paradigm for Object Detection - [Arxiv] [QA]
  • Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning - [Arxiv] [QA]
  • Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations - [Arxiv] [QA]
  • MaskGIT: Masked Generative Image Transformer - [Arxiv] [QA]
  • FMP: Toward Fair Graph Message Passing against Topology Bias - [Arxiv] [QA]
  • DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models - [Arxiv] [QA]
  • How to Understand Masked Autoencoders - [Arxiv] [QA]
  • Survey of Hallucination in Natural Language Generation - [Arxiv] [QA]
  • GrASP: Gradient-Based Affordance Selection for Planning - [Arxiv] [QA]
  • GrASP: Gradient-Based Affordance Selection for Planning - [Arxiv] [QA]
  • PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning - [Arxiv] [QA]
  • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language - [Arxiv] [QA]
  • Corrupted Image Modeling for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
  • Message Passing Neural PDE Solvers - [Arxiv] [QA]
  • Message Passing Neural PDE Solvers - [Arxiv] [QA]
  • Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics - [Arxiv] [QA]
  • OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - [Arxiv] [QA]
  • Context Autoencoder for Self-Supervised Representation Learning - [Arxiv] [QA]
  • User Satisfaction Estimation with Sequential Dialogue Act Modeling in Goal-oriented Conversational Systems - [Arxiv] [QA]
  • DEVO: Depth-Event Camera Visual Odometry in Challenging Conditions - [Arxiv] [QA]
  • One-Nearest-Neighbor Search is All You Need for Minimax Optimal Regression and Classification - [Arxiv] [QA]
  • Webly Supervised Concept Expansion for General Purpose Vision Models - [Arxiv] [QA]
  • Structured Prediction Problem Archive - [Arxiv] [QA]
  • mSLAM: Massively multilingual joint pre-training for speech and text - [Arxiv] [QA]
  • A Survey on Retrieval-Augmented Text Generation - [Arxiv] [QA]
  • ColloSSL: Collaborative Self-Supervised Learning for Human Activity Recognition - [Arxiv] [QA]
  • Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics - [Arxiv] [QA]
  • CLA-NeRF: Category-Level Articulated Neural Radiance Field - [Arxiv] [QA]

January 2022

  • Signing the Supermask: Keep, Hide, Invert - [Arxiv] [QA]
  • Signing the Supermask: Keep, Hide, Invert - [Arxiv] [QA]
  • Few-Shot Backdoor Attacks on Visual Object Tracking - [Arxiv] [QA]
  • Few-Shot Backdoor Attacks on Visual Object Tracking - [Arxiv] [QA]
  • Causal Explanations and XAI - [Arxiv] [QA]
  • Adversarial Masking for Self-Supervised Learning - [Arxiv] [QA]
  • Robust Imitation Learning from Corrupted Demonstrations - [Arxiv] [QA]
  • Robust Imitation Learning from Corrupted Demonstrations - [Arxiv] [QA]
  • Rebalancing Batch Normalization for Exemplar-based Class-Incremental Learning - [Arxiv] [QA]
  • ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation - [Arxiv] [QA]
  • Counterfactual Plans under Distributional Ambiguity - [Arxiv] [QA]
  • Counterfactual Plans under Distributional Ambiguity - [Arxiv] [QA]
  • DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR - [Arxiv] [QA]
  • DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR - [Arxiv] [QA]
  • Mask-based Latent Reconstruction for Reinforcement Learning - [Arxiv] [QA]
  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model - [Arxiv] [QA]
  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - [Arxiv] [QA]
  • DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence - [Arxiv] [QA]
  • Natural Language Descriptions of Deep Visual Features - [Arxiv] [QA]
  • Natural Language Descriptions of Deep Visual Features - [Arxiv] [QA]
  • Explanatory Learning: Beyond Empiricism in Neural Networks - [Arxiv] [QA]
  • Explanatory Learning: Beyond Empiricism in Neural Networks - [Arxiv] [QA]
  • RePaint: Inpainting using Denoising Diffusion Probabilistic Models - [Arxiv] [QA]
  • Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
  • Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
  • Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
  • Patches Are All You Need? - [Arxiv] [QA]
  • Patches Are All You Need? - [Arxiv] [QA]
  • Neural Implicit Surface Evolution - [Arxiv] [QA]
  • Universal Online Learning with Unbounded Losses: Memory Is All You Need - [Arxiv] [QA]
  • Fast Differentiable Matrix Square Root - [Arxiv] [QA]
  • Fast Differentiable Matrix Square Root - [Arxiv] [QA]
  • End-to-end Generative Pretraining for Multimodal Video Captioning - [Arxiv] [QA]
  • LaMDA: Language Models for Dialog Applications - [Arxiv] [QA]
  • Safe Deep RL in 3D Environments using Human Feedback - [Arxiv] [QA]
  • Safe Deep RL in 3D Environments using Human Feedback - [Arxiv] [QA]
  • CM3: A Causal Masked Multimodal Model of the Internet - [Arxiv] [QA]
  • Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - [Arxiv] [QA]
  • GANmouflage: 3D Object Nondetection with Texture Fields - [Arxiv] [QA]
  • RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training - [Arxiv] [QA]
  • Parameter-free Online Test-time Adaptation - [Arxiv] [QA]
  • Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval - [Arxiv] [QA]
  • A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models - [Arxiv] [QA]
  • Neural Circuit Architectural Priors for Embodied Control - [Arxiv] [QA]
  • Neural Circuit Architectural Priors for Embodied Control - [Arxiv] [QA]
  • SparseDet: Improving Sparsely Annotated Object Detection with Pseudo-positive Mining - [Arxiv] [QA]
  • Structure and Semantics Preserving Document Representations - [Arxiv] [QA]
  • 3D Face Morphing Attacks: Generation, Vulnerability and Detection - [Arxiv] [QA]
  • QuadTree Attention for Vision Transformers - [Arxiv] [QA]
  • QuadTree Attention for Vision Transformers - [Arxiv] [QA]
  • Categorical Hopfield Networks - [Arxiv] [QA]
  • Detecting Human-to-Human-or-Object (H2O) Interactions with DIABOLO - [Arxiv] [QA]
  • Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets - [Arxiv] [QA]
  • All You Need In Sign Language Production - [Arxiv] [QA]
  • C2-CRS: Coarse-to-Fine Contrastive Learning for Conversational Recommender System - [Arxiv] [QA]
  • Class-Incremental Continual Learning into the eXtended DER-verse - [Arxiv] [QA]
  • Vision Transformer with Deformable Attention - [Arxiv] [QA]