- Rethinking with Retrieval: Faithful Large Language Model Inference - [Arxiv] [QA]
- A Survey on In-context Learning - [Arxiv] [QA]
- Disjoint Masking with Joint Distillation for Efficient Masked Image Modeling - [Arxiv] [QA]
- Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? - [Arxiv] [QA]
- Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models - [Arxiv] [QA]
- Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples - [Arxiv] [QA]
- Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence - [Arxiv] [QA]
- Imitator: Personalized Speech-driven 3D Facial Animation - [Arxiv] [QA]
- NeRF-Gaze: A Head-Eye Redirection Parametric Model for Gaze Estimation - [Arxiv] [QA]
- NIRVANA: Neural Implicit Representations of Videos with Adaptive Networks and Autoregressive Patch-wise Modeling - [Arxiv] [QA]
- Transformer in Transformer as Backbone for Deep Reinforcement Learning - [Arxiv] [QA]
- Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning - [Arxiv] [QA]
- Improving Visual Representation Learning through Perceptual Understanding - [Arxiv] [QA]
- Effects of Data Geometry in Early Deep Learning - [Arxiv] [QA]
- Effects of Data Geometry in Early Deep Learning - [Arxiv] [QA]
- StyleRes: Transforming the Residuals for Real Image Editing with StyleGAN - [Arxiv] [QA]
- MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery - [Arxiv] [QA]
- Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models - [Arxiv] [QA]
- Detection of out-of-distribution samples using binary neuron activation patterns - [Arxiv] [QA]
- Discriminator-Cooperated Feature Map Distillation for GAN Compression - [Arxiv] [QA]
- Cramming: Training a Language Model on a Single GPU in One Day - [Arxiv] [QA]
- Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP - [Arxiv] [QA]
- Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models - [Arxiv] [QA]
- Multi-Realism Image Compression with a Conditional Generator - [Arxiv] [QA]
- Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning - [Arxiv] [QA]
- Interactive Segmentation of Radiance Fields - [Arxiv] [QA]
- GEDI: GEnerative and DIscriminative Training for Self-Supervised Learning - [Arxiv] [QA]
- Behavioral Cloning via Search in Video PreTraining Latent Space - [Arxiv] [QA]
- DSI2I: Dense Style for Unpaired Image-to-Image Translation - [Arxiv] [QA]
- Large Language Models Encode Clinical Knowledge - [Arxiv] [QA]
- SMMix: Self-Motivated Image Mixing for Vision Transformers - [Arxiv] [QA]
- TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation - [Arxiv] [QA]
- When Do Curricula Work in Federated Learning? - [Arxiv] [QA]
- On Realization of Intelligent Decision-Making in the Real World: A Foundation Decision Model Perspective - [Arxiv] [QA]
- HandsOff: Labeled Dataset Generation With No Additional Human Annotations - [Arxiv] [QA]
- xFBD: Focused Building Damage Dataset and Analysis - [Arxiv] [QA]
- Detecting Objects with Context-Likelihood Graphs and Graph Refinement - [Arxiv] [QA]
- A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference - [Arxiv] [QA]
- Do DALL-E and Flamingo Understand Each Other? - [Arxiv] [QA]
- Learning to Detect and Segment for Open Vocabulary Object Detection - [Arxiv] [QA]
- On Calibrating Semantic Segmentation Models: Analyses and An Algorithm - [Arxiv] [QA]
- OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization - [Arxiv] [QA]
- DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis - [Arxiv] [QA]
- Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized Photography - [Arxiv] [QA]
- Removing Objects From Neural Radiance Fields - [Arxiv] [QA]
- Markov Categories and Entropy - [Arxiv] [QA]
- Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous Paragraph Denoise - [Arxiv] [QA]
- DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders - [Arxiv] [QA]
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation - [Arxiv] [QA]
- Generalized Decoding for Pixel, Image, and Language - [Arxiv] [QA]
- 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions - [Arxiv] [QA]
- Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble - [Arxiv] [QA]
- Revisiting Residual Networks for Adversarial Robustness: An Architectural Perspective - [Arxiv] [QA]
- TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization - [Arxiv] [QA]
- Critic-Guided Decoding for Controlled Text Generation - [Arxiv] [QA]
- In-Sensor & Neuromorphic Computing are all you need for Energy Efficient Computer Vision - [Arxiv] [QA]
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning - [Arxiv] [QA]
- MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Moral Discussions - [Arxiv] [QA]
- PaletteNeRF: Palette-based Appearance Editing of Neural Radiance Fields - [Arxiv] [QA]
- Analyzing Semantic Faithfulness of Language Models via Input Intervention on Conversational Question Answering - [Arxiv] [QA]
- Scene-aware Egocentric 3D Human Pose Estimation - [Arxiv] [QA]
- Full-Body Articulated Human-Object Interaction - [Arxiv] [QA]
- Ontologically Faithful Generation of Non-Player Character Dialogues - [Arxiv] [QA]
- Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers - [Arxiv] [QA]
- Unleashing the Power of Visual Prompting At the Pixel Level - [Arxiv] [QA]
- InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds - [Arxiv] [QA]
- A Survey of Deep Learning for Mathematical Reasoning - [Arxiv] [QA]
- Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions - [Arxiv] [QA]
- Precise Zero-Shot Dense Retrieval without Relevance Labels - [Arxiv] [QA]
- LAMBADA: Backward Chaining for Automated Reasoning in Natural Language - [Arxiv] [QA]
- Controllable Text Generation with Language Constraints - [Arxiv] [QA]
- QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity - [Arxiv] [QA]
- Towards Reasoning in Large Language Models: A Survey - [Arxiv] [QA]
- SeqDiffuSeq: Text Diffusion with Encoder-Decoder Transformers - [Arxiv] [QA]
- ReCode: Robustness Evaluation of Code Generation Models - [Arxiv] [QA]
- Pre-trained Language Models for Keyphrase Generation: A Thorough Empirical Study - [Arxiv] [QA]
- StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-shot and Few-shot Domain Adaptation - [Arxiv] [QA]
- Hoyer regularizer is all you need for ultra low-latency spiking neural networks - [Arxiv] [QA]
- Planning-oriented Autonomous Driving - [Arxiv] [QA]
- Large Language Models Are Reasoning Teachers - [Arxiv] [QA]
- RepMode: Learning to Re-parameterize Diverse Experts for Subcellular Structure Prediction - [Arxiv] [QA]
- I Cast Detect Thoughts: Learning to Converse and Guide with Intents and Theory-of-Mind in Dungeons and Dragons - [Arxiv] [QA]
- Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters - [Arxiv] [QA]
- MetaCLUE: Towards Comprehensive Visual Metaphors Research - [Arxiv] [QA]
- Panoptic Lifting for 3D Scene Understanding with Neural Fields - [Arxiv] [QA]
- Denotationally Correct, Purely Functional, Efficient Reverse-mode Automatic Differentiation - [Arxiv] [QA]
- Scalable Diffusion Models with Transformers - [Arxiv] [QA]
- One Embedder, Any Task: Instruction-Finetuned Text Embeddings - [Arxiv] [QA]
- Position-guided Text Prompt for Vision-Language Pre-training - [Arxiv] [QA]
- Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments - [Arxiv] [QA]
- The case for 4-bit precision: k-bit Inference Scaling Laws - [Arxiv] [QA]
- A Probabilistic Framework for Lifelong Test-Time Adaptation - [Arxiv] [QA]
- Reasoning with Language Model Prompting: A Survey - [Arxiv] [QA]
- Large Language Models are Better Reasoners with Self-Verification - [Arxiv] [QA]
- Interactive Cartoonization with Controllable Perceptual Factors - [Arxiv] [QA]
- HARP: Personalized Hand Reconstruction from a Monocular RGB Video - [Arxiv] [QA]
- MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation - [Arxiv] [QA]
- Latent Diffusion for Language Generation - [Arxiv] [QA]
- Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation - [Arxiv] [QA]
- Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization - [Arxiv] [QA]
- Out-of-domain GAN inversion via Invertibility Decomposition for Photo-Realistic Human Face Manipulation - [Arxiv] [QA]
- Discovering Language Model Behaviors with Model-Written Evaluations - [Arxiv] [QA]
- PAL: Persona-Augmented Emotional Support Conversation Generation - [Arxiv] [QA]
- Discrete Point-wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition - [Arxiv] [QA]
- Emergent Analogical Reasoning in Large Language Models - [Arxiv] [QA]
- Don't Forget Your ABC's: Evaluating the State-of-the-Art in Chat-Oriented Dialogue Systems - [Arxiv] [QA]
- Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model - [Arxiv] [QA]
- Let's Negotiate! A Survey of Negotiation Dialogue Systems - [Arxiv] [QA]
- Masked Wavelet Representation for Compact Neural Radiance Fields - [Arxiv] [QA]
- Fine-Tuning Is All You Need to Mitigate Backdoor Attacks - [Arxiv] [QA]
- Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted Attacks - [Arxiv] [QA]
- A Layered Architecture for Universal Causality - [Arxiv] [QA]
- Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark - [Arxiv] [QA]
- Neural Story Planning - [Arxiv] [QA]
- Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models - [Arxiv] [QA]
- The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning - [Arxiv] [QA]
- Attentive Mask CLIP - [Arxiv] [QA]
- GFPose: Learning 3D Human Pose Prior with Gradient Fields - [Arxiv] [QA]
- Learnable Commutative Monoids for Graph Neural Networks - [Arxiv] [QA]
- Teaching Small Language Models to Reason - [Arxiv] [QA]
- Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? - [Arxiv] [QA]
- RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers - [Arxiv] [QA]
- Backdoor Attack Detection in Computer Vision by Applying Matrix Factorization on the Weights of Deep Networks - [Arxiv] [QA]
- Injecting Domain Knowledge in Language Models for Task-Oriented Dialogue Systems - [Arxiv] [QA]
- MAViL: Masked Audio-Video Learners - [Arxiv] [QA]
- VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction - [Arxiv] [QA]
- MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation - [Arxiv] [QA]
- On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning - [Arxiv] [QA]
- Rethinking Vision Transformers for MobileNet Size and Speed - [Arxiv] [QA]
- Real-Time Neural Light Field on Mobile Devices - [Arxiv] [QA]
- Objaverse: A Universe of Annotated 3D Objects - [Arxiv] [QA]
- Sliced Optimal Partial Transport - [Arxiv] [QA]
- CLIPPO: Image-and-Language Understanding from Pixels Only - [Arxiv] [QA]
- FlexiViT: One Model for All Patch Sizes - [Arxiv] [QA]
- EVAL: Explainable Video Anomaly Localization - [Arxiv] [QA]
- DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients - [Arxiv] [QA]
- Relightable Neural Human Assets from Multi-view Gradient Illuminations - [Arxiv] [QA]
- Constitutional AI: Harmlessness from AI Feedback - [Arxiv] [QA]
- NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions - [Arxiv] [QA]
- Enhanced Training of Query-Based Object Detection via Selective Query Recollection - [Arxiv] [QA]
- SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory - [Arxiv] [QA]
- IMos: Intent-Driven Full-Body Motion Synthesis for Human-Object Interactions - [Arxiv] [QA]
- Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language - [Arxiv] [QA]
- ECON: Explicit Clothed humans Optimized via Normal integration - [Arxiv] [QA]
- Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion - [Arxiv] [QA]
- Policy Adaptation from Foundation Model Feedback - [Arxiv] [QA]
- NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior - [Arxiv] [QA]
- ConQueR: Query Contrast Voxel-DETR for 3D Object Detection - [Arxiv] [QA]
- HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics - [Arxiv] [QA]
- Reproducible scaling laws for contrastive language-image learning - [Arxiv] [QA]
- PD-Quant: Post-Training Quantization based on Prediction Difference Metric - [Arxiv] [QA]
- Understanding Zero-Shot Adversarial Robustness for Large-Scale Models - [Arxiv] [QA]
- EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries - [Arxiv] [QA]
- Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting - [Arxiv] [QA]
- CREPE: Can Vision-Language Foundation Models Reason Compositionally? - [Arxiv] [QA]
- Look Before You Match: Instance Understanding Matters in Video Object Segmentation - [Arxiv] [QA]
- Structured 3D Features for Reconstructing Controllable Avatars - [Arxiv] [QA]
- Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders - [Arxiv] [QA]
- Category Theory for Quantum Natural Language Processing - [Arxiv] [QA]
- FastMIM: Expediting Masked Image Modeling Pre-training for Vision - [Arxiv] [QA]
- Pixel is All You Need: Adversarial Trajectory-Ensemble Active Learning for Salient Object Detection - [Arxiv] [QA]
- DA Wand: Distortion-Aware Selection using Neural Mesh Parameterization - [Arxiv] [QA]
- DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization - [Arxiv] [QA]
- Doubly Right Object Recognition: A Why Prompt for Visual Rationales - [Arxiv] [QA]
- Breaking the "Object" in Video Object Segmentation - [Arxiv] [QA]
- Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion - [Arxiv] [QA]
- RGBD2: Generative Scene Synthesis via Incremental View Inpainting using RGBD Diffusion Models - [Arxiv] [QA]
- Towards Practical Plug-and-Play Diffusion Models - [Arxiv] [QA]
- Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks - [Arxiv] [QA]
- ALSO: Automotive Lidar Self-supervision by Occupancy estimation - [Arxiv] [QA]
- Accelerating Dataset Distillation via Model Augmentation - [Arxiv] [QA]
- MoDem: Accelerating Visual Model-Based Reinforcement Learning with Demonstrations - [Arxiv] [QA]
- REAP: A Large-Scale Realistic Adversarial Patch Benchmark - [Arxiv] [QA]
- Masked autoencoders are effective solution to transformer data-hungry - [Arxiv] [QA]
- Cross-Modal Learning with 3D Deformable Attention for Action Recognition - [Arxiv] [QA]
- Recurrent Vision Transformers for Object Detection with Event Cameras - [Arxiv] [QA]
- PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery - [Arxiv] [QA]
- How to Backdoor Diffusion Models? - [Arxiv] [QA]
- Source-free Depth for Object Pop-out - [Arxiv] [QA]
- HumanGen: Generating Human Radiance Fields with Explicit Priors - [Arxiv] [QA]
- Position Embedding Needs an Independent Layer Normalization - [Arxiv] [QA]
- NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction - [Arxiv] [QA]
- MAGVIT: Masked Generative Video Transformer - [Arxiv] [QA]
- A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others - [Arxiv] [QA]
- VindLU: A Recipe for Effective Video-and-Language Pretraining - [Arxiv] [QA]
- SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model - [Arxiv] [QA]
- Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis - [Arxiv] [QA]
- Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning - [Arxiv] [QA]
- Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation - [Arxiv] [QA]
- Seeing a Rose in Five Thousand Ways - [Arxiv] [QA]
- Information-Theoretic Safe Exploration with Gaussian Processes - [Arxiv] [QA]
- Genie: Show Me the Data for Quantization - [Arxiv] [QA]
- Genie: Show Me the Data for Quantization - [Arxiv] [QA]
- SLAM for Visually Impaired Navigation: A Systematic Literature Review of the Current State of Research - [Arxiv] [QA]
- ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal - [Arxiv] [QA]
- Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints - [Arxiv] [QA]
- MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction - [Arxiv] [QA]
- FLAG3D: A 3D Fitness Activity Dataset with Language Instruction - [Arxiv] [QA]
- Ego-Body Pose Estimation via Ego-Head Pose Estimation - [Arxiv] [QA]
- Structured Like a Language Model: Analysing AI as an Automated Subject - [Arxiv] [QA]
- ORCa: Glossy Objects as Radiance Field Cameras - [Arxiv] [QA]
- Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning - [Arxiv] [QA]
- MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis - [Arxiv] [QA]
- SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation - [Arxiv] [QA]
- Multi-Concept Customization of Text-to-Image Diffusion - [Arxiv] [QA]
- Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection - [Arxiv] [QA]
- Phone2Proc: Bringing Robust Robots Into Our Chaotic World - [Arxiv] [QA]
- Generating Holistic 3D Human Motion from Speech - [Arxiv] [QA]
- BEVBert: Multimodal Map Pre-training for Language-guided Navigation - [Arxiv] [QA]
- MIME: Human-Aware 3D Scene Generation - [Arxiv] [QA]
- On the Robustness of Normalizing Flows for Inverse Problems in Imaging - [Arxiv] [QA]
- GazeNeRF: 3D-Aware Gaze Redirection with Neural Radiance Fields - [Arxiv] [QA]
- Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation - [Arxiv] [QA]
- Deep Incubation: Training Large Models by Divide-and-Conquering - [Arxiv] [QA]
- Successive Prompting for Decomposing Complex Questions - [Arxiv] [QA]
- Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection - [Arxiv] [QA]
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models - [Arxiv] [QA]
- Learning to Dub Movies via Hierarchical Prosody Models - [Arxiv] [QA]
- Executing your Commands via Motion Diffusion in Latent Space - [Arxiv] [QA]
- Teaching Matters: Investigating the Role of Supervision in Vision Transformers - [Arxiv] [QA]
- Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors - [Arxiv] [QA]
- iQuery: Instruments as Queries for Audio-Visual Sound Separation - [Arxiv] [QA]
- Reconciling a Centroid-Hypothesis Conflict in Source-Free Domain Adaptation - [Arxiv] [QA]
- GLeaD: Improving GANs with A Generator-Leading Task - [Arxiv] [QA]
- FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation - [Arxiv] [QA]
- EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points - [Arxiv] [QA]
- Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer - [Arxiv] [QA]
- Diffusion-SDF: Text-to-Shape via Voxelized Diffusion - [Arxiv] [QA]
- NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors - [Arxiv] [QA]
- Fine-tuned CLIP Models are Efficient Video Learners - [Arxiv] [QA]
- Perspective Fields for Single Image Camera Calibration - [Arxiv] [QA]
- Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning - [Arxiv] [QA]
- Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning - [Arxiv] [QA]
- InternVideo: General Video Foundation Models via Generative and Discriminative Learning - [Arxiv] [QA]
- Semantic-Conditional Diffusion Networks for Image Captioning - [Arxiv] [QA]
- Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections - [Arxiv] [QA]
- Leveraging Different Learning Styles for Improved Knowledge Distillation in Biomedical Imaging - [Arxiv] [QA]
- Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding - [Arxiv] [QA]
- Adaptive Testing of Computer Vision Models - [Arxiv] [QA]
- Learning Neural Parametric Head Models - [Arxiv] [QA]
- Unifying Vision, Text, and Layout for Universal Document Processing - [Arxiv] [QA]
- SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields - [Arxiv] [QA]
- Images Speak in Images: A Generalist Painter for In-Context Visual Learning - [Arxiv] [QA]
- PEANUT: Predicting and Navigating to Unseen Targets - [Arxiv] [QA]
- One-shot Implicit Animatable Avatars with Model-based Priors - [Arxiv] [QA]
- Block Selection Method for Using Feature Norm in Out-of-distribution Detection - [Arxiv] [QA]
- I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification - [Arxiv] [QA]
- Momentum Decoding: Open-ended Text Generation As Graph Exploration - [Arxiv] [QA]
- Prototypical Residual Networks for Anomaly Detection and Localization - [Arxiv] [QA]
- Learning Imbalanced Data with Vision Transformers - [Arxiv] [QA]
- Multiscale Structure Guided Diffusion for Image Deblurring - [Arxiv] [QA]
- Self-supervised AutoFlow - [Arxiv] [QA]
- Improving Zero-shot Generalization and Robustness of Multi-modal Models - [Arxiv] [QA]
- Fast Point Cloud Generation with Straight Flows - [Arxiv] [QA]
- Neural Fourier Filter Bank - [Arxiv] [QA]
- StegaNeRF: Embedding Invisible Information within Neural Radiance Fields - [Arxiv] [QA]
- PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models - [Arxiv] [QA]
- PGFed: Personalize Each Client's Global Objective for Federated Learning - [Arxiv] [QA]
- PROB: Probabilistic Objectness for Open World Object Detection - [Arxiv] [QA]
- Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking - [Arxiv] [QA]
- MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation - [Arxiv] [QA]
- DiffRF: Rendering-Guided 3D Radiance Field Diffusion - [Arxiv] [QA]
- RT-NeRF: Real-Time On-Device Neural Radiance Fields Towards Immersive AR/VR Rendering - [Arxiv] [QA]
- Are Straight-Through gradients and Soft-Thresholding all you need for Sparse Training? - [Arxiv] [QA]
- Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-world - [Arxiv] [QA]
- StructVPR: Distill Structural Knowledge with Weighting Samples for Visual Place Recognition - [Arxiv] [QA]
- Scaling Language-Image Pre-training via Masking - [Arxiv] [QA]
- SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction - [Arxiv] [QA]
- Unite and Conquer: Plug & Play Multi-Modal Synthesis using Diffusion Models - [Arxiv] [QA]
- 3D Segmentation of Humans in Point Clouds with Synthetic Data - [Arxiv] [QA]
- Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs - [Arxiv] [QA]
- ResFormer: Scaling ViTs with Multi-Resolution Training - [Arxiv] [QA]
- Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation - [Arxiv] [QA]
- Exploiting Proximity-Aware Tasks for Embodied Social Navigation - [Arxiv] [QA]
- Hyperbolic Contrastive Learning for Visual Representations beyond Objects - [Arxiv] [QA]
- Finetune like you pretrain: Improved finetuning of zero-shot vision models - [Arxiv] [QA]
- Graph Convolutional Neural Networks as Parametric CoKleisli morphisms - [Arxiv] [QA]
- Language Model Pre-training on True Negatives - [Arxiv] [QA]
- Language Model Pre-training on True Negatives - [Arxiv] [QA]
- Language Model Pre-training on True Negatives - [Arxiv] [QA]
- 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification - [Arxiv] [QA]
- Parametric Information Maximization for Generalized Category Discovery - [Arxiv] [QA]
- All You Need Is Hashing: Defending Against Data Reconstruction Attack in Vertical Federated Learning - [Arxiv] [QA]
- Distilling Reasoning Capabilities into Smaller Language Models - [Arxiv] [QA]
- Plateau-reduced Differentiable Path Tracing - [Arxiv] [QA]
- SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene - [Arxiv] [QA]
- CREPE: Open-Domain Question Answering with False Presuppositions - [Arxiv] [QA]
- CLIPascene: Scene Sketching with Different Types and Levels of Abstraction - [Arxiv] [QA]
- NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation - [Arxiv] [QA]
- Fast Inference from Transformers via Speculative Decoding - [Arxiv] [QA]
- High-Fidelity Guided Image Synthesis with Latent Diffusion Models - [Arxiv] [QA]
- Spatio-Temporal Crop Aggregation for Video Representation Learning - [Arxiv] [QA]
- BASiS: Batch Aligned Spectral Embedding Space - [Arxiv] [QA]
- DiffPose: Toward More Reliable 3D Pose Estimation - [Arxiv] [QA]
- 3D GAN Inversion with Facial Symmetry Prior - [Arxiv] [QA]
- 3D Neural Field Generation using Triplane Diffusion - [Arxiv] [QA]
- DINER: Depth-aware Image-based NEural Radiance fields - [Arxiv] [QA]
- Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles - [Arxiv] [QA]
- NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360° Views - [Arxiv] [QA]
- RGB no more: Minimally-decoded JPEG Vision Transformers - [Arxiv] [QA]
- Compressing Volumetric Radiance Fields to 1 MB - [Arxiv] [QA]
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding - [Arxiv] [QA]
- Advancing Deep Metric Learning Through Multiple Batch Norms And Multi-Targeted Adversarial Examples - [Arxiv] [QA]
- Scalable Hierarchical Over-the-Air Federated Learning - [Arxiv] [QA]
- Out-Of-Distribution Detection Is Not All You Need - [Arxiv] [QA]
- Wavelet Diffusion Models are fast and scalable Image Generators - [Arxiv] [QA]
- ExpNet: A unified network for Expert-Level Classification - [Arxiv] [QA]
- NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers - [Arxiv] [QA]
- Dimensionality-Varying Diffusion Process - [Arxiv] [QA]
- UDE: A Unified Driving Engine for Human Motion Generation - [Arxiv] [QA]
- SparsePose: Sparse-View Camera Pose Regression and Refinement - [Arxiv] [QA]
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts - [Arxiv] [QA]
- Decentralized Learning with Multi-Headed Distillation - [Arxiv] [QA]
- Post-training Quantization on Diffusion Models - [Arxiv] [QA]
- High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization - [Arxiv] [QA]
- Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries - [Arxiv] [QA]
- Is Conditional Generative Modeling all you need for Decision-Making? - [Arxiv] [QA]
- SuS-X: Training-Free Name-Only Transfer of Vision-Language Models - [Arxiv] [QA]
- In-Hand 3D Object Scanning from an RGB Sequence - [Arxiv] [QA]
- A Light Touch Approach to Teaching Transformers Multi-view Geometry - [Arxiv] [QA]
- Class Adaptive Network Calibration - [Arxiv] [QA]
- FeatureBooster: Boosting Feature Descriptors with a Lightweight Neural Network - [Arxiv] [QA]
- High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors - [Arxiv] [QA]
- DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models - [Arxiv] [QA]
- Optimal Sparse Regression Trees - [Arxiv] [QA]
- Post-Processing Temporal Action Detection - [Arxiv] [QA]
- FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs - [Arxiv] [QA]
- Dense Text Retrieval based on Pretrained Language Models: A Survey - [Arxiv] [QA]
- 3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers - [Arxiv] [QA]
- SliceMatch: Geometry-guided Aggregation for Cross-View Pose Estimation - [Arxiv] [QA]
- Towards Improved Input Masking for Convolutional Neural Networks - [Arxiv] [QA]
- Residual Pattern Learning for Pixel-wise Out-of-Distribution Detection in Semantic Segmentation - [Arxiv] [QA]
- Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis - [Arxiv] [QA]
- Meta Architecture for Point Cloud Analysis - [Arxiv] [QA]
- SpaText: Spatio-Textual Representation for Controllable Image Generation - [Arxiv] [QA]
- RUST: Latent Neural Scene Representations from Unposed Imagery - [Arxiv] [QA]
- BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction - [Arxiv] [QA]
- RbA: Segmenting Unknown Regions Rejected by All - [Arxiv] [QA]
- NeuralUDF: Learning Unsigned Distance Fields for Multi-view Reconstruction of Surfaces with Arbitrary Topologies - [Arxiv] [QA]
- A Strong Baseline for Generalized Few-Shot Semantic Segmentation - [Arxiv] [QA]
- ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision - [Arxiv] [QA]
- Fine-Grained Face Swapping via Regional GAN Inversion - [Arxiv] [QA]
- SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow - [Arxiv] [QA]
- SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow - [Arxiv] [QA]
- CoMFormer: Continual Learning in Semantic and Panoptic Segmentation - [Arxiv] [QA]
- Unsupervised Continual Semantic Adaptation through Neural Rendering - [Arxiv] [QA]
- MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention - [Arxiv] [QA]
- Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis from Monocular Image - [Arxiv] [QA]
- Learning with Silver Standard Data for Zero-shot Relation Extraction - [Arxiv] [QA]
- FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction - [Arxiv] [QA]
- GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face Features - [Arxiv] [QA]
- SAGA: Spectral Adversarial Geometric Attack on 3D Meshes - [Arxiv] [QA]
- Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions - [Arxiv] [QA]
- Attention-based Feature Compression for CNN Inference Offloading in Edge Computing - [Arxiv] [QA]
- Perception-Oriented Single Image Super-Resolution using Optimal Objective Estimation - [Arxiv] [QA]
- SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks - [Arxiv] [QA]
- Video Test-Time Adaptation for Action Recognition - [Arxiv] [QA]
- TSGP: Two-Stage Generative Prompting for Unsupervised Commonsense Question Answering - [Arxiv] [QA]
- Pose-disentangled Contrastive Learning for Self-supervised Facial Representation - [Arxiv] [QA]
- Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning - [Arxiv] [QA]
- Shifted Diffusion for Text-to-image Generation - [Arxiv] [QA]
- HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising - [Arxiv] [QA]
- Paint by Example: Exemplar-based Image Editing with Diffusion Models - [Arxiv] [QA]
- ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field - [Arxiv] [QA]
- Generalizable Implicit Neural Representations via Instance Pattern Composers - [Arxiv] [QA]
- SVFormer: Semi-supervised Video Transformer for Action Recognition - [Arxiv] [QA]
- CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning - [Arxiv] [QA]
- ReCo: Region-Controlled Text-to-Image Generation - [Arxiv] [QA]
- Inversion-Based Style Transfer with Diffusion Models - [Arxiv] [QA]
- Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation - [Arxiv] [QA]
- FeTrIL: Feature Translation for Exemplar-Free Class-Incremental Learning - [Arxiv] [QA]
- Robust Mean Teacher for Continual and Gradual Test-Time Adaptation - [Arxiv] [QA]
- Open-vocabulary Attribute Detection - [Arxiv] [QA]
- OReX: Object Reconstruction from Planar Cross-sections Using Neural Fields - [Arxiv] [QA]
- ActMAD: Activation Matching to Align Distributions for Test-Time-Training - [Arxiv] [QA]
- ActMAD: Activation Matching to Align Distributions for Test-Time-Training - [Arxiv] [QA]
- BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields - [Arxiv] [QA]
- Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation - [Arxiv] [QA]
- Hand Avatar: Free-Pose Hand Animation and Rendering from Monocular Video - [Arxiv] [QA]
- VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval - [Arxiv] [QA]
- Texts as Images in Prompt Tuning for Multi-Label Image Recognition - [Arxiv] [QA]
- Integrally Pre-Trained Transformer Pyramid Networks - [Arxiv] [QA]
- PNI : Industrial Anomaly Detection using Position and Neighborhood Information - [Arxiv] [QA]
- Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization - [Arxiv] [QA]
- Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks - [Arxiv] [QA]
- PermutoSDF: Fast Multi-View Reconstruction with Implicit Surfaces using Permutohedral Lattices - [Arxiv] [QA]
- CASSPR: Cross Attention Single Scan Place Recognition - [Arxiv] [QA]
- AeDet: Azimuth-invariant Multi-view 3D Object Detection - [Arxiv] [QA]
- Person Image Synthesis via Denoising Diffusion Model - [Arxiv] [QA]
- Instant Volumetric Head Avatars - [Arxiv] [QA]
- Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations - [Arxiv] [QA]
- EDICT: Exact Diffusion Inversion via Coupled Transformations - [Arxiv] [QA]
- OCTET: Object-aware Counterfactual Explanations - [Arxiv] [QA]
- OCTET: Object-aware Counterfactual Explanations - [Arxiv] [QA]
- DETRs with Collaborative Hybrid Assignments Training - [Arxiv] [QA]
- GlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild - [Arxiv] [QA]
- DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction - [Arxiv] [QA]
- Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation - [Arxiv] [QA]
- SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields - [Arxiv] [QA]
- Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring - [Arxiv] [QA]
- SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation - [Arxiv] [QA]
- DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models - [Arxiv] [QA]
- Explaining Image Classifiers with Multiscale Directional Image Representation - [Arxiv] [QA]
- Backdoor Cleansing with Unlabeled Data - [Arxiv] [QA]
- Level-S$^2$fM: Structure from Motion on Neural Level Set of Implicit Surfaces - [Arxiv] [QA]
- One Eye is All You Need: Lightweight Ensembles for Gaze Estimation with Single Encoders - [Arxiv] [QA]
- Multi-Directional Subspace Editing in Style-Space - [Arxiv] [QA]
- Visual Dexterity: In-hand Dexterous Manipulation from Depth - [Arxiv] [QA]
- SceneComposer: Any-Level Semantic Image Synthesis - [Arxiv] [QA]
- SPARF: Neural Radiance Fields from Sparse and Noisy Poses - [Arxiv] [QA]
- PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation - [Arxiv] [QA]
- Teaching Structured Vision&Language Concepts to Vision&Language Models - [Arxiv] [QA]
- Multitask Vision-Language Prompt Tuning - [Arxiv] [QA]
- ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields - [Arxiv] [QA]
- PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning - [Arxiv] [QA]
- Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion - [Arxiv] [QA]
- Guided Depth Super-Resolution by Deep Anisotropic Diffusion - [Arxiv] [QA]
- Efficient Second-Order Plane Adjustment - [Arxiv] [QA]
- Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields - [Arxiv] [QA]
- Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint - [Arxiv] [QA]
- SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training - [Arxiv] [QA]
- MATE: Masked Autoencoders are Online 3D Test-Time Learners - [Arxiv] [QA]
- Blur Interpolation Transformer for Real-World Motion from Blur - [Arxiv] [QA]
- DyNCA: Real-time Dynamic Texture Synthesis Using Neural Cellular Automata - [Arxiv] [QA]
- From Node Interaction to Hop Interaction: New Effective and Scalable Graph Learning Paradigm - [Arxiv] [QA]
- Few-shot Non-line-of-sight Imaging with Signal-surface Collaborative Regularization - [Arxiv] [QA]
- Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation - [Arxiv] [QA]
- DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection - [Arxiv] [QA]
- Neural Dependencies Emerging from Learning Massive Categories - [Arxiv] [QA]
- SeeABLE: Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes - [Arxiv] [QA]
- DrapeNet: Garment Generation and Self-Supervised Draping - [Arxiv] [QA]
- Investigating Prompt Engineering in Diffusion Models - [Arxiv] [QA]
- Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars - [Arxiv] [QA]
- NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization - [Arxiv] [QA]
- Vision Transformer with Super Token Sampling - [Arxiv] [QA]
- Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification - [Arxiv] [QA]
- You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model - [Arxiv] [QA]
- DynIBaR: Neural Dynamic Image-Based Rendering - [Arxiv] [QA]
- The Stack: 3 TB of permissively licensed source code - [Arxiv] [QA]
- Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation - [Arxiv] [QA]
- Leveraging per Image-Token Consistency for Vision-Language Pre-training - [Arxiv] [QA]
- DYNAFED: Tackling Client Data Heterogeneity with Global Dynamics - [Arxiv] [QA]
- Learning to Generate Image Embeddings with User-level Differential Privacy - [Arxiv] [QA]
- DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting - [Arxiv] [QA]
- Passive Micron-scale Time-of-Flight with Sunlight Interferometry - [Arxiv] [QA]
- EDGE: Editable Dance Generation From Music - [Arxiv] [QA]
- Parallel Diffusion Models of Operator and Image for Blind Inverse Problems - [Arxiv] [QA]
- Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models - [Arxiv] [QA]
- LidarGait: Benchmarking 3D Gait Recognition with Point Clouds - [Arxiv] [QA]
- MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception - [Arxiv] [QA]
- Tired of Over-smoothing? Stress Graph Drawing Is All You Need! - [Arxiv] [QA]
- A Practical Stereo Depth System for Smart Glasses - [Arxiv] [QA]
- Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization - [Arxiv] [QA]
- Magic3D: High-Resolution Text-to-3D Content Creation - [Arxiv] [QA]
- BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision - [Arxiv] [QA]
- PAL: Program-aided Language Models - [Arxiv] [QA]
- Visual Programming: Compositional visual reasoning without training - [Arxiv] [QA]
- Task Residual for Tuning Vision-Language Models - [Arxiv] [QA]
- Patch-Craft Self-Supervised Training for Correlated Image Denoising - [Arxiv] [QA]
- SPACE: Speech-driven Portrait Animation with Controllable Expression - [Arxiv] [QA]
- Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks - [Arxiv] [QA]
- Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information - [Arxiv] [QA]
- InstructPix2Pix: Learning to Follow Image Editing Instructions - [Arxiv] [QA]
- CAE v2: Context Autoencoder with CLIP Target - [Arxiv] [QA]
- Null-text Inversion for Editing Real Images using Guided Diffusion Models - [Arxiv] [QA]
- MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors - [Arxiv] [QA]
- I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision - [Arxiv] [QA]
- EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones - [Arxiv] [QA]
- AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training - [Arxiv] [QA]
- CRAFT: Concept Recursive Activation FacTorization for Explainability - [Arxiv] [QA]
- UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer - [Arxiv] [QA]
- DETRDistill: A Universal Knowledge Distillation Framework for DETR-families - [Arxiv] [QA]
- UMFuse: Unified Multi View Fusion for Human Editing applications - [Arxiv] [QA]
- Task-aware Retrieval with Instructions - [Arxiv] [QA]
- AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders - [Arxiv] [QA]
- AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders - [Arxiv] [QA]
- Token Turing Machines - [Arxiv] [QA]
- MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis - [Arxiv] [QA]
- Holistic Evaluation of Language Models - [Arxiv] [QA]
- Galactica: A Large Language Model for Science - [Arxiv] [QA]
- Stare at What You See: Masked Image Modeling without Reconstruction - [Arxiv] [QA]
- A Generalized Framework for Video Instance Segmentation - [Arxiv] [QA]
- Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning - [Arxiv] [QA]
- Consistent Direct Time-of-Flight Video Depth Super-Resolution - [Arxiv] [QA]
- R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement - [Arxiv] [QA]
- PromptCap: Prompt-Guided Task-Aware Image Captioning - [Arxiv] [QA]
- Versatile Diffusion: Text, Images and Variations All in One Diffusion Model - [Arxiv] [QA]
- Is Style All You Need? Dependencies Between Emotion and GST-based Speaker Recognition - [Arxiv] [QA]
- Uncertainty-aware Gait Recognition via Learning from Dirichlet Distribution-based Evidence - [Arxiv] [QA]
- Teaching Algorithmic Reasoning via In-context Learning - [Arxiv] [QA]
- DINER: Disorder-Invariant Implicit Neural Representation - [Arxiv] [QA]
- EVA: Exploring the Limits of Masked Visual Representation Learning at Scale - [Arxiv] [QA]
- Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding - [Arxiv] [QA]
- Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures - [Arxiv] [QA]
- Imagination is All You Need! Curved Contrastive Learning for Abstract Sequence Modeling Utilized on Long Short-Term Dialogue Planning - [Arxiv] [QA]
- PKCAM: Previous Knowledge Channel Attention Module - [Arxiv] [QA]
- PKCAM: Previous Knowledge Channel Attention Module - [Arxiv] [QA]
- MLIC: Multi-Reference Entropy Model for Learned Image Compression - [Arxiv] [QA]
- Fcaformer: Forward Cross Attention in Hybrid Vision Transformer - [Arxiv] [QA]
- ParCNetV2: Oversized Kernel with Enhanced Attention - [Arxiv] [QA]
- Joint Data Deepening-and-Prefetching for Energy-Efficient Edge Learning - [Arxiv] [QA]
- BiViT: Extremely Compressed Binary Vision Transformer - [Arxiv] [QA]
- OverFlow: Putting flows on top of neural transducers for better TTS - [Arxiv] [QA]
- VGFlow: Visibility guided Flow Network for Human Reposing - [Arxiv] [QA]
- Residual Degradation Learning Unfolding Framework with Mixing Priors across Spectral and Spatial for Compressive Spectral Imaging - [Arxiv] [QA]
- SCOTCH and SODA: A Transformer Video Shadow Detection Framework - [Arxiv] [QA]
- Large Language Models Meet Harry Potter: A Bilingual Dataset for Aligning Dialogue Agents with Characters - [Arxiv] [QA]
- CXTrack: Improving 3D Point Cloud Tracking with Contextual Information - [Arxiv] [QA]
- MARLIN: Masked Autoencoder for facial video Representation LearnINg - [Arxiv] [QA]
- OpenGait: Revisiting Gait Recognition Toward Better Practicality - [Arxiv] [QA]
- Rewards Encoding Environment Dynamics Improves Preference-based Reinforcement Learning - [Arxiv] [QA]
- Probabilistic Debiasing of Scene Graphs - [Arxiv] [QA]
- Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection - [Arxiv] [QA]
- Masked Contrastive Representation Learning - [Arxiv] [QA]
- Masked Contrastive Representation Learning - [Arxiv] [QA]
- Delay Embedded Echo-State Network: A Predictor for Partially Observed Systems - [Arxiv] [QA]
- High-Quality Entity Segmentation - [Arxiv] [QA]
- OneFormer: One Transformer to Rule Universal Image Segmentation - [Arxiv] [QA]
- MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation - [Arxiv] [QA]
- Secure Aggregation Is Not All You Need: Mitigating Privacy Attacks with Noise Tolerance in Federated Learning - [Arxiv] [QA]
- GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts - [Arxiv] [QA]
- Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models - [Arxiv] [QA]
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model - [Arxiv] [QA]
- NoiSER: Noise is All You Need for Low-Light Image Enhancement - [Arxiv] [QA]
- Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement - [Arxiv] [QA]
- Self-conditioned Embedding Diffusion for Text Generation - [Arxiv] [QA]
-
$BT^2$ : Backward-compatible Training with Basis Transformation - [Arxiv] [QA] - Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories - [Arxiv] [QA]
- A Unified Pyramid Recurrent Network for Video Frame Interpolation - [Arxiv] [QA]
- Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis - [Arxiv] [QA]
- Large Language Models Are Human-Level Prompt Engineers - [Arxiv] [QA]
- Crosslingual Generalization through Multitask Finetuning - [Arxiv] [QA]
- Progressive Transformation Learning for Leveraging Virtual Images in Training - [Arxiv] [QA]
- PINTO: Faithful Language Reasoning Using Prompt-Generated Rationales - [Arxiv] [QA]
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers - [Arxiv] [QA]
- The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training - [Arxiv] [QA]
- The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning - [Arxiv] [QA]
- CARE: Causality Reasoning for Empathetic Responses by Conditional Graph Generation - [Arxiv] [QA]
- SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control - [Arxiv] [QA]
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers - [Arxiv] [QA]
- DanZero: Mastering GuanDan Game with Reinforcement Learning - [Arxiv] [QA]
- DiffusER: Discrete Diffusion via Edit-based Reconstruction - [Arxiv] [QA]
- A simple, efficient and scalable contrastive masked autoencoder for learning visual representations - [Arxiv] [QA]
- Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition - [Arxiv] [QA]
- Saliency Can Be All You Need In Contrastive Self-Supervised Learning - [Arxiv] [QA]
- STPrompt: Semantic-guided and Task-driven prompts for Effective Few-shot Classification - [Arxiv] [QA]
- Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments - [Arxiv] [QA]
- Vox-Fusion: Dense Tracking and Mapping with Voxel-based Neural Implicit Representation - [Arxiv] [QA]
- ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics - [Arxiv] [QA]
- Working Alliance Transformer for Psychotherapy Dialogue Classification - [Arxiv] [QA]
- FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion - [Arxiv] [QA]
- Contrastive Decoding: Open-ended Text Generation as Optimization - [Arxiv] [QA]
- Streaming Radiance Fields for 3D Video Synthesis - [Arxiv] [QA]
- Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization - [Arxiv] [QA]
- Contrastive Search Is What You Need For Neural Text Generation - [Arxiv] [QA]
- FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation - [Arxiv] [QA]
- Towards Robust Recommender Systems via Triple Cooperative Defense - [Arxiv] [QA]
- Dichotomy of Control: Separating What You Can Control from What You Cannot - [Arxiv] [QA]
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task - [Arxiv] [QA]
- GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling - [Arxiv] [QA]
- Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based On FullConv-TTS - [Arxiv] [QA]
- 10 hours data is all you need - [Arxiv] [QA]
- Overview of Dialogue Robot Competition 2022 - [Arxiv] [QA]
- DANLI: Deliberative Agent for Following Natural Language Instructions - [Arxiv] [QA]
- Towards Efficient Dialogue Pre-training with Transferable and Interpretable Latent Structure - [Arxiv] [QA]
- Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation - [Arxiv] [QA]
- There Is No Standard Answer: Knowledge-Grounded Dialogue Generation with Adversarial Activated Multi-Reference Learning - [Arxiv] [QA]
- WikiWhy: Answering and Explaining Cause-and-Effect Questions - [Arxiv] [QA]
- Large Language Models Can Self-Improve - [Arxiv] [QA]
- i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? - [Arxiv] [QA]
- Scaling Instruction-Finetuned Language Models - [Arxiv] [QA]
- On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning - [Arxiv] [QA]
- Scaling Laws for Reward Model Overoptimization - [Arxiv] [QA]
- A Unified View of Masked Image Modeling - [Arxiv] [QA]
- Co-guiding Net: Achieving Mutual Guidances between Multiple Intent Detection and Slot Filling via Heterogeneous Semantics-Label Graphs - [Arxiv] [QA]
- How to Boost Face Recognition with StyleGAN? - [Arxiv] [QA]
- Bag All You Need: Learning a Generalizable Bagging Strategy for Heterogeneous Objects - [Arxiv] [QA]
- Perceptual Grouping in Contrastive Vision-Language Models - [Arxiv] [QA]
- MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos - [Arxiv] [QA]
- DisCup: Discriminator Cooperative Unlikelihood Prompt-tuning for Controllable Text Generation - [Arxiv] [QA]
- Non-Contrastive Learning Meets Language-Image Pre-Training - [Arxiv] [QA]
- Imagic: Text-Based Real Image Editing with Diffusion Models - [Arxiv] [QA]
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them - [Arxiv] [QA]
- Multi-Agent Automated Machine Learning - [Arxiv] [QA]
- DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models - [Arxiv] [QA]
- Keep Me Updated! Memory Management in Long-term Conversations - [Arxiv] [QA]
- Scratching Visual Transformer's Back with Uniform Attention - [Arxiv] [QA]
- Data-Efficient Augmentation for Training Neural Networks - [Arxiv] [QA]
- Data-Efficient Augmentation for Training Neural Networks - [Arxiv] [QA]
- How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders - [Arxiv] [QA]
- Is synthetic data from generative models ready for image recognition? - [Arxiv] [QA]
- DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation - [Arxiv] [QA]
- Visual Reinforcement Learning with Self-Supervised 3D Representations - [Arxiv] [QA]
- Unified Vision and Language Prompt Learning - [Arxiv] [QA]
- Visual Classification via Description from Large Language Models - [Arxiv] [QA]
- Language Models of Code are Few-Shot Commonsense Learners - [Arxiv] [QA]
- CUF: Continuous Upsampling Filters - [Arxiv] [QA]
- Retrospectives on the Embodied AI Workshop - [Arxiv] [QA]
- H2RBox: Horizontal Box Annotation is All You Need for Oriented Object Detection - [Arxiv] [QA]
- Explanations from Large Language Models Make Small Reasoners Better - [Arxiv] [QA]
- Large Language Models are few(1)-shot Table Reasoners - [Arxiv] [QA]
- RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses - [Arxiv] [QA]
- Token-Label Alignment for Vision Transformers - [Arxiv] [QA]
- A Generalist Framework for Panoptic Segmentation of Images and Videos - [Arxiv] [QA]
- Visual Prompting for Adversarial Robustness - [Arxiv] [QA]
- Masked Motion Encoding for Self-Supervised Video Representation Learning - [Arxiv] [QA]
- Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning - [Arxiv] [QA]
- BEV-LaneDet: a Simple and Effective 3D Lane Detection Baseline - [Arxiv] [QA]
- ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation - [Arxiv] [QA]
- Habitat-Matterport 3D Semantics Dataset - [Arxiv] [QA]
- OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions - [Arxiv] [QA]
- Mind's Eye: Grounded Language Model Reasoning through Simulation - [Arxiv] [QA]
- MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model - [Arxiv] [QA]
- It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training - [Arxiv] [QA]
- BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation - [Arxiv] [QA]
- Multi-Object Navigation with dynamically learned neural implicit representations - [Arxiv] [QA]
- Certified Training: Small Boxes are All You Need - [Arxiv] [QA]
- Denoising Masked AutoEncoders Help Robust Classification - [Arxiv] [QA]
- Iterative Convex Optimization for Model Predictive Control with Discrete-Time High-Order Control Barrier Functions - [Arxiv] [QA]
- Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning - [Arxiv] [QA]
- Uncertainty-Aware Unsupervised Image Deblurring with Deep Residual Prior - [Arxiv] [QA]
- Controllable Dialogue Simulation with In-Context Learning - [Arxiv] [QA]
- Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders - [Arxiv] [QA]
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP - [Arxiv] [QA]
- Don't Lose Yourself! Empathetic Response Generation via Explicit Self-Other Awareness - [Arxiv] [QA]
- ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering - [Arxiv] [QA]
- Is margin all you need? An extensive empirical study of active learning on tabular data - [Arxiv] [QA]
- Large Language Models can Implement Policy Iteration - [Arxiv] [QA]
- GraspCaps: Capsule Networks Are All You Need for Grasping Familiar Objects - [Arxiv] [QA]
- Automatic Chain of Thought Prompting in Large Language Models - [Arxiv] [QA]
- Trans2k: Unlocking the Power of Deep Models for Transparent Object Tracking - [Arxiv] [QA]
- Measuring and Narrowing the Compositionality Gap in Language Models - [Arxiv] [QA]
- Critical Learning Periods for Multisensory Integration in Deep Networks - [Arxiv] [QA]
- A ResNet is All You Need? Modeling A Strong Baseline for Detecting Referable Diabetic Retinopathy in Fundus Images - [Arxiv] [QA]
- FAST: Improving Controllability for Text Generation with Feedback Aware Self-Training - [Arxiv] [QA]
- On Distillation of Guided Diffusion Models - [Arxiv] [QA]
- MaPLe: Multi-modal Prompt Learning - [Arxiv] [QA]
- CLIP model is an Efficient Continual Learner - [Arxiv] [QA]
- A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning - [Arxiv] [QA]
- VIMA: General Robot Manipulation with Multimodal Prompts - [Arxiv] [QA]
- Iterative Vision-and-Language Navigation - [Arxiv] [QA]
- Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering - [Arxiv] [QA]
- Language Models are Multilingual Chain-of-Thought Reasoners - [Arxiv] [QA]
- ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs - [Arxiv] [QA]
- A Distributional Lens for Multi-Aspect Controllable Text Generation - [Arxiv] [QA]
- ReAct: Synergizing Reasoning and Acting in Language Models - [Arxiv] [QA]
- Depth Is All You Need for Monocular 3D Detection - [Arxiv] [QA]
- GLM-130B: An Open Bilingual Pre-trained Model - [Arxiv] [QA]
- Decomposed Prompting: A Modular Approach for Solving Complex Tasks - [Arxiv] [QA]
- Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images - [Arxiv] [QA]
- Imagen Video: High Definition Video Generation with Diffusion Models - [Arxiv] [QA]
- CorefDiffs: Co-referential and Differential Knowledge Flow in Document Grounded Conversations - [Arxiv] [QA]
- Teaching Yourself: Graph Self-Distillation on Neighborhood for Node Classification - [Arxiv] [QA]
- Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders - [Arxiv] [QA]
- Affection: Learning Affective Explanations for Real-World Visual Data - [Arxiv] [QA]
- Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures - [Arxiv] [QA]
- Group Personalized Federated Learning - [Arxiv] [QA]
- Group Personalized Federated Learning - [Arxiv] [QA]
- Centerpoints Are All You Need in Overhead Imagery - [Arxiv] [QA]
- COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos - [Arxiv] [QA]
- PlaneDepth: Self-supervised Depth Estimation via Orthogonal Planes - [Arxiv] [QA]
- Knowledge Unlearning for Mitigating Privacy Risks in Language Models - [Arxiv] [QA]
- Extraneousness-Aware Imitation Learning - [Arxiv] [QA]
- Extraneousness-Aware Imitation Learning - [Arxiv] [QA]
- Recitation-Augmented Language Models - [Arxiv] [QA]
- Event-based Temporally Dense Optical Flow Estimation with Sequential Learning - [Arxiv] [QA]
- Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization - [Arxiv] [QA]
- Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought - [Arxiv] [QA]
- Masked Spiking Transformer - [Arxiv] [QA]
- Visual Prompt Tuning for Generative Transfer Learning - [Arxiv] [QA]
- Membership Inference Attacks Against Text-to-image Generation Models - [Arxiv] [QA]
- Improving Sample Quality of Diffusion Models Using Self-Attention Guidance - [Arxiv] [QA]
- Mastering Spatial Graph Prediction of Road Networks - [Arxiv] [QA]
- Complexity-Based Prompting for Multi-Step Reasoning - [Arxiv] [QA]
- IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis - [Arxiv] [QA]
- "Help Me Help the AI": Understanding How Explainability Can Support Human-AI Interaction - [Arxiv] [QA]
- Contrastive Audio-Visual Masked Autoencoder - [Arxiv] [QA]
- NeRF: Neural Radiance Field in 3D Vision, A Comprehensive Review - [Arxiv] [QA]
- Multimodal Analogical Reasoning over Knowledge Graphs - [Arxiv] [QA]
- Bias Mimicking: A Simple Sampling Approach for Bias Mitigation - [Arxiv] [QA]
- Combining Efficient and Precise Sign Language Recognition: Good pose estimation library is all you need - [Arxiv] [QA]
- Sphere-Guided Training of Neural Implicit Surfaces - [Arxiv] [QA]
- SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation - [Arxiv] [QA]
- Hiding Visual Information via Obfuscating Adversarial Perturbations - [Arxiv] [QA]
- Learning Transferable Spatiotemporal Representations from Natural Script Knowledge - [Arxiv] [QA]
- State-specific protein-ligand complex structure prediction with a multi-scale deep generative model - [Arxiv] [QA]
- Compositional Semantic Parsing with Large Language Models - [Arxiv] [QA]
- DreamFusion: Text-to-3D using 2D Diffusion - [Arxiv] [QA]
- EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding - [Arxiv] [QA]
- Contrastive Unsupervised Learning of World Model with Invariant Causal Features - [Arxiv] [QA]
- Make-A-Video: Text-to-Video Generation without Text-Video Data - [Arxiv] [QA]
- Dependent Bayesian Lenses: Categories of Bidirectional Markov Kernels with Canonical Bayesian Inversion - [Arxiv] [QA]
- Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning - [Arxiv] [QA]
- Improving alignment of dialogue agents via targeted human judgements - [Arxiv] [QA]
- Learning State-Aware Visual Representations from Audible Interactions - [Arxiv] [QA]
- Sentiment is all you need to win US Presidential elections - [Arxiv] [QA]
- Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts - [Arxiv] [QA]
- Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric - [Arxiv] [QA]
- Paraphrasing Is All You Need for Novel Object Captioning - [Arxiv] [QA]
- Generating Formal Safety Assurances for High-Dimensional Reachability - [Arxiv] [QA]
- Probabilistic Planning with Partially Ordered Preferences over Temporal Goals - [Arxiv] [QA]
- All are Worth Words: A ViT Backbone for Diffusion Models - [Arxiv] [QA]
- Promptagator: Few-shot Dense Retrieval From 8 Examples - [Arxiv] [QA]
- Control Barrier Functions in UGVs for Kinematic Obstacle Avoidance: A Collision Cone Approach - [Arxiv] [QA]
- ProgPrompt: Generating Situated Robot Task Plans using Large Language Models - [Arxiv] [QA]
- Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning - [Arxiv] [QA]
- Generate rather than Retrieve: Large Language Models are Strong Context Generators - [Arxiv] [QA]
- Target-Guided Open-Domain Conversation Planning - [Arxiv] [QA]
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering - [Arxiv] [QA]
- Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos - [Arxiv] [QA]
- Space-time tradeoffs of lenses and optics via higher category theory - [Arxiv] [QA]
- Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention - [Arxiv] [QA]
- Loc-NeRF: Monte Carlo Localization using Neural Radiance Fields - [Arxiv] [QA]
- Learning Symbolic Model-Agnostic Loss Functions via Meta-Learning - [Arxiv] [QA]
- Semantic Segmentation using Neural Ordinary Differential Equations - [Arxiv] [QA]
- A Benchmark for Understanding and Generating Dialogue between Characters in Stories - [Arxiv] [QA]
- Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution - [Arxiv] [QA]
- Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models - [Arxiv] [QA]
- CurveFormer: 3D Lane Detection by Curve Propagation with Curve Queries and Attention - [Arxiv] [QA]
- Spatial-then-Temporal Self-Supervised Learning for Video Correspondence - [Arxiv] [QA]
- Test-Time Training with Masked Autoencoders - [Arxiv] [QA]
- Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models - [Arxiv] [QA]
- A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
- A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
- A Geometric Perspective on Variational Autoencoders - [Arxiv] [QA]
- Not As Easy As You Think -- Experiences and Lessons Learnt from Trying to Create a Bottom-Up Visualization Image Typology - [Arxiv] [QA]
- PaLI: A Jointly-Scaled Multilingual Language-Image Model - [Arxiv] [QA]
- Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models - [Arxiv] [QA]
- Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models - [Arxiv] [QA]
- Hard Negatives or False Negatives: Correcting Pooling Bias in Training Neural Ranking Models - [Arxiv] [QA]
- Chain of Explanation: New Prompting Method to Generate Higher Quality Natural Language Explanation for Implicit Hate Speech - [Arxiv] [QA]
- Exploring Target Representations for Masked Autoencoders - [Arxiv] [QA]
- Developing a multi-variate prediction model for the detection of COVID-19 from Crowd-sourced Respiratory Voice Data - [Arxiv] [QA]
- Enhancing the Self-Universality for Transferable Targeted Attacks - [Arxiv] [QA]
- What does a platypus look like? Generating customized prompts for zero-shot image classification - [Arxiv] [QA]
- MimCo: Masked Image Modeling Pre-training with Contrastive Teacher - [Arxiv] [QA]
- EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models - [Arxiv] [QA]
- Selective Annotation Makes Language Models Better Few-Shot Learners - [Arxiv] [QA]
- RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection - [Arxiv] [QA]
- An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling - [Arxiv] [QA]
- TogetherNet: Bridging Image Restoration and Object Detection Together via Dynamic Enhancement Learning - [Arxiv] [QA]
- Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement - [Arxiv] [QA]
- Petals: Collaborative Inference and Fine-tuning of Large Models - [Arxiv] [QA]
- Visual Prompting via Image Inpainting - [Arxiv] [QA]
- FLAME: Free-form Language-based Motion Synthesis & Editing - [Arxiv] [QA]
- LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data - [Arxiv] [QA]
- Rethinking Conversational Recommendations: Is Decision Tree All You Need? - [Arxiv] [QA]
- Faithful Reasoning Using Large Language Models - [Arxiv] [QA]
- Benchmark Results for Bookshelf Organization Problem as Mixed Integer Nonlinear Program with Mode Switch and Collision Avoidance - [Arxiv] [QA]
- TrojViT: Trojan Insertion in Vision Transformers - [Arxiv] [QA]
- Multi-Outputs Is All You Need For Deblur - [Arxiv] [QA]
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining - [Arxiv] [QA]
- Masked Autoencoders Enable Efficient Knowledge Distillers - [Arxiv] [QA]
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation - [Arxiv] [QA]
- Understanding Diffusion Models: A Unified Perspective - [Arxiv] [QA]
- Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors - [Arxiv] [QA]
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned - [Arxiv] [QA]
- Improving Personality Consistency in Conversation by Persona Extending - [Arxiv] [QA]
- Extending nnU-Net is all you need - [Arxiv] [QA]
- Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition - [Arxiv] [QA]
- Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks - [Arxiv] [QA]
- Are disentangled representations all you need to build speaker anonymization systems? - [Arxiv] [QA]
- Scattered or Connected? An Optimized Parameter-efficient Tuning Approach for Information Retrieval - [Arxiv] [QA]
- A Contrastive Pre-training Approach to Learn Discriminative Autoencoder for Dense Retrieval - [Arxiv] [QA]
- Label-Noise Learning with Intrinsically Long-Tailed Data - [Arxiv] [QA]
- DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization - [Arxiv] [QA]
- SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability - [Arxiv] [QA]
- Pseudo-Labels Are All You Need - [Arxiv] [QA]
- CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation - [Arxiv] [QA]
- Differentiable Architecture Search with Random Features - [Arxiv] [QA]
- GraVoS: Voxel Selection for 3D Point-Cloud Detection - [Arxiv] [QA]
- Towards Open-vocabulary Scene Graph Generation with Prompt-based Finetuning - [Arxiv] [QA]
- Significance of Skeleton-based Features in Virtual Try-On - [Arxiv] [QA]
- CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks - [Arxiv] [QA]
- Semi-Supervised Video Inpainting with Cycle Consistency Constraints - [Arxiv] [QA]
- Long-Short History of Gradients is All You Need: Detecting Malicious and Unreliable Clients in Federated Learning - [Arxiv] [QA]
- Dropout is NOT All You Need to Prevent Gradient Leakage - [Arxiv] [QA]
- MILAN: Masked Image Pretraining on Language Assisted Representation - [Arxiv] [QA]
- PSUMNet: Unified Modality Part Streams are All You Need for Efficient Pose-based Action Recognition - [Arxiv] [QA]
- Assessing the Unitary RNN as an End-to-End Compositional Model of Syntax - [Arxiv] [QA]
- Safety and Performance, Why not Both? Bi-Objective Optimized Model Compression toward AI Software Deployment - [Arxiv] [QA]
- Generative Action Description Prompts for Skeleton-based Action Recognition - [Arxiv] [QA]
- Understanding Masked Image Modeling via Learning Occlusion Invariant Feature - [Arxiv] [QA]
- Follow Me: Conversation Planning for Target-driven Recommendation Dialogue Systems - [Arxiv] [QA]
- Atlas: Few-shot Learning with Retrieval Augmented Language Models - [Arxiv] [QA]
- BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage - [Arxiv] [QA]
- PointConvFormer: Revenge of the Point-based Convolution - [Arxiv] [QA]
- DropKey - [Arxiv] [QA]
- Prompt Tuning for Generative Multimodal Pretrained Models - [Arxiv] [QA]
- Masked Vision and Language Modeling for Multi-modal Representation Learning - [Arxiv] [QA]
- Detecting Multivariate Time Series Anomalies with Zero Known Label - [Arxiv] [QA]
- Character Generation through Self-Supervised Vectorization - [Arxiv] [QA]
- Character Generation through Self-Supervised Vectorization - [Arxiv] [QA]
- Prompt-to-Prompt Image Editing with Cross Attention Control - [Arxiv] [QA]
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion - [Arxiv] [QA]
- ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries - [Arxiv] [QA]
- Reduction Rules and ILP Are All You Need: Minimal Directed Feedback Vertex Set - [Arxiv] [QA]
- OmniCity: Omnipotent City Understanding with Multi-level and Multi-view Images - [Arxiv] [QA]
- Neural network layers as parametric spans - [Arxiv] [QA]
- Generative Bias for Robust Visual Question Answering - [Arxiv] [QA]
- Composable Text Controls in Latent Space with ODEs - [Arxiv] [QA]
- Search for or Navigate to? Dual Adaptive Thinking for Object Navigation - [Arxiv] [QA]
- SdAE: Self-distillated Masked Autoencoder - [Arxiv] [QA]
- Less is More: Consistent Video Depth Estimation with Masked Frames Modeling - [Arxiv] [QA]
- MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures - [Arxiv] [QA]
- A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond - [Arxiv] [QA]
- Language Models Can Teach Themselves to Program Better - [Arxiv] [QA]
- Visual Recognition by Request - [Arxiv] [QA]
- Contrastive Masked Autoencoders are Stronger Vision Learners - [Arxiv] [QA]
- Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment - [Arxiv] [QA]
- DETRs with Hybrid Matching - [Arxiv] [QA]
- Visual correspondence-based explanations improve AI robustness and human-AI team accuracy - [Arxiv] [QA]
- Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning - [Arxiv] [QA]
- Is GPT-3 all you need for Visual Question Answering in Cultural Heritage? - [Arxiv] [QA]
- Neural Generation Meets Real People: Building a Social, Informative Open-Domain Dialogue Agent - [Arxiv] [QA]
- All you need for horizontal slicing in 5G network - [Arxiv] [QA]
- Divide and Conquer: 3D Point Cloud Instance Segmentation With Point-Wise Binarization - [Arxiv] [QA]
- Adaptive Soft Contrastive Learning - [Arxiv] [QA]
- Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild - [Arxiv] [QA]
- Language Model Cascades - [Arxiv] [QA]
- Tailoring Self-Supervision for Supervised Learning - [Arxiv] [QA]
- GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features - [Arxiv] [QA]
- FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning - [Arxiv] [QA]
- Overlooked factors in concept-based explanations: Dataset choice, concept learnability, and human capability - [Arxiv] [QA]
- Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos - [Arxiv] [QA]
- Consistent Query Answering for Expressive Constraints under Tuple-Deletion Semantics - [Arxiv] [QA]
- FedX: Unsupervised Federated Learning with Cross Knowledge Distillation - [Arxiv] [QA]
- Label2Label: A Language Modeling Framework for Multi-Attribute Learning - [Arxiv] [QA]
- Class-incremental Novel Class Discovery - [Arxiv] [QA]
- UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View - [Arxiv] [QA]
- Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding - [Arxiv] [QA]
- Adaptive Assignment for Geometry Aware Local Feature Matching - [Arxiv] [QA]
- SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery - [Arxiv] [QA]
- Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection - [Arxiv] [QA]
- Clover: Towards A Unified Video-Language Alignment and Fusion Model - [Arxiv] [QA]
- Position Prediction as an Effective Pretraining Strategy - [Arxiv] [QA]
- Bootstrapped Masked Autoencoders for Vision BERT Pretraining - [Arxiv] [QA]
- Language models show human-like content effects on reasoning - [Arxiv] [QA]
- Masked Autoencoders that Listen - [Arxiv] [QA]
- PointNorm: Dual Normalization is All You Need for Point Cloud Analysis - [Arxiv] [QA]
- Look-ups are not (yet) all you need for deep learning inference - [Arxiv] [QA]
- A Data-Based Perspective on Transfer Learning - [Arxiv] [QA]
- Inner Monologue: Embodied Reasoning through Planning with Language Models - [Arxiv] [QA]
- Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection - [Arxiv] [QA]
- Bootstrapping a User-Centered Task-Oriented Dialogue System - [Arxiv] [QA]
- A Skeleton-aware Graph Convolutional Network for Human-Object Interaction Detection - [Arxiv] [QA]
- LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action - [Arxiv] [QA]
- Training Transformers Together - [Arxiv] [QA]
- Back to the Source: Diffusion-Driven Test-Time Adaptation - [Arxiv] [QA]
- YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors - [Arxiv] [QA]
- Chairs Can be Stood on: Overcoming Object Bias in Human-Object Interaction Detection - [Arxiv] [QA]
- Is a PET all you need? A multi-modal study for Alzheimer's disease using 3D CNNs - [Arxiv] [QA]
- Best Subset Selection with Efficient Primal-Dual Algorithm - [Arxiv] [QA]
- Distance Matters in Human-Object Interaction Detection - [Arxiv] [QA]
- Domain-Independent Deception: Definition, Taxonomy and the Linguistic Cues Debate - [Arxiv] [QA]
- Beyond mAP: Towards better evaluation of instance segmentation - [Arxiv] [QA]
- PVO: Panoptic Visual Odometry - [Arxiv] [QA]
- Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection - [Arxiv] [QA]
- I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference - [Arxiv] [QA]
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents - [Arxiv] [QA]
- Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus - [Arxiv] [QA]
- Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need - [Arxiv] [QA]
- Rationale-Augmented Ensembles in Language Models - [Arxiv] [QA]
- LaserMix for Semi-Supervised LiDAR Semantic Segmentation - [Arxiv] [QA]
- On-Device Training Under 256KB Memory - [Arxiv] [QA]
- Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations - [Arxiv] [QA]
- Automatically Balancing Model Accuracy and Complexity using Solution and Fitness Evolution (SAFE) - [Arxiv] [QA]
- UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration - [Arxiv] [QA]
- Solving Quantitative Reasoning Problems with Language Models - [Arxiv] [QA]
- BertNet: Harvesting Knowledge Graphs with Arbitrary Relations from Pretrained Language Models - [Arxiv] [QA]
- Solution and Fitness Evolution (SAFE): A Study of Multiobjective Problems - [Arxiv] [QA]
- Solution and Fitness Evolution (SAFE): Coevolving Solutions and Their Objective Functions - [Arxiv] [QA]
- CV 3315 Is All You Need : Semantic Segmentation Competition - [Arxiv] [QA]
- ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings - [Arxiv] [QA]
- Diegetic Representation of Feedback in Open Games - [Arxiv] [QA]
- Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning - [Arxiv] [QA]
- zPROBE: Zero Peek Robustness Checks for Federated Learning - [Arxiv] [QA]
- Task-Adaptive Few-shot Node Classification - [Arxiv] [QA]
- EventNeRF: Neural Radiance Fields from a Single Colour Event Camera - [Arxiv] [QA]
- MaskViT: Masked Visual Pre-Training for Video Prediction - [Arxiv] [QA]
- Rethinking Surgical Instrument Segmentation: A Background Image Can Be All You Need - [Arxiv] [QA]
- CLAMP: Prompt-based Contrastive Learning for Connecting Language and Animal Pose - [Arxiv] [QA]
- Invariant Causal Mechanisms through Distribution Matching - [Arxiv] [QA]
- Invariant Causal Mechanisms through Distribution Matching - [Arxiv] [QA]
- GODEL: Large-Scale Pre-Training for Goal-Directed Dialog - [Arxiv] [QA]
- KiloNeuS: A Versatile Neural Implicit Surface Representation for Real-Time Rendering - [Arxiv] [QA]
- Questions Are All You Need to Train a Dense Passage Retriever - [Arxiv] [QA]
- LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs - [Arxiv] [QA]
- Marginal Tail-Adaptive Normalizing Flows - [Arxiv] [QA]
- Marginal Tail-Adaptive Normalizing Flows - [Arxiv] [QA]
- SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders - [Arxiv] [QA]
- Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders - [Arxiv] [QA]
- DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations - [Arxiv] [QA]
- All you need is feedback: Communication with block attention feedback codes - [Arxiv] [QA]
- Gender Artifacts in Visual Datasets - [Arxiv] [QA]
- Landscape Learning for Neural Network Inversion - [Arxiv] [QA]
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge - [Arxiv] [QA]
- Sheaf Neural Networks with Connection Laplacians - [Arxiv] [QA]
- PRANC: Pseudo RAndom Networks for Compacting deep models - [Arxiv] [QA]
- OmniMAE: Single Model Masked Pretraining on Images and Videos - [Arxiv] [QA]
- Switchable Representation Learning Framework with Self-compatibility - [Arxiv] [QA]
- Zero-Shot Video Question Answering via Frozen Bidirectional Language Models - [Arxiv] [QA]
- Balancing Discriminability and Transferability for Source-Free Domain Adaptation - [Arxiv] [QA]
- Architectural Backdoors in Neural Networks - [Arxiv] [QA]
- Masked Frequency Modeling for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
- Masked Siamese ConvNets - [Arxiv] [QA]
- Structured Sparsity Learning for Efficient Video Super-Resolution - [Arxiv] [QA]
- Emergent Abilities of Large Language Models - [Arxiv] [QA]
- A smile is all you need: Predicting limiting activity coefficients from SMILES with natural language processing - [Arxiv] [QA]
- GRAM-HD: 3D-Consistent Image Generation at High Resolution with Generative Radiance Manifolds - [Arxiv] [QA]
- Proximal Splitting Adversarial Attacks for Semantic Segmentation - [Arxiv] [QA]
- LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling - [Arxiv] [QA]
- Confidence Score for Source-Free Unsupervised Domain Adaptation - [Arxiv] [QA]
- Confidence Score for Source-Free Unsupervised Domain Adaptation - [Arxiv] [QA]
- Transformers are Meta-Reinforcement Learners - [Arxiv] [QA]
- Transformers are Meta-Reinforcement Learners - [Arxiv] [QA]
- Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence - [Arxiv] [QA]
- Language Models are General-Purpose Interfaces - [Arxiv] [QA]
- Mining Multi-Label Samples from Single Positive Labels - [Arxiv] [QA]
- Mining Multi-Label Samples from Single Positive Labels - [Arxiv] [QA]
- Building a Personalized Dialogue System with Prompt-Tuning - [Arxiv] [QA]
- Balanced Product of Calibrated Experts for Long-Tailed Recognition - [Arxiv] [QA]
- Referring Image Matting - [Arxiv] [QA]
- Masked Autoencoders are Robust Data Augmentors - [Arxiv] [QA]
- Neural Prompt Search - [Arxiv] [QA]
- Extreme Masking for Learning Instance and Distributed Visual Representations - [Arxiv] [QA]
- On Data Scaling in Masked Image Modeling - [Arxiv] [QA]
- Simple Cues Lead to a Strong Multi-Object Tracker - [Arxiv] [QA]
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models - [Arxiv] [QA]
- Spatial-temporal Concept based Explanation of 3D ConvNets - [Arxiv] [QA]
- Words are all you need? Language as an approximation for human similarity judgments - [Arxiv] [QA]
- MobileOne: An Improved One millisecond Mobile Backbone - [Arxiv] [QA]
- MobileOne: An Improved One millisecond Mobile Backbone - [Arxiv] [QA]
- Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks - [Arxiv] [QA]
- Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding - [Arxiv] [QA]
- Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval - [Arxiv] [QA]
- Spatial Parsing and Dynamic Temporal Pooling networks for Human-Object Interaction detection - [Arxiv] [QA]
- TriBYOL: Triplet BYOL for Self-Supervised Representation Learning - [Arxiv] [QA]
- Self-Knowledge Distillation based Self-Supervised Learning for Covid-19 Detection from Chest X-Ray Images - [Arxiv] [QA]
- Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data - [Arxiv] [QA]
- Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation - [Arxiv] [QA]
- A Neural Corpus Indexer for Document Retrieval - [Arxiv] [QA]
- Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering - [Arxiv] [QA]
- Is More Data All You Need? A Causal Exploration - [Arxiv] [QA]
- Learning to Break the Loop: Analyzing and Mitigating Repetitions for Neural Text Generation - [Arxiv] [QA]
- Making Large Language Models Better Reasoners with Step-Aware Verifier - [Arxiv] [QA]
- PIDNet: A Real-time Semantic Segmentation Network Inspired by PID Controllers - [Arxiv] [QA]
- Delving into the Openness of CLIP - [Arxiv] [QA]
- Video-based Human-Object Interaction Detection from Tubelet Tokens - [Arxiv] [QA]
- Revisiting the "Video" in Video-Language Understanding - [Arxiv] [QA]
- PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
- PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
- PROMISSING: Pruning Missing Values in Neural Networks - [Arxiv] [QA]
- A Survey on Computationally Efficient Neural Architecture Search - [Arxiv] [QA]
- Learning Probabilistic Topological Representations Using Discrete Morse Theory - [Arxiv] [QA]
- MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data - [Arxiv] [QA]
- PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images - [Arxiv] [QA]
- Siamese Image Modeling for Self-Supervised Vision Representation Learning - [Arxiv] [QA]
- Multi-View Active Fine-Grained Recognition - [Arxiv] [QA]
- Prefix Conditioning Unifies Language and Label Supervision - [Arxiv] [QA]
- Unified Recurrence Modeling for Video Action Anticipation - [Arxiv] [QA]
- Unified Recurrence Modeling for Video Action Anticipation - [Arxiv] [QA]
- NIPQ: Noise proxy-based Integrated Pseudo-Quantization - [Arxiv] [QA]
- NIPQ: Noise proxy-based Integrated Pseudo-Quantization - [Arxiv] [QA]
- Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction - [Arxiv] [QA]
- ORC: Network Group-based Knowledge Distillation using Online Role Change - [Arxiv] [QA]
- MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining - [Arxiv] [QA]
- Evolving Domain Generalization - [Arxiv] [QA]
- itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection - [Arxiv] [QA]
- FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER - [Arxiv] [QA]
- Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning - [Arxiv] [QA]
- Self-Supervised Visual Representation Learning with Semantic Grouping - [Arxiv] [QA]
- GMML is All you Need - [Arxiv] [QA]
- Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions - [Arxiv] [QA]
- HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling - [Arxiv] [QA]
- FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation - [Arxiv] [QA]
- Robust Weight Perturbation for Adversarial Training - [Arxiv] [QA]
- Robust Weight Perturbation for Adversarial Training - [Arxiv] [QA]
- EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction - [Arxiv] [QA]
- CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Conversational AI - [Arxiv] [QA]
- CoNT: Contrastive Neural Text Generation - [Arxiv] [QA]
- Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning - [Arxiv] [QA]
- SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners - [Arxiv] [QA]
- Additive Higher-Order Factorization Machines - [Arxiv] [QA]
- A Closer Look at Self-Supervised Lightweight Vision Transformers - [Arxiv] [QA]
- Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training - [Arxiv] [QA]
- Object-wise Masked Autoencoders for Fast Pre-training - [Arxiv] [QA]
- Semi-supervised Semantics-guided Adversarial Training for Trajectory Prediction - [Arxiv] [QA]
- Controllable Text Generation with Neurally-Decomposed Oracle - [Arxiv] [QA]
- Diffusion-LM Improves Controllable Text Generation - [Arxiv] [QA]
- Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation - [Arxiv] [QA]
- Bayesian Robust Graph Contrastive Learning - [Arxiv] [QA]
- GIT: A Generative Image-to-text Transformer for Vision and Language - [Arxiv] [QA]
- Prototype Based Classification from Hierarchy to Fairness - [Arxiv] [QA]
- Prototype Based Classification from Hierarchy to Fairness - [Arxiv] [QA]
- Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN - [Arxiv] [QA]
- Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions - [Arxiv] [QA]
- Quark: Controllable Text Generation with Reinforced Unlearning - [Arxiv] [QA]
- Revealing the Dark Secrets of Masked Image Modeling - [Arxiv] [QA]
- Green Hierarchical Vision Transformer for Masked Image Modeling - [Arxiv] [QA]
- Physical-World Optical Adversarial Attacks on 3D Face Recognition - [Arxiv] [QA]
- MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers - [Arxiv] [QA]
- Pretraining is All You Need for Image-to-Image Translation - [Arxiv] [QA]
- RSTGen: Imbuing Fine-Grained Interpretable Control into Long-FormText Generators - [Arxiv] [QA]
- Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers - [Arxiv] [QA]
- TALM: Tool Augmented Language Models - [Arxiv] [QA]
- Large Language Models are Zero-Shot Reasoners - [Arxiv] [QA]
- Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations - [Arxiv] [QA]
- Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition - [Arxiv] [QA]
- Decoder Denoising Pretraining for Semantic Segmentation - [Arxiv] [QA]
- PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection - [Arxiv] [QA]
- FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders - [Arxiv] [QA]
- GraphMAE: Self-Supervised Masked Graph Autoencoders - [Arxiv] [QA]
- All You Need Is Logs: Improving Code Completion by Learning from Anonymous IDE Usage Logs - [Arxiv] [QA]
- Swept-Angle Synthetic Wavelength Interferometry - [Arxiv] [QA]
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models - [Arxiv] [QA]
- A Review of Safe Reinforcement Learning: Methods, Theory and Applications - [Arxiv] [QA]
- Adaptive Fairness-Aware Online Meta-Learning for Changing Environments - [Arxiv] [QA]
- Sampling Is All You Need on Modeling Long-Term User Behaviors for CTR Prediction - [Arxiv] [QA]
- Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality - [Arxiv] [QA]
- Can Foundation Models Wrangle Your Data? - [Arxiv] [QA]
- RankGen: Improving Text Generation with Large Ranking Models - [Arxiv] [QA]
- Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning - [Arxiv] [QA]
- Masked Image Modeling with Denoising Contrast - [Arxiv] [QA]
- Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection - [Arxiv] [QA]
- Learning Graph Structure from Convolutional Mixtures - [Arxiv] [QA]
- Learning Graph Structure from Convolutional Mixtures - [Arxiv] [QA]
- Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation - [Arxiv] [QA]
- Masked Autoencoders As Spatiotemporal Learners - [Arxiv] [QA]
- Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners - [Arxiv] [QA]
- Positional Information is All You Need: A Novel Pipeline for Self-Supervised SVDE from Videos - [Arxiv] [QA]
- Need is All You Need: Homeostatic Neural Networks Adapt to Concept Shift - [Arxiv] [QA]
- A CLIP-Hitchhiker's Guide to Long Video Retrieval - [Arxiv] [QA]
- Robust Losses for Learning Value Functions - [Arxiv] [QA]
- Robust Losses for Learning Value Functions - [Arxiv] [QA]
- LogicSolver: Towards Interpretable Math Word Problem Solving with Logical Prompt-enhanced Learning - [Arxiv] [QA]
- BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models - [Arxiv] [QA]
- Diffusion Models for Adversarial Purification - [Arxiv] [QA]
- Long-term Control for Dialogue Generation: Methods and Evaluation - [Arxiv] [QA]
- Aligning Robot Representations with Humans - [Arxiv] [QA]
- A Generalist Agent - [Arxiv] [QA]
- An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers - [Arxiv] [QA]
- Reduce Information Loss in Transformers for Pluralistic Image Inpainting - [Arxiv] [QA]
- Learning to Answer Visual Questions from Web Videos - [Arxiv] [QA]
- When does dough become a bagel? Analyzing the remaining mistakes on ImageNet - [Arxiv] [QA]
- A for-loop is all you need. For solving the inverse problem in the case of personalized tumor growth modeling - [Arxiv] [QA]
- Activating More Pixels in Image Super-Resolution Transformer - [Arxiv] [QA]
- ConvMAE: Masked Convolution Meets Masked Autoencoders - [Arxiv] [QA]
- Towards a Progression-Aware Autonomous Dialogue Agent - [Arxiv] [QA]
- The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning - [Arxiv] [QA]
- Spiking Graph Convolutional Networks - [Arxiv] [QA]
- Spiking Graph Convolutional Networks - [Arxiv] [QA]
- A Simple Contrastive Learning Objective for Alleviating Neural Text Degeneration - [Arxiv] [QA]
- Lexical Knowledge Internalization for Neural Dialog Generation - [Arxiv] [QA]
- End2End Multi-View Feature Matching with Differentiable Pose Optimization - [Arxiv] [QA]
- Learning to Transfer Prompts for Text Generation - [Arxiv] [QA]
- OPT: Open Pre-trained Transformer Language Models - [Arxiv] [QA]
- Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models - [Arxiv] [QA]
- SVTR: Scene Text Recognition with a Single Visual Model - [Arxiv] [QA]
- Flamingo: a Visual Language Model for Few-Shot Learning - [Arxiv] [QA]
- ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation - [Arxiv] [QA]
- The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction - [Arxiv] [QA]
- Power Bundle Adjustment for Large-Scale 3D Reconstruction - [Arxiv] [QA]
- Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training - [Arxiv] [QA]
- Control Globally, Understand Locally: A Global-to-Local Hierarchical Graph Network for Emotional Support Conversation - [Arxiv] [QA]
- MM-TTA: Multi-Modal Test-Time Adaptation for 3D Semantic Segmentation - [Arxiv] [QA]
- PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions - [Arxiv] [QA]
- MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval - [Arxiv] [QA]
- SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text - [Arxiv] [QA]
- Masked Image Modeling Advances 3D Medical Image Analysis - [Arxiv] [QA]
- LoL: A Comparative Regularization Loss over Query Reformulation Losses for Pseudo-Relevance Feedback - [Arxiv] [QA]
- Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models - [Arxiv] [QA]
- Simulating Fluids in Real-World Still Images - [Arxiv] [QA]
- RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning - [Arxiv] [QA]
- Meet Your Favorite Character: Open-domain Chatbot Mimicking Fictional Characters with only a Few Utterances - [Arxiv] [QA]
- Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction - [Arxiv] [QA]
- Autoregressive Search Engines: Generating Substrings as Document Identifiers - [Arxiv] [QA]
- Sharper Utility Bounds for Differentially Private Models - [Arxiv] [QA]
- Sharper Utility Bounds for Differentially Private Models - [Arxiv] [QA]
- Towards Multi-Turn Empathetic Dialogs with Positive Emotion Elicitation - [Arxiv] [QA]
- Event Transition Planning for Open-ended Text Generation - [Arxiv] [QA]
- Human-Object Interaction Detection via Disentangled Transformer - [Arxiv] [QA]
- Visio-Linguistic Brain Encoding - [Arxiv] [QA]
- Visio-Linguistic Brain Encoding - [Arxiv] [QA]
- The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
- CPFair: Personalized Consumer and Producer Fairness Re-ranking for Recommender Systems - [Arxiv] [QA]
- Interactiveness Field in Human-Object Interactions - [Arxiv] [QA]
- Improving Passage Retrieval with Zero-Shot Question Generation - [Arxiv] [QA]
- INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold - [Arxiv] [QA]
- A Personalized Dialogue Generator with Implicit User Persona Detection - [Arxiv] [QA]
- LaMemo: Language Modeling with Look-Ahead Memory - [Arxiv] [QA]
- Measuring Compositional Consistency for Video Question Answering - [Arxiv] [QA]
- Neighborhood Attention Transformer - [Arxiv] [QA]
- Masked Siamese Networks for Label-Efficient Learning - [Arxiv] [QA]
- BEHAVE: Dataset and Method for Tracking Human Object Interactions - [Arxiv] [QA]
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model - [Arxiv] [QA]
- Learning Convolutional Neural Networks in the Frequency Domain - [Arxiv] [QA]
- Transparent Shape from a Single View Polarization Image - [Arxiv] [QA]
- Neural Topic Modeling of Psychotherapy Sessions - [Arxiv] [QA]
- MGM: A meshfree geometric multilevel method for systems arising from elliptic equations on point cloud surfaces - [Arxiv] [QA]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback - [Arxiv] [QA]
- Bootstrap Motion Forecasting With Self-Consistent Constraints - [Arxiv] [QA]
- Stylized Knowledge-Grounded Dialogue Generation via Disentangled Template Rewriting - [Arxiv] [QA]
- Deep Annotation of Therapeutic Working Alliance in Psychotherapy - [Arxiv] [QA]
- Overlapping Word Removal is All You Need: Revisiting Data Imbalance in Hope Speech Detection - [Arxiv] [QA]
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm - [Arxiv] [QA]
- Focal Length and Object Pose Estimation via Render and Compare - [Arxiv] [QA]
- Category-Aware Transformer Network for Better Human-Object Interaction Detection - [Arxiv] [QA]
- Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection - [Arxiv] [QA]
- DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning - [Arxiv] [QA]
- Representation Learning by Detecting Incorrect Location Embeddings - [Arxiv] [QA]
- Learning Trajectory-Aware Transformer for Video Super-Resolution - [Arxiv] [QA]
- Federated Learning with Partial Model Personalization - [Arxiv] [QA]
- Federated Learning with Partial Model Personalization - [Arxiv] [QA]
- Unsupervised Prompt Learning for Vision-Language Models - [Arxiv] [QA]
- Interacting with Non-Cooperative User: A New Paradigm for Proactive Dialogue Policy - [Arxiv] [QA]
- Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality - [Arxiv] [QA]
- Knowledge Infused Decoding - [Arxiv] [QA]
- Knowledge Infused Decoding - [Arxiv] [QA]
- Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection - [Arxiv] [QA]
- Towards An End-to-End Framework for Flow-Guided Video Inpainting - [Arxiv] [QA]
- There Are a Thousand Hamlets in a Thousand People's Eyes: Enhancing Knowledge-grounded Dialogue with Personal Memory - [Arxiv] [QA]
- Efficient Test-Time Model Adaptation without Forgetting - [Arxiv] [QA]
- C3KG: A Chinese Commonsense Conversation Knowledge Graph - [Arxiv] [QA]
- CHORE: Contact, Human and Object REconstruction from a single RGB image - [Arxiv] [QA]
- Can language models learn from explanations in context? - [Arxiv] [QA]
- PaLM: Scaling Language Modeling with Pathways - [Arxiv] [QA]
- At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads - [Arxiv] [QA]
-
$\textit{latent}$ -GLAT: Glancing at Latent Variables for Parallel Text Generation - [Arxiv] [QA] - Learning Neural Acoustic Fields - [Arxiv] [QA]
- Learning Neural Acoustic Fields - [Arxiv] [QA]
- Do As I Can, Not As I Say: Grounding Language in Robotic Affordances - [Arxiv] [QA]
- MultiMAE: Multi-modal Multi-task Masked Autoencoders - [Arxiv] [QA]
- Value Gradient weighted Model-Based Reinforcement Learning - [Arxiv] [QA]
- Value Gradient weighted Model-Based Reinforcement Learning - [Arxiv] [QA]
- PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models - [Arxiv] [QA]
- Probabilistic Implicit Scene Completion - [Arxiv] [QA]
- Probabilistic Implicit Scene Completion - [Arxiv] [QA]
- Dynamic Focus-aware Positional Queries for Semantic Segmentation - [Arxiv] [QA]
- What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions - [Arxiv] [QA]
- Implicit Feedback for Dense Passage Retrieval: A Counterfactual Approach - [Arxiv] [QA]
- Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language - [Arxiv] [QA]
- End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation - [Arxiv] [QA]
- Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings - [Arxiv] [QA]
- TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization - [Arxiv] [QA]
- Exploring Visual Prompts for Adapting Large-Scale Models - [Arxiv] [QA]
- A 23 MW data centre is all you need - [Arxiv] [QA]
- R2L: Distilling Neural Radiance Field to Neural Light Field for Efficient Novel View Synthesis - [Arxiv] [QA]
- Templates for 3D Object Pose Estimation Revisited: Generalization to New Objects and Robustness to Occlusions - [Arxiv] [QA]
- SimVQA: Exploring Simulated Environments for Visual Question Answering - [Arxiv] [QA]
- Self-distillation Augmented Masked Autoencoders for Histopathological Image Classification - [Arxiv] [QA]
- MAE-AST: Masked Autoencoding Audio Spectrogram Transformer - [Arxiv] [QA]
- PromptDet: Towards Open-vocabulary Detection using Uncurated Images - [Arxiv] [QA]
- A Sequential Quadratic Programming Approach to the Solution of Open-Loop Generalized Nash Equilibria - [Arxiv] [QA]
- Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection - [Arxiv] [QA]
- Causal de Finetti: On the Identification of Invariant Causal Structure in Exchangeable Data - [Arxiv] [QA]
- Training Compute-Optimal Large Language Models - [Arxiv] [QA]
- Graph Neural Networks are Dynamic Programmers - [Arxiv] [QA]
- mc-BEiT: Multi-choice Discretization for Image BERT Pre-training - [Arxiv] [QA]
- MAT: Mask-Aware Transformer for Large Hole Image Inpainting - [Arxiv] [QA]
- Parameter-efficient Model Adaptation for Vision Transformers - [Arxiv] [QA]
- Generalizing Few-Shot NAS with Gradient Matching - [Arxiv] [QA]
- Generalizing Few-Shot NAS with Gradient Matching - [Arxiv] [QA]
- Neural Vocoder is All You Need for Speech Super-resolution - [Arxiv] [QA]
- Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model - [Arxiv] [QA]
- A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition - [Arxiv] [QA]
- MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection - [Arxiv] [QA]
- ImFace: A Nonlinear 3D Morphable Face Model with Implicit Neural Representations - [Arxiv] [QA]
- STaR: Bootstrapping Reasoning With Reasoning - [Arxiv] [QA]
- UV Volumes for Real-time Rendering of Editable Free-view Human Performance - [Arxiv] [QA]
- Discovering Human-Object Interaction Concepts via Self-Compositional Learning - [Arxiv] [QA]
- How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning? - [Arxiv] [QA]
- GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection - [Arxiv] [QA]
- AutoML for Deep Recommender Systems: A Survey - [Arxiv] [QA]
- Spectral Measurement Sparsification for Pose-Graph SLAM - [Arxiv] [QA]
- Continual Test-Time Domain Adaptation - [Arxiv] [QA]
- MISC: A MIxed Strategy-Aware Model Integrating COMET for Emotional Support Conversation - [Arxiv] [QA]
- A Comparative Survey of Deep Active Learning - [Arxiv] [QA]
- Linking Emergent and Natural Languages via Corpus Transfer - [Arxiv] [QA]
- Linking Emergent and Natural Languages via Corpus Transfer - [Arxiv] [QA]
- MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection - [Arxiv] [QA]
- What to Hide from Your Students: Attention-Guided Masked Image Modeling - [Arxiv] [QA]
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training - [Arxiv] [QA]
- Pathways: Asynchronous Distributed Dataflow for ML - [Arxiv] [QA]
- Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition - [Arxiv] [QA]
- Deep Frequency Filtering for Domain Generalization - [Arxiv] [QA]
- Visual Prompt Tuning - [Arxiv] [QA]
- Self-supervision through Random Segments with Autoregressive Coding (RandSAC) - [Arxiv] [QA]
- Language modeling via stochastic processes - [Arxiv] [QA]
- Language modeling via stochastic processes - [Arxiv] [QA]
- Language modeling via stochastic processes - [Arxiv] [QA]
- Masked Discrimination for Self-Supervised Learning on Point Clouds - [Arxiv] [QA]
- Self-Consistency Improves Chain of Thought Reasoning in Language Models - [Arxiv] [QA]
- The Conceptual VAE - [Arxiv] [QA]
- Teaching language models to support answers with verified quotes - [Arxiv] [QA]
- Towards Large-Scale Interpretable Knowledge Graph Reasoning for Dialogue Systems - [Arxiv] [QA]
- Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows - [Arxiv] [QA]
- CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation - [Arxiv] [QA]
- On Robust Prefix-Tuning for Text Classification - [Arxiv] [QA]
- On Robust Prefix-Tuning for Text Classification - [Arxiv] [QA]
- Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation - [Arxiv] [QA]
- SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition - [Arxiv] [QA]
- Generative Principal Component Analysis - [Arxiv] [QA]
- Generative Principal Component Analysis - [Arxiv] [QA]
- Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
- Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
- Monotonic Differentiable Sorting Networks - [Arxiv] [QA]
- A Framework and Benchmark for Deep Batch Active Learning for Regression - [Arxiv] [QA]
- RoMe: A Robust Metric for Evaluating Natural Language Generation - [Arxiv] [QA]
- PLANET: Dynamic Content Planning in Autoregressive Transformers for Long-form Text Generation - [Arxiv] [QA]
- Memorizing Transformers - [Arxiv] [QA]
- Memorizing Transformers - [Arxiv] [QA]
- Multi-Stage Prompting for Knowledgeable Dialogue Generation - [Arxiv] [QA]
- Differentiable DAG Sampling - [Arxiv] [QA]
- Differentiable DAG Sampling - [Arxiv] [QA]
- Iteratively Prompt Pre-trained Language Models for Chain of Thought - [Arxiv] [QA]
- Multi-View Document Representation Learning for Open-Domain Dense Retrieval - [Arxiv] [QA]
- Unified Visual Transformer Compression - [Arxiv] [QA]
- Unified Visual Transformer Compression - [Arxiv] [QA]
- Vision-Based Manipulators Need to Also See from Their Hands - [Arxiv] [QA]
- Vision-Based Manipulators Need to Also See from Their Hands - [Arxiv] [QA]
- Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation - [Arxiv] [QA]
- ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation - [Arxiv] [QA]
- Distraction is All You Need for Fairness - [Arxiv] [QA]
- ScienceWorld: Is your Agent Smarter than a 5th Grader? - [Arxiv] [QA]
- Respecting causality is all you need for training physics-informed neural networks - [Arxiv] [QA]
- All in One: Exploring Unified Video-Language Pre-training - [Arxiv] [QA]
- Orchestrated Value Mapping for Reinforcement Learning - [Arxiv] [QA]
- Orchestrated Value Mapping for Reinforcement Learning - [Arxiv] [QA]
- Masked Autoencoders for Point Cloud Self-supervised Learning - [Arxiv] [QA]
- PromptChainer: Chaining Large Language Model Prompts through Visual Programming - [Arxiv] [QA]
- Categories of Differentiable Polynomial Circuits for Machine Learning - [Arxiv] [QA]
- BiBERT: Accurate Fully Binarized BERT - [Arxiv] [QA]
- BiBERT: Accurate Fully Binarized BERT - [Arxiv] [QA]
- MISF: Multi-level Interactive Siamese Filtering for High-Fidelity Image Inpainting - [Arxiv] [QA]
- LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval - [Arxiv] [QA]
- An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation - [Arxiv] [QA]
- Long Time No See! Open-Domain Conversation with Long-Term Persona Memory - [Arxiv] [QA]
- Conditional Prompt Learning for Vision-Language Models - [Arxiv] [QA]
- Back to the Feature: Classical 3D Features are (Almost) All You Need for 3D Anomaly Detection - [Arxiv] [QA]
- Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation - [Arxiv] [QA]
- MVP: Multimodality-guided Visual Pre-training - [Arxiv] [QA]
- Internet-augmented language models through few-shot prompting for open-domain question answering - [Arxiv] [QA]
- All You Need is LUV: Unsupervised Collection of Labeled Images using Invisible UV Fluorescent Indicators - [Arxiv] [QA]
- Source-free Video Domain Adaptation by Learning Temporal Consistency for Action Recognition - [Arxiv] [QA]
- Kubric: A scalable dataset generator - [Arxiv] [QA]
- Kubric: A scalable dataset generator - [Arxiv] [QA]
- Self-supervised Implicit Glyph Attention for Text Recognition - [Arxiv] [QA]
- Adaptive Cross-Layer Attention for Image Restoration - [Arxiv] [QA]
- Adaptive Cross-Layer Attention for Image Restoration - [Arxiv] [QA]
- Structured Pruning is All You Need for Pruning CNNs at Initialization - [Arxiv] [QA]
- Neural Simulated Annealing - [Arxiv] [QA]
- Neural Simulated Annealing - [Arxiv] [QA]
- Training language models to follow instructions with human feedback - [Arxiv] [QA]
- BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification - [Arxiv] [QA]
- Video is All You Need: Attacking PPG-based Biometric Authentication - [Arxiv] [QA]
- Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding - [Arxiv] [QA]
- Towards a unified view of unsupervised non-local methods for image denoising: the NL-Ridge approach - [Arxiv] [QA]
- DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index - [Arxiv] [QA]
- One Model is All You Need: Multi-Task Learning Enables Simultaneous Histology Image Segmentation and Classification - [Arxiv] [QA]
- A Proximal Algorithm for Sampling - [Arxiv] [QA]
- Rethinking and Refining the Distinct Metric - [Arxiv] [QA]
- Filter-enhanced MLP is All You Need for Sequential Recommendation - [Arxiv] [QA]
- The Spectral Bias of Polynomial Neural Networks - [Arxiv] [QA]
- The Spectral Bias of Polynomial Neural Networks - [Arxiv] [QA]
- AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation - [Arxiv] [QA]
- Ask2Mask: Guided Data Selection for Masked Speech Modeling - [Arxiv] [QA]
- Ask2Mask: Guided Data Selection for Masked Speech Modeling - [Arxiv] [QA]
- Effective Actor-centric Human-object Interaction Detection - [Arxiv] [QA]
- All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL - [Arxiv] [QA]
- Auto-scaling Vision Transformers without Training - [Arxiv] [QA]
- Auto-scaling Vision Transformers without Training - [Arxiv] [QA]
- COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics - [Arxiv] [QA]
- Socialformer: Social Network Inspired Long Document Modeling for Document Ranking - [Arxiv] [QA]
- PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs - [Arxiv] [QA]
- Adversarial Attacks on Speech Recognition Systems for Mission-Critical Applications: A Survey - [Arxiv] [QA]
- 1-WL Expressiveness Is (Almost) All You Need - [Arxiv] [QA]
- Pseudo Numerical Methods for Diffusion Models on Manifolds - [Arxiv] [QA]
- Pseudo Numerical Methods for Diffusion Models on Manifolds - [Arxiv] [QA]
- Bayes-Optimal Classifiers under Group Fairness - [Arxiv] [QA]
- Bit-wise Training of Neural Network Weights - [Arxiv] [QA]
- Bit-wise Training of Neural Network Weights - [Arxiv] [QA]
- Highlighting Object Category Immunity for the Generalization of Human-Object Interaction Detection - [Arxiv] [QA]
- Unsupervised Multiple-Object Tracking with a Dynamical Variational Autoencoder - [Arxiv] [QA]
- Masked prediction tasks: a parameter identifiability view - [Arxiv] [QA]
- Gaussian Mixture Convolution Networks - [Arxiv] [QA]
- Gaussian Mixture Convolution Networks - [Arxiv] [QA]
- cosFormer: Rethinking Softmax in Attention - [Arxiv] [QA]
- cosFormer: Rethinking Softmax in Attention - [Arxiv] [QA]
- Task-Agnostic Graph Explanations - [Arxiv] [QA]
- Task-Agnostic Graph Explanations - [Arxiv] [QA]
- Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations - [Arxiv] [QA]
- Don't Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis - [Arxiv] [QA]
- A precortical module for robust CNNs to light variations - [Arxiv] [QA]
- A precortical module for robust CNNs to light variations - [Arxiv] [QA]
- Exploring Discontinuity for Video Frame Interpolation - [Arxiv] [QA]
- Transformer Memory as a Differentiable Search Index - [Arxiv] [QA]
- What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code - [Arxiv] [QA]
- Domain Adaptation via Prompt Learning - [Arxiv] [QA]
- FlowEval: A Consensus-Based Dialogue Evaluation Framework Using Segment Act Flows - [Arxiv] [QA]
- A Contrastive Framework for Neural Text Generation - [Arxiv] [QA]
- Conditional Contrastive Learning with Kernel - [Arxiv] [QA]
- Conditional Contrastive Learning with Kernel - [Arxiv] [QA]
- Domain Adversarial Training: A Game Perspective - [Arxiv] [QA]
- Domain Adversarial Training: A Game Perspective - [Arxiv] [QA]
- InPars: Data Augmentation for Information Retrieval using Large Language Models - [Arxiv] [QA]
- Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs - [Arxiv] [QA]
- GiraffeDet: A Heavy-Neck Paradigm for Object Detection - [Arxiv] [QA]
- GiraffeDet: A Heavy-Neck Paradigm for Object Detection - [Arxiv] [QA]
- Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning - [Arxiv] [QA]
- Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations - [Arxiv] [QA]
- MaskGIT: Masked Generative Image Transformer - [Arxiv] [QA]
- FMP: Toward Fair Graph Message Passing against Topology Bias - [Arxiv] [QA]
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models - [Arxiv] [QA]
- How to Understand Masked Autoencoders - [Arxiv] [QA]
- Survey of Hallucination in Natural Language Generation - [Arxiv] [QA]
- GrASP: Gradient-Based Affordance Selection for Planning - [Arxiv] [QA]
- GrASP: Gradient-Based Affordance Selection for Planning - [Arxiv] [QA]
- PolicyCleanse: Backdoor Detection and Mitigation in Reinforcement Learning - [Arxiv] [QA]
- data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language - [Arxiv] [QA]
- Corrupted Image Modeling for Self-Supervised Visual Pre-Training - [Arxiv] [QA]
- Message Passing Neural PDE Solvers - [Arxiv] [QA]
- Message Passing Neural PDE Solvers - [Arxiv] [QA]
- Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics - [Arxiv] [QA]
- OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework - [Arxiv] [QA]
- Context Autoencoder for Self-Supervised Representation Learning - [Arxiv] [QA]
- User Satisfaction Estimation with Sequential Dialogue Act Modeling in Goal-oriented Conversational Systems - [Arxiv] [QA]
- DEVO: Depth-Event Camera Visual Odometry in Challenging Conditions - [Arxiv] [QA]
- One-Nearest-Neighbor Search is All You Need for Minimax Optimal Regression and Classification - [Arxiv] [QA]
- Webly Supervised Concept Expansion for General Purpose Vision Models - [Arxiv] [QA]
- Structured Prediction Problem Archive - [Arxiv] [QA]
- mSLAM: Massively multilingual joint pre-training for speech and text - [Arxiv] [QA]
- A Survey on Retrieval-Augmented Text Generation - [Arxiv] [QA]
- ColloSSL: Collaborative Self-Supervised Learning for Human Activity Recognition - [Arxiv] [QA]
- Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics - [Arxiv] [QA]
- CLA-NeRF: Category-Level Articulated Neural Radiance Field - [Arxiv] [QA]
- Signing the Supermask: Keep, Hide, Invert - [Arxiv] [QA]
- Signing the Supermask: Keep, Hide, Invert - [Arxiv] [QA]
- Few-Shot Backdoor Attacks on Visual Object Tracking - [Arxiv] [QA]
- Few-Shot Backdoor Attacks on Visual Object Tracking - [Arxiv] [QA]
- Causal Explanations and XAI - [Arxiv] [QA]
- Adversarial Masking for Self-Supervised Learning - [Arxiv] [QA]
- Robust Imitation Learning from Corrupted Demonstrations - [Arxiv] [QA]
- Robust Imitation Learning from Corrupted Demonstrations - [Arxiv] [QA]
- Rebalancing Batch Normalization for Exemplar-based Class-Incremental Learning - [Arxiv] [QA]
- ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation - [Arxiv] [QA]
- Counterfactual Plans under Distributional Ambiguity - [Arxiv] [QA]
- Counterfactual Plans under Distributional Ambiguity - [Arxiv] [QA]
- DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR - [Arxiv] [QA]
- DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR - [Arxiv] [QA]
- Mask-based Latent Reconstruction for Reinforcement Learning - [Arxiv] [QA]
- Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model - [Arxiv] [QA]
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - [Arxiv] [QA]
- DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence - [Arxiv] [QA]
- Natural Language Descriptions of Deep Visual Features - [Arxiv] [QA]
- Natural Language Descriptions of Deep Visual Features - [Arxiv] [QA]
- Explanatory Learning: Beyond Empiricism in Neural Networks - [Arxiv] [QA]
- Explanatory Learning: Beyond Empiricism in Neural Networks - [Arxiv] [QA]
- RePaint: Inpainting using Denoising Diffusion Probabilistic Models - [Arxiv] [QA]
- Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
- Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
- Learning Graph Augmentations to Learn Graph Representations - [Arxiv] [QA]
- Patches Are All You Need? - [Arxiv] [QA]
- Patches Are All You Need? - [Arxiv] [QA]
- Neural Implicit Surface Evolution - [Arxiv] [QA]
- Universal Online Learning with Unbounded Losses: Memory Is All You Need - [Arxiv] [QA]
- Fast Differentiable Matrix Square Root - [Arxiv] [QA]
- Fast Differentiable Matrix Square Root - [Arxiv] [QA]
- End-to-end Generative Pretraining for Multimodal Video Captioning - [Arxiv] [QA]
- LaMDA: Language Models for Dialog Applications - [Arxiv] [QA]
- Safe Deep RL in 3D Environments using Human Feedback - [Arxiv] [QA]
- Safe Deep RL in 3D Environments using Human Feedback - [Arxiv] [QA]
- CM3: A Causal Masked Multimodal Model of the Internet - [Arxiv] [QA]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents - [Arxiv] [QA]
- GANmouflage: 3D Object Nondetection with Texture Fields - [Arxiv] [QA]
- RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training - [Arxiv] [QA]
- Parameter-free Online Test-time Adaptation - [Arxiv] [QA]
- Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval - [Arxiv] [QA]
- A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models - [Arxiv] [QA]
- Neural Circuit Architectural Priors for Embodied Control - [Arxiv] [QA]
- Neural Circuit Architectural Priors for Embodied Control - [Arxiv] [QA]
- SparseDet: Improving Sparsely Annotated Object Detection with Pseudo-positive Mining - [Arxiv] [QA]
- Structure and Semantics Preserving Document Representations - [Arxiv] [QA]
- 3D Face Morphing Attacks: Generation, Vulnerability and Detection - [Arxiv] [QA]
- QuadTree Attention for Vision Transformers - [Arxiv] [QA]
- QuadTree Attention for Vision Transformers - [Arxiv] [QA]
- Categorical Hopfield Networks - [Arxiv] [QA]
- Detecting Human-to-Human-or-Object (H2O) Interactions with DIABOLO - [Arxiv] [QA]
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets - [Arxiv] [QA]
- All You Need In Sign Language Production - [Arxiv] [QA]
- C2-CRS: Coarse-to-Fine Contrastive Learning for Conversational Recommender System - [Arxiv] [QA]
- Class-Incremental Continual Learning into the eXtended DER-verse - [Arxiv] [QA]
- Vision Transformer with Deformable Attention - [Arxiv] [QA]