New submissions for Wed, 3 Aug 22 #208

zhuhu00 · 2022-08-03T04:18:42Z

New submissions for Wed, 3 Aug 22

Keyword: SLAM

There is no result

Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

Mitigating Shadows in Lidar Scan Matching using Spherical Voxels

Authors: Matthew McDermott, Jason Rife
Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01150
Pdf link: https://arxiv.org/pdf/2208.01150
Abstract
In this paper we propose an approach to mitigate shadowing errors in Lidar scan matching, by introducing a preprocessing step based on spherical gridding. Because the grid aligns with the Lidar beam, it is relatively easy to eliminate shadow edges which cause systematic errors in Lidar scan matching. As we show through simulation, our proposed algorithm provides better results than ground-plane removal, the most common existing strategy for shadow mitigation. Unlike ground plane removal, our method applies to arbitrary terrains (e.g. shadows on urban walls, shadows in hilly terrain) while retaining key Lidar points on the ground that are critical for estimating changes in height, pitch, and roll. Our preprocessing algorithm can be used with a range of scan-matching methods; however, for voxel-based scan matching methods, it provides additional benefits by reducing computation costs and more evenly distributing Lidar points among voxels.

Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions

Authors: Carlos A. Diaz-Ruiz (1), Youya Xia (1), Yurong You (1), Jose Nino (1), Junan Chen (1), Josephine Monica (1), Xiangyu Chen (1), Katie Luo (1), Yan Wang (1), Marc Emond (1), Wei-Lun Chao (2), Bharath Hariharan (1), Kilian Q. Weinberger (1), Mark Campbell (1) ((1) Cornell University, (2) The Ohio State University)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01166
Pdf link: https://arxiv.org/pdf/2208.01166
Abstract
Advances in perception for self-driving cars have accelerated in recent years due to the availability of large-scale datasets, typically collected at specific locations and under nice weather conditions. Yet, to achieve the high safety requirement, these perceptual systems must operate robustly under a wide variety of weather conditions including snow and rain. In this paper, we present a new dataset to enable robust autonomous driving via a novel data collection process - data is repeatedly recorded along a 15 km route under diverse scene (urban, highway, rural, campus), weather (snow, rain, sun), time (day/night), and traffic conditions (pedestrians, cyclists and cars). The dataset includes images and point clouds from cameras and LiDAR sensors, along with high-precision GPS/INS to establish correspondence across routes. The dataset includes road and object annotations using amodal masks to capture partial occlusions and 3D bounding boxes. We demonstrate the uniqueness of this dataset by analyzing the performance of baselines in amodal segmentation of road and objects, depth estimation, and 3D object detection. The repeated routes opens new research directions in object discovery, continual learning, and anomaly detection. Link to Ithaca365: https://ithaca365.mae.cornell.edu/

Keyword: loop detection

There is no result

Keyword: nerf

T4DT: Tensorizing Time for Learning Temporal 3D Visual Data

Authors: Mikhail Usvyatsov, Rafael Ballester-Rippoll, Lina Bashaeva, Konrad Schindler, Gonzalo Ferrer, Ivan Oseledets
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01421
Pdf link: https://arxiv.org/pdf/2208.01421
Abstract
Unlike 2D raster images, there is no single dominant representation for 3D visual data processing. Different formats like point clouds, meshes, or implicit functions each have their strengths and weaknesses. Still, grid representations such as signed distance functions have attractive properties also in 3D. In particular, they offer constant-time random access and are eminently suitable for modern machine learning. Unfortunately, the storage size of a grid grows exponentially with its dimension. Hence they often exceed memory limits even at moderate resolution. This work explores various low-rank tensor formats, including the Tucker, tensor train, and quantics tensor train decompositions, to compress time-varying 3D data. Our method iteratively computes, voxelizes, and compresses each frame's truncated signed distance function and applies tensor rank truncation to condense all frames into a single, compressed tensor that represents the entire 4D scene. We show that low-rank tensor compression is extremely compact to store and query time-varying signed distance functions. It significantly reduces the memory footprint of 4D scenes while surprisingly preserving their geometric quality. Unlike existing iterative learning-based approaches like DeepSDF and NeRF, our method uses a closed-form algorithm with theoretical guarantees.

Keyword: mapping

Explicit Use of Fourier Spectrum in Generative Adversarial Networks

Authors: Soroush Sheikh Gargar
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01265
Pdf link: https://arxiv.org/pdf/2208.01265
Abstract
Generative Adversarial Networks have got the researchers' attention due to their state-of-the-art performance in generating new images with only a dataset of the target distribution. It has been shown that there is a dissimilarity between the spectrum of authentic images and fake ones. Since the Fourier transform is a bijective mapping, saying that the model has a significant problem in learning the original distribution is a fair conclusion. In this work, we investigate the possible reasons for the mentioned drawback in the architecture and mathematical theory of the current GANs. Then we propose a new model to reduce the discrepancies between the spectrum of the actual and fake images. To that end, we design a brand new architecture for the frequency domain using the blueprint of geometric deep learning. Then, we experimentally show promising improvements in the quality of the generated images by considering the Fourier domain representation of the original data as a principal feature in the training process.

Overlooked Poses Actually Make Sense: Distilling Privileged Knowledge for Human Motion Prediction

Authors: Xiaoning Sun, Qiongjie Cui, Huaijiang Sun, Bin Li, Weiqing Li, Jianfeng Lu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01302
Pdf link: https://arxiv.org/pdf/2208.01302
Abstract
Previous works on human motion prediction follow the pattern of building a mapping relation between the sequence observed and the one to be predicted. However, due to the inherent complexity of multivariate time series data, it still remains a challenge to find the extrapolation relation between motion sequences. In this paper, we present a new prediction pattern, which introduces previously overlooked human poses, to implement the prediction task from the view of interpolation. These poses exist after the predicted sequence, and form the privileged sequence. To be specific, we first propose an InTerPolation learning Network (ITP-Network) that encodes both the observed sequence and the privileged sequence to interpolate the in-between predicted sequence, wherein the embedded Privileged-sequence-Encoder (Priv-Encoder) learns the privileged knowledge (PK) simultaneously. Then, we propose a Final Prediction Network (FP-Network) for which the privileged sequence is not observable, but is equipped with a novel PK-Simulator that distills PK learned from the previous network. This simulator takes as input the observed sequence, but approximates the behavior of Priv-Encoder, enabling FP-Network to imitate the interpolation process. Extensive experimental results demonstrate that our prediction pattern achieves state-of-the-art performance on benchmarked H3.6M, CMU-Mocap and 3DPW datasets in both short-term and long-term predictions.

Keyword: localization

DSR -- A dual subspace re-projection network for surface anomaly detection

Authors: Vitjan Zavrtanik, Matej Kristan, Danijel Skočaj
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01521
Pdf link: https://arxiv.org/pdf/2208.01521
Abstract
The state-of-the-art in discriminative unsupervised surface anomaly detection relies on external datasets for synthesizing anomaly-augmented training images. Such approaches are prone to failure on near-in-distribution anomalies since these are difficult to be synthesized realistically due to their similarity to anomaly-free regions. We propose an architecture based on quantized feature space representation with dual decoders, DSR, that avoids the image-level anomaly synthesis requirement. Without making any assumptions about the visual properties of anomalies, DSR generates the anomalies at the feature level by sampling the learned quantized feature space, which allows a controlled generation of near-in-distribution anomalies. DSR achieves state-of-the-art results on the KSDD2 and MVTec anomaly detection datasets. The experiments on the challenging real-world KSDD2 dataset show that DSR significantly outperforms other unsupervised surface anomaly detection methods, improving the previous top-performing methods by 10% AP in anomaly detection and 35% AP in anomaly localization.

Keyword: transformer

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Authors: Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01066
Pdf link: https://arxiv.org/pdf/2208.01066
Abstract
In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Authors: David Peer, Bart Keulen, Sebastian Stabinger, Justus Piater, Antonio Rodríguez-Sánchez
Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01134
Pdf link: https://arxiv.org/pdf/2208.01134
Abstract
Training deep neural networks is a very demanding task, especially challenging is how to adapt architectures to improve the performance of trained models. We can find that sometimes, shallow networks generalize better than deep networks, and the addition of more layers results in higher training and test errors. The deep residual learning framework addresses this degradation problem by adding skip connections to several neural network layers. It would at first seem counter-intuitive that such skip connections are needed to train deep networks successfully as the expressivity of a network would grow exponentially with depth. In this paper, we first analyze the flow of information through neural networks. We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. With batch-entropy regularization, gradient descent optimizers can transform untrainable networks into trainable networks. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network -- no skip connections, batch normalization, dropout, or any other architectural tweak -- with 500 layers by simply adding the batch-entropy regularization term to the loss function. The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks.

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Authors: Ye Yu, Jialin Yuan, Gaurav Mittal, Li Fuxin, Mei Chen
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01159
Pdf link: https://arxiv.org/pdf/2208.01159
Abstract
Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-supervised VOS. It captures object motion in the video via a novel optical flow calibration module that fuses the segmentation mask with optical flow estimation to improve within-object optical flow smoothness and reduce noise at object boundaries. This calibrated optical flow is then employed in our novel bilateral attention, which computes the correspondence between the query and reference frames in the neighboring bilateral space considering both motion and appearance. Extensive experiments validate the effectiveness of BATMAN architecture by outperforming all existing state-of-the-art on all four popular VOS benchmarks: Youtube-VOS 2019 (85.0%), Youtube-VOS 2018 (85.3%), DAVIS 2017Val/Testdev (86.2%/82.2%), and DAVIS 2016 (92.5%).

Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer

Authors: Jicheng Li, Anjana Bhat, Roghayeh Barmaki
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Arxiv link: https://arxiv.org/abs/2208.01161
Pdf link: https://arxiv.org/pdf/2208.01161
Abstract
Movement synchrony reflects the coordination of body movements between interacting dyads. The estimation of movement synchrony has been automated by powerful deep learning models such as transformer networks. However, instead of designing a specialized network for movement synchrony estimation, previous transformer-based works broadly adopted architectures from other tasks such as human activity recognition. Therefore, this paper proposed a skeleton-based graph transformer for movement synchrony estimation. The proposed model applied ST-GCN, a spatial-temporal graph convolutional neural network for skeleton feature extraction, followed by a spatial transformer for spatial feature generation. The spatial transformer is guided by a uniquely designed joint position embedding shared between the same joints of interacting individuals. Besides, we incorporated a temporal similarity matrix in temporal attention computation considering the periodic intrinsic of body movements. In addition, the confidence score associated with each joint reflects the uncertainty of a pose, while previous works on movement synchrony estimation have not sufficiently emphasized this point. Since transformer networks demand a significant amount of data to train, we constructed a dataset for movement synchrony estimation using Human3.6M, a benchmark dataset for human activity recognition, and pretrained our model on it using contrastive learning. We further applied knowledge distillation to alleviate information loss introduced by pose detector failure in a privacy-preserving way. We compared our method with representative approaches on PT13, a dataset collected from autism therapy interventions. Our method achieved an overall accuracy of 88.98% and surpassed its counterparts by a wide margin while maintaining data privacy.

Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation

Authors: Wenxuan Ma, Jinming Zhang, Shuang Li, Chi Harold Liu, Yulin Wang, Wei Li
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01195
Pdf link: https://arxiv.org/pdf/2208.01195
Abstract
Extensive studies on Unsupervised Domain Adaptation (UDA) have propelled the deployment of deep learning from limited experimental datasets into real-world unconstrained domains. Most UDA approaches align features within a common embedding space and apply a shared classifier for target prediction. However, since a perfectly aligned feature space may not exist when the domain discrepancy is large, these methods suffer from two limitations. First, the coercive domain alignment deteriorates target domain discriminability due to lacking target label supervision. Second, the source-supervised classifier is inevitably biased to source data, thus it may underperform in target domain. To alleviate these issues, we propose to simultaneously conduct feature alignment in two individual spaces focusing on different domains, and create for each space a domain-oriented classifier tailored specifically for that domain. Specifically, we design a Domain-Oriented Transformer (DOT) that has two individual classification tokens to learn different domain-oriented representations, and two classifiers to preserve domain-wise discriminability. Theoretical guaranteed contrastive-based alignment and the source-guided pseudo-label refinement strategy are utilized to explore both domain-invariant and specific information. Comprehensive experiments validate that our method achieves state-of-the-art on several benchmarks.

A Novel Transformer Network with Shifted Window Cross-Attention for Spatiotemporal Weather Forecasting

Authors: Alabi Bojesomo, Hasan Al Marzouqi, Panos Liatsis
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01252
Pdf link: https://arxiv.org/pdf/2208.01252
Abstract
Earth Observatory is a growing research area that can capitalize on the powers of AI for short time forecasting, a Now-casting scenario. In this work, we tackle the challenge of weather forecasting using a video transformer network. Vision transformer architectures have been explored in various applications, with major constraints being the computational complexity of Attention and the data hungry training. To address these issues, we propose the use of Video Swin-Transformer, coupled with a dedicated augmentation scheme. Moreover, we employ gradual spatial reduction on the encoder side and cross-attention on the decoder. The proposed approach is tested on the Weather4Cast2021 weather forecasting challenge data, which requires the prediction of 8 hours ahead future frames (4 per hour) from an hourly weather product sequence. The dataset was normalized to 0-1 to facilitate using the evaluation metrics across different datasets. The model results in an MSE score of 0.4750 when provided with training data, and 0.4420 during transfer learning without using training data, respectively.

Silo NLP's Participation at WAT2022

Authors: Shantipriya Parida, Subhadarshi Panda, Stig-Arne Grönroos, Mark Granroth-Wilding, Mika Koistinen
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.01296
Pdf link: https://arxiv.org/pdf/2208.01296
Abstract
This paper provides the system description of "Silo NLP's" submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali Multimodal Translation). For text-only translation, we trained Transformers from scratch and fine-tuned mBART-50 models. For multimodal translation, we used the same mBART architecture and extracted object tags from the images to use as visual features concatenated with the text sequence. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).

Unified Normalization for Accelerating and Stabilizing Transformers

Authors: Qiming Yang, Kai Zhang, Chaoxiang Lan, Zhi Yang, Zheyang Li, Wenming Tan, Jun Xiao, Shiliang Pu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2208.01313
Pdf link: https://arxiv.org/pdf/2208.01313
Abstract
Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at https://github.com/hikvision-research/Unified-Normalization.

A Comparative Study on COVID-19 Fake News Detection Using Different Transformer Based Models

Authors: Sajib Kumar Saha Joy, Dibyo Fabian Dofadar, Riyo Hayat Khan, Md. Sabbir Ahmed, Rafeed Rahman
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01355
Pdf link: https://arxiv.org/pdf/2208.01355
Abstract
The rapid advancement of social networks and the convenience of internet availability have accelerated the rampant spread of false news and rumors on social media sites. Amid the COVID 19 epidemic, this misleading information has aggravated the situation by putting peoples mental and physical lives in danger. To limit the spread of such inaccuracies, identifying the fake news from online platforms could be the first and foremost step. In this research, the authors have conducted a comparative analysis by implementing five transformer based models such as BERT, BERT without LSTM, ALBERT, RoBERTa, and a Hybrid of BERT & ALBERT in order to detect the fraudulent news of COVID 19 from the internet. COVID 19 Fake News Dataset has been used for training and testing the models. Among all these models, the RoBERTa model has performed better than other models by obtaining an F1 score of 0.98 in both real and fake classes.

Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Authors: Reid McIlroy-Young, Russell Wang, Siddhartha Sen, Jon Kleinberg, Ashton Anderson
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2208.01366
Pdf link: https://arxiv.org/pdf/2208.01366
Abstract
The advent of machine learning models that surpass human decision-making ability in complex domains has initiated a movement towards building AI systems that interact with humans. Many building blocks are essential for this activity, with a central one being the algorithmic characterization of human behavior. While much of the existing work focuses on aggregate human behavior, an important long-range goal is to develop behavioral models that specialize to individual people and can differentiate among them. To formalize this process, we study the problem of behavioral stylometry, in which the task is to identify a decision-maker from their decisions alone. We present a transformer-based approach to behavioral stylometry in the context of chess, where one attempts to identify the player who played a set of games. Our method operates in a few-shot classification framework, and can correctly identify a player from among thousands of candidate players with 98% accuracy given only 100 labeled games. Even when trained on amateur play, our method generalises to out-of-distribution samples of Grandmaster players, despite the dramatic differences between amateur and world-class players. Finally, we consider more broadly what our resulting embeddings reveal about human style in chess, as well as the potential ethical implications of powerful methods for identifying individuals from behavioral data.

Multi-Module G2P Converter for Persian Focusing on Relations between Words

Authors: Mahdi Rezaei, Negar Nayeri, Saeed Farzi, Hossein Sameti
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.01371
Pdf link: https://arxiv.org/pdf/2208.01371
Abstract
In this paper, we investigate the application of end-to-end and multi-module frameworks for G2P conversion for the Persian language. The results demonstrate that our proposed multi-module G2P system outperforms our end-to-end systems in terms of accuracy and speed. The system consists of a pronunciation dictionary as our look-up table, along with separate models to handle homographs, OOVs and ezafe in Persian created using GRU and Transformer architectures. The system is sequence-level rather than word-level, which allows it to effectively capture the unwritten relations between words (cross-word information) necessary for homograph disambiguation and ezafe recognition without the need for any pre-processing. After evaluation, our system achieved a 94.48% word-level accuracy, outperforming the previous G2P systems for Persian.

BERT4Loc: BERT for Location -- POI Recommender System

Authors: Syed Raza Bashir, Vojislav Misic
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.01375
Pdf link: https://arxiv.org/pdf/2208.01375
Abstract
Recommending points of interest is a difficult problem that requires precise location information to be extracted from a location-based social media platform. Another challenging and critical problem for such a location-aware recommendation system is modelling users' preferences based on their historical behaviors. We propose a location-aware recommender system based on Bidirectional Encoder Representations from Transformers for the purpose of providing users with location-based recommendations. The proposed model incorporates location data and user preferences. When compared to predicting the next item of interest (location) at each position in a sequence, our model can provide the user with more relevant results. Extensive experiments on a benchmark dataset demonstrate that our model consistently outperforms a variety of state-of-the-art sequential models.

Active entailment encoding for explanation tree construction using parsimonious generation of hard negatives

Authors: Alex Bogatu, Zili Zhou, Dónal Landers, André Freitas
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2208.01376
Pdf link: https://arxiv.org/pdf/2208.01376
Abstract
Entailment trees have been proposed to simulate the human reasoning process of explanation generation in the context of open--domain textual question answering. However, in practice, manually constructing these explanation trees proves a laborious process that requires active human involvement. Given the complexity of capturing the line of reasoning from question to the answer or from claim to premises, the issue arises of how to assist the user in efficiently constructing multi--level entailment trees given a large set of available facts. In this paper, we frame the construction of entailment trees as a sequence of active premise selection steps, i.e., for each intermediate node in an explanation tree, the expert needs to annotate positive and negative examples of premise facts from a large candidate list. We then iteratively fine--tune pre--trained Transformer models with the resulting positive and tightly controlled negative samples and aim to balance the encoding of semantic relationships and explanatory entailment relationships. Experimental evaluation confirms the measurable efficiency gains of the proposed active fine--tuning method in facilitating entailment trees construction: up to 20% improvement in explanatory premise selection when compared against several alternatives.

ferret: a Framework for Benchmarking Explainers on Transformers

Authors: Giuseppe Attanasio, Eliana Pastor, Chiara Di Bonaventura, Debora Nozza
Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2208.01575
Pdf link: https://arxiv.org/pdf/2208.01575
Abstract
Many interpretability tools allow practitioners and researchers to explain Natural Language Processing systems. However, each tool requires different configurations and provides explanations in different forms, hindering the possibility of assessing and comparing them. A principled, unified evaluation benchmark will guide the users through the central question: which explanation method is more reliable for my use case? We introduce ferret, an easy-to-use, extensible Python library to explain Transformer-based models integrated with the Hugging Face Hub. It offers a unified benchmarking suite to test and compare a wide range of state-of-the-art explainers on any text or interpretability corpora. In addition, ferret provides convenient programming abstractions to foster the introduction of new explanation methods, datasets, or evaluation metrics.

Keyword: autonomous driving

Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions

Authors: Carlos A. Diaz-Ruiz (1), Youya Xia (1), Yurong You (1), Jose Nino (1), Junan Chen (1), Josephine Monica (1), Xiangyu Chen (1), Katie Luo (1), Yan Wang (1), Marc Emond (1), Wei-Lun Chao (2), Bharath Hariharan (1), Kilian Q. Weinberger (1), Mark Campbell (1) ((1) Cornell University, (2) The Ohio State University)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2208.01166
Pdf link: https://arxiv.org/pdf/2208.01166
Abstract
Advances in perception for self-driving cars have accelerated in recent years due to the availability of large-scale datasets, typically collected at specific locations and under nice weather conditions. Yet, to achieve the high safety requirement, these perceptual systems must operate robustly under a wide variety of weather conditions including snow and rain. In this paper, we present a new dataset to enable robust autonomous driving via a novel data collection process - data is repeatedly recorded along a 15 km route under diverse scene (urban, highway, rural, campus), weather (snow, rain, sun), time (day/night), and traffic conditions (pedestrians, cyclists and cars). The dataset includes images and point clouds from cameras and LiDAR sensors, along with high-precision GPS/INS to establish correspondence across routes. The dataset includes road and object annotations using amodal masks to capture partial occlusions and 3D bounding boxes. We demonstrate the uniqueness of this dataset by analyzing the performance of baselines in amodal segmentation of road and objects, depth estimation, and 3D object detection. The repeated routes opens new research directions in object discovery, continual learning, and anomaly detection. Link to Ithaca365: https://ithaca365.mae.cornell.edu/

ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries

Authors: Junru Gu, Chenxu Hu, Tianyuan Zhang, Xuanyao Chen, Yilun Wang, Yue Wang, Hang Zhao
Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2208.01582
Pdf link: https://arxiv.org/pdf/2208.01582
Abstract
Existing autonomous driving pipelines separate the perception module from the prediction module. The two modules communicate via hand-picked features such as agent boxes and trajectories as interfaces. Due to this separation, the prediction module only receives partial information from the perception module. Even worse, errors from the perception modules can propagate and accumulate, adversely affecting the prediction results. In this work, we propose ViP3D, a visual trajectory prediction pipeline that leverages the rich information from raw videos to predict future trajectories of agents in a scene. ViP3D employs sparse agent queries throughout the pipeline, making it fully differentiable and interpretable. Furthermore, we propose an evaluation metric for this novel end-to-end visual trajectory prediction task. Extensive experimental results on the nuScenes dataset show the strong performance of ViP3D over traditional pipelines and previous end-to-end models.

zhuhu00 self-assigned this Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New submissions for Wed, 3 Aug 22 #208

New submissions for Wed, 3 Aug 22 #208

zhuhu00 commented Aug 3, 2022

New submissions for Wed, 3 Aug 22 #208

New submissions for Wed, 3 Aug 22 #208

Comments

zhuhu00 commented Aug 3, 2022

New submissions for Wed, 3 Aug 22

Keyword: SLAM

Keyword: odometry

Keyword: livox

Keyword: loam

Keyword: lidar

Mitigating Shadows in Lidar Scan Matching using Spherical Voxels

Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions

Keyword: loop detection

Keyword: nerf

T4DT: Tensorizing Time for Learning Temporal 3D Visual Data

Keyword: mapping

Explicit Use of Fourier Spectrum in Generative Adversarial Networks

Overlooked Poses Actually Make Sense: Distilling Privileged Knowledge for Human Motion Prediction

Keyword: localization

DSR -- A dual subspace re-projection network for surface anomaly detection

Keyword: transformer

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer

Making the Best of Both Worlds: A Domain-Oriented Transformer for Unsupervised Domain Adaptation

A Novel Transformer Network with Shifted Window Cross-Attention for Spatiotemporal Weather Forecasting

Silo NLP's Participation at WAT2022

Unified Normalization for Accelerating and Stabilizing Transformers

A Comparative Study on COVID-19 Fake News Detection Using Different Transformer Based Models

Detecting Individual Decision-Making Style: Exploring Behavioral Stylometry in Chess

Multi-Module G2P Converter for Persian Focusing on Relations between Words

BERT4Loc: BERT for Location -- POI Recommender System

Active entailment encoding for explanation tree construction using parsimonious generation of hard negatives

ferret: a Framework for Benchmarking Explainers on Transformers

Keyword: autonomous driving

Ithaca365: Dataset and Driving Perception under Repeated and Challenging Weather Conditions

ViP3D: End-to-end Visual Trajectory Prediction via 3D Agent Queries