New submissions for Mon, 18 Jul 22 #200

zhuhu00 · 2022-07-18T03:59:21Z

New submissions for Mon, 18 Jul 22

Keyword: SLAM

Fixed-Parameter Tractability of Maximum Colored Path and Beyond

Authors: Fedor V. Fomin, Petr A. Golovach, Tuukka Korhonen, Kirill Simonov, Giannos Stamoulis
Subjects: Data Structures and Algorithms (cs.DS)
Arxiv link: https://arxiv.org/abs/2207.07449
Pdf link: https://arxiv.org/pdf/2207.07449
Abstract
We introduce a general method for obtaining fixed-parameter algorithms for problems about finding paths in undirected graphs, where the length of the path could be unbounded in the parameter. The first application of our method is as follows. We give a randomized algorithm, that given a colored $n$-vertex undirected graph, vertices $s$ and $t$, and an integer $k$, finds an $(s,t)$-path containing at least $k$ different colors in time $2^k n^{O(1)}$. This is the first FPT algorithm for this problem, and it generalizes the algorithm of Bj"orklund, Husfeldt, and Taslaman [SODA 2012] on finding a path through $k$ specified vertices. It also implies the first $2^k n^{O(1)}$ time algorithm for finding an $(s,t)$-path of length at least $k$. Our method yields FPT algorithms for even more general problems. For example, we consider the problem where the input consists of an $n$-vertex undirected graph $G$, a matroid $M$ whose elements correspond to the vertices of $G$ and which is represented over a finite field of order $q$, a positive integer weight function on the vertices of $G$, two sets of vertices $S,T \subseteq V(G)$, and integers $p,k,w$, and the task is to find $p$ vertex-disjoint paths from $S$ to $T$ so that the union of the vertices of these paths contains an independent set of $M$ of cardinality $k$ and weight $w$, while minimizing the sum of the lengths of the paths. We give a $2^{p+O(k^2 \log (q+k))} n^{O(1)} w$ time randomized algorithm for this problem.

Keyword: odometry

There is no result

Keyword: livox

There is no result

Keyword: loam

There is no result

Keyword: lidar

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Authors: Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, Dacheng Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07601
Pdf link: https://arxiv.org/pdf/2207.07601
Abstract
Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at https://github.com/OpenPerceptionX/ST-P3.

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving

Authors: Ruiqing Mao, Jingyu Guo, Yukuan Jia, Yuxuan Sun, Sheng Zhou, Zhisheng Niu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2207.07609
Pdf link: https://arxiv.org/pdf/2207.07609
Abstract
Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected autonomous driving dataset; meticulously selected viewpoints providing full coverage of the key areas and every object; 42376 frames and 292549 objects, as well as the corresponding 3D annotations, geo-positions, and calibrations, compose the largest dataset for collaborative perception; Full-HD images and 64-line LiDARs construct high-resolution data with sufficient details; well-organized APIs and open-source codes ensure the extensibility of DOLPHINS. We also construct a benchmark of 2D detection, 3D detection, and multi-view collaborative perception tasks on DOLPHINS. The experiment results show that the raw-level fusion scheme through V2X communication can help to improve the precision as well as to reduce the necessity of expensive LiDAR equipment on vehicles when RSUs exist, which may accelerate the popularity of interconnected self-driving vehicles. DOLPHINS is now available on https://dolphins-dataset.net/.

Keyword: loop detection

There is no result

Keyword: nerf

There is no result

Keyword: mapping

Z-polyregular functions

Authors: Thomas Colcombet, Gaëtan Douéneau-Tabot, Aliaume Lopez
Subjects: Formal Languages and Automata Theory (cs.FL)
Arxiv link: https://arxiv.org/abs/2207.07450
Pdf link: https://arxiv.org/pdf/2207.07450
Abstract
Given an MSO formula $\phi$ with free variables $x_1, \dots, x_k$, one can define the function $# \phi$ mapping a word $w$ to the number of valuations satisfying $\phi$ in $w$. In this paper, we introduce the class of $\mathbb{Z}$-linear combinations of such functions, that we call $\mathbb{Z}$-polyregular functions. Indeed, it turns out to be closely related to the well-studied class of polyregular functions. The main results of this paper solve two natural decision problems for $\mathbb{Z}$-polyregular functions. First, we show that one can minimise the number $k \ge 0$ of free variables which are needed to describe a function. Then, we show how to decide if a function can be defined using first-order formulas, by extending the notion of residual automaton and providing an original semantic characterisation based on aperiodicity. We also connect this class of functions to $\mathbb{Z}$-rational series.

DeepSolar tracker: towards unsupervised assessment with open-source data of the accuracy of deep learning-based distributed PV mapping

Authors: Gabriel Kasmi, Laurent Dubus, Philippe Blanc, Yves-Marie Saint-Drenan
Subjects: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2207.07466
Pdf link: https://arxiv.org/pdf/2207.07466
Abstract
Photovoltaic (PV) energy is key to mitigating the current energy crisis. However, distributed PV generation, which amounts to half of the PV energy generation, makes it increasingly difficult for transmission system operators (TSOs) to balance the load and supply and avoid grid congestions. Indeed, in the absence of measurements, estimating the distributed PV generation is tough. In recent years, many remote sensing-based approaches have been proposed to map distributed PV installations. However, to be applicable in industrial settings, one needs to assess the accuracy of the mapping over the whole deployment area. We build on existing work to propose an automated PV registry pipeline. This pipeline automatically generates a dataset recording all distributed PV installations' location, area, installed capacity, and tilt angle. It only requires aerial orthoimagery and topological data, both of which are freely accessible online. In order to assess the accuracy of the registry, we propose an unsupervised method based on the {\it Registre national d'installation} (RNI), that centralizes all individual PV systems aggregated at communal level, enabling practitioners to assess the accuracy of the registry and eventually remove outliers. We deploy our model on 9 French {\it d'epartements} covering more than 50 000 square kilometers, providing the largest mapping of distributed PV panels with this level of detail to date. We then demonstrate how practitioners can use our unsupervised accuracy assessment method to assess the accuracy of the outputs. In particular, we show how it can easily identify outliers in the detections. Overall, our approach paves the way for a safer integration of deep learning-based pipelines for remote PV mapping. Code is available at {\tt https://github.com/gabrielkasmi/dsfrance}.

Keyword: localization

Between-Tactor Display Using Dynamic Tactile Stimuli

Authors: Ryo Eguchi, David Vacek, Cole Godzinski, Silvia Curry, Max Evans, Allison M. Okamura
Subjects: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2207.07120
Pdf link: https://arxiv.org/pdf/2207.07120
Abstract
Display of illusory vibration locations between physical vibrotactile motors (tactors) placed on the skin has the potential to reduce the number of tactors in distributed tactile displays. This paper presents a between-tactor display method that uses dynamic tactile stimuli to generate illusory vibration locations. A belt with only 6 vibration motors displays 24 targets consisting of on-tactor and between-tactor locations. On-tactor locations are represented by simply vibrating the relevant single tactor. Between-tactor locations are displayed by adjusting the relative vibration amplitudes of two adjacent motors, with either (1) constant vibration amplitudes or (2) perturbed vibration amplitudes (creating local illusory motion). User testing showed that perturbations improve recognition accuracy for in-between tactor localization.

On the use of graph neural networks and shape-function-based gradient computation in the deep energy method

Authors: Junyan He, Diab Abueidda, Seid Koric, Iwona Jasiuk
Subjects: Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2207.07216
Pdf link: https://arxiv.org/pdf/2207.07216
Abstract
A graph neural network (GCN) is employed in the deep energy method (DEM) model to solve the momentum balance equation in 3D for the deformation of linear elastic and hyperelastic materials due to its ability to handle irregular domains over the traditional DEM method based on a multilayer perceptron (MLP) network. Its accuracy and solution time are compared to the DEM model based on a MLP network. We demonstrate that the GCN-based model delivers similar accuracy while having a shorter run time through numerical examples. Two different spatial gradient computation techniques, one based on automatic differentiation (AD) and the other based on shape function (SF) gradients, are also accessed. We provide a simple example to demonstrate the strain localization instability associated with the AD-based gradient computation and show that the instability exists in more general cases by four numerical examples. The SF-based gradient computation is shown to be more robust and delivers an accurate solution even at severe deformations. Therefore, the combination of the GCN-based DEM model and SF-based gradient computation is potentially a promising candidate for solving problems involving severe material and geometric nonlinearities.

3D Instances as 1D Kernels

Authors: Yizheng Wu, Min Shi, Shuaiyuan Du, Hao Lu, Zhiguo Cao, Weicai Zhong
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07372
Pdf link: https://arxiv.org/pdf/2207.07372
Abstract
We introduce a 3D instance representation, termed instance kernels, where instances are represented by one-dimensional vectors that encode the semantic, positional, and shape information of 3D instances. We show that instance kernels enable easy mask inference by simply scanning kernels over the entire scenes, avoiding the heavy reliance on proposals or heuristic clustering algorithms in standard 3D instance segmentation pipelines. The idea of instance kernel is inspired by recent success of dynamic convolutions in 2D/3D instance segmentation. However, we find it non-trivial to represent 3D instances due to the disordered and unstructured nature of point cloud data, e.g., poor instance localization can significantly degrade instance representation. To remedy this, we construct a novel 3D instance encoding paradigm. First, potential instance centroids are localized as candidates. Then, a candidate merging scheme is devised to simultaneously aggregate duplicated candidates and collect context around the merged centroids to form the instance kernels. Once instance kernels are available, instance masks can be reconstructed via dynamic convolutions whose weights are conditioned on instance kernels. The whole pipeline is instantiated with a dynamic kernel network (DKNet). Results show that DKNet outperforms the state of the arts on both ScanNetV2 and S3DIS datasets with better instance localization. Code is available: https://github.com/W1zheng/DKNet.

Keyword: transformer

Lightweight Vision Transformer with Cross Feature Attention

Authors: Youpeng Zhao, Huadong Tang, Yingying Jiang, Yong A, Qiang Wu
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07268
Pdf link: https://arxiv.org/pdf/2207.07268
Abstract
Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

Authors: Zhicai Wang, Yanbin Hao, Xingyu Gao, Hao Zhang, Shuo Wang, Tingting Mu, Xiangnan He
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.07284
Pdf link: https://arxiv.org/pdf/2207.07284
Abstract
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14% to 74.02% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at \href{https://github.com/Zhicaiwww/PosMLP}{https://github.com/Zhicaiwww/PosMLP}.

Z-Index at CheckThat! Lab 2022: Check-Worthiness Identification on Tweet Text

Authors: Prerona Tarannum, Firoj Alam, Md. Arid Hasan, Sheak Rashed Haider Noori
Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.07308
Pdf link: https://arxiv.org/pdf/2207.07308
Abstract
The wide use of social media and digital technologies facilitates sharing various news and information about events and activities. Despite sharing positive information misleading and false information is also spreading on social media. There have been efforts in identifying such misleading information both manually by human experts and automatic tools. Manual effort does not scale well due to the high volume of information, containing factual claims, are appearing online. Therefore, automatically identifying check-worthy claims can be very useful for human experts. In this study, we describe our participation in Subtask-1A: Check-worthiness of tweets (English, Dutch and Spanish) of CheckThat! lab at CLEF 2022. We performed standard preprocessing steps and applied different models to identify whether a given text is worthy of fact checking or not. We use the oversampling technique to balance the dataset and applied SVM and Random Forest (RF) with TF-IDF representations. We also used BERT multilingual (BERT-m) and XLM-RoBERTa-base pre-trained models for the experiments. We used BERT-m for the official submissions and our systems ranked as 3rd, 5th, and 12th in Spanish, Dutch, and English, respectively. In further experiments, our evaluation shows that transformer models (BERT-m and XLM-RoBERTa-base) outperform the SVM and RF in Dutch and English languages where a different scenario is observed for Spanish.

Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal

Authors: Xuhao Jiang, Weimin Tan, Ri Cheng, Shili Zhou, Bo Yan
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07335
Pdf link: https://arxiv.org/pdf/2207.07335
Abstract
Under stereo settings, the performance of image JPEG artifacts removal can be further improved by exploiting the additional information provided by a second view. However, incorporating this information for stereo image JPEG artifacts removal is a huge challenge, since the existing compression artifacts make pixel-level view alignment difficult. In this paper, we propose a novel parallax transformer network (PTNet) to integrate the information from stereo image pairs for stereo image JPEG artifacts removal. Specifically, a well-designed symmetric bi-directional parallax transformer module is proposed to match features with similar textures between different views instead of pixel-level view alignment. Due to the issues of occlusions and boundaries, a confidence-based cross-view fusion module is proposed to achieve better feature fusion for both views, where the cross-view features are weighted with confidence maps. Especially, we adopt a coarse-to-fine design for the cross-view interaction, leading to better performance. Comprehensive experimental results demonstrate that our PTNet can effectively remove compression artifacts and achieves superior performance than other testing state-of-the-art methods.

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

Authors: Junkun Jiang, Jie Chen, Yike Guo
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07381
Pdf link: https://arxiv.org/pdf/2207.07381
Abstract
Multi-person motion capture can be challenging due to ambiguities caused by severe occlusion, fast body movement, and complex interactions. Existing frameworks build on 2D pose estimations and triangulate to 3D coordinates via reasoning the appearance, trajectory, and geometric consistencies among multi-camera observations. However, 2D joint detection is usually incomplete and with wrong identity assignments due to limited observation angle, which leads to noisy 3D triangulation results. To overcome this issue, we propose to explore the short-range autoregressive characteristics of skeletal motion using transformer. First, we propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity. To generate complete 3D skeletal motion, we then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion. D-MAE's flexible masking and encoding mechanism enable arbitrary skeleton definitions to be conveniently deployed under the same framework. In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset of multi-person interactions with severe occlusion. Evaluations on both benchmark and our new dataset demonstrate the efficiency of our proposed model, as well as its advantage against the other state-of-the-art methods.

A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation

Authors: Aleksandr Petrov, Craig Macdonald
Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2207.07483
Pdf link: https://arxiv.org/pdf/2207.07483
Abstract
BERT4Rec is an effective model for sequential recommendation based on the Transformer architecture. In the original publication, BERT4Rec claimed superiority over other available sequential recommendation approaches (e.g. SASRec), and it is now frequently being used as a state-of-the art baseline for sequential recommendations. However, not all subsequent publications confirmed this result and proposed other models that were shown to outperform BERT4Rec in effectiveness. In this paper we systematically review all publications that compare BERT4Rec with another popular Transformer-based model, namely SASRec, and show that BERT4Rec results are not consistent within these publications. To understand the reasons behind this inconsistency, we analyse the available implementations of BERT4Rec and show that we fail to reproduce results of the original BERT4Rec publication when using their default configuration parameters. However, we are able to replicate the reported results with the original code if training for a much longer amount of time (up to 30x) compared to the default configuration. We also propose our own implementation of BERT4Rec based on the Hugging Face Transformers library, which we demonstrate replicates the originally reported results on 3 out 4 datasets, while requiring up to 95% less training time to converge. Overall, from our systematic review and detailed experiments, we conclude that BERT4Rec does indeed exhibit state-of-the-art effectiveness for sequential recommendation, but only when trained for a sufficient amount of time. Additionally, we show that our implementation can further benefit from adapting other Transformer architectures that are available in the Hugging Face Transformers library (e.g. using disentangled attention, as provided by DeBERTa, or larger hidden layer size cf. ALBERT).

Rethinking Attention Mechanism in Time Series Classification

Authors: Bowen Zhao, Huanlai Xing, Xinhan Wang, Fuhong Song, Zhiwen Xiao
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.07564
Pdf link: https://arxiv.org/pdf/2207.07564
Abstract
Attention-based models have been widely used in many areas, such as computer vision and natural language processing. However, relevant applications in time series classification (TSC) have not been explored deeply yet, causing a significant number of TSC algorithms still suffer from general problems of attention mechanism, like quadratic complexity. In this paper, we promote the efficiency and performance of the attention mechanism by proposing our flexible multi-head linear attention (FMLA), which enhances locality awareness by layer-wise interactions with deformable convolutional blocks and online knowledge distillation. What's more, we propose a simple but effective mask mechanism that helps reduce the noise influence in time series and decrease the redundancy of the proposed FMLA by masking some positions of each given series proportionally. To stabilize this mechanism, samples are forwarded through the model with random mask layers several times and their outputs are aggregated to teach the same model with regular mask layers. We conduct extensive experiments on 85 UCR2018 datasets to compare our algorithm with 11 well-known ones and the results show that our algorithm has comparable performance in terms of top-1 accuracy. We also compare our model with three Transformer-based models with respect to the floating-point operations per second and number of parameters and find that our algorithm achieves significantly better efficiency with lower complexity.

Mobile Keystroke Biometrics Using Transformers

Authors: Giuseppe Stragapede, Paula Delgado-Santos, Ruben Tolosana, Ruben Vera-Rodriguez, Richard Guest, Aythami Morales
Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2207.07596
Pdf link: https://arxiv.org/pdf/2207.07596
Abstract
Behavioural biometrics have proven to be effective against identity theft as well as be considered user-friendly authentication methods. One of the most popular traits in the literature is keystroke dynamics due to the large deployment of computers and mobile devices in our society. This paper focuses on improving keystroke biometric systems on the free-text scenario. This scenario is characterised as very challenging due to the uncontrolled text conditions, the influential of the user's emotional and physical state, and the in-use application. To overcome these drawbacks, methods based on deep learning such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been proposed in the literature, outperforming traditional machine learning methods. However, these architectures still have aspects that need to be reviewed and improved. To the best of our knowledge, this is the first study that proposes keystroke biometric systems based on Transformers. The proposed Transformer architecture has achieved Equal Error Rate (EER) values of 3.84% in the popular Aalto mobile keystroke database using only 5 enrolment sessions, outperforming in large margin other state-of-the-art approaches in the literature.

Position Prediction as an Effective Pretraining Strategy

Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Arxiv link: https://arxiv.org/abs/2207.07611
Pdf link: https://arxiv.org/pdf/2207.07611
Abstract
Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.

Keyword: autonomous driving

Stochastic Market Games

Authors: Kyrill Schmid, Lenz Belzner, Robert Müller, Johannes Tochtermann, Claudia-Linhoff-Popien
Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2207.07388
Pdf link: https://arxiv.org/pdf/2207.07388
Abstract
Some of the most relevant future applications of multi-agent systems like autonomous driving or factories as a service display mixed-motive scenarios, where agents might have conflicting goals. In these settings agents are likely to learn undesirable outcomes in terms of cooperation under independent learning, such as overly greedy behavior. Motivated from real world societies, in this work we propose to utilize market forces to provide incentives for agents to become cooperative. As demonstrated in an iterated version of the Prisoner's Dilemma, the proposed market formulation can change the dynamics of the game to consistently learn cooperative policies. Further we evaluate our approach in spatially and temporally extended settings for varying numbers of agents. We empirically find that the presence of markets can improve both the overall result and agent individual returns via their trading activities.

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Authors: Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Yan, Dacheng Tao
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2207.07601
Pdf link: https://arxiv.org/pdf/2207.07601
Abstract
Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at https://github.com/OpenPerceptionX/ST-P3.

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving

Authors: Ruiqing Mao, Jingyu Guo, Yukuan Jia, Yuxuan Sun, Sheng Zhou, Zhisheng Niu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Arxiv link: https://arxiv.org/abs/2207.07609
Pdf link: https://arxiv.org/pdf/2207.07609
Abstract
Vehicle-to-Everything (V2X) network has enabled collaborative perception in autonomous driving, which is a promising solution to the fundamental defect of stand-alone intelligence including blind zones and long-range perception. However, the lack of datasets has severely blocked the development of collaborative perception algorithms. In this work, we release DOLPHINS: Dataset for cOllaborative Perception enabled Harmonious and INterconnected Self-driving, as a new simulated large-scale various-scenario multi-view multi-modality autonomous driving dataset, which provides a ground-breaking benchmark platform for interconnected autonomous driving. DOLPHINS outperforms current datasets in six dimensions: temporally-aligned images and point clouds from both vehicles and Road Side Units (RSUs) enabling both Vehicle-to-Vehicle (V2V) and Vehicle-to-Infrastructure (V2I) based collaborative perception; 6 typical scenarios with dynamic weather conditions make the most various interconnected autonomous driving dataset; meticulously selected viewpoints providing full coverage of the key areas and every object; 42376 frames and 292549 objects, as well as the corresponding 3D annotations, geo-positions, and calibrations, compose the largest dataset for collaborative perception; Full-HD images and 64-line LiDARs construct high-resolution data with sufficient details; well-organized APIs and open-source codes ensure the extensibility of DOLPHINS. We also construct a benchmark of 2D detection, 3D detection, and multi-view collaborative perception tasks on DOLPHINS. The experiment results show that the raw-level fusion scheme through V2X communication can help to improve the precision as well as to reduce the necessity of expensive LiDAR equipment on vehicles when RSUs exist, which may accelerate the popularity of interconnected self-driving vehicles. DOLPHINS is now available on https://dolphins-dataset.net/.

zhuhu00 self-assigned this Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New submissions for Mon, 18 Jul 22 #200

New submissions for Mon, 18 Jul 22 #200

zhuhu00 commented Jul 18, 2022

New submissions for Mon, 18 Jul 22 #200

New submissions for Mon, 18 Jul 22 #200

Comments

zhuhu00 commented Jul 18, 2022

New submissions for Mon, 18 Jul 22

Keyword: SLAM

Fixed-Parameter Tractability of Maximum Colored Path and Beyond

Keyword: odometry

Keyword: livox

Keyword: loam

Keyword: lidar

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving

Keyword: loop detection

Keyword: nerf

Keyword: mapping

Z-polyregular functions

DeepSolar tracker: towards unsupervised assessment with open-source data of the accuracy of deep learning-based distributed PV mapping

Keyword: localization

Between-Tactor Display Using Dynamic Tactile Stimuli

On the use of graph neural networks and shape-function-based gradient computation in the deep energy method

3D Instances as 1D Kernels

Keyword: transformer

Lightweight Vision Transformer with Cross Feature Attention

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

Z-Index at CheckThat! Lab 2022: Check-Worthiness Identification on Tweet Text

Learning Parallax Transformer Network for Stereo Image JPEG Artifacts Removal

A Dual-Masked Auto-Encoder for Robust Motion Capture with Spatial-Temporal Skeletal Token Completion

A Systematic Review and Replicability Study of BERT4Rec for Sequential Recommendation

Rethinking Attention Mechanism in Time Series Classification

Mobile Keystroke Biometrics Using Transformers

Position Prediction as an Effective Pretraining Strategy

Keyword: autonomous driving

Stochastic Market Games

ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving