Please consider citing our paper as
@ARTICLE{10264066,
author={Hou, Yuanbo and Song, Siyang and Yu, Chuang and Wang, Wenwu and Botteldooren, Dick},
journal={IEEE Signal Processing Letters},
title={Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification},
year={2023},
volume={30},
number={},
pages={1382-1386},
doi={10.1109/LSP.2023.3319233}}
Figure presents the confusion matrix of ERGL and the distribution visualization of graph representations learned by ERGL. In Figure (a), most misclassified samples appear between \textit{airp.} (airport) and \textit{mall} (shopping mall), and between \textit{bus} and \textit{tram}. The distribution of graph representations in Figure (b) shows that the results are consistent with the confusion matrix.
Based on the Pearson correlation coefficient of predictions on the test set, we visualize the correlation between 10 classes of scenes and 25 classes of events, which are automatically selected by the model. As can be seen from the figure, on the test set:
- For scene airport, the related typical audio events are [Speech, Tick];
-
For scene bus, the related typical audio events are [Car, Vehicle, Inside];
-
For scene metro, the related typical audio events are [Rail, Vehicle, Railroad];
-
For scene metro station, the related typical audio events are [Train, Rail, Railroad];
-
For scene park, the related typical audio events are [Bird, Outside, Animal, Silence];
-
For scene public square, the related typical audio events are [Clip-clop, Outside, Animal, Raindrop, Patter];
-
For scene shopping mall, the related typical audio events are [Music, Speech];
-
For scene street pedestrian, the related typical audio events are [Clip-clop, Tick, Patter, Rain];
-
For scene street traffic, the related typical audio events are [Noise, Rain, Fan, Car, Vehicle];
-
For scene tram, the related typical audio events are [Vehicle, Inside, Car].
This Figure illustrates that AE-based relational edges effectively help ERGL improve its accuracy in 6 scenes: ``\textit{airp.} (airport), \textit{bus}, \textit{metro}, \textit{park}, \textit{squa.} (public square), \textit{traff.} (street traffic)". The accuracy of \textit{airp.} is improved the most, mainly because the misclassified samples between \textit{stat.} (metro station), \textit{pedes.} (street pedestrian), and \textit{airp.} is reduced from 30 and 39 to 8 and 16, respectively. In contrast, introducing MEL increases the misclassified samples of \textit{stat.} and \textit{metro}. Even for humans, it is challenging to distinguish these similar scenes relying on audio only. In short, introducing AE-based relational edges can effectively improve the performance of ERGL in 6 scenes, increasing its \textit{Acc} from 73.35% to 78.08%.
To further explore the reasons for the misclassification between scenes for graph representation-based classification. Figure presents the structure of graph representations of audio clips from different scenes.
Graph structures of test audio samples from similar scenes. Green dots represent 25 classes of AEs used in this paper. A larger dot denotes a higher probability of the event. A thicker line denotes a larger edge value between nodes in the graph representation.
For visualization, a node is considered inactive if its probability is less than 0.1. Edges between two inactive nodes are not displayed. Multi-dimensional features of the edge are represented by their mean value.
The dominant AEs in \textit{airp.} and \textit{mall} scenes are \textit{speech} and \textit{music}, so the connections in the graph are mainly gathered around \textit{speech} and \textit{music}. This may be the reason that the ERGL confuses the \textit{airp.} and \textit{mall} scenes. In addition to these similarities, the third focused audio event in the airport is \textit{silence}, while that in the shopping mall is \textit{animal} sounds, which reflects the differences between the two scenes.
The dominant audio events in bus and tram scenes are \textit{vehicle}, \textit{music}, \textit{speech}, \textit{train}, \textit{car} and \textit{silence}. The audio events with dominant connections are the same: \textit{vehicle}, \textit{music} and \textit{silence}.
With similar dominant events and graph structures, the model tends to be confused by these similar scenes.
The top 25 events simply depend on the entire dataset and are not specifically selected for each single target scene. So for each scene graph, some events seem to be a little bit strange in the graph tree. At the semantic level, these 25 classes of events are slightly insufficient in describing 10 different classes of scenes. The top 25 events are automatically chosen by the classification model without involving artificial prior knowledge.