Scene-dependent event-relational graph learning (ERGL)

Citation

Please consider citing our paper as

@ARTICLE{10264066,
  author={Hou, Yuanbo and Song, Siyang and Yu, Chuang and Wang, Wenwu and Botteldooren, Dick},
  journal={IEEE Signal Processing Letters}, 
  title={Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification}, 
  year={2023},
  volume={30},
  number={},
  pages={1382-1386},
  doi={10.1109/LSP.2023.3319233}}

1. Confusion matrix and the distribution visualization of graph representations learned by ERGL.

Figure presents the confusion matrix of ERGL and the distribution visualization of graph representations learned by ERGL. In Figure (a), most misclassified samples appear between \textit{airp.} (airport) and \textit{mall} (shopping mall), and between \textit{bus} and \textit{tram}. The distribution of graph representations in Figure (b) shows that the results are consistent with the confusion matrix.

2. Correlation analysis

Based on the Pearson correlation coefficient of predictions on the test set, we visualize the correlation between 10 classes of scenes and 25 classes of events, which are automatically selected by the model. As can be seen from the figure, on the test set:

For scene airport, the related typical audio events are [Speech, Tick];

For scene bus, the related typical audio events are [Car, Vehicle, Inside];
For scene metro, the related typical audio events are [Rail, Vehicle, Railroad];
For scene metro station, the related typical audio events are [Train, Rail, Railroad];
For scene park, the related typical audio events are [Bird, Outside, Animal, Silence];
For scene public square, the related typical audio events are [Clip-clop, Outside, Animal, Raindrop, Patter];
For scene shopping mall, the related typical audio events are [Music, Speech];
For scene street pedestrian, the related typical audio events are [Clip-clop, Tick, Patter, Rain];
For scene street traffic, the related typical audio events are [Noise, Rain, Fan, Car, Vehicle];
For scene tram, the related typical audio events are [Vehicle, Inside, Car].

3. Explore where the AE-based relational approach works and where it does not

This Figure illustrates that AE-based relational edges effectively help ERGL improve its accuracy in 6 scenes: ``\textit{airp.} (airport), \textit{bus}, \textit{metro}, \textit{park}, \textit{squa.} (public square), \textit{traff.} (street traffic)". The accuracy of \textit{airp.} is improved the most, mainly because the misclassified samples between \textit{stat.} (metro station), \textit{pedes.} (street pedestrian), and \textit{airp.} is reduced from 30 and 39 to 8 and 16, respectively. In contrast, introducing MEL increases the misclassified samples of \textit{stat.} and \textit{metro}. Even for humans, it is challenging to distinguish these similar scenes relying on audio only. In short, introducing AE-based relational edges can effectively improve the performance of ERGL in 6 scenes, increasing its \textit{Acc} from 73.35% to 78.08%.

4. The structure of graph representations of audio clips from similar scenes.

To further explore the reasons for the misclassification between scenes for graph representation-based classification. Figure presents the structure of graph representations of audio clips from different scenes.

Graph structures of test audio samples from similar scenes. Green dots represent 25 classes of AEs used in this paper. A larger dot denotes a higher probability of the event. A thicker line denotes a larger edge value between nodes in the graph representation.

For visualization, a node is considered inactive if its probability is less than 0.1. Edges between two inactive nodes are not displayed. Multi-dimensional features of the edge are represented by their mean value.

Case 1 :

1. airp. (airport) and mall (shopping mall)

The dominant AEs in \textit{airp.} and \textit{mall} scenes are \textit{speech} and \textit{music}, so the connections in the graph are mainly gathered around \textit{speech} and \textit{music}. This may be the reason that the ERGL confuses the \textit{airp.} and \textit{mall} scenes. In addition to these similarities, the third focused audio event in the airport is \textit{silence}, while that in the shopping mall is \textit{animal} sounds, which reflects the differences between the two scenes.

Case 2 :

2. bus and tram

The dominant audio events in bus and tram scenes are \textit{vehicle}, \textit{music}, \textit{speech}, \textit{train}, \textit{car} and \textit{silence}. The audio events with dominant connections are the same: \textit{vehicle}, \textit{music} and \textit{silence}.

Case 3 :

3. street traffic and public square

With similar dominant events and graph structures, the model tends to be confused by these similar scenes.

5. Scene graphs consist of the top 25 events displayed in a hierarchical manner

The top 25 events simply depend on the entire dataset and are not specifically selected for each single target scene. So for each scene graph, some events seem to be a little bit strange in the graph tree. At the semantic level, these 25 classes of events are slightly insufficient in describing 10 different classes of scenes. The top 25 events are automatically chosen by the classification model without involving artificial prior knowledge.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
Analysis_of_confusion		Analysis_of_confusion
Some_sample_graphs		Some_sample_graphs
Visual_supplements		Visual_supplements
docs		docs
README.md		README.md

Yuanbo2020/ERGL

Folders and files

Latest commit

History

Repository files navigation

Scene-dependent event-relational graph learning (ERGL)

Citation

1. Confusion matrix and the distribution visualization of graph representations learned by ERGL.

2. Correlation analysis

3. Explore where the AE-based relational approach works and where it does not

4. The structure of graph representations of audio clips from similar scenes.

Case 1 :

1. airp. (airport) and mall (shopping mall)

Case 2 :

2. bus and tram

Case 3 :

3. street traffic and public square

5. Scene graphs consist of the top 25 events displayed in a hierarchical manner

1. airport

2. bus

3. metro

4. metro station

5. park

6. public square

7. shopping mall

8. street pedestrian

9. street traffic

10. tram

6. Averaged multi-dimensional edge values between nodes in samples of different acoustic scenes

1. airport

2. bus

3. metro

4. metro station

5. park

6. public square

7. shopping mall

8. street pedestrian

9. street traffic

10. tram

7. ERGL

The framework of scene-dependent event-relational graph learning (ERGL) for ASC.

About

Resources

Stars

Watchers

Forks