Skip to content

AdaUchendu/AwesomeTDA4NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unveiling Topological Structures from Language: A Comprehensive Survey of Topological Data Analysis Applications in NLP

Awesome Arxiv

Overview

The surge of data available on the internet has led to the adoption of various computational methods to analyze and extract valuable insights from this wealth of information. Among these, the field of Machine Learning (ML) has thrived by leveraging data to extract meaningful insights. However, ML techniques face notable challenges when dealing with real-world data, often due to issues of imbalance, noise, insufficient labeling, and high dimensionality. To address these limitations, some researchers advocate for the adoption of Topological Data Analysis (TDA), a statistical approach that discerningly captures the intrinsic shape of data despite noise. Despite its potential, TDA has not gained as much traction within the Natural Language Processing (NLP) domain compared to structurally distinct areas like computer vision. Nevertheless, a dedicated community of researchers has been exploring the application of TDA in NLP, yielding 97 papers we comprehensively survey in this paper. Our findings categorize these efforts into theoretical and non-theoretical approaches. Theoretical approaches aim to explain linguistic phenomena from a topological viewpoint, while non-theoretical approaches merge TDA with ML features, utilizing diverse numerical representation techniques. We conclude by exploring the challenges and unresolved questions that persist in this niche field.

PDF file

How to cite the Repo or Paper

@article{uchendu2024unveiling,
  title={Unveiling Topological Structures in Text: A Comprehensive Survey of Topological Data Analysis Applications in NLP},
  author={Uchendu, Adaku and Le, Thai},
  journal={arXiv preprint arXiv:2411.10298},
  year={2024},
  url={https://arxiv.org/abs/2411.10298} 
}

Table of Content


Paper List

1. Theoretical Approaches

1a. Topological Space

Semantic Topological Space

  • Semantic topology. Jussi Karlgren, Martin Bohman, Ariel Ekgren, Gabriel Isheden, Emelie Kullmann, and David Nilsson. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (2014) [link]
  • Context-aware profiling of concepts from a semantic topological space. Knowledge-Based Systems (2017) [link]

Syntactic Topological Space

  • Persistent topology of syntax. Alexander Port, Iulia Gheorghita, Daniel Guth, John M Clark, Crystal Liang, Shival Dasu, and Matilde Marcolli. Mathematics in Computer Science. (2018) [link]
  • Topological analysis of syntactic structures. Alexander Port, Taelin Karidi, and Matilde Marcolli. Mathematics in Computer Science (2022) [link]

1b. Topology of Topic Evolution

  • A simplified topological representation of text for local and global context. Ishrat Rahman Sami and Katayoun Farrahi. Proceedings of the 25th ACM International Conference on Multimedia. (2017) [link]

1c. Topological “Shape” of Words

  • The shape of word embeddings: Quantifying non-isometry with topological data analysis. Ondˇrej Draganov and Steven Skiena. Findings of the Association for Computational Linguistics: EMNLP 2024. (2024) [link]
  • The shape of words-topological structure in natural language data. Stephen Fitz. Topological, Algebraic, and Geometric Learning Workshops 2022. (2022) [link]
  • Hidden holes: topological aspects of language models. Stephen Fitz, Peter Romero, and Jiyan Jonas Schneider. arXiv preprint arXiv:2406.05798. (2024) [link]
  • Linguistics from a topological viewpoint. Rui Dong. arXiv preprint arXiv:2403.15440. (2024) [link]
  • An application of persistent homology and the graph theory to linguistics: The case of Tifinagh and Phoenician scripts. Hajar Bouazzaoui, Mohamed Abdou Elomary, My Ismail Mamouni. Statistics in Transition new series 22.3. (2021) [link]

2. Non-theoretical Approaches

2a. TF-IDF

  • Persistent homology: An introduction and a new text representation for natural language processing. Xiaojin Zhu. IJCAI (2013) [link]
  • Text Classification via Topological Data Analysis. Bendik Løvlie. Master’s thesis, Norwegian University of Science and Technology (NTNU) (2023) [link]
  • Movie genre detection using topological data analysis. Pratik Doshi and Wlodek Zadrozny. Statistical Language and Speech Processing: 6th International Conference, SLSP 2018 (2018) [link]
  • Genre classification: A topological data analysis approach. Kevin Shin. (2019) [link]
  • Text Mining via Homology. Blaž Sovdat. Master's thesis, UNIVERSITY OF LJUBLJANA (2016) [link]
  • Topological data analysis for discourse semantics? Ketki Savle, Wlodek Zadrozny, and Minwoo Lee. Proceedings of the 13th International Conference on Computational Semantics-Student Papers (2019) [link]
  • Does the geometry of word embeddings help document classification? Paul Michel, Abhilasha Ravichander, and Shruti Rijhwani. Proceedings of the 2nd Workshop on Representation Learning for NLP. (2017) [link]
  • A topological collapse for document summarization. Hui Guan, Wen Tang, Hamid Krim, James Keiser, Andrew Rindos, and Radmila Sazdanovic. 2016 IEEE 17th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). (2016) [link]
  • Topic detection in Twitter using topology data analysis. Pablo Torres-Tramón, Hugo Hromic, and Bahareh Rahmanzadeh Heravi. Current Trends in Web Engineering. (2015) [link]
  • Extractive text summarization using topological features. Ankit Kumar and Apurba Sarkar. International Workshop on Combinatorial Image Analysis. (2022) [link]
  • Novel topological shapes of model interpretability. Hendrik Jacob van Veen. TDA and Beyond at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020). (2020) [link]
  • An introduction to a new text classification and visualization for natural language processing using topological data analysis. Naiereh Elyasi and Mehdi Hosseini Moghadam. arXiv preprint arXiv:1906.01726. (2019) [link]
  • Topological data analysis of open-ended responses. Bright Effah. Ph.D. thesis, University of Cape Coast (2017) [link]
  • The shape of poems. Lamis Maadarani and Sayonita Ghosh Hajra. Sac State Scholars' Fall Poster Forum (2020) [link]

2b. Pre-trained Embeddings

Word2Vec

  • Story trees: Representing documents using topological persistence. Pantea Haghighatkhah, Antske Fokkens, Pia Sommerauer, Bettina Speckmann, and Kevin Verbeek. Proceedings of the Thirteenth LREC 2022 (2022) [link]
  • Topological analysis of contradictions in text. Xiangcheng Wu, Xi Niu, and Ruhani Rahman. Proceedings of the 45th International ACM SIGIR. (2022) [link]
  • Comparison of word embeddings of unaligned audio and text data using persistent homology. Zhandos Yessenbayev and Zhanibek Kozhirbayev. International Conference on Speech and Computer. (2022) [link]
  • Use of riemannian distance metric to verify topological similarity of acoustic and text domains. Zhandos Yessenbayev and Zhanibek Kozhirbayev. International Conference on Artificial Neural Networks. (2024) [link]
  • An explainable topological search engine with giotto-tda. Filip Cornell. gtda-challenge-2020. (2020) [link]
  • Topological analysis of averaged sentence embeddings. Wesley J Holmes. Master’s thesis, Wright State University. (2020) [link]
  • Topological data analysis for word sense disambiguation. Michael Rawson, Samuel Dooley, Mithun Bharadwaj, and Rishabh Choudhary. arXiv preprint arXiv:2203.00565. (2022) [link]
  • Local homology of word embeddings. Tadas Tem{\v{c}}inas. arXiv preprint arXiv:1810.10136. (2018) [link]
  • Geometry of textual data augmentation: Insights from large language models. Sherry JH Feng, Edmund MK Lai, and Weihua Li. Electronics. (2024) [link]
  • Con connections: Detecting fraud from abstracts using topological data analysis. Sarah Tymochko, Julien Chaput, Timothy Doster, Emilie Purvine, Jackson Warley, and Tegan Emerson. 20th IEEE International Conference on Machine Learning and Applications (ICMLA). (2021) [link]
  • Topological data analysis on simple english wikipedia articles. Matthew Wright and Xiaojun Zheng. The PUMP Journal of Undergraduate Research. (2020) [link]
  • Argumentative topology: Finding loop (holes) in logic. Sarah Tymochko, Zachary New, Lucius Bynum, Emilie Purvine, Timothy Doster, Julien Chaput, and Tegan Emerson. arXiv preprint arXiv:2011.08952. (2020) [link]
  • Prediction of disease type from topological features of time series. Giovanni Petri and Antonio Leitao. gtda-challenge-2020. (2020) [link]
  • Summary and distance between sets of texts based on topological data analysis. Eduardo Paluzo Hidalgo, Rocío González Díaz, and Miguel Ángel Gutiérrez Naranjo. arXiv preprint arXiv:1912.09253. (2019) [link]
  • Detecting Narrative Shifts through Persistent Structures: A Topological Analysis of Media Discourse. Mark M. Bailey, Mark I. Heiligman. arXiv preprint arXiv:2506.14836. (2025) [link]

GloVe

  • Topological signature of 19th century novelists: Persistent homology in text mining. Shafie Gholizadeh, Armin Seyeditabari, and Wlodek Zadrozny. Big Data and Cognitive Computing (2018) [link]
  • A novel method of extracting topological features from word embeddings. Shafie Gholizadeh, Armin Seyeditabari, and Wlodek Zadrozny. arXiv preprint arXiv:2003.13074. (2020) [link]
  • Topological interpretability for deep learning. Adam Spannaus, Heidi A Hanson, Georgia Tourassi, and Lynne Penberthy. Proceedings of the Platform for Advanced Scientific Computing Conference. (2024) [link]
  • A note on argumentative topology: Circularity and syllogisms as unsolved problems. Wlodek W Zadrozny. arXiv preprint arXiv:2102.03874. (2021) [link]
  • The Hidden Shape of Data: Topological Data Analysis for Anxiety Detection in Text. Morgan Byers. Ph.D. thesis, Texas State University. (2021) [link]
  • Unsupervised geometric and topological approaches for cross-lingual sentence representation and comparison. Shaked Haim Meirom and Omer Bobrowski. Proceedings of the 7th Workshop on Representation Learning for NLP @ ACL 2022 (2022) [link]
  • Abstraction, reasoning and deep learning: A study of the" look and say" sequence. Wlodek W Zadrozny. arXiv preprint arXiv:2109.12755. (2021) [link]
  • Topological data analysis helps to improve accuracy of deep learning models for fake news detection trained on very small training sets. Ran Deng and Fedor Duzhin. Big Data and Cognitive Computing (2022) [link]
  • Network and topological analysis of scholarly metadata: A platform to model and predict collaboration. Lance Novak. Master’s thesis, Purdue University. (2019) [link]

FastText

  • Topology of word embeddings: Singularities reflect polysemy. Alexander Jakubowski, Milica Gasic, and Marcus Zibrowius. Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics (2020) [link]
  • Analysis of word embeddings: A clustering and topological approach. Jonas Folkvord Triki. Master’s thesis, The University of Bergen. (2021) [link]
  • An analysis of the effect of polysemy on the topology of the latent manifold. Denis Shehu. Master’s thesis, Eindhoven University of Technology. (2024) [link]

ELMo

  • Con connections: Detecting fraud from abstracts using topological data analysis. Sarah Tymochko, Julien Chaput, Timothy Doster, Emilie Purvine, Jackson Warley, and Tegan Emerson. 20th IEEE International Conference on Machine Learning and Applications (ICMLA). (2021) [link]

Transformers

CLS

  • Intrinsic dimension estimation for robust detection of ai-generated texts. Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. Advances in Neural Information Processing Systems. (2024) [link]
  • Ai-generated text boundary detection with roft. Laida Kushnareva, Tatiana Gaintseva, German Magai, Serguei Barannikov, Dmitry Abulkhanov, Kristian Kuznetsov, Eduard Tulchinskii, Irina Piontkovskaya, and Sergey Nikolenko. 1st Conference on Language Modeling (COLM). (2024) [link]
  • Estimating class separability of text embeddings with persistent homology. Kostis Gourgoulias, Najah Ghalyan, Maxime Labonne, Sean Moran, Joseph Sabelja. Transactions on Machine Learning Research. (2024) [link]
  • Topobert: Exploring the topology of fine-tuned word representations. Archit Rathore, Yichu Zhou, Vivek Srikumar, and Bei Wang. Information Visualization (2023) [link]
  • Combining topological signature with text embeddings: Multi-modal approach to fake news detection. Rachel Lavery, Anna Jurek-Loughrey, and Lu Bai. 35th Irish Signals and Systems Conference (ISSC) (2024) [link]
  • Persistence homology of tedtalk: Do sentence embeddings have a topological shape? Shouman Das, Syed A Haque, and Md Iftekhar Tanveer. arXiv preprint arXiv:2103.14131. (2021) [link]
  • Topic modeling with topological data analysis. Ciarán Byrne, Danijela Horak, Karo Moilanen, and Amandla Mabona. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. (2022) [link]
  • Short-PHD: Detecting Short LLM-generated Text with Topological Data Analysis After Off-topic Content Insertion. Dongjun Wei, Minjia Mao, Xiao Fang, Michael Chau. arXiv preprint arXiv:2504.02873 (2025) [link]
  • Topological Data Analysis for Distinguishing Human-Written and AI-Generated Abstracts. Ann Guilinger, Eli Best, Vinay Awasthi. preprints.org (2025) [link]
  • Topo Goes Political: TDA-Based Controversy Detection in Imbalanced Reddit Political Data. Arvindh Arun, Karuna K Chandra, Akshit Sinha, Balakumar Velayutham, Jashn Arora, Manish Jain, Ponnurangam Kumaraguru. Companion Proceedings of the ACM on Web Conference 2025. (2025) [link]

Hidden

  • Topformer: Topology-aware authorship attribution of deepfake texts with diverse writing styles. Adaku Uchendu, Thai Le, and Dongwon Lee. ECAI 2024. (2024) [link]
  • Applications of topological data analysis to natural language processing and computer vision. Jason S Garcia. Ph.D. thesis, Colorado State University. (2022) [link]
  • Persistent topological features in large language models. Yuri Gardinazzi, Giada Panerai, Karthik Viswanathan, Alessio Ansuini, Alberto Cazzaniga, and Matteo Biagetti. ICML 2025 (2025) [link]
  • Topological interpretations of GPT-3. Tianyi Sun and Bradley Nelson. arXiv preprint arXiv:2308.03565 (2023) [link]
  • Bertops: Studying BERT representations under a topological lens. Jatin Chauhan and Manohar Kaul. International Joint Conference on Neural Networks (IJCNN). (2022) [link]
  • Relative representations: Topological and geometric perspectives. Alejandro García-Castellanos, Giovanni Luca Marchetti, Danica Kragic, and Martina Scolamiero. arXiv preprint arXiv:2409.10967. (2024) [link]
  • Local topology measures of contextual language model latent spaces with applications to dialogue term extraction. Benjamin Matthias Ruppik, Michael Heck, Carel van Niekerk, Renato Vukovic, Hsien-chin Lin, Shutong Feng, Marcus Zibrowius, and Milica Gaši´c. Proceedings of the 25th Meeting of the Special Interest Group on Discourse and Dialogue. (2024) [link]
  • The more polypersonal the better - A short look on space geometry of fine-tuned layers. Sergei Kudriashov, Veronika Zykova, Angelina Stepanova, Jacob Raskind, and Eduard Klyshinsky. International Conference on Neuroinformatics. (2024) [link]
  • A Green AI Methodology Based on Persistent Homology for Compressing BERT. Luis Balderas, Miguel Lastra, and José M. Benítez. Applied Sciences. (2025) [link]
  • Topological Data Mapping of Online Hate Speech, Misinformation, and General Mental Health: A Large Language Model Based Study. Alexander, Andrew, and Hongbin Wang. arXiv preprint arXiv:2309.13098. (2023) [link]
  • Holes in Latent Space: Topological Signatures Under Adversarial Influence. Aideen Fay, Inés García-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod. arXiv preprint arXiv:2505.20435. (2025) [link]

Attention

  • Artificial text detection via examining the topology of attention maps. Laida Kushnareva, Daniil Cherniavskii, Vladislav Mikhailov, Ekaterina Artemova, Serguei Barannikov, Alexander Bernstein, Irina Piontkovskaya, Dmitri Piontkovski, and Evgeny Burnaev. Proceedings of the 2021 EMNLP (2021) [link]
  • Acceptability judgements via examining the topology of attention maps. Daniil Cherniavskii, Eduard Tulchinskii, Vladislav Mikhailov, Irina Proskurina, Laida Kushnareva, Ekaterina Artemova, Serguei Barannikov, Irina Piontkovskaya, Dmitri Piontkovski, and Evgeny Burnaev. Findings of the Association for Computational Linguistics: EMNLP 2022. (2022) [link]
  • Can BERT eat RuCola? Topological data analysis to explain. Irina Proskurina, Ekaterina Artemova, and Irina Piontkovskaya. Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). (2023) [link]
  • Beyond words: A topological exploration of coherence in text documents. Samyak Jain, Rishi Singhal, Sriram Krishna, Yaman K Singla, and Rajiv Ratn Shah. The Second Tiny Papers Track at ICLR 2024. (2024) [link]
  • Uncertainty estimation of transformers’ predictions via topological analysis of the attention matrices. Elizaveta Kostenok, Daniil Cherniavskii, and Alexey Zaytsev. arXiv preprint arXiv:2308.11295. (2023) [link]
  • The topological bert: Transforming attention into topology for natural language processing. Ilan Perez and Raphael Reinauer. arXiv preprint arXiv:2206.15195. (2022) [link]
  • Detecting out-of-distribution text using topological features of transformer-based language models. Andres Pollano, Anupam Chaudhuri, and Anj Simmons. The IJCAI-2024 AISafety Workshop. (2024) [link]
  • Beyond simple averaging: Improving nlp ensemble performance with topological-data-analysis-based weighting. Polina Proskura and Alexey Zaytsev. IEEE 11th International Conference on Data Science and Advanced Analytics (DSAA) (2024) [link] OR [link]
  • Vulnerability Detection via Topological Analysis of Attention Maps. Pavel Snopov and Andrey Nikolaevich Golubinskiy. arXiv preprint arXiv:2410.03470 (2024) [link]
  • Topological data analysis for speech processing. Eduard Tulchinskii, Kristian Kuznetsov, Daniil Cherniavskii, Serguei Barannikov, Sergey Nikolenko, and Evgeny Burnaev. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. (2023) [link]
  • Dialogue term extraction using transfer learning and topological data analysis. Renato Vukovic, Michael Heck, Benjamin Ruppik, Carel van Niekerk, Marcus Zibrowius, and Milica Gasic. Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue. (2022) [link]
  • Authorship Attribution by Attention Pattern of BERT with Topological Data Analysis and UMAP. Wataru Sakurai, Masato Asano, Daisuke Imoto, Masakatsu Honma, and Kenji Kurosawa. 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). (2025) [link]
  • Hallucination Detection in LLMs via Topological Divergence on Attention Graphs. Alexandra Bazarova, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Andrei Volodichev, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev. arxiv preprint arXiv:2504.10063. (2025) [link]

2c. Symbolic Representation

  • Tongue twisters detection in Ukrainian by using tda. Iryna Yurchuk and Olga Gurnik. CEUR Workshop Proceedings. (2023) [link]
  • Topological structure of Ukrainian tongue twisters based on speech sound analysis. Tetiana Kovaliuk, Iryna Yurchuk, and Olga Gurnik. MoDaST-2024: 6th International Workshop on Modern Data Science Technologies. (2024) [link]

2d. Multi-Modal Representation

  • Topological data analysis of human vowels: Persistent homologies across representation spaces. Guillem Bonafos, Jean-Marc Freyermuth, Pierre Pudlo, Samuel Tronçon, and Arnaud Rey. arXiv preprint arXiv:2310.06508. (2023) [link]
  • Dirichlet process mixture model based on topologically augmented signal representation for clustering infant vocalizations. Guillem Bonafos, Clara Bourot, Pierre Pudlo, Jean-Marc Freyermuth, Laurence Reboul, Samuel Tronçon, and Arnaud Rey. Interspeech 2024. (2024) [link]
  • Towards emotion recognition: a persistent entropy application. Rocio Gonzalez-Diaz, Eduardo Paluzo-Hidalgo, and José F Quesada. Computational Topology in Image Context: 7th International Workshop, CTIC 2019. (2019) [link]
  • Emotion recognition in talking-face videos using persistent entropy and neural networks. Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz, and Guillermo Aguirre-Carrazana. Electronic Research Archive. (2022) [link]
  • Topological data analysis to engineer features from audio signals for depression detection. ML Tlachac, Adam Sargent, Ermal Toto, Randy Paffenroth, and Elke Rundensteiner. 19th IEEE International Conference on Machine Learning and Applications (ICMLA). (2020) [link]
  • Topology-enhanced machine learning for consonant recognition. Y Zhu, P Feng, S Yi, Q Qu, and Z Yu. Research Square. (2024) [link]
  • Bridging Topological Persistence and Machine Learning for Music Information Retrieval. Luca Sassone, Marco Manetti, Mattia G Bergomi, and Massimo Ferri. Ph.D. thesis, Sapienza – University of Rome. (2022) [link]
  • Dynamical and topological tools for (modern) music analysis. Mattia Giuseppe Bergomi. Ph.D. thesis, Université Pierre et Marie Curie-Paris VI; Università degli studi (Milan, Italie). (2015) [link]
  • Topological Signatures of Adversaries in Multimodal Alignments. Minh Vu, Geigh Zollicoffer, Huy Mai, Ben Nebgen, Boian Alexandrov, Manish Bhattarai. arXiv preprint arXiv:2501.18006. (2025) [link]

Other non-TDA Topological Approaches in NLP

  • Discover the semantic topology in high-dimensional data. IJ Chiang. Expert Systems with Applications 33.1: 256-262 (2007) [link]
  • Computational topology in text mining. Hubert Wagner, Paweł Dłotko, and Marian Mrozek. Computational Topology in Image Context: 4th International Workshop. (2012) [link]

Tutorials

  • A Tutorial on Topological Data Analysis in Text Mining. Shafie Gholizadeh, Wlodek Zadrozny. IEEE BigData 2020. (2020) [link]
  • Topological Data Analysis in Natural Language Processing - A Tutorial. Wlodek Zadrozny. The International FLAIRS Conference Proceedings (FLAIRS-36). (2023) [link]

Resources

To learn more about TDA techniques, libraries for implementing TDA in Python, R, etc., go to FatemehTarashi's TDA repo and list of conferences and workshops

Contributing

We welcome contributions from the community. If you have a paper applying TDA in NLP tasks, please submit a pull request or open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for details.


Contact

For any questions or inquiries, please contact Adaku Uchendu and Thai Le.

Releases

No releases published

Packages

No packages published