📚 Wikipedia Semantic Mapping using Sentence-BERT and UMAP Clustering
This project explores the hidden semantic structure of Wikipedia topics using advanced NLP and unsupervised learning techniques. It retrieves topic descriptions from Wikipedia, generates dense vector embeddings using Sentence-BERT, and uses UMAP and DBSCAN to visualize and discover clusters of semantically related topics.
🔍 Overview
- Sentence Embeddings: Contextual embeddings generated using ‘sentence-transformers’ (Sentence-BERT).
- Dimensionality Reduction: UMAP projects high-dimensional vectors to 2D space for visual exploration.
- Clustering: DBSCAN detects dense topic clusters without requiring a predefined number of clusters.
- Keyword Extraction: KeyBERT is used to summarize each cluster with meaningful keywords.
- Visualization: Matplotlib is used to create clear, labeled plots of topic groupings.
🧠 Techniques Used
- ✅ BERT-based semantic embeddings (
sentence-transformers
) - ✅ Unsupervised learning (UMAP + DBSCAN)
- ✅ Clustering and keyword interpretation (
KeyBERT
) - ✅ Data visualization with Matplotlib
📦 Requirements
- sentence-transformers
- umap-learn
- scikit-learn
- matplotlib
- keybert
- wikipedia
- numpy
🎯 Motivation
By combining sentence-level embeddings and topological structure, this project provides an intuitive map of how Wikipedia topics relate semantically. It demonstrates how modern NLP models can uncover rich latent structures in textual data.
🤝 Contributions
Contributions, issues, and improvements are welcome! Feel free to open a pull request or submit feedback.