GitHub - Anishgoswamicode/wikipedia-semantic-clustering: Unsupervised semantic clustering of Wikipedia topics using Sentence-BERT embeddings, UMAP for visualization, and DBSCAN for topic discovery

📚 Wikipedia Semantic Mapping using Sentence-BERT and UMAP Clustering

This project explores the hidden semantic structure of Wikipedia topics using advanced NLP and unsupervised learning techniques. It retrieves topic descriptions from Wikipedia, generates dense vector embeddings using Sentence-BERT, and uses UMAP and DBSCAN to visualize and discover clusters of semantically related topics.

🔍 Overview

Sentence Embeddings: Contextual embeddings generated using ‘sentence-transformers’ (Sentence-BERT).
Dimensionality Reduction: UMAP projects high-dimensional vectors to 2D space for visual exploration.
Clustering: DBSCAN detects dense topic clusters without requiring a predefined number of clusters.
Keyword Extraction: KeyBERT is used to summarize each cluster with meaningful keywords.
Visualization: Matplotlib is used to create clear, labeled plots of topic groupings.

🧠 Techniques Used

✅ BERT-based semantic embeddings (sentence-transformers)
✅ Unsupervised learning (UMAP + DBSCAN)
✅ Clustering and keyword interpretation (KeyBERT)
✅ Data visualization with Matplotlib

📦 Requirements

sentence-transformers
umap-learn
scikit-learn
matplotlib
keybert
wikipedia
numpy

🎯 Motivation

By combining sentence-level embeddings and topological structure, this project provides an intuitive map of how Wikipedia topics relate semantically. It demonstrates how modern NLP models can uncover rich latent structures in textual data.

🤝 Contributions

Contributions, issues, and improvements are welcome! Feel free to open a pull request or submit feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
wikipedia_semantic_clustering.ipynb		wikipedia_semantic_clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Anishgoswamicode/wikipedia-semantic-clustering

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages