-
Notifications
You must be signed in to change notification settings - Fork 1
Big Data Clustering Analytics
π Urban Mobility Β Β |Β Β π³ Fraud Detection Β Β |Β Β π§ Scalable Clustering
This repository contains a real-world, scalable clustering system designed to discover patterns and anomalies in large, complex datasets.
The project focuses on two impactful domains:
- Urban Mobility Analysis (NYC Taxi Trips)
- Credit Card Fraud Detection
The goal is to show how modern clustering algorithms such as KMeans++, DBSCAN, OPTICS, BIRCH, and DENCLUE perform when applied to big data and high-dimensional data β the kind of problems faced in industry.
This is not a toy example β it is a research-grade and production-inspired clustering framework.
Real-world data is:
- Large
- Noisy
- High-dimensional
- Mostly unlabeled
Traditional clustering methods break at this scale.
This project demonstrates how scalable and density-based algorithms can uncover:
- Mobility patterns in a smart city
- Anomalous transactions in financial data
- Meaningful clusters without labels
This project uses two publicly available Kaggle datasets:
Used to analyze:
- High-demand routes
- Travel-time clusters
- Trip distance vs duration
- Urban movement behavior
Features:
- Pickup & drop-off coordinates
- Trip distance (computed using Euclidean distance)
- Trip duration
- Passenger count
A real-world financial dataset with:
- 284,807 transactions
- 492 fraud cases (0.17%)
- 28 PCA-transformed features
Fraud is treated as an anomaly detection problem, where unusual transactions form sparse clusters.
Each algorithm is implemented as a separate Python file for clarity and modularity.
| Category | Algorithms |
|---|---|
| Partition-based | Mini-Batch KMeans++, CLARA, CLARANS |
| Hierarchical | BIRCH, CURE |
| Density-based | DBSCAN, OPTICS, DENCLUE |
| Grid-based | STING |
This structure allows easy testing, comparison, and reuse.
Raw Data (Kaggle)
β
Cleaning & Feature Engineering
β
Scaling & PCA (for fraud data)
β
Individual Clustering Algorithms
β
Validation Metrics
β
Visualization & Insights
Clustering quality is measured using:
- Silhouette Score
- DaviesβBouldin Index
- Adjusted Rand Index (ARI)
- Entropy
These evaluate how well clusters are separated, compact, and meaningful.
- Mini-Batch KMeans++ scales efficiently for millions of taxi trips
- BIRCH clusters big data with low memory usage
- OPTICS and DENCLUE are highly effective for fraud detection
- Density-based methods isolate fraudulent transactions as anomalies
This confirms why hybrid clustering strategies are needed in real-world analytics.
Due to Kaggle licensing and file size limits, datasets are not stored in this repository.
Please download them from:
-
NYC Taxi Trip Duration
https://www.kaggle.com/c/nyc-taxi-trip-duration -
Credit Card Fraud Detection
https://www.kaggle.com/mlg-ulb/creditcardfraud
After downloading, place the CSV files into:
data/raw/
Install dependencies:
pip install -r requirements.txtRun any clustering algorithm:
python kmeans.py
python dbscan.py
python optics.py
python denclue.py
python birch.py
python clara.py
python clarans.pyEach file runs the full pipeline for that specific algorithm.
This project shows hands-on skills in:
- Big data preprocessing
- Scalable machine learning
- Unsupervised learning
- Anomaly detection
- Feature engineering
- Model evaluation
- High-dimensional data handling
These are core skills used in:
- FinTech
- Smart cities
- Risk analytics
- Data engineering
- AI research
Sai Teja Bandaru
Bachelorβs in Data Analytics
UniversitΓ degli Studi della Campania Luigi Vanvitelli
Feel free to star β the repository or use it as a reference for:
- Research
- Data science portfolios
- Machine learning engineering
- Big data analytics
This repository represents real-world clustering at scale.