Python implementation of DBCLASD, a non-parametric clustering algorithm
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
Aggregation2d.txt
LICENSE
README.md
dbclasd.py

README.md

py-dbclasd

Python implementation of DBCLASD: a non-parametric clustering algorithm

=== INTRODUCTION ===

Many recent (and not so recent) surveys on clustering algorithms, highlight the strength of this non-parametric clustering method. As this is one of my research interests, I wanted to give it a try and compare it myself to see if it suits my needs. Unfortunately, I haven't been able to find any implementation of this algorithm whatsoever :-(

This is why, I decided to do it myself. First to be able to test its novelties against other methods and second, because I wanted to contribute to the scientific and CS community somehow and this seemed to me as a nice way to do it.

I've tried my best to stick to the description given in the paper, which I found to be rather confusing once you really get into the implementation details. However, there is at least one thing that I felt needed to be done in order for the algorithm to deliver meaningful results, namely a condition to prevent the evaluation of points whose 29 initial nearest neighbors have been mostly labeled already (i.e., if more than half of the 29-NN of a candidate point have labels assigned already).

I'd be happy to get feedback, specially if there are still bugs around. I tried to make the code as efficient as possible without sacrificing readability. That means that I tried to stick to the pseudocode given in the paper as much as possible. The code is of course, far from being efficient but I guess I can think of creating branch with an efficient version afterwards.

=== EXECUTING DBCLASD ===

The code should run out of the box, given that you have all packages required (they're all standard):

  • NumPy
  • SciPy
  • scikit-learn
  • matplotlib

To give it a test-run, make sure you have a copy of the text file named Aggregation2d.txt which contains some sample data (which was publicly available at http://cs.joensuu.fi/sipu/datasets/ ). Now, run the command

$ python dbclasd.py -i Aggregation2d.txt

This should load the data and perform the clustering on it. At the end, a plot of all classes is saved to the current directory. Make sure matplotlib is properly configured and you have enough permissions to save in '.'

References: Xiaowei Xu; Ester, M.; Kriegel, H.-P.; Sander, J., "A distribution-based clustering algorithm for mining in large spatial databases," Data Engineering, 1998. Proceedings., 14th International Conference on , vol., no., pp.324,331, 23-27 Feb 1998 - http://www.ualr.edu/xwxu/publications/icde-98.pdf