Homepage: http://github.com/whym/scluster Contact: http://whym.org
Spectral clustering a modern clustering technique considered to be effective for image clustering among others. [1] [2]
This software find clusters among documents based on the bag-of-words representation [3] and TF-IDF weighting [4].
[1] | Ulrike von Luxburg, A Tutorial on Spectral Clustering, 2006. http://arxiv.org/abs/0711.0189 |
[2] | Chris H. Q. Ding, Spectral Clustering, 2004. http://ranger.uta.edu/~chqding/Spectral/ |
[3] | http://en.wikipedia.org/wiki/Bag_of_words_model |
[4] | http://en.wikipedia.org/wiki/Tf%E2%80%93idf |
Following softwares are required.
- Python 2.7 or 3.4
- Numpy
- Scipy
Clone this repository.
Prepare documents as raw-text files, and put them in a directory, for example, 'reuters'.
Prepare a category file. For example, 'cats.txt' may contain:
14833 palm-oil veg-oil 14839 ship
This means that the file '14833' has 'palm-oil' and 'veg-oil' as its categories, and '14839' has 'ship' as its category.
Run:
python -m scluster.clusterer cats.txt reusters/ -m kmeans
,
- When you use the Reuters set, notice No 17980 might contain non-Unicode character at Line 10. It should probably read: "world economic growth-side measures ..."
[5] | http://www.daviddlewis.com/resources/testcollections/reuters21578/ |