Coreset-based clustering

This code implements algorithms described in "Improved MapReduce and Streaming Algorithms for k-Center Clustering (with Outliers)".

To compile it you will need sbt and Java version 8. The code is based on Spark, which has problems running on Java 9 (at least as of April 2018, see this issue).

To build the code the following command is sufficient

sbt compile

If you want a "fat jar" suitable for deployment on a Spark cluster, use the following command

sbt assembly

Insteaed, if you just want to test the software locally, just use the provided run script.

Running the program

To see all the available options:

./run Main --help

The main parameters are the following:

k: controls the desired number of clusters
z: controls the number of outliers allowed. If not provided, the software performs a clustering without outliers.
p: the number of blocks in which to partition the input (only used with the mapreduce and random coresets, see below)
tau: the size of each coreset
input: path to the input file
coreset: you can choose the coreset to use among the following ones
- none: just run the sequential algorithm. Beware: if you also specify z, pay attention to use only small inputs, since the sequential algorithm with z outliers is cubic.
- mapreduce: use the MapReduce coreset-based algorithm
- streaming: use the streaming coreset-based algorithm
- random: build a coreset obtained by sampling points at random from each partition in MapReduce

Preparing datasets

In the paper we use the following two datasets as a benchmark:

Higgs
Power

Both datasets need some preprocessing to be used as input to the software.

Preparing the Higgs dataset

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
cat HIGGS.csv.gz | gunzip | cut -d ',' -f 23,24,25,26,27,28,29 | gzip > Higgs.csv.gz
./run VectorIO Higgs.csv.gz Higgs.bin

Preparing the Power dataset

wget https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
unzip household_power_consumption.zip
cat household_power_consumption.txt | sed 1d | cut -d ';' -f 1,2 --complement | sed "s/;/,/g" > household_power_consumption.csv
./run VectorIO household_power_consumption.csv power.bin

Adding outliers to an existing dataset

If your dataset is stored in a directory dataset.bin, then you can add 200 outliers with the following command:

./run InjectOutliers -i dataset.bin/ --output dataset.csv --outliers 200 --factor 10 --num-partitions 1

To get a description of the available options, execute ./run InjectOutliers --help.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
core/src		core/src
experiments/src		experiments/src
project		project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
coreset-run		coreset-run
dump_classpath		dump_classpath
multirun		multirun
run		run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Coreset-based clustering

Running the program

Preparing datasets

Preparing the Higgs dataset

Preparing the Power dataset

Adding outliers to an existing dataset

About

Uh oh!

Releases

Packages

Languages

License

Cecca/coreset-clustering

Folders and files

Latest commit

History

Repository files navigation

Coreset-based clustering

Running the program

Preparing datasets

Preparing the Higgs dataset

Preparing the Power dataset

Adding outliers to an existing dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages