This code implements algorithms described in "Improved MapReduce and Streaming Algorithms for k-Center Clustering (with Outliers)".
To compile it you will need sbt and Java version 8. The code is based on Spark, which has problems running on Java 9 (at least as of April 2018, see this issue).
To build the code the following command is sufficient
sbt compile
If you want a "fat jar" suitable for deployment on a Spark cluster, use the following command
sbt assembly
Insteaed, if you just want to test the software locally, just use the provided run
script.
To see all the available options:
./run Main --help
The main parameters are the following:
k
: controls the desired number of clustersz
: controls the number of outliers allowed. If not provided, the software performs a clustering without outliers.p
: the number of blocks in which to partition the input (only used with themapreduce
andrandom
coresets, see below)tau
: the size of each coresetinput
: path to the input filecoreset
: you can choose the coreset to use among the following onesnone
: just run the sequential algorithm. Beware: if you also specifyz
, pay attention to use only small inputs, since the sequential algorithm withz
outliers is cubic.mapreduce
: use the MapReduce coreset-based algorithmstreaming
: use the streaming coreset-based algorithmrandom
: build a coreset obtained by sampling points at random from each partition in MapReduce
In the paper we use the following two datasets as a benchmark:
Both datasets need some preprocessing to be used as input to the software.
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
cat HIGGS.csv.gz | gunzip | cut -d ',' -f 23,24,25,26,27,28,29 | gzip > Higgs.csv.gz
./run VectorIO Higgs.csv.gz Higgs.bin
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
unzip household_power_consumption.zip
cat household_power_consumption.txt | sed 1d | cut -d ';' -f 1,2 --complement | sed "s/;/,/g" > household_power_consumption.csv
./run VectorIO household_power_consumption.csv power.bin
If your dataset is stored in a directory dataset.bin
, then you can add 200
outliers with the following command:
./run InjectOutliers -i dataset.bin/ --output dataset.csv --outliers 200 --factor 10 --num-partitions 1
To get a description of the available options, execute ./run InjectOutliers --help
.