Skip to content

Commit

Permalink
[SPARK-17389][ML][MLLIB] KMeans speedup with better choice of k-means…
Browse files Browse the repository at this point in the history
…|| init steps = 2

## What changes were proposed in this pull request?

Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion.
See also apache#14948

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes apache#14956 from srowen/SPARK-17389.2.
  • Loading branch information
srowen authored and wgtmac committed Sep 19, 2016
1 parent 9e6680e commit 71d6291
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 10 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,10 @@ class KMeans private (

/**
* Constructs a KMeans instance with default parameters: {k: 2, maxIterations: 20, runs: 1,
* initializationMode: "k-means||", initializationSteps: 5, epsilon: 1e-4, seed: random}.
* initializationMode: "k-means||", initializationSteps: 2, epsilon: 1e-4, seed: random}.
*/
@Since("0.8.0")
def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 5, 1e-4, Utils.random.nextLong())
def this() = this(2, 20, 1, KMeans.K_MEANS_PARALLEL, 2, 1e-4, Utils.random.nextLong())

/**
* Number of clusters to create (k).
Expand Down Expand Up @@ -134,7 +134,7 @@ class KMeans private (

/**
* Set the number of steps for the k-means|| initialization mode. This is an advanced
* setting -- the default of 5 is almost always enough. Default: 5.
* setting -- the default of 2 is almost always enough. Default: 2.
*/
@Since("0.8.0")
def setInitializationSteps(initializationSteps: Int): this.type = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ class PowerIterationClusteringSuite extends SparkFunSuite with MLlibTestSparkCon
val r1 = 1.0
val n1 = 10
val r2 = 4.0
val n2 = 40
val n2 = 10
val n = n1 + n2
val points = genCircle(r1, n1) ++ genCircle(r2, n2)
val similarities = for (i <- 1 until n; j <- 0 until i) yield {
Expand Down Expand Up @@ -83,19 +83,15 @@ class PowerIterationClusteringSuite extends SparkFunSuite with MLlibTestSparkCon
val r1 = 1.0
val n1 = 10
val r2 = 4.0
val n2 = 40
val n2 = 10
val n = n1 + n2
val points = genCircle(r1, n1) ++ genCircle(r2, n2)
val similarities = for (i <- 1 until n; j <- 0 until i) yield {
(i.toLong, j.toLong, sim(points(i), points(j)))
}

val edges = similarities.flatMap { case (i, j, s) =>
if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
} else {
None
}
Seq(Edge(i, j, s), Edge(j, i, s))
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)

Expand Down

0 comments on commit 71d6291

Please sign in to comment.