compute cost for k-means #173

Merged
merged 3 commits into from Aug 17, 2016

Conversation

Projects
None yet
2 participants
@MarcinKosinski
Contributor

MarcinKosinski commented Aug 16, 2016

I have added an option for ml_kmeans to enable computing cost of clustering which might be helpful in determining the number of clusters in K-means problem as this is suggested here http://stackoverflow.com/a/15376462/3857701

I've added new parameter compute.cost to ml_kmeans which, when set to TRUE invokes computeCost on fitted model. Also I have extended the print.ml_model_kmeans function so that now it gives additional info about clustering

> print(kmeans_model3)
K-means clustering with 3 clusters

Cluster centers:
  Petal_Width Petal_Length
1    1.359259     4.292593
2    2.047826     5.626087
3    0.246000     1.462000

Within Set Sum of Squared Errors =  31.41289

The full code enables to determine number of clusters in k-means clustering from your example on your website http://spark.rstudio.com/mllib.html#k-means_clustering

> library(sparklyr)
> library(ggplot2)
> library(dplyr)
> sc <- spark_connect("local", version = "1.6.1")
Re-using existing Spark connection to local
> iris_tbl <- copy_to(sc, iris, "iris", overwrite = TRUE)
The following columns have been renamed:
- 'Sepal.Length' => 'Sepal_Length' (#1)
- 'Sepal.Width'  => 'Sepal_Width'  (#2)
- 'Petal.Length' => 'Petal_Length' (#3)
- 'Petal.Width'  => 'Petal_Width'  (#4)
> iris_tbl
Source:   query [?? x 5]
Database: spark connection master=local[8] app=sparklyr local=TRUE

   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
# ... with more rows
> 
> 
> kmeans_model2 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 2, compute.cost = TRUE)
> 
> 
> kmeans_model3 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 3, compute.cost = TRUE)
> 
> kmeans_model4 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 4, compute.cost = TRUE)
> 
> 
> kmeans_model5 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 5, compute.cost = TRUE)
> 
> kmeans_model6 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 6, compute.cost = TRUE)
> 
> kmeans_model7 <- iris_tbl %>%
+   select(Petal_Width, Petal_Length) %>%
+   ml_kmeans(centers = 7, compute.cost = TRUE)
> 
> 
> # print our model fit
> print(kmeans_model3)
K-means clustering with 3 clusters

Cluster centers:
  Petal_Width Petal_Length
1    1.359259     4.292593
2    2.047826     5.626087
3    0.246000     1.462000

Within Set Sum of Squared Errors =  31.41289
> 
> data.frame(costs = c(kmeans_model2$cost,
+                      kmeans_model3$cost,
+                      kmeans_model4$cost,
+                      kmeans_model5$cost,
+                      kmeans_model6$cost,
+                      kmeans_model7$cost),
+            clusters = 2:7,
+            stringsAsFactors = FALSE) %>%
+   ggplot(aes(y = costs, x = clusters)) + 
+   geom_line() +
+   labs(title = "Within Set Sum of Squared Errors \nfor various number of clusters \nin K-means Clustering")

compute_cost

In the future, I am wondering if any plot.ml_model_kmeans function would be valuable with such graphs

@@ -31,7 +33,7 @@ ml_kmeans <- function(x,
envir <- new.env(parent = emptyenv())
- envir$id <- random_string("id_")
+ envir$id <- sparklyr:::random_string("id_")

This comment has been minimized.

@MarcinKosinski

MarcinKosinski Aug 16, 2016

Contributor

This might not be necessary

@MarcinKosinski

MarcinKosinski Aug 16, 2016

Contributor

This might not be necessary

This comment has been minimized.

@kevinushey

kevinushey Aug 16, 2016

Contributor

Indeed, we don't need to namespace qualify calls to sparklyrs own functions within sparklyr.

@kevinushey

kevinushey Aug 16, 2016

Contributor

Indeed, we don't need to namespace qualify calls to sparklyrs own functions within sparklyr.

R/ml_kmeans.R
@@ -18,7 +19,8 @@ ml_kmeans <- function(x,
centers,
max.iter = 100,
features = dplyr::tbl_vars(x),
- ...) {
+ ...,
+ compute.cost = FALSE) {

This comment has been minimized.

@kevinushey

kevinushey Aug 16, 2016

Contributor

I think this can be moved before the ... parameter.

@kevinushey

kevinushey Aug 16, 2016

Contributor

I think this can be moved before the ... parameter.

This comment has been minimized.

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

I'll fix that.

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

I'll fix that.

@@ -54,19 +56,33 @@ ml_kmeans <- function(x,
# extract cluster centers
kmmCenters <- invoke(fit, "clusterCenters")
+ # compute cost for k-means
+ if (compute.cost)

This comment has been minimized.

@kevinushey

kevinushey Aug 16, 2016

Contributor

How expensive is this computation? Is it something we could feasibly do by default?

@kevinushey

kevinushey Aug 16, 2016

Contributor

How expensive is this computation? Is it something we could feasibly do by default?

This comment has been minimized.

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

I have really no idea.

This is this method
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L84-L88

I think you have to subtract from each observation it's corresponding cluster center. The take square of this operation, and take sum for every operation for all observations. This looks trivial and can be easily done parallel. I haven't noticed a big difference in computing time when clustering 100 columns and 7 mln observations.

Should we go compute.cost = TRUE?

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

I have really no idea.

This is this method
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala#L84-L88

I think you have to subtract from each observation it's corresponding cluster center. The take square of this operation, and take sum for every operation for all observations. This looks trivial and can be easily done parallel. I haven't noticed a big difference in computing time when clustering 100 columns and 7 mln observations.

Should we go compute.cost = TRUE?

This comment has been minimized.

@kevinushey

kevinushey Aug 17, 2016

Contributor

Let's default to TRUE and allow users to opt out if it's too expensive for their particular case.

@kevinushey

kevinushey Aug 17, 2016

Contributor

Let's default to TRUE and allow users to opt out if it's too expensive for their particular case.

This comment has been minimized.

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

No problem : ) this is the functionality that I need so it's vital for me.
Talking about unit tests, I think they are very simple right now and I
think I'll come up with a more complex one.

2016-08-17 18:49 GMT+02:00 Kevin Ushey notifications@github.com:

In R/ml_kmeans.R
#173 (comment):

@@ -54,19 +56,33 @@ ml_kmeans <- function(x,

extract cluster centers

kmmCenters <- invoke(fit, "clusterCenters")

  • compute cost for k-means

  • if (compute.cost)

Let's default to TRUE and allow users to opt out if it's too expensive
for their particular case.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rstudio/sparklyr/pull/173/files/664b588fe99160f592c8892ad86ab6d363dbf297#r75161015,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGdazurSriRzfnVi9_mz420gacY5SbuIks5qgzuCgaJpZM4JlWxk
.

@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

No problem : ) this is the functionality that I need so it's vital for me.
Talking about unit tests, I think they are very simple right now and I
think I'll come up with a more complex one.

2016-08-17 18:49 GMT+02:00 Kevin Ushey notifications@github.com:

In R/ml_kmeans.R
#173 (comment):

@@ -54,19 +56,33 @@ ml_kmeans <- function(x,

extract cluster centers

kmmCenters <- invoke(fit, "clusterCenters")

  • compute cost for k-means

  • if (compute.cost)

Let's default to TRUE and allow users to opt out if it's too expensive
for their particular case.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/rstudio/sparklyr/pull/173/files/664b588fe99160f592c8892ad86ab6d363dbf297#r75161015,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGdazurSriRzfnVi9_mz420gacY5SbuIks5qgzuCgaJpZM4JlWxk
.

R/ml_kmeans.R
- data = df,
- model.parameters = as.list(envir)
- )
+ if (compute.cost) {

This comment has been minimized.

@kevinushey

kevinushey Aug 16, 2016

Contributor

I think it would be okay to have a single call to ml_model; just have the default cost be NULL.

@kevinushey

kevinushey Aug 16, 2016

Contributor

I think it would be okay to have a single call to ml_model; just have the default cost be NULL.

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 16, 2016

Contributor

If you haven't already, would you be willing to submit a CLA? https://www.rstudio.com/wp-content/uploads/2014/06/RStudioIndividualContributorAgreement.pdf

Contributor

kevinushey commented Aug 16, 2016

If you haven't already, would you be willing to submit a CLA? https://www.rstudio.com/wp-content/uploads/2014/06/RStudioIndividualContributorAgreement.pdf

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

Contribution Agreement signed and send. Updates provided in this commit 8d17d6e

Contributor

MarcinKosinski commented Aug 17, 2016

Contribution Agreement signed and send. Updates provided in this commit 8d17d6e

@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 17, 2016

Contributor

Awesome, thanks! (and extra thanks for providing unit tests!)

Contributor

kevinushey commented Aug 17, 2016

Awesome, thanks! (and extra thanks for providing unit tests!)

@kevinushey kevinushey merged commit b96cd39 into rstudio:master Aug 17, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment