setEpsilon and setInitializationMode for ml_kmeans #178

Closed
MarcinKosinski opened this Issue Aug 17, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@MarcinKosinski
Contributor

MarcinKosinski commented Aug 17, 2016

I am working on extending ml_kmeans with a support for setEpsilon and setInitializationMode parameters https://github.com/MarcinKosinski/sparklyr/blob/feature/setEpsilonInKmeans/R/ml_kmeans.R#L61 as further parameters can be set http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

but I have received a weird error I am not able to overcome. Do you have any suggestions on that issue?

> library(sparklyr)
> sc <- spark_connect("local", version = "2.0.0")
> 
> library(dplyr)
> iris_tbl <- dplyr::copy_to(sc, iris, "iris", overwrite = TRUE)
The following columns have been renamed:
- 'Sepal.Length' => 'Sepal_Length' (#1)
- 'Sepal.Width'  => 'Sepal_Width'  (#2)
- 'Petal.Length' => 'Petal_Length' (#3)
- 'Petal.Width'  => 'Petal_Width'  (#4)
> 
> s <- ml_kmeans(iris_tbl %>% dplyr::select(Sepal_Length, Petal_Length), 
+                centers = 3, max.iter = 5)
 Show Traceback

 Rerun with Debug
 Error: failed to invoke spark command
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: sparklyr_tmp_ab5205e5f92
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: SELECT *
16/08/17 17:40:55 INFO CodeGenerator: Code generated in 7.533418 ms
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: SELECT *
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: sparklyr_tmp_ab54f3f9c9d
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: SELECT *
16/08/17 17:40:55 INFO CodeGenerator: Code generated in 19.315911 ms
16/08/17 17:40:55 INFO SparkSqlParser: Parsing command: SELECT *
16/08/17 17:40:55 ERROR setEpsilon on 58 failed 

My changed ml_kmeans function is below

ml_kmeans <- function(x,
                      centers,
                      max.iter = 100,
                      features = dplyr::tbl_vars(x),
                      compute.cost = TRUE,
                      epsilon = 0.0001,
                      mode = "k-means||",
                      ...) {

  df <- spark_dataframe(x)
  sc <- spark_connection(df)

  df <- ml_prepare_features(df, features)

  centers <- ensure_scalar_integer(centers)
  max.iter <- ensure_scalar_integer(max.iter)
  only_model <- ensure_scalar_boolean(list(...)$only_model, default = FALSE)
  epsilon <- ensure_scalar_double(epsilon)

  assert_that(mode %in% c("random", "k-means||"))
  mode <- ensure_scalar_character(mode)

  envir <- new.env(parent = emptyenv())

  envir$id <- sparklyr:::random_string("id_")
  df <- df %>%
    sdf_with_unique_id(envir$id) %>%
    spark_dataframe()

  tdf <- ml_prepare_dataframe(df, features, envir = envir)

  envir$model <- "org.apache.spark.ml.clustering.KMeans"
  kmeans <- invoke_new(sc, envir$model)

  model <- kmeans %>%
    invoke("setK", centers) %>%
    invoke("setMaxIter", max.iter) %>%
    invoke("setEpsilon", epsilon) %>%
    invoke("setInitializationMode", mode) %>%
    invoke("setFeaturesCol", envir$features) %>%


  if (only_model) return(model)

  fit <- model %>%
    invoke("fit", tdf)

  # extract cluster centers
  kmmCenters <- invoke(fit, "clusterCenters")

  # compute cost for k-means
  if (compute.cost)
    kmmCost <- invoke(fit, "computeCost", tdf)

  centersList <- transpose_list(lapply(kmmCenters, function(center) {
    as.numeric(invoke(center, "toArray"))
  }))

  names(centersList) <- features
  centers <- as.data.frame(centersList, stringsAsFactors = FALSE, optional = TRUE)

  ml_model("kmeans", fit,
           centers = centers,
           features = features,
           data = df,
           model.parameters = as.list(envir),
           cost = ifelse(compute.cost, kmmCost, NULL)
  )
}

Session Info

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=pl_PL.UTF-8       LC_NUMERIC=C               LC_TIME=pl_PL.UTF-8       
 [4] LC_COLLATE=pl_PL.UTF-8     LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=pl_PL.UTF-8   
 [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.5.0    sparklyr_0.3.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6     digest_0.6.9    withr_1.0.2     rprojroot_1.0-2 assertthat_0.1 
 [6] rappdirs_0.3.1  R6_2.1.2        DBI_0.4-1       magrittr_1.5    lazyeval_0.2.0 
[11] config_0.1.0    tools_3.3.1     readr_0.2.2     yaml_2.1.13     parallel_3.3.1 
[16] tibble_1.1     
@kevinushey

This comment has been minimized.

Show comment
Hide comment
@kevinushey

kevinushey Aug 17, 2016

Contributor

It looks like setEpsilon is a method available to mllib.KMeans, but not ml.KMeans. Compare the API documentation at:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

We might need to bug the Spark team to add these parameters to the ml.KMeans implementation. (As I understand it, the Spark team is promoting the use of the ml package with Spark DataFrames over mllib now; hence why we use the ml package throughout sparklyr)

Contributor

kevinushey commented Aug 17, 2016

It looks like setEpsilon is a method available to mllib.KMeans, but not ml.KMeans. Compare the API documentation at:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.clustering.KMeans
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans

We might need to bug the Spark team to add these parameters to the ml.KMeans implementation. (As I understand it, the Spark team is promoting the use of the ml package with Spark DataFrames over mllib now; hence why we use the ml package throughout sparklyr)

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Aug 17, 2016

Contributor

Thanks for pointing this out. Those 2 links are a good comparison for begginers (me).
Now I have understand spark/ML/MLlib and sparklyr better.

I'll be thinking about this issue now : ). setEpsilon is setTol now, so I can come up with PR in close future for more parameters in KMeans

Contributor

MarcinKosinski commented Aug 17, 2016

Thanks for pointing this out. Those 2 links are a good comparison for begginers (me).
Now I have understand spark/ML/MLlib and sparklyr better.

I'll be thinking about this issue now : ). setEpsilon is setTol now, so I can come up with PR in close future for more parameters in KMeans

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Aug 18, 2016

Contributor

moved to #179

Contributor

MarcinKosinski commented Aug 18, 2016

moved to #179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment