cannot fit 'intercept only' logistic regression model #1596

shabbybanks · 2018-07-13T19:33:27Z

It is possible to build a 'intercept only' glm in R, but not via ml_logistic_regression:

# testing
train_data <- iris %>%
  mutate(is_setosa=as.numeric((Species=='setosa'))) %>%
  mutate(is_not_setosa=1-is_setosa) %>%
  setNames(gsub('\\.','_',names(.)))

# runs just fine:
r_mod_one <- glm(formula=is_setosa ~ 1,data=train_data,family=binomial(link='logit'))

# copy data to spark ...
spark_data <- copy_to(sc,train_data,'like_iris',overwrite=TRUE)
# then fit
# but this errors out:
s_mod_one <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)

Depending on the size of the data (I was trying this on 'real' data, not iris), it may take a while, then throw an error, which suggests the problem is somewhere later in the processing, rather than earlier. The error message is the mystifying:

 java.lang.IllegalArgumentException: requirement failed: Vector should
have dimension larger than zero.

with this awesome stack trace:

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ 1)
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ 1)
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: pipeline %>% ml_fit(x)
13: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
14: eval(quote(`_fseq`(`_lhs`)), env, env)
15: eval(quote(`_fseq`(`_lhs`)), env, env)
16: `_fseq`(`_lhs`)
17: freduce(value, `_function_list`)
18: withVisible(function_list[[k]](value))
19: function_list[[k]](value)
20: ml_fit(., x)
21: spark_jobj(x) %>% invoke("fit", spark_dataframe(dataset)) %>% ml_constructo
22: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
23: eval(quote(`_fseq`(`_lhs`)), env, env)
24: eval(quote(`_fseq`(`_lhs`)), env, env)
25: `_fseq`(`_lhs`)
26: freduce(value, `_function_list`)
27: function_list[[i]](value)
28: invoke(., "fit", spark_dataframe(dataset))
29: invoke.shell_jobj(., "fit", spark_dataframe(dataset))
30: invoke_method(spark_connection(jobj), FALSE, jobj, method, ...)
31: invoke_method.spark_shell_connection(spark_connection(jobj), FALSE, jobj, m
32: core_invoke_method(sc, static, object, method, ...)
33: withr::with_options(list(warning.length = 8000), {
    if (nzchar(msg)) {

34: force(code)

The text was updated successfully, but these errors were encountered:

shabbybanks · 2018-07-13T19:55:59Z

For that matter, fitting a perfectly cromulent model without an intercept also throws an error. With the same setup:

# this works fine
s_mod_ok  <- spark_data %>%
  ml_logistic_regression(formula = is_setosa ~ Sepal_Length,fit_intercept=TRUE)

# this errors:
s_mod_bad <- spark_data %>%
  ml_logistic_regression(formula = is_setosa ~ Sepal_Length,fit_intercept=FALSE)

The error and stack trace I get are:

Error: `x` must be a vector

Enter a frame number, or 0 to exit

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ Sepal_Length, f
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ Sepal_Length, fit_intercept
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ Sepal_Length, fit
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: do.call(constructor, args)
13: (function (pipeline, pipeline_model, model, dataset, formula, feature_names
14: rlang::set_names(coefficients, feature_names)
15: set_names_impl(x, x, nm, ...)
16: abort("`x` must be a vector")

As this was written, I believe `coefficients` was referencing a function, not the model's coefficients. Hoping this will fix sparklyr#1596

shabbybanks · 2018-07-13T20:13:45Z

To be sure, the PR should fix the later case where there is a variable and no intercept. I should have made this two separate issues. My bad.

shabbybanks · 2018-07-17T22:15:26Z

Sorry, this should not have been closed. #1597 fixes the case of 'no intercept', which you could use to 'fake' the intercept only model, I think. However, the following are all broken:

# make some data
train_data <- iris %>% mutate(is_setosa=as.numeric((Species=='setosa')))

# copy it into spark
spark_data <- copy_to(sc,train_data,'like_iris',overwrite=TRUE)

# these all error.
# scala error:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
# nonsensical in R:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ )
# this throws a scala error:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = "is_setosa ~ ")
# this might work when I get fix for #1597, but is against the spirit, really.
s_int_mod <- spark_data %>% mutate(one=1.0) %>% ml_logistic_regression(formula = is_setosa ~ one,fit_intercept=FALSE)

The first one, which is of interest here, gives the stack trace:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 143.0 failed 4 times, most recent failure: Lost task 0.3
in stage 143.0 (TID 111, vaecdh00361.ussdnve.baml.com, executor 10): java.lang.IllegalArgumentException: requirement failed: Vector should have dim
ension larger than zero.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:74)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:509)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:508)

...

Enter a frame number, or 0 to exit

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ 1)
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ 1)
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: pipeline %>% ml_fit(x)
...

From what I can tell, ml_generate_ml_model thinks that features_col should be 'features', but that column does not exist in the data. If during debugging I set features_col=c() and rerun, I get a

Error during wrapup: java.lang.IllegalArgumentException: Field "features" does not exist.
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)

So perhaps the underlying logistic regression really wants to have features in it.

I will submit some tests that would catch this one, but I have no fix in mind, so the tests will only break the build.

shabbybanks · 2018-07-17T22:27:20Z

The 'pipeline API' gives the same error, of course:

# try the pipeline model
pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(is_setosa ~ 1) %>%
  ml_logistic_regression()

s_int_mod <- pipeline %>%
  ml_fit(spark_data)

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most recent failure: Lost task 0.3
in stage 145.0 (TID 119, vaecdh00344.ussdnve.baml.com, executor 12): java.lang.IllegalArgumentException: requirement failed: Vector should have dim
ension larger than zero.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:74)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:509)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:508)

adding test to catch sparklyr#1596. fix not in place yet, this will break the build.

kevinykuo · 2018-09-14T05:41:47Z

OK I think RFormula isn't working as expected in spark.ml. I'm getting

> iris_tbl <- sdf_copy_to(sc, iris)
> ft_r_formula(iris_tbl, "Species ~ 1")
 Error: org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
'Project [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038, UDF(named_struct()) AS features#4165]
+- AnalysisBarrier
      +- Project [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038]
         +- SubqueryAlias iris
            +- LogicalRDD [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

So the parser isn't understanding that ~ 1 should give us a features column with ones. We could try to do the parsing ourselves in sparklyr but that would be a bigger undertaking. It's probably better to fix this upstream in apache/spark. Leaving this issue open for now.

shabbybanks · 2018-09-14T15:56:23Z

thanks for the update and thanks for all the work. Should I submit an issue upstream in apache/spark, or have you already?

kevinykuo · 2018-09-14T15:59:46Z

I have not.

shabbybanks · 2018-09-14T16:32:50Z

I looked to see if it had already been filed and found only this issue, spark-19400, which seems to suggest that Rformula maybe does accept y ~ 1, but IWLS fails on it. I am not adept enough in raw Spark to test, however.

kevinykuo · 2018-09-14T16:37:56Z

Thanks, this is interesting, taking a look now...

kevinykuo · 2018-09-14T16:41:31Z

Failing in Spark 2.3.0...

scala> val output = formula.fit(dataset).transform(dataset)
org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
'Project [y#11, w#12, off#13, x1#14, x2#15, UDF(named_struct()) AS features#34]
+- AnalysisBarrier
      +- Project [_1#5 AS y#11, _2#6 AS w#12, _3#7 AS off#13, _4#8 AS x1#14, _5#9 AS x2#15]
         +- LocalRelation [_1#5, _2#6, _3#7, _4#8, _5#9]

kevinykuo · 2018-09-14T16:49:17Z

In Spark 2.2.0 this works in sparklyr:

df <- tribble(
  ~y, ~w, ~off, ~x1, ~x2,
  1, 1, 2, 0, 5,
  0.5, 2, 1, 1, 2,
  1, 3, 0.5, 2, 1,
  2, 4, 1.5, 3, 3
)
df_tbl <- sdf_copy_to(sc, df)
fm <- ft_r_formula(sc, "y ~ 1")
output <- fm %>% ml_fit_and_transform(df_tbl)
glr <- ml_generalized_linear_regression(sc, family = "poisson")
model <- glt %>% ml_fit(output)
model
# GeneralizedLinearRegressionModel (Transformer)
# <generalized_linear_regression_151621472ad45> 
#   (Parameters -- Column Names)
# features_col: features
# label_col: label
# prediction_col: prediction
# (Transformer Info)
# coefficients:  num(0)  
# intercept:  num 0.118 
# num_features:  int 0

yitao-li · 2021-01-15T21:34:19Z

Looks like issue has been fixed in apache/spark

shabbybanks added a commit to shabbybanks/sparklyr that referenced this issue Jul 13, 2018

get model coefficients, not the function

2cebf89

As this was written, I believe `coefficients` was referencing a function, not the model's coefficients. Hoping this will fix sparklyr#1596

shabbybanks mentioned this issue Jul 13, 2018

get model coefficients, not the function shabbybanks/sparklyr#1

Merged

shabbybanks closed this as completed in shabbybanks/sparklyr#1 Jul 13, 2018

shabbybanks mentioned this issue Jul 13, 2018

get model coefficients, not the function #1597

Merged

shabbybanks reopened this Jul 17, 2018

kevinykuo self-assigned this Jul 18, 2018

kevinykuo added the ml label Jul 18, 2018

kevinykuo added this to the 0.10.0 milestone Jul 18, 2018

shabbybanks added a commit to shabbybanks/sparklyr that referenced this issue Jul 18, 2018

add tests for intercept-only logistic model

4836cd7

adding test to catch sparklyr#1596. fix not in place yet, this will break the build.

shabbybanks mentioned this issue Jul 18, 2018

[WIP] test for intercept-only logistic model #1609

Closed

kevinykuo modified the milestones: 0.10.0, 0.9.0 Sep 13, 2018

kevinykuo removed their assignment Jan 13, 2020

yitao-li closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot fit 'intercept only' logistic regression model #1596

cannot fit 'intercept only' logistic regression model #1596

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 17, 2018

shabbybanks commented Jul 17, 2018

kevinykuo commented Sep 14, 2018

shabbybanks commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

shabbybanks commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

yitao-li commented Jan 15, 2021

cannot fit 'intercept only' logistic regression model #1596

cannot fit 'intercept only' logistic regression model #1596

Comments

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 13, 2018

shabbybanks commented Jul 17, 2018

shabbybanks commented Jul 17, 2018

kevinykuo commented Sep 14, 2018

shabbybanks commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

shabbybanks commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

kevinykuo commented Sep 14, 2018

yitao-li commented Jan 15, 2021