Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot fit 'intercept only' logistic regression model #1596

Closed
shabbybanks opened this issue Jul 13, 2018 · 12 comments · Fixed by shabbybanks/sparklyr#1
Closed

cannot fit 'intercept only' logistic regression model #1596

shabbybanks opened this issue Jul 13, 2018 · 12 comments · Fixed by shabbybanks/sparklyr#1
Labels

Comments

@shabbybanks
Copy link
Contributor

It is possible to build a 'intercept only' glm in R, but not via ml_logistic_regression:

# testing
train_data <- iris %>%
  mutate(is_setosa=as.numeric((Species=='setosa'))) %>%
  mutate(is_not_setosa=1-is_setosa) %>%
  setNames(gsub('\\.','_',names(.)))

# runs just fine:
r_mod_one <- glm(formula=is_setosa ~ 1,data=train_data,family=binomial(link='logit'))

# copy data to spark ...
spark_data <- copy_to(sc,train_data,'like_iris',overwrite=TRUE)
# then fit
# but this errors out:
s_mod_one <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)

Depending on the size of the data (I was trying this on 'real' data, not iris), it may take a while, then throw an error, which suggests the problem is somewhere later in the processing, rather than earlier. The error message is the mystifying:

 java.lang.IllegalArgumentException: requirement failed: Vector should
have dimension larger than zero.

with this awesome stack trace:

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ 1)
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ 1)
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: pipeline %>% ml_fit(x)
13: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
14: eval(quote(`_fseq`(`_lhs`)), env, env)
15: eval(quote(`_fseq`(`_lhs`)), env, env)
16: `_fseq`(`_lhs`)
17: freduce(value, `_function_list`)
18: withVisible(function_list[[k]](value))
19: function_list[[k]](value)
20: ml_fit(., x)
21: spark_jobj(x) %>% invoke("fit", spark_dataframe(dataset)) %>% ml_constructo
22: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
23: eval(quote(`_fseq`(`_lhs`)), env, env)
24: eval(quote(`_fseq`(`_lhs`)), env, env)
25: `_fseq`(`_lhs`)
26: freduce(value, `_function_list`)
27: function_list[[i]](value)
28: invoke(., "fit", spark_dataframe(dataset))
29: invoke.shell_jobj(., "fit", spark_dataframe(dataset))
30: invoke_method(spark_connection(jobj), FALSE, jobj, method, ...)
31: invoke_method.spark_shell_connection(spark_connection(jobj), FALSE, jobj, m
32: core_invoke_method(sc, static, object, method, ...)
33: withr::with_options(list(warning.length = 8000), {
    if (nzchar(msg)) {

34: force(code)
@shabbybanks
Copy link
Contributor Author

For that matter, fitting a perfectly cromulent model without an intercept also throws an error. With the same setup:

# this works fine
s_mod_ok  <- spark_data %>%
  ml_logistic_regression(formula = is_setosa ~ Sepal_Length,fit_intercept=TRUE)

# this errors:
s_mod_bad <- spark_data %>%
  ml_logistic_regression(formula = is_setosa ~ Sepal_Length,fit_intercept=FALSE)

The error and stack trace I get are:

Error: `x` must be a vector

Enter a frame number, or 0 to exit

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ Sepal_Length, f
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ Sepal_Length, fit_intercept
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ Sepal_Length, fit
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: do.call(constructor, args)
13: (function (pipeline, pipeline_model, model, dataset, formula, feature_names
14: rlang::set_names(coefficients, feature_names)
15: set_names_impl(x, x, nm, ...)
16: abort("`x` must be a vector")

shabbybanks added a commit to shabbybanks/sparklyr that referenced this issue Jul 13, 2018
As this was written, I believe `coefficients` was referencing a function, not the model's coefficients.
Hoping this will fix sparklyr#1596
@shabbybanks
Copy link
Contributor Author

To be sure, the PR should fix the later case where there is a variable and no intercept. I should have made this two separate issues. My bad.

@shabbybanks
Copy link
Contributor Author

Sorry, this should not have been closed. #1597 fixes the case of 'no intercept', which you could use to 'fake' the intercept only model, I think. However, the following are all broken:

# make some data
train_data <- iris %>% mutate(is_setosa=as.numeric((Species=='setosa')))

# copy it into spark
spark_data <- copy_to(sc,train_data,'like_iris',overwrite=TRUE)

# these all error.
# scala error:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
# nonsensical in R:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = is_setosa ~ )
# this throws a scala error:
s_int_mod <- spark_data %>% ml_logistic_regression(formula = "is_setosa ~ ")
# this might work when I get fix for #1597, but is against the spirit, really.
s_int_mod <- spark_data %>% mutate(one=1.0) %>% ml_logistic_regression(formula = is_setosa ~ one,fit_intercept=FALSE)

The first one, which is of interest here, gives the stack trace:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 143.0 failed 4 times, most recent failure: Lost task 0.3
in stage 143.0 (TID 111, vaecdh00361.ussdnve.baml.com, executor 10): java.lang.IllegalArgumentException: requirement failed: Vector should have dim
ension larger than zero.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:74)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:509)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:508)

...

Enter a frame number, or 0 to exit

 1: spark_data %>% ml_logistic_regression(formula = is_setosa ~ 1)
 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
 3: eval(quote(`_fseq`(`_lhs`)), env, env)
 4: eval(quote(`_fseq`(`_lhs`)), env, env)
 5: `_fseq`(`_lhs`)
 6: freduce(value, `_function_list`)
 7: withVisible(function_list[[k]](value))
 8: function_list[[k]](value)
 9: ml_logistic_regression(., formula = is_setosa ~ 1)
10: ml_logistic_regression.tbl_spark(., formula = is_setosa ~ 1)
11: ml_generate_ml_model(x, predictor = predictor, formula = formula, features_
12: pipeline %>% ml_fit(x)
...

From what I can tell, ml_generate_ml_model thinks that features_col should be 'features', but that column does not exist in the data. If during debugging I set features_col=c() and rerun, I get a

Error during wrapup: java.lang.IllegalArgumentException: Field "features" does not exist.
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
        at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)

So perhaps the underlying logistic regression really wants to have features in it.

I will submit some tests that would catch this one, but I have no fix in mind, so the tests will only break the build.

@shabbybanks shabbybanks reopened this Jul 17, 2018
@shabbybanks
Copy link
Contributor Author

The 'pipeline API' gives the same error, of course:

# try the pipeline model
pipeline <- ml_pipeline(sc) %>%
  ft_r_formula(is_setosa ~ 1) %>%
  ml_logistic_regression()

s_int_mod <- pipeline %>%
  ml_fit(spark_data)
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most recent failure: Lost task 0.3
in stage 145.0 (TID 119, vaecdh00344.ussdnve.baml.com, executor 12): java.lang.IllegalArgumentException: requirement failed: Vector should have dim
ension larger than zero.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.mllib.stat.MultivariateOnlineSummarizer.add(MultivariateOnlineSummarizer.scala:74)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:509)
        at org.apache.spark.ml.classification.LogisticRegression$$anonfun$15.apply(LogisticRegression.scala:508)

@kevinykuo kevinykuo self-assigned this Jul 18, 2018
@kevinykuo kevinykuo added the ml label Jul 18, 2018
@kevinykuo kevinykuo added this to the 0.10.0 milestone Jul 18, 2018
shabbybanks added a commit to shabbybanks/sparklyr that referenced this issue Jul 18, 2018
adding test to catch sparklyr#1596.
fix not in place yet, this will break the build.
@kevinykuo kevinykuo modified the milestones: 0.10.0, 0.9.0 Sep 13, 2018
@kevinykuo
Copy link
Collaborator

OK I think RFormula isn't working as expected in spark.ml. I'm getting

> iris_tbl <- sdf_copy_to(sc, iris)
> ft_r_formula(iris_tbl, "Species ~ 1")
 Error: org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
'Project [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038, UDF(named_struct()) AS features#4165]
+- AnalysisBarrier
      +- Project [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038]
         +- SubqueryAlias iris
            +- LogicalRDD [Sepal_Length#4034, Sepal_Width#4035, Petal_Length#4036, Petal_Width#4037, Species#4038], false

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

So the parser isn't understanding that ~ 1 should give us a features column with ones. We could try to do the parsing ourselves in sparklyr but that would be a bigger undertaking. It's probably better to fix this upstream in apache/spark. Leaving this issue open for now.

@shabbybanks
Copy link
Contributor Author

thanks for the update and thanks for all the work. Should I submit an issue upstream in apache/spark, or have you already?

@kevinykuo
Copy link
Collaborator

I have not.

@shabbybanks
Copy link
Contributor Author

I looked to see if it had already been filed and found only this issue, spark-19400, which seems to suggest that Rformula maybe does accept y ~ 1, but IWLS fails on it. I am not adept enough in raw Spark to test, however.

@kevinykuo
Copy link
Collaborator

Thanks, this is interesting, taking a look now...

@kevinykuo
Copy link
Collaborator

Failing in Spark 2.3.0...

scala> val output = formula.fit(dataset).transform(dataset)
org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due to data type mismatch: input to function named_struct requires at least one argument;;
'Project [y#11, w#12, off#13, x1#14, x2#15, UDF(named_struct()) AS features#34]
+- AnalysisBarrier
      +- Project [_1#5 AS y#11, _2#6 AS w#12, _3#7 AS off#13, _4#8 AS x1#14, _5#9 AS x2#15]
         +- LocalRelation [_1#5, _2#6, _3#7, _4#8, _5#9]

@kevinykuo
Copy link
Collaborator

In Spark 2.2.0 this works in sparklyr:

df <- tribble(
  ~y, ~w, ~off, ~x1, ~x2,
  1, 1, 2, 0, 5,
  0.5, 2, 1, 1, 2,
  1, 3, 0.5, 2, 1,
  2, 4, 1.5, 3, 3
)
df_tbl <- sdf_copy_to(sc, df)
fm <- ft_r_formula(sc, "y ~ 1")
output <- fm %>% ml_fit_and_transform(df_tbl)
glr <- ml_generalized_linear_regression(sc, family = "poisson")
model <- glt %>% ml_fit(output)
model
# GeneralizedLinearRegressionModel (Transformer)
# <generalized_linear_regression_151621472ad45> 
#   (Parameters -- Column Names)
# features_col: features
# label_col: label
# prediction_col: prediction
# (Transformer Info)
# coefficients:  num(0)  
# intercept:  num 0.118 
# num_features:  int 0 

@kevinykuo kevinykuo removed their assignment Jan 13, 2020
@yitao-li
Copy link
Contributor

Looks like issue has been fixed in apache/spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants