Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpectedly different behavior for factors/dummy variables between parsnip and workflows #326

Closed
juliasilge opened this issue Feb 6, 2020 · 5 comments
Labels
feature a feature request or enhancement

Comments

@juliasilge
Copy link
Member

When training what seems like the same model (same model specification, same formula, same data) using parsnip vs. using workflows, it is surprising to see different results. I found this behavior quite unexpected, especially what workflows did.

Some options to reduce user surprise 😮 would be more clarity in the functions either in parsnip, in workflows, or both.

lm(Sepal.Length ~ ., iris)
#> 
#> Call:
#> lm(formula = Sepal.Length ~ ., data = iris)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

library(parsnip)
lm_spec <- linear_reg() %>%
  set_engine(engine = "lm") 

## parsnip version looks the same as lm
lm_spec %>%
  fit(Sepal.Length ~ ., data = iris)
#> parsnip model object
#> 
#> Fit time:  2ms 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

## workflows version has made a different choice about dummy variables
library(workflows)
workflow() %>%
  add_model(lm_spec) %>%
  add_formula(Sepal.Length ~ .) %>%
  fit(data = iris)
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Sepal.Length ~ .
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            1.1478             0.4959             0.8292            -0.3152  
#>     Speciessetosa  Speciesversicolor   Speciesvirginica  
#>            1.0235             0.2999                 NA

Created on 2020-02-06 by the reprex package (v0.3.0)

@LaugeGregers

This comment has been minimized.

@juliasilge
Copy link
Member Author

Yep, we are aware of where the difference arises. Current thoughts on resolving the unexpected differences are in #290.

@juliasilge juliasilge transferred this issue from tidymodels/parsnip Jun 5, 2020
@juliasilge
Copy link
Member Author

juliasilge commented Jun 5, 2020

We addressed the issue with indicator/dummy variables in #319 and tidymodels/workflows#51. Next, we need to address the difference in one-hot encoding vs. indicator/dummy variables (i.e. the intercept handling).

@juliasilge juliasilge changed the title Document (or change) unexpectedly different behavior for factors/dummy variables between parsnip and workflows Unexpectedly different behavior for factors/dummy variables between parsnip and workflows Jun 5, 2020
@juliasilge juliasilge transferred this issue from tidymodels/hardhat Jun 5, 2020
@juliasilge juliasilge added the feature a feature request or enhancement label Jun 5, 2020
@juliasilge
Copy link
Member Author

This has been closed in #332 and tidymodels/workflows#53.

library(parsnip)

lm_spec <- linear_reg() %>%
  set_engine(engine = "lm") 

lm_spec %>%
  fit(Sepal.Length ~ ., data = iris)
#> parsnip model object
#> 
#> Fit time:  5ms 
#> 
#> Call:
#> stats::lm(formula = Sepal.Length ~ ., data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

library(workflows)

workflow() %>%
  add_model(lm_spec) %>%
  add_formula(Sepal.Length ~ .) %>%
  fit(data = iris)
#> ══ Workflow [trained] ═══════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────
#> Sepal.Length ~ .
#> 
#> ── Model ────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#>       (Intercept)        Sepal.Width       Petal.Length        Petal.Width  
#>            2.1713             0.4959             0.8292            -0.3152  
#> Speciesversicolor   Speciesvirginica  
#>           -0.7236            -1.0235

Created on 2020-07-02 by the reprex package (v0.3.0.9001)

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants