Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checking if variable is being used as a outcome and predictor #35

Closed
EmilHvitfeldt opened this issue Feb 16, 2019 · 2 comments
Closed

Comments

@EmilHvitfeldt
Copy link
Member

I don't know whether it would fit within this package, or somewhere else. But would it be beneficial to alert (or error) the user if a variable appear on the left and right side in a formula?

@DavisVaughan
Copy link
Member

This sounds useful! model.matrix() actually checks for this automatically and throws a warning and drops the duplicated predictor (if it is exactly the same as the outcome. Meaning log(Sepal.Width) would not count as a duplicate). I think this is rather aggressive.

I don't think I would put this in mold(), as I wouldn't call this a "required" check, but I want hardhat to have a number of extra optional validate_***() functions that developers can use, and this seems like one of them.

Below is one version of a validate function for this. This uses the original column names and checks for duplicates. So Sepal.Width ~ Sepal.Width and Sepal.Width ~ log(Sepal.Width) will both be flagged as having duplicates. There could also be a version that works more like model.matrix() and checks that the processed training data does not have duplicates (so log(Sepal.Width) would look different than Sepal.Width).

library(hardhat)

.formula <- Sepal.Width ~ Sepal.Width

# //////////////////////////////////////////////////////////////////////////////

# mold() lets you use them
x <- mold(.formula, iris)

x$predictors
#> # A tibble: 150 x 1
#>    Sepal.Width
#>          <dbl>
#>  1         3.5
#>  2         3  
#>  3         3.2
#>  4         3.1
#>  5         3.6
#>  6         3.9
#>  7         3.4
#>  8         3.4
#>  9         2.9
#> 10         3.1
#> # … with 140 more rows

x$outcomes
#> # A tibble: 150 x 1
#>    Sepal.Width
#>          <dbl>
#>  1         3.5
#>  2         3  
#>  3         3.2
#>  4         3.1
#>  5         3.6
#>  6         3.9
#>  7         3.4
#>  8         3.4
#>  9         2.9
#> 10         3.1
#> # … with 140 more rows

# //////////////////////////////////////////////////////////////////////////////

# a warning is thrown here
mf <- model.frame(.formula, iris)
head(model.matrix(terms(mf), mf))
#> Warning in model.matrix.default(terms(mf), mf): the response appeared on
#> the right-hand side and was dropped
#> Warning in model.matrix.default(terms(mf), mf): problem with term 1 in
#> model.matrix: no columns are assigned
#>   (Intercept)
#> 1           1
#> 2           1
#> 3           1
#> 4           1
#> 5           1
#> 6           1

# //////////////////////////////////////////////////////////////////////////////

# the info is here
x$preprocessor$predictors$names
#> [1] "Sepal.Width"
x$preprocessor$outcomes$names
#> [1] "Sepal.Width"

# //////////////////////////////////////////////////////////////////////////////

validate_lhs_rhs_duplication <- function(preprocessor) {
  
  if (!inherits(preprocessor, "terms_preprocessor")) {
    return(preprocessor)
  }
  
  original_predictor_names <- preprocessor$predictors$names
  original_outcome_names <- preprocessor$outcomes$names
  
  dups <- intersect(original_predictor_names, original_outcome_names)
  
  if (length(dups) > 0) {
    
    dups <- glue::glue_collapse(glue::single_quote(dups), ", ")
    
    rlang::abort(glue::glue(
      "The supplied `formula` cannot have the same term ",
      "as both an outcome and a predictor. The following terms ",
      "appear on both sides of the formula: {dups}."
    ))
  }
  
  invisible(preprocessor)
}

validate_lhs_rhs_duplication(x$preprocessor)
#> Error: The supplied `formula` cannot have the same term as both an outcome and a predictor. The following terms appear on both sides of the formula: 'Sepal.Width'.
#> Backtrace:
#>     █
#>  1. └─global::validate_lhs_rhs_duplication(x$preprocessor)

Created on 2019-02-16 by the reprex package (v0.2.1.9000)

@github-actions
Copy link

github-actions bot commented Jul 1, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants