Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invertible transformations #264

Closed
alexpghayes opened this issue Nov 25, 2018 · 13 comments
Closed

Invertible transformations #264

alexpghayes opened this issue Nov 25, 2018 · 13 comments

Comments

@alexpghayes
Copy link
Contributor

In essence recipes currently define a forward map. You fit() the forward map with prep() and apply it to data with bake(). The backwards direction should work the same way, i.e. you should just apply some generic like unbake() to the data.

But now there are these issues of fidelity in the recovery. Three possibilities:

  • all steps are invertible (i.e. centering, scaling, etc) and allow perfect recovery
  • the steps are pseudo-invertible (i.e PCA where you drop terms)
  • the steps just aren't invertible (t-SNE)

The last two should definitely error or give a warning. This would also require modifying all the prep() methods to learn the backwards map at prep() time (i.e. when you prep() a PCA-step, if you support inversion, you now need to save the information associated with the back transform, which is distinct from the information associated with the forward transform).

Side-note: these generics have nice analogues in the sklearn world:

  • prep() is fit()
  • juice() is fit_transform()
  • bake() is transform()
  • unbake() would be untransform(), although I'm not sure how many inverse transformations (if any) are support for sklearn
@topepo
Copy link
Member

topepo commented Nov 26, 2018

I don't know that unbake is the right approach (although I like the idea). You might to want to only undo certain transformations and those might be included in the original recipe.

  • Suppose there is an imputation method that needs centered and scaled data. You can used those steps before the imputation method but might want the data back in original units before running the model. The uncenter and unscale steps would have to be in the recipe. The tokenize/untokenize examples in adding has_rule limits operations to 5 variables when list columns textrecipes#17 shows another example.

  • You might not want to undo everything that can be back transformed.

My thought is to have a specific step type that can be used to undo a previous step using the id field that we just added. Something like:

recipe(y ~ ., data = dat) %>%
  step_center(a, b, id = "center a and b") %>%
  step_scale(a, b, id = "scale a and b") %>%
  step_impute_method(a, b,) %>%
  step_undo(id = "scale a and b") %>%
  step_undo(id = "center a and b") %>%
  step_blab_blah_blah() 

My thinking is that we create S3 methods for steps that can be inverted using something along the lines of

invert <- function (x, ...) {
  UseMethod("invert")
}

invert.step <- function(x, ...) {
  stop("This type of step cannot be inverted.", call. = FALSE)
}

invert.step_sqrt <- function(x, ...) {
  # stuff
}

if you support inversion, you now need to save the information associated with the back transform, which is distinct from the information associated with the forward transform).

Can you give an example where those two pieces of information would be different?

@alexpghayes
Copy link
Contributor Author

I'm on board. Are you thinking of a having a single recipe with that different generics can be applied to, or a forward recipe and an undo recipe?

Can you give an example where those two pieces of information would be different?

I thought this was the case for PCA but have since realized this was a brain fart.

@topepo
Copy link
Member

topepo commented Nov 26, 2018

I'm on board. Are you thinking of a having a single recipe with that different generics can be applied to, or a forward recipe and an undo recipe?

At would be done at the step level (I think) with the usual recipes/prep/bake workflow.

@DavisVaughan
Copy link
Member

I think you need a separate undo recipe if this is trying to also solve the problem of back transforming predictions.

suppressPackageStartupMessages({
  library(AmesHousing)
  library(recipes)
  library(rsample)
  library(parsnip)
})
ames <- make_ames()

split <- initial_split(ames)
train <- training(split)
test  <- testing(split)

rec <- recipe(Sale_Price ~ Longitude, data = train) %>%
  step_log(Sale_Price, id = "log")

p_rec <- prep(rec, training = train, retain = TRUE)

model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Sale_Price ~ Longitude, data = juice(p_rec))

pred <- predict(model, new_data = bake(p_rec, test))

pred
#> # A tibble: 732 x 1
#>    .pred
#>    <dbl>
#>  1  11.9
#>  2  11.9
#>  3  12.0
#>  4  12.0
#>  5  12.0
#>  6  12.0
#>  7  12.0
#>  8  11.9
#>  9  11.9
#> 10  11.9
#> # ... with 722 more rows

# need to undo log transform on the predictions

Once you have pred, you need some way to only undo the steps applied to Sale_Price. If I were to add a step_undo() onto the current recipe and then apply it to pred, it wouldn't work because it would first log then "unlog" (besides the fact that the column names are wrong).

@alexpghayes
Copy link
Contributor Author

I find that the most intuitive as well

@ryankarel
Copy link

Has there been any progress on this issue?

@topepo
Copy link
Member

topepo commented Aug 13, 2019

A little. We have step ID values that we can refer back to. Otherwise no. There probably won't be for a few months.

@smingerson
Copy link
Contributor

@topepo, would you review a PR which began to implement step_undo() and invert.*() as described in your rough sketch?

@piinghel
Copy link

piinghel commented Mar 5, 2020

Hi, has this already been implemented? This seems like a crucial step in every modelling framework...

@mjkeefe
Copy link

mjkeefe commented Jun 2, 2020

Will this include things like back transforming an outcome variable if it was log transformed? This would seem important for assessing models using yardstick, right? You would want metrics in the original scale of the outcome, not the log transformed scale.

@alexpghayes
Copy link
Contributor Author

alexpghayes commented Jun 2, 2020 via email

@topepo
Copy link
Member

topepo commented Jun 5, 2020

This will eventually be a post-processor in workflows. You shouldn't do this in recipes since the full recipe results are given to the model. Untransforming needs to happen after the model is fit.

@topepo topepo closed this as completed Jun 5, 2020
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants