Invertible transformations #264

alexpghayes · 2018-11-25T18:37:08Z

In essence recipes currently define a forward map. You fit() the forward map with prep() and apply it to data with bake(). The backwards direction should work the same way, i.e. you should just apply some generic like unbake() to the data.

But now there are these issues of fidelity in the recovery. Three possibilities:

all steps are invertible (i.e. centering, scaling, etc) and allow perfect recovery
the steps are pseudo-invertible (i.e PCA where you drop terms)
the steps just aren't invertible (t-SNE)

The last two should definitely error or give a warning. This would also require modifying all the prep() methods to learn the backwards map at prep() time (i.e. when you prep() a PCA-step, if you support inversion, you now need to save the information associated with the back transform, which is distinct from the information associated with the forward transform).

Side-note: these generics have nice analogues in the sklearn world:

prep() is fit()
juice() is fit_transform()
bake() is transform()
unbake() would be untransform(), although I'm not sure how many inverse transformations (if any) are support for sklearn

The text was updated successfully, but these errors were encountered:

topepo · 2018-11-26T15:31:32Z

I don't know that unbake is the right approach (although I like the idea). You might to want to only undo certain transformations and those might be included in the original recipe.

Suppose there is an imputation method that needs centered and scaled data. You can used those steps before the imputation method but might want the data back in original units before running the model. The uncenter and unscale steps would have to be in the recipe. The tokenize/untokenize examples in adding has_rule limits operations to 5 variables when list columns textrecipes#17 shows another example.
You might not want to undo everything that can be back transformed.

My thought is to have a specific step type that can be used to undo a previous step using the id field that we just added. Something like:

recipe(y ~ ., data = dat) %>%
  step_center(a, b, id = "center a and b") %>%
  step_scale(a, b, id = "scale a and b") %>%
  step_impute_method(a, b,) %>%
  step_undo(id = "scale a and b") %>%
  step_undo(id = "center a and b") %>%
  step_blab_blah_blah()

My thinking is that we create S3 methods for steps that can be inverted using something along the lines of

invert <- function (x, ...) {
  UseMethod("invert")
}

invert.step <- function(x, ...) {
  stop("This type of step cannot be inverted.", call. = FALSE)
}

invert.step_sqrt <- function(x, ...) {
  # stuff
}

if you support inversion, you now need to save the information associated with the back transform, which is distinct from the information associated with the forward transform).

Can you give an example where those two pieces of information would be different?

alexpghayes · 2018-11-26T17:01:01Z

I'm on board. Are you thinking of a having a single recipe with that different generics can be applied to, or a forward recipe and an undo recipe?

Can you give an example where those two pieces of information would be different?

I thought this was the case for PCA but have since realized this was a brain fart.

topepo · 2018-11-26T19:40:45Z

I'm on board. Are you thinking of a having a single recipe with that different generics can be applied to, or a forward recipe and an undo recipe?

At would be done at the step level (I think) with the usual recipes/prep/bake workflow.

DavisVaughan · 2018-11-27T13:32:05Z

I think you need a separate undo recipe if this is trying to also solve the problem of back transforming predictions.

suppressPackageStartupMessages({
  library(AmesHousing)
  library(recipes)
  library(rsample)
  library(parsnip)
})
ames <- make_ames()

split <- initial_split(ames)
train <- training(split)
test  <- testing(split)

rec <- recipe(Sale_Price ~ Longitude, data = train) %>%
  step_log(Sale_Price, id = "log")

p_rec <- prep(rec, training = train, retain = TRUE)

model <- linear_reg() %>%
  set_engine("lm") %>%
  fit(Sale_Price ~ Longitude, data = juice(p_rec))

pred <- predict(model, new_data = bake(p_rec, test))

pred
#> # A tibble: 732 x 1
#>    .pred
#>    <dbl>
#>  1  11.9
#>  2  11.9
#>  3  12.0
#>  4  12.0
#>  5  12.0
#>  6  12.0
#>  7  12.0
#>  8  11.9
#>  9  11.9
#> 10  11.9
#> # ... with 722 more rows

# need to undo log transform on the predictions

Once you have pred, you need some way to only undo the steps applied to Sale_Price. If I were to add a step_undo() onto the current recipe and then apply it to pred, it wouldn't work because it would first log then "unlog" (besides the fact that the column names are wrong).

alexpghayes · 2018-11-28T20:47:46Z

I find that the most intuitive as well

ryankarel · 2019-08-13T18:22:14Z

Has there been any progress on this issue?

topepo · 2019-08-13T19:09:48Z

A little. We have step ID values that we can refer back to. Otherwise no. There probably won't be for a few months.

smingerson · 2019-11-25T00:58:08Z

@topepo, would you review a PR which began to implement step_undo() and invert.*() as described in your rough sketch?

piinghel · 2020-03-05T11:39:57Z

Hi, has this already been implemented? This seems like a crucial step in every modelling framework...

mjkeefe · 2020-06-02T20:25:18Z

Will this include things like back transforming an outcome variable if it was log transformed? This would seem important for assessing models using yardstick, right? You would want metrics in the original scale of the outcome, not the log transformed scale.

alexpghayes · 2020-06-02T22:29:57Z

Side note: naive backtransformation of predictions is not in general consistent for conditional means, although it is potentially useful. For backtransformed predictions in the general case you really should Duan's Smearing Estimator (which requires the ability to backtransform!).

…

On Tue, Jun 2, 2020 at 3:25 PM Matt ***@***.***> wrote: Will this include things like back transforming an outcome variable if it was log transformed? This would seem important for assessing models using yardstick, right? You would want metrics in the original scale of the outcome, not the log transformed scale. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#264 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTBG2YNK6ZZ2HYZHRFB2N3RUVN3XANCNFSM4GGIAKMQ> .

topepo · 2020-06-05T14:15:57Z

This will eventually be a post-processor in workflows. You shouldn't do this in recipes since the full recipe results are given to the model. Untransforming needs to happen after the model is fit.

github-actions · 2021-02-21T00:02:59Z

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

alexpghayes mentioned this issue Nov 25, 2018

tranform()-like generic for unsupervised methods tidymodels/model-implementation-principles#1

Open

topepo closed this as completed Jun 5, 2020

github-actions bot locked and limited conversation to collaborators Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invertible transformations #264

Invertible transformations #264

alexpghayes commented Nov 25, 2018

topepo commented Nov 26, 2018

alexpghayes commented Nov 26, 2018

topepo commented Nov 26, 2018

DavisVaughan commented Nov 27, 2018

alexpghayes commented Nov 28, 2018

ryankarel commented Aug 13, 2019

topepo commented Aug 13, 2019

smingerson commented Nov 25, 2019

piinghel commented Mar 5, 2020

mjkeefe commented Jun 2, 2020

alexpghayes commented Jun 2, 2020 via email

topepo commented Jun 5, 2020

github-actions bot commented Feb 21, 2021

Invertible transformations #264

Invertible transformations #264

Comments

alexpghayes commented Nov 25, 2018

topepo commented Nov 26, 2018

alexpghayes commented Nov 26, 2018

topepo commented Nov 26, 2018

DavisVaughan commented Nov 27, 2018

alexpghayes commented Nov 28, 2018

ryankarel commented Aug 13, 2019

topepo commented Aug 13, 2019

smingerson commented Nov 25, 2019

piinghel commented Mar 5, 2020

mjkeefe commented Jun 2, 2020

alexpghayes commented Jun 2, 2020 via email

topepo commented Jun 5, 2020

github-actions bot commented Feb 21, 2021