Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using vctrs for coercion #59

Closed
DavisVaughan opened this issue Feb 25, 2019 · 3 comments
Closed

Investigate using vctrs for coercion #59

DavisVaughan opened this issue Feb 25, 2019 · 3 comments

Comments

@DavisVaughan
Copy link
Member

Coercion of new_data columns into their correct type (maybe characters can be coerced to factors automatically)

@DavisVaughan
Copy link
Member Author

suppressPackageStartupMessages({
  library(hardhat)
  library(vctrs)
  library(gapminder)
  library(rsample)
  library(dplyr)
})
#> Warning: package 'rsample' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.2

options(rlang__backtrace_on_error = "none")

split <- initial_split(gapminder)
gap_train <- training(split)
gap_test <- testing(split)

# 0 row slice (this is `info`!)
gap_train_0 <- vec_slice(gap_train, 0L)

# all of the df - df casts are due to df_col_cast()
# which is called from vec_cast()

# ///////////////////////

# continent is character not factor
gap_test2 <- mutate(gap_test, continent = as.character(continent))

# silently fixes that
common2 <- vec_cast(gap_test2, gap_train_0)

# recovered levels and type
levels(common2$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# Takeaway) This is good

# ///////////////////////

# continent is character not factor, AND has too many implicit levels internally
gap_test2b <- mutate(gap_test, continent = as.character(continent))
gap_test2b$continent[1] <- "jupyter"

# noisily fixes that
common2b <- vec_cast(gap_test2b, gap_train_0)
#> Warning: Lossy cast from <character> to <factor<69262>>
#> Locations: 1

# recovered levels and type
levels(common2b$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# Takeaway) This is good, but a better warning would be
# good. This is a TOO MANY LEVELS problem.

# ///////////////////////

# continent is character not factor, AND doesn't have enough levels internally
gap_test2c <- mutate(gap_test, continent = as.character(continent))
gap_test2c$continent <- gsub("Asia", NA_character_, gap_test2c$continent)

# silently fixes that
common2c <- vec_cast(gap_test2c, gap_train_0)

# recovered levels and type
levels(common2c$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# Takeaway) This is good

# ///////////////////////

# continent is numeric
gap_test3 <- mutate(gap_test, continent = 1)

# not the best error message
common3 <- vec_cast(gap_test3, gap_train_0)
#> Error: Can't cast <double> to <factor<69262>>

# Takeaway) This is good, but want a better error message
# would really like this better message to be vctrs' job

# ///////////////////////

# too many columns
gap_test4 <- mutate(gap_test, x = 4)

# this actually throws a decent error message,
# but I think we would still rather have our own
vec_cast(gap_test4, gap_train)
#> Warning: Lossy cast from <tbl_df<
#>   country  : factor<bf6dc>
#>   continent: factor<69262>
#>   year     : integer
#>   lifeExp  : double
#>   pop      : integer
#>   gdpPercap: double
#>   x        : double
#> >> to <tbl_df<
#>   country  : factor<bf6dc>
#>   continent: factor<69262>
#>   year     : integer
#>   lifeExp  : double
#>   pop      : integer
#>   gdpPercap: double
#> >>
#> Locations: 
#> Dropped variables: `x`
#> # A tibble: 426 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1987    40.8 13867957      852.
#>  3 Albania     Europe     1987    72    3075321     3739.
#>  4 Albania     Europe     1992    71.6  3326498     2497.
#>  5 Albania     Europe     2002    75.7  3508512     4604.
#>  6 Algeria     Africa     1957    45.7 10270856     3014.
#>  7 Algeria     Africa     1977    58.0 17152804     4910.
#>  8 Algeria     Africa     1987    65.8 23254956     5681.
#>  9 Algeria     Africa     2007    72.3 33333216     6223.
#> 10 Angola      Africa     1962    34    4826015     4269.
#> # … with 416 more rows

# Takeaway) This is good, but probably want a custom error message
# because vctrs "types" its warnings, I think I can do this

# ///////////////////////

# not enough columns
gap_test5 <- select(gap_test, -pop)

# silently adds column of NA values
vec_cast(gap_test5, gap_train_0)
#> # A tibble: 426 x 6
#>    country     continent  year lifeExp   pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl> <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8    NA      779.
#>  2 Afghanistan Asia       1987    40.8    NA      852.
#>  3 Albania     Europe     1987    72      NA     3739.
#>  4 Albania     Europe     1992    71.6    NA     2497.
#>  5 Albania     Europe     2002    75.7    NA     4604.
#>  6 Algeria     Africa     1957    45.7    NA     3014.
#>  7 Algeria     Africa     1977    58.0    NA     4910.
#>  8 Algeria     Africa     1987    65.8    NA     5681.
#>  9 Algeria     Africa     2007    72.3    NA     6223.
#> 10 Angola      Africa     1962    34      NA     4269.
#> # … with 416 more rows

# Takeaway) We should let shrink() be noisy here and error

# ///////////////////////

# Too many levels, but the factor's actual values never use that level
gap_test6 <- mutate(gap_test, continent = factor(continent, c(levels(continent), "extra_level")))

# silently fixes that
common6 <- vec_cast(gap_test6, gap_train_0)

levels(common6$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

# Takeaway) Silence is okay here as the values aren't actually affected

# ///////////////////////

# Too many levels (in test), AND the factor's actual values use that level
# (here we drop Africa from the train data to demonstrate)
gap_train_0_6b <- mutate(gap_train_0, continent = factor(continent, levels(continent)[-1]))

# noisily drops the level and coerces problematic
# positions to NA
common6b <- vec_cast(gap_test, gap_train_0_6b)
#> Warning: Lossy cast from <factor<69262>> to <factor<e5252>>
#> Locations: 6, 7, 8, 9, 10, 11, 12, 13, 33, 34, 35, 42, 43, 47, 48, 49, 5...

# no africa
levels(common6b$continent)
#> [1] "Americas" "Asia"     "Europe"   "Oceania"

# Takeaway) Noisy is good, but I think I want a different warning.
# Again, capture the typed warning. This could check if `x` is a factor
# then you'd know the lossy cast is specific to having too many factor levels

# ///////////////////////

# not enough levels (in test)
gap_test7 <- mutate(gap_test, continent = factor(continent, levels(continent)[-1]))

# silently fixes that
common7 <- vec_cast(gap_test7, gap_train_0)

levels(common7$continent)
#> [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Created on 2019-02-25 by the reprex package (v0.2.1.9000)

@DavisVaughan
Copy link
Member Author

Closed by #71

@github-actions
Copy link

github-actions bot commented Jul 1, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jul 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant