Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

Closed
JanLauGe opened this issue Mar 10, 2016 · 5 comments

Comments

@JanLauGe
Copy link

I noticed that dummyVars is producing erroneous variable names when creating (predicting) dummy variables if one of the column names in the original dataset matches the start of the name string of a subsequent column name. For these cases, the new dummy variable names get split in the wrong place. Part of the column names of the partly matching subsequent column name is put with the factor level name.
As far as I can tell the function still delivers the right result, just with a confusing name.

Minimal dataset:

data <- data.frame('id' = seq(1,30,1),
                   'fooFactor' = factor(c(rep(1,10), rep(2,10), rep(3,10))),
                   'fooFactorBar' = factor(c(rep(4,10), rep(5,10), rep(6,10))),
                   'fooBarFactor' = factor(c(rep(7,10), rep(8,10), rep(9,10))))

Minimal, runnable code:

library(caret)
library(dplyr)
#make some data
data <- data.frame('id' = seq(1,30,1),
                   'fooFactor' = factor(c(rep(1,10), rep(2,10), rep(3,10))),
                   'fooFactorBar' = factor(c(rep(4,10), rep(5,10), rep(6,10))),
                   'fooBarFactor' = factor(c(rep(7,10), rep(8,10), rep(9,10))))
#dummify the data
dummies <- dummyVars(formula = id ~., 
                     data = data,
                     sep = '-') %>%
  predict(data)

#check the names
colnames(dummies)
#will return:
# [1] "fooFactor-1"    "fooFactor-2"    "fooFactor-3"    "fooFactor-Bar4" "fooFactor-Bar5"
# [6] "fooFactor-Bar6" "fooBarFactor-7" "fooBarFactor-8" "fooBarFactor-9"

#notice how 'fooFactor' and 'fooBarFactor' are both fine,
#but 'fooFactorBar' gets turned into 'fooFactor-Bar4' etc.

The same is true when using 'levelsOnly = TRUE' by the way. With this option, dummy variable names become 1, 2, 3, Bar4, Bar5, Bar6, 7, 8, 9.

This is my first bug report on github. Please point out anything that is missing or should be done better. Thanks for all the effort that went into this fantastic and super helpful package!

Session Info:

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin14.0.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_0.4.3 caret_6.0-64 ggplot2_2.0.0 lattice_0.20-33 plyr_1.8.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 magrittr_1.5 splines_3.2.3 MASS_7.3-45 munsell_0.4.2
[6] colorspace_1.2-6 R6_2.1.2 foreach_1.4.3 minqa_1.2.4 stringr_1.0.0
[11] car_2.1-1 tools_3.2.3 nnet_7.3-11 pbkrtest_0.4-6 parallel_3.2.3
[16] grid_3.2.3 gtable_0.1.2 nlme_3.1-124 mgcv_1.8-11 quantreg_5.19
[21] DBI_0.3.1 MatrixModels_0.4-1 iterators_1.0.8 assertthat_0.1 lme4_1.1-10
[26] Matrix_1.2-3 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-14 stringi_1.0-1
[31] scales_0.3.0 stats4_3.2.3 SparseM_1.7

topepo added a commit that referenced this issue Apr 7, 2016
@topepo
Copy link
Owner

topepo commented Apr 7, 2016

These changes should fix the issue. Thanks.

@JanLauGe
Copy link
Author

JanLauGe commented Apr 7, 2016

Thank you :)

@topepo
Copy link
Owner

topepo commented Apr 7, 2016

No problem...

@topepo topepo closed this as completed Apr 7, 2016
@adimajo
Copy link
Contributor

adimajo commented Feb 25, 2019

I run into this issue with the latest version of caret (6.0-81) :

library(caret)
15 columns of categorical features which levels range from "1" to "15":
data = data.frame(matrix(rep(as.factor(sample.int(15, size = 100, replace = TRUE, prob = rep(1/15,15))), 15), ncol = 15))
Learning the mapping:
essai_dummyVars = caret::dummyVars(stats::as.formula(paste0("~ ", colnames(data), collapse = "+")), data)
Predicting:
essai_predict = predict(essai_dummyVars, data)
colnames(essai_predict)

Should return:
"X1.1" [...] "X1.15" "X2.1" [...] "X2.15" [...] "X15.1" [...] "X15.15"

Returns:
[1] "X1.1" "X1.10" "X1.1.1" "X1.1.2" "X1.1.3" "X1.1.4" "X1.1.5" "X1.2" "X1.3" [10] "X1.4" "X1.5" "X1.6" "X1.7" "X1.8" "X1.9" "X2.1" "X2.10" "X2.11" [19] "X2.12" "X2.13" "X2.14" "X2.15" "X2.2" "X2.3" "X2.4" "X2.5" "X2.6" [28] "X2.7" "X2.8" "X2.9" "X3.1" "X3.10" "X3.11" "X3.12" "X3.13" "X3.14" [37] "X3.15" "X3.2" "X3.3" "X3.4" "X3.5" "X3.6" "X3.7" "X3.8" "X3.9" [46] "X4.1" "X4.10" "X4.11" "X4.12" "X4.13" "X4.14" "X4.15" "X4.2" "X4.3" [55] "X4.4" "X4.5" "X4.6" "X4.7" "X4.8" "X4.9" "X5.1" "X5.10" "X5.11" [64] "X5.12" "X5.13" "X5.14" "X5.15" "X5.2" "X5.3" "X5.4" "X5.5" "X5.6" [73] "X5.7" "X5.8" "X5.9" "X6.1" "X6.10" "X6.11" "X6.12" "X6.13" "X6.14" [82] "X6.15" "X6.2" "X6.3" "X6.4" "X6.5" "X6.6" "X6.7" "X6.8" "X6.9" [91] "X7.1" "X7.10" "X7.11" "X7.12" "X7.13" "X7.14" "X7.15" "X7.2" "X7.3" [100] "X7.4" "X7.5" "X7.6" "X7.7" "X7.8" "X7.9" "X8.1" "X8.10" "X8.11" [109] "X8.12" "X8.13" "X8.14" "X8.15" "X8.2" "X8.3" "X8.4" "X8.5" "X8.6" [118] "X8.7" "X8.8" "X8.9" "X9.1" "X9.10" "X9.11" "X9.12" "X9.13" "X9.14" [127] "X9.15" "X9.2" "X9.3" "X9.4" "X9.5" "X9.6" "X9.7" "X9.8" "X9.9" [136] "X10.1" "X10.10" "X10.11" "X10.12" "X10.13" "X10.14" "X10.15" "X10.2" "X10.3" [145] "X10.4" "X10.5" "X10.6" "X10.7" "X10.8" "X10.9" "X1.1.1" "X1.1.10" "X1.1.11" [154] "X1.1.12" "X1.1.13" "X1.1.14" "X1.1.15" "X1.1.2" "X1.1.3" "X1.1.4" "X1.1.5" "X1.1.6" [163] "X1.1.7" "X1.1.8" "X1.1.9" "X1.2.1" "X1.2.10" "X1.2.11" "X1.2.12" "X1.2.13" "X1.2.14" [172] "X1.2.15" "X1.2.2" "X1.2.3" "X1.2.4" "X1.2.5" "X1.2.6" "X1.2.7" "X1.2.8" "X1.2.9" [181] "X1.3.1" "X1.3.10" "X1.3.11" "X1.3.12" "X1.3.13" "X1.3.14" "X1.3.15" "X1.3.2" "X1.3.3" [190] "X1.3.4" "X1.3.5" "X1.3.6" "X1.3.7" "X1.3.8" "X1.3.9" "X1.4.1" "X1.4.10" "X1.4.11" [199] "X1.4.12" "X1.4.13" "X1.4.14" "X1.4.15" "X1.4.2" "X1.4.3" "X1.4.4" "X1.4.5" "X1.4.6" [208] "X1.4.7" "X1.4.8" "X1.4.9" "X1.5.1" "X1.5.10" "X1.5.11" "X1.5.12" "X1.5.13" "X1.5.14" [217] "X1.5.15" "X1.5.2" "X1.5.3" "X1.5.4" "X1.5.5" "X1.5.6" "X1.5.7" "X1.5.8" "X1.5.9"

@adimajo
Copy link
Contributor

adimajo commented Feb 25, 2019

I reran @JanLauGe 's MWE and it is indeed fixed.
I guess my issue has something to do with integers in the columns' names.
Shall I open another issue?

EDIT: function stri_replace_last from package stringi might fix the issue of the gsub command that lies on line 142 of fix b4f8a87 but it is probably overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants