caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

JanLauGe · 2016-03-10T12:12:27Z

I noticed that dummyVars is producing erroneous variable names when creating (predicting) dummy variables if one of the column names in the original dataset matches the start of the name string of a subsequent column name. For these cases, the new dummy variable names get split in the wrong place. Part of the column names of the partly matching subsequent column name is put with the factor level name.
As far as I can tell the function still delivers the right result, just with a confusing name.

Minimal dataset:

data <- data.frame('id' = seq(1,30,1),
                   'fooFactor' = factor(c(rep(1,10), rep(2,10), rep(3,10))),
                   'fooFactorBar' = factor(c(rep(4,10), rep(5,10), rep(6,10))),
                   'fooBarFactor' = factor(c(rep(7,10), rep(8,10), rep(9,10))))

Minimal, runnable code:

library(caret)
library(dplyr)
#make some data
data <- data.frame('id' = seq(1,30,1),
                   'fooFactor' = factor(c(rep(1,10), rep(2,10), rep(3,10))),
                   'fooFactorBar' = factor(c(rep(4,10), rep(5,10), rep(6,10))),
                   'fooBarFactor' = factor(c(rep(7,10), rep(8,10), rep(9,10))))
#dummify the data
dummies <- dummyVars(formula = id ~., 
                     data = data,
                     sep = '-') %>%
  predict(data)

#check the names
colnames(dummies)
#will return:
# [1] "fooFactor-1"    "fooFactor-2"    "fooFactor-3"    "fooFactor-Bar4" "fooFactor-Bar5"
# [6] "fooFactor-Bar6" "fooBarFactor-7" "fooBarFactor-8" "fooBarFactor-9"

#notice how 'fooFactor' and 'fooBarFactor' are both fine,
#but 'fooFactorBar' gets turned into 'fooFactor-Bar4' etc.

The same is true when using 'levelsOnly = TRUE' by the way. With this option, dummy variable names become 1, 2, 3, Bar4, Bar5, Bar6, 7, 8, 9.

This is my first bug report on github. Please point out anything that is missing or should be done better. Thanks for all the effort that went into this fantastic and super helpful package!

Session Info:

R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin14.0.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] dplyr_0.4.3 caret_6.0-64 ggplot2_2.0.0 lattice_0.20-33 plyr_1.8.3

loaded via a namespace (and not attached):
[1] Rcpp_0.12.3 magrittr_1.5 splines_3.2.3 MASS_7.3-45 munsell_0.4.2
[6] colorspace_1.2-6 R6_2.1.2 foreach_1.4.3 minqa_1.2.4 stringr_1.0.0
[11] car_2.1-1 tools_3.2.3 nnet_7.3-11 pbkrtest_0.4-6 parallel_3.2.3
[16] grid_3.2.3 gtable_0.1.2 nlme_3.1-124 mgcv_1.8-11 quantreg_5.19
[21] DBI_0.3.1 MatrixModels_0.4-1 iterators_1.0.8 assertthat_0.1 lme4_1.1-10
[26] Matrix_1.2-3 nloptr_1.0.4 reshape2_1.4.1 codetools_0.2-14 stringi_1.0-1
[31] scales_0.3.0 stats4_3.2.3 SparseM_1.7

topepo · 2016-04-07T10:50:02Z

These changes should fix the issue. Thanks.

JanLauGe · 2016-04-07T11:20:45Z

Thank you :)

topepo · 2016-04-07T15:18:37Z

No problem...

adimajo · 2019-02-25T13:36:16Z

I run into this issue with the latest version of caret (6.0-81) :

library(caret)
15 columns of categorical features which levels range from "1" to "15":
data = data.frame(matrix(rep(as.factor(sample.int(15, size = 100, replace = TRUE, prob = rep(1/15,15))), 15), ncol = 15))
Learning the mapping:
essai_dummyVars = caret::dummyVars(stats::as.formula(paste0("~ ", colnames(data), collapse = "+")), data)
Predicting:
essai_predict = predict(essai_dummyVars, data)
colnames(essai_predict)

Should return:
"X1.1" [...] "X1.15" "X2.1" [...] "X2.15" [...] "X15.1" [...] "X15.15"

Returns:
[1] "X1.1" "X1.10" "X1.1.1" "X1.1.2" "X1.1.3" "X1.1.4" "X1.1.5" "X1.2" "X1.3" [10] "X1.4" "X1.5" "X1.6" "X1.7" "X1.8" "X1.9" "X2.1" "X2.10" "X2.11" [19] "X2.12" "X2.13" "X2.14" "X2.15" "X2.2" "X2.3" "X2.4" "X2.5" "X2.6" [28] "X2.7" "X2.8" "X2.9" "X3.1" "X3.10" "X3.11" "X3.12" "X3.13" "X3.14" [37] "X3.15" "X3.2" "X3.3" "X3.4" "X3.5" "X3.6" "X3.7" "X3.8" "X3.9" [46] "X4.1" "X4.10" "X4.11" "X4.12" "X4.13" "X4.14" "X4.15" "X4.2" "X4.3" [55] "X4.4" "X4.5" "X4.6" "X4.7" "X4.8" "X4.9" "X5.1" "X5.10" "X5.11" [64] "X5.12" "X5.13" "X5.14" "X5.15" "X5.2" "X5.3" "X5.4" "X5.5" "X5.6" [73] "X5.7" "X5.8" "X5.9" "X6.1" "X6.10" "X6.11" "X6.12" "X6.13" "X6.14" [82] "X6.15" "X6.2" "X6.3" "X6.4" "X6.5" "X6.6" "X6.7" "X6.8" "X6.9" [91] "X7.1" "X7.10" "X7.11" "X7.12" "X7.13" "X7.14" "X7.15" "X7.2" "X7.3" [100] "X7.4" "X7.5" "X7.6" "X7.7" "X7.8" "X7.9" "X8.1" "X8.10" "X8.11" [109] "X8.12" "X8.13" "X8.14" "X8.15" "X8.2" "X8.3" "X8.4" "X8.5" "X8.6" [118] "X8.7" "X8.8" "X8.9" "X9.1" "X9.10" "X9.11" "X9.12" "X9.13" "X9.14" [127] "X9.15" "X9.2" "X9.3" "X9.4" "X9.5" "X9.6" "X9.7" "X9.8" "X9.9" [136] "X10.1" "X10.10" "X10.11" "X10.12" "X10.13" "X10.14" "X10.15" "X10.2" "X10.3" [145] "X10.4" "X10.5" "X10.6" "X10.7" "X10.8" "X10.9" "X1.1.1" "X1.1.10" "X1.1.11" [154] "X1.1.12" "X1.1.13" "X1.1.14" "X1.1.15" "X1.1.2" "X1.1.3" "X1.1.4" "X1.1.5" "X1.1.6" [163] "X1.1.7" "X1.1.8" "X1.1.9" "X1.2.1" "X1.2.10" "X1.2.11" "X1.2.12" "X1.2.13" "X1.2.14" [172] "X1.2.15" "X1.2.2" "X1.2.3" "X1.2.4" "X1.2.5" "X1.2.6" "X1.2.7" "X1.2.8" "X1.2.9" [181] "X1.3.1" "X1.3.10" "X1.3.11" "X1.3.12" "X1.3.13" "X1.3.14" "X1.3.15" "X1.3.2" "X1.3.3" [190] "X1.3.4" "X1.3.5" "X1.3.6" "X1.3.7" "X1.3.8" "X1.3.9" "X1.4.1" "X1.4.10" "X1.4.11" [199] "X1.4.12" "X1.4.13" "X1.4.14" "X1.4.15" "X1.4.2" "X1.4.3" "X1.4.4" "X1.4.5" "X1.4.6" [208] "X1.4.7" "X1.4.8" "X1.4.9" "X1.5.1" "X1.5.10" "X1.5.11" "X1.5.12" "X1.5.13" "X1.5.14" [217] "X1.5.15" "X1.5.2" "X1.5.3" "X1.5.4" "X1.5.5" "X1.5.6" "X1.5.7" "X1.5.8" "X1.5.9"

adimajo · 2019-02-25T13:49:21Z

I reran @JanLauGe 's MWE and it is indeed fixed.
I guess my issue has something to do with integers in the columns' names.
Shall I open another issue?

EDIT: function stri_replace_last from package stringi might fix the issue of the gsub command that lies on line 142 of fix b4f8a87 but it is probably overkill.

A quick fix of issue #390 partially solved in commit b4f8a87 (see my recent comments on #390)

topepo added a commit that referenced this issue Apr 7, 2016

fixes for issue #390

b4f8a87

topepo closed this as completed Apr 7, 2016

adimajo mentioned this issue Feb 26, 2019

A quick fix of issue #390 partially solved in commit b4f8a87ce516b6f5d9d8a11cfb940a3b696ec5ca (see my recent comments on #390) #1011

Merged

topepo added a commit that referenced this issue Mar 25, 2019

Merge pull request #1011 from adimajo/master

0a0ac85

A quick fix of issue #390 partially solved in commit b4f8a87 (see my recent comments on #390)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

JanLauGe commented Mar 10, 2016

topepo commented Apr 7, 2016

JanLauGe commented Apr 7, 2016

topepo commented Apr 7, 2016

adimajo commented Feb 25, 2019

adimajo commented Feb 25, 2019 •

edited

Loading

caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

caret::dummyVars reoccurring pattern in column name causes errors in dummy variable names #390

Comments

JanLauGe commented Mar 10, 2016

Minimal dataset:

Minimal, runnable code:

Session Info:

topepo commented Apr 7, 2016

JanLauGe commented Apr 7, 2016

topepo commented Apr 7, 2016

adimajo commented Feb 25, 2019

adimajo commented Feb 25, 2019 • edited Loading

adimajo commented Feb 25, 2019 •

edited

Loading