Factor variables #16

Open
matthieugomez opened this Issue Nov 6, 2014 · 4 comments

Projects

None yet

4 participants

@matthieugomez

When the formula contains terms such as : or , it would be nice to split term in two columns. For instance

tidy(lm(y ~ x:as.factor(z)))

gives

term 
(Intercept)     
x*as.factor(z):1
x*as.factor(z):2
x*as.factor(z):3

but it could give two supplementary columns

term              
(Intercept)       NA   NA     
x*as.factor(z):1   x    1
x*as.factor(z):2   x    2
x*as.factor(z):3   x    3

This would allow to plot the estimate as a function of z, for instance.
I'm not sure what's the best way to do it that avoids regex.

Collaborator
dgrtwo commented Nov 6, 2014

I see what you mean but this seems like something that would best be handled in post-tidying dplyr steps. An analysis could be arbitrarily complicated, with multiple interaction terms in different combinations, and any operation to plot them would need to filter out and manipulate terms anyway.

Also, I can't reproduce the above example. Why does the z factor have one term rather than having one for each level? And it's not clear what the z column in the output represents: why is it z in one case but 1/2/3 in others (If it's a factor, does it simply have an extra level relative to the z input?) Could you suggest a reproducible example?

I see. I now handle that using regex, but I was just wondering if something cleaner was possible. It corresponds a little bit to tidyr::separate

Attached are examples (the second row in the first post corresponds to the intercept (edited))

N=1e3; K=100
set.seed(1)
DT <- data.table(
  z = sample(3, N/K, TRUE),
  x =  sample(5, N, TRUE),
  y =  sample(1e6, N, TRUE)                       
)
tidy(lm(y~ x+ x:as.factor(z), DT))
#>              term   estimate std.error  statistic      p.value
#> 1     (Intercept) 501831.257 22033.010 22.7763371 9.041980e-93
#> 2               x  -8988.640  7681.250 -1.1702053 2.421983e-01
#> 3 x:as.factor(z)2   8682.732  6670.360  1.3016885 1.933239e-01
#> 4 x:as.factor(z)3   5488.104  7472.208  0.7344688 4.628359e-01

tidy(lm(y~ as.factor(z)+ x:as.factor(z), DT))
#>              term     estimate std.error    statistic      p.value
#> 1     (Intercept) 539837.42605  41834.66 12.904069970 2.464753e-35
#> 2   as.factor(z)2 -38945.31370  54239.54 -0.718024374 4.729110e-01
#> 3   as.factor(z)3 -70280.63443  57378.57 -1.224858572 2.209187e-01
#> 4 as.factor(z)1:x -19071.47164  12166.52 -1.567537331 1.173074e-01
#> 5 as.factor(z)2:x    -52.63436  10312.06 -0.005104157 9.959285e-01
#> 6 as.factor(z)3:x   5886.15152  12694.31  0.463684419 6.429754e-01

tidy(lm(y~ x*as.factor(z), DT))
#>              term  estimate std.error  statistic      p.value
#> 1     (Intercept) 539837.43  41834.66 12.9040700 2.464753e-35
#> 2               x -19071.47  12166.52 -1.5675373 1.173074e-01
#> 3   as.factor(z)2 -38945.31  54239.54 -0.7180244 4.729110e-01
#> 4   as.factor(z)3 -70280.63  57378.57 -1.2248586 2.209187e-01
#> 5 x:as.factor(z)2  19018.84  15948.75  1.1924969 2.333511e-01
#> 6 x:as.factor(z)3  24957.62  17583.22  1.4194002 1.560959e-01

I agree with @matthieugomez, I would also like to see tidy handle interactions. My current workflow is to perform the regression and then calculate stratum-specific effects with the survey package, for each interaction term. If this could all be wrapped up in a tidy() function, with exponentiation and confidence intervals, it would be much neater.

I've got no idea how it could be achieved though.

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
# print(d.AD <- data.frame(treatment, outcome, counts))

glm.D93 <- glm(counts ~ outcome * treatment, family = poisson())
#anova(glm.D93)
summary(glm.D93)

require(survey)
svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))
exp(svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))[[1]])
Contributor
nutterb commented Mar 22, 2015

@matthieugomez, Here's a sort-of solution that gets the two columns you want using regular expressions. Based on the common rules for variable names, I think that this will work for nearly all factor specifications because there really isn't a whole lot of transforming you can do to a factor. (I bring this up because I don't handle things like + and ^ in the regular expressions. I can't think of an example where one would use such a thing with a factor)

But I'll agree with @dgrtwo that this is probably not something suited to what broom is trying to accomplish. At least not in my understanding. Also, this solution is model specific, so while broom can combine the results of multiple models, the code below has to be applied for each model independently.

mtcars2 <- transform(mtcars,
                     am = factor(am, 0:1, c("Manual", "Automatic")))
Hmisc::label(mtcars2[, c("mpg", "am", "qsec", "wt")], self=FALSE) <- 
  c("Miles per Gallon", "Transmission", "Quarter Mile Time", "Weight")

fit0 <- lm(mpg ~ factor(am) * qsec + wt, data=mtcars)
tbl0 <- tidy(fit0)

makeRegex <- function(fit){
  fctr <- attributes(terms(fit))$dataClasses
  fctr <- names(fctr)[fctr == "factor"]
  fctr_regex <- paste0(fctr, collapse="|")
  fctr_regex <- gsub("[(]", "[(]", fctr_regex)
  fctr_regex <- gsub("[)]", "[)]", fctr_regex)
  fctr_regex <- gsub("[.]", "[.]", fctr_regex)
  fctr_regex <- paste0("(", fctr_regex, ")")
  return(fctr_regex)
}

fr <- makeRegex(fit0)

tbl0$level <- gsub(fr, "", tbl0$term)
tbl0$term_alt <- sapply(tbl0$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl0

fit <- lm(mpg ~ am * qsec + wt, data=mtcars2)
tbl <- tidy(fit)

fr <- makeRegex(fit)

tbl$level <- gsub(fr, "", tbl$term)
tbl$term_alt <- sapply(tbl$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment