# Factor variables #16

Closed
opened this Issue Nov 6, 2014 · 5 comments

Projects
None yet
5 participants

### matthieugomez commented Nov 6, 2014

 When the formula contains terms such as : or , it would be nice to split term in two columns. For instance `tidy(lm(y ~ x:as.factor(z)))` gives ```term (Intercept) x*as.factor(z):1 x*as.factor(z):2 x*as.factor(z):3``` but it could give two supplementary columns ```term (Intercept) NA NA x*as.factor(z):1 x 1 x*as.factor(z):2 x 2 x*as.factor(z):3 x 3``` This would allow to plot the estimate as a function of z, for instance. I'm not sure what's the best way to do it that avoids regex.
Collaborator

### dgrtwo commented Nov 6, 2014

 I see what you mean but this seems like something that would best be handled in post-tidying dplyr steps. An analysis could be arbitrarily complicated, with multiple interaction terms in different combinations, and any operation to plot them would need to filter out and manipulate terms anyway. Also, I can't reproduce the above example. Why does the `z` factor have one term rather than having one for each level? And it's not clear what the `z` column in the output represents: why is it `z` in one case but 1/2/3 in others (If it's a factor, does it simply have an extra level relative to the `z` input?) Could you suggest a reproducible example?

### matthieugomez commented Nov 6, 2014

 I see. I now handle that using regex, but I was just wondering if something cleaner was possible. It corresponds a little bit to `tidyr::separate` Attached are examples (the second row in the first post corresponds to the intercept (edited)) ```N=1e3; K=100 set.seed(1) DT <- data.table( z = sample(3, N/K, TRUE), x = sample(5, N, TRUE), y = sample(1e6, N, TRUE) ) tidy(lm(y~ x+ x:as.factor(z), DT)) #> term estimate std.error statistic p.value #> 1 (Intercept) 501831.257 22033.010 22.7763371 9.041980e-93 #> 2 x -8988.640 7681.250 -1.1702053 2.421983e-01 #> 3 x:as.factor(z)2 8682.732 6670.360 1.3016885 1.933239e-01 #> 4 x:as.factor(z)3 5488.104 7472.208 0.7344688 4.628359e-01 tidy(lm(y~ as.factor(z)+ x:as.factor(z), DT)) #> term estimate std.error statistic p.value #> 1 (Intercept) 539837.42605 41834.66 12.904069970 2.464753e-35 #> 2 as.factor(z)2 -38945.31370 54239.54 -0.718024374 4.729110e-01 #> 3 as.factor(z)3 -70280.63443 57378.57 -1.224858572 2.209187e-01 #> 4 as.factor(z)1:x -19071.47164 12166.52 -1.567537331 1.173074e-01 #> 5 as.factor(z)2:x -52.63436 10312.06 -0.005104157 9.959285e-01 #> 6 as.factor(z)3:x 5886.15152 12694.31 0.463684419 6.429754e-01 tidy(lm(y~ x*as.factor(z), DT)) #> term estimate std.error statistic p.value #> 1 (Intercept) 539837.43 41834.66 12.9040700 2.464753e-35 #> 2 x -19071.47 12166.52 -1.5675373 1.173074e-01 #> 3 as.factor(z)2 -38945.31 54239.54 -0.7180244 4.729110e-01 #> 4 as.factor(z)3 -70280.63 57378.57 -1.2248586 2.209187e-01 #> 5 x:as.factor(z)2 19018.84 15948.75 1.1924969 2.333511e-01 #> 6 x:as.factor(z)3 24957.62 17583.22 1.4194002 1.560959e-01```

### simonthelwall commented Feb 27, 2015

 I agree with @matthieugomez, I would also like to see tidy handle interactions. My current workflow is to perform the regression and then calculate stratum-specific effects with the `survey` package, for each interaction term. If this could all be wrapped up in a `tidy()` function, with exponentiation and confidence intervals, it would be much neater. I've got no idea how it could be achieved though. ```counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) # print(d.AD <- data.frame(treatment, outcome, counts)) glm.D93 <- glm(counts ~ outcome * treatment, family = poisson()) #anova(glm.D93) summary(glm.D93) require(survey) svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1)) exp(svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))[[1]])```
Contributor

### nutterb commented Mar 22, 2015

 @matthieugomez, Here's a sort-of solution that gets the two columns you want using regular expressions. Based on the common rules for variable names, I think that this will work for nearly all factor specifications because there really isn't a whole lot of transforming you can do to a factor. (I bring this up because I don't handle things like `+` and `^` in the regular expressions. I can't think of an example where one would use such a thing with a factor) But I'll agree with @dgrtwo that this is probably not something suited to what `broom` is trying to accomplish. At least not in my understanding. Also, this solution is model specific, so while `broom` can combine the results of multiple models, the code below has to be applied for each model independently. ```mtcars2 <- transform(mtcars, am = factor(am, 0:1, c("Manual", "Automatic"))) Hmisc::label(mtcars2[, c("mpg", "am", "qsec", "wt")], self=FALSE) <- c("Miles per Gallon", "Transmission", "Quarter Mile Time", "Weight") fit0 <- lm(mpg ~ factor(am) * qsec + wt, data=mtcars) tbl0 <- tidy(fit0) makeRegex <- function(fit){ fctr <- attributes(terms(fit))\$dataClasses fctr <- names(fctr)[fctr == "factor"] fctr_regex <- paste0(fctr, collapse="|") fctr_regex <- gsub("[(]", "[(]", fctr_regex) fctr_regex <- gsub("[)]", "[)]", fctr_regex) fctr_regex <- gsub("[.]", "[.]", fctr_regex) fctr_regex <- paste0("(", fctr_regex, ")") return(fctr_regex) } fr <- makeRegex(fit0) tbl0\$level <- gsub(fr, "", tbl0\$term) tbl0\$term_alt <- sapply(tbl0\$term, function(x, reg){ x <- unlist(strsplit(x, ":")) x <- ifelse(grepl(reg, x), stringr::str_extract(x, reg), x) x <- paste0(x, collapse=":") }, fr) tbl0 fit <- lm(mpg ~ am * qsec + wt, data=mtcars2) tbl <- tidy(fit) fr <- makeRegex(fit) tbl\$level <- gsub(fr, "", tbl\$term) tbl\$term_alt <- sapply(tbl\$term, function(x, reg){ x <- unlist(strsplit(x, ":")) x <- ifelse(grepl(reg, x), stringr::str_extract(x, reg), x) x <- paste0(x, collapse=":") }, fr) tbl```

Closed

Collaborator

### alexpghayes commented Jun 4, 2018

 I agree that this would be nice, but think it's out of scope for `broom` at the moment. I see `broom` as tidying `summary` methods, etc, that produce non-tidy output. Making those methods produce more useful information in the first place is hugely valuable, but not something for `broom` to take on at the moment. As an aside, I think a there's a potential for blockbuster package if you reimagined the interface for linear models and got it right.