New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor variables #16

Closed
matthieugomez opened this Issue Nov 6, 2014 · 5 comments

Comments

Projects
None yet
5 participants
@matthieugomez

matthieugomez commented Nov 6, 2014

When the formula contains terms such as : or , it would be nice to split term in two columns. For instance

tidy(lm(y ~ x:as.factor(z)))

gives

term 
(Intercept)     
x*as.factor(z):1
x*as.factor(z):2
x*as.factor(z):3

but it could give two supplementary columns

term              
(Intercept)       NA   NA     
x*as.factor(z):1   x    1
x*as.factor(z):2   x    2
x*as.factor(z):3   x    3

This would allow to plot the estimate as a function of z, for instance.
I'm not sure what's the best way to do it that avoids regex.

@dgrtwo

This comment has been minimized.

Show comment
Hide comment
@dgrtwo

dgrtwo Nov 6, 2014

Collaborator

I see what you mean but this seems like something that would best be handled in post-tidying dplyr steps. An analysis could be arbitrarily complicated, with multiple interaction terms in different combinations, and any operation to plot them would need to filter out and manipulate terms anyway.

Also, I can't reproduce the above example. Why does the z factor have one term rather than having one for each level? And it's not clear what the z column in the output represents: why is it z in one case but 1/2/3 in others (If it's a factor, does it simply have an extra level relative to the z input?) Could you suggest a reproducible example?

Collaborator

dgrtwo commented Nov 6, 2014

I see what you mean but this seems like something that would best be handled in post-tidying dplyr steps. An analysis could be arbitrarily complicated, with multiple interaction terms in different combinations, and any operation to plot them would need to filter out and manipulate terms anyway.

Also, I can't reproduce the above example. Why does the z factor have one term rather than having one for each level? And it's not clear what the z column in the output represents: why is it z in one case but 1/2/3 in others (If it's a factor, does it simply have an extra level relative to the z input?) Could you suggest a reproducible example?

@matthieugomez

This comment has been minimized.

Show comment
Hide comment
@matthieugomez

matthieugomez Nov 6, 2014

I see. I now handle that using regex, but I was just wondering if something cleaner was possible. It corresponds a little bit to tidyr::separate

Attached are examples (the second row in the first post corresponds to the intercept (edited))

N=1e3; K=100
set.seed(1)
DT <- data.table(
  z = sample(3, N/K, TRUE),
  x =  sample(5, N, TRUE),
  y =  sample(1e6, N, TRUE)                       
)
tidy(lm(y~ x+ x:as.factor(z), DT))
#>              term   estimate std.error  statistic      p.value
#> 1     (Intercept) 501831.257 22033.010 22.7763371 9.041980e-93
#> 2               x  -8988.640  7681.250 -1.1702053 2.421983e-01
#> 3 x:as.factor(z)2   8682.732  6670.360  1.3016885 1.933239e-01
#> 4 x:as.factor(z)3   5488.104  7472.208  0.7344688 4.628359e-01

tidy(lm(y~ as.factor(z)+ x:as.factor(z), DT))
#>              term     estimate std.error    statistic      p.value
#> 1     (Intercept) 539837.42605  41834.66 12.904069970 2.464753e-35
#> 2   as.factor(z)2 -38945.31370  54239.54 -0.718024374 4.729110e-01
#> 3   as.factor(z)3 -70280.63443  57378.57 -1.224858572 2.209187e-01
#> 4 as.factor(z)1:x -19071.47164  12166.52 -1.567537331 1.173074e-01
#> 5 as.factor(z)2:x    -52.63436  10312.06 -0.005104157 9.959285e-01
#> 6 as.factor(z)3:x   5886.15152  12694.31  0.463684419 6.429754e-01

tidy(lm(y~ x*as.factor(z), DT))
#>              term  estimate std.error  statistic      p.value
#> 1     (Intercept) 539837.43  41834.66 12.9040700 2.464753e-35
#> 2               x -19071.47  12166.52 -1.5675373 1.173074e-01
#> 3   as.factor(z)2 -38945.31  54239.54 -0.7180244 4.729110e-01
#> 4   as.factor(z)3 -70280.63  57378.57 -1.2248586 2.209187e-01
#> 5 x:as.factor(z)2  19018.84  15948.75  1.1924969 2.333511e-01
#> 6 x:as.factor(z)3  24957.62  17583.22  1.4194002 1.560959e-01

matthieugomez commented Nov 6, 2014

I see. I now handle that using regex, but I was just wondering if something cleaner was possible. It corresponds a little bit to tidyr::separate

Attached are examples (the second row in the first post corresponds to the intercept (edited))

N=1e3; K=100
set.seed(1)
DT <- data.table(
  z = sample(3, N/K, TRUE),
  x =  sample(5, N, TRUE),
  y =  sample(1e6, N, TRUE)                       
)
tidy(lm(y~ x+ x:as.factor(z), DT))
#>              term   estimate std.error  statistic      p.value
#> 1     (Intercept) 501831.257 22033.010 22.7763371 9.041980e-93
#> 2               x  -8988.640  7681.250 -1.1702053 2.421983e-01
#> 3 x:as.factor(z)2   8682.732  6670.360  1.3016885 1.933239e-01
#> 4 x:as.factor(z)3   5488.104  7472.208  0.7344688 4.628359e-01

tidy(lm(y~ as.factor(z)+ x:as.factor(z), DT))
#>              term     estimate std.error    statistic      p.value
#> 1     (Intercept) 539837.42605  41834.66 12.904069970 2.464753e-35
#> 2   as.factor(z)2 -38945.31370  54239.54 -0.718024374 4.729110e-01
#> 3   as.factor(z)3 -70280.63443  57378.57 -1.224858572 2.209187e-01
#> 4 as.factor(z)1:x -19071.47164  12166.52 -1.567537331 1.173074e-01
#> 5 as.factor(z)2:x    -52.63436  10312.06 -0.005104157 9.959285e-01
#> 6 as.factor(z)3:x   5886.15152  12694.31  0.463684419 6.429754e-01

tidy(lm(y~ x*as.factor(z), DT))
#>              term  estimate std.error  statistic      p.value
#> 1     (Intercept) 539837.43  41834.66 12.9040700 2.464753e-35
#> 2               x -19071.47  12166.52 -1.5675373 1.173074e-01
#> 3   as.factor(z)2 -38945.31  54239.54 -0.7180244 4.729110e-01
#> 4   as.factor(z)3 -70280.63  57378.57 -1.2248586 2.209187e-01
#> 5 x:as.factor(z)2  19018.84  15948.75  1.1924969 2.333511e-01
#> 6 x:as.factor(z)3  24957.62  17583.22  1.4194002 1.560959e-01
@simonthelwall

This comment has been minimized.

Show comment
Hide comment
@simonthelwall

simonthelwall Feb 27, 2015

I agree with @matthieugomez, I would also like to see tidy handle interactions. My current workflow is to perform the regression and then calculate stratum-specific effects with the survey package, for each interaction term. If this could all be wrapped up in a tidy() function, with exponentiation and confidence intervals, it would be much neater.

I've got no idea how it could be achieved though.

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
# print(d.AD <- data.frame(treatment, outcome, counts))

glm.D93 <- glm(counts ~ outcome * treatment, family = poisson())
#anova(glm.D93)
summary(glm.D93)

require(survey)
svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))
exp(svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))[[1]])

simonthelwall commented Feb 27, 2015

I agree with @matthieugomez, I would also like to see tidy handle interactions. My current workflow is to perform the regression and then calculate stratum-specific effects with the survey package, for each interaction term. If this could all be wrapped up in a tidy() function, with exponentiation and confidence intervals, it would be much neater.

I've got no idea how it could be achieved though.

counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
# print(d.AD <- data.frame(treatment, outcome, counts))

glm.D93 <- glm(counts ~ outcome * treatment, family = poisson())
#anova(glm.D93)
summary(glm.D93)

require(survey)
svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))
exp(svycontrast(glm.D93, c("outcome2" = 1, "outcome3:treatment3" = 1))[[1]])
@nutterb

This comment has been minimized.

Show comment
Hide comment
@nutterb

nutterb Mar 22, 2015

Contributor

@matthieugomez, Here's a sort-of solution that gets the two columns you want using regular expressions. Based on the common rules for variable names, I think that this will work for nearly all factor specifications because there really isn't a whole lot of transforming you can do to a factor. (I bring this up because I don't handle things like + and ^ in the regular expressions. I can't think of an example where one would use such a thing with a factor)

But I'll agree with @dgrtwo that this is probably not something suited to what broom is trying to accomplish. At least not in my understanding. Also, this solution is model specific, so while broom can combine the results of multiple models, the code below has to be applied for each model independently.

mtcars2 <- transform(mtcars,
                     am = factor(am, 0:1, c("Manual", "Automatic")))
Hmisc::label(mtcars2[, c("mpg", "am", "qsec", "wt")], self=FALSE) <- 
  c("Miles per Gallon", "Transmission", "Quarter Mile Time", "Weight")

fit0 <- lm(mpg ~ factor(am) * qsec + wt, data=mtcars)
tbl0 <- tidy(fit0)

makeRegex <- function(fit){
  fctr <- attributes(terms(fit))$dataClasses
  fctr <- names(fctr)[fctr == "factor"]
  fctr_regex <- paste0(fctr, collapse="|")
  fctr_regex <- gsub("[(]", "[(]", fctr_regex)
  fctr_regex <- gsub("[)]", "[)]", fctr_regex)
  fctr_regex <- gsub("[.]", "[.]", fctr_regex)
  fctr_regex <- paste0("(", fctr_regex, ")")
  return(fctr_regex)
}

fr <- makeRegex(fit0)

tbl0$level <- gsub(fr, "", tbl0$term)
tbl0$term_alt <- sapply(tbl0$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl0

fit <- lm(mpg ~ am * qsec + wt, data=mtcars2)
tbl <- tidy(fit)

fr <- makeRegex(fit)

tbl$level <- gsub(fr, "", tbl$term)
tbl$term_alt <- sapply(tbl$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl
Contributor

nutterb commented Mar 22, 2015

@matthieugomez, Here's a sort-of solution that gets the two columns you want using regular expressions. Based on the common rules for variable names, I think that this will work for nearly all factor specifications because there really isn't a whole lot of transforming you can do to a factor. (I bring this up because I don't handle things like + and ^ in the regular expressions. I can't think of an example where one would use such a thing with a factor)

But I'll agree with @dgrtwo that this is probably not something suited to what broom is trying to accomplish. At least not in my understanding. Also, this solution is model specific, so while broom can combine the results of multiple models, the code below has to be applied for each model independently.

mtcars2 <- transform(mtcars,
                     am = factor(am, 0:1, c("Manual", "Automatic")))
Hmisc::label(mtcars2[, c("mpg", "am", "qsec", "wt")], self=FALSE) <- 
  c("Miles per Gallon", "Transmission", "Quarter Mile Time", "Weight")

fit0 <- lm(mpg ~ factor(am) * qsec + wt, data=mtcars)
tbl0 <- tidy(fit0)

makeRegex <- function(fit){
  fctr <- attributes(terms(fit))$dataClasses
  fctr <- names(fctr)[fctr == "factor"]
  fctr_regex <- paste0(fctr, collapse="|")
  fctr_regex <- gsub("[(]", "[(]", fctr_regex)
  fctr_regex <- gsub("[)]", "[)]", fctr_regex)
  fctr_regex <- gsub("[.]", "[.]", fctr_regex)
  fctr_regex <- paste0("(", fctr_regex, ")")
  return(fctr_regex)
}

fr <- makeRegex(fit0)

tbl0$level <- gsub(fr, "", tbl0$term)
tbl0$term_alt <- sapply(tbl0$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl0

fit <- lm(mpg ~ am * qsec + wt, data=mtcars2)
tbl <- tidy(fit)

fr <- makeRegex(fit)

tbl$level <- gsub(fr, "", tbl$term)
tbl$term_alt <- sapply(tbl$term, 
                        function(x, reg){
                          x <- unlist(strsplit(x, ":"))
                          x <- ifelse(grepl(reg, x), 
                                      stringr::str_extract(x, reg),
                                      x)
                          x <- paste0(x, collapse=":")
                        },
                        fr)
tbl
@alexpghayes

This comment has been minimized.

Show comment
Hide comment
@alexpghayes

alexpghayes Jun 4, 2018

Collaborator

I agree that this would be nice, but think it's out of scope for broom at the moment. I see broom as tidying summary methods, etc, that produce non-tidy output. Making those methods produce more useful information in the first place is hugely valuable, but not something for broom to take on at the moment.

As an aside, I think a there's a potential for blockbuster package if you reimagined the interface for linear models and got it right.

Collaborator

alexpghayes commented Jun 4, 2018

I agree that this would be nice, but think it's out of scope for broom at the moment. I see broom as tidying summary methods, etc, that produce non-tidy output. Making those methods produce more useful information in the first place is hugely valuable, but not something for broom to take on at the moment.

As an aside, I think a there's a potential for blockbuster package if you reimagined the interface for linear models and got it right.

@alexpghayes alexpghayes closed this Jun 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment