Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A framework for reproducible tables #69

Open
bzkrouse opened this issue May 3, 2017 · 28 comments

Comments

@bzkrouse
Copy link

commented May 3, 2017

In my work (clinical research), we make a lot of tables, usually comparing 2 or more groups. It's nice to format the table programmatically so that it is reproducible and ready for publication. The process to do so usually looks something like this:

  • Make a df that contains everything needed for the table (for example: N per group, mean/adjusted mean for each group, mean difference/OR/RR and confidence interval, p-value)
  • Do a bunch of reshaping, reordering, and pasting ), (, and %s here and there to get the results formatted correctly
  • Print/output the result

With tidy tools like dplyr, broom, and purrr, it is easier than ever before to create the self-contained data frame. However, getting all the necessary pieces and working the df into a table-ready format is a process that seems to be recreated from scratch each time. It would be great to have a tool that helps to automate this process a bit. Here's some vague thoughts on what this could look like:

  • Functions that, based on desired table/analysis type, automate the different stats and combine them together
  • Something like a compareGroups for the tidyverse
  • Maybe borrowing formatting conventions from summary functions in #50

Does anyone have any interest or thoughts about this topic? Are there any tools already out there that help with this? If not an unconf project, would love a related discussion about people’s workflows!

@sfirke

This comment has been minimized.

Copy link

commented May 3, 2017

I so hear this. I think the formatting aspect of summary tables in R is quite tedious and a barrier to winning people over from Excel for routine analyses.

I took a crack at this with the janitor package, specifically creating tabulations and 2-way/crosstab/contingency tables and formatting them with percentages, rounding, etc. for quick publication. Though I have focused on simple counting and percentages, not any statistics; but maybe the formatting aspect could be leveraged?

I'm rethinking the approach to janitor's tabulations and formatting, making the functions more modular and coherent and less a set of utilities. If this comes kind of close, maybe something could be built into janitor or those functions or ideas could be extended. Or if it should be something separate, that's great too and I'd love to help

@haozhu233

This comment has been minimized.

Copy link

commented May 3, 2017

To help generating nice looking table with grouping factors, I wrote a package called kableExtra a few months ago. So basically you can do something like below in a pdf_document

library(dplyr)
library(knitr)
library(kableExtra)
library(ezsummary)

mtcars %>%
  group_by(cyl) %>%
  ezsummary(flavor = "wide") %>%
  kable(format = "latex", booktabs = T,
        col.names = c("variable", rep(c("mean", "sd"), 3))) %>%
  add_header_above(c(" ", "4 cyl" = 2, "6 cyl" = 2, "8 cyl" = 2))

Then you will get something like
screen shot 2017-05-03 at 7 52 08 pm

(ezsummary is something I wrote in the past but some design aspects of it is a little below my expectation but I still use it sometimes. :P )

@njtierney

This comment has been minimized.

Copy link

commented May 4, 2017

This is a great idea!

Creating these tables is something I find so frustrating when writing a paper or report. It's totally one of those things where I've just gone:

Bah! I'll just write in the values manually just this once

Except it's almost never just this once, and it adds to reproducibility hell.

Having a tool(s) that makes it easier to create these sorts of tables would for sure ease one a pressure point in reproducibility.

ezsummary and kableExtra both look amazing, @haozhu233! I'd love to learn more about ezsummary and kableExtra and see if we can develop them further.

@stephlocke

This comment has been minimized.

Copy link

commented May 4, 2017

I don't know if this might be of interest but I met the guy who built this the other day -I was impressed with the level of docs https://cran.r-project.org/web/packages/pivottabler/index.html

@batpigandme

This comment has been minimized.

Copy link

commented May 4, 2017

@njtierney I'm digging kableExtra atm, but I've also used huxtable, and pander. huxtable has a "table of regressions" format, which you can see here.
I'm sure there are still lots of gaps, just wanted to put these out there.

@njtierney

This comment has been minimized.

Copy link

commented May 5, 2017

Ah, good to know! It's great to gather all these resources together!

Maybe we can work together on some examples of table we have made for papers/reports, and try all these different methods/pkgs out, and then work out what was great and what could be improved?

@bzkrouse

This comment has been minimized.

Copy link
Author

commented May 5, 2017

Wow, it's nice to hear other people are having similar thoughts (well said @njtierney !). @sfirke and @haozhu233 - really appreciate the tools you've built and the fact that you've already spent so much time thinking about this problem. If any of these tools can be leveraged or extended that would be amazing. It would be great to figure out a way to incorporate more of the statistical/modeling aspect of the analysis. More specifically - in the case of a table that contains many models/test, a potential tool could pair nicely with the purrr workflow.

I will mention the tangram package that came onto my radar yesterday that I don't know much about but seems to have a unique table building model.

@haozhu233

This comment has been minimized.

Copy link

commented May 5, 2017

@njtierney Great idea! I think this type of "literature review" will be very useful for our community. After that we will have a better understanding of what we have right now and what exactly we need. I can imagine during the unconf, we can easily generate a blog post that @stefaniebutland would like to see. ;)

@njtierney

This comment has been minimized.

Copy link

commented May 5, 2017

So many interesting things to work on at the unconf!

@jhollist

This comment has been minimized.

Copy link
Member

commented May 5, 2017

Agree with all that this is needed and this thread helps summarize a lot. Just an idea, what about a gallery of tables with the code to prodcue them. Something similar to @haozhu233 example above, but for different typical table types.

@haozhu233

This comment has been minimized.

Copy link

commented May 5, 2017

I feel like having a gallery thing @jhollist just mentioned will definitely be super helpful. We can also borrow some ideas from the design of ggplot2 that having ggplot(), which is powerful and customizable, and qplot(), which is bootstrapping common plot types, at the same time.

@bzkrouse

This comment has been minimized.

Copy link
Author

commented May 5, 2017

These are great ideas! I agree the lit review and gallery concept will both be very helpful and great resources for the broader community. It would be nice to take stock of what tools are out there and what types of tables should be covered. @haozhu233 - yes!! to your idea of structuring like ggplot2. That sounds ideal for a tool that is meant to be easy to use and easily extensible. Maybe we could try out a paradigm where you start with a simple table and add "layers" of details, complexities, and/or customizations.

@elinw

This comment has been minimized.

Copy link

commented May 6, 2017

This is great, I actually mentioned something like this to @stefaniebutland in my talk with her. It's a huge issue in sociology because we make crosstabs a lot and they are really a pain overall in R especially with multiple variables. This is something I wrote to make crosstab making easier for my students (and me) https://github.com/elinw/lehmansociology/blob/master/R/crosstab.R but the print function is really painful. Even what should be simple frequency tables are hard in base R, this is what we came up with just to illustrate https://github.com/elinw/lehmansociology/blob/master/R/frequency.R. @sfirke I'm going to have a look at janitor!

@noamross noamross added the publication label May 7, 2017

@elinw

This comment has been minimized.

Copy link

commented May 8, 2017

Wow, kableExtra, nice!

If we are making a literature review then I think formattable should be in there. And of course tables.
I have a lot of PHP/Web experience and the way tables are handled in R always feels very different to me.

@karawoo

This comment has been minimized.

Copy link

commented May 8, 2017

Someone gave a lightning talk at the Seattle useR meetup on this topic a few months ago. He showed a few examples, one of which was tableone

@haozhu233

This comment has been minimized.

Copy link

commented May 9, 2017

Just saw desctable on my github timeline. It seems to be another good fit for this issue.

@bzkrouse

This comment has been minimized.

Copy link
Author

commented May 11, 2017

Nice, lots of examples :) desctable is really interesting! It seems to focus on ease of process & content than styling (it's my impression that some of these packages seem to emphasize one or the other). I'll throw another one into the mix: arsenal, which gets more into stats and models. (@elinw this may be of interest to you for frequency tables...)

@sfirke

This comment has been minimized.

Copy link

commented May 11, 2017

There are also older-school table printing options, like gmodels::CrossTable() and there's one in Hmisc (I think summary?). A literature review of what's out there and how it differs would be a boon to folks navigating all of the options. Makes me think of reviews of a field of products from The WireCutter.

@bzkrouse

This comment has been minimized.

Copy link
Author

commented May 17, 2017

Summary of this thread:

There are lots of existing packages/functions for creating and/or formatting tables of various types. There seems to be a consensus that more work may be needed in this area, but we first need to understand all that is available right now. The great discussion in #78 could inform this process. From there, we can determine what is needed going forward. Potential ideas for the unconf, summarized from discussion above:

  1. Perform "lit review" of existing packages

    • perform as a case study for #78
    • compare existing packages by trying them out on a set of common table types
    • create a gallery of tables with the code to produce them
    • create blog post
      • present results of lit review to benefit the community
      • reference WireCutter for ideas
  2. Are there improvements to be made? If so, planning the future of tables in R:

    • would be informed by the lit review
    • consider extend existing packages or creating a new one
    • borrow ideas from ggplot2
@jhollist

This comment has been minimized.

Copy link
Member

commented May 17, 2017

@maelle

This comment has been minimized.

Copy link
Member

commented May 18, 2017

In case it wasn't in your list I just saw this https://gdemin.github.io/expss/

@jsta

This comment has been minimized.

Copy link
Contributor

commented May 18, 2017

I was digging into the huxtable docs and found a vignette which compares the features of many table making packages: https://cran.r-project.org/web/packages/huxtable/vignettes/design-principles.html

@wampeh1

This comment has been minimized.

Copy link

commented May 19, 2017

I work for the Federal Reserve Board (FRB). My duties include reading data from various sources (including pdf, excel, xml), processing these data, recompile and produce tables and charts Latex and FAME for publication purposes. I am currently searching for a similar tool(s) in R to replicate these processes (including creating tables). Very interested in this topic.

@aammd

This comment has been minimized.

Copy link

commented May 24, 2017

This issue reminds me a little of this silly joke flowchart i made. But perhaps a (slightly) more serious flowchart would be helpful to people?

I also wonder if it would be possible to create some kind of DSL for making tables that works with the pipe operator? Similar to @haozhu233 's suggestion to use something like a ggplot2 syntax

@sfirke

This comment has been minimized.

Copy link

commented May 24, 2017

A grammar of tables w/ modular piping functions, ala ggplot2, would be wonderful. I have been stumbling toward something similar, though in a limited use case (simple one-way and two-way tabulations) - so far I have (on a dev branch):

library(janitor)
mtcars %>%
  crosstab(cyl, am) %>% 
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>% 
  adorn_ns()

#>     cyl          0          1
#> 1     4 27.3%  (3) 72.7%  (8)
#> 2     6 57.1%  (4) 42.9%  (3)
#> 3     8 85.7% (12) 14.3%  (2)
#> 4 Total 59.4% (19) 40.6% (13)

But this is hardly a grammar - just a vote of enthusiasm for going in that direction 😀

@GShotwell

This comment has been minimized.

Copy link

commented May 24, 2017

dplyr::case_when() might be a good model for table formatting. Maybe something like this:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(
            n = n(), 
            price = 10000 * wt,
            percent_wt = wt / sum(wt)) %>% 
  format(
          n ~ 'comma',
          price ~ 'euro',
          percent_wt ~ 'percent')
@elinw

This comment has been minimized.

Copy link

commented May 25, 2017

@GShotwell I like that idea a lot, in a strange way it's like tables but more like normal language. It would be great if there were a dplyr n() function (with some other name) that is a real function, I always end up needing n in doing calculations when creating tables.

@stefaniebutland

This comment has been minimized.

Copy link
Collaborator

commented Oct 24, 2017

"Combining the two issues, we set out to to create a guide that could help users navigate package selection, using the case of reproducible tables as a case study."

Repo: https://github.com/ropenscilabs/packagemetrics
Blog post: packagemetrics - Helping you choose a package since runconf17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.