A framework for reproducible tables #69

bzkrouse · 2017-05-03T19:55:09Z

In my work (clinical research), we make a lot of tables, usually comparing 2 or more groups. It's nice to format the table programmatically so that it is reproducible and ready for publication. The process to do so usually looks something like this:

Make a df that contains everything needed for the table (for example: N per group, mean/adjusted mean for each group, mean difference/OR/RR and confidence interval, p-value)
Do a bunch of reshaping, reordering, and pasting ), (, and %s here and there to get the results formatted correctly
Print/output the result

With tidy tools like dplyr, broom, and purrr, it is easier than ever before to create the self-contained data frame. However, getting all the necessary pieces and working the df into a table-ready format is a process that seems to be recreated from scratch each time. It would be great to have a tool that helps to automate this process a bit. Here's some vague thoughts on what this could look like:

Functions that, based on desired table/analysis type, automate the different stats and combine them together
Something like a compareGroups for the tidyverse
Maybe borrowing formatting conventions from summary functions in Pipeable summary functions #50

Does anyone have any interest or thoughts about this topic? Are there any tools already out there that help with this? If not an unconf project, would love a related discussion about people’s workflows!

sfirke · 2017-05-03T22:57:39Z

I so hear this. I think the formatting aspect of summary tables in R is quite tedious and a barrier to winning people over from Excel for routine analyses.

I took a crack at this with the janitor package, specifically creating tabulations and 2-way/crosstab/contingency tables and formatting them with percentages, rounding, etc. for quick publication. Though I have focused on simple counting and percentages, not any statistics; but maybe the formatting aspect could be leveraged?

I'm rethinking the approach to janitor's tabulations and formatting, making the functions more modular and coherent and less a set of utilities. If this comes kind of close, maybe something could be built into janitor or those functions or ideas could be extended. Or if it should be something separate, that's great too and I'd love to help ⛏

haozhu233 · 2017-05-03T23:56:11Z

To help generating nice looking table with grouping factors, I wrote a package called kableExtra a few months ago. So basically you can do something like below in a pdf_document

library(dplyr)
library(knitr)
library(kableExtra)
library(ezsummary)

mtcars %>%
  group_by(cyl) %>%
  ezsummary(flavor = "wide") %>%
  kable(format = "latex", booktabs = T,
        col.names = c("variable", rep(c("mean", "sd"), 3))) %>%
  add_header_above(c(" ", "4 cyl" = 2, "6 cyl" = 2, "8 cyl" = 2))

Then you will get something like

(ezsummary is something I wrote in the past but some design aspects of it is a little below my expectation but I still use it sometimes. :P )

njtierney · 2017-05-04T04:59:23Z

This is a great idea!

Creating these tables is something I find so frustrating when writing a paper or report. It's totally one of those things where I've just gone:

Bah! I'll just write in the values manually just this once

Except it's almost never just this once, and it adds to reproducibility hell.

Having a tool(s) that makes it easier to create these sorts of tables would for sure ease one a pressure point in reproducibility.

ezsummary and kableExtra both look amazing, @haozhu233! I'd love to learn more about ezsummary and kableExtra and see if we can develop them further.

stephlocke · 2017-05-04T17:40:48Z

I don't know if this might be of interest but I met the guy who built this the other day -I was impressed with the level of docs https://cran.r-project.org/web/packages/pivottabler/index.html

batpigandme · 2017-05-04T19:35:43Z

@njtierney I'm digging kableExtra atm, but I've also used huxtable, and pander. huxtable has a "table of regressions" format, which you can see here.
I'm sure there are still lots of gaps, just wanted to put these out there.

njtierney · 2017-05-05T01:03:29Z

Ah, good to know! It's great to gather all these resources together!

Maybe we can work together on some examples of table we have made for papers/reports, and try all these different methods/pkgs out, and then work out what was great and what could be improved?

bzkrouse · 2017-05-05T01:33:29Z

Wow, it's nice to hear other people are having similar thoughts (well said @njtierney !). @sfirke and @haozhu233 - really appreciate the tools you've built and the fact that you've already spent so much time thinking about this problem. If any of these tools can be leveraged or extended that would be amazing. It would be great to figure out a way to incorporate more of the statistical/modeling aspect of the analysis. More specifically - in the case of a table that contains many models/test, a potential tool could pair nicely with the purrr workflow.

I will mention the tangram package that came onto my radar yesterday that I don't know much about but seems to have a unique table building model.

haozhu233 · 2017-05-05T01:39:18Z

@njtierney Great idea! I think this type of "literature review" will be very useful for our community. After that we will have a better understanding of what we have right now and what exactly we need. I can imagine during the unconf, we can easily generate a blog post that @stefaniebutland would like to see. ;)

njtierney · 2017-05-05T01:53:17Z

So many interesting things to work on at the unconf!

jhollist · 2017-05-05T13:34:43Z

Agree with all that this is needed and this thread helps summarize a lot. Just an idea, what about a gallery of tables with the code to prodcue them. Something similar to @haozhu233 example above, but for different typical table types.

haozhu233 · 2017-05-05T15:04:23Z

I feel like having a gallery thing @jhollist just mentioned will definitely be super helpful. We can also borrow some ideas from the design of ggplot2 that having ggplot(), which is powerful and customizable, and qplot(), which is bootstrapping common plot types, at the same time.

bzkrouse · 2017-05-05T22:21:10Z

These are great ideas! I agree the lit review and gallery concept will both be very helpful and great resources for the broader community. It would be nice to take stock of what tools are out there and what types of tables should be covered. @haozhu233 - yes!! to your idea of structuring like ggplot2. That sounds ideal for a tool that is meant to be easy to use and easily extensible. Maybe we could try out a paradigm where you start with a simple table and add "layers" of details, complexities, and/or customizations.

elinw · 2017-05-06T18:52:32Z

This is great, I actually mentioned something like this to @stefaniebutland in my talk with her. It's a huge issue in sociology because we make crosstabs a lot and they are really a pain overall in R especially with multiple variables. This is something I wrote to make crosstab making easier for my students (and me) https://github.com/elinw/lehmansociology/blob/master/R/crosstab.R but the print function is really painful. Even what should be simple frequency tables are hard in base R, this is what we came up with just to illustrate https://github.com/elinw/lehmansociology/blob/master/R/frequency.R. @sfirke I'm going to have a look at janitor!

elinw · 2017-05-08T16:23:35Z

Wow, kableExtra, nice!

If we are making a literature review then I think formattable should be in there. And of course tables.
I have a lot of PHP/Web experience and the way tables are handled in R always feels very different to me.

karawoo · 2017-05-08T20:24:54Z

Someone gave a lightning talk at the Seattle useR meetup on this topic a few months ago. He showed a few examples, one of which was tableone

haozhu233 · 2017-05-09T20:49:34Z

Just saw desctable on my github timeline. It seems to be another good fit for this issue.

bzkrouse · 2017-05-11T11:09:42Z

Nice, lots of examples :) desctable is really interesting! It seems to focus on ease of process & content than styling (it's my impression that some of these packages seem to emphasize one or the other). I'll throw another one into the mix: arsenal, which gets more into stats and models. (@elinw this may be of interest to you for frequency tables...)

sfirke · 2017-05-11T15:17:44Z

There are also older-school table printing options, like gmodels::CrossTable() and there's one in Hmisc (I think summary?). A literature review of what's out there and how it differs would be a boon to folks navigating all of the options. Makes me think of reviews of a field of products from The WireCutter.

bzkrouse · 2017-05-17T19:58:23Z

Summary of this thread:

There are lots of existing packages/functions for creating and/or formatting tables of various types. There seems to be a consensus that more work may be needed in this area, but we first need to understand all that is available right now. The great discussion in #78 could inform this process. From there, we can determine what is needed going forward. Potential ideas for the unconf, summarized from discussion above:

Perform "lit review" of existing packages
- perform as a case study for Avoiding redundant / overlapping packages #78
- compare existing packages by trying them out on a set of common table types
- create a gallery of tables with the code to produce them
- create blog post
  - present results of lit review to benefit the community
  - reference WireCutter for ideas
Are there improvements to be made? If so, planning the future of tables in R:
- would be informed by the lit review
- consider extend existing packages or creating a new one
- borrow ideas from ggplot2

jhollist · 2017-05-17T20:09:03Z

I will be following along with the unconf remotely (via slack, issues, twitter, etc.) I'll keep my eye on this and if there is anything I can do remotely, would be happy to do so. If it makes sense we could chat via appear.in (I'm not on skype). I do really like this idea of "lit reviews" for packages. It feels like a more targeted/granular version of a task view and I think could be very useful. We've had fits and starts of a discussion on what to do with https://github.com/ropensci/maptools. It was intended to be a Task View but we got some push back due to the overlap with the Spatial Task View. Anyway, I think this general idea of targeted reviews could fill the void between "packages useful for a broad area" and "use package X to do Y" And thanks for the interesting discussion!

…

On Wed, May 17, 2017 at 3:58 PM, Becca Krouse ***@***.***> wrote: Summary of this thread: There are lots of existing packages/functions for creating and/or formatting tables of various types. There seems to be a consensus that more work may be needed in this area, but we first need to understand all that is available right now. The great discussion in #78 <#78> could inform this process. From there, we can determine what is needed going forward. Potential ideas for the unconf, summarized from discussion above: 1. Perform "lit review" of existing packages - perform as a case study for #78 <#78> - compare existing packages by trying them out on a set of common table types - create a gallery of tables with the code to produce them - create blog post - present results of lit review to benefit the community - reference WireCutter <http://thewirecutter.com/leaderboard/headphones/> for ideas 2. Are there improvements to be made? If so, planning the future of tables in R: - would be informed by the lit review - consider extend existing packages or creating a new one - borrow ideas from ggplot2 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFL8S2H6BRu-LZf7u0gHgMQVBTr4ASejks5r61FhgaJpZM4NP4C8> .

-- Jeff W. Hollister email: jeff.w.hollister@gmail.com cell: 401 556 4087

maelle · 2017-05-18T07:50:58Z

In case it wasn't in your list I just saw this https://gdemin.github.io/expss/

jsta · 2017-05-18T13:37:16Z

I was digging into the huxtable docs and found a vignette which compares the features of many table making packages: https://cran.r-project.org/web/packages/huxtable/vignettes/design-principles.html

wampeh1 · 2017-05-19T21:04:00Z

I work for the Federal Reserve Board (FRB). My duties include reading data from various sources (including pdf, excel, xml), processing these data, recompile and produce tables and charts Latex and FAME for publication purposes. I am currently searching for a similar tool(s) in R to replicate these processes (including creating tables). Very interested in this topic.

aammd · 2017-05-24T07:47:54Z

This issue reminds me a little of this silly joke flowchart i made. But perhaps a (slightly) more serious flowchart would be helpful to people?

I also wonder if it would be possible to create some kind of DSL for making tables that works with the pipe operator? Similar to @haozhu233 's suggestion to use something like a ggplot2 syntax

sfirke · 2017-05-24T13:37:00Z

A grammar of tables w/ modular piping functions, ala ggplot2, would be wonderful. I have been stumbling toward something similar, though in a limited use case (simple one-way and two-way tabulations) - so far I have (on a dev branch):

library(janitor)
mtcars %>%
  crosstab(cyl, am) %>% 
  adorn_totals("row") %>%
  adorn_percentages("row") %>%
  adorn_pct_formatting() %>% 
  adorn_ns()

#>     cyl          0          1
#> 1     4 27.3%  (3) 72.7%  (8)
#> 2     6 57.1%  (4) 42.9%  (3)
#> 3     8 85.7% (12) 14.3%  (2)
#> 4 Total 59.4% (19) 40.6% (13)

But this is hardly a grammar - just a vote of enthusiasm for going in that direction 😀

gshotwell · 2017-05-24T14:45:03Z

dplyr::case_when() might be a good model for table formatting. Maybe something like this:

mtcars %>% 
  group_by(cyl) %>% 
  summarize(
            n = n(), 
            price = 10000 * wt,
            percent_wt = wt / sum(wt)) %>% 
  format(
          n ~ 'comma',
          price ~ 'euro',
          percent_wt ~ 'percent')

elinw · 2017-05-25T13:51:04Z

@gshotwell I like that idea a lot, in a strange way it's like tables but more like normal language. It would be great if there were a dplyr n() function (with some other name) that is a real function, I always end up needing n in doing calculations when creating tables.

stefaniebutland · 2017-10-24T18:05:58Z

"Combining the two issues, we set out to to create a guide that could help users navigate package selection, using the case of reproducible tables as a case study."

Repo: https://github.com/ropenscilabs/packagemetrics
Blog post: packagemetrics - Helping you choose a package since runconf17

noamross added the publication label May 7, 2017

noamross added the reproducibility label May 9, 2017

sfirke mentioned this issue May 11, 2017

Avoiding redundant / overlapping packages #78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A framework for reproducible tables #69

A framework for reproducible tables #69

bzkrouse commented May 3, 2017

sfirke commented May 3, 2017

haozhu233 commented May 3, 2017 •

edited

Loading

njtierney commented May 4, 2017

stephlocke commented May 4, 2017

batpigandme commented May 4, 2017

njtierney commented May 5, 2017

bzkrouse commented May 5, 2017

haozhu233 commented May 5, 2017

njtierney commented May 5, 2017 •

edited

Loading

jhollist commented May 5, 2017

haozhu233 commented May 5, 2017

bzkrouse commented May 5, 2017

elinw commented May 6, 2017

elinw commented May 8, 2017

karawoo commented May 8, 2017

haozhu233 commented May 9, 2017

bzkrouse commented May 11, 2017 •

edited

Loading

sfirke commented May 11, 2017

bzkrouse commented May 17, 2017 •

edited

Loading

jhollist commented May 17, 2017 via email

maelle commented May 18, 2017

jsta commented May 18, 2017

wampeh1 commented May 19, 2017

aammd commented May 24, 2017

sfirke commented May 24, 2017

gshotwell commented May 24, 2017

elinw commented May 25, 2017

stefaniebutland commented Oct 24, 2017

A framework for reproducible tables #69

A framework for reproducible tables #69

Comments

bzkrouse commented May 3, 2017

sfirke commented May 3, 2017

haozhu233 commented May 3, 2017 • edited Loading

njtierney commented May 4, 2017

stephlocke commented May 4, 2017

batpigandme commented May 4, 2017

njtierney commented May 5, 2017

bzkrouse commented May 5, 2017

haozhu233 commented May 5, 2017

njtierney commented May 5, 2017 • edited Loading

jhollist commented May 5, 2017

haozhu233 commented May 5, 2017

bzkrouse commented May 5, 2017

elinw commented May 6, 2017

elinw commented May 8, 2017

karawoo commented May 8, 2017

haozhu233 commented May 9, 2017

bzkrouse commented May 11, 2017 • edited Loading

sfirke commented May 11, 2017

bzkrouse commented May 17, 2017 • edited Loading

jhollist commented May 17, 2017 via email

maelle commented May 18, 2017

jsta commented May 18, 2017

wampeh1 commented May 19, 2017

aammd commented May 24, 2017

sfirke commented May 24, 2017

gshotwell commented May 24, 2017

elinw commented May 25, 2017

stefaniebutland commented Oct 24, 2017

haozhu233 commented May 3, 2017 •

edited

Loading

njtierney commented May 5, 2017 •

edited

Loading

bzkrouse commented May 11, 2017 •

edited

Loading

bzkrouse commented May 17, 2017 •

edited

Loading