Skip to content

ggduo: pairs plots for multiple regression, cca, time series

Barret Schloerke edited this page Mar 25, 2016 · 15 revisions

Background

The function ggpairs and ggscatmat in GGally provide generalized pairs plots for a data frame in R. All pairs of variables are displayed, with plot defaults depending on the type of variable in a matrix format. The diagonal contains univariate displays. These functions extend the classic pairs function in base R, which only handles real-valued variables, to flexibly handle different variable types, and to use the graphics package ggplot2.

This is appropriate for multivariate data, because we want to see each variable vs each other. But in many problems, such as regression, or multiple time series, there are two groups of variables, e.g. response variables and explanatory variables, and we would like to see one group vs the other group. New functions are needed to accomplish this.

Related work

GGally package description:

The R package 'ggplot2' is a plotting system based on the grammar of graphics. 'GGally' extends 'ggplot2' by adding several functions to reduce the complexity of combining geoms with transformed data. Some of these functions include a pairwise plot matrix, a scatterplot plot matrix, a parallel coordinates plot, a survival plot, and several functions to plot networks.

The 'stats' package has a ts.plot function that currently allows for multiple time series to be printed in a single plot. Canonical correlation analysis currently does not have a single plotting mechanism to visually display the associations of two sets of variables. Current examples use ggpairs to display all pairs of columns when only a subset of combinations are needed. Like canonical correlation analysis, multiple regression is currently being done using ggpairs and only needs to display a subset of the pairs of columns.

Details of your coding project

The outcomes of the project are:

  • R function for a generalized version of pairs plots for two groups of variables to be implemented, similar to the ggpairs function, with ideally a second like ggscatmat for all real-valued variables
  • Develop several cognostics for choosing variables to plot, and using these to sort plots, for data with large numbers of variables
  • Merge of code into master GGally code base
  • Vignette illustrating usage
  • Vignette page on GGally's website

Expected impact

GGally is currently downloaded more than 14,000 times a month (download statistics provided from RStudio host only). The new ggduo function would be immediately available to all R users through CRAN. The GGally package currently has 9 contributing authors and many more users who submit issues on github.com. This establishes that the new functionality will fill a gap in a popular package.

With a single line of code, this will allow R users to look at multiple time series, whether the variable is categorial or continuous, and arbitrary combinations of variables so that it would support multivariate regression problems and canonical correlation problems. This would encourage people to look at their data and help better model building.

Mentors

Once you have a solution to the medium or/and the hard problem, please get in touch with Dianne Cook and Ryan Hafen.

Tests

Several tests that potential students can do to demonstrate their capabilities for this particular project. Please modify the suggestions below to make them specific for your project.

  • Easy: Install the GGally package from github (you might have to install the devtools package first). Run one of the examples, put the chart in a knitr/Rmarkdown document and write a paragraph to explain the chart.
  • Medium: Make a new correlation plot that can be used in ggpairs. The plot function should change background color depending on the correlation value, the regular background and grid should be removed from the plot.
  • Hard: Make a function that prints the legend of a ggplot2 object. Display the results in a knitr/Rmarkdown document.
  • Hard: All documents should be vignettes in a small package that depends upon GGally. The R code necessary will be located in the R directory and documented.

Solutions of tests

Schloerke Solution Github Repo

Links to vignettes

References

  • Emerson, John W., Walton A. Green, Barret Schloerke, Di Cook, Heike Hofmann, and Hadley Wickham. The Generalized Pairs Plot, Journal of Computational and Graphical Statistics, 22 (1), 79-91; doi: 10.1080/10618600.2012.694762, 2012.
  • Wickham H., ggplot2: Elegant graphics for data analysis. useR, Springer, July 2009.
  • Wilkinson L., The Grammar of Graphics, Statistics and Computing, Springer, 1999.
  • Everitt, B. and Hothorn, T. An Introduction to Applied Multivariate Analysis with R, Springer, 2011.
  • Hyndman, R. J. and Athanopoulos, G. Forecasting Principles and Practice, https://www.otexts.org/book/fpp/, 2013.
  • Tukey, J. W. The Collected Works, 1965-1985, see pages on cognostics google books
You can’t perform that action at this time.