Skip to content

Parallel Coordinate Plots in ggplot2

YueHeeeee edited this page Apr 9, 2019 · 9 revisions

Background

Parallel coordinate plots allow a multivariate view of data by visualizing each row of a data set as a poli-line across all dimensions shown as vertical axes. Parallel coordinate plots are often - and for obvious reason - being critized for being "spaghetti plots". However, they do allow insight into the presence of low-dimensional clusters and outliers. The chart below shows a parallel coordinate plot of the (in)famous Iris data. Lines are colored by species. It is immediately clear that one of the species is well separated from the other two, while there is some overlap between the other two species.

In 2016 Martin Konrad provided a nice overview of the state of parallel coordinate plots in ggplot2. Konrad pointed out that the three of the ggplot2 related solutions were ggparcoord (GGally package, Schloerke et al, 2018) for numeric data, ggparallel (ggparallel package, Hofmann and Vendettuoli, 2013) for discrete data and a straight-up ggplot2 (Wickham 2016) solution based on standard geoms. This situation has not changed much since.

The function ggparcoord from the GGally package implements parallel coordinate plots using ggplot2 functionality. However, it does not support the full functionality of ggplot2, such as e.g. facetting or additional layers. Additionally, while ggparcoord implements a lot of different options for scaling as well as visualization options such as showing additional points and boxplots, the function obviously does not allow the same flexibility as full ggplot2.

A conceptual problem of parallel coordinate plots is the way categorical variables are handled. The currently most common solution is to convert any categorical variable into a ordinal variable and then treat in the same way as numeric variables. By treating categorical variables in this way, a lot of overplotting on each axis is introduced. Simple work-arounds such as jittering would be obvious solutions but do not work 'out of the box' within the ggplot2 framework.

Related work

Parallel coordinate plots have been implemented in analysis software since the mid 1980s (Inselberg 1985, Wegman 1990). Several packages in R are dedicated to visualizing parallel coordinate plots.

Using base plots the main function for drawing parallel coordinate plots is parcoord implemented in MASS (Venables and Ripley 2004). The package gclus (Hurley 2019) implements cparcoord to include panel color as a representation of the strength of a correlation between neighboring axes.

Within the ggplot2 environment there are several packages implementing parallel coordinate plots. For numeric variables there's the function ggparcoord from the GGally package, for categorical variables the ggparallel package provides an implementation of pcp-like plots, such as the Hammock plot (Schonlau 2003) and parsets (Kosara et al, 2013).

The bigPint Google Summer of Code project 2017 implemented static and interactive versions of parallel coordinate plots within the framework of plotting large data interactively. These functions are meant for exploration and discovery and are not fully parameterized for their appearance.

All of these implementations have in common that they describe highly specialized plots - in the sense that there are tens of parameters describing construction, type, and appearance of the plot. While giving the user some flexibility this way, this approach goes a bit against the modular approach of the tidyverse, and in particular against the layered approach of ggplot2, i.e. the approaches make use of ggplot2, but are not native to the ggplot2 approach.

Details of your coding project

One of the challenges in implementing parallel coordinate plots within the ggplot2 environment, is that parallel coordinate plots do not have a straightforward place in the grammar of graphics because of the underlying projective geometry and the arbitrary number of variables. The general solution to addressing this problem is to make parallel coordinate plots fit the framework by reshaping the original data: variables of the original data set are converted into a 'tidy' long format (Wickham 2014). Variables are then shown in form of levels along a (categorical) x axis, with their values plotted along the y axis (after some standardization).

This means that any implementation has to define how much of this data processing is done automatically: for the implemented approaches this means all of the data processing is done in the function and therefore closed off from the user, whereas using basic ggplot2 functionality leaves all of the data processing to the user.

This project is implementing parallel coordinate plots using a modular approach consistent with the tidyverse framework. It distinguishes between data shaping functions and their visualisation. By breaking down complex data processing into their basic operations we can use a mix and match approach that allows full flexibility on the data processing side and allows us to make use of the full ggplot2 framework for the visualization.

The outcomes of the project are:

  • R package for a generalized version of parallel coordinate plots consisting of a set of functions targetted at reshaping and rescaling the original data, and a set of geoms tailored for use with parallel coordinate plots. The package will be fully functional and completely documented.

  • A set of examples documenting the use and flexibility of the implementation.

  • A set of new tools that allows working with categorical variables in parallel coordinate plots more seamlessly.

Expected impact

Providing a set of functions for working with parallel coordinate plots By focussing on ggplot2 we want to reach the wider community of ggplot2 users to draw mosaic plots.

Mentors

  • Heike Hofmann hofmann@iastate.edu is a Professor of Statistics at Iowa State University. Dr Hofmann is an expert in data visualization and has been a mentor for several GSOC students with R since 2016. She is the author of various R packages including ggparallel, ggmosaic, lvplot and x3ptools.
  • Di Cook dicook@monash.edu is a Professor at Monash and an authority in data visualization. She has been mentoring students in GSOC projects for years.

Tests

  • Easy: install packages GGally and ggparallel. Find or create two data sets to visualize with these packages. Choose the data such that they highlight the main purpose of ggparcoord (GGally) and ggparallel, respectively. Show the plots and describe them appropriately. Be creative (i.e. do not just use a standard example).
  • Medium: recreate your ggparcoord example from scratch using ggplot2 geoms such as geom_point and geom_line. Wrap the code into a (set of) function(s). Show that your code works on other examples. Comment on the balance that your function(s) strike(s) between ease of use and flexibility.
  • Hard: expand the functionality to include jittering. Jittered points should stay on their respective axes, but should be dispersed along an appropriate range along the y axis. Wrap your function(s) into a package. Write documentation and tests. Discuss additional functionality that you are planning to implement.

Solutions of tests

Students, please post a link to your test results here.

References

  • Hofmann H., Vendettuoli M.: Common Angle Plots as Perception-True Visualizations of Categorical Associations, IEEE Transactions on Visualization and Computer Graphics, 19(12), 2297-2305, 2013.
  • Inselberg A., The Plane with Parallel Coordinates, The Visual Computer, 1(2), 69-91, 1985.
  • Kosara R., Bendix F., Hauser H., Parallel Sets: Interactive Exploration and Visual Analysis of Categorical Data, IEEE Transactions on Visualization and Computer Graphics, 12(4), 558-568, 2006.
  • Wegman, E., Hyperdimensional Data Analysis Using Parallel Coordinates, JASA, 85(411), 664-675, 1990.
  • Schloerke B., Crowley J., Cook D., Briatte F., Marbach M., Thoen E., Elberg ., Larmarange J.: GGally: Extension to 'ggplot2', R package version 1.4.0.
  • Schonlau M.: Visualizing Categorical Data Arising in the Health Sciences Using Hammock Plots, Proc. of Section on Statistical Graphics ASA, 2003.
  • Venables W.N., Ripley B.D.: Modern Applied Statistics with S (4th ed), Springer, 2002.
  • Wickham H., ggplot2: Elegant graphics for data analysis (2nd ed), Springer, 2016
  • Wickham H., Tidy data. The Journal of Statistical Software, 59, 2014.
  • Wilkinson L., The Grammar of Graphics. Statistics and Computing, Springer, 1999.
Clone this wiki locally