madblobfish edited this page Aug 22, 2017 · 3 revisions
Clone this wiki locally



To describe how scales work, we will first describe the domain (the data space) and the range (the aesthetic space), and then outline the process by which one is mapped to the other.

Since an input variable is either discrete or continuous, the domain is either a set of values (stored as a factor, character vector or logical vector) Discrete variables or an interval on the real line (stored as a numeric vector of length 2). For example, in the mammals sleep dataset (msleep), the domain of the discrete variable vore is {carni, herbi, omni}, and the domain of the continuous variable bodywt is [0.005, 6654]. We often think of these as data ranges, but here we are focussing on their nature as input to the scale, i.e., as a domain of a function.

The range can also be discrete or continuous. For discrete scales, it is a vector of aesthetic values corresponding to the input values. For continuous scales, it is a 1d path through some more complicated space. For example, a colour gradient interpolates linearly from one colour to another. The range is either specified by the user when the scale is created, or by the scale itself.

Scaling takes place in three steps:

  • Transformation, for continuous domains only, transforms the raw data with functions such as a logarithm or square root.

    Scale transformation occurs before statistical transformation so stats are based on the transformed data. This ensures that log(x) vs. log(y) on linear scales looks the same as x vs. y on log scales.

    qplot(displ, cty, data = mpg)
    # Transformed position scale
    qplot(displ, cty, data = mpg) + 
      scale_y_continuous(trans = reciprocal_trans())
    # Transformed non-position scale
    qplot(displ, cty, data = mpg, size = cty) + 
      scale_size(trans = sqrt_trans())
    qplot(displ, cty, data = mpg, colour = cty) + 
      scale_colour_continuous(trans = reverse_trans())
  • Training learns the domain of the scale, by looking at the range of the all datasets to which to will be applied. This is simple in a plot of raw data with only one layer: it's the range of continuous variable, and the levels of a discrete variable. More complicated plots require this training to be repeated over multiple layers in multiple panels.

    qplot(displ, cty, data = mpg) + facet_grid(. ~ cyl)
    qplot(displ, cty, data = mpg) + facet_grid(drv ~ cyl)
    qplot(displ, cty, data = mpg) + facet_grid(. ~ cyl, scales = "free")
    qplot(displ, cty, data = mpg) + facet_wrap(~ class)
    qplot(displ, cty, data = mpg) + facet_wrap(~ class, scales = "free")

    Training is performed both before and after statistical transformation, as described in the next section.

    The domain can also be specified directly, overriding the training process, by manually setting the domain of the scale with the limits argument. Any values outside of the domain of the scale are dealt with according to the oob (out of bounds) parameter of the scale.

  • Mapping mapping converts data values to aesthetic values.


Statistical transformations add an additional complication, because they can extend the range of an existing scale, or if they add a new aesthetic, require a complete pass of the scaling algorithm. The following examples illustrate these cases:

# Data that needs to be included in range
qplot(displ, cty, data = mpg) + geom_smooth(method = lm)

# New position scale
qplot(cty, data = mpg, binwidth = 10)
qplot(cty, data = mpg, binwidth = 10) + scale_y_sqrt()

# New non-position scale
qplot(class, manufacturer, data = mpg, stat = "sum", size = ..n..)
last_plot() + scale_area()

Yet another complication is the statistics that need to know the range of the axis. For example, stat_histogram needs this to compute consistent breaks across panels, and stat_smooth needs to know this for the full range option. This means that scales need to be trained twice: once for position scales prior to stat transformation, and once for all scales after stat transformation and position adjustment.

Writing your own

Writing you own scale is now fairly simple - you need to figure out which component of the scale you want to modify, and write a new function for that component:

  • The most common component to modify is the palette. A palette defines the scaled output. Continuous palettes should return a function that accepts a value between 0 and 1, and discrete palettes should return a function that accepts an integer which defines the number of levels

  • The rescaler function converts continuous data to the range (0, 1), to then be passed on to the palette. The two built in rescalers are rescale and rescale_mid

  • The out-of-bounds, oob, parameter defines what happens to scaled values that are outside of the range of the limits - built in behaviours are censor and squish

  • breaks and labels determine the breaks from the limits, and the labels from the breaks.

To get started, I recommend downloading the source versions of ggplot2 and scales off CRAN. This will provide you with a large set of examples to learn from and build on.