# 3 Geometries

A plot’s geometry dictates what visual elements will be used. In this chapter, we’ll familiarize you with the geometries used in the three most common plot types you’ll encounter - scatter plots, bar charts and line plots. We’ll look at a variety of different ways to construct these plots.

# Overplotting 1: large datasets

Scatter plots (using geom_point()) are intuitive, easily understood, and very common, but we must always consider overplotting, particularly in the following four situations:

Large datasets
Aligned values on a single axis
Low-precision data
Integer data
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.

Small points are suitable for large datasets with regions of high density (lots of overlapping).

Let's use the diamonds dataset to practice dealing with the large dataset case.

# Instructions:

- Add a points layer to the base plot.
Set the point transparency to 0.5.
Set shape = ".", the point size of 1 pixel.

In [None]:
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Add a point layer with tiny points
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = ".")

- Update the point shape to remove the line outlines by setting shape to 16.

In [None]:
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))

# Set transparency to 0.5
plt_price_vs_carat_by_clarity + geom_point(alpha = 0.5, shape = 16)

# Overplotting 2: Aligned values

Let's take a look at another case where we should be aware of overplotting: Aligning values on a single axis.

This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.

In the mtcars data set, fam and fcyl are categorical variants of cyl and am.

# Instructions:

- Create a base plot plt_mpg_vs_fcyl_by_fam of fcyl by mpg, colored by fam.
- Add a points layer to the base plot.

In [None]:
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

- Add some jittering by using position_jitter(), setting the width to 0.3.

In [None]:
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitter(0.3))

- Alternatively, use position_jitterdodge(). Set jitter.width and dodge.width to 0.3 to separate subgroups further.

In [None]:
# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars, aes(fcyl, mpg, color = fam))

# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam + geom_point()

# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam + geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))

# Overplotting 3: Low-precision data

You already saw how to deal with overplotting when using geom_point() in two cases:

Large datasets
Aligned values on a single axis
We used position = 'jitter' inside geom_point() or geom_jitter().

Let's take a look at another case:

Low-precision data
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It's similar to case 2, but in this case we can jitter on both the x and y axis.

# Instructions:

- Change the points layer into a jitter layer.
- Reduce the jitter layer's width by setting the width argument to 0.1.

In [None]:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Swap for jitter layer with width 0.1
  geom_jitter(alpha = 0.5, width = 0.1)

- Let's use a different approach: Within geom_point(), set position to "jitter".

In [None]:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Set the position to jitter
  geom_point(alpha = 0.5, position = "jitter")

- Provide an alternative specification: Have the position argument call position_jitter() with a width of 0.1.

In [None]:
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  # Use a jitter position function with width 0.1
  geom_point(alpha = 0.5, position = position_jitter(0.1))

# Overplotting 4: Integer data

Let's take a look at the last case of dealing with overplotting:

Integer data
This can be type integer (i.e. 1 ,2, 3…) or categorical (i.e. class factor) variables. factor is just a special class of type integer.

You'll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don't realize that integer and factor data are the same as low precision data.

The Vocab dataset provided contains the years of education and vocabulary test scores from respondents to US General Social Surveys from 1972-2004.

# Instructions:

- Examine the Vocab dataset using str().
- Using Vocab, draw a plot of vocabulary vs education.
- Add a point layer.

In [None]:
# Examine the structure of Vocab
str(Vocab)

# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
  # Add a point layer
  geom_point()

- Replace the point layer with a jitter layer.

In [None]:
ggplot(Vocab, aes(education, vocabulary)) +
  # Change to a jitter layer
  geom_jitter()

- Set the jitter transparency to 0.2.

In [None]:
ggplot(Vocab, aes(education, vocabulary)) +
  # Set the transparency to 0.2
  geom_jitter(alpha = 0.2)

- Set the shape of the jittered points to hollow circles, (shape 1).

In [None]:
ggplot(Vocab, aes(education, vocabulary)) +
  # Set the shape to 1
  geom_jitter(alpha = 0.2, shape = 1)

# Drawing histograms

Recall that histograms cut up a continuous variable into discrete bins and, by default, maps the internally calculated count variable (the number of observations in each bin) onto the y aesthetic. An internal variable called density can be accessed by using the .. notation, i.e. ..density... Plotting this variable will show the relative frequency, which is the height times the width of each bin.

# Instructions:

- Using mtcars, map mpg onto the x aesthetic.
- Add a histogram layer using geom_histogram().

In [None]:
# Plot mpg
ggplot(mtcars, aes(mpg)) +
  # Add a histogram layer
  geom_histogram()

- Set the histogram binwidth to 1.

In [None]:
ggplot(mtcars, aes(mpg)) +
  # Set the binwidth to 1
  geom_histogram(binwidth = 1)

- Map y to the internal variable ..density.. to show frequency densities.

In [None]:
# Map y to ..density..
ggplot(mtcars, aes(mpg, ..density..)) +
  geom_histogram(binwidth = 1)

- Set the fill color of the histogram bars to datacamp_light_blue.

In [None]:
datacamp_light_blue <- "#51A8C9"
ggplot(mtcars, aes(mpg, ..density..)) +
  # Set the fill color to datacamp_light_blue
  geom_histogram(binwidth = 1, fill = datacamp_light_blue)

# Positions in histograms

Here, we'll examine the various ways of applying positions to histograms. geom_histogram(), a special case of geom_bar(), has a position argument that can take on the following values:

stack (the default): Bars for different groups are stacked on top of each other.
dodge: Bars for different groups are placed side by side.
fill: Bars for different groups are shown as proportions.
identity: Plot the values as they appear in the dataset.

# Instructions:

- Update the aesthetics so that the fill color of the bars is determined by fam.

In [None]:
# Update the aesthetics so the fill color is by fam
ggplot(mtcars, aes(mpg, fill = fam)) +
  geom_histogram(binwidth = 1)

- Update the histogram layer to position the bars side-by-side, that is, "dodge".

In [None]:
ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to dodge
  geom_histogram(binwidth = 1, position = "dodge")

- Update the histogram layer so the bars' positions "fill" the y-axis.

In [None]:
ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to fill
  geom_histogram(binwidth = 1, position = "fill")

- Completed task after requesting a hint or solution
- Update the histogram layer so bars are top of each other, using the "identity" position. So each bar can be seen, set alpha to 0.4.

In [None]:
ggplot(mtcars, aes(mpg, fill = fam)) +
  # Change the position to identity, with transparency 0.4
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4)

# Position in bar and col plots

Let's see how the position argument changes geom_bar().

We have three position options:

stack: The default
dodge: Preferred
fill: To show proportions
While we will be using geom_bar() here, note that the function geom_col() is just geom_bar() where both the position and stat arguments are set to "identity". It is used when we want the heights of the bars to represent the exact values in the data.

In this exercise, you'll draw the total count of cars having a given number of cylinders (fcyl), according to manual or automatic transmission type (fam).

# Instructions:

- Using mtcars, plot fcyl, filled by fam.
- Add a bar layer using geom_bar().

In [None]:
# Plot fcyl, filled by fam
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Add a bar layer
  geom_bar()

- Set the bar position argument to "fill".

In [None]:
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Set the position to "fill"
  geom_bar(position = "fill")

- Completed task after requesting a hint or solution
Change the bar position argument to "dodge".

In [None]:
ggplot(mtcars, aes(fcyl, fill = fam)) +
  # Change the position to "dodge"
  geom_bar(position = "dodge")

# Overlapping bar plots

You can customize bar plots further by adjusting the dodging so that your bars partially overlap each other. Instead of using position = "dodge", you're going to use position_dodge(), like you did with position_jitter() in the the previous exercises. Here, you'll save this as an object, posn_d, so that you can easily reuse it.

Remember, the reason you want to use position_dodge() (and position_jitter()) is to specify how much dodging (or jittering) you want.

For this example, you'll use the mtcars dataset.

# Instructions:

- Use the functional form of the bar position: replace "dodge" with a call to position_dodge().
- Set its width to 0.2.

In [None]:
ggplot(mtcars, aes(cyl, fill = fam)) +
  # Change position to use the functional form, with width 0.2
  geom_bar(position = position_dodge(width = 0.2))

- Set the bar transparency level of the bars to 0.6.

In [None]:
ggplot(mtcars, aes(cyl, fill = fam)) +
  # Set the transparency to 0.6
  geom_bar(position = position_dodge(width = 0.2), alpha = 0.6)

# Bar plots: sequential color palette

In this bar plot, we'll fill each segment according to an ordinal variable. The best way to do that is with a sequential color palette.

Here's an example of using a sequential color palette with the mtcars dataset:

ggplot(mtcars, aes(fcyl, fill = fam)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1")
In the exercise, you'll use similar code on the the Vocab dataset. Both datasets are ordinal.

# Instructions:

- Plot the Vocab dataset, mapping education onto x and vocabulary onto fill.

In [None]:
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill")

- Add a bar layer, setting position to "fill".

In [None]:
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer()

- Add a brewer fill scale, using the default palette (don't pass any arguments). Notice how this generates a warning message and an incomplete plot.

In [None]:
# Plot education, filled by vocabulary
ggplot(Vocab, aes(education, fill = vocabulary)) +
  # Add a bar layer with position "fill"
  geom_bar(position = "fill") +
  # Add a brewer fill scale with default palette
  scale_fill_brewer()

# Basic line plots

Here, we'll use the economics dataset to make some line plots. The dataset contains a time series for unemployment and population statistics from the Federal Reserve Bank of St. Louis in the United States. The data is contained in the ggplot2 package.

To begin with, you can look at how the median unemployment time and the unemployment rate (the number of unemployed people as a proportion of the population) change over time.

# Instructions:

- Print the head of the economics dataset.
- Plot unemploy vs. date as a line plot.

In [None]:
# Print the head of economics
head(economics)

# Using economics, plot unemploy vs. date
ggplot(economics, aes(date, unemploy)) +
  # Make it a line plot
  geom_line()

- Adjust the y-axis aesthetic of the plot so that it represents the proportion of the population that is unemployed.

In [None]:
# Change the y-axis to the proportion of the population that is unemployed
ggplot(economics, aes(date, unemploy / pop)) +
  geom_line()

# Multiple time series

We already saw how the form of your data affects how you can plot it. Let's explore that further with multiple time series. Here, it's important that all lines are on the same scale, and if possible, on the same plot.

fish.species contains the global capture rates of seven salmon species from 1950–2010. Each variable (column) is a Salmon species and each observation (row) is one year. fish.tidy contains the same data, but in three columns: Species, Year, and Capture (i.e. one variable per column).

# Instructions:

- Use str() in the console to examine the structure of both fish.species and fish.tidy.
- Plot only the Rainbow salmon time series with geom_line().

In [None]:
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

- Plot only the Pink salmon time series with geom_line().

In [None]:
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()

- Now try and plot all time series in a single plot.
- Plot the fish.tidy dataset, mapping Year to x and Capture to y. group by fish species within the aesthetics of geom_line().

In [None]:
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

- Let's add color to the previous plot to distinguish between the different time series.
- Plot the fish.tidy dataset again, this time making sure to color by Species.

In [None]:
# Plot the Rainbow Salmon time series
ggplot(fish.species, aes(x = Year, y = Rainbow)) +
  geom_line()

# Plot the Pink Salmon time series
ggplot(fish.species, aes(x = Year, y = Pink)) +
  geom_line()

# Plot multiple time-series by grouping by species
ggplot(fish.tidy, aes(Year, Capture)) +
  geom_line(aes(group = Species))

# Plot multiple time-series by coloring by species
ggplot(fish.tidy, aes(Year, Capture, color = Species)) +
  geom_line(aes(group = Species))