# 2 Exploring Numerical Data

In this chapter, you will learn how to graphically summarize numerical data.

# Faceted histogram

In this chapter, you'll be working with the cars dataset, which records characteristics on all of the new models of cars for sale in the US in a certain year. You will investigate the distribution of mileage across a categorical variable, but before you get there, you'll want to familiarize yourself with the dataset.

# Instructions:

- The cars dataset has been loaded in your workspace.

* Load the ggplot2 package.
* View the size of the data and the variable types using str().
* Plot a histogram of city_mpg faceted by suv, a logical variable indicating whether the car is an SUV or not.

In [None]:
# Load package
library(ggplot2)

# Learn data structure
str(cars)

# Create faceted histogram
ggplot(cars, aes(x = city_mpg)) +
  geom_histogram() +
  facet_wrap(~ suv)

# Boxplots and density plots

The mileage of a car tends to be associated with the size of its engine (as measured by the number of cylinders). To explore the relationship between these two variables, you could stick to using histograms, but in this exercise you'll try your hand at two alternatives: the box plot and the density plot.

# Instructions:

- A quick look at unique(cars$ncyl) shows that there are more possible levels of ncyl than you might think. Here, restrict your attention to the most common levels.

- Filter cars to include only cars with 4, 6, or 8 cylinders and save the result as common_cyl. The %in% operator may prove useful here.
- Create side-by-side box plots of city_mpg separated out by ncyl.
- Create overlaid density plots of city_mpg colored by ncyl.

In [None]:
# Filter cars with 4, 6, 8 cylinders
common_cyl <- filter(cars, ncyl %in% c(4, 6, 8))

# Create box plots of city mpg by ncyl
ggplot(common_cyl, aes(x = as.factor(ncyl), y = city_mpg)) +
  geom_boxplot()

# Create overlaid density plots for same data
ggplot(common_cyl, aes(x = city_mpg, fill = as.factor(ncyl))) +
  geom_density(alpha = .3)

# Compare distribution via plots

Which of the following interpretations of the plot is not valid?

# Instructions:

Possible answers

( ) The highest mileage cars have 4 cylinders.

( ) The typical 4 cylinder car gets better mileage than the typical 6 cylinder car, which gets better mileage than the typical 8 cylinder car.

( ) Most of the 4 cylinder cars get better mileage than even the most efficient 8 cylinder cars.

(x) The variability in mileage of 8 cylinder cars is similar to the variability in mileage of 4 cylinder cars.

# Marginal and conditional histograms

Now, turn your attention to a new variable: horsepwr. The goal is to get a sense of the marginal distribution of this variable and then compare it to the distribution of horsepower conditional on the price of the car being less than $25,000.

You'll be making two plots using the "data pipeline" paradigm, where you start with the raw data and end with the plot.

# Instructions:

- Create a histogram of the distribution of horsepwr across all cars and add an appropriate title. Start by piping in the raw dataset.
- Create a second histogram of the distribution of horsepower, but only for those cars that have an msrp less than $25,000. Keep the limits of the x-axis so that they're similar to that of the first plot, and add a descriptive title.

In [None]:
# Create hist of horsepwr
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  ggtitle("Distribution of horsepower")

# Create hist of horsepwr for affordable cars
cars %>% 
  filter(msrp < 25000) %>%
  ggplot(aes(horsepwr)) +
  geom_histogram() +
  xlim(c(90, 550)) +
  ggtitle("Distribution of horsepower for cars under $25k")

# Marginal and conditional histograms interpretation

Observe the two histograms in the plotting window and decide which of the following is a valid interpretation.

# Instructions:

Possible answers

( ) Cars with around 300 horsepower are more common than cars with around 200 horsepower.

(x) The highest horsepower car in the less expensive range has just under 250 horsepower.

( ) Most cars under $25,000 vary from roughly 100 horsepower to roughly 350 horsepower.

# Three binwidths

Before you take these plots for granted, it's a good idea to see how things change when you alter the binwidth. The binwidth determines how smooth your distribution will appear: the smaller the binwidth, the more jagged your distribution becomes. It's good practice to consider several binwidths in order to detect different types of structure in your data.

# Instructions:

- Create the following three plots, adding a title to each to indicate the binwidth used:

* A histogram of horsepower (i.e. horsepwr) with a binwidth of 3.
* A second histogram of horsepower with a binwidth of 30.
* A third histogram of horsepower with a binwidth of 60.

In [None]:
# Create hist of horsepwr with binwidth of 3
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 3) +
  ggtitle("Distribution of horsepower: bindwidth 3")

# Create hist of horsepwr with binwidth of 30
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 30) +
  ggtitle("Distribution of horsepower: bindwidth 30")

# Create hist of horsepwr with binwidth of 60
cars %>%
  ggplot(aes(horsepwr)) +
  geom_histogram(binwidth = 60) +
  ggtitle("Distribution of horsepower: bindwidth 60")

# Three binwidths interpretation

What feature is present in Plot A that's not found in B or C?

# Instructions:

Possible answers

( ) The most common horsepower is around 200.

(x) There is a tendency for cars to have horsepower right at 200 or 300 horsepower.

( ) There is a second mode around 300 horsepower.

# Box plots for outliers

In addition to indicating the center and spread of a distribution, a box plot provides a graphical means to detect outliers. You can apply this method to the msrp column (manufacturer's suggested retail price) to detect if there are unusually expensive or cheap cars.

# Instructions:

- Construct a box plot of msrp.
- Exclude the largest 3-5 outliers by filtering the rows to retain cars less than $100,000. Save this reduced dataset as cars_no_out.
- Construct a similar box plot of msrp using this reduced dataset. Compare the two plots.

In [None]:
# Construct box plot of msrp
cars %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Exclude outliers from data
cars_no_out <- cars %>%
  filter(msrp < 100000)

# Construct box plot of msrp using the reduced dataset
cars_no_out %>%
  ggplot(aes(x = 1, y = msrp)) +
  geom_boxplot()

# Plot selection

Consider two other columns in the cars dataset: city_mpg and width. Which is the most appropriate plot for displaying the important features of their distributions? Remember, both density plots and box plots display the central tendency and spread of the data, but the box plot is more robust to outliers.

# Instructions:

- Use density plots or box plots to construct the following visualizations. For each variable, try both plots and submit the one that is better at capturing the important structure.

- Display the distribution of city_mpg.
- Display the distribution of width.

In [None]:
# Create plot of city_mpg
cars %>%
  ggplot(aes(x = 1, y = city_mpg)) +
  geom_boxplot()

# Create plot of width
cars %>% 
  ggplot(aes(x = width)) +
  geom_density()

# 3 variable plot

Faceting is a valuable technique for looking at several conditional distributions at the same time. If the faceted distributions are laid out in a grid, you can consider the association between a variable and two others, one on the rows of the grid and the other on the columns.

# Instructions:

- common_cyl, which you created to contain only cars with 4, 6, or 8 cylinders, is available in your workspace.

- Using common_cyl, create a histogram of hwy_mpg.
- Grid-facet the plot rowwise by ncyl and columnwise by suv.
- Add a title to your plot to indicate what variables are being faceted on.

In [None]:
# Facet hists using hwy mileage and ncyl
common_cyl %>%
  ggplot(aes(x = hwy_mpg)) +
  geom_histogram() +
  facet_grid(ncyl ~ suv) +
  ggtitle("Mileage by suv and ncyl")

# Interpret 3 var plot

Which of the following interpretations of the plot is valid?

# Instructions:

Possible answers

( ) Across both SUVs and non-SUVs, mileage tends to decrease as the number of cylinders increases.

( ) There are more SUVs than non-SUVs across all cylinder types.

( ) There is more variability in 6-cylinder non-SUVs than in any other type of car.