# L02-E2-Overplotting
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error



# Overplotting

We may not have been seeing the most accurate picture of the data when we plot every observation because it is possible there could be two or more identical observations plotted on top of each other and we couldn’t see that. This is especially the case plotting discrete or categorical variables in a scatterplot. 

One way to see more of the points and avoid them being plotted on top of each other is to move them around a small random amount. ggplot calls this jitter. We will jitter the data a bit. Ironically, we are changing the data making it less accurate in order to more accurately analyze it visually. Many geoms have a jitter parameter, but since it is so common for scatterplots, it got its own geom for that, geom_jitter().  It is shorthand for geom_point() with the jitter parameter active. I find the code is easier to read when I use a specialized geom as compared to a parameter to a more generic geom.

## R Features
* library()
* glimpse()
* ggplot()
* geom_point()
* geom_smooth()
* geom_jitter()
* labs()

## Datasets
* mpg

## Library and mpg dataset

In [None]:
# Load the tidyverse library
???

In [None]:
# Explore the data
# Dataframe: mpg
# Hint: glimpse()
???

## Basic scatterplot + trendline

In [None]:
# Create a scatterplot and linear trendline of mpg 
# Miles per gallon vs engine cylinders
# cyl on the x-axis and hwy on the y-axis
# This is without any jitter so it will be a baseline for comparison
ggplot(data = ??? , mapping = aes(x = ??? , y = ??? )) + 
   geom_point() +
   geom_smooth(method = "lm", se = FALSE) +
   labs(title="mpg trend: hwy vs cyl")

### geom_jitter()

In [None]:
# Create a scatterplot and linear trendline with jitter of mpg 
# Miles per gallon vs engine cylinders
# cyl on the x-axis and hwy on the y-axis 
# Change this one to add the jitter by
# changing geom_point to geom_jitter
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) + 
   ??? () + 
   geom_smooth(method = "lm", se = FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Points jittered")

The chart looks quite a bit different and much more representative of the data. 

## Alpha Blending
Another option we have is the transparency level of the point. This is called alpha and is an option on most geoms. All we need to do is turn it on. The range is a value from 0 to 1 where 0 is completely transparent and 1 is completely opaque. It defaults to 1 so that is why it is completely opaque. 

If you recall the conversation about whether to paint the trend line over the points or the points over the trend line. Well, I often use alpha on the second and more layers to help with this, so yes it can be used on trend lines (geom_smooth) too if you like. 

In [None]:
# Since we already have the baseline plot above, let's see what alpha blending looks like

# Create a scatterplot and linear trendline with alpha blending of mpg 
# Miles per gallon vs engine cylinders
# cyl on the x-axis and hwy on the y-axis
# Change this one to add alpha blending to geom_point
# Set the alpha value to 0.1
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) + 
   geom_point(alpha = ??? ) +
   geom_smooth(method = "lm", se = FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Point Alpha blended")

The chart looks quite a bit different and 
much more representative of the data. 
The darker dots represent more observations
You can even start to see the distribution on 6 cylinders
With a cluster at 18 and another at 25
Note that the value for alpha is a bit of trial and error
It depends on the amount of overlap and the amount of data
Often try different numbers until you see the pattern clearly

## Add jitter and alpha
Let's add both jitter and alpha.

Create a scatterplot and linear trendline with alpha blending and jitter of mpg.

In [None]:
# Miles per gallon vs engine cylinders
# hwy on the y-axis and cyl on the x-axis
# Change this one to add alpha blending to geom_point
# Set the alpha value to 0.1
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) + 
   ??? (alpha = 0.1) +
   geom_smooth(method = "lm", se = FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Points Alpha blended + jitter")

Notice that the alpha looks a little too washed out
due to the jitter not having as much overlapping of points.

In [None]:
# Experiment with alpha value

# Create a scatterplot and linear trendline with alpha blending and jitter of mpg 
# Miles per gallon vs engine cylinders
# hwy on the y-axis and cyl on the x-axis
# Change this one to add alpha blending to geom_point
# Set the alpha value what you think looks best
# Just change the value to something between 0 and 1 and 
# rerun the block
# Do this for a few different values.
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) + 
   geom_jitter(alpha = ??? ) +  # Change alpha value to your choice
   geom_smooth(method="lm", se=FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Points Alpha blended + jitter")

It takes a bit of trial and error to get alpha to a good value. There may be too few overlapping values after the jitter to even need alpha blending. The jitter spreads out the values quite a bit in this case. If there were a large amount of data, then jitter + alpha would be more helpful.

In [None]:
# Open usage help on geom_jitter
# We are looking for a way to control 
# the amount of jitter
# the jitter seems to be a bit too much
? geom_jitter

In [None]:
# Experiment with jitter amount

# Create a scatterplot and linear trendline with alpha blending and jitter of mpg 
# Miles per gallon vs engine cylinders
# cyl on the x-axis and hwy on the y-axis
# Change this one to add alpha blending to geom_point
# Set the width and height of the jitter amount
# between 0 and 1. 
# Experiment with different values for width and height
# Feel free to also update alpha as necessary
ggplot(data = mpg, mapping = aes(x = cyl, y = hwy)) + 
   geom_jitter(alpha = 0.5, , width = ??? , height = ??? ) +   
   geom_smooth(method="lm", se=FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Points Alpha blended + custom jitter")

It is easy to add jitter and alpha. Without these features, scatterplots can hide popular values.

Notice that it seemed to work better with jitter width being smaller than 0.5 but 0.5 still worked well for jitter height.

Notice that the trend line didn't change
It was based on the underlying data and nothing to do with how it is being plotted.
Another good reason for a trendline.

If you chose a small jitter width of around 0.1, the 5 cylinder count becomes more clear. How many vehicles has 5 cylinders?

# Code Summary
Let's put all the lines of code together to see the bigger picture of what we ended up with. Even though we did a good job of representing categorical data in a scatterplot, soon we will look to different ways of representing this information.

In [None]:
# Load libraries
???

# Explore mpg data
???

# Plot + trend mpg: hwy vs cyl
# Use jitter and alpha as desired
ggplot(data = ??? , mapping = aes(x = ??? , y = hwy)) + 
   ???(alpha = ??? , width = ??? , height = ??? ) + 
   geom_smooth(method="lm", se=FALSE) +
   labs(title="mpg trend: hwy vs cyl", subtitle="Points Alpha blended + custom jitter")