# L02-E1-Trendlines
## Exercise Instructions

* Complete all cells as instructed, replacing any ??? with the appropriate code

* Execute Jupyter **Kernel** > **Restart & Run All** and ensure that all code blocks run without error


# Trendlines

You will add trendlines to your scatterplots. ggplot can calculate trend lines automatically using geom_smooth(). We'll use this to add trend lines.

## R Features
* library()
* glimpse()
* ggplot()
* geom_point()
* geom_smooth()
* labs()

## Datasets
* mtcars
* mpg

In [None]:
# Load the tidyverse library
library( ??? )

In [None]:
# Explore the data 
# Dataframe: mtcars
# Hint: glimpse()
??? (mtcars)

# Adding a Trendline
Create a trend line of mtcars 

In [None]:
# Miles per gallon vs. engine displacement
# disp on the x-axis and mpg on the y-axis
# Hint: geom_smooth
ggplot(data = mtcars) + 
   ??? (mapping = aes(x = ??? , y = ??? )) 

After you run it, notice the informational message that is it using method = 'loess'. Also notice that it has a gray band around the line and that the points are gone.

At this point we are looking for something a bit simpler than what geom_smooth provides by default. We will need to pass some parameters to it to change its behavior. 

## ? Help
See what the parameters are and what they default to when we don't specific anything,  take a look at the help usage.

Get help by executing ? followed by the name of the function. It works best if the functions are loaded by library before calling help on them. So all the base R functions and the tidyverse functions will be readily available. 

In [None]:
# Display help for geom_smooth by 
# typing a question mark and then the function
# you want help on, in this case geom_smooth
? geom_smooth

In the usage screen, there is a lot of information. Notice geom_smooth(mapping = NULL, ...) that matches the mapping parameter we have been using.

We first want to change the method from loess to a line This is the method parameter. In the Arguments section of the help it talks about method and that its default is set to "auto". If you read closely, it lists some more example values to use. 

## Trendline: Linear Model
We want to use "lm" which stands for linear model so we can get a line.

Let's try the plot again, this time we will add a parameter method = "lm" to geom_smooth.

Create a trend line of mtcars.

In [None]:
# Miles per gallon vs. engine displacement
# disp on the x-axis and mpg on the y-axis
# Add a linear trend line
ggplot(data = mtcars) + 
   geom_smooth(mapping = aes(x = ??? , y = ??? ), method = "???") 

After you run it, notice that there is no informational message about loess and the line is straight!

## Trendline: Remove confidence interval
We are not done yet, we need to remove that gray band. 
That band is due to a default parameter se = TRUE
Can you find se in the Arguments list?


se displays the confidence interval, 
something we'll discuss later in the course.


For now, it is distracting and we want to remove it.
It looks like we can set se = FALSE and that should do the trick.
Although F can be a shorthand for FALSE, use FALSE as best practice
Note it all needs to be in uppercase letters, R is case sensitive.
Let's try it.

Create a trend line of mtcars.

In [None]:
# Miles per gallon vs. engine displacement
# disp on the x-axis and mpg on the y-axis 
# Add a linear trend line with no error band
ggplot(data = mtcars) + 
   geom_smooth(mapping = aes(x = disp, y = mpg), method = "lm", se = ??? )  # Remember R is case sensitive

After you run it, notice the gray confidence interval band is gone.


By the way, we are done with help for a while, feel free to close it 
if it is taking up screen space. You can always use the ? to get it back
Use help whenever you need it.

## Trendline + Scatterplot: mpg vs disp
A line without the points can be good for some analysis but I really want to see how close the points are to the line.
Let's get the points back on our graph. Recall that we can add more layers chaining them together with the + operator.

Create a scatterplot and linear trend line of mtcars.

In [None]:
# Miles per gallon vs. engine displacement
# disp on the x-axis and mpg on the y-axis 
ggplot(data = mtcars) + 
   ??? (mapping = aes(x = disp, y = mpg), method = "lm", se = FALSE) ???  # Remember the +
   ??? (mapping = aes(x = disp, y = mpg))

After you run this, notice that we get the points and the trend line, 
like we wanted.

## Scatterplot + Trendline: mpg vs hp
Create a scatterplot and linear trendline of mtcars.

In [None]:
# Miles per gallon vs engine horsepower
# hp on the x-axis and  mpg on the y-axis
ggplot(data = mtcars) + 
   geom_point(mapping = aes(x = hp, y = ??? )) +
   geom_smooth(mapping = aes(x = ??? , y = mpg), method = "lm", se = ??? )

The order of geom_point and geom_smooth can make a difference in the visualization. 
The layers are plotted over one another.
If geom_smooth is last then the trend line will be painted *over* the points.
If geom_point is last then the points will be painted *over* the line.
Can you spot where the line is going over a point?
The order you choose is a matter of preference and a good discussion in a data visualization course. 
Do you want the user's focus to be on the trend or on the point?
For data analysis my preference is for geom_smooth to come later,
so points are not plotted over it. When there are *lots* of points,
it may be hard to see the line.

## Scatterplot + Trendline: mpg vs cyl
Create a scatterplot and linear trendline of mtcars.

In [None]:
# Miles per gallon vs engine cylinders
# cyl on the x-axis and mpg on the y-axis
ggplot(data = mtcars) + 
   geom_point(mapping = aes(x = ??? , y = mpg)) +
   geom_smooth(mapping = aes(x = hp, y = ??? ), method = ??? , se = FALSE)

# mpg dataset
Let's switch from mtcars to mpg dataset and do some more practice. First, I want to introduce a feature that will save some typing when you have multiple layers in your plot. ggplot and many other functions have the notion of default values for paramaters. If you like the default values, then you don't need to do any more typing. Only if you want to change from the default value, do you need to type in the parameter and the new value you want to use.

# Inherit data from ggplot()
When the ggplot function is first called where we specify data = (data frame name), that is actually setting the default data paramater for every layer. This is why we didn't have to specify it in geom_point or geom_smooth. Notice that we are duplicating the mapping = aes(...) code in geom_point and geom_smooth. The code would be more readable and we would be less likely to make a mistake if we could specify this just once instead of for every layer. To do this we put it in the beginning ggplot function call and then all the layers will inherit it. Less typing!!

In [None]:
# Explore the data
# Dataframe: mpg
??? (mpg)

## Scatterplot + Trendline: hwy vs displ
Create a scatterplot and linear trendline of mpg dataset.

In [None]:
# Highway miles per gallon vs engine displacement
# hwy on the y-axis and displ on the x-axis
# We move the mapping parameter to ggplot instead of in the geoms
ggplot(data = ??? , mapping = aes(x = ??? , y = hwy)) + 
   geom_point() +
   geom_smooth(method = "lm", se = FALSE)

# Inherit mapping from ggplot()
Aside from less typing, notice how much easier it 
is to read geom code when the mapping info moved to ggplot. We will do this from now on because as more geoms are added, we can take advantage of typing the mapping only once at the top. If we need to change the mapping, we only need to do it in one place.

## Scatterplot + Trendline: hwy vs cyl
Create a scatterplot and linear trendline of mpg dataset.

In [None]:
# Highway miles per gallon vs engine cylinders
# hwy on the y-axis and cyl on the x-axis
ggplot(data = mpg, mapping = aes(x = ??? , y = hwy)) + 
   geom_point() + 
   geom_smooth( ??? = "lm", se = FALSE)

# Summary 
Let's put all the lines of code together to see the bigger picture of what we did. I'll switch mtcars to inherit the mappings from ggplot to make that code look cleaner too. Also, let's add some chart titles using the labs() function.

In [None]:
# Load libraries
library( ??? )

# Explore mtcars dataset
??? (mtcars)

# Scatterplot and linear trend mtcars: y = mpg vs x = disp, hp, cyl
ggplot(data = mtcars, mapping = aes( ??? )) + 
   geom_point() +
   geom_smooth( ??? ) +
   labs(title="mtcars trend: mpg vs disp")

ggplot(data = mtcars, mapping = aes( ??? )) + 
   geom_point() +
   geom_smooth( ??? ) +
   labs(title="mtcars trend: mpg vs hp")

ggplot(data = mtcars, mapping = aes( ??? )) + 
   geom_point() +
   geom_smooth( ??? ) +
   labs(title="mtcars trend: mpg vs cyl")

# Explore mpg dataset
glimpse(mpg)

# Scatterplot and linear trend mpg: y = hwy vs x = displ, cyl
ggplot(data = mpg, mapping = aes( ??? )) + 
   geom_point() +
   geom_smooth( ??? ) +
   labs(title="mpg trend: hwy vs displ")

ggplot(data = mpg, mapping = aes( ??? )) + 
   geom_point() +
   geom_smooth( ??? ) +
   labs(title="mpg trend: hwy vs cyl")

By the way, is it just me, or are those highway mileage vs cylinder plots have something interesting going on with them?